我试图通过csv文件进行解析,并仅从特定列中提取数据。
例csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
我试图只捕获特定的列,比如ID、Name、Zip和Phone。
我看过的代码让我相信我可以通过对应的数字调用特定的列,因此ie: Name将对应于2,并且使用行[2]遍历每一行将产生列2中的所有项。但事实并非如此。
以下是我目前所做的:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
我期望它只打印出每行我想要的特定列,但它没有,我只打印出最后一列。
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
像这样的文件
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
将输出
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
或者如果你想对列进行数字索引:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
要更改分隔符,请将delimiter=" "添加到适当的实例化,即reader = csv。读者(f,分隔符= " ")
从这段代码中获得最后一列的唯一方法是在for循环中不包含print语句。
这很可能是你代码的结尾:
for row in reader:
content = list(row[i] for i in included_cols)
print content
你希望它是这样的:
for row in reader:
content = list(row[i] for i in included_cols)
print content
现在我们已经解决了你的错误,我想借此时间向你介绍熊猫模块。
Pandas在处理csv文件方面非常出色,下面的代码将是读取csv并将整个列保存到变量中所需要的全部代码:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
所以如果你想保存你列Names中的所有信息到一个变量中,这就是你所需要做的:
names = df.Names
这是一个很棒的模块,我建议你研究一下。如果由于某种原因,你的打印语句在for循环中,它仍然只打印出最后一列,这是不应该发生的,但如果我的假设是错误的,请告诉我。你发布的代码有很多缩进错误,所以很难知道什么应该在哪里。希望这对你有帮助!