我试图通过csv文件进行解析,并仅从特定列中提取数据。
例csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
我试图只捕获特定的列,比如ID、Name、Zip和Phone。
我看过的代码让我相信我可以通过对应的数字调用特定的列,因此ie: Name将对应于2,并且使用行[2]遍历每一行将产生列2中的所有项。但事实并非如此。
以下是我目前所做的:
import sys, argparse, csv
from settings import *
# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file
# open csv file
with open(csv_file, 'rb') as csvfile:
# get number of columns
for line in csvfile.readlines():
array = line.split(',')
first_item = array[0]
num_columns = len(array)
csvfile.seek(0)
reader = csv.reader(csvfile, delimiter=' ')
included_cols = [1, 2, 6, 7]
for row in reader:
content = list(row[i] for i in included_cols)
print content
我期望它只打印出每行我想要的特定列,但它没有,我只打印出最后一列。
import csv
from collections import defaultdict
columns = defaultdict(list) # each value in each column is appended to a list
with open('file.txt') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
for (k,v) in row.items(): # go over each column name and value
columns[k].append(v) # append the value into the appropriate list
# based on column name k
print(columns['name'])
print(columns['phone'])
print(columns['street'])
像这样的文件
name,phone,street
Bob,0893,32 Silly
James,000,400 McHilly
Smithers,4442,23 Looped St.
将输出
>>>
['Bob', 'James', 'Smithers']
['0893', '000', '4442']
['32 Silly', '400 McHilly', '23 Looped St.']
或者如果你想对列进行数字索引:
with open('file.txt') as f:
reader = csv.reader(f)
next(reader)
for row in reader:
for (i,v) in enumerate(row):
columns[i].append(v)
print(columns[0])
>>>
['Bob', 'James', 'Smithers']
要更改分隔符,请将delimiter=" "添加到适当的实例化,即reader = csv。读者(f,分隔符= " ")
从这段代码中获得最后一列的唯一方法是在for循环中不包含print语句。
这很可能是你代码的结尾:
for row in reader:
content = list(row[i] for i in included_cols)
print content
你希望它是这样的:
for row in reader:
content = list(row[i] for i in included_cols)
print content
现在我们已经解决了你的错误,我想借此时间向你介绍熊猫模块。
Pandas在处理csv文件方面非常出色,下面的代码将是读取csv并将整个列保存到变量中所需要的全部代码:
import pandas as pd
df = pd.read_csv(csv_file)
saved_column = df.column_name #you can also use df['column_name']
所以如果你想保存你列Names中的所有信息到一个变量中,这就是你所需要做的:
names = df.Names
这是一个很棒的模块,我建议你研究一下。如果由于某种原因,你的打印语句在for循环中,它仍然只打印出最后一列,这是不应该发生的,但如果我的假设是错误的,请告诉我。你发布的代码有很多缩进错误,所以很难知道什么应该在哪里。希望这对你有帮助!
您可以使用numpy.loadtext(文件名)。例如,如果这是你的数据库。csv:
ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
你需要Name列:
import numpy as np
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))
>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
你可以更容易地使用genfromtext:
b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '],
dtype='|S7')
Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.
答:假设您在csv文件中有第一个表(您也可以使用petl直接从数据库加载)。然后您只需加载它并执行以下操作。
from petl import fromcsv, look, cut, tocsv
#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')
对于pandas,你可以使用read_csv和usecols参数:
df = pd.read_csv(filename, usecols=['col1', 'col3', 'col7'])
例子:
import pandas as pd
import io
s = '''
total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
'''
df = pd.read_csv(io.StringIO(s), usecols=['total_bill', 'day', 'size'])
print(df)
total_bill day size
0 16.99 Sun 2
1 10.34 Sun 3
2 21.01 Sun 3
SAMPLE.CSV
a, 1, +
b, 2, -
c, 3, *
d, 4, /
column_names = ["Letter", "Number", "Symbol"]
df = pd.read_csv("sample.csv", names=column_names)
print(df)
OUTPUT
Letter Number Symbol
0 a 1 +
1 b 2 -
2 c 3 *
3 d 4 /
letters = df.Letter.to_list()
print(letters)
OUTPUT
['a', 'b', 'c', 'd']