我试图通过csv文件进行解析,并仅从特定列中提取数据。

例csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

我试图只捕获特定的列,比如ID、Name、Zip和Phone。

我看过的代码让我相信我可以通过对应的数字调用特定的列,因此ie: Name将对应于2,并且使用行[2]遍历每一行将产生列2中的所有项。但事实并非如此。

以下是我目前所做的:

import sys, argparse, csv
from settings import *

# command arguments
parser = argparse.ArgumentParser(description='csv to postgres',\
 fromfile_prefix_chars="@" )
parser.add_argument('file', help='csv file to import', action='store')
args = parser.parse_args()
csv_file = args.file

# open csv file
with open(csv_file, 'rb') as csvfile:

    # get number of columns
    for line in csvfile.readlines():
        array = line.split(',')
        first_item = array[0]

    num_columns = len(array)
    csvfile.seek(0)

    reader = csv.reader(csvfile, delimiter=' ')
        included_cols = [1, 2, 6, 7]

    for row in reader:
            content = list(row[i] for i in included_cols)
            print content

我期望它只打印出每行我想要的特定列,但它没有,我只打印出最后一列。


当前回答

如果需要单独处理列,我喜欢使用zip(*iterable)模式(有效地“unzip”)来解构列。举个例子:

ids, names, zips, phones = zip(*(
  (row[1], row[2], row[6], row[7])
  for row in reader
))

其他回答

Context: For this type of work you should use the amazing python petl library. That will save you a lot of work and potential frustration from doing things 'manually' with the standard csv module. AFAIK, the only people who still use the csv module are those who have not yet discovered better tools for working with tabular data (pandas, petl, etc.), which is fine, but if you plan to work with a lot of data in your career from various strange sources, learning something like petl is one of the best investments you can make. To get started should only take 30 minutes after you've done pip install petl. The documentation is excellent.

答:假设您在csv文件中有第一个表(您也可以使用petl直接从数据库加载)。然后您只需加载它并执行以下操作。

from petl import fromcsv, look, cut, tocsv 

#Load the table
table1 = fromcsv('table1.csv')
# Alter the colums
table2 = cut(table1, 'Song_Name','Artist_ID')
#have a quick look to make sure things are ok. Prints a nicely formatted table to your console
print look(table2)
# Save to new file
tocsv(table2, 'new.csv')

您可以使用numpy.loadtext(文件名)。例如,如果这是你的数据库。csv:

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | Adam | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Carl | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Adolf | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
10 | Den | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |

你需要Name列:

import numpy as np 
b=np.loadtxt(r'filepath\name.csv',dtype=str,delimiter='|',skiprows=1,usecols=(1,))

>>> b
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

你可以更容易地使用genfromtext:

b = np.genfromtxt(r'filepath\name.csv', delimiter='|', names=True,dtype=None)
>>> b['Name']
array([' Adam ', ' Carl ', ' Adolf ', ' Den '], 
      dtype='|S7')

我认为有一个更简单的方法

import pandas as pd

dataset = pd.read_csv('table1.csv')
ftCol = dataset.iloc[:, 0].values

在这里iloc[:, 0],:表示所有值,0表示列的位置。 在下面的例子中,ID将被选中

ID | Name | Address | City | State | Zip | Phone | OPEID | IPEDS |
10 | C... | 130 W.. | Mo.. | AL... | 3.. | 334.. | 01023 | 10063 |
import csv

with open('input.csv', encoding='utf-8-sig') as csv_file:
    # the below statement will skip the first row
    next(csv_file)
    reader= csv.DictReader(csv_file)
   
    Time_col ={'Time' : []}
    #print(Time_col)
    for record in reader :
        Time_col['Time'].append(record['Time'])
        print(Time_col)

由于你可以索引和子集pandas数据框架,一个非常简单的方法从csv文件提取单列到一个变量是:

myVar = pd.read_csv('YourPath', sep = ",")['ColumnName']

有几件事需要考虑:

上面的代码片段将生成一个pandas系列,而不是数据框架。 如果速度是一个问题,ayhan和usecols的建议也会更快。 在一个2122 KB大小的csv文件上使用%timeit测试这两种不同的方法,usecols方法得到22.8 ms的结果,而我建议的方法得到53 ms的结果。

别忘了进口熊猫当pd