我想从目录中读取几个CSV文件到熊猫,并将它们连接到一个大的DataFrame。不过我还没弄明白。以下是我目前所掌握的:

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助?


当前回答

另一个带有列表理解的一行程序,允许使用read_csv参数。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

其他回答

如果多个CSV文件被压缩,您可以使用zipfile读取所有文件并按以下方式连接:

import zipfile
import pandas as pd

ziptrain = zipfile.ZipFile('yourpath/yourfile.zip')

train = []

train = [ pd.read_csv(ziptrain.open(f)) for f in ziptrain.namelist() ]

df = pd.concat(train)

我在谷歌上找到了高拉夫·辛格的答案。

然而,到最近为止,我发现使用NumPy进行任何操作,然后将其分配给一个数据帧,而不是在迭代的基础上操作数据帧本身,这似乎在这个解决方案中也有效。

我真诚地希望访问此页的任何人都能考虑这种方法,但我不想将这段巨大的代码作为注释附加,从而降低其可读性。

您可以利用NumPy来真正加速数据帧连接。

import os
import glob
import pandas as pd
import numpy as np

path = "my_dir_full_path"
allFiles = glob.glob(os.path.join(path,"*.csv"))


np_array_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0)
    np_array_list.append(df.as_matrix())

comb_np_array = np.vstack(np_array_list)
big_frame = pd.DataFrame(comb_np_array)

big_frame.columns = ["col1", "col2"....]

时间统计:

total files :192
avg lines per file :8492
--approach 1 without NumPy -- 8.248656988143921 seconds ---
total records old :1630571
--approach 2 with NumPy -- 2.289292573928833 seconds ---

如果你想递归搜索(Python 3.5或以上),你可以这样做:

from glob import iglob
import pandas as pd

path = r'C:\user\your\path\**\*.csv'

all_rec = iglob(path, recursive=True)     
dataframes = (pd.read_csv(f) for f in all_rec)
big_dataframe = pd.concat(dataframes, ignore_index=True)

请注意,最后三行可以用一行表示:

df = pd.concat((pd.read_csv(f) for f in iglob(path, recursive=True)), ignore_index=True)

你可以在这里找到**的文档。另外,我使用了iglob而不是glob,因为它返回的是迭代器而不是列表。



编辑:多平台递归功能:

你可以把上面的内容包装成一个多平台函数(Linux, Windows, Mac),所以你可以这样做:

df = read_df_rec('C:\user\your\path', *.csv)

函数如下:

from glob import iglob
from os.path import join
import pandas as pd

def read_df_rec(path, fn_regex=r'*.csv'):
    return pd.concat((pd.read_csv(f) for f in iglob(
        join(path, '**', fn_regex), recursive=True)), ignore_index=True)

灵感来自MrFun的回答:

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

注:

By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp. In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like timestamp id valid_frame 0 1 2 . . . 0 1 2 . . . With ignore_index=True, it looks like: timestamp id valid_frame 0 1 2 . . . 108 109 . . . IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g. begin_timestamp = df['timestamp'][0] Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.

另一个带有列表理解的一行程序,允许使用read_csv参数。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])