我想从目录中读取几个CSV文件到熊猫,并将它们连接到一个大的DataFrame。不过我还没弄明白。以下是我目前所掌握的:

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助?


当前回答

另一个带有列表理解的一行程序,允许使用read_csv参数。

df = pd.concat([pd.read_csv(f'dir/{f}') for f in os.listdir('dir') if f.endswith('.csv')])

其他回答

Dask库可以从多个文件中读取数据帧:

>>> import dask.dataframe as dd
>>> df = dd.read_csv('data*.csv')

(来源:https://examples.dask.org/dataframes/01-data-access.html # Read-CSV-files)

Dask数据框架实现了Pandas数据框架API的一个子集。如果所有的数据都适合内存,你可以调用df.compute()将数据帧转换为Pandas数据帧。

使用map的一行代码,但是如果你想指定额外的参数,你可以这样做:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
                    glob.glob("data/*.csv")))

注意:map本身不允许您提供额外的参数。

import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

darindaCoder的答案的替代方案:

path = r'C:\DRO\DCL_rawdata_files'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent

df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)
# doesn't create a list, nor does it append to one
import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)