将多个CSV文件导入pandas并连接到一个DataFrame中

我想从目录中读取几个CSV文件到熊猫，并将它们连接到一个大的DataFrame。不过我还没弄明白。以下是我目前所掌握的:

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助?

当前回答

可选择使用pathlib库(通常优先于os.path)。

该方法避免了重复使用pandas concat()/ apping()。

从熊猫文档中可以看到: 值得注意的是，concat()(因此append())会生成数据的完整副本，并且不断重用此函数会产生显著的性能影响。如果需要对多个数据集使用操作，请使用列表推导式。

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

2019-09-20 13:08:08

其他回答

灵感来自MrFun的回答:

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

注:

By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp. In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like timestamp id valid_frame 0 1 2 . . . 0 1 2 . . . With ignore_index=True, it looks like: timestamp id valid_frame 0 1 2 . . . 108 109 . . . IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g. begin_timestamp = df['timestamp'][0] Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.

2021-11-16 19:20:15

可选择使用pathlib库(通常优先于os.path)。

该方法避免了重复使用pandas concat()/ apping()。

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

2019-09-20 13:08:08

import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))

2017-02-21 16:25:56

你也可以这样做:

import pandas as pd
import os

new_df = pd.DataFrame()
for r, d, f in os.walk(csv_folder_path):
    for file in f:
        complete_file_path = csv_folder_path+file
        read_file = pd.read_csv(complete_file_path)
        new_df = new_df.append(read_file, ignore_index=True)


new_df.shape

2020-10-21 07:05:44

这里几乎所有的答案要么是不必要的复杂(glob模式匹配)，要么依赖于额外的第三方库。您可以在两行中使用Pandas和Python(所有版本)已经内置的所有内容来完成此操作。

对于一些文件-一行程序

df = pd.concat(map(pd.read_csv, ['d1.csv', 'd2.csv','d3.csv']))

对于许多文件

import os

filepaths = [f for f in os.listdir(".") if f.endswith('.csv')]
df = pd.concat(map(pd.read_csv, filepaths))

对于无头文件

如果你想用pd改变一些特定的东西。Read_csv(即，没有头)，你可以创建一个单独的函数，并调用你的地图:

def f(i):
    return pd.read_csv(i, header=None)

df = pd.concat(map(f, filepaths))

这条pandas行，它设置了df，利用了三个东西:

Python的map (function, iterable)发送给函数(the pd.read_csv())迭代对象(我们的列表)，它是每个CSV元素在filepaths)。 Panda的read_csv()函数正常读取每个CSV文件。 Panda的concat()将所有这些都放在一个df变量下。

2018-06-30 21:23:25

将多个CSV文件导入pandas并连接到一个DataFrame中

推荐文章

最新文章

标签