我想从目录中读取几个CSV文件到熊猫,并将它们连接到一个大的DataFrame。不过我还没弄明白。以下是我目前所掌握的:

import glob
import pandas as pd

# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")

dfs = []
for filename in filenames:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)

我想我在for循环中需要一些帮助?


当前回答

灵感来自MrFun的回答:

import glob
import pandas as pd

list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()

df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)

注:

By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp. In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like timestamp id valid_frame 0 1 2 . . . 0 1 2 . . . With ignore_index=True, it looks like: timestamp id valid_frame 0 1 2 . . . 108 109 . . . IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g. begin_timestamp = df['timestamp'][0] Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.

其他回答

可选择使用pathlib库(通常优先于os.path)。

该方法避免了重复使用pandas concat()/ apping()。

从熊猫文档中可以看到: 值得注意的是,concat()(因此append())会生成数据的完整副本,并且不断重用此函数会产生显著的性能影响。如果需要对多个数据集使用操作,请使用列表推导式。

import pandas as pd
from pathlib import Path

dir = Path("../relevant_directory")

df = (pd.read_csv(f) for f in dir.glob("*.csv"))
df = pd.concat(df)

使用map的一行代码,但是如果你想指定额外的参数,你可以这样做:

import pandas as pd
import glob
import functools

df = pd.concat(map(functools.partial(pd.read_csv, sep='|', compression=None),
                    glob.glob("data/*.csv")))

注意:map本身不允许您提供额外的参数。

import pandas as pd
import glob

path = r'C:\DRO\DCL_rawdata_files' # use your path
file_path_list = glob.glob(path + "/*.csv")

file_iter = iter(file_path_list)

list_df_csv = []
list_df_csv.append(pd.read_csv(next(file_iter)))

for file in file_iter:
    lsit_df_csv.append(pd.read_csv(file, header=0))
df = pd.concat(lsit_df_csv, ignore_index=True)

所有可用的.read_方法参见pandas: IO工具。

如果所有CSV文件都有相同的列,请尝试以下代码。

我添加了header=0,这样在读取CSV文件的第一行之后,就可以将它赋值为列名。

import pandas as pd
import glob
import os

path = r'C:\DRO\DCL_rawdata_files' # use your path
all_files = glob.glob(os.path.join(path , "/*.csv"))

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

或者,归属于Sid的评论。

all_files = glob.glob(os.path.join(path, "*.csv"))

df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)

通常需要标识每个数据样本,这可以通过向数据框架添加一个新列来实现。 本例将使用标准库中的Pathlib。它将路径视为具有方法的对象,而不是要切片的字符串。

导入和设置

from pathlib import Path
import pandas as pd
import numpy as np

path = r'C:\DRO\DCL_rawdata_files'  # or unix / linux / mac path

# Get the files from the path provided in the OP
files = Path(path).glob('*.csv')  # .rglob to get subdirectories

选项1:

添加带有文件名的新列

dfs = list()
for f in files:
    data = pd.read_csv(f)
    # .stem is method for pathlib objects to get the filename w/o the extension
    data['file'] = f.stem
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

选项2:

使用enumerate添加具有泛型名称的新列

dfs = list()
for i, f in enumerate(files):
    data = pd.read_csv(f)
    data['file'] = f'File {i}'
    dfs.append(data)

df = pd.concat(dfs, ignore_index=True)

选项3:

使用列表理解创建数据框架,然后使用np。重复此操作以添加新列。 [f' s {i}' for i in range(len(dfs))]创建一个字符串列表来命名每个数据帧。 [len(df) for df in dfs]创建一个长度列表 这个选项的归属归属于这个绘图答案。

# Read the files into dataframes
dfs = [pd.read_csv(f) for f in files]

# Combine the list of dataframes
df = pd.concat(dfs, ignore_index=True)

# Add a new column
df['Source'] = np.repeat([f'S{i}' for i in range(len(dfs))], [len(df) for df in dfs])

选项4:

一行代码使用.assign创建新列,并将其归属于来自C8H10N4O2的注释

df = pd.concat((pd.read_csv(f).assign(filename=f.stem) for f in files), ignore_index=True)

or

df = pd.concat((pd.read_csv(f).assign(Source=f'S{i}') for i, f in enumerate(files)), ignore_index=True)
import glob
import os
import pandas as pd   
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "my_files*.csv"))))