我想从目录中读取几个CSV文件到熊猫,并将它们连接到一个大的DataFrame。不过我还没弄明白。以下是我目前所掌握的:
import glob
import pandas as pd
# Get data file names
path = r'C:\DRO\DCL_rawdata_files'
filenames = glob.glob(path + "/*.csv")
dfs = []
for filename in filenames:
dfs.append(pd.read_csv(filename))
# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
我想我在for循环中需要一些帮助?
灵感来自MrFun的回答:
import glob
import pandas as pd
list_of_csv_files = glob.glob(directory_path + '/*.csv')
list_of_csv_files.sort()
df = pd.concat(map(pd.read_csv, list_of_csv_files), ignore_index=True)
注:
By default, the list of files generated through glob.glob is not sorted. On the other hand, in many scenarios, it's required to be sorted e.g. one may want to analyze number of sensor-frame-drops v/s timestamp.
In pd.concat command, if ignore_index=True is not specified then it reserves the original indices from each dataframes (i.e. each individual CSV file in the list) and the main dataframe looks like
timestamp id valid_frame
0
1
2
.
.
.
0
1
2
.
.
.
With ignore_index=True, it looks like:
timestamp id valid_frame
0
1
2
.
.
.
108
109
.
.
.
IMO, this is helpful when one may want to manually create a histogram of number of frame drops v/s one minutes (or any other duration) bins and want to base the calculation on very first timestamp e.g.
begin_timestamp = df['timestamp'][0]
Without, ignore_index=True, df['timestamp'][0] generates the series containing very first timestamp from all the individual dataframes, it does not give just a value.
如果出现未命名列的问题,请使用此代码沿x轴合并多个CSV文件。
import glob
import os
import pandas as pd
merged_df = pd.concat([pd.read_csv(csv_file, index_col=0, header=0) for csv_file in glob.glob(
os.path.join("data/", "*.csv"))], axis=0, ignore_index=True)
merged_df.to_csv("merged.csv")