通过每次追加一行来创建Pandas数据框架

我如何创建一个空DataFrame，然后添加行，一个接一个?

我创建了一个空DataFrame:

df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))

然后我可以在最后添加一个新行，并填充一个字段:

df = df._set_value(index=len(df), col='qty1', value=10.0)

它一次只适用于一个领域。向df中添加新行有什么更好的方法?

当前回答

永远不要增长数据框架!

是的，人们已经解释了，你不应该增长一个DataFrame，你应该追加你的数据到一个列表，并转换为一个DataFrame一旦结束。但你知道为什么吗?

以下是最重要的原因，摘自我在这里的帖子。

它总是更便宜/更快地追加到一个列表和创建一个DataFrame。列表占用更少的内存，并且是一种更轻的数据结构，可以处理、添加和删除。为您的数据自动推断d类型。另一方面，创建一个空的nan帧将自动使它们成为对象，这是不好的。索引是自动为您创建的，而不是您必须小心地将正确的索引分配给您追加的行。

这是正确的方式™积累您的数据

data = []
for a, b, c in some_function_that_yields_data():
    data.append([a, b, c])

df = pd.DataFrame(data, columns=['A', 'B', 'C'])

这些选择都很糟糕

在循环内追加或连接 Append和concat单独在本质上并不坏。的当您在循环中迭代调用它们时，问题就开始了结果在二次内存使用。 #创建空数据框架并追加 Df = pd。DataFrame(columns=['A'， 'B'， 'C']) 对于some_function_that_yields_data()中的a, b, c: Df = Df。追加({A:我,B: B, C: C}, ignore_index = True) #这同样糟糕: # df = pd.concat( # df, pd。({'A': i， 'B': B， 'C': C})]， # ignore_index = True) 清空nan的数据帧永远不要创建nan的数据帧，因为列是初始化的对象(缓慢的、不可向量化的dtype)。 #创建nan的数据帧并覆盖值。 Df = pd。DataFrame(列= [' A ', ' B ', ' C '],指数=范围(5)) 对于some_function_that_yields_data()中的a, b, c: df.loc[len(df)] = [a, b, c]

见分晓

对这些方法进行计时是了解它们在内存和效用方面有多大不同的最快方法。

基准测试代码供参考。

像这样的帖子提醒了我为什么我是这个社区的一员。人们明白教人们用正确的代码得到正确答案的重要性，而不是用错误的代码得到正确答案。现在，您可能会争辩说，如果您只是向DataFrame添加一行，那么使用loc或append都不是问题。然而，人们经常会在这个问题上添加不止一行——通常要求是使用来自函数的数据在循环中迭代地添加一行(参见相关问题)。在这种情况下，重要的是要理解迭代增长DataFrame不是一个好主意。

2020-07-04 22:15:04

其他回答

在添加一行之前，我们必须将数据帧转换为字典。在这里，你可以看到键作为数据帧中的列，列的值再次存储在字典中，但是每个列的键都是数据帧中的索引号。

这个想法促使我编写下面的代码。

df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns:   # Here df.columns gives us the main dictionary key
    df2[x][101] = values[i]   # Here the 101 is our index number. It is also the key of the sub dictionary
    i += 1

2020-04-17 17:54:13

你可以使用pandas.concat()。有关详细信息和示例，请参见合并、连接和连接。

例如:

def append_row(df, row):
    return pd.concat([
                df, 
                pd.DataFrame([row], columns=row.index)]
           ).reset_index(drop=True)

df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})

df = append_row(df, new_row)

2012-05-23 08:14:43

如果你可以提前获得数据帧的所有数据，有一个比追加到数据帧更快的方法:

创建一个字典列表，其中每个字典对应一个输入数据行。从这个列表创建一个数据帧。

我有一个类似的任务，一行一行地添加到一个数据帧花了30分钟，从字典列表中创建一个数据帧在几秒钟内完成。

rows_list = []
for row in input_rows:

        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(blah..) 

        rows_list.append(dict1)

df = pd.DataFrame(rows_list)

2013-07-05 20:38:13

如果你有一个数据帧df，想要添加一个列表new_list作为一个新行到df，你可以简单地做:

df.loc[len(df)] = new_list

如果你想在数据帧df下添加一个新的数据帧new_df，那么你可以使用:

df.append(new_df)

2020-12-21 09:57:20

在向dataframe添加大量行的情况下，我对性能感兴趣。所以我尝试了四种最流行的方法，并检查了它们的速度。

性能

使用.append (NPE的答案) 使用。loc (fred的回答) 使用.loc预分配(FooBar的答案) 使用dict并最终创建DataFrame (ShikharDua的回答)

运行时结果(秒):

Approach	1000 rows	5000 rows	10 000 rows
.append	0.69	3.39	6.78
.loc without prealloc	0.74	3.90	8.35
.loc with prealloc	0.24	2.58	8.70
dict	0.012	0.046	0.084

所以我自己用了加法法。

代码:

import pandas as pd
import numpy as np
import time

del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
    df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)

# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
    df2.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)

# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
    df3.loc[i]  = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)

# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
    row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
    dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
    row_list.append(dict1)

df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)

附注:我相信我的实现并不完美，也许还有一些优化可以做。

2017-12-26 14:02:37

通过每次追加一行来创建Pandas数据框架

推荐文章

最新文章

标签