如何添加一个新的列到现有的数据帧?

我有以下索引DataFrame命名列和行不连续的数字:

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

我想添加一个新列，'e'，到现有的数据帧，并不想改变数据帧中的任何东西(即，新列始终具有与DataFrame相同的长度)。

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

如何将列e添加到上面的例子中?

当前回答

以下是我所做的…但我对熊猫和Python都很陌生，所以不能保证。

df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))

newCol = [3,5,7]
newName = 'C'

values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)

2015-10-06 01:18:52

其他回答

以下是我所做的…但我对熊猫和Python都很陌生，所以不能保证。

df = pd.DataFrame([[1, 2], [3, 4], [5,6]], columns=list('AB'))

newCol = [3,5,7]
newName = 'C'

values = np.insert(df.values,df.shape[1],newCol,axis=1)
header = df.columns.values.tolist()
header.append(newName)

df = pd.DataFrame(values,columns=header)

2015-10-06 01:18:52

我得到了可怕的SettingWithCopyWarning，它没有通过使用iloc语法修复。我的DataFrame是由read_sql从ODBC源创建的。根据上面low - tech的建议，以下方法对我来说是有效的:

df.insert(len(df.columns), 'e', pd.Series(np.random.randn(sLength),  index=df.index))

This worked fine to insert the column at the end. I don't know if it is the most efficient, but I don't like warning messages. I think there is a better solution, but I can't find it, and I think it depends on some aspect of the index. Note. That this only works once and will give an error message if trying to overwrite and existing column. Note As above and from 0.16.0 assign is the best solution. See documentation http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.assign.html#pandas.DataFrame.assign Works well for data flow type where you don't overwrite your intermediate values.

2015-06-11 09:45:04

如果你只需要创建一个新的空列，那么最短的解决方案是:

df.loc[:, 'e'] = pd.Series()

2020-11-27 08:26:56

你可以像这样通过for循环插入新列:

for label,row in your_dframe.iterrows():
      your_dframe.loc[label,"new_column_length"]=len(row["any_of_column_in_your_dframe"])

示例代码如下:

import pandas as pd

data = {
  "any_of_column_in_your_dframe" : ["ersingulbahar","yagiz","TS"],
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
your_dframe = pd.DataFrame(data)


for label,row in your_dframe.iterrows():
      your_dframe.loc[label,"new_column_length"]=len(row["any_of_column_in_your_dframe"])
      
      
print(your_dframe)

输出如下:

any_of_column_in_your_dframe	calories	duration	new_column_length
ersingulbahar	420	50	13.0
yagiz	380	40	5.0
TS	390	45	2.0

你也可以这样用:

your_dframe["new_column_length"]=your_dframe["any_of_column_in_your_dframe"].apply(len)

2021-08-12 05:33:40

超级简单的列赋值

pandas数据框架实现为有序的列字典。

这意味着__getitem__[]不仅可以用来获取某个列，而且__setitem__[] =可以用来分配一个新列。

例如，这个数据帧可以通过简单地使用[]访问器添加一个列

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

请注意，即使数据帧的索引是关闭的，这也是有效的。

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

[]=是正确的选择，但要小心!

但是，如果你有pd。如果您试图将其分配给一个索引关闭的数据框架，那么您将遇到麻烦。看到的例子:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

这是因为pd。默认情况下，系列具有从0到n的枚举索引。pandas[] =方法试图“聪明”

到底发生了什么。

当您使用[]=方法时，pandas使用左手数据框架的索引和右手系列的索引悄悄执行外部连接或外部合并。Df ['column'] =级数

边注

这很快就会导致认知失调，因为[]=方法试图根据输入做很多不同的事情，除非您只知道pandas是如何工作的，否则无法预测结果。因此，我建议不要在代码库中使用[]=，但在笔记本中查看数据时，使用[]=是可以的。

绕过问题

如果你有警察。系列，并希望它从上到下分配，或者如果您正在编码生产代码，而您不确定索引顺序，那么值得为这种问题进行保护。

你可以让警察失望。级数到np。Ndarray或一个列表，这将达到目的。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

但这并不是很明确。

有些程序员可能会说:“嘿，这看起来有点多余，我就把它优化掉吧。”

明确的方法

设置pd的索引。作为df下标的级数是明确的。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

或者更现实一点，你可能有个警察。系列已经可用。

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

现在可以分配

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

df.reset_index()的替代方法

由于索引不协调是问题所在，如果你觉得数据框架的索引不应该决定事情，你可以简单地放弃索引，这应该更快，但它不是很干净，因为你的函数现在可能做两件事。

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

注意df.assign

而df。赋值让它更明确你在做什么，它实际上有和上面[]=相同的问题

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

只是要小心df。指定你的列不叫self。这会导致错误。这就得到df。Assign smell，因为函数中存在这类工件。

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

你可能会说，那我就不用self了。但是谁知道这个函数将来会如何改变来支持新的论点呢。也许您的列名将在pandas的新更新中成为一个参数，从而导致升级出现问题。

2017-04-03 08:59:22

如何添加一个新的列到现有的数据帧?

推荐文章

最新文章

标签