如何添加一个新的列到现有的数据帧?

我有以下索引DataFrame命名列和行不连续的数字:

          a         b         c         d
2  0.671399  0.101208 -0.181532  0.241273
3  0.446172 -0.243316  0.051767  1.577318
5  0.614758  0.075793 -0.451460 -0.012493

我想添加一个新列，'e'，到现有的数据帧，并不想改变数据帧中的任何东西(即，新列始终具有与DataFrame相同的长度)。

0   -0.335485
1   -1.166658
2   -0.385571
dtype: float64

如何将列e添加到上面的例子中?

当前回答

当您将Series对象作为新列添加到现有DF时，您需要确保它们都具有相同的索引。然后添加到DF中

e_series = pd.Series([-0.335485, -1.166658,-0.385571])
print(e_series)
e_series.index = d_f.index
d_f['e'] = e_series
d_f

2021-03-02 21:09:51

其他回答

向pandas数据框架插入新列的4种方法

using simple assignment, insert(), assign() and Concat() methods.

import pandas as pd

df = pd.DataFrame({
    'col_a':[True, False, False], 
    'col_b': [1, 2, 3],
})
print(df)
    col_a  col_b
0   True     1
1  False     2
2  False     3

使用简单赋值

ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
print(ser)
0    a
1    b
2    c
dtype: object

df['col_c'] = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
print(df)
     col_a  col_b col_c
0   True     1  NaN
1  False     2    a
2  False     3    b

使用分配()

e = pd.Series([1.0, 3.0, 2.0], index=[0, 2, 1])
ser = pd.Series(['a', 'b', 'c'], index=[0, 1, 2])
df.assign(colC=s.values, colB=e.values)
     col_a  col_b col_c
0   True   1.0    a
1  False   3.0    b
2  False   2.0    c

使用insert ()

df.insert(len(df.columns), 'col_c', ser.values)
print(df)
    col_a  col_b col_c
0   True     1    a
1  False     2    b
2  False     3    c

使用concat ()

ser = pd.Series(['a', 'b', 'c'], index=[10, 20, 30])
df = pd.concat([df, ser.rename('colC')], axis=1)
print(df)
     col_a  col_b col_c
0    True   1.0  NaN
1   False   2.0  NaN
2   False   3.0  NaN
10    NaN   NaN    a
20    NaN   NaN    b
30    NaN   NaN    c

2022-03-06 14:21:40

超级简单的列赋值

pandas数据框架实现为有序的列字典。

这意味着__getitem__[]不仅可以用来获取某个列，而且__setitem__[] =可以用来分配一个新列。

例如，这个数据帧可以通过简单地使用[]访问器添加一个列

    size      name color
0    big      rose   red
1  small    violet  blue
2  small     tulip   red
3  small  harebell  blue

df['protected'] = ['no', 'no', 'no', 'yes']

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

请注意，即使数据帧的索引是关闭的，这也是有效的。

df.index = [3,2,1,0]
df['protected'] = ['no', 'no', 'no', 'yes']
    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

[]=是正确的选择，但要小心!

但是，如果你有pd。如果您试图将其分配给一个索引关闭的数据框架，那么您将遇到麻烦。看到的例子:

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'])
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

这是因为pd。默认情况下，系列具有从0到n的枚举索引。pandas[] =方法试图“聪明”

到底发生了什么。

当您使用[]=方法时，pandas使用左手数据框架的索引和右手系列的索引悄悄执行外部连接或外部合并。Df ['column'] =级数

边注

这很快就会导致认知失调，因为[]=方法试图根据输入做很多不同的事情，除非您只知道pandas是如何工作的，否则无法预测结果。因此，我建议不要在代码库中使用[]=，但在笔记本中查看数据时，使用[]=是可以的。

绕过问题

如果你有警察。系列，并希望它从上到下分配，或者如果您正在编码生产代码，而您不确定索引顺序，那么值得为这种问题进行保护。

你可以让警察失望。级数到np。Ndarray或一个列表，这将达到目的。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes']).values

df['protected'] = list(pd.Series(['no', 'no', 'no', 'yes']))

但这并不是很明确。

有些程序员可能会说:“嘿，这看起来有点多余，我就把它优化掉吧。”

明确的方法

设置pd的索引。作为df下标的级数是明确的。

df['protected'] = pd.Series(['no', 'no', 'no', 'yes'], index=df.index)

或者更现实一点，你可能有个警察。系列已经可用。

protected_series = pd.Series(['no', 'no', 'no', 'yes'])
protected_series.index = df.index

3     no
2     no
1     no
0    yes

现在可以分配

df['protected'] = protected_series

    size      name color protected
3    big      rose   red        no
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue       yes

df.reset_index()的替代方法

由于索引不协调是问题所在，如果你觉得数据框架的索引不应该决定事情，你可以简单地放弃索引，这应该更快，但它不是很干净，因为你的函数现在可能做两件事。

df.reset_index(drop=True)
protected_series.reset_index(drop=True)
df['protected'] = protected_series

    size      name color protected
0    big      rose   red        no
1  small    violet  blue        no
2  small     tulip   red        no
3  small  harebell  blue       yes

注意df.assign

而df。赋值让它更明确你在做什么，它实际上有和上面[]=相同的问题

df.assign(protected=pd.Series(['no', 'no', 'no', 'yes']))
    size      name color protected
3    big      rose   red       yes
2  small    violet  blue        no
1  small     tulip   red        no
0  small  harebell  blue        no

只是要小心df。指定你的列不叫self。这会导致错误。这就得到df。Assign smell，因为函数中存在这类工件。

df.assign(self=pd.Series(['no', 'no', 'no', 'yes'])
TypeError: assign() got multiple values for keyword argument 'self'

你可能会说，那我就不用self了。但是谁知道这个函数将来会如何改变来支持新的论点呢。也许您的列名将在pandas的新更新中成为一个参数，从而导致升级出现问题。

2017-04-03 08:59:22

如果你想将整个新列设置为一个初始值(例如None)，你可以这样做:df1['e'] = None

这实际上会给单元格分配object类型。因此，稍后您可以自由地将复杂的数据类型(如列表)放入单个单元格中。

2017-10-13 16:53:18

如果你要添加的列是一个序列变量，那么只需:

df["new_columns_name"]=series_variable_name #this will do it for you

即使要替换现有列，这种方法也很有效。只需输入与要替换的列相同的new_columns_name。它只会用新的系列数据覆盖现有的列数据。

2017-11-03 10:05:58

这是向pandas数据框架添加新列的特殊情况。在这里，我基于数据框架的现有列数据添加了一个新特性/列。

因此，让我们的dataFrame有列'feature_1'， 'feature_2'， 'probability_score'，我们必须根据'probability_score'列中的数据添加一个new_column 'predicted_class'。

我将使用来自python的map()函数，并定义一个我自己的函数，该函数将实现如何给dataFrame中的每一行一个特定的class_label的逻辑。

data = pd.read_csv('data.csv')

def myFunction(x):
   //implement your logic here

   if so and so:
        return a
   return b

variable_1 = data['probability_score']
predicted_class = variable_1.map(myFunction)

data['predicted_class'] = predicted_class

// check dataFrame, new column is included based on an existing column data for each row
data.head()

2020-06-19 12:24:35

如何添加一个新的列到现有的数据帧?

推荐文章

最新文章

标签