应用熊猫功能列创建多个新列?

如何在熊猫身上做到这一点:

我在单个文本列上有一个函数extract_text_features，返回多个输出列。具体来说，该函数返回6个值。

该函数可以工作，但是似乎没有任何合适的返回类型(pandas DataFrame/ numpy数组/ Python列表)，以便输出可以正确分配df。Ix [:，10:16] = df.textcol.map(extract_text_features)

所以我认为我需要回落到迭代与df.iterrows()，按此?

更新: 使用df.iterrows()迭代至少要慢20倍，因此我放弃并将该函数分解为6个不同的.map(lambda…)调用。

更新2:这个问题是在v0.11.0版本被问到的，在可用性df之前。在v0.16中改进了Apply或添加了df.assign()。因此，很多问题和答案都不太相关。

当前回答

这是我过去所做的

df = pd.DataFrame({'textcol' : np.random.rand(5)})

df
    textcol
0  0.626524
1  0.119967
2  0.803650
3  0.100880
4  0.017859

df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))
   feature1  feature2
0  1.626524 -0.373476
1  1.119967 -0.880033
2  1.803650 -0.196350
3  1.100880 -0.899120
4  1.017859 -0.982141

为完整性而编辑

pd.concat([df, df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))], axis=1)
    textcol feature1  feature2
0  0.626524 1.626524 -0.373476
1  0.119967 1.119967 -0.880033
2  0.803650 1.803650 -0.196350
3  0.100880 1.100880 -0.899120
4  0.017859 1.017859 -0.982141

2013-04-26 17:39:39

其他回答

def extract_text_features(feature):
    ...
    ...
    return pd.Series((feature1, feature2)) 

df[['NewFeature1', 'NewFeature1']] = df[['feature']].apply(extract_text_features, axis=1)

在这里，具有单个特征的a数据帧被转换为两个新特征。你也可以试试这个。

2020-09-30 10:20:11

公认的解决方案对于大量数据来说将会非常慢。获得最多赞数的解决方案读起来有点困难，而且处理数字数据也很慢。如果每个新列都可以独立于其他列计算，那么我将直接分配它们，而不使用apply。

假字符数据的例子

在DataFrame中创建100,000个字符串

df = pd.DataFrame(np.random.choice(['he jumped', 'she ran', 'they hiked'],
                                   size=100000, replace=True),
                  columns=['words'])
df.head()
        words
0     she ran
1     she ran
2  they hiked
3  they hiked
4  they hiked

假设我们想提取一些文本特征，就像在最初的问题中所做的那样。例如，让我们提取第一个字符，计算字母“e”的出现次数，并将短语大写。

df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
df.head()
        words first  count_e         cap
0     she ran     s        1     She ran
1     she ran     s        1     She ran
2  they hiked     t        2  They hiked
3  they hiked     t        2  They hiked
4  they hiked     t        2  They hiked

计时

%%timeit
df['first'] = df['words'].str[0]
df['count_e'] = df['words'].str.count('e')
df['cap'] = df['words'].str.capitalize()
127 ms ± 585 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

def extract_text_features(x):
    return x[0], x.count('e'), x.capitalize()

%timeit df['first'], df['count_e'], df['cap'] = zip(*df['words'].apply(extract_text_features))
101 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

令人惊讶的是，通过遍历每个值可以获得更好的性能

%%timeit
a,b,c = [], [], []
for s in df['words']:
    a.append(s[0]), b.append(s.count('e')), c.append(s.capitalize())

df['first'] = a
df['count_e'] = b
df['cap'] = c
79.1 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

另一个假数字数据的例子

创建100万个随机数并从上面测试幂函数。

df = pd.DataFrame(np.random.rand(1000000), columns=['num'])


def powers(x):
    return x, x**2, x**3, x**4, x**5, x**6

%%timeit
df['p1'], df['p2'], df['p3'], df['p4'], df['p5'], df['p6'] = \
       zip(*df['num'].map(powers))
1.35 s ± 83.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

为每一列赋值速度快25倍，可读性强:

%%timeit 
df['p1'] = df['num'] ** 1
df['p2'] = df['num'] ** 2
df['p3'] = df['num'] ** 3
df['p4'] = df['num'] ** 4
df['p5'] = df['num'] ** 5
df['p6'] = df['num'] ** 6
51.6 ms ± 1.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

我在这里也做了类似的回答，并详细说明了为什么申请通常不是正确的选择。

2017-11-03 14:01:38

这是我过去所做的

df = pd.DataFrame({'textcol' : np.random.rand(5)})

df
    textcol
0  0.626524
1  0.119967
2  0.803650
3  0.100880
4  0.017859

df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))
   feature1  feature2
0  1.626524 -0.373476
1  1.119967 -0.880033
2  1.803650 -0.196350
3  1.100880 -0.899120
4  1.017859 -0.982141

为完整性而编辑

pd.concat([df, df.textcol.apply(lambda s: pd.Series({'feature1':s+1, 'feature2':s-1}))], axis=1)
    textcol feature1  feature2
0  0.626524 1.626524 -0.373476
1  0.119967 1.119967 -0.880033
2  0.803650 1.803650 -0.196350
3  0.100880 1.100880 -0.899120
4  0.017859 1.017859 -0.982141

2013-04-26 17:39:39

你可以返回整行而不是值:

df = df.apply(extract_text_features,axis = 1)

函数在哪里返回行

def extract_text_features(row):
      row['new_col1'] = value1
      row['new_col2'] = value2
      return row

2018-06-24 19:06:57

在2020年，我使用apply()参数result_type='expand'

applied_df = df.apply(lambda row: fn(row.text), axis='columns', result_type='expand')
df = pd.concat([df, applied_df], axis='columns')

2018-09-17 08:45:29

应用熊猫功能列创建多个新列?

推荐文章

最新文章

标签