我有一个80%类别变量的机器学习分类问题。如果我想使用一些分类器进行分类,我必须使用一个热编码吗?我可以将数据传递给分类器而不进行编码吗?

我试图做以下的特征选择:

I read the train file: num_rows_to_read = 10000 train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read) I change the type of the categorical features to 'category': non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt'] for categorical_feature in list(train_small.columns): if categorical_feature not in non_categorial_features: train_small[categorical_feature] = train_small[categorical_feature].astype('category') I use one hot encoding: train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

问题是,第三部分经常卡住,尽管我使用的是一个强大的机器。

因此,如果没有一个热编码,我就无法进行任何特征选择,以确定特征的重要性。

你有什么建议吗?


当前回答

一个在numpy中使用矢量化并在pandas中应用的简单示例:

import numpy as np

a = np.array(['male','female','female','male'])

#define function
onehot_function = lambda x: 1.0 if (x=='male') else 0.0

onehot_a = np.vectorize(onehot_function)(a)

print(onehot_a)
# [1., 0., 0., 1.]

# -----------------------------------------

import pandas as pd

s = pd.Series(['male','female','female','male'])
onehot_s = s.apply(onehot_function)

print(onehot_s)
# 0    1.0
# 1    0.0
# 2    0.0
# 3    1.0
# dtype: float64

其他回答

您可以将数据传递给catboost分类器而不进行编码。Catboost通过执行单热和目标扩展平均编码来处理分类变量本身。

你也可以做以下事情。注意,对于下面的内容,您不必使用pd.concat。

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 

for _c in df.select_dtypes(include=['object']).columns:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

还可以将显式列更改为分类列。例如,这里我正在更改颜色和组

import pandas as pd 
# intialise data of lists. 
data = {'Color':['Red', 'Yellow', 'Red', 'Yellow'], 'Length':[20.1, 21.1, 19.1, 18.1],
       'Group':[1,2,1,2]} 

# Create DataFrame 
df = pd.DataFrame(data) 
columns_to_change = list(df.select_dtypes(include=['object']).columns)
columns_to_change.append('Group')
for _c in columns_to_change:
    print(_c)
    df[_c]  = pd.Categorical(df[_c])
df_transformed = pd.get_dummies(df)
df_transformed

一个在numpy中使用矢量化并在pandas中应用的简单示例:

import numpy as np

a = np.array(['male','female','female','male'])

#define function
onehot_function = lambda x: 1.0 if (x=='male') else 0.0

onehot_a = np.vectorize(onehot_function)(a)

print(onehot_a)
# [1., 0., 0., 1.]

# -----------------------------------------

import pandas as pd

s = pd.Series(['male','female','female','male'])
onehot_s = s.apply(onehot_function)

print(onehot_s)
# 0    1.0
# 1    0.0
# 2    0.0
# 3    1.0
# dtype: float64

为了补充其他问题,让我提供如何使用Numpy使用Python 2.0函数:

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

行n_values = np.max(y_) + 1可以硬编码,以便在使用小批量的情况下使用足够数量的神经元。

使用此函数的演示项目/教程: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition

Pandas as有内置函数“get_dummies”来获取特定列/秒的热编码。

单热编码的一行代码:

df=pd.concat([df,pd.get_dummies(df['column name'],prefix='column name')],axis=1).drop(['column name'],axis=1)