我有一个80%类别变量的机器学习分类问题。如果我想使用一些分类器进行分类,我必须使用一个热编码吗?我可以将数据传递给分类器而不进行编码吗?
我试图做以下的特征选择:
I read the train file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
I use one hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
问题是,第三部分经常卡住,尽管我使用的是一个强大的机器。
因此,如果没有一个热编码,我就无法进行任何特征选择,以确定特征的重要性。
你有什么建议吗?
一个在numpy中使用矢量化并在pandas中应用的简单示例:
import numpy as np
a = np.array(['male','female','female','male'])
#define function
onehot_function = lambda x: 1.0 if (x=='male') else 0.0
onehot_a = np.vectorize(onehot_function)(a)
print(onehot_a)
# [1., 0., 0., 1.]
# -----------------------------------------
import pandas as pd
s = pd.Series(['male','female','female','male'])
onehot_s = s.apply(onehot_function)
print(onehot_s)
# 0 1.0
# 1 0.0
# 2 0.0
# 3 1.0
# dtype: float64
首先,最简单的热编码方法:使用Sklearn。
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
其次,我不认为使用熊猫进行一个热编码是那么简单(虽然未经证实)
在pandas中为python创建虚拟变量
最后,你需要一个热编码吗?一个热编码以指数方式增加了特征的数量,大大增加了任何分类器或任何你要运行的东西的运行时间。特别是当每个分类特征都有很多层次时。相反,你可以进行虚拟编码。
使用虚拟编码通常工作得很好,运行时间和复杂性要少得多。一位睿智的教授曾经告诉我,“少即是多”。
如果你愿意,这是我的自定义编码函数的代码。
from sklearn.preprocessing import LabelEncoder
#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
columnsToEncode = list(df.select_dtypes(include=['category','object']))
le = LabelEncoder()
for feature in columnsToEncode:
try:
df[feature] = le.fit_transform(df[feature])
except:
print('Error encoding '+feature)
return df
编辑:比较更清楚:
一热编码:将n层转换为n-1列。
Index Animal Index cat mouse
1 dog 1 0 0
2 cat --> 2 1 0
3 mouse 3 0 1
你可以看到,如果你的分类特征中有许多不同类型(或级别),这会使你的记忆爆发式增长。记住,这只是一列。
伪代码:
Index Animal Index Animal
1 dog 1 0
2 cat --> 2 1
3 mouse 3 2
转换为数字表示。极大地节省了特征空间,代价是准确性。
扩展@Martin Thoma的答案
def one_hot_encode(y):
"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y)) # get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
# standardised labels fixes this issue by returning a dictionary;
# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
return np.eye(nb_classes)[targets]