我有一个80%类别变量的机器学习分类问题。如果我想使用一些分类器进行分类,我必须使用一个热编码吗?我可以将数据传递给分类器而不进行编码吗?
我试图做以下的特征选择:
I read the train file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
I use one hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
问题是,第三部分经常卡住,尽管我使用的是一个强大的机器。
因此,如果没有一个热编码,我就无法进行任何特征选择,以确定特征的重要性。
你有什么建议吗?
你可以用numpy来做。眼和一个使用数组元素的选择机制:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
indices_to_one_hot(nb_classes, data)的返回值现在是
array([[[ 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0.]]])
. remodeling(-1)的作用是确保标签格式正确(也可能有[[2],[3],[4],[0]])。
扩展@Martin Thoma的答案
def one_hot_encode(y):
"""Convert an iterable of indices to one-hot encoded labels."""
y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
# the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
nb_classes = len(np.unique(y)) # get the number of unique classes
standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
# which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
# directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
# standardised labels fixes this issue by returning a dictionary;
# standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
# standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
# cannot be called by an integer index e.g y[1.0] - throws an index error.
targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
return np.eye(nb_classes)[targets]
假设在10个变量中,在数据帧中有3个分类变量,分别为cname1、cname2和cname3。
然后下面的代码将自动在新的数据框架中创建一个热编码变量。
import category_encoders as ce
encoder_var=ce.OneHotEncoder(cols=['cname1','cname2','cname3'],handle_unknown='return_nan',return_df=True,use_cat_names=True)
new_df = encoder_var.fit_transform(old_df)
为了补充其他问题,让我提供如何使用Numpy使用Python 2.0函数:
def one_hot(y_):
# Function to encode output labels from number indexes
# e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]
y_ = y_.reshape(len(y_))
n_values = np.max(y_) + 1
return np.eye(n_values)[np.array(y_, dtype=np.int32)] # Returns FLOATS
行n_values = np.max(y_) + 1可以硬编码,以便在使用小批量的情况下使用足够数量的神经元。
使用此函数的演示项目/教程:
https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition