我有一个80%类别变量的机器学习分类问题。如果我想使用一些分类器进行分类,我必须使用一个热编码吗?我可以将数据传递给分类器而不进行编码吗?
我试图做以下的特征选择:
I read the train file:
num_rows_to_read = 10000
train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read)
I change the type of the categorical features to 'category':
non_categorial_features = ['orig_destination_distance',
'srch_adults_cnt',
'srch_children_cnt',
'srch_rm_cnt',
'cnt']
for categorical_feature in list(train_small.columns):
if categorical_feature not in non_categorial_features:
train_small[categorical_feature] = train_small[categorical_feature].astype('category')
I use one hot encoding:
train_small_with_dummies = pd.get_dummies(train_small, sparse=True)
问题是,第三部分经常卡住,尽管我使用的是一个强大的机器。
因此,如果没有一个热编码,我就无法进行任何特征选择,以确定特征的重要性。
你有什么建议吗?
您可以使用numpy。眼睛的功能。
import numpy as np
def one_hot_encode(x, n_classes):
"""
One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
: x: List of sample Labels
: return: Numpy array of one-hot encoded labels
"""
return np.eye(n_classes)[x]
def main():
list = [0,1,2,3,4,3,2,1,0]
n_classes = 5
one_hot_list = one_hot_encode(list, n_classes)
print(one_hot_list)
if __name__ == "__main__":
main()
结果
D:\Desktop>python test.py
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 1. 0. 0. 0. 0.]]
你可以用numpy来做。眼和一个使用数组元素的选择机制:
import numpy as np
nb_classes = 6
data = [[2, 3, 4, 0]]
def indices_to_one_hot(data, nb_classes):
"""Convert an iterable of indices to one-hot encoded labels."""
targets = np.array(data).reshape(-1)
return np.eye(nb_classes)[targets]
indices_to_one_hot(nb_classes, data)的返回值现在是
array([[[ 0., 0., 1., 0., 0., 0.],
[ 0., 0., 0., 1., 0., 0.],
[ 0., 0., 0., 0., 1., 0.],
[ 1., 0., 0., 0., 0., 0.]]])
. remodeling(-1)的作用是确保标签格式正确(也可能有[[2],[3],[4],[0]])。