如何在Python中进行热编码?

我有一个80%类别变量的机器学习分类问题。如果我想使用一些分类器进行分类，我必须使用一个热编码吗?我可以将数据传递给分类器而不进行编码吗?

我试图做以下的特征选择:

I read the train file: num_rows_to_read = 10000 train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read) I change the type of the categorical features to 'category': non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt'] for categorical_feature in list(train_small.columns): if categorical_feature not in non_categorial_features: train_small[categorical_feature] = train_small[categorical_feature].astype('category') I use one hot encoding: train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

问题是，第三部分经常卡住，尽管我使用的是一个强大的机器。

因此，如果没有一个热编码，我就无法进行任何特征选择，以确定特征的重要性。

你有什么建议吗?

当前回答

为了补充其他问题，让我提供如何使用Numpy使用Python 2.0函数:

def one_hot(y_):
    # Function to encode output labels from number indexes 
    # e.g.: [[5], [0], [3]] --> [[0, 0, 0, 0, 0, 1], [1, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0]]

    y_ = y_.reshape(len(y_))
    n_values = np.max(y_) + 1
    return np.eye(n_values)[np.array(y_, dtype=np.int32)]  # Returns FLOATS

行n_values = np.max(y_) + 1可以硬编码，以便在使用小批量的情况下使用足够数量的神经元。

使用此函数的演示项目/教程: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition

2017-03-28 00:04:56

其他回答

使用Pandas进行基本的单热编码要容易得多。如果您正在寻找更多的选项，您可以使用scikit-learn。

对于Pandas的基本单热编码，您可以将数据帧传递给get_dummies函数。

例如，如果我有一个名为imdb_movies的数据帧:

.．.和我想要一个热编码的评级列，我这样做:

pd.get_dummies(imdb_movies.Rated)

这将返回一个新的数据框架，其中包含一个列，表示存在的每个评级“级别”，以及一个1或0，指定给定观察值的评级。

通常，我们希望它是原始数据框架的一部分。在本例中，我们使用“列绑定”将新的虚拟编码框架附加到原始框架上。

我们可以使用Pandas concat函数进行列绑定:

rated_dummies = pd.get_dummies(imdb_movies.Rated)
pd.concat([imdb_movies, rated_dummies], axis=1)

现在我们可以对完整的数据框架进行分析。

简单效用函数

我建议你自己做一个效用函数来快速做到这一点:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    return(res)

用法:

encode_and_bind(imdb_movies, 'Rated')

结果:

另外，根据@pmalbu的评论，如果你想让函数删除原来的feature_to_encode，那么使用这个版本:

def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res)

你可以在同一时间编码多个特征，如下所示:

features_to_encode = ['feature_1', 'feature_2', 'feature_3',
                      'feature_4']
for feature in features_to_encode:
    res = encode_and_bind(train_set, feature)

2018-10-22 18:07:06

您可以使用numpy。眼睛的功能。

import numpy as np

def one_hot_encode(x, n_classes):
    """
    One hot encode a list of sample labels. Return a one-hot encoded vector for each label.
    : x: List of sample Labels
    : return: Numpy array of one-hot encoded labels
     """
    return np.eye(n_classes)[x]

def main():
    list = [0,1,2,3,4,3,2,1,0]
    n_classes = 5
    one_hot_list = one_hot_encode(list, n_classes)
    print(one_hot_list)

if __name__ == "__main__":
    main()

结果

D:\Desktop>python test.py
[[ 1.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  0.  0.  1.]
 [ 0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.]
 [ 0.  1.  0.  0.  0.]
 [ 1.  0.  0.  0.  0.]]

2017-03-18 21:00:54

这对我来说很管用:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

输出:

[0, 1, 2, 0]

2018-12-06 16:24:18

您可以将数据传递给catboost分类器而不进行编码。Catboost通过执行单热和目标扩展平均编码来处理分类变量本身。

2019-02-28 13:53:20

扩展@Martin Thoma的答案

def one_hot_encode(y):
    """Convert an iterable of indices to one-hot encoded labels."""
    y = y.flatten() # Sometimes not flattened vector is passed e.g (118,1) in these cases
    # the function ends up creating a tensor e.g. (118, 2, 1). flatten removes this issue
    nb_classes = len(np.unique(y)) # get the number of unique classes
    standardised_labels = dict(zip(np.unique(y), np.arange(nb_classes))) # get the class labels as a dictionary
    # which then is standardised. E.g imagine class labels are (4,7,9) if a vector of y containing 4,7 and 9 is
    # directly passed then np.eye(nb_classes)[4] or 7,9 throws an out of index error.
    # standardised labels fixes this issue by returning a dictionary;
    # standardised_labels = {4:0, 7:1, 9:2}. The values of the dictionary are mapped to keys in y array.
    # standardised_labels also removes the error that is raised if the labels are floats. E.g. 1.0; element
    # cannot be called by an integer index e.g y[1.0] - throws an index error.
    targets = np.vectorize(standardised_labels.get)(y) # map the dictionary values to array.
    return np.eye(nb_classes)[targets]

2019-12-29 12:36:04

如何在Python中进行热编码?

推荐文章

最新文章

标签