如何在Python中进行热编码?

我有一个80%类别变量的机器学习分类问题。如果我想使用一些分类器进行分类，我必须使用一个热编码吗?我可以将数据传递给分类器而不进行编码吗?

我试图做以下的特征选择:

I read the train file: num_rows_to_read = 10000 train_small = pd.read_csv("../../dataset/train.csv", nrows=num_rows_to_read) I change the type of the categorical features to 'category': non_categorial_features = ['orig_destination_distance', 'srch_adults_cnt', 'srch_children_cnt', 'srch_rm_cnt', 'cnt'] for categorical_feature in list(train_small.columns): if categorical_feature not in non_categorial_features: train_small[categorical_feature] = train_small[categorical_feature].astype('category') I use one hot encoding: train_small_with_dummies = pd.get_dummies(train_small, sparse=True)

问题是，第三部分经常卡住，尽管我使用的是一个强大的机器。

因此，如果没有一个热编码，我就无法进行任何特征选择，以确定特征的重要性。

你有什么建议吗?

当前回答

方法1:你可以使用pandas的pd.get_dummies。

示例1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

示例2:

下面将把一个给定的列转换为一个hot。使用前缀有多个假人。

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

方法2:使用Scikit-learn

使用OneHotEncoder的优点是能够拟合一些训练数据，然后使用相同的实例对一些其他数据进行转换。我们还有handle_unknown来进一步控制编码器对未见数据的处理。

给定一个具有三个特征和四个样本的数据集，我们让编码器找到每个特征的最大值，并将数据转换为二进制one-hot编码。

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

下面是这个例子的链接:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

2016-09-02 07:55:21

其他回答

首先，最简单的热编码方法:使用Sklearn。

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

其次，我不认为使用熊猫进行一个热编码是那么简单(虽然未经证实)

在pandas中为python创建虚拟变量

最后，你需要一个热编码吗?一个热编码以指数方式增加了特征的数量，大大增加了任何分类器或任何你要运行的东西的运行时间。特别是当每个分类特征都有很多层次时。相反，你可以进行虚拟编码。

使用虚拟编码通常工作得很好，运行时间和复杂性要少得多。一位睿智的教授曾经告诉我，“少即是多”。

如果你愿意，这是我的自定义编码函数的代码。

from sklearn.preprocessing import LabelEncoder

#Auto encodes any dataframe column of type category or object.
def dummyEncode(df):
        columnsToEncode = list(df.select_dtypes(include=['category','object']))
        le = LabelEncoder()
        for feature in columnsToEncode:
            try:
                df[feature] = le.fit_transform(df[feature])
            except:
                print('Error encoding '+feature)
        return df

编辑:比较更清楚:

一热编码:将n层转换为n-1列。

Index  Animal         Index  cat  mouse
  1     dog             1     0     0
  2     cat       -->   2     1     0
  3    mouse            3     0     1

你可以看到，如果你的分类特征中有许多不同类型(或级别)，这会使你的记忆爆发式增长。记住，这只是一列。

伪代码:

Index  Animal         Index  Animal
  1     dog             1      0   
  2     cat       -->   2      1 
  3    mouse            3      2

转换为数字表示。极大地节省了特征空间，代价是准确性。

2016-05-18 07:46:34

方法1:你可以使用pandas的pd.get_dummies。

示例1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

示例2:

下面将把一个给定的列转换为一个hot。使用前缀有多个假人。

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

方法2:使用Scikit-learn

给定一个具有三个特征和四个样本的数据集，我们让编码器找到每个特征的最大值，并将数据转换为二进制one-hot编码。

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

下面是这个例子的链接:http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

2016-09-02 07:55:21

我在我的声学模型中使用了这个: 也许这对你的模型有帮助。

def one_hot_encoding(x, n_out):
    x = x.astype(int)  
    shape = x.shape
    x = x.flatten()
    N = len(x)
    x_categ = np.zeros((N,n_out))
    x_categ[np.arange(N), x] = 1
    return x_categ.reshape((shape)+(n_out,))

2018-06-12 18:41:12

试试这个:

!pip install category_encoders
import category_encoders as ce

categorical_columns = [...the list of names of the columns you want to one-hot-encode ...]
encoder = ce.OneHotEncoder(cols=categorical_columns, use_cat_names=True)
df_train_encoded = encoder.fit_transform(df_train_small)

df_encoded.head ()

生成的数据框架df_train_encoded与原始数据框架相同，但是分类特征现在被它们的单热编码版本所取代。

更多关于category_encoders的信息请点击这里。

2020-03-05 10:03:35

这对我来说很管用:

pandas.factorize( ['B', 'C', 'D', 'B'] )[0]

输出:

[0, 1, 2, 0]

2018-12-06 16:24:18

如何在Python中进行热编码?

推荐文章

最新文章

标签