我试图使用scikit-learn的LabelEncoder来编码字符串标签的pandas DataFrame。由于数据帧有许多(50+)列,我想避免为每一列创建一个LabelEncoder对象;我宁愿只有一个大的LabelEncoder对象,它可以跨所有数据列工作。

将整个DataFrame扔到LabelEncoder中会产生以下错误。请记住,我在这里使用的是虚拟数据;实际上,我正在处理大约50列的字符串标记数据,所以需要一个解决方案,不引用任何列的名称。

import pandas
from sklearn import preprocessing 

df = pandas.DataFrame({
    'pets': ['cat', 'dog', 'cat', 'monkey', 'dog', 'dog'], 
    'owner': ['Champ', 'Ron', 'Brick', 'Champ', 'Veronica', 'Ron'], 
    'location': ['San_Diego', 'New_York', 'New_York', 'San_Diego', 'San_Diego', 
                 'New_York']
})

le = preprocessing.LabelEncoder()

le.fit(df)

回溯(最近一次调用): 文件“”,第1行,在 文件"/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/预处理/label.py",第103行 y = column_or_1d(y, warn=True) 文件"/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py",第306行,在column_or_1d中 raise ValueError("错误的输入形状{0}".format(形状)) ValueError:错误的输入形状(6,3)

对于如何解决这个问题有什么想法吗?


当前回答

根据对@PriceHardman解决方案提出的意见,我将提出以下版本的类:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

这个类适合编码器的训练集,并在转换时使用适合的版本。代码的初始版本可以在这里找到。

其他回答

根据对@PriceHardman解决方案提出的意见,我将提出以下版本的类:

class LabelEncodingColoumns(BaseEstimator, TransformerMixin):
def __init__(self, cols=None):
    pdu._is_cols_input_valid(cols)
    self.cols = cols
    self.les = {col: LabelEncoder() for col in cols}
    self._is_fitted = False

def transform(self, df, **transform_params):
    """
    Scaling ``cols`` of ``df`` using the fitting

    Parameters
    ----------
    df : DataFrame
        DataFrame to be preprocessed
    """
    if not self._is_fitted:
        raise NotFittedError("Fitting was not preformed")
    pdu._is_cols_subset_of_df_cols(self.cols, df)

    df = df.copy()

    label_enc_dict = {}
    for col in self.cols:
        label_enc_dict[col] = self.les[col].transform(df[col])

    labelenc_cols = pd.DataFrame(label_enc_dict,
        # The index of the resulting DataFrame should be assigned and
        # equal to the one of the original DataFrame. Otherwise, upon
        # concatenation NaNs will be introduced.
        index=df.index
    )

    for col in self.cols:
        df[col] = labelenc_cols[col]
    return df

def fit(self, df, y=None, **fit_params):
    """
    Fitting the preprocessing

    Parameters
    ----------
    df : DataFrame
        Data to use for fitting.
        In many cases, should be ``X_train``.
    """
    pdu._is_cols_subset_of_df_cols(self.cols, df)
    for col in self.cols:
        self.les[col].fit(df[col])
    self._is_fitted = True
    return self

这个类适合编码器的训练集,并在转换时使用适合的版本。代码的初始版本可以在这里找到。

在这里和其他地方进行了大量的搜索和实验后,我认为你的答案是:

pd.DataFrame(列= df.columns, data = LabelEncoder () .fit_transform (df.values.flatten ()) .reshape (df.shape))

这将跨列保留类别名称:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.DataFrame([['A','B','C','D','E','F','G','I','K','H'],
                   ['A','E','H','F','G','I','K','','',''],
                   ['A','C','I','F','H','G','','','','']], 
                  columns=['A1', 'A2', 'A3','A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10'])

pd.DataFrame(columns=df.columns, data=LabelEncoder().fit_transform(df.values.flatten()).reshape(df.shape))

    A1  A2  A3  A4  A5  A6  A7  A8  A9  A10
0   1   2   3   4   5   6   7   9   10  8
1   1   5   8   6   7   9   10  0   0   0
2   1   3   9   6   8   7   0   0   0   0

这个怎么样?

def MultiColumnLabelEncode(choice, columns, X):
    LabelEncoders = []
    if choice == 'encode':
        for i in enumerate(columns):
            LabelEncoders.append(LabelEncoder())
        i=0    
        for cols in columns:
            X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
            i += 1
    elif choice == 'decode': 
        for cols in columns:
            X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
            i += 1
    else:
        print('Please select correct parameter "choice". Available parameters: encode/decode')

这不是最有效的,但它工作,它是超级简单。

这是脚本

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
col_list = df.select_dtypes(include = "object").columns
for colsn in col_list:
    df[colsn] = le.fit_transform(df[colsn].astype(str))

问题是传递给fit函数的数据(pd dataframe)的形状。 你必须通过1d列表。