检测和排除熊猫数据框架中的异常值

我有一个很少列的熊猫数据帧。

现在我知道某些行是基于某个列值的异常值。

例如

列“Vol”的所有值都在12xx左右，其中一个值是4000(离群值)。

现在我想排除那些Vol列像这样的行。

所以，本质上，我需要在数据帧上放一个过滤器，这样我们就可以选择所有的行，其中某一列的值距离平均值在3个标准差之内。

实现这一点的优雅方式是什么?

当前回答

这个答案类似于@tanemaki提供的答案，但使用了lambda表达式而不是scipy stats。

df = pd.DataFrame(np.random.randn(100, 3), columns=list('ABC'))

standard_deviations = 3
df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < standard_deviations)
   .all(axis=1)]

要过滤只有一个列的数据帧(例如:B)在三个标准差之内:

df[((df['B'] - df['B'].mean()) / df['B'].std()).abs() < standard_deviations]

关于如何在滚动的基础上应用这个z-score:滚动z-score应用于pandas数据框架

2015-07-19 15:44:23

其他回答

下面是一个包含数据和2组的完整示例:

进口:

from StringIO import StringIO
import pandas as pd
#pandas config
pd.set_option('display.max_rows', 20)

有2个组的数据示例:G1:Group 1。G2:第二组:

TESTDATA = StringIO("""G1;G2;Value
1;A;1.6
1;A;5.1
1;A;7.1
1;A;8.1

1;B;21.1
1;B;22.1
1;B;24.1
1;B;30.6

2;A;40.6
2;A;51.1
2;A;52.1
2;A;60.6

2;B;80.1
2;B;70.6
2;B;90.6
2;B;85.1
""")

读取文本数据到pandas数据框架:

df = pd.read_csv(TESTDATA, sep=";")

使用标准偏差定义离群值

stds = 1.0
outliers = df[['G1', 'G2', 'Value']].groupby(['G1','G2']).transform(
           lambda group: (group - group.mean()).abs().div(group.std())) > stds

定义过滤后的数据值和异常值:

dfv = df[outliers.Value == False]
dfo = df[outliers.Value == True]

打印结果:

print '\n'*5, 'All values with decimal 1 are non-outliers. In the other hand, all values with 6 in the decimal are.'
print '\nDef DATA:\n%s\n\nFiltred Values with %s stds:\n%s\n\nOutliers:\n%s' %(df, stds, dfv, dfo)

2018-03-20 01:06:34

另一种选择是转换数据，以减轻异常值的影响。你可以通过winsorize你的数据来做到这一点。

import pandas as pd
from scipy.stats import mstats
%matplotlib inline

test_data = pd.Series(range(30))
test_data.plot()

# Truncate values to the 5th and 95th percentiles
transformed_test_data = pd.Series(mstats.winsorize(test_data, limits=[0.05, 0.05])) 
transformed_test_data.plot()

2017-07-13 14:14:31

如果你的数据帧有异常值，有很多方法可以处理这些异常值:

大多数都在我的文章中提到过:读一读

在这里找到代码:Notebook

2022-12-24 15:37:42

在回答实际问题之前，我们应该问另一个非常相关的问题，这取决于你的数据的性质:

什么是异常值?

想象一下数值[3,2,3,4,999]的序列(其中999似乎不适合)，并分析各种异常值检测方法

z分数

这里的问题是，所讨论的值严重扭曲了我们的测量均值和std，导致不明显的z分数大约为[-0.5，-0.5，-0.5，-0.5,2.0]，使每个值保持在均值的两个标准偏差内。因此，一个非常大的离群值可能会扭曲你对离群值的整个评估。我反对这种方法。

分位数过滤器

给出的一种更健壮的方法是这样的答案，消除了底部和顶部1%的数据。然而，如果这些数据真的是异常值，这就排除了一个与问题无关的固定分数。您可能会丢失大量有效数据，另一方面，如果您有超过1%或2%的数据作为异常值，则仍然会保留一些异常值。

距中位数的距离

更健壮的分位数原则:删除所有距离数据中位数超过f倍四分位数范围的数据。这也是sklearn的RobustScaler所使用的转换。IQR和中位数对异常值具有鲁棒性，因此您可以聪明地解决z分数方法的问题。

在正态分布中，我们大致有iqr=1.35*s，所以你可以将z-score过滤器的z=3转换为iqr过滤器的f=2.22。这将在上面的例子中删除999。

基本假设是，至少数据的“中间一半”是有效的，并且与分布很相似，然而，如果分布尾部较宽，q_25%到q_75%的区间较窄，那么也会搞砸。

高级统计方法

当然，也有一些漂亮的数学方法，如Peirce准则，Grubb的检验或Dixon的q检验，只是举几个也适用于非正态分布数据的例子。它们都不容易实现，因此没有进一步解决。

Code

用np替换所有数值列的所有异常值。Nan在一个例子数据帧上。该方法对于pandas提供的所有dtype都是健壮的，并且可以很容易地应用于混合类型的数据帧:

import pandas as pd
import numpy as np                                     

# sample data of all dtypes in pandas (column 'a' has an outlier)         # dtype:
df = pd.DataFrame({'a': list(np.random.rand(8)) + [123456, np.nan],       # float64
                   'b': [0,1,2,3,np.nan,5,6,np.nan,8,9],                  # int64
                   'c': [np.nan] + list("qwertzuio"),                     # object
                   'd': [pd.to_datetime(_) for _ in range(10)],           # datetime64[ns]
                   'e': [pd.Timedelta(_) for _ in range(10)],             # timedelta[ns]
                   'f': [True] * 5 + [False] * 5,                         # bool
                   'g': pd.Series(list("abcbabbcaa"), dtype="category")}) # category
cols = df.select_dtypes('number').columns  # limits to a (float), b (int) and e (timedelta)
df_sub = df.loc[:, cols]


# OPTION 1: z-score filter: z-score < 3
lim = np.abs((df_sub - df_sub.mean()) / df_sub.std(ddof=0)) < 3

# OPTION 2: quantile filter: discard 1% upper / lower values
lim = np.logical_and(df_sub < df_sub.quantile(0.99, numeric_only=False),
                     df_sub > df_sub.quantile(0.01, numeric_only=False))

# OPTION 3: iqr filter: within 2.22 IQR (equiv. to z-score < 3)
iqr = df_sub.quantile(0.75, numeric_only=False) - df_sub.quantile(0.25, numeric_only=False)
lim = np.abs((df_sub - df_sub.median()) / iqr) < 2.22


# replace outliers with nan
df.loc[:, cols] = df_sub.where(lim, np.nan)

删除包含至少一个nan-value的所有行:

df.dropna(subset=cols, inplace=True) # drop rows with NaN in numerical columns
# or
df.dropna(inplace=True)  # drop rows with NaN in any column

使用pandas 1.3函数:

pandas.DataFrame.select_dtypes () pandas.DataFrame.quantile () pandas.DataFrame.where () pandas.DataFrame.dropna ()

2021-08-31 15:21:38

由于我还没有看到处理数值和非数值属性的答案，这里有一个补充答案。

您可能只希望删除数值属性上的异常值(类别变量几乎不可能是异常值)。

函数定义

我扩展了@tanemaki的建议，当非数值属性也存在时处理数据:

from scipy import stats

def drop_numerical_outliers(df, z_thresh=3):
    # Constrains will contain `True` or `False` depending on if it is a value below the threshold.
    constrains = df.select_dtypes(include=[np.number]) \
        .apply(lambda x: np.abs(stats.zscore(x)) < z_thresh, reduce=False) \
        .all(axis=1)
    # Drop (inplace) values set to be rejected
    df.drop(df.index[~constrains], inplace=True)

使用

drop_numerical_outliers(df)

例子

想象一个数据集df，其中包含一些关于房屋的值:小巷、土地轮廓、销售价格……例:数据文档

首先，你想要在散点图上可视化数据(z-score Thresh=3):

# Plot data before dropping those greater than z-score 3. 
# The scatterAreaVsPrice function's definition has been removed for readability's sake.
scatterAreaVsPrice(df)

# Drop the outliers on every attributes
drop_numerical_outliers(train_df)

# Plot the result. All outliers were dropped. Note that the red points are not
# the same outliers from the first plot, but the new computed outliers based on the new data-frame.
scatterAreaVsPrice(train_df)

2019-06-23 15:33:20

检测和排除熊猫数据框架中的异常值

推荐文章

最新文章

标签