我想找出我的数据的每一列中NaN的数量。


当前回答

另一种完整的方法是使用np。带有.isna()的count_non0:

np.count_nonzero(df.isna())

%timeit np.count_nonzero(df.isna())
512 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用1000005行× 16列的数据框架与顶部答案进行比较:

%timeit df.isna().sum()
492 ms ± 55.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.isnull().sum(axis = 0)
478 ms ± 34.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit count_nan = len(df) - df.count()
484 ms ± 47.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

数据:

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])

# big dataframe for %timeit 
big_df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 10)), columns=list('ABCDEFGHIJ'))
df = pd.concat([big_df,results]) # 1000005 rows × 16 columns

其他回答

我使用这个循环来计算每一列的缺失值:

# check missing values
import numpy as np, pandas as pd
for col in df:
      print(col +': '+ np.str(df[col].isna().sum()))

请使用以下方法计算特定的列数

dataframe.columnName.isnull().sum()
import pandas as pd
import numpy as np

# example DataFrame
df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})

# count the NaNs in a column
num_nan_a = df.loc[ (pd.isna(df['a'])) , 'a' ].shape[0]
num_nan_b = df.loc[ (pd.isna(df['b'])) , 'b' ].shape[0]

# summarize the num_nan_b
print(df)
print(' ')
print(f"There are {num_nan_a} NaNs in column a")
print(f"There are {num_nan_b} NaNs in column b")

给出输出:

     a    b
0  1.0  NaN
1  2.0  1.0
2  NaN  NaN

There are 1 NaNs in column a
There are 2 NaNs in column b

根据给出的答案和一些改进,这是我的方法

def PercentageMissin(Dataset):
    """this function will return the percentage of missing values in a dataset """
    if isinstance(Dataset,pd.DataFrame):
        adict={} #a dictionary conatin keys columns names and values percentage of missin value in the columns
        for col in Dataset.columns:
            adict[col]=(np.count_nonzero(Dataset[col].isnull())*100)/len(Dataset[col])
        return pd.DataFrame(adict,index=['% of missing'],columns=adict.keys())
    else:
        raise TypeError("can only be used with panda dataframe")

你可以使用value_counts方法打印np.nan的值

s.value_counts(dropna = False)[np.nan]