我想删除这个数据帧中的行:
a)在所有列中包含NAs。下面是我的示例数据帧。
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA NA
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA NA NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
基本上,我想获得如下所示的数据帧。
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
b)只在某些列中包含NAs,所以我也可以得到这个结果:
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
如果希望控制每行有多少个NAs是有效的,请尝试此功能。对于许多调查数据集,过多的空白问题回答可能会破坏结果。所以它们在某个阈值之后就会被删除。这个函数允许你在删除行之前选择有多少个NAs:
delete.na <- function(DF, n=0) {
DF[rowSums(is.na(DF)) <= n,]
}
默认情况下,它将消除所有NAs:
delete.na(final)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
或指定允许的最大NAs数量:
delete.na(final, 2)
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
使用{dplyr}包中的filter()函数和across()帮助函数是一种既通用又能产生可读性较好的代码的方法。
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter(across(one_of(vars_to_check),
~ !is.na(.x)))
# Filter all the columns to exclude NA
df %>%
filter(across(everything(),
~ !is.na(.)))
# Filter only numeric columns
df %>%
filter(across(where(is.numeric),
~ !is.na(.)))
类似地,在dplyr包中也有变体函数(filter_all, filter_at, filter_if),它们完成同样的事情:
library(dplyr)
vars_to_check <- c("rnor", "cfam")
# Filter a specific list of columns to keep only non-missing entries
df %>%
filter_at(.vars = vars(one_of(vars_to_check)),
~ !is.na(.))
# Filter all the columns to exclude NA
df %>%
filter_all(~ !is.na(.))
# Filter only numeric columns
df %>%
filter_if(is.numeric,
~ !is.na(.))
我是个合成器:)。这里我把答案组合成一个函数:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://stackoverflow.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}