我想删除这个数据帧中的行:
a)在所有列中包含NAs。下面是我的示例数据帧。
gene hsap mmul mmus rnor cfam
1 ENSG00000208234 0 NA NA NA NA
2 ENSG00000199674 0 2 2 2 2
3 ENSG00000221622 0 NA NA NA NA
4 ENSG00000207604 0 NA NA 1 2
5 ENSG00000207431 0 NA NA NA NA
6 ENSG00000221312 0 1 2 3 2
基本上,我想获得如下所示的数据帧。
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
6 ENSG00000221312 0 1 2 3 2
b)只在某些列中包含NAs,所以我也可以得到这个结果:
gene hsap mmul mmus rnor cfam
2 ENSG00000199674 0 2 2 2 2
4 ENSG00000207604 0 NA NA 1 2
6 ENSG00000221312 0 1 2 3 2
我更喜欢用下面的方法来检查行中是否包含NAs:
row.has.na <- apply(final, 1, function(x){any(is.na(x))})
这将返回逻辑向量,其中的值表示一行中是否有NA。你可以使用它来查看你需要删除多少行:
sum(row.has.na)
并最终放弃它们
final.filtered <- final[!row.has.na,]
对于过滤具有特定部分NAs的行,它变得有点棘手(例如,您可以将'final[,5:6]'输入到'apply')。
一般来说,Joris Meys的解决方案似乎更优雅。
我是个合成器:)。这里我把答案组合成一个函数:
#' keep rows that have a certain number (range) of NAs anywhere/somewhere and delete others
#' @param df a data frame
#' @param col restrict to the columns where you would like to search for NA; eg, 3, c(3), 2:5, "place", c("place","age")
#' \cr default is NULL, search for all columns
#' @param n integer or vector, 0, c(3,5), number/range of NAs allowed.
#' \cr If a number, the exact number of NAs kept
#' \cr Range includes both ends 3<=n<=5
#' \cr Range could be -Inf, Inf
#' @return returns a new df with rows that have NA(s) removed
#' @export
ez.na.keep = function(df, col=NULL, n=0){
if (!is.null(col)) {
# R converts a single row/col to a vector if the parameter col has only one col
# see https://radfordneal.wordpress.com/2008/08/20/design-flaws-in-r-2-%E2%80%94-dropped-dimensions/#comments
df.temp = df[,col,drop=FALSE]
} else {
df.temp = df
}
if (length(n)==1){
if (n==0) {
# simply call complete.cases which might be faster
result = df[complete.cases(df.temp),]
} else {
# credit: http://stackoverflow.com/a/30461945/2292993
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) sum(x) == n)
result = df[logindex, ]
}
}
if (length(n)==2){
min = n[1]; max = n[2]
log <- apply(df.temp, 2, is.na)
logindex <- apply(log, 1, function(x) {sum(x) >= min && sum(x) <= max})
result = df[logindex, ]
}
return(result)
}
Dplyr 1.0.4引入了两个配套的过滤函数:if_any()和if_all()。在这种情况下,if_all()伴随函数将特别有用:
a)删除所有列中包含NAs的行
df %>%
filter(if_all(everything(), ~ !is.na(.x)))
这一行将只保留那些列中没有NAs的行。
b)删除仅在某些列中包含NAs的行
cols_to_check = c("rnor", "cfam")
df %>%
filter(if_all(cols_to_check, ~ !is.na(.x)))
这一行将检查任何指定的列(cols_to_check)是否有NAs,并只保留没有NAs的那些行。