我有一个很大的数据集,我想阅读特定的列或放弃所有其他列。
data <- read.dta("file.dta")
我选择我不感兴趣的列:
var.out <- names(data)[!names(data) %in% c("iden", "name", "x_serv", "m_serv")]
然后我想做的事情是:
for(i in 1:length(var.out)) {
paste("data$", var.out[i], sep="") <- NULL
}
删除所有不需要的列。这是最优解吗?
我试图在使用包数据时删除一列。表并得到了意想不到的结果。我觉得下面的内容可能值得发表。只是一个小警告。
[编辑:Matthew…]
DF = read.table(text = "
fruit state grade y1980 y1990 y2000
apples Ohio aa 500 100 55
apples Ohio bb 0 0 44
apples Ohio cc 700 0 33
apples Ohio dd 300 50 66
", sep = "", header = TRUE, stringsAsFactors = FALSE)
DF[ , !names(DF) %in% c("grade")] # all columns other than 'grade'
fruit state y1980 y1990 y2000
1 apples Ohio 500 100 55
2 apples Ohio 0 0 44
3 apples Ohio 700 0 33
4 apples Ohio 300 50 66
library('data.table')
DT = as.data.table(DF)
DT[ , !names(dat4) %in% c("grade")] # not expected !! not the same as DF !!
[1] TRUE TRUE FALSE TRUE TRUE TRUE
DT[ , !names(DT) %in% c("grade"), with=FALSE] # that's better
fruit state y1980 y1990 y2000
1: apples Ohio 500 100 55
2: apples Ohio 0 0 44
3: apples Ohio 700 0 33
4: apples Ohio 300 50 66
基本上就是数据的语法。table与data.frame并不完全相同。实际上有很多不同之处,参见FAQ 1.1和FAQ 2.17。我警告过你!
不要使用-which(),这是极其危险的。考虑:
dat <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
dat[ , -which(names(dat) %in% c("z","u"))] ## works as expected
dat[ , -which(names(dat) %in% c("foo","bar"))] ## deletes all columns! Probably not what you wanted...
使用子集或!功能:
dat[ , !names(dat) %in% c("z","u")] ## works as expected
dat[ , !names(dat) %in% c("foo","bar")] ## returns the un-altered data.frame. Probably what you want
我从痛苦的经历中学到了这一点。不要过度使用which()!
这是另一个可能对其他人有帮助的解决方案。下面的代码从一个大数据集中选择少量的行和列。这些列被选择为juba的答案之一,除了我使用粘贴函数选择一组列的名称按顺序编号:
df = read.table(text = "
state county city region mmatrix X1 X2 X3 A1 A2 A3 B1 B2 B3 C1 C2 C3
1 1 1 1 111010 1 0 0 2 20 200 4 8 12 NA NA NA
1 2 1 1 111010 1 0 0 4 NA 400 5 9 NA NA NA NA
1 1 2 1 111010 1 0 0 6 60 NA NA 10 14 NA NA NA
1 2 2 1 111010 1 0 0 NA 80 800 7 11 15 NA NA NA
1 1 3 2 111010 0 1 0 1 2 1 2 2 2 10 20 30
1 2 3 2 111010 0 1 0 2 NA 1 2 2 NA 40 50 NA
1 1 4 2 111010 0 1 0 1 1 NA NA 2 2 70 80 90
1 2 4 2 111010 0 1 0 NA 2 1 2 2 10 100 110 120
1 1 1 3 010010 0 0 1 10 20 10 200 200 200 1 2 3
1 2 1 3 001000 0 0 1 20 NA 10 200 200 200 4 5 9
1 1 2 3 101000 0 0 1 10 10 NA 200 200 200 7 8 NA
1 2 2 3 011010 0 0 1 NA 20 10 200 200 200 10 11 12
", sep = "", header = TRUE, stringsAsFactors = FALSE)
df
df2 <- df[df$region == 2, names(df) %in% c(paste("C", seq_along(1:3), sep=''))]
df2
# C1 C2 C3
# 5 10 20 30
# 6 40 50 NA
# 7 70 80 90
# 8 100 110 120