我有一个非常大的表(3000万行),我想在r中作为数据框架加载,read.table()有很多方便的特性,但似乎在实现中有很多逻辑会减慢速度。在我的例子中,我假设我事先知道列的类型,表不包含任何列标题或行名,也没有任何需要担心的病态字符。

我知道使用scan()将表读入为列表可以相当快,例如:

datalist <- scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0)))

但我试图将其转换为数据框架的一些尝试似乎将上述性能降低了6倍:

df <- as.data.frame(scan('myfile',sep='\t',list(url='',popularity=0,mintime=0,maxtime=0))))

有更好的办法吗?或者完全不同的解决问题的方法?


当前回答

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

(1)数据一直加载在数据库中。当重新打开笔记本电脑时,只需在pgadmin中重新连接所需的数据库。

(2) R的确可以比SQL做更多漂亮的统计和绘图操作。但是我认为SQL比R更适合于查询大量的数据。

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

其他回答

下面是一个使用fread from data的例子。表1.8.7

这些例子来自我的windows XP Core 2 duo E8400上的帮助页面。

library(data.table)
# Demo speedup
n=1e6
DT = data.table( a=sample(1:1000,n,replace=TRUE),
                 b=sample(1:1000,n,replace=TRUE),
                 c=rnorm(n),
                 d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
                 e=rnorm(n),
                 f=sample(1:1000,n,replace=TRUE) )
DT[2,b:=NA_integer_]
DT[4,c:=NA_real_]
DT[3,d:=NA_character_]
DT[5,d:=""]
DT[2,e:=+Inf]
DT[3,e:=-Inf]

标准read.table

write.table(DT,"test.csv",sep=",",row.names=FALSE,quote=FALSE)
cat("File size (MB):",round(file.info("test.csv")$size/1024^2),"\n")    
## File size (MB): 51 

system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   24.71    0.15   25.42
# second run will be faster
system.time(DF1 <- read.csv("test.csv",stringsAsFactors=FALSE))        
##    user  system elapsed 
##   17.85    0.07   17.98

优化read.table

system.time(DF2 <- read.table("test.csv",header=TRUE,sep=",",quote="",  
                          stringsAsFactors=FALSE,comment.char="",nrows=n,                   
                          colClasses=c("integer","integer","numeric",                        
                                       "character","numeric","integer")))


##    user  system elapsed 
##   10.20    0.03   10.32

从文件中读

require(data.table)
system.time(DT <- fread("test.csv"))                                  
 ##    user  system elapsed 
##    3.12    0.01    3.22

sqldf

require(sqldf)

system.time(SQLDF <- read.csv.sql("test.csv",dbname=NULL))             

##    user  system elapsed 
##   12.49    0.09   12.69

# sqldf as on SO

f <- file("test.csv")
system.time(SQLf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

##    user  system elapsed 
##   10.21    0.47   10.73

Ff / FFDF

 require(ff)

 system.time(FFDF <- read.csv.ffdf(file="test.csv",nrows=n))   
 ##    user  system elapsed 
 ##   10.85    0.10   10.99

总而言之:

##    user  system elapsed  Method
##   24.71    0.15   25.42  read.csv (first time)
##   17.85    0.07   17.98  read.csv (second time)
##   10.20    0.03   10.32  Optimized read.table
##    3.12    0.01    3.22  fread
##   12.49    0.09   12.69  sqldf
##   10.21    0.47   10.73  sqldf on SO
##   10.85    0.10   10.99  ffdf

Often times I think it is just good practice to keep larger databases inside a database (e.g. Postgres). I don't use anything too much larger than (nrow * ncol) ncell = 10M, which is pretty small; but I often find I want R to create and hold memory intensive graphs only while I query from multiple databases. In the future of 32 GB laptops, some of these types of memory problems will disappear. But the allure of using a database to hold the data and then using R's memory for the resulting query results and graphs still may be useful. Some advantages are:

(1)数据一直加载在数据库中。当重新打开笔记本电脑时,只需在pgadmin中重新连接所需的数据库。

(2) R的确可以比SQL做更多漂亮的统计和绘图操作。但是我认为SQL比R更适合于查询大量的数据。

# Looking at Voter/Registrant Age by Decade

library(RPostgreSQL);library(lattice)

con <- dbConnect(PostgreSQL(), user= "postgres", password="password",
                 port="2345", host="localhost", dbname="WC2014_08_01_2014")

Decade_BD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from Birthdate) from voterdb where extract(DECADE from Birthdate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

Decade_RD_1980_42 <- dbGetQuery(con,"Select PrecinctID,Count(PrecinctID),extract(DECADE from RegistrationDate) from voterdb where extract(DECADE from RegistrationDate)::numeric > 198 and PrecinctID in (Select * from LD42) Group By PrecinctID,date_part Order by Count DESC;")

with(Decade_BD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Birthdays later than 1980 by Precinct",side=1,line=0)

with(Decade_RD_1980_42,(barchart(~count | as.factor(precinctid))));
mtext("42LD Registration Dates later than 1980 by Precinct",side=1,line=0)

上面这些我都试过了,[1]做得最好。我只有8gb的内存

循环20个文件,每个5gb, 7列:

read_fwf(arquivos[i],col_types = "ccccccc",fwf_cols(cnpj = c(4,17), nome = c(19,168), cpf = c(169,183), fantasia = c(169,223), sit.cadastral = c(224,225), dt.sitcadastral = c(226,233), cnae = c(376,382)))

我正在阅读数据非常快地使用新的箭头包。它似乎还处于相当早期的阶段。

具体来说,我使用的是拼花柱状格式。这将转换回R中的data.frame,但如果不这样做,您可以获得更大的加速。这种格式很方便,因为它也可以从Python中使用。

我的主要用例是在一个相当受限的RShiny服务器上。出于这些原因,我更喜欢将数据附加到应用程序(即SQL之外),因此要求小文件大小和速度。

这篇链接的文章提供了基准测试和一个很好的概述。下面我引用了一些有趣的观点。

https://ursalabs.org/blog/2019-10-columnar-perf/

文件大小

也就是说,Parquet文件甚至是gzip的CSV的一半大。Parquet文件如此小的原因之一是字典编码(也称为“字典压缩”)。字典压缩可以产生比使用LZ4或ZSTD(以FST格式使用)等通用字节压缩器更好的压缩效果。Parquet的设计目的是生成非常小的文件,以便快速读取。

读取速度

当通过输出类型控制时(例如,比较所有R data.frame输出),我们看到Parquet、Feather和FST的性能之间的差距相对较小。熊猫也是如此。DataFrame输出。数据。table::fread与1.5 GB文件大小的竞争令人印象深刻,但在2.5 GB CSV上落后于其他文件。


独立测试

我在一个1,000,000行的模拟数据集上执行了一些独立的基准测试。基本上,我打乱了一堆东西,试图挑战压缩。我还添加了一个随机单词和两个模拟因素的简短文本字段。

Data

library(dplyr)
library(tibble)
library(OpenRepGrid)

n <- 1000000

set.seed(1234)
some_levels1 <- sapply(1:10, function(x) paste(LETTERS[sample(1:26, size = sample(3:8, 1), replace = TRUE)], collapse = ""))
some_levels2 <- sapply(1:65, function(x) paste(LETTERS[sample(1:26, size = sample(5:16, 1), replace = TRUE)], collapse = ""))


test_data <- mtcars %>%
  rownames_to_column() %>%
  sample_n(n, replace = TRUE) %>%
  mutate_all(~ sample(., length(.))) %>%
  mutate(factor1 = sample(some_levels1, n, replace = TRUE),
         factor2 = sample(some_levels2, n, replace = TRUE),
         text = randomSentences(n, sample(3:8, n, replace = TRUE))
         )

读和写

写入数据很容易。

library(arrow)

write_parquet(test_data , "test_data.parquet")

# you can also mess with the compression
write_parquet(test_data, "test_data2.parquet", compress = "gzip", compression_level = 9)

读取数据也很容易。

read_parquet("test_data.parquet")

# this option will result in lightning fast reads, but in a different format.
read_parquet("test_data2.parquet", as_data_frame = FALSE)

我将这些数据与一些竞争选项进行了测试,得到的结果与上面的文章略有不同,这是意料之中的。

这个文件远没有基准测试文章那么大,所以这可能就是区别所在。

测试

rds: test_data。rds (20.3 MB) parquet2_native: (14.9 MB,更高的压缩和as_data_frame = FALSE) parquet2: test_data2。拼花地板(14.9 MB压缩更高) 拼花:test_data。拼花(40.7 MB) fst2: test_data2。fst (27.9 MB高压缩) 置:test_data。fst (768 MB) fread2: test_data.csv.gz (23.6MB) test_data.csv (987 mb) feather_arrow: test_data。羽毛(157.2 MB带箭头读取) 羽毛:test_data。羽毛(157.2 MB用羽毛阅读)

观察

对于这个特定的文件,fread实际上非常快。我喜欢高度压缩parquet2测试的小文件大小。如果我真的需要加快速度,我可能会花时间使用原生数据格式,而不是data.frame。

这里fst也是一个很好的选择。我要么使用高度压缩的fst格式,要么使用高度压缩的parquet格式,这取决于我是否需要在速度或文件大小方面进行权衡。

一开始我没有看到这个问题,几天后我问了一个类似的问题。我将记下我之前的问题,但我认为我应该在这里添加一个答案,以解释我如何使用sqldf()来做到这一点。

关于将2GB或更多的文本数据导入R数据帧的最佳方法,已经有了一些讨论。昨天我写了一篇关于使用sqldf()将数据导入SQLite作为暂存区,然后将它从SQLite吸到r的博客文章,这对我来说真的很好。我能够在不到5分钟的时间内提取2GB(3列,40mm行)的数据。相比之下,read.csv命令运行了一整夜,始终没有完成。

下面是我的测试代码:

设置测试数据:

bigdf <- data.frame(dim=sample(letters, replace=T, 4e7), fact1=rnorm(4e7), fact2=rnorm(4e7, 20, 50))
write.csv(bigdf, 'bigdf.csv', quote = F)

在运行以下导入例程之前,我重新启动R:

library(sqldf)
f <- file("bigdf.csv")
system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F)))

我让下面这行写了一整晚,但始终没有写完:

system.time(big.df <- read.csv('bigdf.csv'))