人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

请注意这些数据。table包的tables()似乎是Dirk的.ls.objects()自定义函数的一个很好的替代品(在前面的回答中有详细说明),尽管只是针对data.frames/tables,而不是矩阵,数组,列表。

其他回答

For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g., rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.

dfinal <- NULL
first <- TRUE
tempfile <- "dfinal_temp.csv"
for( i in bigloop ) {
    if( !i %% 10000 ) { 
        print( i, "; flushing to disk..." )
        write.table( dfinal, file=tempfile, append=!first, col.names=first )
        first <- FALSE
        dfinal <- NULL   # nuke it
    }

    # ... complex operations here that add data to 'dfinal' data frame  
}
print( "Loop done; flushing to disk and re-reading entire data set..." )
write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )
dfinal <- read.table( tempfile )

这是个好把戏。

另一个建议是尽可能使用内存效率高的对象:例如,使用矩阵而不是data.frame。

这并没有真正解决内存管理问题,但是一个不为人所知的重要函数是memory.limit()。可以使用memory.limit(size=2500)命令增加默认值,这里的大小以MB为单位。正如Dirk提到的,为了真正利用这一点,您需要使用64位。

除了以上回答中给出的更通用的内存管理技术外,我总是尽可能地减小对象的大小。例如,我处理非常大但非常稀疏的矩阵,换句话说,大多数值为零的矩阵。使用“矩阵”包(大写很重要),我能够将我的平均对象大小从~2GB减小到~200MB,简单如下:

my.matrix <- Matrix(my.matrix)

Matrix包包含的数据格式可以像常规矩阵一样使用(不需要更改其他代码),但能够更有效地存储稀疏数据,无论是加载到内存中还是保存到磁盘中。

此外,我收到的原始文件是“长”格式的,其中每个数据点都有变量x, y, z, I。将数据转换为只有变量I的x * y * z维度数组更有效。

了解你的数据并使用一些常识。

我从不保存R工作区。我使用导入脚本和数据脚本,并将我不想经常重新创建的任何特别大的数据对象输出到文件。这样,我总是从一个新的工作空间开始,不需要清理大的物体。这是一个很好的函数。

只有4GB的内存(运行Windows 10,所以大约是2或更现实的1GB),我必须非常小心地分配。

我使用数据。几乎只有桌子。

'fread'函数允许您在导入时按字段名划分信息子集;只导入开始时实际需要的字段。如果使用base R read,则在导入后立即将伪列空。

正如42-所建议的,只要有可能,我将在导入信息后立即在列中进行子集。

我经常从环境中rm()对象,一旦他们不再需要,例如在使用他们子集其他东西后的下一行,并调用gc()。

'fread'和'fwrite'从数据。表的读写速度可以非常快。

As kpierce8 suggests, I almost always fwrite everything out of the environment and fread it back in, even with thousand / hundreds of thousands of tiny files to get through. This not only keeps the environment 'clean' and keeps the memory allocation low but, possibly due to the severe lack of RAM available, R has a propensity for frequently crashing on my computer; really frequently. Having the information backed up on the drive itself as the code progresses through various stages means I don't have to start right from the beginning if it crashes.

As of 2017, I think the fastest SSDs are running around a few GB per second through the M2 port. I have a really basic 50GB Kingston V300 (550MB/s) SSD that I use as my primary disk (has Windows and R on it). I keep all the bulk information on a cheap 500GB WD platter. I move the data sets to the SSD when I start working on them. This, combined with 'fread'ing and 'fwrite'ing everything has been working out great. I've tried using 'ff' but prefer the former. 4K read/write speeds can create issues with this though; backing up a quarter of a million 1k files (250MBs worth) from the SSD to the platter can take hours. As far as I'm aware, there isn't any R package available yet that can automatically optimise the 'chunkification' process; e.g. look at how much RAM a user has, test the read/write speeds of the RAM / all the drives connected and then suggest an optimal 'chunkification' protocol. This could produce some significant workflow improvements / resource optimisations; e.g. split it to ... MB for the ram -> split it to ... MB for the SSD -> split it to ... MB on the platter -> split it to ... MB on the tape. It could sample data sets beforehand to give it a more realistic gauge stick to work from.

A lot of the problems I've worked on in R involve forming combination and permutation pairs, triples etc, which only makes having limited RAM more of a limitation as they will often at least exponentially expand at some point. This has made me focus a lot of attention on the quality as opposed to quantity of information going into them to begin with, rather than trying to clean it up afterwards, and on the sequence of operations in preparing the information to begin with (starting with the simplest operation and increasing the complexity); e.g. subset, then merge / join, then form combinations / permutations etc.

There do seem to be some benefits to using base R read and write in some instances. For instance, the error detection within 'fread' is so good it can be difficult trying to get really messy information into R to begin with to clean it up. Base R also seems to be a lot easier if you're using Linux. Base R seems to work fine in Linux, Windows 10 uses ~20GB of disc space whereas Ubuntu only needs a few GB, the RAM needed with Ubuntu is slightly lower. But I've noticed large quantities of warnings and errors when installing third party packages in (L)Ubuntu. I wouldn't recommend drifting too far away from (L)Ubuntu or other stock distributions with Linux as you can loose so much overall compatibility it renders the process almost pointless (I think 'unity' is due to be cancelled in Ubuntu as of 2017). I realise this won't go down well with some Linux users but some of the custom distributions are borderline pointless beyond novelty (I've spent years using Linux alone).

希望其中一些能帮助到其他人。