人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

gData包中的llfunction也可以显示每个对象的内存使用情况。

gdata::ll(unit='MB')

其他回答

这是个好把戏。

另一个建议是尽可能使用内存效率高的对象:例如,使用矩阵而不是data.frame。

这并没有真正解决内存管理问题,但是一个不为人所知的重要函数是memory.limit()。可以使用memory.limit(size=2500)命令增加默认值,这里的大小以MB为单位。正如Dirk提到的,为了真正利用这一点,您需要使用64位。

I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set. Calling gc () "by hand" can help if the size of the data get close to available memory. Sometimes a different algorithm needs much less memory. Sometimes there's a trade off between vectorization and memory use. compare: split & lapply vs. a for loop. For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.

Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%. When you read stuff into R with for example read.csv they require a certain amount of memory. After this you can save them with save("Destinationfile",list=ls()) The next time you open R you can use load("Destinationfile") Now the memory usage might have decreased. It would be nice if anyone could confirm whether this produces similar results with a different dataset.

我真的很欣赏上面的一些答案,遵循@hadley和@Dirk的建议,关闭R并发布源代码,使用命令行,我想出了一个非常适合我的解决方案。我必须处理数百个质谱仪,每个质谱仪占用大约20 Mb的内存,所以我使用了两个R脚本,如下所示:

首先是包装器:

#!/usr/bin/Rscript --vanilla --default-packages=utils

for(l in 1:length(fdir)) {

   for(k in 1:length(fds)) {
     system(paste("Rscript runConsensus.r", l, k))
   }
}

用这个脚本,我基本上控制我的主脚本做什么运行共识。r,然后写出输出的数据答案。这样,每次包装器调用脚本时,似乎会重新打开R并释放内存。

希望能有所帮助。

如果真的想避免泄漏,应该避免在全局环境中创建任何大对象。

我通常做的是有一个函数来完成这项工作并返回NULL -所有数据都在这个函数或它调用的其他函数中读取和操作。