人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

如果真的想避免泄漏,应该避免在全局环境中创建任何大对象。

我通常做的是有一个函数来完成这项工作并返回NULL -所有数据都在这个函数或它调用的其他函数中读取和操作。

其他回答

这是个好把戏。

另一个建议是尽可能使用内存效率高的对象:例如,使用矩阵而不是data.frame。

这并没有真正解决内存管理问题,但是一个不为人所知的重要函数是memory.limit()。可以使用memory.limit(size=2500)命令增加默认值,这里的大小以MB为单位。正如Dirk提到的,为了真正利用这一点,您需要使用64位。

Rm (list=ls())是一种让你保持诚实和保持事物可重复性的好方法。

Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%. When you read stuff into R with for example read.csv they require a certain amount of memory. After this you can save them with save("Destinationfile",list=ls()) The next time you open R you can use load("Destinationfile") Now the memory usage might have decreased. It would be nice if anyone could confirm whether this produces similar results with a different dataset.

I'm fortunate and my large data sets are saved by the instrument in "chunks" (subsets) of roughly 100 MB (32bit binary). Thus I can do pre-processing steps (deleting uninformative parts, downsampling) sequentially before fusing the data set. Calling gc () "by hand" can help if the size of the data get close to available memory. Sometimes a different algorithm needs much less memory. Sometimes there's a trade off between vectorization and memory use. compare: split & lapply vs. a for loop. For the sake of fast & easy data analysis, I often work first with a small random subset (sample ()) of the data. Once the data analysis script/.Rnw is finished data analysis code and the complete data go to the calculation server for over night / over weekend / ... calculation.

使用环境而不是列表来处理占用大量工作内存的对象集合。

原因是:每当列表结构的一个元素被修改时,整个列表都会被临时复制。如果列表的存储需求大约是可用工作内存的一半,这就会成为一个问题,因为这时必须将数据交换到慢速硬盘上。另一方面,环境不受这种行为的影响,它们可以类似于列表。

这里有一个例子:

get.data <- function(x)
{
  # get some data based on x
  return(paste("data from",x))
}

collect.data <- function(i,x,env)
{
  # get some data
  data <- get.data(x[[i]])
  # store data into environment
  element.name <- paste("V",i,sep="")
  env[[element.name]] <- data
  return(NULL)  
}

better.list <- new.env()
filenames <- c("file1","file2","file3")
lapply(seq_along(filenames),collect.data,x=filenames,env=better.list)

# read/write access
print(better.list[["V1"]])
better.list[["V2"]] <- "testdata"
# number of list elements
length(ls(better.list))

结合结构,如大。矩阵或数据。表允许修改其内容的地方,非常有效的内存使用可以实现。