人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

我真的很欣赏上面的一些答案,遵循@hadley和@Dirk的建议,关闭R并发布源代码,使用命令行,我想出了一个非常适合我的解决方案。我必须处理数百个质谱仪,每个质谱仪占用大约20 Mb的内存,所以我使用了两个R脚本,如下所示:

首先是包装器:

#!/usr/bin/Rscript --vanilla --default-packages=utils

for(l in 1:length(fdir)) {

   for(k in 1:length(fds)) {
     system(paste("Rscript runConsensus.r", l, k))
   }
}

用这个脚本,我基本上控制我的主脚本做什么运行共识。r,然后写出输出的数据答案。这样,每次包装器调用脚本时,似乎会重新打开R并释放内存。

希望能有所帮助。

其他回答

我非常喜欢Dirk开发的改进的对象函数。不过,大多数时候,一个包含对象名称和大小的更基本的输出对我来说就足够了。这是一个具有类似目标的简单函数。内存使用可以按字母顺序或大小排序,可以限制为一定数量的对象,并且可以按升序或降序排序。此外,我经常处理1GB以上的数据,因此该函数相应地改变单位。

showMemoryUse <- function(sort="size", decreasing=FALSE, limit) {

  objectList <- ls(parent.frame())

  oneKB <- 1024
  oneMB <- 1048576
  oneGB <- 1073741824

  memoryUse <- sapply(objectList, function(x) as.numeric(object.size(eval(parse(text=x)))))

  memListing <- sapply(memoryUse, function(size) {
        if (size >= oneGB) return(paste(round(size/oneGB,2), "GB"))
        else if (size >= oneMB) return(paste(round(size/oneMB,2), "MB"))
        else if (size >= oneKB) return(paste(round(size/oneKB,2), "kB"))
        else return(paste(size, "bytes"))
      })

  memListing <- data.frame(objectName=names(memListing),memorySize=memListing,row.names=NULL)

  if (sort=="alphabetical") memListing <- memListing[order(memListing$objectName,decreasing=decreasing),] 
  else memListing <- memListing[order(memoryUse,decreasing=decreasing),] #will run if sort not specified or "size"

  if(!missing(limit)) memListing <- memListing[1:limit,]

  print(memListing, row.names=FALSE)
  return(invisible(memListing))
}

下面是一些输出示例:

> showMemoryUse(decreasing=TRUE, limit=5)
      objectName memorySize
       coherData  713.75 MB
 spec.pgram_mine  149.63 kB
       stoch.reg  145.88 kB
      describeBy    82.5 kB
      lmBandpass   68.41 kB

For both speed and memory purposes, when building a large data frame via some complex series of steps, I'll periodically flush it (the in-progress data set being built) to disk, appending to anything that came before, and then restart it. This way the intermediate steps are only working on smallish data frames (which is good as, e.g., rbind slows down considerably with larger objects). The entire data set can be read back in at the end of the process, when all the intermediate objects have been removed.

dfinal <- NULL
first <- TRUE
tempfile <- "dfinal_temp.csv"
for( i in bigloop ) {
    if( !i %% 10000 ) { 
        print( i, "; flushing to disk..." )
        write.table( dfinal, file=tempfile, append=!first, col.names=first )
        first <- FALSE
        dfinal <- NULL   # nuke it
    }

    # ... complex operations here that add data to 'dfinal' data frame  
}
print( "Loop done; flushing to disk and re-reading entire data set..." )
write.table( dfinal, file=tempfile, append=TRUE, col.names=FALSE )
dfinal <- read.table( tempfile )

这是对这个优秀的老问题的一个新的回答。来自哈德利的高级R:

install.packages("pryr")

library(pryr)

object_size(1:10)
## 88 B

object_size(mean)
## 832 B

object_size(mtcars)
## 6.74 kB

(http://adv-r.had.co.nz/memory.html)

我喜欢Dirk的.ls.objects()脚本,但我总是眯着眼睛数大小列中的字符。所以我做了一些丑陋的hack,使它呈现出漂亮的格式大小:

.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.prettysize <- sapply(obj.size, function(r) prettyNum(r, big.mark = ",") )
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size,obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
        out <- out[c("Type", "PrettySize", "Rows", "Columns")]
        names(out) <- c("Type", "Size", "Rows", "Columns")
    if (head)
        out <- head(out, n)
    out
}

这是个好把戏。

另一个建议是尽可能使用内存效率高的对象:例如,使用矩阵而不是data.frame。

这并没有真正解决内存管理问题,但是一个不为人所知的重要函数是memory.limit()。可以使用memory.limit(size=2500)命令增加默认值,这里的大小以MB为单位。正如Dirk提到的,为了真正利用这一点,您需要使用64位。