人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

我在推特上看到了这个,觉得德克的功能太棒了!根据JD Long的回答,为了方便用户阅读,我会这样做:

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
                           format(utils::object.size(x), units = "auto") })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
    
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

结果如下:

                      Type   Size PrettySize Length/Rows Columns
pca.res                 PCA 790128   771.6 Kb          7      NA
DF               data.frame 271040   264.7 Kb        669      50
factor.AgeGender   factanal  12888    12.6 Kb         12      NA
dates            data.frame   9016     8.8 Kb        669       2
sd.                 numeric   3808     3.7 Kb         51      NA
napply             function   2256     2.2 Kb         NA      NA
lsos               function   1944     1.9 Kb         NA      NA
load               loadings   1768     1.7 Kb         12       2
ind.sup             integer    448  448 bytes        102      NA
x                 character     96   96 bytes          1      NA

注:我补充的主要部分是(再次改编自JD的回答):

obj.prettysize <- napply(names, function(x) {
                           print(object.size(x), units = "auto") })

其他回答

我真的很欣赏上面的一些答案,遵循@hadley和@Dirk的建议,关闭R并发布源代码,使用命令行,我想出了一个非常适合我的解决方案。我必须处理数百个质谱仪,每个质谱仪占用大约20 Mb的内存,所以我使用了两个R脚本,如下所示:

首先是包装器:

#!/usr/bin/Rscript --vanilla --default-packages=utils

for(l in 1:length(fdir)) {

   for(k in 1:length(fds)) {
     system(paste("Rscript runConsensus.r", l, k))
   }
}

用这个脚本,我基本上控制我的主脚本做什么运行共识。r,然后写出输出的数据答案。这样,每次包装器调用脚本时,似乎会重新打开R并释放内存。

希望能有所帮助。

Tip for dealing with objects requiring heavy intermediate calculation: When using objects that require a lot of heavy calculation and intermediate steps to create, I often find it useful to write a chunk of code with the function to create the object, and then a separate chunk of code that gives me the option either to generate and save the object as an rmd file, or load it externally from an rmd file I have already previously saved. This is especially easy to do in R Markdown using the following code-chunk structure.

```{r Create OBJECT}

COMPLICATED.FUNCTION <- function(...) { Do heavy calculations needing lots of memory;
                                        Output OBJECT; }

```
```{r Generate or load OBJECT}

LOAD <- TRUE
SAVE <- TRUE
#NOTE: Set LOAD to TRUE if you want to load saved file
#NOTE: Set LOAD to FALSE if you want to generate the object from scratch
#NOTE: Set SAVE to TRUE if you want to save the object externally

if(LOAD) { 
  OBJECT <- readRDS(file = 'MySavedObject.rds') 
} else {
  OBJECT <- COMPLICATED.FUNCTION(x, y, z)
  if (SAVE) { saveRDS(file = 'MySavedObject.rds', object = OBJECT) } }

```

With this code structure, all I need to do is to change LOAD depending on whether I want to generate the object, or load it directly from an existing saved file. (Of course, I have to generate it and save it the first time, but after this I have the option of loading it.) Setting LOAD <- TRUE bypasses use of my complicated function and avoids all of the heavy computation therein. This method still requires enough memory to store the object of interest, but it saves you from having to calculate it each time you run your code. For objects that require a lot of heavy calculation of intermediate steps (e.g., for calculations involving loops over large arrays) this can save a substantial amount of time and computation.

为了进一步说明频繁重启的常见策略,我们可以使用littler,它允许我们直接从命令行运行简单的表达式。这里有一个例子,我有时会用不同的BLAS为一个简单的交叉刺计时。

 r -e'N<-3*10^3; M<-matrix(rnorm(N*N),ncol=N); print(system.time(crossprod(M)))'

同样的,

 r -lMatrix -e'example(spMatrix)'

加载Matrix包(通过——packages | -l开关)并运行spMatrix函数的示例。由于总是“新鲜”开始,这个方法在包开发过程中也是一个很好的测试。

最后但并非最不重要的是,r在脚本中使用'#!/usr/bin/r shebang-header。Rscript是little不可用的替代方案(例如在Windows上)。

这并没有增加上面的内容,而是以我喜欢的简单和大量注释的风格编写的。它生成一个对象大小排序表,但没有上面例子中给出的一些细节:

#Find the objects       
MemoryObjects = ls()    
#Create an array
MemoryAssessmentTable=array(NA,dim=c(length(MemoryObjects),2))
#Name the columns
colnames(MemoryAssessmentTable)=c("object","bytes")
#Define the first column as the objects
MemoryAssessmentTable[,1]=MemoryObjects
#Define a function to determine size        
MemoryAssessmentFunction=function(x){object.size(get(x))}
#Apply the function to the objects
MemoryAssessmentTable[,2]=t(t(sapply(MemoryAssessmentTable[,1],MemoryAssessmentFunction)))
#Produce a table with the largest objects first
noquote(MemoryAssessmentTable[rev(order(as.numeric(MemoryAssessmentTable[,2]))),])

如果真的想避免泄漏,应该避免在全局环境中创建任何大对象。

我通常做的是有一个函数来完成这项工作并返回NULL -所有数据都在这个函数或它调用的其他函数中读取和操作。