人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

我喜欢Dirk的.ls.objects()脚本,但我总是眯着眼睛数大小列中的字符。所以我做了一些丑陋的hack,使它呈现出漂亮的格式大小:

.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.prettysize <- sapply(obj.size, function(r) prettyNum(r, big.mark = ",") )
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size,obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
        out <- out[c("Type", "PrettySize", "Rows", "Columns")]
        names(out) <- c("Type", "Size", "Rows", "Columns")
    if (head)
        out <- head(out, n)
    out
}

其他回答

Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%. When you read stuff into R with for example read.csv they require a certain amount of memory. After this you can save them with save("Destinationfile",list=ls()) The next time you open R you can use load("Destinationfile") Now the memory usage might have decreased. It would be nice if anyone could confirm whether this produces similar results with a different dataset.

gData包中的llfunction也可以显示每个对象的内存使用情况。

gdata::ll(unit='MB')

我在推特上看到了这个,觉得德克的功能太棒了!根据JD Long的回答,为了方便用户阅读,我会这样做:

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
                           format(utils::object.size(x), units = "auto") })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Length/Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
    
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

结果如下:

                      Type   Size PrettySize Length/Rows Columns
pca.res                 PCA 790128   771.6 Kb          7      NA
DF               data.frame 271040   264.7 Kb        669      50
factor.AgeGender   factanal  12888    12.6 Kb         12      NA
dates            data.frame   9016     8.8 Kb        669       2
sd.                 numeric   3808     3.7 Kb         51      NA
napply             function   2256     2.2 Kb         NA      NA
lsos               function   1944     1.9 Kb         NA      NA
load               loadings   1768     1.7 Kb         12       2
ind.sup             integer    448  448 bytes        102      NA
x                 character     96   96 bytes          1      NA

注:我补充的主要部分是(再次改编自JD的回答):

obj.prettysize <- napply(names, function(x) {
                           print(object.size(x), units = "auto") })

我使用数据。表方案。使用它的:=运算符,你可以:

通过引用添加列 通过引用修改现有列的子集,通过引用修改组 通过引用删除列

这些操作都不会复制(可能很大的)数据。连一张桌子都没有。

聚合也特别快,因为数据。表使用更少的工作内存。

相关链接:

来自数据的新闻。表,伦敦R展示,2012年 什么时候我应该在data.table中使用:=操作符?

使用knitr和将脚本放在Rmd块中也可以获得一些好处。

我通常将代码划分为不同的块,并选择将检查点保存到缓存或RDS文件中

在那里,你可以设置一个块被保存到“缓存”,或者你可以决定运行或不运行一个特定的块。这样,在第一次运行时,你只能处理“第一部分”,而在另一次执行时,你只能选择“第二部分”,等等。

例子:

part1
```{r corpus, warning=FALSE, cache=TRUE, message=FALSE, eval=TRUE}
corpusTw <- corpus(twitter)  # build the corpus
```
part2
```{r trigrams, warning=FALSE, cache=TRUE, message=FALSE, eval=FALSE}
dfmTw <- dfm(corpusTw, verbose=TRUE, removeTwitter=TRUE, ngrams=3)
```

作为一个副作用,这也可以让你在可重复性方面省去一些麻烦:)