人们使用什么技巧来管理交互式R会话的可用内存?我使用下面的函数[基于Petr Pikal和David Hinds在2004年发布的r-help列表]来列出(和/或排序)最大的对象,并偶尔rm()其中一些对象。但到目前为止最有效的解决办法是……在64位Linux下运行,有充足的内存。

大家还有什么想分享的妙招吗?请每人寄一份。

# improved list of objects
.ls.objects <- function (pos = 1, pattern, order.by,
                        decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                                         fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                        as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.dim)
    names(out) <- c("Type", "Size", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}
# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

当前回答

我非常喜欢Dirk开发的改进的对象函数。不过,大多数时候,一个包含对象名称和大小的更基本的输出对我来说就足够了。这是一个具有类似目标的简单函数。内存使用可以按字母顺序或大小排序,可以限制为一定数量的对象,并且可以按升序或降序排序。此外,我经常处理1GB以上的数据,因此该函数相应地改变单位。

showMemoryUse <- function(sort="size", decreasing=FALSE, limit) {

  objectList <- ls(parent.frame())

  oneKB <- 1024
  oneMB <- 1048576
  oneGB <- 1073741824

  memoryUse <- sapply(objectList, function(x) as.numeric(object.size(eval(parse(text=x)))))

  memListing <- sapply(memoryUse, function(size) {
        if (size >= oneGB) return(paste(round(size/oneGB,2), "GB"))
        else if (size >= oneMB) return(paste(round(size/oneMB,2), "MB"))
        else if (size >= oneKB) return(paste(round(size/oneKB,2), "kB"))
        else return(paste(size, "bytes"))
      })

  memListing <- data.frame(objectName=names(memListing),memorySize=memListing,row.names=NULL)

  if (sort=="alphabetical") memListing <- memListing[order(memListing$objectName,decreasing=decreasing),] 
  else memListing <- memListing[order(memoryUse,decreasing=decreasing),] #will run if sort not specified or "size"

  if(!missing(limit)) memListing <- memListing[1:limit,]

  print(memListing, row.names=FALSE)
  return(invisible(memListing))
}

下面是一些输出示例:

> showMemoryUse(decreasing=TRUE, limit=5)
      objectName memorySize
       coherData  713.75 MB
 spec.pgram_mine  149.63 kB
       stoch.reg  145.88 kB
      describeBy    82.5 kB
      lmBandpass   68.41 kB

其他回答

这是对这个优秀的老问题的一个新的回答。来自哈德利的高级R:

install.packages("pryr")

library(pryr)

object_size(1:10)
## 88 B

object_size(mean)
## 832 B

object_size(mtcars)
## 6.74 kB

(http://adv-r.had.co.nz/memory.html)

只有4GB的内存(运行Windows 10,所以大约是2或更现实的1GB),我必须非常小心地分配。

我使用数据。几乎只有桌子。

'fread'函数允许您在导入时按字段名划分信息子集;只导入开始时实际需要的字段。如果使用base R read,则在导入后立即将伪列空。

正如42-所建议的,只要有可能,我将在导入信息后立即在列中进行子集。

我经常从环境中rm()对象,一旦他们不再需要,例如在使用他们子集其他东西后的下一行,并调用gc()。

'fread'和'fwrite'从数据。表的读写速度可以非常快。

As kpierce8 suggests, I almost always fwrite everything out of the environment and fread it back in, even with thousand / hundreds of thousands of tiny files to get through. This not only keeps the environment 'clean' and keeps the memory allocation low but, possibly due to the severe lack of RAM available, R has a propensity for frequently crashing on my computer; really frequently. Having the information backed up on the drive itself as the code progresses through various stages means I don't have to start right from the beginning if it crashes.

As of 2017, I think the fastest SSDs are running around a few GB per second through the M2 port. I have a really basic 50GB Kingston V300 (550MB/s) SSD that I use as my primary disk (has Windows and R on it). I keep all the bulk information on a cheap 500GB WD platter. I move the data sets to the SSD when I start working on them. This, combined with 'fread'ing and 'fwrite'ing everything has been working out great. I've tried using 'ff' but prefer the former. 4K read/write speeds can create issues with this though; backing up a quarter of a million 1k files (250MBs worth) from the SSD to the platter can take hours. As far as I'm aware, there isn't any R package available yet that can automatically optimise the 'chunkification' process; e.g. look at how much RAM a user has, test the read/write speeds of the RAM / all the drives connected and then suggest an optimal 'chunkification' protocol. This could produce some significant workflow improvements / resource optimisations; e.g. split it to ... MB for the ram -> split it to ... MB for the SSD -> split it to ... MB on the platter -> split it to ... MB on the tape. It could sample data sets beforehand to give it a more realistic gauge stick to work from.

A lot of the problems I've worked on in R involve forming combination and permutation pairs, triples etc, which only makes having limited RAM more of a limitation as they will often at least exponentially expand at some point. This has made me focus a lot of attention on the quality as opposed to quantity of information going into them to begin with, rather than trying to clean it up afterwards, and on the sequence of operations in preparing the information to begin with (starting with the simplest operation and increasing the complexity); e.g. subset, then merge / join, then form combinations / permutations etc.

There do seem to be some benefits to using base R read and write in some instances. For instance, the error detection within 'fread' is so good it can be difficult trying to get really messy information into R to begin with to clean it up. Base R also seems to be a lot easier if you're using Linux. Base R seems to work fine in Linux, Windows 10 uses ~20GB of disc space whereas Ubuntu only needs a few GB, the RAM needed with Ubuntu is slightly lower. But I've noticed large quantities of warnings and errors when installing third party packages in (L)Ubuntu. I wouldn't recommend drifting too far away from (L)Ubuntu or other stock distributions with Linux as you can loose so much overall compatibility it renders the process almost pointless (I think 'unity' is due to be cancelled in Ubuntu as of 2017). I realise this won't go down well with some Linux users but some of the custom distributions are borderline pointless beyond novelty (I've spent years using Linux alone).

希望其中一些能帮助到其他人。

为了进一步说明频繁重启的常见策略,我们可以使用littler,它允许我们直接从命令行运行简单的表达式。这里有一个例子,我有时会用不同的BLAS为一个简单的交叉刺计时。

 r -e'N<-3*10^3; M<-matrix(rnorm(N*N),ncol=N); print(system.time(crossprod(M)))'

同样的,

 r -lMatrix -e'example(spMatrix)'

加载Matrix包(通过——packages | -l开关)并运行spMatrix函数的示例。由于总是“新鲜”开始,这个方法在包开发过程中也是一个很好的测试。

最后但并非最不重要的是,r在脚本中使用'#!/usr/bin/r shebang-header。Rscript是little不可用的替代方案(例如在Windows上)。

如果您正在Linux上工作,希望使用多个进程,并且只需要对一个或多个大对象执行读取操作,请使用makeForkCluster而不是makePSOCKcluster。这也节省了将大对象发送给其他进程的时间。

Unfortunately I did not have time to test it extensively but here is a memory tip that I have not seen before. For me the required memory was reduced with more than 50%. When you read stuff into R with for example read.csv they require a certain amount of memory. After this you can save them with save("Destinationfile",list=ls()) The next time you open R you can use load("Destinationfile") Now the memory usage might have decreased. It would be nice if anyone could confirm whether this produces similar results with a different dataset.