我理解data.table的引用传递属性有点困难。一些操作似乎“破坏”了引用,我想确切地了解发生了什么。

在创建数据时。表来自另一个数据。表(通过<-,然后通过:=更新新表,原始表也被改变。这是预期的,如:

? data.table::复制 和stackoverflow: pass-by-reference-the-operator-in- data-table-package

这里有一个例子:

library(data.table)

DT <- data.table(a=c(1,2), b=c(11,12))
print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

newDT <- DT        # reference, not copy
newDT[1, a := 100] # modify new DT

print(DT)          # DT is modified too.
#        a  b
# [1,] 100 11
# [2,]   2 12

然而,如果我在<-赋值和上面的:=行之间插入一个非:=的修改,DT现在不再被修改:

DT = data.table(a=c(1,2), b=c(11,12))
newDT <- DT        
newDT$b[2] <- 200  # new operation
newDT[1, a := 100]

print(DT)
#      a  b
# [1,] 1 11
# [2,] 2 12

因此,newDT$b[2] <- 200行似乎以某种方式“破坏”了引用。我猜这会以某种方式调用一个副本,但我想完全理解R是如何处理这些操作的,以确保我不会在我的代码中引入潜在的错误。

如果有人能给我解释一下,我会很感激的。


当前回答

是的,它是在R中使用<- (or = or ->)的子赋值来复制整个对象。您可以使用tracemem(DT)和. internal (inspect(DT))进行跟踪,如下所示。数据。表特性:=和set()通过引用传递给它们的任何对象赋值。因此,如果该对象之前被复制过(通过子赋值<-或显式复制(DT)),那么它就是通过引用修改的复制。

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))   # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB:  # ..snip..

请注意,即使a没有改变,a也被复制了(不同的十六进制值表示vector的新副本)。甚至整个b都被复制了,而不仅仅是改变需要改变的元素。对于大数据,避免这样做是很重要的,以及为什么在data.table中引入:=和set()。

现在,使用我们复制的newDT,我们可以通过引用修改它:

newDT
#      a   b
# [1,] 1  11
# [2,] 2 200

newDT[2, b := 400]
#      a   b        # See FAQ 2.21 for why this prints newDT
# [1,] 1  11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB:  # ..snip ..

注意,所有3个十六进制值(列点向量和2列中的每一列)保持不变。所以它是真正的参考修改,没有任何副本。

或者,我们可以通过引用修改原来的DT:

DT[2, b := 600]
#      a   b
# [1,] 1  11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
#   ATTRIB:  # ..snip..

这些十六进制值与我们上面看到的DT的原始值相同。输入example(copy)查看更多使用tracemem并与data.frame进行比较的示例。

顺便说一句,如果你tracemem(DT),那么DT[2,b:=600],你会看到一个拷贝报告。这是print方法执行的前10行的副本。当使用invisible()包装或在函数或脚本中调用时,print方法不会被调用。

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).

其他回答

简单总结一下。

<- with data.table is just like base; i.e., no copy is taken until a subassign is done afterwards with <- (such as changing the column names or changing an element such as DT[i,j]<-v). Then it takes a copy of the whole object just like base. That's known as copy-on-write. Would be better known as copy-on-subassign, I think! It DOES NOT copy when you use the special := operator, or the set* functions provided by data.table. If you have large data you probably want to use them instead. := and set* will NOT COPY the data.table, EVEN WITHIN FUNCTIONS.

给定以下示例数据:

DT <- data.table(a=c(1,2), b=c(11,12))

下面只是将另一个名称DT2“绑定”到当前绑定到名称DT的相同数据对象:

DT2 <- DT

它不会复制,也不会在碱基中复制。它只是标记数据对象,以便R知道两个不同的名称(DT2和DT)指向同一个对象。所以R需要复制对象如果后面有任何一个被赋给子对象。

这对于数据来说是完美的。表。=不是用来做那件事的。因此,下面是一个故意的错误,因为:=不仅仅用于绑定对象名称:

DT2 := DT    # not what := is for, not defined, gives a nice error

:=用于通过引用进行子赋值。但你不像在base中那样使用它:

DT[3,"foo"] := newvalue    # not like this

你可以这样使用它:

DT[3,foo:=newvalue]    # like this

这通过参考改变了DT。假设您通过引用数据对象添加了一个新列new,不需要这样做:

DT <- DT[,new:=1L]

因为RHS已经通过引用改变了DT。额外的DT <-是误解了:=的作用。你可以写在这里,但这是多余的。

DT通过引用、:=、EVEN WITHIN FUNCTIONS改变:

f <- function(X){
    X[,new2:=2L]
    return("something else")
}
f(DT)   # will change DT

DT2 <- DT
f(DT)   # will change both DT and DT2 (they're the same data object)

数据。表是用于大型数据集的,记住。如果你有20GB的数据。表,那么你需要一种方法来做到这一点。这是一个非常深思熟虑的data。table设计决策。

当然,复印是可以的。你只需要告诉数据。表,你确定你想复制你的20GB数据集,通过使用copy()函数:

DT3 <- copy(DT)   # rather than DT3 <- DT
DT3[,new3:=3L]     # now, this just changes DT3 because it's a copy, not DT too.

为了避免复制,不要使用基类型赋值或更新:

DT$new4 <- 1L                 # will make a copy so use :=
attr(DT,"sorted") <- "a"      # will make a copy use setattr() 

如果你想确保你是通过引用更新的,请使用. internal (inspect(x))并查看组件的内存地址值(参见Matthew Dowle的回答)。

像这样写:=在j中允许通过引用通过组进行子赋值。可以通过组引用添加新列。这就是为什么:=在[…]]:

DT[, newcol:=mean(x), by=group]

是的,它是在R中使用<- (or = or ->)的子赋值来复制整个对象。您可以使用tracemem(DT)和. internal (inspect(DT))进行跟踪,如下所示。数据。表特性:=和set()通过引用传递给它们的任何对象赋值。因此,如果该对象之前被复制过(通过子赋值<-或显式复制(DT)),那么它就是通过引用修改的复制。

DT <- data.table(a = c(1, 2), b = c(11, 12)) 
newDT <- DT 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))   # precisely the same object at this point
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

tracemem(newDT)
# [1] "<0x0000000003b7e2a0"

newDT$b[2] <- 200
# tracemem[0000000003B7E2A0 -> 00000000040ED948]: 
# tracemem[00000000040ED948 -> 00000000040ED830]: .Call copy $<-.data.table $<- 

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),TR,ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,12
# ATTRIB:  # ..snip..

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,200
# ATTRIB:  # ..snip..

请注意,即使a没有改变,a也被复制了(不同的十六进制值表示vector的新副本)。甚至整个b都被复制了,而不仅仅是改变需要改变的元素。对于大数据,避免这样做是很重要的,以及为什么在data.table中引入:=和set()。

现在,使用我们复制的newDT,我们可以通过引用修改它:

newDT
#      a   b
# [1,] 1  11
# [2,] 2 200

newDT[2, b := 400]
#      a   b        # See FAQ 2.21 for why this prints newDT
# [1,] 1  11
# [2,] 2 400

.Internal(inspect(newDT))
# @0000000003D97A58 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040ED7F8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040ED8D8 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,400
# ATTRIB:  # ..snip ..

注意,所有3个十六进制值(列点向量和2列中的每一列)保持不变。所以它是真正的参考修改,没有任何副本。

或者,我们可以通过引用修改原来的DT:

DT[2, b := 600]
#      a   b
# [1,] 1  11
# [2,] 2 600

.Internal(inspect(DT))
# @0000000003B7E2A0 19 VECSXP g0c7 [OBJ,NAM(2),ATT] (len=2, tl=100)
#   @00000000040C2288 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 1,2
#   @00000000040C2250 14 REALSXP g0c2 [NAM(2)] (len=2, tl=0) 11,600
#   ATTRIB:  # ..snip..

这些十六进制值与我们上面看到的DT的原始值相同。输入example(copy)查看更多使用tracemem并与data.frame进行比较的示例。

顺便说一句,如果你tracemem(DT),那么DT[2,b:=600],你会看到一个拷贝报告。这是print方法执行的前10行的副本。当使用invisible()包装或在函数或脚本中调用时,print方法不会被调用。

All this applies inside functions too; i.e., := and set() do not copy on write, even within functions. If you need to modify a local copy, then call x=copy(x) at the start of the function. But, remember data.table is for large data (as well as faster programming advantages for small data). We deliberately don't want to copy large objects (ever). As a result we don't need to allow for the usual 3* working memory factor rule of thumb. We try to only need working memory as large as one column (i.e. a working memory factor of 1/ncol rather than 3).