我们至少需要涵盖这些方面来提供一个全面的答案/比较(没有特别重要的顺序):速度,内存使用,语法和功能。
我的目的是尽可能清楚地从数据中阐述每一个问题。表观点。
注意:除非另有明确说明,通过引用dplyr,我们引用了dplyr的data.frame接口,其内部是使用Rcpp在c++中实现的。
数据。表语法在其形式上是一致的- DT[i, j, by]。保持i j和by在一起是故意的。通过将相关操作放在一起,它可以轻松地优化操作的速度,更重要的是内存使用,还提供了一些强大的功能,同时保持语法的一致性。
1. 速度
相当多的基准测试(虽然主要是分组操作)已经添加到已经显示数据的问题中。随着组和/或行数量的增加,table比dplyr更快,包括Matt在1亿到1亿组上分组从1000万到20亿行(100GB RAM)的基准,以及不同的分组列,这也比较了pandas。请参见更新的基准测试,其中包括Spark和pydatatable。
在基准测试方面,最好也涵盖以下几个方面:
涉及行子集的分组操作——即DT[x > val, sum(y), by = z]类型操作。
对更新和连接等其他操作进行基准测试。
除了运行时,还对每个操作的内存占用进行基准测试。
2. 内存使用情况
Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post.
Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory.
data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable).
# sub-assign by reference, updates 'y' in-place
DT[x >= 1L, y := NA]
But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned):
# copies the entire 'y' column
ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA))
A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it.
Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do:
foo <- function(DT) {
DT = shallow(DT) ## shallow copy DT
DT[, newcol := 1L] ## does not affect the original DT
DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT
DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will
## also get modified.
}
By not using shallow(), the old functionality is retained:
bar <- function(DT) {
DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference
DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT.
}
By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties.
Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables.
But it will still lack many features that data.table provides, including (sub)-assignment by reference.
Aggregate while joining:
Suppose you have two data.tables as follows:
DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y"))
# x y z
# 1: 1 a 1
# 2: 1 a 2
# 3: 1 b 3
# 4: 1 b 4
# 5: 2 a 5
# 6: 2 a 6
# 7: 2 b 7
# 8: 2 b 8
DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y"))
# x y mul
# 1: 1 a 4
# 2: 2 b 3
And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either:
aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or)
data.table way
DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][]
dplyr equivalent
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
right_join(DF2) %>% mutate(z = z * mul)
do it all in one go (using by = .EACHI feature):
DT1[DT2, list(z=sum(z) * mul), by = .EACHI]
What is the advantage?
We don't have to allocate memory for the intermediate result.
We don't have to group/hash twice (one for aggregation and other for joining).
And more importantly, the operation what we wanted to perform is clear by looking at j in (2).
Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go.
Have a look at this, this and this posts for real usage scenarios.
In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed).
Update and joins:
Consider the data.table code shown below:
DT1[DT2, col := i.mul]
adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary.
Check this post for a real usage scenario.
总而言之,认识到每一点优化都很重要是很重要的。就像格蕾丝·霍普说的,小心你的纳秒!
3.语法
现在让我们看看语法。Hadley在这里评论道:
数据表非常快,但我认为它们的简洁使得它更难学习,使用它的代码在你写完之后更难阅读……
我觉得这句话毫无意义,因为它太主观了。我们也许可以尝试对比语法的一致性。我们将比较数据。表和dplyr语法并排。
我们将使用如下所示的虚拟数据:
DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)
Basic aggregation/update operations.
# case (a)
DT[, sum(y), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax
DT[, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y))
# case (b)
DT[x > 2, sum(y), by = z]
DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y))
DT[x > 2, y := cumsum(y), by = z]
ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y)))
# case (c)
DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z]
DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L])
DT[, if(any(x > 5L)) y[1L] - y[2L], by = z]
DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L])
data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a).
In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum.
This is what we mean when we say the DT[i, j, by] form is consistent.
Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value.
While it returns the same result, using filter() here makes the actual operation less obvious.
It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to.
Aggregation / update on multiple columns
# case (a)
DT[, lapply(.SD, sum), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax
DT[, (cols) := lapply(.SD, sum), by = z]
ans <- DF %>% group_by(z) %>% mutate_each(funs(sum))
# case (b)
DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(sum, mean))
# case (c)
DT[, c(.N, lapply(.SD, sum)), by = z]
DF %>% group_by(z) %>% summarise_each(funs(n(), mean))
In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs().
data.table's := requires column names to be provided, whereas dplyr generates it automatically.
In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list.
In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list.
Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions.
You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and .
Joins
dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative.
setkey(DT1, x, y)
# 1. normal join
DT1[DT2] ## data.table syntax
left_join(DT2, DT1) ## dplyr syntax
# 2. select columns while join
DT1[DT2, .(z, i.mul)]
left_join(select(DT2, x, y, mul), select(DT1, x, y, z))
# 3. aggregate while join
DT1[DT2, .(sum(z) * i.mul), by = .EACHI]
DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>%
inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul)
# 4. update while join
DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI]
??
# 5. rolling join
DT1[DT2, roll = -Inf]
??
# 6. other arguments to control output
DT1[DT2, mult = "first"]
??
Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R.
However dplyr joins do just that. Nothing more. Nothing less.
data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient.
data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns?
data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest.
data.table also has mult = argument which selects first, last or all matches (6).
data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.
同样,该语法与DT[i, j, by]一致,并增加了允许进一步控制输出的参数。
do()...
dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value.
DT[, list(x[1], y[1]), by = z] ## data.table syntax
DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax
DT[, list(x[1:2], y[1]), by = z]
DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1]))
DT[, quantile(x, 0.25), by = z]
DF %>% group_by(z) %>% summarise(quantile(x, 0.25))
DT[, quantile(x, c(0.25, 0.75)), by = z]
DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75))))
DT[, as.list(summary(x)), by = z]
DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))
. sd的等效是。
在数据。Table,你可以在j中抛出几乎任何东西-唯一需要记住的是它返回一个列表,以便列表中的每个元素都转换为一列。
在dplyr中,不能这样做。必须求助于do(),这取决于您对函数是否总是返回单个值的确定程度。而且速度很慢。
还是数据。表的语法与DT[i, j, by]一致。我们可以继续把表达式扔到j中而不用担心这些东西。
看看这个SO问题和这个问题。我想知道是否有可能用dplyr的语法直接表达答案…
To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.
数据。表格也有它的怪癖(其中一些我已经指出,我们正在试图解决)。我们也在尝试改进数据。表的连接,正如我在这里突出显示的。
但是,与data.table相比,还应该考虑dplyr缺乏的特性的数量。
4. 特性
我已经在这里和这篇文章中指出了大部分功能。此外:
fread - fast file reader has been available for a long time now.
fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments.
Automatic indexing - another handy feature to optimise base R syntax as is, internally.
Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable.
Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above.
Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins.
Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks.
setorder() function in data.table that allows really fast reordering of data.tables by reference.
dplyr provides interface to databases using the same syntax, which data.table does not at the moment.
data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL).
data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.
最后:
在数据库上——没有理由需要数据。表不能提供类似的接口,但这不是现在的优先级。如果用户非常喜欢这个功能,它可能会得到提升。不确定。
关于并行性——任何事情都是困难的,直到有人走在前面去做。当然,这需要付出努力(确保线程安全)。
目前(在v1.9.7 devel中)正在使用OpenMP并行处理已知耗时部件以获得增量性能提升。