概述

我对数据比较熟悉。Table,没有太多dplyr。我已经阅读了一些dplyr小插图和在SO上出现的例子,到目前为止,我的结论是:

数据。table和dplyr在速度上是相当的,除非有很多组(例如>10-100K),以及在其他一些情况下(参见下面的基准测试) Dplyr具有更易于访问的语法 dplyr抽象(或将)潜在的DB交互 有一些小的功能差异(见下面的“示例/用法”)

In my mind 2. doesn't bear much weight because I am fairly familiar with it data.table, though I understand that for users new to both it will be a big factor. I would like to avoid an argument about which is more intuitive, as that is irrelevant for my specific question asked from the perspective of someone already familiar with data.table. I also would like to avoid a discussion about how "more intuitive" leads to faster analysis (certainly true, but again, not what I'm most interested about here).

问题

我想知道的是:

对于熟悉包的人来说,是否存在使用其中一种包更容易编码的分析任务(例如,所需的击键与所需的深奥程度的组合,其中每种都少一些是一件好事)。 是否有分析任务在一个包中比另一个包更有效地执行(即超过2倍)。

最近的一个SO问题让我对这个问题有了更多的思考,因为在此之前,我认为dplyr除了data.table中已经能做的以外,还不能提供更多的功能。下面是dplyr的解(Q结尾的数据):

dat %.%
  group_by(name, job) %.%
  filter(job != "Boss" | year == min(year)) %.%
  mutate(cumu_job2 = cumsum(job2))

这比我入侵数据的企图好多了。表解决方案。也就是说,数据不错。表格解也很好(感谢Jean-Robert, Arun,注意这里我更喜欢单一的陈述而不是严格的最优解):

setDT(dat)[,
  .SD[job != "Boss" | year == min(year)][, cumjob := cumsum(job2)], 
  by=list(id, job)
]

后者的语法可能看起来非常深奥,但如果您习惯了数据,它实际上非常简单。表(即没有使用一些更深奥的技巧)。

理想情况下,我希望看到一些好的例子是dplyr或数据。表方式基本上更简洁或性能更好。

例子

Usage

dplyr does not allow grouped operations that return arbitrary number of rows (from eddi's question, note: this looks like it will be implemented in dplyr 0.5, also, @beginneR shows a potential work-around using do in the answer to @eddi's question). data.table supports rolling joins (thanks @dholstius) as well as overlap joins data.table internally optimises expressions of the form DT[col == value] or DT[col %in% values] for speed through automatic indexing which uses binary search while using the same base R syntax. See here for some more details and a tiny benchmark. dplyr offers standard evaluation versions of functions (e.g. regroup, summarize_each_) that can simplify the programmatic use of dplyr (note programmatic use of data.table is definitely possible, just requires some careful thought, substitution/quoting, etc, at least to my knowledge)

Benchmarks

I ran my own benchmarks and found both packages to be comparable in "split apply combine" style analysis, except when there are very large numbers of groups (>100K) at which point data.table becomes substantially faster. @Arun ran some benchmarks on joins, showing that data.table scales better than dplyr as the number of groups increase (updated with recent enhancements in both packages and recent version of R). Also, a benchmark when trying to get unique values has data.table ~6x faster. (Unverified) has data.table 75% faster on larger versions of a group/apply/sort while dplyr was 40% faster on the smaller ones (another SO question from comments, thanks danas). Matt, the main author of data.table, has benchmarked grouping operations on data.table, dplyr and python pandas on up to 2 billion rows (~100GB in RAM). An older benchmark on 80K groups has data.table ~8x faster

Data

这是我在问题部分展示的第一个例子。

dat <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L), name = c("Jane", "Jane", "Jane", "Jane", 
"Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", 
"Bob", "Bob", "Bob"), year = c(1980L, 1981L, 1982L, 1983L, 1984L, 
1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 
1991L, 1992L), job = c("Manager", "Manager", "Manager", "Manager", 
"Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", 
"Boss", "Boss", "Boss", "Boss", "Boss"), job2 = c(1L, 1L, 1L, 
1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)), .Names = c("id", 
"name", "year", "job", "job2"), class = "data.frame", row.names = c(NA, 
-16L))

当前回答

我们至少需要涵盖这些方面来提供一个全面的答案/比较(没有特别重要的顺序):速度,内存使用,语法和功能。

我的目的是尽可能清楚地从数据中阐述每一个问题。表观点。

注意:除非另有明确说明,通过引用dplyr,我们引用了dplyr的data.frame接口,其内部是使用Rcpp在c++中实现的。


数据。表语法在其形式上是一致的- DT[i, j, by]。保持i j和by在一起是故意的。通过将相关操作放在一起,它可以轻松地优化操作的速度,更重要的是内存使用,还提供了一些强大的功能,同时保持语法的一致性。

1. 速度

相当多的基准测试(虽然主要是分组操作)已经添加到已经显示数据的问题中。随着组和/或行数量的增加,table比dplyr更快,包括Matt在1亿到1亿组上分组从1000万到20亿行(100GB RAM)的基准,以及不同的分组列,这也比较了pandas。请参见更新的基准测试,其中包括Spark和pydatatable。

在基准测试方面,最好也涵盖以下几个方面:

涉及行子集的分组操作——即DT[x > val, sum(y), by = z]类型操作。 对更新和连接等其他操作进行基准测试。 除了运行时,还对每个操作的内存占用进行基准测试。

2. 内存使用情况

Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post. Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable). # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned): # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it. Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do: foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } By not using shallow(), the old functionality is retained: bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties. Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables. But it will still lack many features that data.table provides, including (sub)-assignment by reference. Aggregate while joining: Suppose you have two data.tables as follows: DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3 And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either: aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or) data.table way DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] dplyr equivalent DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul) do it all in one go (using by = .EACHI feature): DT1[DT2, list(z=sum(z) * mul), by = .EACHI] What is the advantage? We don't have to allocate memory for the intermediate result. We don't have to group/hash twice (one for aggregation and other for joining). And more importantly, the operation what we wanted to perform is clear by looking at j in (2). Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go. Have a look at this, this and this posts for real usage scenarios. In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed). Update and joins: Consider the data.table code shown below: DT1[DT2, col := i.mul] adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary. Check this post for a real usage scenario.

总而言之,认识到每一点优化都很重要是很重要的。就像格蕾丝·霍普说的,小心你的纳秒!

3.语法

现在让我们看看语法。Hadley在这里评论道:

数据表非常快,但我认为它们的简洁使得它更难学习,使用它的代码在你写完之后更难阅读……

我觉得这句话毫无意义,因为它太主观了。我们也许可以尝试对比语法的一致性。我们将比较数据。表和dplyr语法并排。

我们将使用如下所示的虚拟数据:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

Basic aggregation/update operations. # case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L]) data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a). In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum. This is what we mean when we say the DT[i, j, by] form is consistent. Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value. While it returns the same result, using filter() here makes the actual operation less obvious. It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to. Aggregation / update on multiple columns # case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean)) In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs(). data.table's := requires column names to be provided, whereas dplyr generates it automatically. In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list. In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list. Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions. You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and . Joins dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative. setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??

Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R. However dplyr joins do just that. Nothing more. Nothing less. data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient. data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns? data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest. data.table also has mult = argument which selects first, last or all matches (6). data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

同样,该语法与DT[i, j, by]一致,并增加了允许进一步控制输出的参数。

do()... dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value. DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

. sd的等效是。 在数据。Table,你可以在j中抛出几乎任何东西-唯一需要记住的是它返回一个列表,以便列表中的每个元素都转换为一列。 在dplyr中,不能这样做。必须求助于do(),这取决于您对函数是否总是返回单个值的确定程度。而且速度很慢。

还是数据。表的语法与DT[i, j, by]一致。我们可以继续把表达式扔到j中而不用担心这些东西。

看看这个SO问题和这个问题。我想知道是否有可能用dplyr的语法直接表达答案…

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

数据。表格也有它的怪癖(其中一些我已经指出,我们正在试图解决)。我们也在尝试改进数据。表的连接,正如我在这里突出显示的。

但是,与data.table相比,还应该考虑dplyr缺乏的特性的数量。

4. 特性

我已经在这里和这篇文章中指出了大部分功能。此外:

fread - fast file reader has been available for a long time now. fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments. Automatic indexing - another handy feature to optimise base R syntax as is, internally. Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable. Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above. Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins. Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks. setorder() function in data.table that allows really fast reordering of data.tables by reference. dplyr provides interface to databases using the same syntax, which data.table does not at the moment. data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL). data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


最后:

在数据库上——没有理由需要数据。表不能提供类似的接口,但这不是现在的优先级。如果用户非常喜欢这个功能,它可能会得到提升。不确定。 关于并行性——任何事情都是困难的,直到有人走在前面去做。当然,这需要付出努力(确保线程安全)。 目前(在v1.9.7 devel中)正在使用OpenMP并行处理已知耗时部件以获得增量性能提升。

其他回答

直接回答问题标题…

Dplyr确实在做数据方面的工作。表不能。

你的观点3

dplyr抽象(或将)潜在的DB交互

是对你自己问题的直接回答,但没有提升到足够高的水平。Dplyr是一个真正的可扩展的前端数据存储机制。表是对单个表的扩展。

将dplyr视为后端不可知接口,所有目标都使用相同的语法,可以随意扩展目标和处理程序。数据。从dplyr的角度来看,Table就是这些目标之一。

你永远(我希望)不会看到这样的数据。表试图转换您的查询,以创建操作磁盘上或网络数据存储的SQL语句。

Dplyr可以处理数据。桌子可能做不到。

基于工作在内存中的设计,数据。与dplyr相比,Table在将自身扩展为并行查询处理时要困难得多。


作为对体内问题的回应…

使用

对于熟悉包的人来说,是否存在使用其中一种包更容易编码的分析任务(例如,所需的击键与所需的深奥程度的组合,其中每种都少一些是一件好事)。

这看起来像是一种冒险,但真正的答案是否定的。熟悉工具的人似乎要么使用他们最熟悉的工具,要么使用真正适合手头工作的工具。话虽如此,有时你想要呈现一种特定的可读性,有时是一种性能水平,当你需要足够高的这两种水平时,你可能只需要另一种工具来配合你已经拥有的东西来做出更清晰的抽象。

性能

是否有分析任务在一个包中比另一个包更有效地执行(即超过2倍)。

再一次,没有。数据。Table在它所做的所有事情上都非常高效,而dplyr在某些方面受到了底层数据存储和注册处理程序的限制。

这意味着当您遇到数据性能问题时。表,你可以很确定它在你的查询函数中,如果它实际上是一个数据瓶颈。那么你就为自己赢得了写报告的乐趣。当dplyr使用数据时也是如此。表作为后端;您可能会看到dplyr的一些开销,但这很可能是您的查询造成的。

当dplyr后端存在性能问题时,可以通过注册一个函数进行混合计算,或者(在数据库的情况下)在执行之前操作生成的查询来解决这些问题。

也看到公认的答案,何时是plyr优于data.table?

我们至少需要涵盖这些方面来提供一个全面的答案/比较(没有特别重要的顺序):速度,内存使用,语法和功能。

我的目的是尽可能清楚地从数据中阐述每一个问题。表观点。

注意:除非另有明确说明,通过引用dplyr,我们引用了dplyr的data.frame接口,其内部是使用Rcpp在c++中实现的。


数据。表语法在其形式上是一致的- DT[i, j, by]。保持i j和by在一起是故意的。通过将相关操作放在一起,它可以轻松地优化操作的速度,更重要的是内存使用,还提供了一些强大的功能,同时保持语法的一致性。

1. 速度

相当多的基准测试(虽然主要是分组操作)已经添加到已经显示数据的问题中。随着组和/或行数量的增加,table比dplyr更快,包括Matt在1亿到1亿组上分组从1000万到20亿行(100GB RAM)的基准,以及不同的分组列,这也比较了pandas。请参见更新的基准测试,其中包括Spark和pydatatable。

在基准测试方面,最好也涵盖以下几个方面:

涉及行子集的分组操作——即DT[x > val, sum(y), by = z]类型操作。 对更新和连接等其他操作进行基准测试。 除了运行时,还对每个操作的内存占用进行基准测试。

2. 内存使用情况

Operations involving filter() or slice() in dplyr can be memory inefficient (on both data.frames and data.tables). See this post. Note that Hadley's comment talks about speed (that dplyr is plentiful fast for him), whereas the major concern here is memory. data.table interface at the moment allows one to modify/update columns by reference (note that we don't need to re-assign the result back to a variable). # sub-assign by reference, updates 'y' in-place DT[x >= 1L, y := NA] But dplyr will never update by reference. The dplyr equivalent would be (note that the result needs to be re-assigned): # copies the entire 'y' column ans <- DF %>% mutate(y = replace(y, which(x >= 1L), NA)) A concern for this is referential transparency. Updating a data.table object by reference, especially within a function may not be always desirable. But this is an incredibly useful feature: see this and this posts for interesting cases. And we want to keep it. Therefore we are working towards exporting shallow() function in data.table that will provide the user with both possibilities. For example, if it is desirable to not modify the input data.table within a function, one can then do: foo <- function(DT) { DT = shallow(DT) ## shallow copy DT DT[, newcol := 1L] ## does not affect the original DT DT[x > 2L, newcol := 2L] ## no need to copy (internally), as this column exists only in shallow copied DT DT[x > 2L, x := 3L] ## have to copy (like base R / dplyr does always); otherwise original DT will ## also get modified. } By not using shallow(), the old functionality is retained: bar <- function(DT) { DT[, newcol := 1L] ## old behaviour, original DT gets updated by reference DT[x > 2L, x := 3L] ## old behaviour, update column x in original DT. } By creating a shallow copy using shallow(), we understand that you don't want to modify the original object. We take care of everything internally to ensure that while also ensuring to copy columns you modify only when it is absolutely necessary. When implemented, this should settle the referential transparency issue altogether while providing the user with both possibilties. Also, once shallow() is exported dplyr's data.table interface should avoid almost all copies. So those who prefer dplyr's syntax can use it with data.tables. But it will still lack many features that data.table provides, including (sub)-assignment by reference. Aggregate while joining: Suppose you have two data.tables as follows: DT1 = data.table(x=c(1,1,1,1,2,2,2,2), y=c("a", "a", "b", "b"), z=1:8, key=c("x", "y")) # x y z # 1: 1 a 1 # 2: 1 a 2 # 3: 1 b 3 # 4: 1 b 4 # 5: 2 a 5 # 6: 2 a 6 # 7: 2 b 7 # 8: 2 b 8 DT2 = data.table(x=1:2, y=c("a", "b"), mul=4:3, key=c("x", "y")) # x y mul # 1: 1 a 4 # 2: 2 b 3 And you would like to get sum(z) * mul for each row in DT2 while joining by columns x,y. We can either: aggregate DT1 to get sum(z), 2) perform a join and 3) multiply (or) data.table way DT1[, .(z = sum(z)), keyby = .(x,y)][DT2][, z := z*mul][] dplyr equivalent DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% right_join(DF2) %>% mutate(z = z * mul) do it all in one go (using by = .EACHI feature): DT1[DT2, list(z=sum(z) * mul), by = .EACHI] What is the advantage? We don't have to allocate memory for the intermediate result. We don't have to group/hash twice (one for aggregation and other for joining). And more importantly, the operation what we wanted to perform is clear by looking at j in (2). Check this post for a detailed explanation of by = .EACHI. No intermediate results are materialised, and the join+aggregate is performed all in one go. Have a look at this, this and this posts for real usage scenarios. In dplyr you would have to join and aggregate or aggregate first and then join, neither of which are as efficient, in terms of memory (which in turn translates to speed). Update and joins: Consider the data.table code shown below: DT1[DT2, col := i.mul] adds/updates DT1's column col with mul from DT2 on those rows where DT2's key column matches DT1. I don't think there is an exact equivalent of this operation in dplyr, i.e., without avoiding a *_join operation, which would have to copy the entire DT1 just to add a new column to it, which is unnecessary. Check this post for a real usage scenario.

总而言之,认识到每一点优化都很重要是很重要的。就像格蕾丝·霍普说的,小心你的纳秒!

3.语法

现在让我们看看语法。Hadley在这里评论道:

数据表非常快,但我认为它们的简洁使得它更难学习,使用它的代码在你写完之后更难阅读……

我觉得这句话毫无意义,因为它太主观了。我们也许可以尝试对比语法的一致性。我们将比较数据。表和dplyr语法并排。

我们将使用如下所示的虚拟数据:

DT = data.table(x=1:10, y=11:20, z=rep(1:2, each=5))
DF = as.data.frame(DT)

Basic aggregation/update operations. # case (a) DT[, sum(y), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(sum(y)) ## dplyr syntax DT[, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = cumsum(y)) # case (b) DT[x > 2, sum(y), by = z] DF %>% filter(x>2) %>% group_by(z) %>% summarise(sum(y)) DT[x > 2, y := cumsum(y), by = z] ans <- DF %>% group_by(z) %>% mutate(y = replace(y, which(x > 2), cumsum(y))) # case (c) DT[, if(any(x > 5L)) y[1L]-y[2L] else y[2L], by = z] DF %>% group_by(z) %>% summarise(if (any(x > 5L)) y[1L] - y[2L] else y[2L]) DT[, if(any(x > 5L)) y[1L] - y[2L], by = z] DF %>% group_by(z) %>% filter(any(x > 5L)) %>% summarise(y[1L] - y[2L]) data.table syntax is compact and dplyr's quite verbose. Things are more or less equivalent in case (a). In case (b), we had to use filter() in dplyr while summarising. But while updating, we had to move the logic inside mutate(). In data.table however, we express both operations with the same logic - operate on rows where x > 2, but in first case, get sum(y), whereas in the second case update those rows for y with its cumulative sum. This is what we mean when we say the DT[i, j, by] form is consistent. Similarly in case (c), when we have if-else condition, we are able to express the logic "as-is" in both data.table and dplyr. However, if we would like to return just those rows where the if condition satisfies and skip otherwise, we cannot use summarise() directly (AFAICT). We have to filter() first and then summarise because summarise() always expects a single value. While it returns the same result, using filter() here makes the actual operation less obvious. It might very well be possible to use filter() in the first case as well (does not seem obvious to me), but my point is that we should not have to. Aggregation / update on multiple columns # case (a) DT[, lapply(.SD, sum), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise_each(funs(sum)) ## dplyr syntax DT[, (cols) := lapply(.SD, sum), by = z] ans <- DF %>% group_by(z) %>% mutate_each(funs(sum)) # case (b) DT[, c(lapply(.SD, sum), lapply(.SD, mean)), by = z] DF %>% group_by(z) %>% summarise_each(funs(sum, mean)) # case (c) DT[, c(.N, lapply(.SD, sum)), by = z] DF %>% group_by(z) %>% summarise_each(funs(n(), mean)) In case (a), the codes are more or less equivalent. data.table uses familiar base function lapply(), whereas dplyr introduces *_each() along with a bunch of functions to funs(). data.table's := requires column names to be provided, whereas dplyr generates it automatically. In case (b), dplyr's syntax is relatively straightforward. Improving aggregations/updates on multiple functions is on data.table's list. In case (c) though, dplyr would return n() as many times as many columns, instead of just once. In data.table, all we need to do is to return a list in j. Each element of the list will become a column in the result. So, we can use, once again, the familiar base function c() to concatenate .N to a list which returns a list. Note: Once again, in data.table, all we need to do is return a list in j. Each element of the list will become a column in result. You can use c(), as.list(), lapply(), list() etc... base functions to accomplish this, without having to learn any new functions. You will need to learn just the special variables - .N and .SD at least. The equivalent in dplyr are n() and . Joins dplyr provides separate functions for each type of join where as data.table allows joins using the same syntax DT[i, j, by] (and with reason). It also provides an equivalent merge.data.table() function as an alternative. setkey(DT1, x, y) # 1. normal join DT1[DT2] ## data.table syntax left_join(DT2, DT1) ## dplyr syntax # 2. select columns while join DT1[DT2, .(z, i.mul)] left_join(select(DT2, x, y, mul), select(DT1, x, y, z)) # 3. aggregate while join DT1[DT2, .(sum(z) * i.mul), by = .EACHI] DF1 %>% group_by(x, y) %>% summarise(z = sum(z)) %>% inner_join(DF2) %>% mutate(z = z*mul) %>% select(-mul) # 4. update while join DT1[DT2, z := cumsum(z) * i.mul, by = .EACHI] ?? # 5. rolling join DT1[DT2, roll = -Inf] ?? # 6. other arguments to control output DT1[DT2, mult = "first"] ??

Some might find a separate function for each joins much nicer (left, right, inner, anti, semi etc), whereas as others might like data.table's DT[i, j, by], or merge() which is similar to base R. However dplyr joins do just that. Nothing more. Nothing less. data.tables can select columns while joining (2), and in dplyr you will need to select() first on both data.frames before to join as shown above. Otherwise you would materialiase the join with unnecessary columns only to remove them later and that is inefficient. data.tables can aggregate while joining (3) and also update while joining (4), using by = .EACHI feature. Why materialse the entire join result to add/update just a few columns? data.table is capable of rolling joins (5) - roll forward, LOCF, roll backward, NOCB, nearest. data.table also has mult = argument which selects first, last or all matches (6). data.table has allow.cartesian = TRUE argument to protect from accidental invalid joins.

同样,该语法与DT[i, j, by]一致,并增加了允许进一步控制输出的参数。

do()... dplyr's summarise is specially designed for functions that return a single value. If your function returns multiple/unequal values, you will have to resort to do(). You have to know beforehand about all your functions return value. DT[, list(x[1], y[1]), by = z] ## data.table syntax DF %>% group_by(z) %>% summarise(x[1], y[1]) ## dplyr syntax DT[, list(x[1:2], y[1]), by = z] DF %>% group_by(z) %>% do(data.frame(.$x[1:2], .$y[1])) DT[, quantile(x, 0.25), by = z] DF %>% group_by(z) %>% summarise(quantile(x, 0.25)) DT[, quantile(x, c(0.25, 0.75)), by = z] DF %>% group_by(z) %>% do(data.frame(quantile(.$x, c(0.25, 0.75)))) DT[, as.list(summary(x)), by = z] DF %>% group_by(z) %>% do(data.frame(as.list(summary(.$x))))

. sd的等效是。 在数据。Table,你可以在j中抛出几乎任何东西-唯一需要记住的是它返回一个列表,以便列表中的每个元素都转换为一列。 在dplyr中,不能这样做。必须求助于do(),这取决于您对函数是否总是返回单个值的确定程度。而且速度很慢。

还是数据。表的语法与DT[i, j, by]一致。我们可以继续把表达式扔到j中而不用担心这些东西。

看看这个SO问题和这个问题。我想知道是否有可能用dplyr的语法直接表达答案…

To summarise, I have particularly highlighted several instances where dplyr's syntax is either inefficient, limited or fails to make operations straightforward. This is particularly because data.table gets quite a bit of backlash about "harder to read/learn" syntax (like the one pasted/linked above). Most posts that cover dplyr talk about most straightforward operations. And that is great. But it is important to realise its syntax and feature limitations as well, and I am yet to see a post on it.

数据。表格也有它的怪癖(其中一些我已经指出,我们正在试图解决)。我们也在尝试改进数据。表的连接,正如我在这里突出显示的。

但是,与data.table相比,还应该考虑dplyr缺乏的特性的数量。

4. 特性

我已经在这里和这篇文章中指出了大部分功能。此外:

fread - fast file reader has been available for a long time now. fwrite - a parallelised fast file writer is now available. See this post for a detailed explanation on the implementation and #1664 for keeping track of further developments. Automatic indexing - another handy feature to optimise base R syntax as is, internally. Ad-hoc grouping: dplyr automatically sorts the results by grouping variables during summarise(), which may not be always desirable. Numerous advantages in data.table joins (for speed / memory efficiency and syntax) mentioned above. Non-equi joins: Allows joins using other operators <=, <, >, >= along with all other advantages of data.table joins. Overlapping range joins was implemented in data.table recently. Check this post for an overview with benchmarks. setorder() function in data.table that allows really fast reordering of data.tables by reference. dplyr provides interface to databases using the same syntax, which data.table does not at the moment. data.table provides faster equivalents of set operations (written by Jan Gorecki) - fsetdiff, fintersect, funion and fsetequal with additional all argument (as in SQL). data.table loads cleanly with no masking warnings and has a mechanism described here for [.data.frame compatibility when passed to any R package. dplyr changes base functions filter, lag and [ which can cause problems; e.g. here and here.


最后:

在数据库上——没有理由需要数据。表不能提供类似的接口,但这不是现在的优先级。如果用户非常喜欢这个功能,它可能会得到提升。不确定。 关于并行性——任何事情都是困难的,直到有人走在前面去做。当然,这需要付出努力(确保线程安全)。 目前(在v1.9.7 devel中)正在使用OpenMP并行处理已知耗时部件以获得增量性能提升。

下面是我试图从dplyr的角度给出的一个全面的答案, 遵循阿伦的回答的大致轮廓(但有些重新安排 基于不同的优先级)。

语法

语法有一定的主观性,但我坚持我的说法 数据的简洁。表格让它更难学,更难读。 这部分是因为dplyr解决的问题要简单得多!

dplyr为您做的一件非常重要的事情是 限制你的选择。我认为大多数单表问题都可以 只需五个关键动词就能解决过滤,选择,变异,排列和 总结,连同一个“由组”副词。这种约束是一个很大的帮助 当你在学习数据操作的时候,因为它可以帮助你 思考这个问题。在dplyr中,这些动词中的每一个都映射到a 单一的功能。每个函数只做一项工作,而且很容易理解 在隔离。

通过将这些简单的操作与 % > %。下面是阿伦链接的一个帖子中的一个例子 :

diamonds %>%
  filter(cut != "Fair") %>%
  group_by(cut) %>%
  summarize(
    AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = n()
  ) %>%
  arrange(desc(Count))

即使您以前从未见过dplyr(甚至R!),您仍然可以得到 因为函数都是英文的 动词。英语动词的缺点是它们需要更多的打字 [,但我认为更好的自动补全功能可以在很大程度上缓解这一问题。

这是等价的数据。表的代码:

diamondsDT <- data.table(diamonds)
diamondsDT[
  cut != "Fair", 
  .(AvgPrice = mean(price),
    MedianPrice = as.numeric(median(price)),
    Count = .N
  ), 
  by = cut
][ 
  order(-Count) 
]

这段代码很难理解,除非您已经熟悉了 data.table。(我也不知道如何缩进重复的[ 以一种在我看来不错的方式)。就我个人而言,当我看代码I时 6个月前写的,就像看一个陌生人写的代码, 所以我更喜欢简单的代码,如果冗长的话。

另外两个我认为会稍微降低可读性的次要因素:

Since almost every data table operation uses [ you need additional context to figure out what's happening. For example, is x[y] joining two data tables or extracting columns from a data frame? This is only a small issue, because in well-written code the variable names should suggest what's happening. I like that group_by() is a separate operation in dplyr. It fundamentally changes the computation so I think should be obvious when skimming the code, and it's easier to spot group_by() than the by argument to [.data.table.

我也喜欢那个烟斗 不仅仅局限于一个包裹。你可以从整理你的 数据与 tidyr, 在ggvis中完成一个plot。和你 不限于我写的包——任何人都可以写函数 这形成了数据操作管道的无缝部分。事实上,我 更喜欢之前的数据。用%>%重写表代码:

diamonds %>% 
  data.table() %>% 
  .[cut != "Fair", 
    .(AvgPrice = mean(price),
      MedianPrice = as.numeric(median(price)),
      Count = .N
    ), 
    by = cut
  ] %>% 
  .[order(-Count)]

使用%>%管道的想法不仅限于数据帧和 很容易推广到其他环境:交互式网络 图形、网络 刮, 丰子恺,运行时 合同,…)

内存和性能

我把它们放在一起,因为对我来说,它们并不那么重要。 大多数R用户处理的数据行远低于100万行,而dplyr是 足够快,对于你不知道的数据量 处理时间。我们优化了dplyr的表达介质数据; 请随意使用数据。大数据的原始速度表。

dplyr的灵活性还意味着您可以轻松地调整性能 使用相同语法的特征。如果dplyr的性能与 数据帧后端对你来说不够好,你可以使用 数据。表后端(尽管功能有点受限)。 如果您处理的数据不适合内存,那么您可以使用 数据库后端。

尽管如此,从长远来看,dplyr的性能将会变得更好。我们会 一定要实现一些关于数据的伟大想法。类似基数的表格 对连接和过滤器排序并使用相同的索引。我们也 致力于并行化,这样我们就可以利用多核的优势。

特性

2015年我们计划做以下几件事:

读取器包,可以方便地将文件从磁盘取出并放入 对于内存,类似于fread()。 更灵活的连接,包括对非等连接的支持。 更灵活的分组,如bootstrap样本,rollup等

我还投入时间来改进R的数据库 连接器,与之对话的能力 Web api,并使它更容易 抓取HTML页面。

阅读Hadley和Arun的回答,人们会有这样的印象:那些喜欢dplyr语法的人在某些情况下会切换到数据。表或妥协长运行时间。

但是正如一些人已经提到的,dplyr可以使用数据。表作为后端。这是使用dtplyr包完成的,该包最近发布了1.0.0版本。学习dtplyr几乎不需要额外的努力。

在使用dtplyr时,可以使用lazy_dt()函数来声明惰性数据。表,之后使用标准的dplyr语法指定对它的操作。它看起来会像下面这样:

new_table <- mtcars2 %>% 
  lazy_dt() %>%
  filter(wt < 5) %>% 
  mutate(l100k = 235.21 / mpg) %>% # liters / 100 km
  group_by(cyl) %>% 
  summarise(l100k = mean(l100k))

  new_table

#> Source: local data table [?? x 2]
#> Call:   `_DT1`[wt < 5][, `:=`(l100k = 235.21/mpg)][, .(l100k = mean(l100k)), 
#>     keyby = .(cyl)]
#> 
#>     cyl l100k
#>   <dbl> <dbl>
#> 1     4  9.05
#> 2     6 12.0 
#> 3     8 14.9 
#> 
#> # Use as.data.table()/as.data.frame()/as_tibble() to access results

new_table对象直到调用as.data.table()/as.data.frame()/as_tibble()时才会被求值,此时底层数据。执行表操作。

我重新创建了一个基于数据的基准分析。表作者Matt Dowle在2018年12月的时候,涵盖了大量组的操作情况。我发现dtplyr确实在很大程度上使那些喜欢dplyr语法的人能够继续使用它,同时享受data.table提供的速度。