分组函数(tapply, by, aggregate)和*apply族

每当我想在R中做一些“映射”py的事情时，我通常尝试使用apply家族中的函数。

然而，我从来没有完全理解它们之间的区别——{sapply, lapply，等等}如何将函数应用到输入/分组输入，输出将是什么样子，甚至输入可以是什么——所以我经常只是浏览它们，直到我得到我想要的。

有人能解释一下什么时候用哪个吗?

我目前(可能不正确/不完全)的理解是……

sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output lapply(vec, f): same as sapply, but output is a list? apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix) tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

题外话:我还没学过plyr或remodeling——plyr或remodeling会完全取代所有这些吗?

当前回答

有很多很好的答案讨论了每个功能用例中的差异。没有一个答案讨论了表现上的差异。这是合理的，因为不同的函数需要不同的输入，产生不同的输出，但大多数函数都有一个一般的共同目标，以级数/组来评估。我的答案将集中在性能上。由于以上从矢量产生的输入包含在计时中，应用函数也没有测量。

我同时测试了两个不同的函数sum和length。容量测试为50M输入和50K输出。我还包括了两个目前流行的软件包，在提出问题时还没有广泛使用，那就是数据。Table和dplyr。如果您的目标是获得良好的性能，这两种方法都值得一看。

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

2015-12-08 22:42:43

其他回答

附注:以下是各种plyr函数如何对应于基本*apply函数(来自plyr网页http://had.co.nz/plyr/的plyr介绍文档)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply

plyr的目标之一是为每个函数提供一致的命名约定，在函数名中编码输入和输出数据类型。它还提供了输出的一致性，因为来自dlply()的输出很容易传递给ldply()以产生有用的输出，等等。

从概念上讲，学习plyr并不比理解基本的apply函数更难。

在我的日常使用中，Plyr和重塑函数几乎取代了所有这些函数。但是，同样来自Plyr文档的介绍:

相关函数tapply和sweep在plyr中没有相应的函数，仍然有用。Merge对于合并摘要和原始数据非常有用。

2010-08-17 19:20:09

R有许多在帮助文件中有巧妙描述的*apply函数(例如?apply)。但是，它们太多了，初学者可能很难决定哪一个适合他们的情况，甚至很难记住它们。他们可能有一个普遍的感觉，“我应该在这里使用一个*apply函数”，但一开始很难把它们都说清楚。

尽管事实上(在其他回答中提到)*apply系列的大部分功能都由非常流行的plyr包覆盖，但基本函数仍然有用，值得了解。

这个答案旨在作为新用户的一个路标，帮助他们针对特定的问题找到正确的*apply函数。注意，这不是为了简单地复制或替换R文档!希望这个答案能帮助您决定哪个*apply函数适合您的情况，然后由您进一步研究。除了一个例外，性能差异将不予处理。

apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first. # Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48 If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums. lapply - When you want to apply a function to each element of a list in turn and get a list back. This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath. x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005 sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list. If you find yourself typing unlist(lapply(...)), stop and consider sapply. x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005 In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix: sapply(1:5,function(x) rnorm(3,x)) If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector: sapply(1:5,function(x) matrix(x,2,2)) Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array: sapply(1:5,function(x) matrix(x,2,2), simplify = "array") Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension. vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety. For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector. x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91 mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply. This is multivariate in the sense that your function must accept multiple arguments. #Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4 Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list. Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15 rapply - For when you want to apply a function to each element of a nested list structure, recursively. To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply: # Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace") tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor. The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple. A vector: x <- 1:20 A factor (of the same length!) defining groups: y <- factor(rep(letters[1:5], each = 4)) Add up the values in x within each subgroup defined by y: tapply(x, y, sum) a b c d e 10 26 42 58 74 More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

2011-08-21 22:50:17

也许值得一提的是ave。ave是tapply的好兄弟。它以一种可以直接插入数据帧的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

在基本包中，对于整个数据帧，没有任何东西可以像ave那样工作(就像by与tapply一样)。但你可以蒙混过关:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

2014-11-06 00:00:25

请看http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:的第21页

(希望这是清楚的，apply对应于@Hadley的apply和aggregate对应于@Hadley的ddply等。如果你没有从这张图片中得到它，同一幻灯片的第20张将会说明。)

(左边是输入，上面是输出)

2011-10-09 05:29:32

我最近发现了一个相当有用的扫描函数，为了完整起见，我将它添加到这里:

扫描

基本思想是逐行或逐列遍历数组并返回修改后的数组。下面的例子将说明这一点(来源:datacamp):

假设你有一个矩阵，想要按列对它进行标准化:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")

# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5

# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

注意:对于这个简单的例子，同样的结果当然可以通过应用(dataPoints, 2, scale)更容易实现。

2017-06-16 16:03:27

分组函数(tapply, by, aggregate)和*apply族

推荐文章

最新文章

标签