分组函数(tapply, by, aggregate)和*apply族

每当我想在R中做一些“映射”py的事情时，我通常尝试使用apply家族中的函数。

然而，我从来没有完全理解它们之间的区别——{sapply, lapply，等等}如何将函数应用到输入/分组输入，输出将是什么样子，甚至输入可以是什么——所以我经常只是浏览它们，直到我得到我想要的。

有人能解释一下什么时候用哪个吗?

我目前(可能不正确/不完全)的理解是……

sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output lapply(vec, f): same as sapply, but output is a list? apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix) tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

题外话:我还没学过plyr或remodeling——plyr或remodeling会完全取代所有这些吗?

当前回答

因为我意识到这篇文章(非常优秀)的答案缺乏全面的解释。以下是我的贡献。

正如文档中所述，by函数可以作为tapply的“包装器”。当我们想要计算tapply无法处理的任务时，by的幂函数就出现了。例如下面的代码:

ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )

 cb
iris$Species: setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 
-------------------------------------------------------------- 
iris$Species: versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 
-------------------------------------------------------------- 
iris$Species: virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 


ct
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800

如果我们打印这两个对象，ct和cb，我们“本质上”有相同的结果，唯一的区别是它们的显示方式和不同的类属性，分别是by for cb和array for ct。

正如我所说，当我们不能使用tapply时，by的力量就出现了;下面的代码是一个例子:

 tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) : 
  arguments must have same length

R说参数必须具有相同的长度，比如“我们想要计算虹膜中所有变量沿因子Species的总和”:但是R不能这样做，因为它不知道如何处理。

使用by函数R为数据帧类分派一个特定的方法，然后让summary函数工作，即使第一个参数的长度(以及类型)不同。

bywork <- by(iris, iris$Species, summary )

bywork
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
-------------------------------------------------------------- 
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
-------------------------------------------------------------- 
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500

它确实起作用了，结果非常令人惊讶。它是一个类的对象，即沿着Species(例如，对于它们中的每一个)计算每个变量的摘要。

注意，如果第一个参数是一个数据帧，分派的函数必须有针对该类对象的方法。例如，我们将这段代码与均值函数一起使用我们将得到这段完全没有意义的代码:

 by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
------------------------------------------- 
iris$Species: versicolor
[1] NA
------------------------------------------- 
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA

总

如果我们以这种方式使用tapply，那么Aggregate可以被视为tapply的另一种不同的使用方式。

at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)

 at
    setosa versicolor  virginica 
     5.006      5.936      6.588 
 ag
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

两个直接的区别是，aggregate的第二个参数必须是一个列表，而tapply可以(非强制)是一个列表，aggregate的输出是一个数据帧，而tapply的输出是一个数组。

aggregate的强大之处在于它可以轻松地处理带有子集参数的数据子集，并且它还有ts对象和公式的方法。

这些元素使聚合体在某些情况下更容易与tapply一起工作。以下是一些例子(可在文档中找到):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

 ag
  supp dose   len
1   OJ  0.5 13.23
2   VC  0.5  7.98
3   OJ  1.0 22.70
4   VC  1.0 16.77
5   OJ  2.0 26.06
6   VC  2.0 26.14

我们可以用tapply实现同样的效果，但语法略难，输出(在某些情况下)可读性较差:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)

 att
       OJ    VC
0.5 13.23  7.98
1   22.70 16.77
2   26.06 26.14

还有一些时候，我们不能使用by或tapply，而必须使用aggregate。

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

 ag1
  Month    Ozone     Temp
1     5 23.61538 66.73077
2     6 29.44444 78.22222
3     7 59.11538 83.88462
4     8 59.96154 83.96154
5     9 31.44828 76.89655

我们不能在一次调用中使用tapply获得之前的结果，但我们必须计算每个元素的平均值沿着Month，然后将它们组合起来(还要注意，我们必须调用na。rm = TRUE，因为聚合函数的公式方法默认为na。Action = na.省略):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)

 cbind(ta1, ta2)
       ta1      ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000

虽然使用by我们无法实现这一点，但实际上下面的函数调用会返回一个错误(但很可能与所提供的函数有关，mean):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)

其他时候，结果是相同的，差异只是在类(然后它是如何显示/打印的，而不仅仅是——例如，如何子集)对象:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)

前面的代码实现了相同的目标和结果，在某些情况下使用什么工具只是个人口味和需求的问题;前面两个对象在子集方面有非常不同的需求。

2015-08-28 02:28:10

其他回答

也许值得一提的是ave。ave是tapply的好兄弟。它以一种可以直接插入数据帧的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

在基本包中，对于整个数据帧，没有任何东西可以像ave那样工作(就像by与tapply一样)。但你可以蒙混过关:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

2014-11-06 00:00:25

首先是乔兰的精彩回答——恐怕没有比这更好的答案了。

下面的助记法可以帮助你记住它们之间的区别。虽然有些是显而易见的，但有些可能不那么明显——对于这些，你会在Joran的讨论中找到理由。

助记符

Lapply是一个列表应用程序，作用于列表或向量并返回一个列表。 Sapply是一个简单的lapply(函数默认返回一个向量或矩阵) Vapply是一个经过验证的apply(允许预先指定返回对象类型) Rapply是针对嵌套列表(即列表中的列表)的递归应用 Tapply是一个带标记的应用程序，其中标记标识子集 Apply是通用的:将函数应用到矩阵的行或列(或者更一般地，应用到数组的维数)

构建正确的背景

如果您仍然觉得使用apply族有点陌生，那么可能您缺少了一个关键的观点。

这两篇文章会有所帮助。它们为应用函数族所提供的函数式编程技术提供了必要的背景知识。

Lisp的用户会立刻认出这个范例。如果您不熟悉Lisp，一旦您了解了FP，您将获得在R中使用的强大观点——而apply将更有意义。

高级R:函数式编程，Hadley Wickham著《R中的简单函数式编程》，作者:Michael Barton

2014-04-25 00:20:19

附注:以下是各种plyr函数如何对应于基本*apply函数(来自plyr网页http://had.co.nz/plyr/的plyr介绍文档)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply

plyr的目标之一是为每个函数提供一致的命名约定，在函数名中编码输入和输出数据类型。它还提供了输出的一致性，因为来自dlply()的输出很容易传递给ldply()以产生有用的输出，等等。

从概念上讲，学习plyr并不比理解基本的apply函数更难。

在我的日常使用中，Plyr和重塑函数几乎取代了所有这些函数。但是，同样来自Plyr文档的介绍:

相关函数tapply和sweep在plyr中没有相应的函数，仍然有用。Merge对于合并摘要和原始数据非常有用。

2010-08-17 19:20:09

R有许多在帮助文件中有巧妙描述的*apply函数(例如?apply)。但是，它们太多了，初学者可能很难决定哪一个适合他们的情况，甚至很难记住它们。他们可能有一个普遍的感觉，“我应该在这里使用一个*apply函数”，但一开始很难把它们都说清楚。

尽管事实上(在其他回答中提到)*apply系列的大部分功能都由非常流行的plyr包覆盖，但基本函数仍然有用，值得了解。

这个答案旨在作为新用户的一个路标，帮助他们针对特定的问题找到正确的*apply函数。注意，这不是为了简单地复制或替换R文档!希望这个答案能帮助您决定哪个*apply函数适合您的情况，然后由您进一步研究。除了一个例外，性能差异将不予处理。

apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first. # Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48 If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums. lapply - When you want to apply a function to each element of a list in turn and get a list back. This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath. x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005 sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list. If you find yourself typing unlist(lapply(...)), stop and consider sapply. x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005 In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix: sapply(1:5,function(x) rnorm(3,x)) If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector: sapply(1:5,function(x) matrix(x,2,2)) Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array: sapply(1:5,function(x) matrix(x,2,2), simplify = "array") Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension. vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety. For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector. x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91 mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply. This is multivariate in the sense that your function must accept multiple arguments. #Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4 Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list. Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15 rapply - For when you want to apply a function to each element of a nested list structure, recursively. To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply: # Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace") tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor. The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple. A vector: x <- 1:20 A factor (of the same length!) defining groups: y <- factor(rep(letters[1:5], each = 4)) Add up the values in x within each subgroup defined by y: tapply(x, y, sum) a b c d e 10 26 42 58 74 More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

2011-08-21 22:50:17