分组函数(tapply, by, aggregate)和*apply族

每当我想在R中做一些“映射”py的事情时，我通常尝试使用apply家族中的函数。

然而，我从来没有完全理解它们之间的区别——{sapply, lapply，等等}如何将函数应用到输入/分组输入，输出将是什么样子，甚至输入可以是什么——所以我经常只是浏览它们，直到我得到我想要的。

有人能解释一下什么时候用哪个吗?

我目前(可能不正确/不完全)的理解是……

sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output lapply(vec, f): same as sapply, but output is a list? apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix) tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

题外话:我还没学过plyr或remodeling——plyr或remodeling会完全取代所有这些吗?

当前回答

在CRAN上最近发布的崩溃包中，我尝试将大多数常见的应用功能压缩为2个函数:

dapply (Data-Apply)将函数应用于矩阵和data.frames的行或列(默认值)，并且(默认值)返回一个具有相同类型和相同属性的对象(除非每次计算的结果是原子的并且drop = TRUE)。对于data.frame列的性能与lapply相当，对于矩阵行或列的性能比apply快2倍。并行性可通过mclapply获得(仅适用于MAC)。

语法:

dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L, 
       return = c("same", "matrix", "data.frame"), drop = TRUE)

例子:

# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...

BY是S3的一个泛型，用于拆分应用组合计算，使用矢量、矩阵和data.frame方法。它明显比tapply快，通过和聚合(也比plyr快，但在大数据上dplyr更快)。

语法:

BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

例子:

# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix 
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...

分组变量的列表也可以提供给g。

Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions: fmean, fmedian, fmode, fsum, fprod, fsd, fvar, fmin, fmax, ffirst, flast, fNobs, fNdistinct, fscale, fbetween, fwithin, fHDbetween, fHDwithin, flag, fdiff and fgrowth. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).

语法:

fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

例子:

v <- iris$Sepal.Length
f <- iris$Species

# Vectors
fmean(v)             # mean
fmean(v, f)          # grouped mean
fsd(v, f)            # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f)         # grouped standardizing (scaling and centering)
fwithin(v, f)        # grouped demeaning

w <- abs(rnorm(nrow(iris)))
fmean(v, w = w)      # Weighted mean
fmean(v, f, w)       # Weighted grouped mean
fsd(v, f, w)         # Weighted grouped standard-deviation
fsd(v, f, w, "/")    # Weighted grouped scaling
fscale(v, f, w)      # Weighted grouped standardizing
fwithin(v, f, w)     # Weighted grouped demeaning

# Same using data.frames...
fmean(iris[-5], f)                # grouped mean
fscale(iris[-5], f)               # grouped standardizing
fwithin(iris[-5], f)              # grouped demeaning

# Same with matrices ...

在软件包小插图中，我提供了基准测试。使用快速函数编程要比使用dplyr或数据编程快得多。表，尤其适用于较小的数据，也适用于较大的数据。

2020-03-20 07:22:45

其他回答

附注:以下是各种plyr函数如何对应于基本*apply函数(来自plyr网页http://had.co.nz/plyr/的plyr介绍文档)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply

plyr的目标之一是为每个函数提供一致的命名约定，在函数名中编码输入和输出数据类型。它还提供了输出的一致性，因为来自dlply()的输出很容易传递给ldply()以产生有用的输出，等等。

从概念上讲，学习plyr并不比理解基本的apply函数更难。

在我的日常使用中，Plyr和重塑函数几乎取代了所有这些函数。但是，同样来自Plyr文档的介绍:

相关函数tapply和sweep在plyr中没有相应的函数，仍然有用。Merge对于合并摘要和原始数据非常有用。

2010-08-17 19:20:09

R有许多在帮助文件中有巧妙描述的*apply函数(例如?apply)。但是，它们太多了，初学者可能很难决定哪一个适合他们的情况，甚至很难记住它们。他们可能有一个普遍的感觉，“我应该在这里使用一个*apply函数”，但一开始很难把它们都说清楚。

尽管事实上(在其他回答中提到)*apply系列的大部分功能都由非常流行的plyr包覆盖，但基本函数仍然有用，值得了解。

这个答案旨在作为新用户的一个路标，帮助他们针对特定的问题找到正确的*apply函数。注意，这不是为了简单地复制或替换R文档!希望这个答案能帮助您决定哪个*apply函数适合您的情况，然后由您进一步研究。除了一个例外，性能差异将不予处理。

apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first. # Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48 If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums. lapply - When you want to apply a function to each element of a list in turn and get a list back. This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath. x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005 sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list. If you find yourself typing unlist(lapply(...)), stop and consider sapply. x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005 In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix: sapply(1:5,function(x) rnorm(3,x)) If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector: sapply(1:5,function(x) matrix(x,2,2)) Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array: sapply(1:5,function(x) matrix(x,2,2), simplify = "array") Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension. vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety. For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector. x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91 mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply. This is multivariate in the sense that your function must accept multiple arguments. #Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4 Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list. Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15 rapply - For when you want to apply a function to each element of a nested list structure, recursively. To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply: # Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace") tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor. The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple. A vector: x <- 1:20 A factor (of the same length!) defining groups: y <- factor(rep(letters[1:5], each = 4)) Add up the values in x within each subgroup defined by y: tapply(x, y, sum) a b c d e 10 26 42 58 74 More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

2011-08-21 22:50:17

在CRAN上最近发布的崩溃包中，我尝试将大多数常见的应用功能压缩为2个函数:

语法:

dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L, 
       return = c("same", "matrix", "data.frame"), drop = TRUE)

例子:

# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...

BY是S3的一个泛型，用于拆分应用组合计算，使用矢量、矩阵和data.frame方法。它明显比tapply快，通过和聚合(也比plyr快，但在大数据上dplyr更快)。

语法:

BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
   expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
   return = c("same", "matrix", "data.frame", "list"))

例子:

# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix 
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...

分组变量的列表也可以提供给g。

语法:

fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)

例子:

v <- iris$Sepal.Length
f <- iris$Species

# Vectors
fmean(v)             # mean
fmean(v, f)          # grouped mean
fsd(v, f)            # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f)         # grouped standardizing (scaling and centering)
fwithin(v, f)        # grouped demeaning

w <- abs(rnorm(nrow(iris)))
fmean(v, w = w)      # Weighted mean
fmean(v, f, w)       # Weighted grouped mean
fsd(v, f, w)         # Weighted grouped standard-deviation
fsd(v, f, w, "/")    # Weighted grouped scaling
fscale(v, f, w)      # Weighted grouped standardizing
fwithin(v, f, w)     # Weighted grouped demeaning

# Same using data.frames...
fmean(iris[-5], f)                # grouped mean
fscale(iris[-5], f)               # grouped standardizing
fwithin(iris[-5], f)              # grouped demeaning

# Same with matrices ...

在软件包小插图中，我提供了基准测试。使用快速函数编程要比使用dplyr或数据编程快得多。表，尤其适用于较小的数据，也适用于较大的数据。

2020-03-20 07:22:45

也许值得一提的是ave。ave是tapply的好兄弟。它以一种可以直接插入数据帧的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

在基本包中，对于整个数据帧，没有任何东西可以像ave那样工作(就像by与tapply一样)。但你可以蒙混过关:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

2014-11-06 00:00:25

请看http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:的第21页

(希望这是清楚的，apply对应于@Hadley的apply和aggregate对应于@Hadley的ddply等。如果你没有从这张图片中得到它，同一幻灯片的第20张将会说明。)

(左边是输入，上面是输出)

2011-10-09 05:29:32

分组函数(tapply, by, aggregate)和*apply族

推荐文章

最新文章

标签