每当我想在R中做一些“映射”py的事情时,我通常尝试使用apply家族中的函数。

然而,我从来没有完全理解它们之间的区别——{sapply, lapply,等等}如何将函数应用到输入/分组输入,输出将是什么样子,甚至输入可以是什么——所以我经常只是浏览它们,直到我得到我想要的。

有人能解释一下什么时候用哪个吗?

我目前(可能不正确/不完全)的理解是……

sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output lapply(vec, f): same as sapply, but output is a list? apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix) tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.

题外话:我还没学过plyr或remodeling——plyr或remodeling会完全取代所有这些吗?


当前回答

R有许多在帮助文件中有巧妙描述的*apply函数(例如?apply)。但是,它们太多了,初学者可能很难决定哪一个适合他们的情况,甚至很难记住它们。他们可能有一个普遍的感觉,“我应该在这里使用一个*apply函数”,但一开始很难把它们都说清楚。

尽管事实上(在其他回答中提到)*apply系列的大部分功能都由非常流行的plyr包覆盖,但基本函数仍然有用,值得了解。

这个答案旨在作为新用户的一个路标,帮助他们针对特定的问题找到正确的*apply函数。注意,这不是为了简单地复制或替换R文档!希望这个答案能帮助您决定哪个*apply函数适合您的情况,然后由您进一步研究。除了一个例外,性能差异将不予处理。

apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first. # Two dimensional matrix M <- matrix(seq(1,16), 4, 4) # apply min to rows apply(M, 1, min) [1] 1 2 3 4 # apply max to columns apply(M, 2, max) [1] 4 8 12 16 # 3 dimensional array M <- array( seq(32), dim = c(4,4,2)) # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension apply(M, 1, sum) # Result is one-dimensional [1] 120 128 136 144 # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension apply(M, c(1,2), sum) # Result is two-dimensional [,1] [,2] [,3] [,4] [1,] 18 26 34 42 [2,] 20 28 36 44 [3,] 22 30 38 46 [4,] 24 32 40 48 If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums. lapply - When you want to apply a function to each element of a list in turn and get a list back. This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath. x <- list(a = 1, b = 1:3, c = 10:100) lapply(x, FUN = length) $a [1] 1 $b [1] 3 $c [1] 91 lapply(x, FUN = sum) $a [1] 1 $b [1] 6 $c [1] 5005 sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list. If you find yourself typing unlist(lapply(...)), stop and consider sapply. x <- list(a = 1, b = 1:3, c = 10:100) # Compare with above; a named vector, not a list sapply(x, FUN = length) a b c 1 3 91 sapply(x, FUN = sum) a b c 1 6 5005 In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix: sapply(1:5,function(x) rnorm(3,x)) If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector: sapply(1:5,function(x) matrix(x,2,2)) Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array: sapply(1:5,function(x) matrix(x,2,2), simplify = "array") Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension. vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code or want more type safety. For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector. x <- list(a = 1, b = 1:3, c = 10:100) #Note that since the advantage here is mainly speed, this # example is only for illustration. We're telling R that # everything returned by length() should be an integer of # length 1. vapply(x, FUN = length, FUN.VALUE = 0L) a b c 1 3 91 mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply. This is multivariate in the sense that your function must accept multiple arguments. #Sums the 1st elements, the 2nd elements, etc. mapply(sum, 1:5, 1:5, 1:5) [1] 3 6 9 12 15 #To do rep(1,4), rep(2,3), etc. mapply(rep, 1:4, 4:1) [[1]] [1] 1 1 1 1 [[2]] [1] 2 2 2 [[3]] [1] 3 3 [[4]] [1] 4 Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list. Map(sum, 1:5, 1:5, 1:5) [[1]] [1] 3 [[2]] [1] 6 [[3]] [1] 9 [[4]] [1] 12 [[5]] [1] 15 rapply - For when you want to apply a function to each element of a nested list structure, recursively. To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply: # Append ! to string, otherwise increment myFun <- function(x){ if(is.character(x)){ return(paste(x,"!",sep="")) } else{ return(x + 1) } } #A nested list structure l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), b = 3, c = "Yikes", d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5))) # Result is named vector, coerced to character rapply(l, myFun) # Result is a nested list like l, with values altered rapply(l, myFun, how="replace") tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor. The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple. A vector: x <- 1:20 A factor (of the same length!) defining groups: y <- factor(rep(letters[1:5], each = 4)) Add up the values in x within each subgroup defined by y: tapply(x, y, sum) a b c d e 10 26 42 58 74 More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

其他回答

有很多很好的答案讨论了每个功能用例中的差异。没有一个答案讨论了表现上的差异。这是合理的,因为不同的函数需要不同的输入,产生不同的输出,但大多数函数都有一个一般的共同目标,以级数/组来评估。我的答案将集中在性能上。由于以上从矢量产生的输入包含在计时中,应用函数也没有测量。

我同时测试了两个不同的函数sum和length。容量测试为50M输入和50K输出。我还包括了两个目前流行的软件包,在提出问题时还没有广泛使用,那就是数据。Table和dplyr。如果您的目标是获得良好的性能,这两种方法都值得一看。

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

也许值得一提的是ave。ave是tapply的好兄弟。它以一种可以直接插入数据帧的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

在基本包中,对于整个数据帧,没有任何东西可以像ave那样工作(就像by与tapply一样)。但你可以蒙混过关:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

尽管这里有很多很棒的答案,但还有两个基本函数值得提一下,一个是有用的outer函数,另一个是鲜为人知的eapply函数

Outer是一个非常有用的函数,隐藏在一个更普通的函数中。如果你阅读外部的帮助,它的描述是这样的:

The outer product of the arrays X and Y is the array A with dimension  
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
FUN(X[arrayindex.x], Y[arrayindex.y], ...).

这让它看起来只对线性代数有用。然而,它可以像mapply一样使用,将一个函数应用到输入的两个向量上。区别在于mapply将函数应用于前两个元素,然后是后两个元素,等等,而outer将函数应用于第一个向量中的一个元素和第二个向量中的一个元素的每个组合。例如:

 A<-c(1,3,5,7,9)
 B<-c(0,3,6,9,12)

mapply(FUN=pmax, A, B)

> mapply(FUN=pmax, A, B)
[1]  1  3  6  9 12

outer(A,B, pmax)

 > outer(A,B, pmax)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    3    6    9   12
 [2,]    3    3    6    9   12
 [3,]    5    5    6    9   12
 [4,]    7    7    7    9   12
 [5,]    9    9    9    9   12

当我有一个值向量和一个条件向量,并希望看到哪些值满足哪些条件时,我个人就会使用这个方法。

eapply

Eapply类似于lapply,不同之处在于它不是将函数应用到列表中的每个元素,而是将函数应用到环境中的每个元素。例如,如果你想在全局环境中找到用户定义函数的列表:

A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}

> eapply(.GlobalEnv, is.function)
$A
[1] FALSE

$B
[1] FALSE

$C
[1] FALSE

$D
[1] TRUE 

坦率地说,我不经常使用这个,但如果你正在构建很多包或创建很多环境,它可能会派上用场。

我最近发现了一个相当有用的扫描函数,为了完整起见,我将它添加到这里:

扫描

基本思想是逐行或逐列遍历数组并返回修改后的数组。下面的例子将说明这一点(来源:datacamp):

假设你有一个矩阵,想要按列对它进行标准化:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")

# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5

# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

注意:对于这个简单的例子,同样的结果当然可以通过应用(dataPoints, 2, scale)更容易实现。

首先是乔兰的精彩回答——恐怕没有比这更好的答案了。

下面的助记法可以帮助你记住它们之间的区别。虽然有些是显而易见的,但有些可能不那么明显——对于这些,你会在Joran的讨论中找到理由。

助记符

Lapply是一个列表应用程序,作用于列表或向量并返回一个列表。 Sapply是一个简单的lapply(函数默认返回一个向量或矩阵) Vapply是一个经过验证的apply(允许预先指定返回对象类型) Rapply是针对嵌套列表(即列表中的列表)的递归应用 Tapply是一个带标记的应用程序,其中标记标识子集 Apply是通用的:将函数应用到矩阵的行或列(或者更一般地,应用到数组的维数)

构建正确的背景

如果您仍然觉得使用apply族有点陌生,那么可能您缺少了一个关键的观点。

这两篇文章会有所帮助。它们为应用函数族所提供的函数式编程技术提供了必要的背景知识。

Lisp的用户会立刻认出这个范例。如果您不熟悉Lisp,一旦您了解了FP,您将获得在R中使用的强大观点——而apply将更有意义。

高级R:函数式编程,Hadley Wickham著 《R中的简单函数式编程》,作者:Michael Barton