在CRAN上最近发布的崩溃包中,我尝试将大多数常见的应用功能压缩为2个函数:
dapply (Data-Apply)将函数应用于矩阵和data.frames的行或列(默认值),并且(默认值)返回一个具有相同类型和相同属性的对象(除非每次计算的结果是原子的并且drop = TRUE)。对于data.frame列的性能与lapply相当,对于矩阵行或列的性能比apply快2倍。并行性可通过mclapply获得(仅适用于MAC)。
语法:
dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame"), drop = TRUE)
例子:
# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
BY是S3的一个泛型,用于拆分应用组合计算,使用矢量、矩阵和data.frame方法。它明显比tapply快,通过和聚合(也比plyr快,但在大数据上dplyr更快)。
语法:
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
例子:
# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...
分组变量的列表也可以提供给g。
Talking about performance: A main goal of collapse is to foster high-performance programming in R and to move beyond split-apply-combine alltogether. For this purpose the package has a full set of C++ based fast generic functions: fmean, fmedian, fmode, fsum, fprod, fsd, fvar, fmin, fmax, ffirst, flast, fNobs, fNdistinct, fscale, fbetween, fwithin, fHDbetween, fHDwithin, flag, fdiff and fgrowth. They perform grouped computations in a single pass through the data (i.e. no splitting and recombining).
语法:
fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)
例子:
v <- iris$Sepal.Length
f <- iris$Species
# Vectors
fmean(v) # mean
fmean(v, f) # grouped mean
fsd(v, f) # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f) # grouped standardizing (scaling and centering)
fwithin(v, f) # grouped demeaning
w <- abs(rnorm(nrow(iris)))
fmean(v, w = w) # Weighted mean
fmean(v, f, w) # Weighted grouped mean
fsd(v, f, w) # Weighted grouped standard-deviation
fsd(v, f, w, "/") # Weighted grouped scaling
fscale(v, f, w) # Weighted grouped standardizing
fwithin(v, f, w) # Weighted grouped demeaning
# Same using data.frames...
fmean(iris[-5], f) # grouped mean
fscale(iris[-5], f) # grouped standardizing
fwithin(iris[-5], f) # grouped demeaning
# Same with matrices ...
在软件包小插图中,我提供了基准测试。使用快速函数编程要比使用dplyr或数据编程快得多。表,尤其适用于较小的数据,也适用于较大的数据。