我有一些麻烦的前导和尾随空白在一个数据。框架。
例如,我根据特定条件查看data.frame中的特定行:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummyLI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
我想知道为什么我没有得到预期的输出,因为奥地利显然存在于我的数据框架中。在查看了我的代码历史并试图找出错误后,我尝试了:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
我所更改的命令只是在奥地利之后增加了一个空白。
显然还会出现更多烦人的问题。例如,当我喜欢根据国家列合并两帧时。一个data.frame使用“Austria”,而另一个frame使用“Austria”。匹配不起作用。
有没有一种很好的方法来“显示”屏幕上的空白,让我意识到这个问题?
我能移除R开头和结尾的空白吗?
到目前为止,我曾经写过一个简单的Perl脚本,它消除了白色的速度,但如果我能以某种方式在R中做到这一点就好了。
本线程中主要方法的基准测试。这并没有捕捉到所有奇怪的情况,但到目前为止,我们仍然缺少str_trim删除空格而trimws不删除空格的示例(参见Richard Telford对这个答案的评论)。似乎并不重要- gsub选项似乎是最快的:)
x <- c(" lead", "trail ", " both ", " both and middle ", " _special")
## gsub function from https://stackoverflow.com/a/2261149/7941188
## this is NOT the function from user Bernhard Kausler, which uses
## a much less concise regex
gsub_trim <- function (x) gsub("^\\s+|\\s+$", "", x)
res <- microbenchmark::microbenchmark(
gsub = gsub_trim(x),
## https://stackoverflow.com/a/30210713/7941188
trimws = trimws(x),
## https://stackoverflow.com/a/15007398/7941188
str_trim = stringr::str_trim(x),
times = 10^5
)
res
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> gsub 20.201 22.788 31.43943 24.654 28.4115 5303.741 1e+05 a
#> trimws 38.204 41.980 61.92218 44.420 51.1810 40363.860 1e+05 b
#> str_trim 88.672 92.347 116.59186 94.542 105.2800 13618.673 1e+05 c
ggplot2::autoplot(res)
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> stringr_1.4.0
使用dplyr/tidyverse mutate_all和str_trim来修剪整个数据帧:
myDummy %>%
mutate_all(str_trim)
library(tidyverse)
set.seed(335)
df <- mtcars %>%
rownames_to_column("car") %>%
mutate(car = ifelse(runif(nrow(mtcars)) > 0.4, car, paste0(car, " "))) %>%
select(car, mpg)
print(head(df), quote = T)
#> car mpg
#> 1 "Mazda RX4 " "21.0"
#> 2 "Mazda RX4 Wag" "21.0"
#> 3 "Datsun 710 " "22.8"
#> 4 "Hornet 4 Drive " "21.4"
#> 5 "Hornet Sportabout " "18.7"
#> 6 "Valiant " "18.1"
df_trim <- df %>%
mutate_all(str_trim)
print(head(df_trim), quote = T)
#> car mpg
#> 1 "Mazda RX4" "21"
#> 2 "Mazda RX4 Wag" "21"
#> 3 "Datsun 710" "22.8"
#> 4 "Hornet 4 Drive" "21.4"
#> 5 "Hornet Sportabout" "18.7"
#> 6 "Valiant" "18.1"
由reprex包于2021-05-07创建(v0.3.0)