我有一些麻烦的前导和尾随空白在一个数据。框架。
例如,我根据特定条件查看data.frame中的特定行:
> myDummy[myDummy$country == c("Austria"),c(1,2,3:7,19)]
[1] codeHelper country dummyLI dummyLMI dummyUMI
[6] dummyHInonOECD dummyHIOECD dummyOECD
<0 rows> (or 0-length row.names)
我想知道为什么我没有得到预期的输出,因为奥地利显然存在于我的数据框架中。在查看了我的代码历史并试图找出错误后,我尝试了:
> myDummy[myDummy$country == c("Austria "),c(1,2,3:7,19)]
codeHelper country dummyLI dummyLMI dummyUMI dummyHInonOECD dummyHIOECD
18 AUT Austria 0 0 0 0 1
dummyOECD
18 1
我所更改的命令只是在奥地利之后增加了一个空白。
显然还会出现更多烦人的问题。例如,当我喜欢根据国家列合并两帧时。一个data.frame使用“Austria”,而另一个frame使用“Austria”。匹配不起作用。
有没有一种很好的方法来“显示”屏幕上的空白,让我意识到这个问题?
我能移除R开头和结尾的空白吗?
到目前为止,我曾经写过一个简单的Perl脚本,它消除了白色的速度,但如果我能以某种方式在R中做到这一点就好了。
最好的方法可能是在读取数据文件时处理后面的空白。如果你使用read。csv或read。表中可以设置parameterstrip.white=TRUE。
如果你想清理字符串之后,你可以使用这些函数之一:
# Returns string without leading white space
trim.leading <- function (x) sub("^\\s+", "", x)
# Returns string without trailing white space
trim.trailing <- function (x) sub("\\s+$", "", x)
# Returns string without leading or trailing white space
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
在myDummy$country上使用下列函数之一:
myDummy$country <- trim(myDummy$country)
要“显示”你可以使用的空白:
paste(myDummy$country)
它将显示由引号(")包围的字符串,使空白更容易发现。
我创建了一个修剪。string()函数修剪前导和/或尾随空格,如下所示:
# Arguments: x - character vector
# side - side(s) on which to remove whitespace
# default : "both"
# possible values: c("both", "leading", "trailing")
trim.strings <- function(x, side = "both") {
if (is.na(match(side, c("both", "leading", "trailing")))) {
side <- "both"
}
if (side == "leading") {
sub("^\\s+", "", x)
} else {
if (side == "trailing") {
sub("\\s+$", "", x)
} else gsub("^\\s+|\\s+$", "", x)
}
}
为了进行说明,
a <- c(" ABC123 456 ", " ABC123DEF ")
# returns string without leading and trailing whitespace
trim.strings(a)
# [1] "ABC123 456" "ABC123DEF"
# returns string without leading whitespace
trim.strings(a, side = "leading")
# [1] "ABC123 456 " "ABC123DEF "
# returns string without trailing whitespace
trim.strings(a, side = "trailing")
# [1] " ABC123 456" " ABC123DEF"
本线程中主要方法的基准测试。这并没有捕捉到所有奇怪的情况,但到目前为止,我们仍然缺少str_trim删除空格而trimws不删除空格的示例(参见Richard Telford对这个答案的评论)。似乎并不重要- gsub选项似乎是最快的:)
x <- c(" lead", "trail ", " both ", " both and middle ", " _special")
## gsub function from https://stackoverflow.com/a/2261149/7941188
## this is NOT the function from user Bernhard Kausler, which uses
## a much less concise regex
gsub_trim <- function (x) gsub("^\\s+|\\s+$", "", x)
res <- microbenchmark::microbenchmark(
gsub = gsub_trim(x),
## https://stackoverflow.com/a/30210713/7941188
trimws = trimws(x),
## https://stackoverflow.com/a/15007398/7941188
str_trim = stringr::str_trim(x),
times = 10^5
)
res
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> gsub 20.201 22.788 31.43943 24.654 28.4115 5303.741 1e+05 a
#> trimws 38.204 41.980 61.92218 44.420 51.1810 40363.860 1e+05 b
#> str_trim 88.672 92.347 116.59186 94.542 105.2800 13618.673 1e+05 c
ggplot2::autoplot(res)
sessionInfo()
#> R version 4.0.3 (2020-10-10)
#> Platform: x86_64-apple-darwin17.0 (64-bit)
#> Running under: macOS Big Sur 10.16
#>
#> locale:
#> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> stringr_1.4.0
使用dplyr/tidyverse mutate_all和str_trim来修剪整个数据帧:
myDummy %>%
mutate_all(str_trim)
library(tidyverse)
set.seed(335)
df <- mtcars %>%
rownames_to_column("car") %>%
mutate(car = ifelse(runif(nrow(mtcars)) > 0.4, car, paste0(car, " "))) %>%
select(car, mpg)
print(head(df), quote = T)
#> car mpg
#> 1 "Mazda RX4 " "21.0"
#> 2 "Mazda RX4 Wag" "21.0"
#> 3 "Datsun 710 " "22.8"
#> 4 "Hornet 4 Drive " "21.4"
#> 5 "Hornet Sportabout " "18.7"
#> 6 "Valiant " "18.1"
df_trim <- df %>%
mutate_all(str_trim)
print(head(df_trim), quote = T)
#> car mpg
#> 1 "Mazda RX4" "21"
#> 2 "Mazda RX4 Wag" "21"
#> 3 "Datsun 710" "22.8"
#> 4 "Hornet 4 Drive" "21.4"
#> 5 "Hornet Sportabout" "18.7"
#> 6 "Valiant" "18.1"
由reprex包于2021-05-07创建(v0.3.0)