我试图确定一个字符串是否是另一个字符串的子集。例如:

chars <- "test"
value <- "es"

如果“value”作为字符串“chars”的一部分出现,我想返回TRUE。在下面的场景中,我想返回false:

chars <- "test"
value <- "et"

当前回答

同样,可以使用"stringr"库:

> library(stringr)
> chars <- "test"
> value <- "es"
> str_detect(chars, value)
[1] TRUE

### For multiple value case:
> value <- c("es", "l", "est", "a", "test")
> str_detect(chars, value)
[1]  TRUE FALSE  TRUE FALSE  TRUE

其他回答

你想要grepl:

> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE

这里有类似的问题:给定一个字符串和一个关键字列表,检测字符串中包含哪些关键字(如果有的话)。

这个线程的建议是stringr的str_detect和grepl。下面是微基准测试包中的基准测试:

使用

map_keywords = c("once", "twice", "few")
t = "yes but only a few times"

mapper1 <- function (x) {
  r = str_detect(x, map_keywords)
}

mapper2 <- function (x) {
  r = sapply(map_keywords, function (k) grepl(k, x, fixed = T))
}

然后

microbenchmark(mapper1(t), mapper2(t), times = 5000)

我们发现

Unit: microseconds
       expr    min     lq     mean  median      uq      max neval
 mapper1(t) 26.401 27.988 31.32951 28.8430 29.5225 2091.476  5000
 mapper2(t) 19.289 20.767 24.94484 23.7725 24.6220 1011.837  5000

可以看到,使用str_detect和grepl对关键字的实际字符串和向量进行了超过5000次的关键字搜索,grepl的性能要比str_detect好得多。

结果是布尔向量r,它标识字符串中包含哪些关键字(如果有)。

因此,我建议使用grepl来确定字符串中是否有关键字。

使用stringi package中的这个函数:

> stri_detect_fixed("test",c("et","es"))
[1] FALSE  TRUE

一些基准:

library(stringi)
set.seed(123L)
value <- stri_rand_strings(10000, ceiling(runif(10000, 1, 100))) # 10000 random ASCII strings
head(value)

chars <- "es"
library(microbenchmark)
microbenchmark(
   grepl(chars, value),
   grepl(chars, value, fixed=TRUE),
   grepl(chars, value, perl=TRUE),
   stri_detect_fixed(value, chars),
   stri_detect_regex(value, chars)
)
## Unit: milliseconds
##                               expr       min        lq    median        uq       max neval
##                grepl(chars, value) 13.682876 13.943184 14.057991 14.295423 15.443530   100
##  grepl(chars, value, fixed = TRUE)  5.071617  5.110779  5.281498  5.523421 45.243791   100
##   grepl(chars, value, perl = TRUE)  1.835558  1.873280  1.956974  2.259203  3.506741   100
##    stri_detect_fixed(value, chars)  1.191403  1.233287  1.309720  1.510677  2.821284   100
##    stri_detect_regex(value, chars)  6.043537  6.154198  6.273506  6.447714  7.884380   100

使用grep或grepl,但要注意是否要使用正则表达式。

默认情况下,grep和相关的匹配采用正则表达式,而不是文字子字符串。如果你没有预料到,并且你试图匹配一个无效的正则表达式,它就不起作用:

> grep("[", "abc[")
Error in grep("[", "abc[") : 
  invalid regular expression '[', reason 'Missing ']''

要做一个真子字符串测试,使用fixed = true。

> grep("[", "abc[", fixed = TRUE)
[1] 1

如果你确实想要正则表达式,很好,但这并不是OP所要求的。

回答

唉,我花了45分钟才找到这个简单问题的答案。答案是:grepl(needle, haystack, fixed=TRUE)

# Correct
> grepl("1+2", "1+2", fixed=TRUE)
[1] TRUE
> grepl("1+2", "123+456", fixed=TRUE)
[1] FALSE

# Incorrect
> grepl("1+2", "1+2")
[1] FALSE
> grepl("1+2", "123+456")
[1] TRUE

解释

grep is named after the linux executable, which is itself an acronym of "Global Regular Expression Print", it would read lines of input and then print them if they matched the arguments you gave. "Global" meant the match could occur anywhere on the input line, I'll explain "Regular Expression" below, but the idea is it's a smarter way to match the string (R calls this "character", eg class("abc")), and "Print" because it's a command line program, emitting output means it prints to its output string.

现在,grep程序基本上是一个过滤器,从输入行到输出行。似乎R的grep函数也同样接受一个输入数组。出于我完全不知道的原因(我大约一个小时前才开始使用R),它返回一个匹配索引的向量,而不是一个匹配列表。

但是,回到你最初的问题,我们真正想知道的是我们是否找到了大海捞针,一个真/假的值。显然,他们决定将这个函数命名为grepl,就像在“grep”中一样,但是返回值是“Logical”(他们调用true和false逻辑值,例如class(true))。

So, now we know where the name came from and what it's supposed to do. Lets get back to Regular Expressions. The arguments, even though they are strings, they are used to build regular expressions (henceforth: regex). A regex is a way to match a string (if this definition irritates you, let it go). For example, the regex a matches the character "a", the regex a* matches the character "a" 0 or more times, and the regex a+ would match the character "a" 1 or more times. Hence in the example above, the needle we are searching for 1+2, when treated as a regex, means "one or more 1 followed by a 2"... but ours is followed by a plus!

所以,如果你使用grepl而没有设置fixed,你的针会意外地变成干草堆,这经常会意外地起作用,我们可以看到它甚至在OP的例子中起作用。但这是一个潜在的漏洞!我们需要告诉它输入是一个字符串,而不是一个正则表达式,这显然是fixed的用途。为什么固定?不知道,把这个答案收藏起来,你可能要再查5次才能记住它。

最后的一些想法

The better your code is, the less history you have to know to make sense of it. Every argument can have at least two interesting values (otherwise it wouldn't need to be an argument), the docs list 9 arguments here, which means there's at least 2^9=512 ways to invoke it, that's a lot of work to write, test, and remember... decouple such functions (split them up, remove dependencies on each other, string things are different than regex things are different than vector things). Some of the options are also mutually exclusive, don't give users incorrect ways to use the code, ie the problematic invocation should be structurally nonsensical (such as passing an option that doesn't exist), not logically nonsensical (where you have to emit a warning to explain it). Put metaphorically: replacing the front door in the side of the 10th floor with a wall is better than hanging a sign that warns against its use, but either is better than neither. In an interface, the function defines what the arguments should look like, not the caller (because the caller depends on the function, inferring everything that everyone might ever want to call it with makes the function depend on the callers, too, and this type of cyclical dependency will quickly clog a system up and never provide the benefits you expect). Be very wary of equivocating types, it's a design flaw that things like TRUE and 0 and "abc" are all vectors.