如何在Bash中将字符串分割成数组?

在Bash脚本中，我希望将一行分割成多个片段，并将它们存储在一个数组中。

例如，给定一行:

Paris, France, Europe

我想让结果数组看起来像这样:

array[0] = Paris
array[1] = France
array[2] = Europe

最好是一个简单的实现;速度并不重要。我该怎么做呢?

当前回答

t="one,two,three"
a=($(echo "$t" | tr ',' '\n'))
echo "${a[2]}"

打印三

2015-07-14 11:54:19

其他回答

输入代码here多字符分隔符解决方案。

正如其他人在这篇文章中指出的，OP的问题给出了一个用逗号分隔的字符串被解析成数组的例子，但没有指出他/她是否只对逗号分隔符、单字符分隔符或多字符分隔符感兴趣。

由于谷歌倾向于将这个答案排在搜索结果的顶部或附近，所以我想为读者提供一个关于多个字符分隔符问题的有力答案，因为至少有一个回答也提到了这个问题。

如果您正在寻找多字符分隔符问题的解决方案，我建议您查看Mallikarjun M的帖子，特别是来自gniourf_gniourf的回复谁提供了这个优雅的纯BASH解决方案使用参数展开:

#!/bin/bash
str="LearnABCtoABCSplitABCaABCString"
delimiter=ABC
s=$str$delimiter
array=();
while [[ $s ]]; do
    array+=( "${s%%"$delimiter"*}" );
    s=${s#*"$delimiter"};
done;
declare -p array

链接到引用的评论/引用的帖子

链接到引用的问题:如何在bash中拆分多字符分隔符上的字符串?

2022年8月3日

Xebeche在下面的评论中提出了一个很好的观点。在审查了他们建议的编辑之后，我修改了gniourf_gniourf提供的脚本，并添加了注释，以便于理解脚本正在做什么。我还将双括号[[]]改为单括号，以提高兼容性，因为许多SHell变体不支持双括号表法。在本例中，对于BaSH，逻辑在单括号或双括号内工作。

#!/bin/bash
  
str="LearnABCtoABCSplitABCABCaABCStringABC"
delimiter="ABC"
array=()

while [ "$str" ]; do

    # parse next sub-string, left of next delimiter
    substring="${str%%"$delimiter"*}" 

    # when substring = delimiter, truncate leading delimiter
    # (i.e. pattern is "$delimiter$delimiter")
    [ -z "$substring" ] && str="${str#"$delimiter"}" && continue

    # create next array element with parsed substring
    array+=( "$substring" )

    # remaining string to the right of delimiter becomes next string to be evaluated
    str="${str:${#substring}}"

    # prevent infinite loop when last substring = delimiter
    [ "$str" == "$delimiter" ] && break

done

declare -p array

不加评论:

#!/bin/bash
str="LearnABCtoABCSplitABCABCaABCStringABC"
delimiter="ABC"
array=()
while [ "$str" ]; do
    substring="${str%%"$delimiter"*}" 
    [ -z "$substring" ] && str="${str#"$delimiter"}" && continue
    array+=( "$substring" )
    str="${str:${#substring}}"
    [ "$str" == "$delimiter" ] && break
done
declare -p array

2018-11-13 13:19:11

有时，我发现在已接受的答案中描述的方法不起作用，特别是当分隔符是回车符时。在这些情况下，我是这样解决的:

string='first line
second line
third line'

oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"

for line in "${lines[@]}"
    do
        echo "--> $line"
done

2012-11-02 13:44:37

我很好奇"正确答案"的相对表现在@bgoldst的流行回答中，显然是对循环的谴责，所以我用三个纯bash实现做了一个简单的基准测试。

综上所述，我建议:

对于字符串长度< 4k左右的情况，纯bash比gawk更快对于分隔符长度< 10和字符串长度< 256k，纯bash与gawk相当对于分隔符长度>> 10和字符串长度< 64k左右，纯bash是“可接受的”; gawk的速度还不到5倍对于字符串长度< 512k左右，gawk是“可接受的”

我任意地将“可接受”定义为“分割字符串所需时间< 0.5s”。

我认为问题是获取一个bash字符串并使用任意长度的分隔符字符串(不是regex)将其分割成一个bash数组。

# in: $1=delim, $2=string
# out: sets array a

我的纯bash实现是:

# naive approach - slow
split_byStr_bash_naive(){
    a=()
    local prev=""
    local cdr="$2"
    [[ -z "${cdr}" ]] && a+=("")
    while [[ "$cdr" != "$prev" ]]; do
        prev="$cdr"
        a+=( "${cdr%%"$1"*}" )
        cdr="${cdr#*"$1"}"
    done
    # echo $( declare -p a | md5sum; declare -p a )
}

# use lengths wherever possible - faster
split_byStr_bash_faster(){
    a=()
    local car=""
    local cdr="$2"
    while
        car="${cdr%%"$1"*}"
        a+=("$car")
        cdr="${cdr:${#car}}"
        (( ${#cdr} ))
    do
        cdr="${cdr:${#1}}"
    done
    # echo $( declare -p a | md5sum; declare -p a )
}

# use pattern substitution and readarray - fastest
split_byStr_bash_sub(){
        a=()
        local delim="$1" string="$2"

        delim="${delim//=/=-}"
        delim="${delim//$'\n'/=n}"

        string="${string//=/=-}"
        string="${string//$'\n'/=n}"

        readarray -td $'\n' a <<<"${string//"$delim"/$'\n'}"

        local len=${#a[@]} i s
        for (( i=0; i<len; i++ )); do
                s="${a[$i]//=n/$'\n'}"
                a[$i]="${s//=-/=}"
        done
        # echo $( declare -p a | md5sum; declare -p a )
}

在naive版本中，初始的-z测试处理长度为零的情况正在传递的字符串。如果没有测试，输出数组是空的; 使用它，数组只有一个长度为0的元素。

将readarray替换为while read会导致< 10%的减速。

这是我使用的gawk实现:

split_byRE_gawk(){
    readarray -td '' a < <(awk '{gsub(/'"$1"'/,"\0")}1' <<<"$2$1")
    unset 'a[-1]'
    # echo $( declare -p a | md5sum; declare -p a )
}

显然，在一般情况下，delim参数需要被净化，因为gawk需要一个正则表达式，而gawk-special字符可能会导致问题。同样，按原样，该实现不会正确处理分隔符中的换行符。

由于gawk正在被使用，一个通用版本可以处理更多的任意分隔符可以是:

split_byREorStr_gawk(){
    local delim=$1
    local string=$2
    local useRegex=${3:+1}  # if set, delimiter is regex

    readarray -td '' a < <(
        export delim
        gawk -v re="$useRegex" '
            BEGIN {
                RS = FS = "\0"
                ORS = ""
                d = ENVIRON["delim"]

                # cf. https://stackoverflow.com/a/37039138
                if (!re) gsub(/[\\.^$(){}\[\]|*+?]/,"\\\\&",d)
            }
            gsub(d"|\n$","\0")
        ' <<<"$string"
    )
    # echo $( declare -p a | md5sum; declare -p a )
}

或者在Perl中使用相同的想法:

split_byREorStr_perl(){
    local delim=$1
    local string=$2
    local regex=$3  # if set, delimiter is regex

    readarray -td '' a < <(
        export delim regex
        perl -0777pe '
            $d = $ENV{delim};
            $d = "\Q$d\E" if ! $ENV{regex};
            s/$d|\n$/\0/g;
        ' <<<"$string"
    )
    # echo $( declare -p a | md5sum; declare -p a )
}

这两个实现产生相同的输出，分别通过比较md5sum进行测试。

注意，如果输入有歧义(正如@bgoldst所说的“逻辑不正确”)，行为会略有不同。例如，使用分隔符——和字符串a-或——:

@goldst代码返回:宣布——=([0]=“a”)或宣布——=([0]=“a”[1]= " ") 我回:宣布——=([0]=“-”)或宣布——=([0]=“a”[1]=“-”)

参数由简单的Perl脚本派生，从:

delim="-=-="
base="ABCDEFGHIJKLMNOPQRSTUVWXYZ012345"

下面是3种不同类型的计时结果表(以秒为单位) 字符串和分隔符参数的。

#s -字符串参数的长度 #d - delim参数的长度 = -性能盈亏平衡点！ -“可接受的”性能限制(bash)在这里！! -“可接受的”性能限制大概在这里 ——函数花了太长时间 <！> - gawk命令执行失败

1型

d=$(perl -e "print( '$delim' x (7*2**$n) )")
s=$(perl -e "print( '$delim' x (7*2**$n) . '$base' x (7*2**$n) )")

	n	#s	#d	gawk	b_sub	b_faster	b_naive
	0	252	28	0.002	0.000	0.000	0.000
	1	504	56	0.005	0.000	0.000	0.001
	2	1008	112	0.005	0.001	0.000	0.003
	3	2016	224	0.006	0.001	0.000	0.009
	4	4032	448	0.007	0.002	0.001	0.048
=	5	8064	896	0.014	0.008	0.005	0.377
	6	16128	1792	0.018	0.029	0.017	(2.214)
	7	32256	3584	0.033	0.057	0.039	(15.16)
!	8	64512	7168	0.063	0.214	0.128	-
	9	129024	14336	0.111	(0.826)	(0.602)	-
	10	258048	28672	0.214	(3.383)	(2.652)	-
!!	11	516096	57344	0.430	(13.46)	(11.00)	-
	12	1032192	114688	(0.834)	(58.38)	-	-
	13	2064384	229376	<!>	(228.9)	-	-

2型

d=$(perl -e "print( '$delim' x ($n) )")
s=$(perl -e "print( ('$delim' x ($n) . '$base' x $n ) x (2**($n-1)) )")

	n	#s	#d	gawk	b_sub	b_faster	b_naive
	0	0	0	0.003	0.000	0.000	0.000
	1	36	4	0.003	0.000	0.000	0.000
	2	144	8	0.005	0.000	0.000	0.000
	3	432	12	0.005	0.000	0.000	0.000
	4	1152	16	0.005	0.001	0.001	0.002
	5	2880	20	0.005	0.001	0.002	0.003
	6	6912	24	0.006	0.003	0.009	0.014
=	7	16128	28	0.012	0.012	0.037	0.044
	8	36864	32	0.023	0.044	0.167	0.187
!	9	82944	36	0.049	0.192	(0.753)	(0.840)
	10	184320	40	0.097	(0.925)	(3.682)	(4.016)
	11	405504	44	0.204	(4.709)	(18.00)	(19.58)
!!	12	884736	48	0.444	(22.17)	-	-
	13	1916928	52	(1.019)	(102.4)	-	-

3型

d=$(perl -e "print( '$delim' x (2**($n-1)) )")
s=$(perl -e "print( ('$delim' x (2**($n-1)) . '$base' x (2**($n-1)) ) x ($n) )")

	n	#s	#d	gawk	b_sub	b_faster	b_naive
	0	0	0	0.000	0.000	0.000	0.000
	1	36	4	0.004	0.000	0.000	0.000
	2	144	8	0.003	0.000	0.000	0.000
	3	432	16	0.003	0.000	0.000	0.000
	4	1152	32	0.005	0.001	0.001	0.002
	5	2880	64	0.005	0.002	0.001	0.003
	6	6912	128	0.006	0.003	0.003	0.014
=	7	16128	256	0.012	0.011	0.010	0.077
	8	36864	512	0.023	0.046	0.046	(0.513)
!	9	82944	1024	0.049	0.195	0.197	(3.850)
	10	184320	2048	0.103	(0.951)	(1.061)	(31.84)
	11	405504	4096	0.222	(4.796)	-	-
!!	12	884736	8192	0.473	(22.88)	-	-
	13	1916928	16384	(1.126)	(105.4)	-	-

长度为1..10的分隔符摘要

由于短分隔符可能比长分隔符更有可能，下面总结了不同分隔符长度的结果在1和10之间(结果为2..9个大多被省略为非常相似)。

s1=$(perl -e "print( '$d' . '$base' x (7*2**$n) )")
s2=$(perl -e "print( ('$d' . '$base' x $n ) x (2**($n-1)) )")
s3=$(perl -e "print( ('$d' . '$base' x (2**($n-1)) ) x ($n) )")

Bash_sub < gawk

string	n	#s	#d	gawk	b_sub	b_faster	b_naive
s1	10	229377	1	0.131	0.089	1.709	-
s1	10	229386	10	0.142	0.095	1.907	-
s2	8	32896	1	0.022	0.007	0.148	0.168
s2	8	34048	10	0.021	0.021	0.163	0.179
s3	12	786444	1	0.436	0.468	-	-
s3	12	786456	2	0.434	0.317	-	-
s3	12	786552	10	0.438	0.333	-	-

Bash_sub < 0.5s

string	n	#s	#d	gawk	b_sub	b_faster	b_naive
s1	11	458753	1	0.256	0.332	(7.089)	-
s1	11	458762	10	0.269	0.387	(8.003)	-
s2	11	361472	1	0.205	0.283	(14.54)	-
s2	11	363520	3	0.207	0.462	(16.66)	-
s3	12	786444	1	0.436	0.468	-	-
s3	12	786456	2	0.434	0.317	-	-
s3	12	786552	10	0.438	0.333	-	-

Gawk < 0.5s

string	n	#s	$d	gawk	b_sub	b_faster	b_naive
s1	11	458753	1	0.256	0.332	(7.089)	-
s1	11	458762	10	0.269	0.387	(8.003)	-
s2	12	788480	1	0.440	(1.252)	-	-
s2	12	806912	10	0.449	(4.968)	-	-
s3	12	786444	1	0.436	0.468	-	-
s3	12	786456	2	0.434	0.317	-	-
s3	12	786552	10	0.438	0.333	-	-

(我不完全确定为什么bash_sub与s>160k和d=1始终比d>1 s3慢。)

所有测试都是在Intel i7-7500U上运行xubuntu 20.04，使用bash 5.0.17进行的。

2022-08-03 17:26:20

这个问题的所有答案或多或少都是错误的。

错误答案1

IFS=', ' read -r -a array <<< "$string"

1:这是对$IFS的滥用。$IFS变量的值不作为单个变长字符串分隔符，而是作为一组单字符字符串分隔符，其中读取的每个字段从输入行分离出来，可以用该集合中的任何字符结束(本例中为逗号或空格)。

实际上，对于那些真正坚持的人来说，$IFS的全部含义要稍微复杂一些。来自bash手册:

The shell treats each character of IFS as a delimiter, and splits the results of the other expansions into words using these characters as field terminators. If IFS is unset, or its value is exactly <space><tab><newline>, the default, then sequences of <space>, <tab>, and <newline> at the beginning and end of the results of the previous expansions are ignored, and any sequence of IFS characters not at the beginning or end serves to delimit words. If IFS has a value other than the default, then sequences of the whitespace characters <space>, <tab>, and <newline> are ignored at the beginning and end of the word, as long as the whitespace character is in the value of IFS (an IFS whitespace character). Any character in IFS that is not IFS whitespace, along with any adjacent IFS whitespace characters, delimits a field. A sequence of IFS whitespace characters is also treated as a delimiter. If the value of IFS is null, no word splitting occurs.

基本上，对于$IFS的非默认非空值，字段可以用(1)来自“IFS空白字符”集的一个或多个字符序列(即<空格>，<tab>和<换行>(“换行”表示换行(LF))中的任何一个字符分隔，或者(2)任何出现在$IFS中的非“IFS空白字符”，以及输入行中围绕它的任何“IFS空白字符”。

对于OP，我在前一段中描述的第二种分离模式可能正是他想要的输入字符串，但我们可以相当肯定的是，我描述的第一种分离模式根本不正确。例如，如果他的输入字符串是'Los Angeles, United States, North America'呢?

IFS=', ' read -ra a <<<'Los Angeles, United States, North America'; declare -p a;
## declare -a a=([0]="Los" [1]="Angeles" [2]="United" [3]="States" [4]="North" [5]="America")

2: Even if you were to use this solution with a single-character separator (such as a comma by itself, that is, with no following space or other baggage), if the value of the $string variable happens to contain any LFs, then read will stop processing once it encounters the first LF. The read builtin only processes one line per invocation. This is true even if you are piping or redirecting input only to the read statement, as we are doing in this example with the here-string mechanism, and thus unprocessed input is guaranteed to be lost. The code that powers the read builtin has no knowledge of the data flow within its containing command structure.

你可能会说这不太可能造成问题，但这仍然是一个微妙的危险，如果可能的话应该避免。这是由于内建的read实际上进行了两级输入分割:首先分解为行，然后分解为字段。由于OP只需要一个级别的分割，因此read内置的这种用法是不合适的，我们应该避免它。

3:这个解决方案的一个不明显的潜在问题是，如果后面的字段是空的，read总是丢弃它，尽管在其他情况下它保留空字段。下面是一个演示:

string=', , a, , b, c, , , '; IFS=', ' read -ra a <<<"$string"; declare -p a;
## declare -a a=([0]="" [1]="" [2]="a" [3]="" [4]="b" [5]="c" [6]="" [7]="")

也许OP并不关心这一点，但这仍然是一个值得了解的限制。它降低了解决方案的健壮性和通用性。

这个问题可以通过在输入字符串供读取之前附加一个虚拟的尾随分隔符来解决，我将在后面演示。

错误答案2

string="1:2:3:4:5"
set -f                     # avoid globbing (expansion of *).
array=(${string//:/ })

类似的想法:

t="one,two,three"
a=($(echo $t | tr ',' "\n"))

(注意:我在回答者似乎遗漏的命令替换周围添加了缺失的括号。)

类似的想法:

string="1,2,3,4"
array=(`echo $string | sed 's/,/\n/g'`)

这些解决方案利用数组赋值中的字分割将字符串分割为字段。有趣的是，就像read一样，一般的分词也使用$IFS特殊变量，尽管在这种情况下，它暗示它被设置为其默认值<空格><制表符><换行>，因此任何一个或多个IFS字符序列(现在都是空格字符)都被认为是字段分隔符。

This solves the problem of two levels of splitting committed by read, since word splitting by itself constitutes only one level of splitting. But just as before, the problem here is that the individual fields in the input string can already contain $IFS characters, and thus they would be improperly split during the word splitting operation. This happens to not be the case for any of the sample input strings provided by these answerers (how convenient...), but of course that doesn't change the fact that any code base that used this idiom would then run the risk of blowing up if this assumption were ever violated at some point down the line. Once again, consider my counterexample of 'Los Angeles, United States, North America' (or 'Los Angeles:United States:North America').

Also, word splitting is normally followed by filename expansion (aka pathname expansion aka globbing), which, if done, would potentially corrupt words containing the characters *, ?, or [ followed by ] (and, if extglob is set, parenthesized fragments preceded by ?, *, +, @, or !) by matching them against file system objects and expanding the words ("globs") accordingly. The first of these three answerers has cleverly undercut this problem by running set -f beforehand to disable globbing. Technically this works (although you should probably add set +f afterward to reenable globbing for subsequent code which may depend on it), but it's undesirable to have to mess with global shell settings in order to hack a basic string-to-array parsing operation in local code.

这个答案的另一个问题是所有空字段都会丢失。这可能是问题，也可能不是问题，这取决于应用程序。

Note: If you're going to use this solution, it's better to use the ${string//:/ } "pattern substitution" form of parameter expansion, rather than going to the trouble of invoking a command substitution (which forks the shell), starting up a pipeline, and running an external executable (tr or sed), since parameter expansion is purely a shell-internal operation. (Also, for the tr and sed solutions, the input variable should be double-quoted inside the command substitution; otherwise word splitting would take effect in the echo command and potentially mess with the field values. Also, the $(...) form of command substitution is preferable to the old `...` form since it simplifies nesting of command substitutions and allows for better syntax highlighting by text editors.)

错误答案3

str="a, b, c, d"  # assuming there is a space after ',' as in Q
arr=(${str//,/})  # delete all occurrences of ','

这个答案和第二条几乎一样。区别在于应答者假设字段由两个字符分隔，其中一个字符在默认的$IFS中表示，另一个则不是。他通过使用模式替换展开删除非ifs表示的字符，然后使用单词拆分拆分幸存的ifs表示的分隔符字符上的字段，解决了这个相当具体的情况。

这不是一个非常通用的解决方案。此外，可以认为逗号实际上是这里的“主要”分隔符，而剥离它然后依赖空格字符进行字段分割是完全错误的。再一次考虑一下我的反例:“美国，北美的洛杉矶”。

同样，文件名展开可能会破坏展开的单词，但可以通过暂时禁用set -f和set +f的赋值通配符来防止这种情况。

同样，所有空字段都将丢失，这可能是问题，也可能不是问题，这取决于应用程序。

错误答案4

string='first line
second line
third line'

oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"

This is similar to #2 and #3 in that it uses word splitting to get the job done, only now the code explicitly sets $IFS to contain only the single-character field delimiter present in the input string. It should be repeated that this cannot work for multicharacter field delimiters such as the OP's comma-space delimiter. But for a single-character delimiter like the LF used in this example, it actually comes close to being perfect. The fields cannot be unintentionally split in the middle as we saw with previous wrong answers, and there is only one level of splitting, as required.

一个问题是文件名展开会破坏前面描述的受影响的单词，尽管这也可以通过在set -f和set +f中包装关键语句来解决。

另一个潜在的问题是，由于LF符合前面定义的“IFS空白字符”的条件，因此所有空字段都将丢失，就像#2和#3中一样。如果分隔符恰好是非“IFS空白字符”，这当然不是问题，而且取决于应用程序，这可能无关紧要，但它确实破坏了解决方案的通用性。

因此，总的来说，假设您有一个单字符分隔符，并且它不是“IFS空白字符”，或者您不关心空字段，并且您将关键语句包装在set -f和set +f中，那么这个解决方案是可行的，但否则就不行。

(此外，为了方便起见，在bash中将LF分配给一个变量可以更容易地使用$'…'语法，例如IFS=$'\n';.)

错误答案5

countries='Paris, France, Europe'
OIFS="$IFS"
IFS=', ' array=($countries)
IFS="$OIFS"

类似的想法:

IFS=', ' eval 'array=($string)'

这个解决方案实际上是#1(因为它将$IFS设置为逗号空格)和#2-4(因为它使用单词分割将字符串分割为字段)之间的交叉。正因为如此，它会遇到上面所有错误答案都会遇到的大多数问题，有点像所有世界中最糟糕的一个。

同样，对于第二种变体，eval调用似乎完全没有必要，因为它的参数是单引号字符串字面量，因此是静态已知的。但实际上，这样使用eval有一个不太明显的好处。通常，当你运行一个简单的命令，它只包含一个变量赋值，这意味着后面没有一个实际的命令字，赋值在shell环境中生效:

IFS=', '; ## changes $IFS in the shell environment

即使简单的命令涉及多个变量赋值也是如此;同样，只要没有命令字，所有变量赋值都会影响shell环境:

IFS=', ' array=($countries); ## changes both $IFS and $array in the shell environment

但是，如果变量赋值附加到命令名(我喜欢称之为“前缀赋值”)，那么它不会影响shell环境，而是只影响所执行命令的环境，不管它是内置的还是外部的:

IFS=', ' :; ## : is a builtin command, the $IFS assignment does not outlive it
IFS=', ' env; ## env is an external command, the $IFS assignment does not outlive it

bash手册中的相关引用:

如果没有命令名，变量赋值将影响当前shell环境。否则，这些变量将被添加到所执行命令的环境中，不影响当前的shell环境。

It is possible to exploit this feature of variable assignment to change $IFS only temporarily, which allows us to avoid the whole save-and-restore gambit like that which is being done with the $OIFS variable in the first variant. But the challenge we face here is that the command we need to run is itself a mere variable assignment, and hence it would not involve a command word to make the $IFS assignment temporary. You might think to yourself, well why not just add a no-op command word to the statement like the : builtin to make the $IFS assignment temporary? This does not work because it would then make the $array assignment temporary as well:

IFS=', ' array=($countries) :; ## fails; new $array value never escapes the : command

所以，我们实际上陷入了僵局，有点左右为难。但是，当eval运行它的代码时，它是在shell环境中运行的，就像它是正常的静态源代码一样，因此我们可以在eval参数中运行$array赋值，使其在shell环境中生效，而作为eval命令前缀的$IFS前缀赋值将不会比eval命令更有效。这正是这个解决方案的第二个变体所使用的技巧:

IFS=', ' eval 'array=($string)'; ## $IFS does not outlive the eval command, but $array does

因此，正如您所看到的，这实际上是一个相当聪明的技巧，并以一种相当不明显的方式完成了所需的内容(至少在赋值效果方面)。实际上，我并不反对这个技巧，尽管涉及到eval;只是要注意参数字符串使用单引号，以防范安全威胁。

但是，由于“最糟糕的”问题的聚集，这仍然是对OP要求的错误回答。

错误答案6

IFS=', '; array=(Paris, France, Europe)

IFS=' ';declare -a array=(Paris France Europe)

嗯…什么?OP有一个需要解析为数组的字符串变量。这个“答案”以输入字符串的逐字内容粘贴到数组文字开始。我想这是一种方法。

看起来回答者可能假设$IFS变量会影响所有上下文中的所有bash解析，但事实并非如此。来自bash手册:

IFS内部字段分隔符，用于展开后的字分割，并使用read内置命令将行分割为字。默认值为<space><tab><newline>。

因此$IFS特殊变量实际上只在两种情况下使用:(1)在展开之后执行的单词拆分(这意味着在解析bash源代码时不执行)和(2)通过read内置程序将输入行拆分为单词。

Let me try to make this clearer. I think it might be good to draw a distinction between parsing and execution. Bash must first parse the source code, which obviously is a parsing event, and then later it executes the code, which is when expansion comes into the picture. Expansion is really an execution event. Furthermore, I take issue with the description of the $IFS variable that I just quoted above; rather than saying that word splitting is performed after expansion, I would say that word splitting is performed during expansion, or, perhaps even more precisely, word splitting is part of the expansion process. The phrase "word splitting" refers only to this step of expansion; it should never be used to refer to the parsing of bash source code, although unfortunately the docs do seem to throw around the words "split" and "words" a lot. Here's a relevant excerpt from the linux.die.net version of the bash manual:

在将命令行分解为单词后，在命令行上执行展开操作。执行的展开有7种:大括号展开、波浪号展开、参数和变量展开、命令替换、算术展开、单词拆分和路径名展开。展开的顺序是:大括号展开;波浪号展开、参数和变量展开、算术展开和命令替换(以从左到右的方式完成);分词;以及路径名展开。

你可能会说GNU版本的手册做得稍微好一点，因为它在扩展部分的第一句中选择了“标记”而不是“单词”:

在将命令行分解为令牌之后，在命令行上执行扩展。

The important point is, $IFS does not change the way bash parses source code. Parsing of bash source code is actually a very complex process that involves recognition of the various elements of shell grammar, such as command sequences, command lists, pipelines, parameter expansions, arithmetic substitutions, and command substitutions. For the most part, the bash parsing process cannot be altered by user-level actions like variable assignments (actually, there are some minor exceptions to this rule; for example, see the various compatxx shell settings, which can change certain aspects of parsing behavior on-the-fly). The upstream "words"/"tokens" that result from this complex parsing process are then expanded according to the general process of "expansion" as broken down in the above documentation excerpts, where word splitting of the expanded (expanding?) text into downstream words is simply one step of that process. Word splitting only touches text that has been spit out of a preceding expansion step; it does not affect literal text that was parsed right off the source bytestream.

错误答案7

string='first line
        second line
        third line'

while read -r line; do lines+=("$line"); done <<<"$string"

这是最好的解决办法之一。注意，我们又回到了read。我之前不是说过read是不合适的，因为它执行两层分割，而我们只需要一个?这里的技巧是，您可以以这样一种方式调用read，它可以有效地只执行一级分割，特别是通过每次调用只分离一个字段，这需要在循环中重复调用它。这是一种技巧，但很有效。

But there are problems. First: When you provide at least one NAME argument to read, it automatically ignores leading and trailing whitespace in each field that is split off from the input string. This occurs whether $IFS is set to its default value or not, as described earlier in this post. Now, the OP may not care about this for his specific use-case, and in fact, it may be a desirable feature of the parsing behavior. But not everyone who wants to parse a string into fields will want this. There is a solution, however: A somewhat non-obvious usage of read is to pass zero NAME arguments. In this case, read will store the entire input line that it gets from the input stream in a variable named $REPLY, and, as a bonus, it does not strip leading and trailing whitespace from the value. This is a very robust usage of read which I've exploited frequently in my shell programming career. Here's a demonstration of the difference in behavior:

string=$'  a  b  \n  c  d  \n  e  f  '; ## input string

a=(); while read -r line; do a+=("$line"); done <<<"$string"; declare -p a;
## declare -a a=([0]="a  b" [1]="c  d" [2]="e  f") ## read trimmed surrounding whitespace

a=(); while read -r; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="  a  b  " [1]="  c  d  " [2]="  e  f  ") ## no trimming

该解决方案的第二个问题是，它实际上没有解决自定义字段分隔符的情况，例如OP的逗号空格。与以前一样，不支持多字符分隔符，这是该解决方案的一个不幸的限制。我们可以通过指定-d选项的分隔符来尝试至少在逗号上进行分隔，但是看看会发生什么:

string='Paris, France, Europe';
a=(); while read -rd,; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France")

Predictably, the unaccounted surrounding whitespace got pulled into the field values, and hence this would have to be corrected subsequently through trimming operations (this could also be done directly in the while-loop). But there's another obvious error: Europe is missing! What happened to it? The answer is that read returns a failing return code if it hits end-of-file (in this case we can call it end-of-string) without encountering a final field terminator on the final field. This causes the while-loop to break prematurely and we lose the final field.

Technically this same error afflicted the previous examples as well; the difference there is that the field separator was taken to be LF, which is the default when you don't specify the -d option, and the <<< ("here-string") mechanism automatically appends a LF to the string just before it feeds it as input to the command. Hence, in those cases, we sort of accidentally solved the problem of a dropped final field by unwittingly appending an additional dummy terminator to the input. Let's call this solution the "dummy-terminator" solution. We can apply the dummy-terminator solution manually for any custom delimiter by concatenating it against the input string ourselves when instantiating it in the here-string:

a=(); while read -rd,; do a+=("$REPLY"); done <<<"$string,"; declare -p a;
declare -a a=([0]="Paris" [1]=" France" [2]=" Europe")

这样，问题解决了。另一种解决方案是仅在(1)read返回失败且(2)$REPLY为空时才打破while循环，这意味着read在击中文件结束符之前无法读取任何字符。演示:

a=(); while read -rd,|| [[ -n "$REPLY" ]]; do a+=("$REPLY"); done <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=$' Europe\n')

This approach also reveals the secretive LF that automatically gets appended to the here-string by the <<< redirection operator. It could of course be stripped off separately through an explicit trimming operation as described a moment ago, but obviously the manual dummy-terminator approach solves it directly, so we could just go with that. The manual dummy-terminator solution is actually quite convenient in that it solves both of these two problems (the dropped-final-field problem and the appended-LF problem) in one go.

所以，总的来说，这是一个非常强大的解决方案。它唯一的缺点是缺乏对多字符分隔符的支持，我将在后面讨论这个问题。

错误答案8

string='first line
        second line
        third line'

readarray -t lines <<<"$string"

(这实际上来自于第7篇文章;回复者在同一篇文章中提供了两种解决方案。)

内置的readarray(它是mapfile的同义词)是理想的。这是一个内置命令，可以一次性将字节流解析为数组变量;不要混淆循环、条件、替换或其他任何东西。而且它不会偷偷地从输入字符串中删除任何空白。并且(如果没有给出-O)在给目标数组赋值之前，它可以方便地清除目标数组。但它仍然不完美，因此我批评它是一个“错误的答案”。

首先，为了解决这个问题，请注意，就像read在进行字段解析时的行为一样，readarray在尾部字段为空时删除它。同样，这可能不是OP的关注点，但对于某些用例可能是关注点。我一会儿再回到这个问题上来。

其次，与前面一样，它不支持多字符分隔符。我一会儿也会给出一个修复方法。

第三，编写的解决方案不解析OP的输入字符串，事实上，不能按原样使用它来解析OP的输入字符串。我也会在这方面做进一步的阐述。

基于上述原因，我仍然认为这是对OP问题的“错误回答”。下面我将给出我认为正确的答案。

正确的答案

下面是一个naïve的尝试，通过指定-d选项使#8工作:

string='Paris, France, Europe';
readarray -td, a <<<"$string"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=$' Europe\n')

我们看到，这个结果与我们在第7章中讨论的循环读取解决方案的双条件方法得到的结果相同。我们几乎可以用手动的假人终结者技巧来解决这个问题:

readarray -td, a <<<"$string,"; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=" Europe" [3]=$'\n')

这里的问题是readarray保留了尾随字段，因为<<<重定向操作符将LF追加到输入字符串，因此尾随字段不是空的(否则它将被丢弃)。我们可以通过事后显式地取消最终数组元素来解决这个问题:

readarray -td, a <<<"$string,"; unset 'a[-1]'; declare -p a;
## declare -a a=([0]="Paris" [1]=" France" [2]=" Europe")

剩下的两个问题实际上是相关的:(1)需要修剪的多余空格，以及(2)缺乏对多字符分隔符的支持。

空白当然可以在之后进行修剪(例如，参见如何从Bash变量中修剪空白?)但是，如果我们可以破解一个多字符分隔符，那么这两个问题就可以一次性解决了。

Unfortunately, there's no direct way to get a multicharacter delimiter to work. The best solution I've thought of is to preprocess the input string to replace the multicharacter delimiter with a single-character delimiter that will be guaranteed not to collide with the contents of the input string. The only character that has this guarantee is the NUL byte. This is because, in bash (though not in zsh, incidentally), variables cannot contain the NUL byte. This preprocessing step can be done inline in a process substitution. Here's how to do it using awk:

readarray -td '' a < <(awk '{ gsub(/, /,"\0"); print; }' <<<"$string, "); unset 'a[-1]';
declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")

在那里,终于!这个解决方案不会在中间错误地分割字段，不会过早地剪切，不会删除空字段，不会在文件名扩展时损坏自身，不会自动剥离前导和尾随空白，不会在末尾留下一个隐藏的LF，不需要循环，也不满足于单字符分隔符。

整理解决方案

最后，我想用readarray的模糊的-C回调选项演示我自己的相当复杂的修剪解决方案。不幸的是，我已经没有空间来对抗Stack Overflow严格的30,000字符的帖子限制，所以我无法解释它。我把这个留给读者做练习。

function mfcb { local val="$4"; "$1"; eval "$2[$3]=\$val;"; };
function val_ltrim { if [[ "$val" =~ ^[[:space:]]+ ]]; then val="${val:${#BASH_REMATCH[0]}}"; fi; };
function val_rtrim { if [[ "$val" =~ [[:space:]]+$ ]]; then val="${val:0:${#val}-${#BASH_REMATCH[0]}}"; fi; };
function val_trim { val_ltrim; val_rtrim; };
readarray -c1 -C 'mfcb val_trim a' -td, <<<"$string,"; unset 'a[-1]'; declare -p a;
## declare -a a=([0]="Paris" [1]="France" [2]="Europe")

2017-07-19 21:20:22

对于多行元素，为什么不像

$ array=($(echo -e $'a a\nb b' | tr ' ' '§')) && array=("${array[@]//§/ }") && echo "${array[@]/%/ INTERELEMENT}"

a a INTERELEMENT b b INTERELEMENT

2020-05-09 16:42:31

如何在Bash中将字符串分割成数组?

推荐文章

最新文章

标签