在Bash脚本中,我希望将一行分割成多个片段,并将它们存储在一个数组中。
例如,给定一行:
Paris, France, Europe
我想让结果数组看起来像这样:
array[0] = Paris
array[1] = France
array[2] = Europe
最好是一个简单的实现;速度并不重要。我该怎么做呢?
在Bash脚本中,我希望将一行分割成多个片段,并将它们存储在一个数组中。
例如,给定一行:
Paris, France, Europe
我想让结果数组看起来像这样:
array[0] = Paris
array[1] = France
array[2] = Europe
最好是一个简单的实现;速度并不重要。我该怎么做呢?
当前回答
试试这个
IFS=', '; array=(Paris, France, Europe)
for item in ${array[@]}; do echo $item; done
这很简单。如果你想,你也可以添加一个声明(并删除逗号):
IFS=' ';declare -a array=(Paris France Europe)
添加IFS是为了撤销上述操作,但在新的bash实例中,它可以在没有IFS的情况下工作
其他回答
我很好奇"正确答案"的相对表现 在@bgoldst的流行回答中,显然是对循环的谴责, 所以我用三个纯bash实现做了一个简单的基准测试。
综上所述,我建议:
对于字符串长度< 4k左右的情况,纯bash比gawk更快 对于分隔符长度< 10和字符串长度< 256k,纯bash与gawk相当 对于分隔符长度>> 10和字符串长度< 64k左右,纯bash是“可接受的”; gawk的速度还不到5倍 对于字符串长度< 512k左右,gawk是“可接受的”
我任意地将“可接受”定义为“分割字符串所需时间< 0.5s”。
我认为问题是获取一个bash字符串并使用任意长度的分隔符字符串(不是regex)将其分割成一个bash数组。
# in: $1=delim, $2=string
# out: sets array a
我的纯bash实现是:
# naive approach - slow
split_byStr_bash_naive(){
a=()
local prev=""
local cdr="$2"
[[ -z "${cdr}" ]] && a+=("")
while [[ "$cdr" != "$prev" ]]; do
prev="$cdr"
a+=( "${cdr%%"$1"*}" )
cdr="${cdr#*"$1"}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
# use lengths wherever possible - faster
split_byStr_bash_faster(){
a=()
local car=""
local cdr="$2"
while
car="${cdr%%"$1"*}"
a+=("$car")
cdr="${cdr:${#car}}"
(( ${#cdr} ))
do
cdr="${cdr:${#1}}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
# use pattern substitution and readarray - fastest
split_byStr_bash_sub(){
a=()
local delim="$1" string="$2"
delim="${delim//=/=-}"
delim="${delim//$'\n'/=n}"
string="${string//=/=-}"
string="${string//$'\n'/=n}"
readarray -td $'\n' a <<<"${string//"$delim"/$'\n'}"
local len=${#a[@]} i s
for (( i=0; i<len; i++ )); do
s="${a[$i]//=n/$'\n'}"
a[$i]="${s//=-/=}"
done
# echo $( declare -p a | md5sum; declare -p a )
}
在naive版本中,初始的-z测试处理长度为零的情况 正在传递的字符串。如果没有测试,输出数组是空的; 使用它,数组只有一个长度为0的元素。
将readarray替换为while read会导致< 10%的减速。
这是我使用的gawk实现:
split_byRE_gawk(){
readarray -td '' a < <(awk '{gsub(/'"$1"'/,"\0")}1' <<<"$2$1")
unset 'a[-1]'
# echo $( declare -p a | md5sum; declare -p a )
}
显然,在一般情况下,delim参数需要被净化, 因为gawk需要一个正则表达式,而gawk-special字符可能会导致问题。 同样,按原样,该实现不会正确处理分隔符中的换行符。
由于gawk正在被使用,一个通用版本可以处理更多的任意 分隔符可以是:
split_byREorStr_gawk(){
local delim=$1
local string=$2
local useRegex=${3:+1} # if set, delimiter is regex
readarray -td '' a < <(
export delim
gawk -v re="$useRegex" '
BEGIN {
RS = FS = "\0"
ORS = ""
d = ENVIRON["delim"]
# cf. https://stackoverflow.com/a/37039138
if (!re) gsub(/[\\.^$(){}\[\]|*+?]/,"\\\\&",d)
}
gsub(d"|\n$","\0")
' <<<"$string"
)
# echo $( declare -p a | md5sum; declare -p a )
}
或者在Perl中使用相同的想法:
split_byREorStr_perl(){
local delim=$1
local string=$2
local regex=$3 # if set, delimiter is regex
readarray -td '' a < <(
export delim regex
perl -0777pe '
$d = $ENV{delim};
$d = "\Q$d\E" if ! $ENV{regex};
s/$d|\n$/\0/g;
' <<<"$string"
)
# echo $( declare -p a | md5sum; declare -p a )
}
这两个实现产生相同的输出,分别通过比较md5sum进行测试。
注意,如果输入有歧义(正如@bgoldst所说的“逻辑不正确”), 行为会略有不同。例如,使用分隔符——和字符串a-或——:
@goldst代码返回:宣布——=([0]=“a”)或宣布——=([0]=“a”[1]= " ") 我回:宣布——=([0]=“-”)或宣布——=([0]=“a”[1]=“-”)
参数由简单的Perl脚本派生,从:
delim="-=-="
base="ABCDEFGHIJKLMNOPQRSTUVWXYZ012345"
下面是3种不同类型的计时结果表(以秒为单位) 字符串和分隔符参数的。
#s -字符串参数的长度 #d - delim参数的长度 = -性能盈亏平衡点 ! -“可接受的”性能限制(bash)在这里 !! -“可接受的”性能限制大概在这里 ——函数花了太长时间 <!> - gawk命令执行失败
1型
d=$(perl -e "print( '$delim' x (7*2**$n) )")
s=$(perl -e "print( '$delim' x (7*2**$n) . '$base' x (7*2**$n) )")
n | #s | #d | gawk | b_sub | b_faster | b_naive | |
---|---|---|---|---|---|---|---|
0 | 252 | 28 | 0.002 | 0.000 | 0.000 | 0.000 | |
1 | 504 | 56 | 0.005 | 0.000 | 0.000 | 0.001 | |
2 | 1008 | 112 | 0.005 | 0.001 | 0.000 | 0.003 | |
3 | 2016 | 224 | 0.006 | 0.001 | 0.000 | 0.009 | |
4 | 4032 | 448 | 0.007 | 0.002 | 0.001 | 0.048 | |
= | 5 | 8064 | 896 | 0.014 | 0.008 | 0.005 | 0.377 |
6 | 16128 | 1792 | 0.018 | 0.029 | 0.017 | (2.214) | |
7 | 32256 | 3584 | 0.033 | 0.057 | 0.039 | (15.16) | |
! | 8 | 64512 | 7168 | 0.063 | 0.214 | 0.128 | - |
9 | 129024 | 14336 | 0.111 | (0.826) | (0.602) | - | |
10 | 258048 | 28672 | 0.214 | (3.383) | (2.652) | - | |
!! | 11 | 516096 | 57344 | 0.430 | (13.46) | (11.00) | - |
12 | 1032192 | 114688 | (0.834) | (58.38) | - | - | |
13 | 2064384 | 229376 | <!> | (228.9) | - | - |
2型
d=$(perl -e "print( '$delim' x ($n) )")
s=$(perl -e "print( ('$delim' x ($n) . '$base' x $n ) x (2**($n-1)) )")
n | #s | #d | gawk | b_sub | b_faster | b_naive | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.003 | 0.000 | 0.000 | 0.000 | |
1 | 36 | 4 | 0.003 | 0.000 | 0.000 | 0.000 | |
2 | 144 | 8 | 0.005 | 0.000 | 0.000 | 0.000 | |
3 | 432 | 12 | 0.005 | 0.000 | 0.000 | 0.000 | |
4 | 1152 | 16 | 0.005 | 0.001 | 0.001 | 0.002 | |
5 | 2880 | 20 | 0.005 | 0.001 | 0.002 | 0.003 | |
6 | 6912 | 24 | 0.006 | 0.003 | 0.009 | 0.014 | |
= | 7 | 16128 | 28 | 0.012 | 0.012 | 0.037 | 0.044 |
8 | 36864 | 32 | 0.023 | 0.044 | 0.167 | 0.187 | |
! | 9 | 82944 | 36 | 0.049 | 0.192 | (0.753) | (0.840) |
10 | 184320 | 40 | 0.097 | (0.925) | (3.682) | (4.016) | |
11 | 405504 | 44 | 0.204 | (4.709) | (18.00) | (19.58) | |
!! | 12 | 884736 | 48 | 0.444 | (22.17) | - | - |
13 | 1916928 | 52 | (1.019) | (102.4) | - | - |
3型
d=$(perl -e "print( '$delim' x (2**($n-1)) )")
s=$(perl -e "print( ('$delim' x (2**($n-1)) . '$base' x (2**($n-1)) ) x ($n) )")
n | #s | #d | gawk | b_sub | b_faster | b_naive | |
---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0.000 | 0.000 | 0.000 | 0.000 | |
1 | 36 | 4 | 0.004 | 0.000 | 0.000 | 0.000 | |
2 | 144 | 8 | 0.003 | 0.000 | 0.000 | 0.000 | |
3 | 432 | 16 | 0.003 | 0.000 | 0.000 | 0.000 | |
4 | 1152 | 32 | 0.005 | 0.001 | 0.001 | 0.002 | |
5 | 2880 | 64 | 0.005 | 0.002 | 0.001 | 0.003 | |
6 | 6912 | 128 | 0.006 | 0.003 | 0.003 | 0.014 | |
= | 7 | 16128 | 256 | 0.012 | 0.011 | 0.010 | 0.077 |
8 | 36864 | 512 | 0.023 | 0.046 | 0.046 | (0.513) | |
! | 9 | 82944 | 1024 | 0.049 | 0.195 | 0.197 | (3.850) |
10 | 184320 | 2048 | 0.103 | (0.951) | (1.061) | (31.84) | |
11 | 405504 | 4096 | 0.222 | (4.796) | - | - | |
!! | 12 | 884736 | 8192 | 0.473 | (22.88) | - | - |
13 | 1916928 | 16384 | (1.126) | (105.4) | - | - |
长度为1..10的分隔符摘要
由于短分隔符可能比长分隔符更有可能, 下面总结了不同分隔符长度的结果 在1和10之间(结果为2..9个大多被省略为非常相似)。
s1=$(perl -e "print( '$d' . '$base' x (7*2**$n) )")
s2=$(perl -e "print( ('$d' . '$base' x $n ) x (2**($n-1)) )")
s3=$(perl -e "print( ('$d' . '$base' x (2**($n-1)) ) x ($n) )")
Bash_sub < gawk
string | n | #s | #d | gawk | b_sub | b_faster | b_naive |
---|---|---|---|---|---|---|---|
s1 | 10 | 229377 | 1 | 0.131 | 0.089 | 1.709 | - |
s1 | 10 | 229386 | 10 | 0.142 | 0.095 | 1.907 | - |
s2 | 8 | 32896 | 1 | 0.022 | 0.007 | 0.148 | 0.168 |
s2 | 8 | 34048 | 10 | 0.021 | 0.021 | 0.163 | 0.179 |
s3 | 12 | 786444 | 1 | 0.436 | 0.468 | - | - |
s3 | 12 | 786456 | 2 | 0.434 | 0.317 | - | - |
s3 | 12 | 786552 | 10 | 0.438 | 0.333 | - | - |
Bash_sub < 0.5s
string | n | #s | #d | gawk | b_sub | b_faster | b_naive |
---|---|---|---|---|---|---|---|
s1 | 11 | 458753 | 1 | 0.256 | 0.332 | (7.089) | - |
s1 | 11 | 458762 | 10 | 0.269 | 0.387 | (8.003) | - |
s2 | 11 | 361472 | 1 | 0.205 | 0.283 | (14.54) | - |
s2 | 11 | 363520 | 3 | 0.207 | 0.462 | (16.66) | - |
s3 | 12 | 786444 | 1 | 0.436 | 0.468 | - | - |
s3 | 12 | 786456 | 2 | 0.434 | 0.317 | - | - |
s3 | 12 | 786552 | 10 | 0.438 | 0.333 | - | - |
Gawk < 0.5s
string | n | #s | $d | gawk | b_sub | b_faster | b_naive |
---|---|---|---|---|---|---|---|
s1 | 11 | 458753 | 1 | 0.256 | 0.332 | (7.089) | - |
s1 | 11 | 458762 | 10 | 0.269 | 0.387 | (8.003) | - |
s2 | 12 | 788480 | 1 | 0.440 | (1.252) | - | - |
s2 | 12 | 806912 | 10 | 0.449 | (4.968) | - | - |
s3 | 12 | 786444 | 1 | 0.436 | 0.468 | - | - |
s3 | 12 | 786456 | 2 | 0.434 | 0.317 | - | - |
s3 | 12 | 786552 | 10 | 0.438 | 0.333 | - | - |
(我不完全确定为什么bash_sub与s>160k和d=1始终比d>1 s3慢。)
所有测试都是在Intel i7-7500U上运行xubuntu 20.04,使用bash 5.0.17进行的。
IFS=', ' read -r -a array <<< "$string"
请注意,$IFS中的字符被单独视为分隔符,因此在这种情况下,字段可以用逗号或空格分隔,而不是两个字符的序列。但有趣的是,当输入中出现逗号时,不会创建空字段,因为空格是经过特殊处理的。
要访问单个元素:
echo "${array[0]}"
要遍历元素:
for element in "${array[@]}"
do
echo "$element"
done
要同时获取索引和值:
for index in "${!array[@]}"
do
echo "$index ${array[index]}"
done
最后一个示例很有用,因为Bash数组很稀疏。换句话说,您可以删除一个元素或添加一个元素,然后索引不是连续的。
unset "array[1]"
array[42]=Earth
获取数组中元素的个数:
echo "${#array[@]}"
如上所述,数组可以是稀疏的,所以不应该使用长度来获取最后一个元素。以下是在Bash 4.2及以后版本中可以做到的:
echo "${array[-1]}"
在任何版本的Bash中(从2.05b之后的某个地方):
echo "${array[@]: -1:1}"
较大的负偏移量选择距离数组末端更远的位置。注意旧形式中负号前面的空格。这是必须的。
有时,我发现在已接受的答案中描述的方法不起作用,特别是当分隔符是回车符时。 在这些情况下,我是这样解决的:
string='first line
second line
third line'
oldIFS="$IFS"
IFS='
'
IFS=${IFS:0:1} # this is useful to format your code with tabs
lines=( $string )
IFS="$oldIFS"
for line in "${lines[@]}"
do
echo "--> $line"
done
如果你使用macOS,不能使用readarray,你可以简单地这样做-
MY_STRING="string1 string2 string3"
array=($MY_STRING)
要遍历元素:
for element in "${array[@]}"
do
echo $element
done
这是我的破解方法!
使用bash拆分字符串是一件非常无聊的事情。实际情况是,我们有有限的方法,只能在少数情况下工作(被“;”,“/”,“.”等等分开),或者我们在输出中有各种副作用。
下面的方法需要一些操作,但我相信它可以满足我们的大部分需求!
#!/bin/bash
# --------------------------------------
# SPLIT FUNCTION
# ----------------
F_SPLIT_R=()
f_split() {
: 'It does a "split" into a given string and returns an array.
Args:
TARGET_P (str): Target string to "split".
DELIMITER_P (Optional[str]): Delimiter used to "split". If not
informed the split will be done by spaces.
Returns:
F_SPLIT_R (array): Array with the provided string separated by the
informed delimiter.
'
F_SPLIT_R=()
TARGET_P=$1
DELIMITER_P=$2
if [ -z "$DELIMITER_P" ] ; then
DELIMITER_P=" "
fi
REMOVE_N=1
if [ "$DELIMITER_P" == "\n" ] ; then
REMOVE_N=0
fi
# NOTE: This was the only parameter that has been a problem so far!
# By Questor
# [Ref.: https://unix.stackexchange.com/a/390732/61742]
if [ "$DELIMITER_P" == "./" ] ; then
DELIMITER_P="[.]/"
fi
if [ ${REMOVE_N} -eq 1 ] ; then
# NOTE: Due to bash limitations we have some problems getting the
# output of a split by awk inside an array and so we need to use
# "line break" (\n) to succeed. Seen this, we remove the line breaks
# momentarily afterwards we reintegrate them. The problem is that if
# there is a line break in the "string" informed, this line break will
# be lost, that is, it is erroneously removed in the output!
# By Questor
TARGET_P=$(awk 'BEGIN {RS="dn"} {gsub("\n", "3F2C417D448C46918289218B7337FCAF"); printf $0}' <<< "${TARGET_P}")
fi
# NOTE: The replace of "\n" by "3F2C417D448C46918289218B7337FCAF" results
# in more occurrences of "3F2C417D448C46918289218B7337FCAF" than the
# amount of "\n" that there was originally in the string (one more
# occurrence at the end of the string)! We can not explain the reason for
# this side effect. The line below corrects this problem! By Questor
TARGET_P=${TARGET_P%????????????????????????????????}
SPLIT_NOW=$(awk -F"$DELIMITER_P" '{for(i=1; i<=NF; i++){printf "%s\n", $i}}' <<< "${TARGET_P}")
while IFS= read -r LINE_NOW ; do
if [ ${REMOVE_N} -eq 1 ] ; then
# NOTE: We use "'" to prevent blank lines with no other characters
# in the sequence being erroneously removed! We do not know the
# reason for this side effect! By Questor
LN_NOW_WITH_N=$(awk 'BEGIN {RS="dn"} {gsub("3F2C417D448C46918289218B7337FCAF", "\n"); printf $0}' <<< "'${LINE_NOW}'")
# NOTE: We use the commands below to revert the intervention made
# immediately above! By Questor
LN_NOW_WITH_N=${LN_NOW_WITH_N%?}
LN_NOW_WITH_N=${LN_NOW_WITH_N#?}
F_SPLIT_R+=("$LN_NOW_WITH_N")
else
F_SPLIT_R+=("$LINE_NOW")
fi
done <<< "$SPLIT_NOW"
}
# --------------------------------------
# HOW TO USE
# ----------------
STRING_TO_SPLIT="
* How do I list all databases and tables using psql?
\"
sudo -u postgres /usr/pgsql-9.4/bin/psql -c \"\l\"
sudo -u postgres /usr/pgsql-9.4/bin/psql <DB_NAME> -c \"\dt\"
\"
\"
\list or \l: list all databases
\dt: list all tables in the current database
\"
[Ref.: https://dba.stackexchange.com/questions/1285/how-do-i-list-all-databases-and-tables-using-psql]
"
f_split "$STRING_TO_SPLIT" "bin/psql -c"
# --------------------------------------
# OUTPUT AND TEST
# ----------------
ARR_LENGTH=${#F_SPLIT_R[*]}
for (( i=0; i<=$(( $ARR_LENGTH -1 )); i++ )) ; do
echo " > -----------------------------------------"
echo "${F_SPLIT_R[$i]}"
echo " < -----------------------------------------"
done
if [ "$STRING_TO_SPLIT" == "${F_SPLIT_R[0]}bin/psql -c${F_SPLIT_R[1]}" ] ; then
echo " > -----------------------------------------"
echo "The strings are the same!"
echo " < -----------------------------------------"
fi