在Bash脚本中,我想从输入文件中随机挑选N行并输出到另一个文件。

如何做到这一点呢?


当前回答

只是为了完整起见,因为它可以从Arch的社区repos中获得:还有一个名为shuffle的工具,但它没有任何命令行开关来限制行数,并在手册页中警告:“由于shuffle将输入读取到内存中,它可能会在非常大的文件上失败。”

其他回答

使用shuf和-n选项,如下所示,得到N个随机行:

shuf -n N input > output
# Function to sample N lines randomly from a file
# Parameter $1: Name of the original file
# Parameter $2: N lines to be sampled 
rand_line_sampler() {
    N_t=$(awk '{print $1}' $1 | wc -l) # Number of total lines

    N_t_m_d=$(( $N_t - $2 - 1 )) # Number oftotal lines minus desired number of lines

    N_d_m_1=$(( $2 - 1)) # Number of desired lines minus 1

    # vector to have the 0 (fail) with size of N_t_m_d 
    echo '0' > vector_0.temp
    for i in $(seq 1 1 $N_t_m_d); do
            echo "0" >> vector_0.temp
    done

    # vector to have the 1 (success) with size of desired number of lines
    echo '1' > vector_1.temp
    for i in $(seq 1 1 $N_d_m_1); do
            echo "1" >> vector_1.temp
    done

    cat vector_1.temp vector_0.temp | shuf > rand_vector.temp

    paste -d" " rand_vector.temp $1 |
    awk '$1 != 0 {$1=""; print}' |
    sed 's/^ *//' > sampled_file.txt # file with the sampled lines

    rm vector_0.temp vector_1.temp rand_vector.temp
}

rand_line_sampler "parameter_1" "parameter_2"

只是为了完整起见,因为它可以从Arch的社区repos中获得:还有一个名为shuffle的工具,但它没有任何命令行开关来限制行数,并在手册页中警告:“由于shuffle将输入读取到内存中,它可能会在非常大的文件上失败。”

下面的“c”是要从输入中选择的行数。根据需要修改:

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"

随机排序文件,并选择前100行:

lines=100
input_file=/usr/share/dict/words

# This is the basic selection method
<$input_file sort -R | head -n $lines

# If the file has duplicates that must never cause duplicate results
<$input_file sort | uniq        | sort -R | head -n $lines

# If the file has blank lines that must be filtered, use sed
<$input_file sed $'/^[ \t]*$/d' | sort -R | head -n $lines

当然<$input_file可以用任何管道标准输入替换。This (sort -R and $'…\t…'来让sed匹配制表符)适用于GNU/Linux和BSD/macOS。