我有一个~23000行的SQL转储,其中包含几个数据库的数据价值。我需要提取这个文件的某个部分(即单个数据库的数据),并将其放在一个新文件中。我知道我想要的数据的开始行号和结束行号。
谁知道一个Unix命令(或一系列命令)可以从文件中提取16224到16482行之间的所有行,然后将它们重定向到一个新文件中?
我有一个~23000行的SQL转储,其中包含几个数据库的数据价值。我需要提取这个文件的某个部分(即单个数据库的数据),并将其放在一个新文件中。我知道我想要的数据的开始行号和结束行号。
谁知道一个Unix命令(或一系列命令)可以从文件中提取16224到16482行之间的所有行,然后将它们重定向到一个新文件中?
当前回答
我已经为sed、perl、head+tail和我自己的awk代码编译了一些最高评级的解决方案,并通过管道关注性能,同时使用LC_ALL=C确保所有候选程序以尽可能快的速度运行,并在两者之间分配2秒的睡眠间隔。
差距是显而易见的:
abs time awk/app speed ratio
----------------------------------
0.0672 sec : 1.00x mawk-2
0.0839 sec : 1.25x gnu-sed
0.1289 sec : 1.92x perl
0.2151 sec : 3.20x gnu-head+tail
还没有机会测试这些工具的python或BSD变体。
(fg && fg && fg && fg) 2>/dev/null;
echo;
( time ( pvE0 < "${m3t}"
| LC_ALL=C mawk2 '
BEGIN {
_=10420001-(\
__=10420256)^(FS="^$")
} _<NR {
print
if(__==NR) { exit }
}' ) | pvE9) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2;
(fg && fg && fg && fg) 2>/dev/null
echo;
( time ( pvE0 < "${m3t}"
| LC_ALL=C gsed -n '10420001,10420256p;10420256q'
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2; (fg && fg && fg && fg) 2>/dev/null
echo
( time ( pvE0 < "${m3t}"
| LC_ALL=C perl -ne 'print if 10420001..10420256'
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
sleep 2; (fg && fg && fg && fg) 2>/dev/null
echo
( time ( pvE0 < "${m3t}"
| LC_ALL=C ghead -n +10420256
| LC_ALL=C gtail -n +10420001
) | pvE9 ) | tee >(xxh128sum >&2) | LC_ALL=C gwc -lcm | lgp3 ;
in0: 1.51GiB 0:00:00 [2.31GiB/s] [2.31GiB/s] [============> ] 81%
out9: 42.5KiB 0:00:00 [64.9KiB/s] [64.9KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C mawk2 ; )
0.43s user 0.36s system 117% cpu 0.672 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
out9: 42.5KiB 0:00:00 [51.7KiB/s] [51.7KiB/s] [ <=> ]
in0: 1.51GiB 0:00:00 [1.84GiB/s] [1.84GiB/s] [==========> ] 81%
( pvE 0.1 in0 < "${m3t}" |LC_ALL=C gsed -n '10420001,10420256p;10420256q'; )
0.68s user 0.34s system 121% cpu 0.839 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
in0: 1.85GiB 0:00:01 [1.46GiB/s] [1.46GiB/s] [=============>] 100%
out9: 42.5KiB 0:00:01 [33.5KiB/s] [33.5KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}" | LC_ALL=C perl -ne 'print if 10420001..10420256'; )
1.10s user 0.44s system 119% cpu 1.289 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
in0: 1.51GiB 0:00:02 [ 728MiB/s] [ 728MiB/s] [=============> ] 81%
out9: 42.5KiB 0:00:02 [19.9KiB/s] [19.9KiB/s] [ <=> ]
( pvE 0.1 in0 < "${m3t}"
| LC_ALL=C ghead -n +10420256
| LC_ALL=C gtail -n ; )
1.98s user 1.40s system 157% cpu 2.151 total
256 43487 43487
54313365c2e66a48dc1dc33595716cc8 stdin
其他回答
又快又脏:
head -16428 < file.in | tail -259 > file.out
可能不是最好的方法,但应该有用。
顺便说一下:259 = 16482-16224+1。
我一直在寻找这个问题的答案,但最终我不得不编写自己的代码。以上的答案都不令人满意。 假设您有一个非常大的文件,并且有一些想要打印的行号,但这些行号不是按顺序排列的。您可以执行以下操作:
我的文件比较大 对于{a..k};执行echo $letter;完成| cat -n > myfile.txt
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
11 k
我想要的具体行号: shuf -i 1-11 -n 4 > line_numbers_I_want.txt
10
11
4
9
要打印这些行号,请执行以下操作。 awk ` {system("head myfile.txt -n " $0 " | tail -n 1")} ` line_numbers_I_want.txt
上面所做的是头n行,然后采取最后一行使用尾巴
如果您希望行号按顺序排列,首先sort (is -n numeric sort),然后获取行。
cat line_numbers_I_want.txt | sort -n | awk '{system("head myfile.txt -n " $0 " | tail -n 1")}'
4 d
9 i
10 j
11 k
也许,你会给这个简陋的脚本一个机会;-)
#!/usr/bin/bash
# Usage:
# body n m|-m
from=$1
to=$2
if [ $to -gt 0 ]; then
# count $from the begin of the file $to selected line
awk "NR >= $from && NR <= $to {print}"
else
# count $from the begin of the file skipping tailing $to lines
awk '
BEGIN {lines=0; from='$from'; to='$to'}
{++lines}
NR >= $from {line[lines]=$0}
END {for (i = from; i < lines + to + 1; i++) {
print line[i]
}
}'
fi
输出:
$ seq 20 | ./body.sh 5 15
5
6
7
8
9
10
11
12
13
14
15
$ seq 20 | ./body.sh 5 -5
5
6
7
8
9
10
11
12
13
14
15
我编写了一个小型bash脚本,您可以从命令行运行它,只要您更新PATH以包含它的目录(或者您可以将它放在PATH中已经包含的目录中)。
用法:$ pinch filename起始行结束行
#!/bin/bash
# Display line number ranges of a file to the terminal.
# Usage: $ pinch filename start-line end-line
# By Evan J. Coon
FILENAME=$1
START=$2
END=$3
ERROR="[PINCH ERROR]"
# Check that the number of arguments is 3
if [ $# -lt 3 ]; then
echo "$ERROR Need three arguments: Filename Start-line End-line"
exit 1
fi
# Check that the file exists.
if [ ! -f "$FILENAME" ]; then
echo -e "$ERROR File does not exist. \n\t$FILENAME"
exit 1
fi
# Check that start-line is not greater than end-line
if [ "$START" -gt "$END" ]; then
echo -e "$ERROR Start line is greater than End line."
exit 1
fi
# Check that start-line is positive.
if [ "$START" -lt 0 ]; then
echo -e "$ERROR Start line is less than 0."
exit 1
fi
# Check that end-line is positive.
if [ "$END" -lt 0 ]; then
echo -e "$ERROR End line is less than 0."
exit 1
fi
NUMOFLINES=$(wc -l < "$FILENAME")
# Check that end-line is not greater than the number of lines in the file.
if [ "$END" -gt "$NUMOFLINES" ]; then
echo -e "$ERROR End line is greater than number of lines in file."
exit 1
fi
# The distance from the end of the file to end-line
ENDDIFF=$(( NUMOFLINES - END ))
# For larger files, this will run more quickly. If the distance from the
# end of the file to the end-line is less than the distance from the
# start of the file to the start-line, then start pinching from the
# bottom as opposed to the top.
if [ "$START" -lt "$ENDDIFF" ]; then
< "$FILENAME" head -n $END | tail -n +$START
else
< "$FILENAME" tail -n +$START | head -n $(( END-START+1 ))
fi
# Success
exit 0
sed -n '16224,16482 p' orig-data-file > new-file
其中16224、16482是起始行号和结束行号,包括。这是1索引。-n抑制将输入回显为输出,这显然不是你想要的;数字表示要执行以下命令的行范围;命令p输出相关的行。