如何在目录/子目录中搜索PDF文件的内容?我在找一些命令行工具。grep似乎不能搜索PDF文件。
当前回答
谢谢所有的好主意!
我尝试了xargs方法,但正如这里所指出的,xargs将使它不可能(或非常困难)包括打印实际的文件名……
所以我尝试了GNU并行。
parallel "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'" ::: *.pdf
This prints not only the pattern, but with --context=5 also 5 lines above and below as well for context. With -q pdftotext won't print any error messages or warnings (quiet). I use brackets [] as labels instead of braces {}. If you wanted braces --label='{'{}'}' will make that happen. Note that {} is replaced by the actual filename by GNU parallel, e.g. 'Example portable document file name with spaces.pdf' ({} is already using single quotes '). By using --label={} only the filename will be printed, which may be the favored way of displaying the filename. I also noticed that the output was without color when I tried it, except when forcing it by adding --color=always with grep. It may be useful to add --ignore-case to the grep command for a case-insensitive keyword search.
如果所有PDF文件都应该递归处理,包括当前目录(.)中的所有子目录,这可以通过find来完成:
find . -type f -iname '*.pdf' -print0 | parallel -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'"
With find, -iname '*.pdf' acts case-insensitive. With -name '*.pdf' only lower-case .pdf files will be included (the normal case). Since I sometimes also encountered Windows PDF-files with an upper-case .PDF file extension, I tend to prefer -iname... The above command also works with the -print find option (instead of -print0), so it will be line-based (one file name per line), then -0 (NUL delimiter) must be omitted from the parallel command. Again, including --ignore-case in the grep command will make the search case-insensitive.
作为处理整个命令行的一般建议,parallel -dry-run将打印将要执行的命令。
$ find . -type f -iname '*.pdf' -print0 | parallel --dry-run -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --ignore-case --context=5 'pattern'"
pdftotext -q ./test PDF file 1.pdf - | grep --with-filename --label='['./test PDF file 1.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir1/test PDF file 2.pdf - | grep --with-filename --label='['./subdir1/test PDF file 2.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir2/test PDF file 3.pdf - | grep --with-filename --label='['./subdir2/test PDF file 3.pdf']' --color=always --ignore-case --context=5 'pattern'
其他回答
还有pdfgrep,它做的正是它的名字所暗示的。
pdfgrep -R 'a pattern to search recursively from path' /some/path
我用它做过简单的搜索,效果很好。
(Debian、Ubuntu和Fedora中都有软件包。)
从1.3.0版本开始,pdfgrep支持递归搜索。这个版本从Ubuntu 12.10 (Quantal)开始在Ubuntu中可用。
谢谢所有的好主意!
我尝试了xargs方法,但正如这里所指出的,xargs将使它不可能(或非常困难)包括打印实际的文件名……
所以我尝试了GNU并行。
parallel "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'" ::: *.pdf
This prints not only the pattern, but with --context=5 also 5 lines above and below as well for context. With -q pdftotext won't print any error messages or warnings (quiet). I use brackets [] as labels instead of braces {}. If you wanted braces --label='{'{}'}' will make that happen. Note that {} is replaced by the actual filename by GNU parallel, e.g. 'Example portable document file name with spaces.pdf' ({} is already using single quotes '). By using --label={} only the filename will be printed, which may be the favored way of displaying the filename. I also noticed that the output was without color when I tried it, except when forcing it by adding --color=always with grep. It may be useful to add --ignore-case to the grep command for a case-insensitive keyword search.
如果所有PDF文件都应该递归处理,包括当前目录(.)中的所有子目录,这可以通过find来完成:
find . -type f -iname '*.pdf' -print0 | parallel -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'"
With find, -iname '*.pdf' acts case-insensitive. With -name '*.pdf' only lower-case .pdf files will be included (the normal case). Since I sometimes also encountered Windows PDF-files with an upper-case .PDF file extension, I tend to prefer -iname... The above command also works with the -print find option (instead of -print0), so it will be line-based (one file name per line), then -0 (NUL delimiter) must be omitted from the parallel command. Again, including --ignore-case in the grep command will make the search case-insensitive.
作为处理整个命令行的一般建议,parallel -dry-run将打印将要执行的命令。
$ find . -type f -iname '*.pdf' -print0 | parallel --dry-run -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --ignore-case --context=5 'pattern'"
pdftotext -q ./test PDF file 1.pdf - | grep --with-filename --label='['./test PDF file 1.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir1/test PDF file 2.pdf - | grep --with-filename --label='['./subdir1/test PDF file 2.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir2/test PDF file 3.pdf - | grep --with-filename --label='['./subdir2/test PDF file 3.pdf']' --color=always --ignore-case --context=5 'pattern'
我写了这个破坏性的小脚本。祝你玩得开心。
function pdfsearch()
{
find . -iname '*.pdf' | while read filename
do
#echo -e "\033[34;1m// === PDF Document:\033[33;1m $filename\033[0m"
pdftotext -q -enc ASCII7 "$filename" "$filename."; grep -s -H --color=always -i $1 "$filename."
# remove it! rm -f "$filename."
done
}
我喜欢@sjr的答案,但我更喜欢xargs vs -exec。我发现xargs更通用。例如,使用-P,我们可以在必要时利用多个cpu。
find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"
我也遇到了同样的问题,因此我写了一个脚本,搜索指定文件夹中的所有pdf文件的字符串,并打印匹配查询字符串的pdf文件。
也许这对你有帮助。
你可以在这里下载
推荐文章
- fork(), vfork(), exec()和clone()的区别
- 在tmux中保持窗口名称固定
- 如何生成一个核心转储在Linux上的分段错误?
- 在Python中如何在Linux和Windows中使用“/”(目录分隔符)?
- 如何在Apache服务器上自动将HTTP重定向到HTTPS ?
- 如何限制从grep返回的结果的数量?
- 将值从管道读入shell变量
- 以相对于当前目录的路径递归地在Linux CLI中列出文件
- 如何使用xargs复制名称中有空格和引号的文件?
- 在makefile中抑制命令调用的回声?
- 在套接字编程中AF_INET和PF_INET的区别是什么?
- Chmod递归
- 任何方式退出bash脚本,但不退出终端
- 如何查看按实际内存使用情况排序的顶级进程?
- 如何将多行输出连接到一行?