如何搜索多个pdf文件的内容?

如何在目录/子目录中搜索PDF文件的内容?我在找一些命令行工具。grep似乎不能搜索PDF文件。

当前回答

还有另一个实用程序叫做ripgrep-all，它是基于ripgrep的。

它不仅可以处理PDF文档，比如Office文档和电影，而且作者声称它比pdfgrep更快。

递归搜索当前目录的命令语法，第二个命令只限制PDF文件:

rga 'pattern' .
rga --type pdf 'pattern' .

2019-07-29 09:06:56

其他回答

谢谢所有的好主意!

我尝试了xargs方法，但正如这里所指出的，xargs将使它不可能(或非常困难)包括打印实际的文件名……

所以我尝试了GNU并行。

parallel "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'" ::: *.pdf

This prints not only the pattern, but with --context=5 also 5 lines above and below as well for context. With -q pdftotext won't print any error messages or warnings (quiet). I use brackets [] as labels instead of braces {}. If you wanted braces --label='{'{}'}' will make that happen. Note that {} is replaced by the actual filename by GNU parallel, e.g. 'Example portable document file name with spaces.pdf' ({} is already using single quotes '). By using --label={} only the filename will be printed, which may be the favored way of displaying the filename. I also noticed that the output was without color when I tried it, except when forcing it by adding --color=always with grep. It may be useful to add --ignore-case to the grep command for a case-insensitive keyword search.

如果所有PDF文件都应该递归处理，包括当前目录(.)中的所有子目录，这可以通过find来完成:

find . -type f -iname '*.pdf' -print0 | parallel -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --context=5 'pattern'"

With find, -iname '*.pdf' acts case-insensitive. With -name '*.pdf' only lower-case .pdf files will be included (the normal case). Since I sometimes also encountered Windows PDF-files with an upper-case .PDF file extension, I tend to prefer -iname... The above command also works with the -print find option (instead of -print0), so it will be line-based (one file name per line), then -0 (NUL delimiter) must be omitted from the parallel command. Again, including --ignore-case in the grep command will make the search case-insensitive.

作为处理整个命令行的一般建议，parallel -dry-run将打印将要执行的命令。

$ find . -type f -iname '*.pdf' -print0 | parallel --dry-run -0 "pdftotext -q {} - | grep --with-filename --label='['{}']' --color=always --ignore-case --context=5 'pattern'"
pdftotext -q ./test PDF file 1.pdf - | grep --with-filename --label='['./test PDF file 1.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir1/test PDF file 2.pdf - | grep --with-filename --label='['./subdir1/test PDF file 2.pdf']' --color=always --ignore-case --context=5 'pattern'
pdftotext -q ./subdir2/test PDF file 3.pdf - | grep --with-filename --label='['./subdir2/test PDF file 3.pdf']' --color=always --ignore-case --context=5 'pattern'

2022-02-06 15:21:15

还有另一个实用程序叫做ripgrep-all，它是基于ripgrep的。

它不仅可以处理PDF文档，比如Office文档和电影，而且作者声称它比pdfgrep更快。

递归搜索当前目录的命令语法，第二个命令只限制PDF文件:

rga 'pattern' .
rga --type pdf 'pattern' .

2019-07-29 09:06:56

你需要一些工具，如pdf2text，首先将pdf转换为文本文件，然后在文本中搜索。(您可能会错过一些信息或符号)。

如果你正在使用一种编程语言，很可能有专门为此目的编写的pdf库。例如:http://search.cpan.org/dist/CAM-PDF/ for Perl

2011-01-10 03:43:07

首先将所有pdf文件转换为文本文件:

for file in *.pdf;do pdftotext "$file"; done

然后像往常一样使用grep。这是特别好的，因为当您有多个查询和许多PDF文件时，它是快速的。

2016-01-02 22:07:10

还有pdfgrep，它做的正是它的名字所暗示的。

pdfgrep -R 'a pattern to search recursively from path' /some/path

我用它做过简单的搜索，效果很好。

(Debian、Ubuntu和Fedora中都有软件包。)

从1.3.0版本开始，pdfgrep支持递归搜索。这个版本从Ubuntu 12.10 (Quantal)开始在Ubuntu中可用。

2011-03-25 15:42:11

如何搜索多个pdf文件的内容?

推荐文章

最新文章

标签