如何搜索多个pdf文件的内容?

如何在目录/子目录中搜索PDF文件的内容?我在找一些命令行工具。grep似乎不能搜索PDF文件。

当前回答

使用pdfgrep:

pdfgrep -HinR 'FWCOSP' DatenModel/

在这个命令中，我在DatenModel/文件夹中搜索单词FWCOSP。

正如你在输出中看到的，你可以有文件名和行号:

我使用的选项是:

-i : Ignores, case for matching
-H : print the file name for each match
-n : prefix each match with the number of the page where it is found
-R : same as -r, but it also follows all symlinks.

2022-02-17 16:22:29

其他回答

我喜欢@sjr的答案，但我更喜欢xargs vs -exec。我发现xargs更通用。例如，使用-P，我们可以在必要时利用多个cpu。

find . -name '*.pdf' | xargs -P 5 -I % pdftotext % - | grep --with-filename --label="{}" --color "pattern"

2014-09-26 18:13:38

还有pdfgrep，它做的正是它的名字所暗示的。

pdfgrep -R 'a pattern to search recursively from path' /some/path

我用它做过简单的搜索，效果很好。

(Debian、Ubuntu和Fedora中都有软件包。)

从1.3.0版本开始，pdfgrep支持递归搜索。这个版本从Ubuntu 12.10 (Quantal)开始在Ubuntu中可用。

2011-03-25 15:42:11

我也遇到了同样的问题，因此我写了一个脚本，搜索指定文件夹中的所有pdf文件的字符串，并打印匹配查询字符串的pdf文件。

也许这对你有帮助。

你可以在这里下载

2012-06-24 14:04:41

试着在一个简单的脚本中使用'acroread'，就像上面那样

2011-01-10 09:09:49

你的发行版应该提供一个名为pdftotext的实用程序:

find /path -name '*.pdf' -exec sh -c 'pdftotext "{}" - | grep --with-filename --label="{}" --color "your pattern"' \;

如果要将pdftotext输出到标准输出，而不是输出到文件，则必须使用“-”。 ——with-filename和——label=选项将把文件名放在grep的输出中。可选的——color标志很好，它告诉grep在终端上使用颜色输出。

(在Ubuntu中，pdftotext是由xpdf-utils或poppler-utils包提供的。)

如果您想使用GNU grep中pdfgrep不支持的特性，这种使用pdftotext和grep的方法比pdfgrep更有优势。注意:pdfgrep - 1.3。x支持-C选项打印上下文行。

2011-01-10 03:43:22

如何搜索多个pdf文件的内容?

推荐文章

最新文章

标签