如何从PDF文件中提取文本?

我试图使用Python提取包含在这个PDF文件中的文本。

我正在使用PyPDF2包(版本1.27.2)，并有以下脚本:

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.pages[0]
    page_content = page.extractText()
print(page_content)

当我运行代码时，我得到以下输出，这与PDF文档中包含的输出不同:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

如何提取PDF文档中的文本?

当前回答

从PDF中提取文本使用下面的代码

import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

a = pageObj.extractText()

print(a)

2020-01-13 18:31:55

其他回答

您可能希望使用经过时间验证的xPDF和派生工具来提取文本，因为pyPDF2在文本提取方面似乎仍然存在各种问题。

长的答案是，文本如何在PDF中编码有很多变化，它可能需要解码PDF字符串本身，然后可能需要与CMAP映射，然后可能需要分析单词和字母之间的距离等。

如果PDF被损坏(即显示正确的文本，但复制时产生垃圾)，并且您确实需要提取文本，那么您可能需要考虑将PDF转换为图像(使用ImageMagik)，然后使用Tesseract使用OCR从图像中获取文本。

2016-01-18 08:42:47

我在这里找到了一个解决方案PDFLayoutTextStripper

这很好，因为它可以保持原始PDF的布局。

它是用Java编写的，但我已经添加了一个网关来支持Python。

示例代码:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

示例输出PDFLayoutTextStripper:

你可以在这里看到更多细节Stripper with Python

2019-05-07 01:54:26

看看PyPDF2<=1.26.0的代码:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

输出结果为:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

使用相同的代码从201308FCR.pdf读取pdf .输出正常。

它的文档解释了原因:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """

2016-01-20 04:00:40

在2020年，上述解决方案并不适用于我正在使用的特定pdf。下面是诀窍。我用的是Windows 10和Python 3.8

测试pdf文件:https://drive.google.com/file/d/1aUfQAlvq5hA9kz2c9CyJADiY3KpY3-Vn/view?usp=sharing

#pip install pdfminer.six
import io

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage


def convert_pdf_to_txt(path):
    '''Convert pdf content from a file path to text

    :path the file path
    '''
    rsrcmgr = PDFResourceManager()
    codec = 'utf-8'
    laparams = LAParams()

    with io.StringIO() as retstr:
        with TextConverter(rsrcmgr, retstr, codec=codec,
                           laparams=laparams) as device:
            with open(path, 'rb') as fp:
                interpreter = PDFPageInterpreter(rsrcmgr, device)
                password = ""
                maxpages = 0
                caching = True
                pagenos = set()

                for page in PDFPage.get_pages(fp,
                                              pagenos,
                                              maxpages=maxpages,
                                              password=password,
                                              caching=caching,
                                              check_extractable=True):
                    interpreter.process_page(page)

                return retstr.getvalue()


if __name__ == "__main__":
    print(convert_pdf_to_txt('C:\\Path\\To\\Test_PDF.pdf'))

2020-07-31 11:18:35

如何从PDF文件中提取文本?

首先要了解的是PDF格式。它有一个用英文编写的公共规范，请参阅ISO 32000-2:2017，并阅读超过700页的PDF 1.7规范。当然，你至少需要阅读维基百科关于PDF的页面

一旦你理解了PDF格式的细节，提取文本或多或少是容易的(但是出现在图形或图像中的文本呢?它的数字1)?不要指望在几周内单独编写一个完美的软件文本提取器....

在Linux上，你也可以使用pdf2text，你可以从你的Python代码中弹出。

一般来说，从PDF文件中提取文本是一个定义不清的问题。对于人类读者来说，一些文本可以由不同的点制成(图形)，或者一张照片等等。

谷歌搜索引擎能够从PDF中提取文本，但据传需要超过5亿行的源代码。你有必要的资源(人力和预算)来发展一个竞争对手吗?

一种可能是将PDF打印到一些虚拟打印机(例如使用GhostScript或Firefox)，然后使用OCR技术提取文本。

相反，我建议处理生成PDF文件的数据表示，例如原始的LaTeX代码(或Lout代码)或OOXML代码。

在所有情况下，您都需要为至少几个人年的软件开发预算。

2020-08-21 07:08:40

如何从PDF文件中提取文本?

推荐文章

最新文章

标签