我试图使用Python提取包含在这个PDF文件中的文本。

我正在使用PyPDF2包(版本1.27.2),并有以下脚本:

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.pages[0]
    page_content = page.extractText()
print(page_content)

当我运行代码时,我得到以下输出,这与PDF文档中包含的输出不同:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

如何提取PDF文档中的文本?


当前回答

目的:从PDF中提取文本

所需工具:

Poppler for windows: windows中pdftotext文件的包装器 对于anaanaconda: conda install -c conda-forge pdftotext实用程序转换PDF到文本。

步骤: 安装荡漾。windows操作系统:在env路径下增加“xxx/bin/” PIP安装pdftotext

import pdftotext
 
# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
 
# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

其他回答

Pdftotext是最好和最简单的一个! Pdftotext也保留了结构。

我尝试了PyPDF2, PDFMiner和其他一些程序,但没有一个能给出令人满意的结果。

我在这里找到了一个解决方案PDFLayoutTextStripper

这很好,因为它可以保持原始PDF的布局。

它是用Java编写的,但我已经添加了一个网关来支持Python。

示例代码:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

示例输出PDFLayoutTextStripper:

你可以在这里看到更多细节Stripper with Python

我有一个比OCR更好的工作,并保持页面对齐,同时从PDF中提取文本。应该有帮助:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()


    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)


    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

text= convert_pdf_to_txt('test.pdf')
print(text)

看看PyPDF2<=1.26.0的代码:

import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')

输出结果为:

!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%

使用相同的代码从201308FCR.pdf读取pdf .输出正常。

它的文档解释了原因:

def extractText(self):
    """
    Locate all text drawing commands, in the order they are provided in the
    content stream, and extract the text.  This works well for some PDF
    files, but poorly for others, depending on the generator used.  This will
    be refined in the future.  Do not rely on the order of text coming out of
    this function, as it will change if this function is made more
    sophisticated.
    :return: a unicode string object.
    """

目的:从PDF中提取文本

所需工具:

Poppler for windows: windows中pdftotext文件的包装器 对于anaanaconda: conda install -c conda-forge pdftotext实用程序转换PDF到文本。

步骤: 安装荡漾。windows操作系统:在env路径下增加“xxx/bin/” PIP安装pdftotext

import pdftotext
 
# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
 
# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))