如何从PDF文件中提取文本?

我试图使用Python提取包含在这个PDF文件中的文本。

我正在使用PyPDF2包(版本1.27.2)，并有以下脚本:

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.pages[0]
    page_content = page.extractText()
print(page_content)

当我运行代码时，我得到以下输出，这与PDF文档中包含的输出不同:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

如何提取PDF文档中的文本?

当前回答

Pdftotext是最好和最简单的一个! Pdftotext也保留了结构。

我尝试了PyPDF2, PDFMiner和其他一些程序，但没有一个能给出令人满意的结果。

2019-04-03 12:16:08

其他回答

你可以使用PDFtoText https://github.com/jalan/pdftotext

PDF到文本保持文本格式缩进，不管你是否有表格。

2017-12-06 23:20:46

下面的代码是Python 3中该问题的解决方案。在运行代码之前，请确保已在您的环境中安装了PyPDF2库。如果未安装，打开命令提示符，执行以下命令:

pip3 install PyPDF2

使用PyPDF2 <= 1.26.0的解决方案代码:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    print(page.extractText())

2018-05-23 13:38:45

一种更健壮的方法，假设有多个PDF或只有一个!

import os
from PyPDF2 import PdfFileWriter, PdfFileReader
from io import BytesIO

mydir = # specify path to your directory where PDF or PDF's are

for arch in os.listdir(mydir): 
    buffer = io.BytesIO()
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
            pdfFileObj = open(archpath, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            pdfReader.numPages
            pageObj = pdfReader.getPage(0) 
            ley = pageObj.extractText()
            file1 = open("myfile.txt","w")
            file1.writelines(ley)
            file1.close()

2020-08-01 17:53:30

您可能希望使用经过时间验证的xPDF和派生工具来提取文本，因为pyPDF2在文本提取方面似乎仍然存在各种问题。

长的答案是，文本如何在PDF中编码有很多变化，它可能需要解码PDF字符串本身，然后可能需要与CMAP映射，然后可能需要分析单词和字母之间的距离等。

如果PDF被损坏(即显示正确的文本，但复制时产生垃圾)，并且您确实需要提取文本，那么您可能需要考虑将PDF转换为图像(使用ImageMagik)，然后使用Tesseract使用OCR从图像中获取文本。

2016-01-18 08:42:47

Pdfplumber是一个更好的从pdf中读取和提取数据的库。它还提供了读取表数据的方法，在经历了大量这样的库之后，pdfplumber最适合我。

请注意，它最适合机器编写的pdf，而不是扫描的pdf。

import pdfplumber
with pdfplumber.open(r'D:\examplepdf.pdf') as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())

2021-10-19 14:04:35

如何从PDF文件中提取文本?

推荐文章

最新文章

标签