是否有可能,使用Python,合并单独的PDF文件?

假设是这样,我需要进一步扩展它。我希望循环通过目录中的文件夹,并重复此过程。

我可能是得过其实了,但是否可以排除每个pdf文件中包含的一页(我的报告生成总是创建一个额外的空白页)。


当前回答

pdfrw库可以很容易地做到这一点,假设您不需要保存书签和注释,并且您的pdf文件没有加密。Cat.py是一个示例拼接脚本,而子集.py是一个示例页面子集脚本。

串联脚本的相关部分——假设input是一个输入文件名列表,outfn是一个输出文件名:

from pdfrw import PdfReader, PdfWriter

writer = PdfWriter()
for inpfn in inputs:
    writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)

正如你所看到的,省略最后一页是很容易的,例如:

    writer.addpages(PdfReader(inpfn).pages[:-1])

免责声明:我是pdfrw的主要作者。

其他回答

使用字典以获得更大的灵活性(例如sort, dedup):

import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
    for file in files:
        filepath = subdir + os.sep + file
        # you can have multiple endswith
        if filepath.endswith((".pdf", ".PDF")):
            file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)

for k, v in file_dict.items():
    print(k, v)
    merger.append(v)

merger.write("combined_result.pdf")

使用Pypdf或其后续版本PyPDF2:

作为PDF工具包构建的Pure-Python库。它能够: 逐页拆分文档, 逐页合并文件,

(以及更多)

下面是一个适用于这两个版本的示例程序。

#!/usr/bin/env python
import sys
try:
    from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
    from pyPdf import PdfFileReader, PdfFileWriter

def pdf_cat(input_files, output_stream):
    input_streams = []
    try:
        # First open all the files, then produce the output file, and
        # finally close the input files. This is necessary because
        # the data isn't read from the input files until the write
        # operation. Thanks to
        # https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
        for input_file in input_files:
            input_streams.append(open(input_file, 'rb'))
        writer = PdfFileWriter()
        for reader in map(PdfFileReader, input_streams):
            for n in range(reader.getNumPages()):
                writer.addPage(reader.getPage(n))
        writer.write(output_stream)
    finally:
        for f in input_streams:
            f.close()
        output_stream.close()

if __name__ == '__main__':
    if sys.platform == "win32":
        import os, msvcrt
        msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
    pdf_cat(sys.argv[1:], sys.stdout)

合并目录下的所有pdf文件

把pdf文件放到目录下。启动程序。你会得到一个合并了所有pdf文件的pdf。

import os
from PyPDF2 import PdfMerger

x = [a for a in os.listdir() if a.endswith(".pdf")]

merger = PdfMerger()

for pdf in x:
    merger.append(open(pdf, 'rb'))

with open("result.pdf", "wb") as fout:
    merger.write(fout)

今天我该如何编写上面相同的代码呢

from glob import glob
from PyPDF2 import PdfMerger



def pdf_merge():
    ''' Merges all the pdf files in current directory '''
    merger = PdfMerger()
    allpdfs = [a for a in glob("*.pdf")]
    [merger.append(pdf) for pdf in allpdfs]
    with open("Merged_pdfs.pdf", "wb") as new_file:
        merger.write(new_file)


if __name__ == "__main__":
    pdf_merge()
from PyPDF2 import PdfFileMerger
import webbrowser
import os
dir_path = os.path.dirname(os.path.realpath(__file__))

def list_files(directory, extension):
    return (f for f in os.listdir(directory) if f.endswith('.' + extension))

pdfs = list_files(dir_path, "pdf")

merger = PdfFileMerger()

for pdf in pdfs:
    merger.append(open(pdf, 'rb'))

with open('result.pdf', 'wb') as fout:
    merger.write(fout)

webbrowser.open_new('file://'+ dir_path + '/result.pdf')

Go 回购:https://github.com/mahaguru24/Python_Merge_PDF.git

下面是针对我的特定用例的最常见答案的时间比较:合并5个大单页pdf文件的列表。每个测试我都运行了两次。

(免责声明:我在Flask中运行这个函数,您的里程可能会有所不同)

博士TL;

pdfrw是我测试的3个pdf文件组合库中最快的一个。

PyPDF2

start = time.time()
merger = PdfFileMerger()
for pdf in all_pdf_obj:
    merger.append(
        os.path.join(
            os.getcwd(), pdf.filename # full path
                )
            )
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
merge_file = os.path.join(os.getcwd(), formatted_name)
merger.write(merge_file)
merger.close()
end = time.time()
print(end - start) #1 66.50084733963013 #2 68.2995400428772

PyMuPDF

start = time.time()
result = fitz.open()

for pdf in all_pdf_obj:
    with fitz.open(os.path.join(os.getcwd(), pdf.filename)) as mfile:
        result.insertPDF(mfile)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'

result.save(formatted_name)
end = time.time()
print(end - start) #1 2.7166640758514404 #2 1.694727897644043

PDFrw

start = time.time()
result = fitz.open()

writer = PdfWriter()
for pdf in all_pdf_obj:
    writer.addpages(PdfReader(os.path.join(os.getcwd(), pdf.filename)).pages)

formatted_name = f'Summary_Invoice_{date.today()}.pdf'
writer.write(formatted_name)
end = time.time()
print(end - start) #1 0.6040127277374268 #2 0.9576816558837891