是否有可能,使用Python,合并单独的PDF文件?
假设是这样,我需要进一步扩展它。我希望循环通过目录中的文件夹,并重复此过程。
我可能是得过其实了,但是否可以排除每个pdf文件中包含的一页(我的报告生成总是创建一个额外的空白页)。
是否有可能,使用Python,合并单独的PDF文件?
假设是这样,我需要进一步扩展它。我希望循环通过目录中的文件夹,并重复此过程。
我可能是得过其实了,但是否可以排除每个pdf文件中包含的一页(我的报告生成总是创建一个额外的空白页)。
当前回答
使用Pypdf或其后续版本PyPDF2:
作为PDF工具包构建的Pure-Python库。它能够: 逐页拆分文档, 逐页合并文件,
(以及更多)
下面是一个适用于这两个版本的示例程序。
#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter
def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
output_stream.close()
if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)
其他回答
使用Pypdf或其后续版本PyPDF2:
作为PDF工具包构建的Pure-Python库。它能够: 逐页拆分文档, 逐页合并文件,
(以及更多)
下面是一个适用于这两个版本的示例程序。
#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter
def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
output_stream.close()
if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)
它是可能的,使用Python,合并单独的PDF文件?
Yes.
下面的例子将一个文件夹中的所有文件合并为一个新的PDF文件:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser
from glob import glob
from pyPdf import PdfFileReader, PdfFileWriter
import os
def merge(path, output_filename):
output = PdfFileWriter()
for pdffile in glob(path + os.sep + '*.pdf'):
if pdffile == output_filename:
continue
print("Parse '%s'" % pdffile)
document = PdfFileReader(open(pdffile, 'rb'))
for i in range(document.getNumPages()):
output.addPage(document.getPage(i))
print("Start writing '%s'" % output_filename)
with open(output_filename, "wb") as f:
output.write(f)
if __name__ == "__main__":
parser = ArgumentParser()
# Add more options if you like
parser.add_argument("-o", "--output",
dest="output_filename",
default="merged.pdf",
help="write merged PDF to FILE",
metavar="FILE")
parser.add_argument("-p", "--path",
dest="path",
default=".",
help="path of source PDF files")
args = parser.parse_args()
merge(args.path, args.output_filename)
使用正确的python解释器:
conda activate py_envs
pip install PyPDF2
Python代码:
from PyPDF2 import PdfMerger
#set path files
import os
os.chdir('/ur/path/to/folder/')
cwd = os.path.abspath('')
files = os.listdir(cwd)
def merge_pdf_files():
merger = PdfMerger()
pdf_files = [x for x in files if x.endswith(".pdf")]
[merger.append(pdf) for pdf in pdf_files]
with open("merged_pdf_all.pdf", "wb") as new_file:
merger.write(new_file)
if __name__ == "__main__":
merge_pdf_files()
下面是针对我的特定用例的最常见答案的时间比较:合并5个大单页pdf文件的列表。每个测试我都运行了两次。
(免责声明:我在Flask中运行这个函数,您的里程可能会有所不同)
博士TL;
pdfrw是我测试的3个pdf文件组合库中最快的一个。
PyPDF2
start = time.time()
merger = PdfFileMerger()
for pdf in all_pdf_obj:
merger.append(
os.path.join(
os.getcwd(), pdf.filename # full path
)
)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
merge_file = os.path.join(os.getcwd(), formatted_name)
merger.write(merge_file)
merger.close()
end = time.time()
print(end - start) #1 66.50084733963013 #2 68.2995400428772
PyMuPDF
start = time.time()
result = fitz.open()
for pdf in all_pdf_obj:
with fitz.open(os.path.join(os.getcwd(), pdf.filename)) as mfile:
result.insertPDF(mfile)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
result.save(formatted_name)
end = time.time()
print(end - start) #1 2.7166640758514404 #2 1.694727897644043
PDFrw
start = time.time()
result = fitz.open()
writer = PdfWriter()
for pdf in all_pdf_obj:
writer.addpages(PdfReader(os.path.join(os.getcwd(), pdf.filename)).pages)
formatted_name = f'Summary_Invoice_{date.today()}.pdf'
writer.write(formatted_name)
end = time.time()
print(end - start) #1 0.6040127277374268 #2 0.9576816558837891
pdfrw库可以很容易地做到这一点,假设您不需要保存书签和注释,并且您的pdf文件没有加密。Cat.py是一个示例拼接脚本,而子集.py是一个示例页面子集脚本。
串联脚本的相关部分——假设input是一个输入文件名列表,outfn是一个输出文件名:
from pdfrw import PdfReader, PdfWriter
writer = PdfWriter()
for inpfn in inputs:
writer.addpages(PdfReader(inpfn).pages)
writer.write(outfn)
正如你所看到的,省略最后一页是很容易的,例如:
writer.addpages(PdfReader(inpfn).pages[:-1])
免责声明:我是pdfrw的主要作者。