是否有可能,使用Python,合并单独的PDF文件?
假设是这样,我需要进一步扩展它。我希望循环通过目录中的文件夹,并重复此过程。
我可能是得过其实了,但是否可以排除每个pdf文件中包含的一页(我的报告生成总是创建一个额外的空白页)。
是否有可能,使用Python,合并单独的PDF文件?
假设是这样,我需要进一步扩展它。我希望循环通过目录中的文件夹,并重复此过程。
我可能是得过其实了,但是否可以排除每个pdf文件中包含的一页(我的报告生成总是创建一个额外的空白页)。
当前回答
我在linux终端上通过利用subprocess(假设目录中存在one.pdf和two.pdf)使用pdf unite,目的是将它们合并为three.pdf
import subprocess
subprocess.call(['pdfunite one.pdf two.pdf three.pdf'],shell=True)
其他回答
def pdf_merger(路径): """将pdf文件合并为一个pdf""" "
import logging
logging.basicConfig(filename = 'output.log', level = logging.DEBUG, format = '%(asctime)s %(levelname)s %(message)s' )
try:
import glob, os
import PyPDF2
os.chdir(path)
pdfs = []
for file in glob.glob("*.pdf"):
pdfs.append(file)
if len(pdfs) == 0:
logging.info("No pdf in the given directory")
else:
merger = PyPDF2.PdfFileMerger()
for pdf in pdfs:
merger.append(pdf)
merger.write('result.pdf')
merger.close()
except Exception as e:
logging.error('Error has happened')
logging.exception('Exception occured' + str(e))
使用字典以获得更大的灵活性(例如sort, dedup):
import os
from PyPDF2 import PdfFileMerger
# use dict to sort by filepath or filename
file_dict = {}
for subdir, dirs, files in os.walk("<dir>"):
for file in files:
filepath = subdir + os.sep + file
# you can have multiple endswith
if filepath.endswith((".pdf", ".PDF")):
file_dict[file] = filepath
# use strict = False to ignore PdfReadError: Illegal character error
merger = PdfFileMerger(strict=False)
for k, v in file_dict.items():
print(k, v)
merger.append(v)
merger.write("combined_result.pdf")
http://pieceofpy.com/2009/03/05/concatenating-pdf-with-python/提供了一个解决方案。
类似的:
from pyPdf import PdfFileWriter, PdfFileReader
def append_pdf(input,output):
[output.addPage(input.getPage(page_num)) for page_num in range(input.numPages)]
output = PdfFileWriter()
append_pdf(PdfFileReader(file("C:\\sample.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample1.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample2.pdf","rb")),output)
append_pdf(PdfFileReader(file("c:\\sample3.pdf","rb")),output)
output.write(file("c:\\combined.pdf","wb"))
------ 11月25日更新------
------似乎以上代码不再工作------
------请使用以下:------
from PyPDF2 import PdfFileMerger, PdfFileReader
import os
merger = PdfFileMerger()
file_folder = "C:\\My Ducoments\\"
root, dirs, files = next(os.walk(file_folder))
for path, subdirs, files in os.walk(root):
for f in files:
if f.endswith(".pdf"):
merger.append(file_folder + f)
merger.write(file_folder + "Economists-1.pdf")
您可以从PyPDF2模块使用pdffilemerge。
例如,要从路径列表中合并多个PDF文件,可以使用以下函数:
from PyPDF2 import PdfFileMerger
# pass the path of the output final file.pdf and the list of paths
def merge_pdf(out_path: str, extracted_files: list [str]):
merger = PdfFileMerger()
for pdf in extracted_files:
merger.append(pdf)
merger.write(out_path)
merger.close()
merge_pdf('./final.pdf', extracted_files)
这个函数从父文件夹中递归地获取所有文件:
import os
# pass the path of the parent_folder
def fetch_all_files(parent_folder: str):
target_files = []
for path, subdirs, files in os.walk(parent_folder):
for name in files:
target_files.append(os.path.join(path, name))
return target_files
# get a list of all the paths of the pdf
extracted_files = fetch_all_files('./parent_folder')
最后,使用这两个函数进行声明。可以包含多个文档的parent_folder_path,以及用于合并PDF的目的地的output_pdf_path:
# get a list of all the paths of the pdf
parent_folder_path = './parent_folder'
outup_pdf_path = './final.pdf'
extracted_files = fetch_all_files(parent_folder_path)
merge_pdf(outup_pdf_path, extracted_files)
你可以从这里获得完整的代码(来源):如何使用Python合并PDF文档
使用Pypdf或其后续版本PyPDF2:
作为PDF工具包构建的Pure-Python库。它能够: 逐页拆分文档, 逐页合并文件,
(以及更多)
下面是一个适用于这两个版本的示例程序。
#!/usr/bin/env python
import sys
try:
from PyPDF2 import PdfFileReader, PdfFileWriter
except ImportError:
from pyPdf import PdfFileReader, PdfFileWriter
def pdf_cat(input_files, output_stream):
input_streams = []
try:
# First open all the files, then produce the output file, and
# finally close the input files. This is necessary because
# the data isn't read from the input files until the write
# operation. Thanks to
# https://stackoverflow.com/questions/6773631/problem-with-closing-python-pypdf-writing-getting-a-valueerror-i-o-operation/6773733#6773733
for input_file in input_files:
input_streams.append(open(input_file, 'rb'))
writer = PdfFileWriter()
for reader in map(PdfFileReader, input_streams):
for n in range(reader.getNumPages()):
writer.addPage(reader.getPage(n))
writer.write(output_stream)
finally:
for f in input_streams:
f.close()
output_stream.close()
if __name__ == '__main__':
if sys.platform == "win32":
import os, msvcrt
msvcrt.setmode(sys.stdout.fileno(), os.O_BINARY)
pdf_cat(sys.argv[1:], sys.stdout)