PyPDF2 PdfFileMerger.Merge Class - python

I am trying to use this class to append a page from one PDF to another by specifying the page position.
Does anyone have experience with this? I couldn't find any example of using PdfFileMerger.merge over internet
with open(orig_pdf, 'rb') as orig, open(amend_pdf, 'rb') as new:
pdf = PdfFileMerger()
pdf.merge(2, new)
pdf.write('.pdf')

Consider using merge and passing the position, which is the page number you wish to add the pdf file
There are (probably) many ways of achieving the same results. Heres a basic working example:
from PyPDF2 import PdfFileMerger, PdfFileReader
orig_pdf = r'C:\temp\old.pdf'
amend_pdf = r'C:\temp\new.pdf'
with open(orig_pdf, 'rb') as orig, open(amend_pdf, 'rb') as new:
merger = PdfFileMerger()
merger.append(PdfFileReader(orig_pdf))
# Add amend_pdf after page 2
merger.merge(2, PdfFileReader(amend_pdf))
merger.write("results.pdf")
For more info, have a look at the official documentation https://pythonhosted.org/PyPDF2/PdfFileMerger.html

Related

How to turn a large PDF's pages to images efficiently? [duplicate]

I would like to take a multi-page pdf file and create separate pdf files per page.
I have downloaded reportlab and have browsed the documentation, but it seems aimed at pdf generation. I haven't yet seen anything about processing PDF files themselves.
Is there an easy way to do this in python?
from PyPDF2 import PdfWriter, PdfReader
inputpdf = PdfReader(open("document.pdf", "rb"))
for i in range(len(inputpdf.pages)):
output = PdfWriter()
output.add_page(inputpdf.pages[i])
with open("document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
etc.
I missed here a solution where you split the PDF to two parts consisting of all pages so I append my solution if somebody was looking for the same:
from PyPDF2 import PdfFileWriter, PdfFileReader
def split_pdf_to_two(filename,page_number):
pdf_reader = PdfFileReader(open(filename, "rb"))
try:
assert page_number < pdf_reader.numPages
pdf_writer1 = PdfFileWriter()
pdf_writer2 = PdfFileWriter()
for page in range(page_number):
pdf_writer1.addPage(pdf_reader.getPage(page))
for page in range(page_number,pdf_reader.getNumPages()):
pdf_writer2.addPage(pdf_reader.getPage(page))
with open("part1.pdf", 'wb') as file1:
pdf_writer1.write(file1)
with open("part2.pdf", 'wb') as file2:
pdf_writer2.write(file2)
except AssertionError as e:
print("Error: The PDF you are cutting has less pages than you want to cut!")
The PyPDF2 package gives you the ability to split up a single PDF into multiple ones.
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
output_filename = '{}_page_{}.pdf'.format(fname, page+1)
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
print('Created: {}'.format(output_filename))
Source: https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/
I know that the code is not related to python, however i felt like posting this piece of R code which is simple, flexible and works amazingly. The PDFtools package in R is amazing in splitting merging PDFs at ease.
library(pdftools) #Rpackage
pdf_subset('D:\\file\\20.02.20\\22 GT 2017.pdf',
pages = 1:51, output = "subset.pdf")
The earlier answers with PyPDF2 for splitting pdfs are not working anymore with the latest version update. The authors recommend using pypdf instead and this version of PyPDF2==3.0.1 will be the last version of PyPDF2. The function needs to be modified as follows:
import os
from PyPDF2 import PdfReader, PdfWriter
def split_pdfs(input_file_path):
inputpdf = PdfReader(open(input_file_path, "rb"))
out_paths = []
if not os.path.exists("outputs"):
os.makedirs("outputs")
for i, page in enumerate(inputpdf.pages):
output = PdfWriter()
output.add_page(page)
out_file_path = f"outputs/{input_file_path[:-4]}_{i}.pdf"
with open(out_file_path, "wb") as output_stream:
output.write(output_stream)
out_paths.append(out_file_path)
return out_paths
Note: The same function will work with pypdf as well. Import PdfReader and PdfWriter from pypdf rather than PyPDF2.
import fitz
src = fitz.open("source.pdf")
for page in src:
tar = fitz.open() # output PDF for 1 page
# copy over current page
tar.insert_pdf(src, from_page=page.number, to_page=page.number)
tar.save(f"page-{page.number}.pdf")
tar.close()

Brand new looking for direction on PDF Page remover

I have been looking for a basic PDF page remover. What I am looking for is something that will let me put in the page numbers I want removed from the file I select. Still really new I am reading a few books to learn, but wanting to use this to make my job a bit easier as the current program is very stupid where I have to make sure to remember to subtract a number from the page given due to how PDF is.
I have to basic codes I would love some critical review of so I can hopefully refine it and make it to something that is easy to use. I thank anyone for their time and help loving how python reads and feels.
from PyPDF2 import PdfFileWriter, PdfFileReader
pages_delete = [4]
input = PdfFileReader('example.pdf', 'rb')
output = PdfFileWriter()
for i in range(input.getNumPages()):
if i not in pages_delete:
p = input.getPage(i)
output.addPage(p)
with open('pages_deleted.pdf', 'wb') as f:
output.write(f)
this is the second one
import PyPDF2 import PdfFileWriter, PdfFileReader
pdf1File = open('*file_name*')
pdf1Reader = PyPDF2.PdfFileReader(pdf1File)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdf1Reader.numPages):
pageObj - pdf1Reader.getPage(pageNum)
pdfWriter.removePage(pageObj)
pdfOutputFile = open('*filename*-Edit')
pdfWriter.write(pdfOutputFile)
pdfOutFile.close()

Generate flattened PDF with Python

When I print a PDF from any of my source PDFs, the file size drops and removes the text boxes presents in form. In short, it flattens the file.
This is behavior I want to achieve.
The following code to create a PDF using another PDF as a source (the one I want to flatten), it writes the text boxes form as well.
Can I get a PDF without the text boxes, flatten it? Just like Adobe does when I print a PDF as a PDF.
My other code looks something like this minus some things:
import os
import StringIO
from pyPdf import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
directory = os.path.join(os.getcwd(), "source") # dir we are interested in
fif = [f for f in os.listdir(directory) if f[-3:] == 'pdf'] # get the PDFs
for i in fif:
packet = StringIO.StringIO()
can = canvas.Canvas(packet, pagesize=letter)
can.rotate(-90)
can.save()
packet.seek(0)
new_pdf = PdfFileReader(packet)
fname = os.path.join('source', i)
existing_pdf = PdfFileReader(file(fname, "rb"))
output = PdfFileWriter()
nump = existing_pdf.getNumPages()
page = existing_pdf.getPage(0)
for l in range(nump):
output.addPage(existing_pdf.getPage(l))
page.mergePage(new_pdf.getPage(0))
outputStream = file("out-"+i, "wb")
output.write(outputStream)
outputStream.close()
print fName + " written as", i
Summing up: I have a pdf, I add a text box to it, covering up info and adding new info, and then I print a pdf from that pdf. The text box becomes not editable or moveable any longer. I wanted to automate that process but everything I tried still allowed that text box to be editable.
If installing an OS package is an option, then you could use pdftk with its python wrapper pypdftk like this:
import pypdftk
pypdftk.fill_form('filled.pdf', out_file='flattened.pdf', flatten=True)
You would also need to install the pdftk package, which on Ubuntu could be done like this:
sudo apt-get install pdftk
The pypdftk library can by downloaded from PyPI:
pip install pypdftk
Update: pdftk was briefly removed from Ubuntu in version 18.04, but it seems it is back since 20.04.
A simple but more of a round about way it to covert the pdf to images than to put those image into a pdf.
You'll need pdf2image and PIL
Like So
from pdf2image import convert_from_path
from PIL import Image
images = convert_from_path('temp.pdf')
im1 = images[0]
images.pop(0)
pdf1_filename = "flattened.pdf"
im1.save(pdf1_filename, "PDF" ,resolution=100.0, save_all=True, append_images=images)
Edit:
I created a library to do this called fillpdf
pip install fillpdf
from fillpdf import fillpdfs
fillpdfs.flatten_pdf('input.pdf', 'newflat.pdf')
Per the Adobe Docs, you can change the Bit Position of the Editable Form Fields to 1 to make the field ReadOnly. I provided a full solution here, but it uses Django:
https://stackoverflow.com/a/55301804/8382028
Adobe Docs (page 552):
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
Use PyPDF2 to fill the fields, then loop through the annotations to change the bit position:
from io import BytesIO
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, NumberObject
# open the pdf
input_stream = open("YourPDF.pdf", "rb")
reader = PdfFileReader(input_stream, strict=False)
if "/AcroForm" in reader.trailer["/Root"]:
reader.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
writer = PdfFileWriter()
writer.set_need_appearances_writer()
if "/AcroForm" in writer._root_object:
# Acro form is form field, set needs appearances to fix printing issues
writer._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)}
)
data_dict = dict() # this is a dict of your form values
writer.addPage(reader.getPage(0))
page = writer.getPage(0)
# update form fields
writer.updatePageFormFieldValues(page, data_dict)
for j in range(0, len(page["/Annots"])):
writer_annot = page["/Annots"][j].getObject()
for field in data_dict:
if writer_annot.get("/T") == field:
# make ReadOnly:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
output_stream = BytesIO()
writer.write(output_stream)
# output_stream is your flattened PDF
A solution that goes for Windows as well, converts many pdf pages and flatens the chackbox values as well. For some reason #ViaTech code did not work in my pc (Windows7 python 3.8)
Followed #ViaTech indications and used extensively #hchillon code from this post
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import BooleanObject, NameObject, IndirectObject, TextStringObject, NumberObject
def set_need_appearances_writer(writer):
try:
catalog = writer._root_object
# get the AcroForm tree and add "/NeedAppearances attribute
if "/AcroForm" not in catalog:
writer._root_object.update({
NameObject("/AcroForm"): IndirectObject(len(writer._objects), 0, writer)})
need_appearances = NameObject("/NeedAppearances")
writer._root_object["/AcroForm"][need_appearances] = BooleanObject(True)
return writer
except Exception as e:
print('set_need_appearances_writer() catch : ', repr(e))
return writer
class PdfFileFiller(object):
def __init__(self, infile):
self.pdf = PdfFileReader(open(infile, "rb"), strict=False)
if "/AcroForm" in self.pdf.trailer["/Root"]:
self.pdf.trailer["/Root"]["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)})
# newvals and newchecks have keys have to be filled. '' is not accepted
def update_form_values(self, outfile, newvals=None, newchecks=None):
self.pdf2 = MyPdfFileWriter()
trailer = self.pdf.trailer['/Root'].get('/AcroForm', None)
if trailer:
self.pdf2._root_object.update({
NameObject('/AcroForm'): trailer})
set_need_appearances_writer(self.pdf2)
if "/AcroForm" in self.pdf2._root_object:
self.pdf2._root_object["/AcroForm"].update(
{NameObject("/NeedAppearances"): BooleanObject(True)})
for i in range(self.pdf.getNumPages()):
self.pdf2.addPage(self.pdf.getPage(i))
self.pdf2.updatePageFormFieldValues(self.pdf2.getPage(i), newvals)
for j in range(0, len(self.pdf.getPage(i)['/Annots'])):
writer_annot = self.pdf.getPage(i)['/Annots'][j].getObject()
for field in newvals:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
self.pdf2.updatePageFormCheckboxValues(self.pdf2.getPage(i), newchecks)
with open(outfile, 'wb') as out:
self.pdf2.write(out)
class MyPdfFileWriter(PdfFileWriter):
def __init__(self):
super().__init__()
def updatePageFormCheckboxValues(self, page, fields):
for j in range(0, len(page['/Annots'])):
writer_annot = page['/Annots'][j].getObject()
for field in fields:
writer_annot.update({NameObject("/Ff"): NumberObject(1)})
origin = ## Put input pdf path here
destination = ## Put output pdf path here, even if the file does not exist yet
newchecks = {} # A dict with all checkbox values that need to be changed
newvals = {'':''} # A dict with all entry values that need to be changed
# newvals dict has to be equal to {'':''} in case that no changes are needed
c = PdfFileFiller(origin)
c.update_form_values(outfile=destination, newvals=newvals, newchecks=newchecks)
print('PDF has been created\n')
I had trouble flattening a form that I had entered content into using pdfrw (How to Populate Fillable PDF's with Python) and found that I had to add an additional step using generate_fdf (pdftk flatten loses fillable field data).
os.system('pdftk '+outtemp+' generate_fdf output '+outfdf)
os.system('pdftk '+outtemp+' fill_form '+outfdf+' output '+outpdf)
I came to this solution because I was able to flatten a file just fine using ghostscript's pdf2ps followed by ps2pdf on my Mac, but the quality had low resolution when I ran it on an Amazon Linux instance. I couldn't figure out why that was the case and so moved to the pdftk solution.

How to append PDF pages using PyPDF2

Is anybody has experience merging two page of PDF file into one using python lib PyPDF2.
When I try page1.mergePage(page2) it results with page2 overlayed page1. How to make it to add page2 to the bottom of the page1?
As I'm searching the web for python pdf merging solution, I noticed that there's a general misconception with merging versus appending.
Most people call the appending action a merge but it's not. What you're describing in your question is really the intended use of mergePage which should be called applyPageOnTopOfAnother but that's a little long. What you are (were) looking for is really appending two files/pages into a new file.
Appending PDF files
Using the PdfFileMerger class and its append method.
Identical to the merge() method, but assumes you want to concatenate
all pages onto the end of the file instead of specifying a position.
Here's one way to do it taken from pypdf Merging multiple pdf files into one pdf:
from PyPDF2 import PdfFileMerger, PdfFileReader
# ...
merger = PdfFileMerger()
merger.append(PdfFileReader(file(filename1, 'rb')))
merger.append(PdfFileReader(file(filename2, 'rb')))
merger.write("document-output.pdf")
Appending specific PDF pages
And to append specific pages of different PDF files, use the PdfFileWriter class with the addPage method.
Adds a page to this PDF file. The page is usually acquired from a
PdfFileReader instance.
file1 = PdfFileReader(file(filename1, "rb"))
file2 = PdfFileReader(file(filename2, "rb"))
output = PdfFileWriter()
output.addPage(file1.getPage(specificPageIndex))
output.addPage(file2.getPage(specificPageIndex))
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
Merging two pages into one page
Using mergePage
Merges the content streams of two pages into one. Resource references
(i.e. fonts) are maintained from both pages. The mediabox/cropbox/etc
of this page are not altered. The parameter page’s content stream will
be added to the end of this page’s content stream, meaning that it
will be drawn after, or “on top” of this page.
file1 = PdfFileReader(file(filename1, "rb"))
file2 = PdfFileReader(file(filename2, "rb"))
output = PdfFileWriter()
page = file1.getPage(specificPageIndex)
page.mergePage(file2.getPage(specificPageIndex))
output.addPage(page)
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
If the 2 PDFs do not exist on your local machine, and instead are normally accessed/download via a URL (i.e. http://foo/bar.pdf & http://bar/foo.pdf), we can fetch both PDFs from remote locations and merge them together in memory in one-fell-swoop.
This eliminates the assumed step of downloading the PDF to begin with, and allows us to generalize beyond the simple case of both PDFs existing on disk. Specifically, it generalizes the solution to any HTTP-accessible PDF.
The example:
from PyPDF2 import PdfFileMerger, PdfFileReader
pdf_content_1 = requests.get('http://foo/bar.pdf').content
pdf_content_2 = requests.get('http://bar/foo.pdf').content
# Write to in-memory file-like buffers
pdf_buffer_1 = StringIO.StringIO().write(pdf_content_1)
pdf_buffer_2 = StringIO.StringIO().write(pdf_content_2)
pdf_merged_buffer = StringIO.StringIO()
merger = PdfFileMerger()
merger.append(PdfFileReader(pdf_buffer_1))
merger.append(PdfFileReader(pdf_buffer_2))
merger.write(pdf_merged_buffer)
# Option 1:
# Return the content of the buffer in an HTTP response (Flask example below)
response = make_response(pdf_merged_buffer.getvalue())
# Set headers so web-browser knows to render results as PDF
response.headers['Content-Type'] = 'application/pdf'
response.headers['Content-Disposition'] = \
'attachment; filename=%s.pdf' % 'Merged PDF'
return response
# Option 2: Write to disk
with open("merged_pdf.pdf", "w") as fp:
fp.write(pdf_merged_buffer.getvalue())
Did it this way:
reader = PyPDF2.PdfFileReader(open("input.pdf",'rb'))
NUM_OF_PAGES = reader.getNumPages()
page0 = reader.getPage(0)
h = page0.mediaBox.getHeight()
w = page0.mediaBox.getWidth()
newpdf_page = PyPDF2.pdf.PageObject.createBlankPage(None, w, h*NUM_OF_PAGES)
for i in range(NUM_OF_PAGES):
next_page = reader.getPage(i)
newpdf_page.mergeScaledTranslatedPage(next_page, 1, 0, h*(NUM_OF_PAGES-i-1))
writer = PdfFileWriter()
writer.addPage(newpdf_page)
with open('output.pdf', 'wb') as f:
writer.write(f)
It works when every page has the same height and width. Otherwise, it needs some modifications.
Maybe Emile Bergeron solution is better. Didn't try it.
The pdfrw library can do this. There is a 4up example in the examples directory that places 4 input pages on every output page, and a booklet example that takes 8.5x11 input and creates 11x17 output. Disclaimer -- I am the pdfrw author.
The code posted in this following link accomplished your objective.
Using PyPDF2 to merge files into multiple output files
I believe the trick is:
merger.append(input)

merging pdf files with pypdf

I am writing a script that parses an internet site (maya.tase.co.il) for links, downloads pdf file and merges them. It works mostly, but merging gives me different kinds of errors depending on the file. I cant seem to figure out why. I cut out the relevant code and built a test only for two specific files that are causing a problem. The script uses pypdf, but I am willing to try anything that works. Some files are encrypted, some are not.
def is_incry(pdf):
from pyPdf import PdfFileWriter, PdfFileReader
input=PdfFileReader(pdf)
try:
input.getNumPages()
return input
except:
input.decrypt("")
return input
def merg_pdf(to_keep,to_lose):
import os
from pyPdf import PdfFileWriter, PdfFileReader
if os.path.exists(to_keep):
in1=file(to_keep, "rb")
in2=file(to_lose, "rb")
input1 = is_incry(in1)
input2 = is_incry(in2)
output = PdfFileWriter()
loop1=input1.getNumPages()
for i in range(0,loop1):
output.addPage(input1.getPage(i))#
loop2=input2.getNumPages()
for i in range(0,loop2):
output.addPage(input2.getPage(i))#
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
pdflen=loop1+loop2
in1.close()
in2.close()
os.remove(to_lose)
os.remove(to_keep)
os.rename("document-output.pdf",to_keep)
else:
os.rename(to_lose,to_keep)
in1=file(to_keep, "rb")
input1 = PdfFileReader(in1)
try:
pdflen=input1.getNumPages()
except:
input1.decrypt("")
pdflen=input1.getNumPages()
in1.close()
#input1.close()
return pdflen
def test():
import urllib
urllib.urlretrieve ('http://mayafiles.tase.co.il/RPdf/487001-488000/P487028-01.pdf', 'temp1.pdf')
urllib.urlretrieve ('http://mayafiles.tase.co.il/RPdf/488001-489000/P488170-00.pdf', 'temp2.pdf')
merg_pdf('temp1.pdf','temp2.pdf')
test()
I thank anyone that even took the time to read this.
Al.
I once wrote a complex PDF generation/merging stuff which I have now open-sourced.
You can have a look at it: https://github.com/becomingGuru/nikecup/blob/master/reg/models.py#L71
def merge_pdf(self):
from pyPdf import PdfFileReader,PdfFileWriter
pdf_file = file_names['main_pdf']%settings.MEDIA_ROOT
pdf_obj = PdfFileReader(open(pdf_file))
values_page = PdfFileReader(open(self.make_pdf())).getPage(0)
mergepage = pdf_obj.pages[0]
mergepage.mergePage(values_page)
signed_pdf = PdfFileWriter()
for page in pdf_obj.pages:
signed_pdf.addPage(page)
signed_pdf_name = file_names['dl_done']%(settings.MEDIA_ROOT,self.phash)
signed_pdf_file = open(signed_pdf_name,mode='wb')
signed_pdf.write(signed_pdf_file)
signed_pdf_file.close()
return signed_pdf_name
It then works like a charm. Hope it helps.
I tried the documents with pyPdf - it looks like both the pdfs in this document are encrypted, and a blank password ("") is not valid.
Looking at the security settings in adobe, you're allowed to print and copy - however, when you run pyPdf's input.decrypt(""), the files are still encrypted (as in, input.getNumPages() after this still returns 0).
I was able to open this in OSX's preview, and save the pdfs without encryption, and then the assembly works fine. pyPdf deals in pages though, so I don't think you can do this through pyPdf. Either find the correct password, or perhaps use another application, such as using a batch job to OpenOffice, or another pdf plugin would work.

Categories