How to turn a large PDF's pages to images efficiently? [duplicate]

How to turn a large PDF's pages to images efficiently? [duplicate] - python

I would like to take a multi-page pdf file and create separate pdf files per page.
I have downloaded reportlab and have browsed the documentation, but it seems aimed at pdf generation. I haven't yet seen anything about processing PDF files themselves.
Is there an easy way to do this in python?

from PyPDF2 import PdfWriter, PdfReader
inputpdf = PdfReader(open("document.pdf", "rb"))
for i in range(len(inputpdf.pages)):
output = PdfWriter()
output.add_page(inputpdf.pages[i])
with open("document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
etc.

I missed here a solution where you split the PDF to two parts consisting of all pages so I append my solution if somebody was looking for the same:
from PyPDF2 import PdfFileWriter, PdfFileReader
def split_pdf_to_two(filename,page_number):
pdf_reader = PdfFileReader(open(filename, "rb"))
try:
assert page_number < pdf_reader.numPages
pdf_writer1 = PdfFileWriter()
pdf_writer2 = PdfFileWriter()
for page in range(page_number):
pdf_writer1.addPage(pdf_reader.getPage(page))
for page in range(page_number,pdf_reader.getNumPages()):
pdf_writer2.addPage(pdf_reader.getPage(page))
with open("part1.pdf", 'wb') as file1:
pdf_writer1.write(file1)
with open("part2.pdf", 'wb') as file2:
pdf_writer2.write(file2)
except AssertionError as e:
print("Error: The PDF you are cutting has less pages than you want to cut!")

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones.
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
output_filename = '{}_page_{}.pdf'.format(fname, page+1)
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
print('Created: {}'.format(output_filename))
Source: https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/

I know that the code is not related to python, however i felt like posting this piece of R code which is simple, flexible and works amazingly. The PDFtools package in R is amazing in splitting merging PDFs at ease.
library(pdftools) #Rpackage
pdf_subset('D:\\file\\20.02.20\\22 GT 2017.pdf',
pages = 1:51, output = "subset.pdf")

The earlier answers with PyPDF2 for splitting pdfs are not working anymore with the latest version update. The authors recommend using pypdf instead and this version of PyPDF2==3.0.1 will be the last version of PyPDF2. The function needs to be modified as follows:
import os
from PyPDF2 import PdfReader, PdfWriter
def split_pdfs(input_file_path):
inputpdf = PdfReader(open(input_file_path, "rb"))
out_paths = []
if not os.path.exists("outputs"):
os.makedirs("outputs")
for i, page in enumerate(inputpdf.pages):
output = PdfWriter()
output.add_page(page)
out_file_path = f"outputs/{input_file_path[:-4]}_{i}.pdf"
with open(out_file_path, "wb") as output_stream:
output.write(output_stream)
out_paths.append(out_file_path)
return out_paths
Note: The same function will work with pypdf as well. Import PdfReader and PdfWriter from pypdf rather than PyPDF2.

import fitz
src = fitz.open("source.pdf")
for page in src:
tar = fitz.open() # output PDF for 1 page
# copy over current page
tar.insert_pdf(src, from_page=page.number, to_page=page.number)
tar.save(f"page-{page.number}.pdf")
tar.close()

Related

PyPDF2 PdfFileMerger.Merge Class

I am trying to use this class to append a page from one PDF to another by specifying the page position.
Does anyone have experience with this? I couldn't find any example of using PdfFileMerger.merge over internet
with open(orig_pdf, 'rb') as orig, open(amend_pdf, 'rb') as new:
pdf = PdfFileMerger()
pdf.merge(2, new)
pdf.write('.pdf')

Consider using merge and passing the position, which is the page number you wish to add the pdf file
There are (probably) many ways of achieving the same results. Heres a basic working example:
from PyPDF2 import PdfFileMerger, PdfFileReader
orig_pdf = r'C:\temp\old.pdf'
amend_pdf = r'C:\temp\new.pdf'
with open(orig_pdf, 'rb') as orig, open(amend_pdf, 'rb') as new:
merger = PdfFileMerger()
merger.append(PdfFileReader(orig_pdf))
# Add amend_pdf after page 2
merger.merge(2, PdfFileReader(amend_pdf))
merger.write("results.pdf")
For more info, have a look at the official documentation https://pythonhosted.org/PyPDF2/PdfFileMerger.html

Applying PDF watermark via a for loop

I am working through the book 'Automate the boring stuff with Python' and I am trying to run the code to watermark a .pdf on all pages but the watermark only appears on the first page.
So the problem must either be in the loop or in the writing. Can anyone help me figure it out? Thank you
Running Python 3.5.0 on a windows 7 machine.
Code below:
import PyPDF2
minutesFile = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(minutesFile)
minutesFirstPage = pdfReader.getPage(0)
pdfWatermarkReader = PyPDF2.PdfFileReader(open('watermark.pdf', 'rb'))
minutesFirstPage.mergePage(pdfWatermarkReader.getPage(0))
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(minutesFirstPage)
for pageNum in range(1, pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pdfWriter.addPage(pageObj)
resultPdfFile = open('watermarkedCover.pdf', 'wb')
pdfWriter.write(resultPdfFile)
minutesFile.close()
resultPdfFile.close()

import PyPDF2
template = PyPDF2.PdfFileReader(open('yourfilw.pdf', 'rb'))
watermark = PyPDF2.PdfFileReader(open('watermarkfile.pdf', 'rb'))
output = PyPDF2.PdfFileWriter()
for i in range(template.getNumPages()):
page = template.getPage(i)
page.mergePage(watermark.getPage(0))
output.addPage(page)
with open('water_marked.pdf', 'wb') as file:
output.write(file)

The fact that "the watermark only appears on the first page" isn't a bug, it's exactly what this code was designed to do. I see no attempt to modify the code to add the watermark to every page. Be honest about the situation and what effort you've put in to change it, even if "None".
Here's my rework of the code to watermark every page:
import PyPDF2
watermarkFile = open('watermark.pdf', 'rb')
pdfWatermarkReader = PyPDF2.PdfFileReader(watermarkFile)
minutesFile = open('meetingminutes.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(minutesFile)
pdfWriter = PyPDF2.PdfFileWriter()
for pageNum in range(pdfReader.numPages):
pageObj = pdfReader.getPage(pageNum)
pageObj.mergePage(pdfWatermarkReader.getPage(0))
pdfWriter.addPage(pageObj)
resultPdfFile = open('watermarkedCover.pdf', 'wb')
pdfWriter.write(resultPdfFile)
watermarkFile.close()
minutesFile.close()
resultPdfFile.close()
Try it out and see if it does what you want.

Reading pdf files line by line using python

I used the following code to read the pdf file, but it does not read it. What could possibly be the reason?
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
contents = reader.pages[0].extractText().split("\n")
print(contents)
The output is [u''] instead of reading the content.

import re
from PyPDF2 import PdfFileReader
reader = PdfFileReader("example.pdf")
for page in reader.pages:
text = page.extractText()
text_lower = text.lower()
for line in text_lower:
if re.search("abc", line):
print(line)
I use it to iterate page by page of pdf and search for key terms in it and process further.

May be this can help you to read PDF.
import pyPdf
def getPDFContent(path):
content = ""
pages = 10
p = file(path, "rb")
pdf_content = pyPdf.PdfFileReader(p)
for i in range(0, pages):
content += pdf_content.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

I think you need to specify the disc name, it's missing in your directory. For example "D:/Users/Rahul/Desktop/Dfiles/106_2015_34-76357.pdf". I tried and I can read without any problem.
Or if you want to find the file path using the os module which you didn't really associate with your directory, you can try the following:
from PyPDF2 import PdfFileReader
import os
def find(name, path):
for root, dirs, files in os.walk(path):
if name in files:
return os.path.join(root, name)
directory = find('106_2015_34-76357.pdf', 'D:/Users/Rahul/Desktop/Dfiles/')
f = open(directory, 'rb')
reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')
f.close()
print(contents)
The find function can be found in Nadia Alramli's answer here Find a file in python

To Read the files from Multiple Folders in a directory, below code can be used-
This Example is for reading pdf files:
import os
from tika import parser
path = "/usr/local/" # path directory
directory=os.path.join(path)
for r,d,f in os.walk(directory): #going through subdirectories
for file in f:
if ".pdf" in file: # reading only PDF files
file_join = os.path.join(r, file) #getting full path
file_data = parser.from_file(file_join) # parsing the PDF file
text = file_data['content'] # read the content
print(text) #print the content

def getTextPDF(pdfFileName,password=''):
import PyPDF2
from PyPDF2 import PdfFileReader, PdfFileWriter
from nltk import sent_tokenize
""" Extract Text from pdf """
pdf_file=open(pdfFileName,'rb')
read_pdf=PyPDF2.PdfFileReader(pdf_file)
if password !='':
read_pdf.decrypt(password)
text=[]
for i in range(0,read_pdf.getNumPages()):
text.append(read_pdf.getPage(i).extractText())
text = '\n'.join (text).replace("\n",'')
text = sent_tokenize(text)
return text

The issue was one of two things: (1) The text was not on page one - hence a user error. (2) PyPDF2 failed to extract the text - hence a bug in PyPDF2.
Sadly, the second one still happens for some PDFs.

Hello Rahul Pipalia,
If not install PyPDF2 in your python so first install PyPDF2 after use this module.
Installation Steps for Ubuntu (Install python-pypdf)
First, open terminal
After type sudo apt-get install python-pypdf
Your Probelm Solution
Try this below code,
# Import Library
import PyPDF2
# Which you want to read file so give file name with ".pdf" extension
pdf_file = open('Your_Pdf_File_Name.pdf')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
#Give page number of the pdf file (How many page in pdf file).
# #param Page_Nuber_of_the_PDF_file: Give page number here i.e 1
page = read_pdf.getPage(Page_Nuber_of_the_PDF_file)
page_content = page.extractText()
# Display content of the pdf
print page_content
Download the PDF from below link and try this code,
https://www.dropbox.com/s/4qad66r2361hvmu/sample.pdf?dl=1
I hope my answer is helpful.
If any query so comments, please.

How to append PDF pages using PyPDF2

Is anybody has experience merging two page of PDF file into one using python lib PyPDF2.
When I try page1.mergePage(page2) it results with page2 overlayed page1. How to make it to add page2 to the bottom of the page1?

As I'm searching the web for python pdf merging solution, I noticed that there's a general misconception with merging versus appending.
Most people call the appending action a merge but it's not. What you're describing in your question is really the intended use of mergePage which should be called applyPageOnTopOfAnother but that's a little long. What you are (were) looking for is really appending two files/pages into a new file.
Appending PDF files
Using the PdfFileMerger class and its append method.
Identical to the merge() method, but assumes you want to concatenate
all pages onto the end of the file instead of specifying a position.
Here's one way to do it taken from pypdf Merging multiple pdf files into one pdf:
from PyPDF2 import PdfFileMerger, PdfFileReader
# ...
merger = PdfFileMerger()
merger.append(PdfFileReader(file(filename1, 'rb')))
merger.append(PdfFileReader(file(filename2, 'rb')))
merger.write("document-output.pdf")
Appending specific PDF pages
And to append specific pages of different PDF files, use the PdfFileWriter class with the addPage method.
Adds a page to this PDF file. The page is usually acquired from a
PdfFileReader instance.
file1 = PdfFileReader(file(filename1, "rb"))
file2 = PdfFileReader(file(filename2, "rb"))
output = PdfFileWriter()
output.addPage(file1.getPage(specificPageIndex))
output.addPage(file2.getPage(specificPageIndex))
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()
Merging two pages into one page
Using mergePage
Merges the content streams of two pages into one. Resource references
(i.e. fonts) are maintained from both pages. The mediabox/cropbox/etc
of this page are not altered. The parameter page’s content stream will
be added to the end of this page’s content stream, meaning that it
will be drawn after, or “on top” of this page.
file1 = PdfFileReader(file(filename1, "rb"))
file2 = PdfFileReader(file(filename2, "rb"))
output = PdfFileWriter()
page = file1.getPage(specificPageIndex)
page.mergePage(file2.getPage(specificPageIndex))
output.addPage(page)
outputStream = file("document-output.pdf", "wb")
output.write(outputStream)
outputStream.close()

If the 2 PDFs do not exist on your local machine, and instead are normally accessed/download via a URL (i.e. http://foo/bar.pdf & http://bar/foo.pdf), we can fetch both PDFs from remote locations and merge them together in memory in one-fell-swoop.
This eliminates the assumed step of downloading the PDF to begin with, and allows us to generalize beyond the simple case of both PDFs existing on disk. Specifically, it generalizes the solution to any HTTP-accessible PDF.
The example:
from PyPDF2 import PdfFileMerger, PdfFileReader
pdf_content_1 = requests.get('http://foo/bar.pdf').content
pdf_content_2 = requests.get('http://bar/foo.pdf').content
# Write to in-memory file-like buffers
pdf_buffer_1 = StringIO.StringIO().write(pdf_content_1)
pdf_buffer_2 = StringIO.StringIO().write(pdf_content_2)
pdf_merged_buffer = StringIO.StringIO()
merger = PdfFileMerger()
merger.append(PdfFileReader(pdf_buffer_1))
merger.append(PdfFileReader(pdf_buffer_2))
merger.write(pdf_merged_buffer)
# Option 1:
# Return the content of the buffer in an HTTP response (Flask example below)
response = make_response(pdf_merged_buffer.getvalue())
# Set headers so web-browser knows to render results as PDF
response.headers['Content-Type'] = 'application/pdf'
response.headers['Content-Disposition'] = \
'attachment; filename=%s.pdf' % 'Merged PDF'
return response
# Option 2: Write to disk
with open("merged_pdf.pdf", "w") as fp:
fp.write(pdf_merged_buffer.getvalue())

Did it this way:
reader = PyPDF2.PdfFileReader(open("input.pdf",'rb'))
NUM_OF_PAGES = reader.getNumPages()
page0 = reader.getPage(0)
h = page0.mediaBox.getHeight()
w = page0.mediaBox.getWidth()
newpdf_page = PyPDF2.pdf.PageObject.createBlankPage(None, w, h*NUM_OF_PAGES)
for i in range(NUM_OF_PAGES):
next_page = reader.getPage(i)
newpdf_page.mergeScaledTranslatedPage(next_page, 1, 0, h*(NUM_OF_PAGES-i-1))
writer = PdfFileWriter()
writer.addPage(newpdf_page)
with open('output.pdf', 'wb') as f:
writer.write(f)
It works when every page has the same height and width. Otherwise, it needs some modifications.
Maybe Emile Bergeron solution is better. Didn't try it.

The pdfrw library can do this. There is a 4up example in the examples directory that places 4 input pages on every output page, and a booklet example that takes 8.5x11 input and creates 11x17 output. Disclaimer -- I am the pdfrw author.

The code posted in this following link accomplished your objective.
Using PyPDF2 to merge files into multiple output files
I believe the trick is:
merger.append(input)

split a multi-page pdf file into multiple pdf files with python?

I would like to take a multi-page pdf file and create separate pdf files per page.
I have downloaded reportlab and have browsed the documentation, but it seems aimed at pdf generation. I haven't yet seen anything about processing PDF files themselves.
Is there an easy way to do this in python?

from PyPDF2 import PdfWriter, PdfReader
inputpdf = PdfReader(open("document.pdf", "rb"))
for i in range(len(inputpdf.pages)):
output = PdfWriter()
output.add_page(inputpdf.pages[i])
with open("document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
etc.

I missed here a solution where you split the PDF to two parts consisting of all pages so I append my solution if somebody was looking for the same:
from PyPDF2 import PdfFileWriter, PdfFileReader
def split_pdf_to_two(filename,page_number):
pdf_reader = PdfFileReader(open(filename, "rb"))
try:
assert page_number < pdf_reader.numPages
pdf_writer1 = PdfFileWriter()
pdf_writer2 = PdfFileWriter()
for page in range(page_number):
pdf_writer1.addPage(pdf_reader.getPage(page))
for page in range(page_number,pdf_reader.getNumPages()):
pdf_writer2.addPage(pdf_reader.getPage(page))
with open("part1.pdf", 'wb') as file1:
pdf_writer1.write(file1)
with open("part2.pdf", 'wb') as file2:
pdf_writer2.write(file2)
except AssertionError as e:
print("Error: The PDF you are cutting has less pages than you want to cut!")

The PyPDF2 package gives you the ability to split up a single PDF into multiple ones.
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf = PdfFileReader(path)
for page in range(pdf.getNumPages()):
pdf_writer = PdfFileWriter()
pdf_writer.addPage(pdf.getPage(page))
output_filename = '{}_page_{}.pdf'.format(fname, page+1)
with open(output_filename, 'wb') as out:
pdf_writer.write(out)
print('Created: {}'.format(output_filename))
Source: https://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/

The earlier answers with PyPDF2 for splitting pdfs are not working anymore with the latest version update. The authors recommend using pypdf instead and this version of PyPDF2==3.0.1 will be the last version of PyPDF2. The function needs to be modified as follows:
import os
from PyPDF2 import PdfReader, PdfWriter
def split_pdfs(input_file_path):
inputpdf = PdfReader(open(input_file_path, "rb"))
out_paths = []
if not os.path.exists("outputs"):
os.makedirs("outputs")
for i, page in enumerate(inputpdf.pages):
output = PdfWriter()
output.add_page(page)
out_file_path = f"outputs/{input_file_path[:-4]}_{i}.pdf"
with open(out_file_path, "wb") as output_stream:
output.write(output_stream)
out_paths.append(out_file_path)
return out_paths
Note: The same function will work with pypdf as well. Import PdfReader and PdfWriter from pypdf rather than PyPDF2.

I know that the code is not related to python, however i felt like posting this piece of R code which is simple, flexible and works amazingly. The PDFtools package in R is amazing in splitting merging PDFs at ease.
library(pdftools) #Rpackage
pdf_subset('D:\\file\\20.02.20\\22 GT 2017.pdf',
pages = 1:51, output = "subset.pdf")

Updated solution for the latest release of PyPDF (3.0.0) and to split a range of pages.
from PyPDF2 import PdfReader, PdfWriter
file_name = r'c:\temp\junk.pdf'
pages = (121, 130)
reader = PdfReader(file_name)
writer = PdfWriter()
page_range = range(pages[0], pages[1] + 1)
for page_num, page in enumerate(reader.pages, 1):
if page_num in page_range:
writer.add_page(page)
with open(f'{file_name}_page_{pages[0]}-{pages[1]}.pdf', 'wb') as out:
writer.write(out)

import fitz
src = fitz.open("source.pdf")
for page in src:
tar = fitz.open() # output PDF for 1 page
# copy over current page
tar.insert_pdf(src, from_page=page.number, to_page=page.number)
tar.save(f"page-{page.number}.pdf")
tar.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to turn a large PDF's pages to images efficiently? [duplicate] - python

I would like to take a multi-page pdf file and create separate pdf files per page. I have downloaded reportlab and have browsed the documentation, but it seems aimed at pdf generation. I haven't yet seen anything about processing PDF files themselves. Is there an easy way to do this in python?

from PyPDF2 import PdfWriter, PdfReader inputpdf = PdfReader(open("document.pdf", "rb")) for i in range(len(inputpdf.pages)): output = PdfWriter() output.add_page(inputpdf.pages[i]) with open("document-page%s.pdf" % i, "wb") as outputStream: output.write(outputStream) etc.

import fitz src = fitz.open("source.pdf") for page in src: tar = fitz.open() # output PDF for 1 page # copy over current page tar.insert_pdf(src, from_page=page.number, to_page=page.number) tar.save(f"page-{page.number}.pdf") tar.close()

Related

PyPDF2 PdfFileMerger.Merge Class

Applying PDF watermark via a for loop

Reading pdf files line by line using python

How to append PDF pages using PyPDF2

split a multi-page pdf file into multiple pdf files with python?

Categories

Resources