I am using python to crop pdf pages.
Everything works fine, but how do I change the page size(width)?
This is my crop code:
input = PdfFileReader(file('my.pdf', 'rb'))
p = input.getPage(1)
(w, h) = p.mediaBox.upperRight
p.mediaBox.upperRight = (w/4, h)
output.addPage(p)
When I crop pages, I need to resize them as well, how can I do this?
This answer is really long overdue, and maybe the older versions of PyPDF2 didn't have this functionality, but its actually quite simple with version 1.26.0
import PyPDF2
pdf = "YOUR PDF FILE PATH.pdf"
pdf = PyPDF2.PdfFileReader(pdf)
page0 = pdf.getPage(0)
page0.scaleBy(0.5) # float representing scale factor - this happens in-place
writer = PyPDF2.PdfFileWriter() # create a writer to save the updated results
writer.addPage(page0)
with open("YOUR OUTPUT PDF FILE PATH.pdf", "wb+") as f:
writer.write(f)
Do you want to scale the image after you crop it? You can use p.scale(factor_x, factor_y) to do that.
You can also apply scaling, rotation or translation directly in the merge function call, by using the functions:
mergePage()
mergeRotatedPage()
mergeRotatedScaledPage()
mergeRotatedScaledTranslatedPage()
mergeScaledPage()
mergeScaledTranslatedPage()
mergeTransformedPage()
mergeTranslatedPage()
Or use addTransformation() on a page object.
Related
I have lots of pdf files, each embedded with multiple images that need to be rotated.
I know I can extract the image out, rotate it and then again reconstruct the pdf, but is there any way that I can add a PDF command so that images rotate in place ?
Ideally, a PDF-library in python that will allow me to do that.
Edit:
One important detail I would like to add is that each page can have multiple images and each image needs to be rotated at a different angles. Think a task of straightening the images in a pdf.
I would like to answer your question,
import PyPDF2
pdf_in = open('original.pdf', 'rb')
pdf_reader = PyPDF2.PdfFileReader(pdf_in)
pdf_writer = PyPDF2.PdfFileWriter()
for pagenum in range(pdf_reader.numPages):
page = pdf_reader.getPage(pagenum)
page.rotateClockwise(180)#Angle in degrees
pdf_writer.addPage(page)
pdf_out = open('rotated.pdf', 'wb')
pdf_writer.write(pdf_out)
pdf_out.close()
pdf_in.close()
Hope so, this solved your problem
I use FPDF to generate a pdf with python. i have a problem for which i am looking for a solution. in a folder "images", there are pictures that I would like to display each on ONE page. I did that - maybe not elegant. unfortunately i can't move the picture to the right. it looks like pdf_set_y won't work in the loop.
from fpdf import FPDF
from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir('../images') if isfile(join('../images', f))]
pdf = FPDF('L')
pdf.add_page()
pdf.set_font('Arial', 'B', 16)
onlyfiles = ['VI 1.png', 'VI 2.png', 'VI 3.png']
y = 10
for image in onlyfiles:
pdf.set_x(10)
pdf.set_y(y)
pdf.cell(10, 10, 'please help me' + image, 0, 0, 'L')
y = y + 210 #to got to the next page
pdf.set_x(120)
pdf.set_y(50)
pdf.image('../images/' + image, w=pdf.w/2.0, h=pdf.h/2.0)
pdf.output('problem.pdf', 'F')
Would be great if you have a solution/help for me. Thanks alot
greets alex
I see the issue. You want to specify the x and y location in the call to pdf.image(). That assessment is based on the documentation for image here: https://pyfpdf.readthedocs.io/en/latest/reference/image/index.html
So you can instead do this (just showing for loop here):
for image in onlyfiles:
pdf.set_x(10)
pdf.set_y(y)
pdf.cell(10, 10, 'please help me' + image, 0, 0, 'L')
y = y + 210 # to go to the next page
# increase `x` from 120 to, say, 150 to move the image to the right
pdf.image('../images/' + image, x=120, y=50, w=pdf.w/2.0, h=pdf.h/2.0)
# new -> ^^^^^ ^^^^
You can check pdfme library. It's the most powerful library in python to create PDF documents. You can add urls, footnotes, headers and footers, tables, anything you would need in a PDF document.
The only problem I see is that currently pdfme only supports jpg format for images. But if that's not a problem it will help you with your task.
Check the docs here.
disclaimer: I am the author of pText the library I will use in this solution.
Let's start by creating an empty Document:
pdf: Document = Document()
# create empty page
page: Page = Page()
# add page to document
pdf.append_page(page)
Next we are going to load an Image using Pillow
import requests
from PIL import Image as PILImage
im = PILImage.open(
requests.get(
"https://365psd.com/images/listing/fbe/ups-logo-49752.jpg", stream=True
).raw
)
Now we can create a LayoutElement Image and use its layout method
Image(im).layout(
page,
bounding_box=Rectangle(Decimal(20), Decimal(724), Decimal(64), Decimal(64)),
)
Keep in mind the origin (in PDF space) is at the lower left corner.
Lastly, we need to store the Document
# attempt to store PDF
with open("output.pdf", "wb") as out_file_handle:
PDF.dumps(out_file_handle, pdf)
You can obtain pText either on GitHub, or using PyPi
There are a ton more examples, check them out to find out more about working with images.
I'm trying to extract specific (or the whole text and then parse it) text from the image.
the image is in the Hebrew language.
what I already tried in nodejs is using in Tesseract library but in Hebrew, it does not recognize the text good.
I'm also tried to convert the image to pdf and then parse from pdf but it's not working well in Hebrew.
anyone has already tried to do that? maybe with python or node js?
I'm trying to do something like cloud vision google text
have you tried preprocessing the image you feed to tesseract? In case you didn't I would give a try to use OpenCV contour detection, particularly Hough Line Transform, and then clean it up a bit. https://www.youtube.com/watch?v=lhMXDqQHf9g&list=PLQVvvaa0QuDeETZEOy4VdocT7TOjfSA8a&index=5 this guy doesn't do your stuff exactly, but if ya take time to scroll bit you can see how it can be useful.
Based on our conversation in OP. Here is some options for you to consider.
Option 1:
If you are working directly with PDFs as your input file
import fitz
input_file = '/path/to/your/pdfs/'
pdf_file = input_file
doc = fitz.open(pdf_file)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo)
pageTextblocks = page.getText('blocks') # This creates a list of items (x0,y0,x1,y1,"line1\nline2\nline3...",...)
pageTextblocks.sort(key=lambda block: block[3])
for block in pageTextblocks:
targetBlock = block[4] # This gets to the content of each block and you can work your logic here to get relevant data
Option 2:
If you are working with image as your input and you need to convert it to PDFs before processing it using code snippet in Option 1.
doc = fitz.open(input_file)
pdfbytes = doc.convertToPDF() # open it as a pdf file
pdf = fitz.open("pdf", pdfbytes) # extract data as a pdf file
One useful tip for processing image in PyMuPDF is to use zoom factor for better resolution if the image is somewhat hard to be recognized.
zoom = 1.2 # scale the image by 120%
mat = fitz.Matrix(zoom,zoom)
Option 3:
A hybrid approach with PyMuPDF and pytesseract since you've mentioned tesseract. I am not sure if this approach fits your needs to extract Hebrew language but it's an idea. The example is used for PDFs.
import fitz
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'/path/to/your/tesseract/cmd'
input_file = '/path/to/pdfs'
pdf_file = input_file
fullText = ""
doc = fitz.open(pdf_file)
zoom = 1.2
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount
for pageNo in range(noOfPages):
page = doc.loadPage(pageNo) #number of page
pix = page.getPixmap(matrix = mat)
output = '/path/to/save/image' + str(pageNo) + '.jpg'
pix.writePNG(output)
print('Converting PDFs to Image ... ' + output)
text_of_each_page = str(((pytesseract.image_to_string(Image.open(output)))))
fullText += text_without_whitespace
fullText += '\n'
Hope this helps. If you need more information about PyMuPDF, click this link and it has a more detailed explanation to fit your needs.
Lets say you have a pdf page with various complex elements inside.
The objective is to crop a region of the page (to extract only one of the elements) and then paste it in another pdf page.
Here is a simplified version of my code:
import PyPDF2
import PyPdf
def extract_tree(in_file, out_file):
with open(in_file, 'rb') as infp:
# Read the document that contains the tree (in its first page)
reader = pyPdf.PdfFileReader(infp)
page = reader.getPage(0)
# Crop the tree. Coordinates below are only referential
page.cropBox.lowerLeft = [100,200]
page.cropBox.upperRight = [250,300]
# Create an empty document and add a single page containing only the cropped page
writer = pyPdf.PdfFileWriter()
writer.addPage(page)
with open(out_file, 'wb') as outfp:
writer.write(outfp)
def insert_tree_into_page(tree_document, text_document):
# Load the first page of the document containing 'text text text text...'
text_page = PyPDF2.PdfFileReader(file(text_document,'rb')).getPage(0)
# Load the previously cropped tree (cropped using 'extract_tree')
tree_page = PyPDF2.PdfFileReader(file(tree_document,'rb')).getPage(0)
# Overlay the text-page and the tree-crop
text_page.mergeScaledTranslatedPage(page2=tree_page,scale='1.0',tx='100',ty='200')
# Save the result into a new empty document
output = PyPDF2.PdfFileWriter()
output.addPage(text_page)
outputStream = file('merged_document.pdf','wb')
output.write(outputStream)
# First, crop the tree and save it into cropped_document.pdf
extract_tree('document1.pdf', 'cropped_document.pdf')
# Now merge document2.pdf with cropped_document.pdf
insert_tree_into_page('cropped_document.pdf', 'document2.pdf')
The method "extract_tree" seems to be working. It generates a pdf file containing only the cropped region (in the example, the tree).
The problem in that when I try to paste the tree in the new page, the star and the house of the original image are pasted anyway
I tried something that actually worked. Try to convert your first output(pdf containing only the tree) to docx then convert it another time from docx to pdf before merging it with other pdf pages. It will work(only the tree will be merged).
Allow me to ask please, how did you implement an interface that define the bounds of the crop Au.
I had the exact same issue. In the end, the solution for me was to make a small edit to the source code of pyPDF2 (from this pull request, which never made it into the master branch). What you need to do is insert these lines into the method _mergePage of the class PageObject inside the file pdf.py:
page2Content = ContentStream(page2Content, self.pdf)
page2Content.operations.insert(0, [map(FloatObject, [page2.trimBox.getLowerLeft_x(), page2.trimBox.getLowerLeft_y(), page2.trimBox.getWidth(), page2.trimBox.getHeight()]), "re"])
page2Content.operations.insert(1, [[], "W"])
page2Content.operations.insert(2, [[], "n"])
(see the pull request for exactly where to put them). With that done, you can then crop the section of a pdf you want, and merge it with another page with no issues. There's no need to save the cropped section into a separate pdf, unless you want to.
from PyPDF2 import PdfFileReader, PdfFileWriter
tree_page = PdfFileReader(open('document1.pdf','rb')).getPage(0)
text_page = PdfFileReader(open('document2.pdf','rb')).getPage(0)
tree_page.cropBox.lowerLeft = [100,200]
tree_page.cropBox.upperRight = [250, 300]
text_page.mergeScaledTranslatedPage(page2=tree_page, scale='1.0', tx='100', ty='200')
output = PdfFileWriter()
output.addPage(text_page)
output.write(open('merged_document.pdf', 'wb'))
Maybe there's a better way of doing this that inserts that code without directly editing the source code. I'd be grateful if anyone finds a way to do it as this admittedly is a slightly dodgy hack.
First post here, although i already spent days of searching for various queries here. Python 3.6, Pillow and tiff processing.
I would like to automate one of our manual tasks, by resizing some of the images from very big to match A4 format. We're operating on tiff format, that sometimes ( often ) contains more than one page. So I wrote:
from PIL import Image,
...
def image_resize(path, dcinput, file):
dcfake = read_config(configlocation)["resize"]["dcfake"]
try:
imagehandler = Image.open(path+file)
imagehandler = imagehandler.resize((2496, 3495), Image.ANTIALIAS)
imagehandler.save(dcinput+file, optimize=True, quality=95)
except Exception:
But the very (not) obvious is that only first page of tiff is being converted. This is not exactly what I expect from this lib, however tried to dig, and found a way to enumerate each page from tiff, and save it as a separate file.
imagehandler = Image.open(path+file)
for i, page in enumerate(ImageSequence.Iterator(imagehandler)):
page = page.resize((2496, 3495), Image.ANTIALIAS)
page.save(dcinput + "proces%i.tif" %i, optimize=True, quality=95, save_all=True)
Now I could use imagemagick, or some internal commands to convert multiple pages into one, but this is not what I want to do, as it drives to code complication.
My question, is there a unicorn that can help me with either :
1) resizing all pages of given multi-page tiff in the fly
2) build a tiff from few tiffs
I'd like to focus only on python modules.
Thx.
Take a look at this example. It will make every page of a TIF file four times smaller (by halving width and height of every page):
from PIL import Image
from PIL import ImageSequence
from PIL import TiffImagePlugin
INFILE = 'multipage_tif_example.tif'
OUTFILE = 'multipage_tif_resized.tif'
print ('Resizing TIF pages')
pages = []
imagehandler = Image.open(INFILE)
for page in ImageSequence.Iterator(imagehandler):
new_size = (page.size[0]/2, page.size[1]/2)
page = page.resize(new_size, Image.ANTIALIAS)
pages.append(page)
print ('Writing multipage TIF')
with TiffImagePlugin.AppendingTiffWriter(OUTFILE) as tf:
for page in pages:
page.save(tf)
tf.newFrame()
It's supposed to work since late Pillow 3.4.x versions (works with version 5.1.0 on my machine).
Resources:
AppendingTiffWriter discussed here.
Sample TIF files can be downloaded here.