I have multiple PDF files with small sizes (e.g. 3cm x 2 cm) exported from Adobe Indesign.
I want to compose many of these into one new PDF which has the size of a whole page.
The small PDFs contain a plotter line in a special color which would get lost if I convert them into images.
How can I place these PDFs (at given positions) using python and without losing the special color.
I tried to read into pypdf, pypdf2 and reportlab but I got lost and the examples I found did not work. I do not need the full code, a hint into the right direction would be enough (even with another language if necessary).
Thanks
Try:
cpdf in.pdf -stamp-on stamp.pdf -pos-left "x y" AND -stamp-on stamp2.pdf -pos-left "x2 y2" AND ..... -o out.pdf
where in.pdf is a blank PDF of appropriate size, and x and y and x2 and y2 etc... are the coordinates required and ..... are parts of the command for the third, fourth etc. stamps.
Here is a sample code to do your task using PyPDF2.
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
for pdf in pdf_files:
merger.append(pdf) #pdf_files is a list of the pdf files (path) to be merged.
merger.write(output_pdf) #output_pdf is the path of the merged pdf file.
merger.close()
Apart from joining / merging pages from different documents into one output PDF, PyMuPDF lets you also embed pages from PDFs into an existing page of some target PDF (as if it were an image).
You can select a subrectangle within the source page (a "clip") and the target rectangle of the target page. Multiple embedded source pages in the same target page are also supported.
This embedding maintains all original source page features - it will not be converted to an image.
Scaling between source / target rectangles will be done as necessary.
This snippet will take the full page number 0 of a source PDF and place it inside a rectangle of a page in the target PDF.
The source page will be positioned centered in the target rectangle.
import fitz # import PyMuPDF
source = fitz.open("source.pdf") # source PDF
target = fitz.open("target.pdf") # target PDF
target_page = target[pagenumber] # desired page in target
# if you want a new page, create it like so:
# target_page = target.new_page()
target_rect = fitz.Rect(100, 100, 300, 250) # show source page here
# if you want to cover the full target page, use
# target_rect = target_page.rect
target_page.show_pdf_page(target_rect,
source, # the source document
0, # the source page number
clip=None, # any subrectangle on source page
)
# if desired, more pages from other sources can be put on same target page
Related
I am using wkhtmltopdf to render a (Django-templated) HTML document to a single-page PDF file. I would like to either render it immediately with the correct height (which I've failed to do so far) or render it incorrectly and trim it. I'm using Python.
Attempt type 1:
wkhtmltopdf render to a very, very long single-page PDF with a lot of extra space using --page-height
Use pdfCropMargins to trim: crop(["-p4", "100", "0", "100", "100", "-a4", "0", "-28", "0", "0", "input.pdf"])
The PDF is rendered perfectly with 28 units of margin at the bottom, but I had to use the filesystem to execute the crop command. It seems that the tool expects an input file and output file, and also creates temporary files midway through. So I can't use it.
Attempt type 2:
wkhtmltopdf render to multi-page PDF with default parameters
Use PyPDF4 (or PyPDF2) to read the file and combine pages into a long, single page
The PDF is rendered fine-ish in most cases, however, sometimes a lot of extra white space can be seen on the bottom if by chance the last PDF page had very little content.
Ideal scenario:
The ideal scenario would involve a function that takes HTML and renders it into a single-page PDF with the expected amount of white space at the bottom. I would be happy with rendering the PDF using wkhtmltopdf, since it returns bytes, and later processing these bytes to remove any extra white space. But I don't want to involve the file system in this, as instead, I want to perform all operations in memory. Perhaps I can somehow inspect the PDF directly and remove the white space manually, or do some HTML magic to determine the render height before-hand?
What am I doing now:
Note that pdfkit is a wkhtmltopdf wrapper
# This is not a valid HTML (includes Django-specific stuff)
template: Template = get_template("some-django-template.html")
# This is now valid HTML
rendered = template.render({
"foo": "bar",
})
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
return pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
It's equivalent to Attempt type 2, except I don't use PyDPF4 here to stitch the pages together, but instead render again with wkhtmltopdf using precomputed page height.
There might be better ways to do this, but this at least works.
I'm assuming that you are able to crop the PDF yourself, and all I'm doing here is determining how far down on the last page you still have content. If that assumption is wrong, I could probably figure out how to crop the PDF. Or otherwise, just crop the image (easy in Pillow) and then convert that to PDF?
Also, if you have one big PDF, you might need to figure how how far down on the whole PDF the text ends. I'm just finding out how far down on the last page the content ends. But converting from one to the other is like just an easy arithmetic problem.
Tested code:
import pdfkit
from PyPDF2 import PdfFileReader
from io import BytesIO
# This library isn't named fitz on pypi,
# obtain this library with `pip install PyMuPDF==1.19.4`
import fitz
# `pip install Pillow==8.3.1`
from PIL import Image
import numpy as np
# However you arrive at valid HTML, it makes no difference to the solution.
rendered = "<html><head></head><body><h3>Hello World</h3><p>hello</p></body></html>"
# This first renders PDF from HTML normally (multiple pages)
# Then counts how many pages were created and determines the required single-page height
# Then renders a single-page PDF from HTML using the page height and width arguments
pdf_bytes = pdfkit.from_string(rendered, options={
"page-height": f"{297 * PdfFileReader(BytesIO(pdfkit.from_string(rendered))).getNumPages()}mm",
"page-width": "210mm"
})
# convert the pdf into an image.
pdf = fitz.open(stream=pdf_bytes, filetype="pdf")
last_page = pdf[pdf.pageCount-1]
matrix = fitz.Matrix(1, 1)
image_pixels = last_page.get_pixmap(matrix=matrix, colorspace="GRAY")
image = Image.frombytes("L", [image_pixels.width, image_pixels.height], image_pixels.samples)
#Uncomment if you want to see.
#image.show()
# Now figure out where the end of the text is:
# First binarize. This might not be the most efficient way to do this.
# But it's how I do it.
THRESHOLD = 100
# I wrote this code ages ago and don't remember the details but
# basically, we treat every pixel > 100 as a white pixel,
# We convert the result to a true/false matrix
# And then invert that.
# The upshot is that, at the end, a value of "True"
# in the matrix will represent a black pixel in that location.
binary_matrix = np.logical_not(image.point( lambda p: 255 if p > THRESHOLD else 0 ).convert("1"))
# Now find last white row, starting at the bottom
row_count, column_count = binary_matrix.shape
last_row = 0
for i, row in enumerate(reversed(binary_matrix)):
if any(row):
last_row = i
break
else:
continue
percentage_from_top = (1 - last_row / row_count) * 100
print(percentage_from_top)
# Now you know where the page ends.
# Go back and crop the PDF accordingly.
The souce file is here.The fetch code is sify .It's just one jpg. If you can't download it, please contact bbliao#126.com.
However this image doesn't work with fpdf package, I don't know why. You can try it.
Thus I have to use the img2pdf. With the following code I converted this image to pdf successfully.
t=os.listdir()
with open('bb.pdf','wb') as f:
f.write(img2pdf.convert(t))
However, when multiple images are combined into one pdf file, the img2pdf just combine each image by head_to_tail. This causes every pagesize = imgaesize. Briefly, the first page of pdf is 30 cm*40 cm while the second is 20 cm*10 cm the third is 15*13...That's ugly.
I want the same pagesize(A4 for example) and the same imgsize in every page of the pdf. One page of pdf with one image.
Glancing at the documentation for img2pdf, it allows you to set the paper size by including layout details to the convert call:
import img2pdf
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
with open('test.pdf', 'wb') as f:
f.write(img2pdf.convert(['image1.jpg','image2.jpg'], layout_fun=layout))
Lets say you have a pdf page with various complex elements inside.
The objective is to crop a region of the page (to extract only one of the elements) and then paste it in another pdf page.
Here is a simplified version of my code:
import PyPDF2
import PyPdf
def extract_tree(in_file, out_file):
with open(in_file, 'rb') as infp:
# Read the document that contains the tree (in its first page)
reader = pyPdf.PdfFileReader(infp)
page = reader.getPage(0)
# Crop the tree. Coordinates below are only referential
page.cropBox.lowerLeft = [100,200]
page.cropBox.upperRight = [250,300]
# Create an empty document and add a single page containing only the cropped page
writer = pyPdf.PdfFileWriter()
writer.addPage(page)
with open(out_file, 'wb') as outfp:
writer.write(outfp)
def insert_tree_into_page(tree_document, text_document):
# Load the first page of the document containing 'text text text text...'
text_page = PyPDF2.PdfFileReader(file(text_document,'rb')).getPage(0)
# Load the previously cropped tree (cropped using 'extract_tree')
tree_page = PyPDF2.PdfFileReader(file(tree_document,'rb')).getPage(0)
# Overlay the text-page and the tree-crop
text_page.mergeScaledTranslatedPage(page2=tree_page,scale='1.0',tx='100',ty='200')
# Save the result into a new empty document
output = PyPDF2.PdfFileWriter()
output.addPage(text_page)
outputStream = file('merged_document.pdf','wb')
output.write(outputStream)
# First, crop the tree and save it into cropped_document.pdf
extract_tree('document1.pdf', 'cropped_document.pdf')
# Now merge document2.pdf with cropped_document.pdf
insert_tree_into_page('cropped_document.pdf', 'document2.pdf')
The method "extract_tree" seems to be working. It generates a pdf file containing only the cropped region (in the example, the tree).
The problem in that when I try to paste the tree in the new page, the star and the house of the original image are pasted anyway
I tried something that actually worked. Try to convert your first output(pdf containing only the tree) to docx then convert it another time from docx to pdf before merging it with other pdf pages. It will work(only the tree will be merged).
Allow me to ask please, how did you implement an interface that define the bounds of the crop Au.
I had the exact same issue. In the end, the solution for me was to make a small edit to the source code of pyPDF2 (from this pull request, which never made it into the master branch). What you need to do is insert these lines into the method _mergePage of the class PageObject inside the file pdf.py:
page2Content = ContentStream(page2Content, self.pdf)
page2Content.operations.insert(0, [map(FloatObject, [page2.trimBox.getLowerLeft_x(), page2.trimBox.getLowerLeft_y(), page2.trimBox.getWidth(), page2.trimBox.getHeight()]), "re"])
page2Content.operations.insert(1, [[], "W"])
page2Content.operations.insert(2, [[], "n"])
(see the pull request for exactly where to put them). With that done, you can then crop the section of a pdf you want, and merge it with another page with no issues. There's no need to save the cropped section into a separate pdf, unless you want to.
from PyPDF2 import PdfFileReader, PdfFileWriter
tree_page = PdfFileReader(open('document1.pdf','rb')).getPage(0)
text_page = PdfFileReader(open('document2.pdf','rb')).getPage(0)
tree_page.cropBox.lowerLeft = [100,200]
tree_page.cropBox.upperRight = [250, 300]
text_page.mergeScaledTranslatedPage(page2=tree_page, scale='1.0', tx='100', ty='200')
output = PdfFileWriter()
output.addPage(text_page)
output.write(open('merged_document.pdf', 'wb'))
Maybe there's a better way of doing this that inserts that code without directly editing the source code. I'd be grateful if anyone finds a way to do it as this admittedly is a slightly dodgy hack.
I searched the stackoverflow for the problem. The nearest link is:
How to set custom page size with Ghostscript
How to convert multiple, different-sized PostScript files to a single PDF?
But this could NOT solve my problem.
The question is plain simple.
How can we combine multiple pdf (with different page sizes) into a combined pdf which have all the pages of same size.
Example:
two input pdfs are:
hw1.pdf with single page of size 5.43x3.26 inch (found from adobe reader)
hw6.pdf with single page of size 5.43x6.51 inch
The pdfs can be found here:
https://github.com/bhishanpdl/Questions
The code is:
gs -sDEVICE=pdfwrite -r720 -g2347x3909 -dPDFFitPage -o homeworks.pdf hw1.pdf hw6.pdf
PROBLEM: First pdf is portrait, and second page is landscape.
QUESTION: How can we make both pages portrait ?
NOTE:
-r720 is pixels/inch.
The size -g2347x3909 is found using python script:
wd = int(np.floor(720 * 5.43))
ht = int(np.floor(720 * 3.26))
gsize = '-g' + str(ht) + 'x' + str(wd) + ' '
# this gives: gsize = -g4308x6066
Another Attempt
commands = 'gs -o homeworks.pdf -sDEVICE=pdfwrite -dDEVICEWIDTHPOINTS=674 ' +\
' -dDEVICEHEIGHTPOINTS=912 -dPDFFitPage ' +\
'hw1.pdf hw6.pdf'
subprocess.call(commands, shell=1)
This gives first both pages portrait, but they do not have the same size.
First page is smaller is size, and second is full when I open the output in adobe reader.
In general, how can we make size of all the pages same?
The reason (in the first example) that one of the pages is rotated is because it fits better that way round. Because Ghostscript is primarily intended as print software, the assumption is that you want to print the input. If the output is to fixed media size, page fitting is requested, and the requested media size fits better (ie with less scaling) when rotated, then the content will be rotated.
In order to prevent that, you would need to rewrite the FitPage procedure, which is defined in /ghostpdl/Resource/Init/pdf_main.ps in the procedure pdf_PDF2PS_matrix. You can modify that procedure so that it does not rotate the page for a better fit.
In the second case you haven't set -dFIXEDMEDIA (-g implies -dFIXEDMEDIA, -dDEVICE...POINTS does not), so the media size requests in the PDF files will override the media size you set on the command line. Which is why the pages are not resized. Since the media is then the size requested by the PDF file, the page will fit without modification, thus -dPDFFitPage will do nothing. So you need to set -dFIXEDMEDIA if you use -dDEVICE...POINTS and any of the FitPage switches.
You would be better advised (as your second attempt) to use -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS to set the media size, since these are not dependent on the resolution (unlike -g) which can be overridden by PostScript input programs. You should not meddle with the resolution without a good reason, so don't set -r720.
Please be aware that this process does not 'merge', 'combine' or anything else which implies that the content of the input is unchanged in the output. You should read the documentation on the subject and understand the process before attempting to use this procedure.
You have tagged this question "ghostscript" but I assume by your use of subprocess.call() that you are not averse to using Python.
The pagemerge canvas of the pdfrw Python library can do this. There are some examples of dealing with different sized pages in the examples directory and at the source of pagemerge.py. The fancy_watermark.py shows an example of dealing with different page sizes, in the context of applying watermarks.
pdfrw can rotate, scale, or simply position source pages on the output. If you want rotation or scaling, you can look in the examples directory. (Since this is for homework, for extra credit you can control the scaling and rotation by looking at the various page sizes. :) But if all you want is the second page to be extended to be as long as the first, you could do that with this bit of code:
from pdfrw import PdfReader, PdfWriter, PageMerge
pages = PdfReader('hw1.pdf').pages + PdfReader('hw6.pdf').pages
output = PdfWriter()
rects = [[float(num) for num in page.MediaBox] for page in pages]
height = max(x[3] - x[1] for x in rects)
width = max(x[2] - x[0] for x in rects)
mbox = [0, 0, width, height]
for page in pages:
newpage = PageMerge()
newpage.mbox = mbox # Set boundaries of output page
newpage.add(page) # Add one old page to new page
image = newpage[0] # Get image of old page (first item)
image.x = (width - image.w) / 2 # Center old page left/right
image.y = (height - image.h) # Move old page to top of output page
output.addpage(newpage.render())
output.write('homeworks.pdf')
(Disclaimer: I am the primary pdfrw author.)
I'm using the open source version Reportlab with Python on Windows. My code loops through multiple PNG files & combines them to form a single PDF. Each PNG is stretched to the full LETTER spec (8.5x11).
Problem is, all the images saved to output.pdf are sandwiched on top of each other and only the last image added is visible. Is there something that I need to add between each drawImage() to offset to a new page? Here's a simple linear view of what I'm doing -
WIDTH,HEIGHT = LETTER
canv = canvas.Canvas('output.pdf',pagesize=LETTER)
canv.setPageCompression(0)
page = Image.open('one.png')
canv.drawImage(ImageReader(page),0,0,WIDTH,HEIGHT)
page = Image.open('two.png')
canv.drawImage(ImageReader(page),0,0,WIDTH,HEIGHT)
page = Image.open('three.png')
canv.drawImage(ImageReader(page),0,0,WIDTH,HEIGHT)
canv.save()
[Follow up of the post's comment]
Use canv.showPage() after you use canv.drawImage(...) each time.
( http://www.reportlab.com/apis/reportlab/dev/pdfgen.html#reportlab.pdfgen.canvas.Canvas.showPage )
Follow the source document(for that matter any tool you are using, you should dig into it's respective website documentation):
http://www.reportlab.com/apis/reportlab/dev/pdfgen.html