Lets say you have a pdf page with various complex elements inside.
The objective is to crop a region of the page (to extract only one of the elements) and then paste it in another pdf page.
Here is a simplified version of my code:
import PyPDF2
import PyPdf
def extract_tree(in_file, out_file):
with open(in_file, 'rb') as infp:
# Read the document that contains the tree (in its first page)
reader = pyPdf.PdfFileReader(infp)
page = reader.getPage(0)
# Crop the tree. Coordinates below are only referential
page.cropBox.lowerLeft = [100,200]
page.cropBox.upperRight = [250,300]
# Create an empty document and add a single page containing only the cropped page
writer = pyPdf.PdfFileWriter()
writer.addPage(page)
with open(out_file, 'wb') as outfp:
writer.write(outfp)
def insert_tree_into_page(tree_document, text_document):
# Load the first page of the document containing 'text text text text...'
text_page = PyPDF2.PdfFileReader(file(text_document,'rb')).getPage(0)
# Load the previously cropped tree (cropped using 'extract_tree')
tree_page = PyPDF2.PdfFileReader(file(tree_document,'rb')).getPage(0)
# Overlay the text-page and the tree-crop
text_page.mergeScaledTranslatedPage(page2=tree_page,scale='1.0',tx='100',ty='200')
# Save the result into a new empty document
output = PyPDF2.PdfFileWriter()
output.addPage(text_page)
outputStream = file('merged_document.pdf','wb')
output.write(outputStream)
# First, crop the tree and save it into cropped_document.pdf
extract_tree('document1.pdf', 'cropped_document.pdf')
# Now merge document2.pdf with cropped_document.pdf
insert_tree_into_page('cropped_document.pdf', 'document2.pdf')
The method "extract_tree" seems to be working. It generates a pdf file containing only the cropped region (in the example, the tree).
The problem in that when I try to paste the tree in the new page, the star and the house of the original image are pasted anyway
I tried something that actually worked. Try to convert your first output(pdf containing only the tree) to docx then convert it another time from docx to pdf before merging it with other pdf pages. It will work(only the tree will be merged).
Allow me to ask please, how did you implement an interface that define the bounds of the crop Au.
I had the exact same issue. In the end, the solution for me was to make a small edit to the source code of pyPDF2 (from this pull request, which never made it into the master branch). What you need to do is insert these lines into the method _mergePage of the class PageObject inside the file pdf.py:
page2Content = ContentStream(page2Content, self.pdf)
page2Content.operations.insert(0, [map(FloatObject, [page2.trimBox.getLowerLeft_x(), page2.trimBox.getLowerLeft_y(), page2.trimBox.getWidth(), page2.trimBox.getHeight()]), "re"])
page2Content.operations.insert(1, [[], "W"])
page2Content.operations.insert(2, [[], "n"])
(see the pull request for exactly where to put them). With that done, you can then crop the section of a pdf you want, and merge it with another page with no issues. There's no need to save the cropped section into a separate pdf, unless you want to.
from PyPDF2 import PdfFileReader, PdfFileWriter
tree_page = PdfFileReader(open('document1.pdf','rb')).getPage(0)
text_page = PdfFileReader(open('document2.pdf','rb')).getPage(0)
tree_page.cropBox.lowerLeft = [100,200]
tree_page.cropBox.upperRight = [250, 300]
text_page.mergeScaledTranslatedPage(page2=tree_page, scale='1.0', tx='100', ty='200')
output = PdfFileWriter()
output.addPage(text_page)
output.write(open('merged_document.pdf', 'wb'))
Maybe there's a better way of doing this that inserts that code without directly editing the source code. I'd be grateful if anyone finds a way to do it as this admittedly is a slightly dodgy hack.
Related
I have multiple PDF files with small sizes (e.g. 3cm x 2 cm) exported from Adobe Indesign.
I want to compose many of these into one new PDF which has the size of a whole page.
The small PDFs contain a plotter line in a special color which would get lost if I convert them into images.
How can I place these PDFs (at given positions) using python and without losing the special color.
I tried to read into pypdf, pypdf2 and reportlab but I got lost and the examples I found did not work. I do not need the full code, a hint into the right direction would be enough (even with another language if necessary).
Thanks
Try:
cpdf in.pdf -stamp-on stamp.pdf -pos-left "x y" AND -stamp-on stamp2.pdf -pos-left "x2 y2" AND ..... -o out.pdf
where in.pdf is a blank PDF of appropriate size, and x and y and x2 and y2 etc... are the coordinates required and ..... are parts of the command for the third, fourth etc. stamps.
Here is a sample code to do your task using PyPDF2.
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
for pdf in pdf_files:
merger.append(pdf) #pdf_files is a list of the pdf files (path) to be merged.
merger.write(output_pdf) #output_pdf is the path of the merged pdf file.
merger.close()
Apart from joining / merging pages from different documents into one output PDF, PyMuPDF lets you also embed pages from PDFs into an existing page of some target PDF (as if it were an image).
You can select a subrectangle within the source page (a "clip") and the target rectangle of the target page. Multiple embedded source pages in the same target page are also supported.
This embedding maintains all original source page features - it will not be converted to an image.
Scaling between source / target rectangles will be done as necessary.
This snippet will take the full page number 0 of a source PDF and place it inside a rectangle of a page in the target PDF.
The source page will be positioned centered in the target rectangle.
import fitz # import PyMuPDF
source = fitz.open("source.pdf") # source PDF
target = fitz.open("target.pdf") # target PDF
target_page = target[pagenumber] # desired page in target
# if you want a new page, create it like so:
# target_page = target.new_page()
target_rect = fitz.Rect(100, 100, 300, 250) # show source page here
# if you want to cover the full target page, use
# target_rect = target_page.rect
target_page.show_pdf_page(target_rect,
source, # the source document
0, # the source page number
clip=None, # any subrectangle on source page
)
# if desired, more pages from other sources can be put on same target page
The souce file is here.The fetch code is sify .It's just one jpg. If you can't download it, please contact bbliao#126.com.
However this image doesn't work with fpdf package, I don't know why. You can try it.
Thus I have to use the img2pdf. With the following code I converted this image to pdf successfully.
t=os.listdir()
with open('bb.pdf','wb') as f:
f.write(img2pdf.convert(t))
However, when multiple images are combined into one pdf file, the img2pdf just combine each image by head_to_tail. This causes every pagesize = imgaesize. Briefly, the first page of pdf is 30 cm*40 cm while the second is 20 cm*10 cm the third is 15*13...That's ugly.
I want the same pagesize(A4 for example) and the same imgsize in every page of the pdf. One page of pdf with one image.
Glancing at the documentation for img2pdf, it allows you to set the paper size by including layout details to the convert call:
import img2pdf
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
with open('test.pdf', 'wb') as f:
f.write(img2pdf.convert(['image1.jpg','image2.jpg'], layout_fun=layout))
I am reading text from one pdf recursively and doing some operation with the extracted text at each run and want to create a new pdf to save that edited text with each run ..
I tried below from PyPDF2..
import PyPDF2
output = PdfFileWriter()
pdf="pdfte.pdf"
Obj_pdfFile = open(pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(Obj_pdfFile,strict = False)
pages=pdfReader.numPages
for page in range(pages):
pageObj = pdfReader.getPage(page)
pdf_text=pageObj.extractText()
upper = pdf_text.upper()
#print(pdf_text)
output.addPage(input.getPage(upper)) . # I thought this will work but no use..
I know need to input "page" but basically looking how to save edited text in new pdf ... I know I am missing some code here that how to save in pdf etc but thats exactly what I need help, never worked with pdf..
Also, is there any better option to do this ?
PyPDF2 is amazing to handle pdf files as documents, but not as an editor. I wanted to do the same that you tried, but only made it posible with reportlab as many other answers here do. Note that here
output.addPage(input.getPage(upper)) . # I thought this will work but no use.
upper is a string, and getPage() is expecting a page from
PyPDF2.PdfFileReader(pdffile).getPage(0)
Here is that worked for me on python 2.7:
temp = StringIO()
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A6 #choose here your size
can = canvas.Canvas(temp, pagesize=A6)
can.drawString(10, 405, "Your string on this position")
can.save()
temp.seek(0)
lector = PyPDF2.PdfFileReader(temp)
output.addPage(lector.getPage(0)) #your pypdf2 writter
now output is your pdf with the string attached, hope someone finds it useful.
I've been able to read the content of PDFs with: PYMuPDF using code similar to the following:
myfile = r"C:\users\xxx\desktop\testpdf1.pdf"
doc =fitz.open(myfile)
page=doc[1]
text = page.getText("text")
to read the contents of PDF files, however I can't read text box annotations is there a way to do this?
Use firstAnnot on the page object. Once you have an annotation object it looks like you can call next on it and get the others. Note the example at the bottom of the Annot page.
I created a PDF from a Word document and added one text box and one sticky note. The following code printed the contents of each. Look inside info for other information you may want.
import fitz
pdf = fitz.open('WordTest.pdf')
page = pdf[0]
annot = page.firstAnnot
print(annot.info['content'])
next_annot = annot.next
print(next_annot.info['content'])
pdf.close()
I'm trying to get the image index from the .docx file using python-docx library. I'm able to extract the name of the image, image height and width. But not the index where it is in the word file
import docx
doc = docx.Document(filename)
for s in doc.inline_shapes:
print (s.height.cm,s.width.cm,s._inline.graphic.graphicData.pic.nvPicPr.cNvPr.name)
output
21.228 15.920 IMG_20160910_220903848.jpg
In fact I would like to know if there is any simpler way to get the image name , like s.height.cm fetched me the height in cm. My primary requirement is to get to know where the image is in the document, because I need to extract the image and do some work on it and then again put the image back to the same location
This operation is not directly supported by the API.
However, if you're willing to dig into the internals a bit and use the underlying lxml API it's possible.
The general approach would be to access the ImagePart instance corresponding to the picture you want to inspect and modify, then read and write the ._blob attribute (which holds the image file as bytes).
This specimen XML might be helpful:
http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/picture.html#specimen-xml
From the inline shape containing the picture, you get the <a:blip> element with this:
blip = inline_shape._inline.graphic.graphicData.pic.blipFill.blip
The relationship id (r:id generally, but r:embed in this case) is available at:
rId = blip.embed
Then you can get the image part from the document part
document_part = document.part
image_part = document_part.related_parts[rId]
And then the binary image is available for read and write on ._blob.
If you write a new blob, it will replace the prior image when saved.
You probably want to get it working with a single image and get a feel for it before scaling up to multiple images in a single document.
There might be one or two image characteristics that are cached, so you might not get all the finer points working until you save and reload the file, so just be alert for that.
Not for the faint of heart as you can see, but should work if you want it bad enough and can trace through the code a bit :)
You can also inspect paragraphs with a simple loop, and check which xml contains an image (for example if an xml contains "graphicData"), that is which is an image container (you can do the same with runs):
from docx import Document
image_paragraphs = []
doc = Document(path_to_docx)
for par in doc.paragraphs:
if 'graphicData' in par._p.xml:
image_paragraphs.append(par)
Than you unzip docx file, images are in the "images" folder, and they are in the same order as they will be in the image_paragraphs list. On every paragraph element you have many options how to change it. If you want to extract img process it and than insert it in the same place, than
paragraph.clear()
paragraph.add_run('your description, if needed')
run = paragraph.runs[0]
run.add_picture(path_to_pic, width, height)
So, I've never really written any answers here, but i think this might be the solution to your problem. With this little code you can see the position of your images given all the paragraphs. Hope it helps.
import docx
doc = docx.Document(filename)
paraGr = []
index = []
par = doc.paragraphs
for i in range(len(par)):
paraGr.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
index.append(i)
If you are using Python 3
pip install python-docx
import docx
doc = docx.Document(document_path)
P = []
I = []
par = doc.paragraphs
for i in range(len(par)):
P.append(par[i].text)
if 'graphicData' in par[i]._p.xml:
I.append(i)
print(I)
#returns list of index(Image_Reference)