import fitz
text_rectangle = fitz.Rect(450,20,550,120)
file_handle = fitz.open(input_file)
first_page = file_handle[0]
text = 'SAS Automation'
first_page.insertTextbox(text_rectangle, f'{text}')
file_handle.save(output_file)
Above code adds text in pdf in mirror form why I dont know I tried insertText method, morph attribute with inserTextbox but still no solutions finds.you can see output hereOutPut PDF file image
Any help? Thanks In Advance
This usually happens when the pdf has its own orientations, etc. Using page.clean_contents() standardizes the page and should be used before the first insertion of any item.
in my case, it seems to be an issue with the PDF file.
i fixed it by generating another copy of the pdf file.
i used photoshop then save as PDF. you can also try "Print to PDF".
HTH
I tried updating my existing pdf file. but it wasn't the correct solution to overcome this problem. Finally, I tried by creating new pdf file.
file_handle = fitz.open()
first_page = file_handle.newPage() #file_handle[0] is getting issue
Related
I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.
I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf
import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)
The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.
All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?
I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.
Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.
I drop some code down here if you are interested.
First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.
To convert PDFs to images using pdf2image:
Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'
def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path)
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
To extract text from the image:
Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img)))))
text = text.replace('-\n', '')
return text
I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.
I am trying to write a python script that would automate the process of finding text in a pdf and highlight according
I am using pymupdf module of python. It works for some pdf. However, when for the target pdf(drawing of components and property tables) it would save output as a blank pdf with no data and some blank highlights.
import fitz
doc=fitz.open("c5.pdf")
page = doc[0]
text = "a"
text_instances = page.searchFor(text)
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
doc.save("out.pdf", garbage=4, deflate=True, clean=True)
Your PDF probably contains elements which appear like text but are something else. It may be that they are just some type of graphics or image.
In that case the text search of course cannot find anything.
Please submit an issue on my repo for PyMuPDF with some sample PDF to allow me investigating this.
The souce file is here.The fetch code is sify .It's just one jpg. If you can't download it, please contact bbliao#126.com.
However this image doesn't work with fpdf package, I don't know why. You can try it.
Thus I have to use the img2pdf. With the following code I converted this image to pdf successfully.
t=os.listdir()
with open('bb.pdf','wb') as f:
f.write(img2pdf.convert(t))
However, when multiple images are combined into one pdf file, the img2pdf just combine each image by head_to_tail. This causes every pagesize = imgaesize. Briefly, the first page of pdf is 30 cm*40 cm while the second is 20 cm*10 cm the third is 15*13...That's ugly.
I want the same pagesize(A4 for example) and the same imgsize in every page of the pdf. One page of pdf with one image.
Glancing at the documentation for img2pdf, it allows you to set the paper size by including layout details to the convert call:
import img2pdf
letter = (img2pdf.in_to_pt(8.5), img2pdf.in_to_pt(11))
layout = img2pdf.get_layout_fun(letter)
with open('test.pdf', 'wb') as f:
f.write(img2pdf.convert(['image1.jpg','image2.jpg'], layout_fun=layout))
I am reading text from one pdf recursively and doing some operation with the extracted text at each run and want to create a new pdf to save that edited text with each run ..
I tried below from PyPDF2..
import PyPDF2
output = PdfFileWriter()
pdf="pdfte.pdf"
Obj_pdfFile = open(pdf, 'rb')
pdfReader = PyPDF2.PdfFileReader(Obj_pdfFile,strict = False)
pages=pdfReader.numPages
for page in range(pages):
pageObj = pdfReader.getPage(page)
pdf_text=pageObj.extractText()
upper = pdf_text.upper()
#print(pdf_text)
output.addPage(input.getPage(upper)) . # I thought this will work but no use..
I know need to input "page" but basically looking how to save edited text in new pdf ... I know I am missing some code here that how to save in pdf etc but thats exactly what I need help, never worked with pdf..
Also, is there any better option to do this ?
PyPDF2 is amazing to handle pdf files as documents, but not as an editor. I wanted to do the same that you tried, but only made it posible with reportlab as many other answers here do. Note that here
output.addPage(input.getPage(upper)) . # I thought this will work but no use.
upper is a string, and getPage() is expecting a page from
PyPDF2.PdfFileReader(pdffile).getPage(0)
Here is that worked for me on python 2.7:
temp = StringIO()
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A6 #choose here your size
can = canvas.Canvas(temp, pagesize=A6)
can.drawString(10, 405, "Your string on this position")
can.save()
temp.seek(0)
lector = PyPDF2.PdfFileReader(temp)
output.addPage(lector.getPage(0)) #your pypdf2 writter
now output is your pdf with the string attached, hope someone finds it useful.
So I have been tasked with creating a pdf that allows the end user to enter information into the pdf and print it or save it, either or. The pdf I am trying to create is being rendered from a pdf template that has fillable fields. The problem I have is that every time I use any python library(pypdf2, pdfrw, reportlabs, etc...) to create this pdf, it flattens it and the fields are no longer fillable after export. Is there anything out there that will accomplish this goal? It doesn't really matter to me if I have to take the flat template file and render an html form onto it, so long as it works. The pdf was made in acrobat and I made sure to remove the password.
The pdf in question was created in acrobat pro. My python version is 2.7.
If anyone has done this before, that information would be super helpful. Thanks!
Facing the same problem. Just found out, the forms in the output.pdf are not fillable in okular or firefox, but still able to fill when opened in google-chrome or chromium-browser.
I used this lines:
from PyPDF2 import PdfFileWriter, PdfFileReader
from reportlab.pdfgen import canvas
c = canvas.Canvas('watermark.pdf')
c.drawImage('testplot.png', 350, 550, width=150, height=150)
c.save()
output_file = PdfFileWriter()
watermark = PdfFileReader(open("watermark.pdf", 'rb'))
input_file = PdfFileReader(file('template.pdf', 'rb'))
page_count = input_file.getNumPages()
for page_number in range(page_count):
print "Plotting png to {} of {}".format(page_number, page_count)
input_page = input_file.getPage(page_number)
input_page.mergePage(watermark.getPage(0))
output_file.addPage(input_page)
with open('output.pdf', 'wb') as outputStream:
output_file.write(outputStream)