I am trying to write a python script that would automate the process of finding text in a pdf and highlight according
I am using pymupdf module of python. It works for some pdf. However, when for the target pdf(drawing of components and property tables) it would save output as a blank pdf with no data and some blank highlights.
import fitz
doc=fitz.open("c5.pdf")
page = doc[0]
text = "a"
text_instances = page.searchFor(text)
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
doc.save("out.pdf", garbage=4, deflate=True, clean=True)
Your PDF probably contains elements which appear like text but are something else. It may be that they are just some type of graphics or image.
In that case the text search of course cannot find anything.
Please submit an issue on my repo for PyMuPDF with some sample PDF to allow me investigating this.
Related
I am new to programming (just do it for fun sometimes) and I am having trouble using PyMuPDF.
In VS Code, it returns no errors but the output is always just an empty list.
Here is the code:
> import fitz
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
def extract_text_from_pdf(file_path):
# Open the pdf file
pdf_document = fitz.open(file_path)
# Initialize an empty list to store the text
text = []
# Iterate through the pages
for page in pdf_document:
# Extract the text from the page
page_text = page.get_text()
# Append the text to the list
text.append(page_text)
# Close the pdf document
pdf_document.close()
# Return the list of text
return text
if __name__ == '__main__':
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
text = extract_text_from_pdf(file_path)
Based on the name of the file, I'm going to guess this was an image that was converted to a PDF. In that case, the PDF does not contain any text. It just contains an image.
If you convert a Word document to a PDF, the words in the Word document are present in the PDF, along with instructions on what font to use and where to place them. But when you convert an image to a PDF, all you have are the bytes in the image. There is no text.
If you really want to explore this further, what you need is an OCR package (Optical Character Recognition). There are Python packages for doing that (like pytesseract), but they can be finicky.
FOLLOWUP
PyMuPDF can do OCR, if the Tesseract package is installed. You need to scan through the documentation.
https://pymupdf.readthedocs.io/en/latest/functions.html
One possibility is that the PDF file does not contain any text. PyMuPDF uses OCR to extract text from PDFs, so if the PDF is an image-only PDF or if the text is not in a format that PyMuPDF's OCR can recognize, it may not be able to extract any text.
I tried in python using fitz, but its simply highlighting the text in PDF. I am not getting how to extract X and Y coordinates of that particular PDF files.
This is my code, can anyone help me in this
import fitz
### READ IN PDF
doc = fitz.open("C:\\Users\\vinit\\Desktop\\Input\\InputPDF.pdf")
for page in doc:
### SEARCH
text = "Authorised Signature"
text_instances = page.search_for(text)
### HIGHLIGHT
for inst in text_instances:
highlight = page.add_highlight_annot(inst)
highlight.update()
### OUTPUT
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
or is there any another method in python to get the coordinates of a particular text from PDF?
I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.
I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf
import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)
The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.
All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?
I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.
Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.
I drop some code down here if you are interested.
First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.
To convert PDFs to images using pdf2image:
Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'
def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path)
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
To extract text from the image:
Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img)))))
text = text.replace('-\n', '')
return text
I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.
I've been able to read the content of PDFs with: PYMuPDF using code similar to the following:
myfile = r"C:\users\xxx\desktop\testpdf1.pdf"
doc =fitz.open(myfile)
page=doc[1]
text = page.getText("text")
to read the contents of PDF files, however I can't read text box annotations is there a way to do this?
Use firstAnnot on the page object. Once you have an annotation object it looks like you can call next on it and get the others. Note the example at the bottom of the Annot page.
I created a PDF from a Word document and added one text box and one sticky note. The following code printed the contents of each. Look inside info for other information you may want.
import fitz
pdf = fitz.open('WordTest.pdf')
page = pdf[0]
annot = page.firstAnnot
print(annot.info['content'])
next_annot = annot.next
print(next_annot.info['content'])
pdf.close()
there are some keywords I am gotten before and I want to search on pdf document via python and highlight them. Is it viable with some library like pdfMiner?
Yes, you can use 'PyMuPDF' library.
pip install PyMuPDF.
Then use the following code,
import fitz
### READ IN PDF
doc = fitz.open(r"D:\XXXX\XXX.pdf")
page = doc[0]
text = "Amey"
text_instances = page.searchFor(text)
### HIGHLIGHT
for inst in text_instances:
print(inst, type(inst))
highlight = page.addHighlightAnnot(inst)
### OUTPUT
doc.save(r"D:\XXXX\XXX.pdf", garbage=4, deflate=True, clean=True)