I've been able to read the content of PDFs with: PYMuPDF using code similar to the following:
myfile = r"C:\users\xxx\desktop\testpdf1.pdf"
doc =fitz.open(myfile)
page=doc[1]
text = page.getText("text")
to read the contents of PDF files, however I can't read text box annotations is there a way to do this?
Use firstAnnot on the page object. Once you have an annotation object it looks like you can call next on it and get the others. Note the example at the bottom of the Annot page.
I created a PDF from a Word document and added one text box and one sticky note. The following code printed the contents of each. Look inside info for other information you may want.
import fitz
pdf = fitz.open('WordTest.pdf')
page = pdf[0]
annot = page.firstAnnot
print(annot.info['content'])
next_annot = annot.next
print(next_annot.info['content'])
pdf.close()
Related
I am new to programming (just do it for fun sometimes) and I am having trouble using PyMuPDF.
In VS Code, it returns no errors but the output is always just an empty list.
Here is the code:
> import fitz
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
def extract_text_from_pdf(file_path):
# Open the pdf file
pdf_document = fitz.open(file_path)
# Initialize an empty list to store the text
text = []
# Iterate through the pages
for page in pdf_document:
# Extract the text from the page
page_text = page.get_text()
# Append the text to the list
text.append(page_text)
# Close the pdf document
pdf_document.close()
# Return the list of text
return text
if __name__ == '__main__':
file_path = "/Users/conor/Desktop/projects/png2pdf.pdf"
text = extract_text_from_pdf(file_path)
Based on the name of the file, I'm going to guess this was an image that was converted to a PDF. In that case, the PDF does not contain any text. It just contains an image.
If you convert a Word document to a PDF, the words in the Word document are present in the PDF, along with instructions on what font to use and where to place them. But when you convert an image to a PDF, all you have are the bytes in the image. There is no text.
If you really want to explore this further, what you need is an OCR package (Optical Character Recognition). There are Python packages for doing that (like pytesseract), but they can be finicky.
FOLLOWUP
PyMuPDF can do OCR, if the Tesseract package is installed. You need to scan through the documentation.
https://pymupdf.readthedocs.io/en/latest/functions.html
One possibility is that the PDF file does not contain any text. PyMuPDF uses OCR to extract text from PDFs, so if the PDF is an image-only PDF or if the text is not in a format that PyMuPDF's OCR can recognize, it may not be able to extract any text.
Covert Rect location from pymupdf to a page number
If I get the locations of certain text like "exam" and get the rectangle location. I then highlight the text in the pdfs with that location. I now want to delete all other pages that do not have this text in it so I use the doc.select()
function to select the pages I want to keep before making a save of the new pdf with the pages with highlighted text on only.
The Issue
You have to pass a dictionary to the doc.select() function with the page numbers I want to keep.
So what I tried to do was to pass the dictionary with the rectangle coordinates to this function but I got the following error
<br>
ValueError: bad page number(s)
<br>
I know understand that I must be able to convert the coordinates of the rectangles to page numbers.
But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong) .
<br>
Current code
from pathlib import Path
import fitz
directory = "pdfs"
# iterate over files in
# that directory
files = Path(directory).glob('*')
for file in files:
doc = fitz.open(file)
for page in doc:
### SEARCH
text = "Exam"
text_instances = page.search_for(text)
### HIGHLIGHT
for inst in text_instances:
highlight = page.add_highlight_annot(inst)
highlight.update()
### OUTPUT
doc.select(text_instances)
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
Pdf that I used for testing purposes:
pdf
I know understand that I must be able to convert the coordinates of the rectangles to page numbers. But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong).
That is completely wrong!
The rectangles returned by text searches are locations on the current page and have nothing to do with page numbers.
You already are iterating over pages. If your search text has been found on some page, put that page's number in a list, then do your highlights.
When finished with a document, select() with the pages memorized, close document, empty the page selection, then continue with next document.
Something like that:
for filename in filenamelist:
select_pages = []
doc = fitz.open(filename)
for page in doc:
hits = page.search_for(text)
if hits == []:
continue
select_pages.append(page.number)
for rect in hits:
page.add_highlight_annot(rect)
doc.select(select_pages)
doc.save(...)
doc.close()
I tried in python using fitz, but its simply highlighting the text in PDF. I am not getting how to extract X and Y coordinates of that particular PDF files.
This is my code, can anyone help me in this
import fitz
### READ IN PDF
doc = fitz.open("C:\\Users\\vinit\\Desktop\\Input\\InputPDF.pdf")
for page in doc:
### SEARCH
text = "Authorised Signature"
text_instances = page.search_for(text)
### HIGHLIGHT
for inst in text_instances:
highlight = page.add_highlight_annot(inst)
highlight.update()
### OUTPUT
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
or is there any another method in python to get the coordinates of a particular text from PDF?
I am trying to write a python script that would automate the process of finding text in a pdf and highlight according
I am using pymupdf module of python. It works for some pdf. However, when for the target pdf(drawing of components and property tables) it would save output as a blank pdf with no data and some blank highlights.
import fitz
doc=fitz.open("c5.pdf")
page = doc[0]
text = "a"
text_instances = page.searchFor(text)
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
doc.save("out.pdf", garbage=4, deflate=True, clean=True)
Your PDF probably contains elements which appear like text but are something else. It may be that they are just some type of graphics or image.
In that case the text search of course cannot find anything.
Please submit an issue on my repo for PyMuPDF with some sample PDF to allow me investigating this.
Lets say you have a pdf page with various complex elements inside.
The objective is to crop a region of the page (to extract only one of the elements) and then paste it in another pdf page.
Here is a simplified version of my code:
import PyPDF2
import PyPdf
def extract_tree(in_file, out_file):
with open(in_file, 'rb') as infp:
# Read the document that contains the tree (in its first page)
reader = pyPdf.PdfFileReader(infp)
page = reader.getPage(0)
# Crop the tree. Coordinates below are only referential
page.cropBox.lowerLeft = [100,200]
page.cropBox.upperRight = [250,300]
# Create an empty document and add a single page containing only the cropped page
writer = pyPdf.PdfFileWriter()
writer.addPage(page)
with open(out_file, 'wb') as outfp:
writer.write(outfp)
def insert_tree_into_page(tree_document, text_document):
# Load the first page of the document containing 'text text text text...'
text_page = PyPDF2.PdfFileReader(file(text_document,'rb')).getPage(0)
# Load the previously cropped tree (cropped using 'extract_tree')
tree_page = PyPDF2.PdfFileReader(file(tree_document,'rb')).getPage(0)
# Overlay the text-page and the tree-crop
text_page.mergeScaledTranslatedPage(page2=tree_page,scale='1.0',tx='100',ty='200')
# Save the result into a new empty document
output = PyPDF2.PdfFileWriter()
output.addPage(text_page)
outputStream = file('merged_document.pdf','wb')
output.write(outputStream)
# First, crop the tree and save it into cropped_document.pdf
extract_tree('document1.pdf', 'cropped_document.pdf')
# Now merge document2.pdf with cropped_document.pdf
insert_tree_into_page('cropped_document.pdf', 'document2.pdf')
The method "extract_tree" seems to be working. It generates a pdf file containing only the cropped region (in the example, the tree).
The problem in that when I try to paste the tree in the new page, the star and the house of the original image are pasted anyway
I tried something that actually worked. Try to convert your first output(pdf containing only the tree) to docx then convert it another time from docx to pdf before merging it with other pdf pages. It will work(only the tree will be merged).
Allow me to ask please, how did you implement an interface that define the bounds of the crop Au.
I had the exact same issue. In the end, the solution for me was to make a small edit to the source code of pyPDF2 (from this pull request, which never made it into the master branch). What you need to do is insert these lines into the method _mergePage of the class PageObject inside the file pdf.py:
page2Content = ContentStream(page2Content, self.pdf)
page2Content.operations.insert(0, [map(FloatObject, [page2.trimBox.getLowerLeft_x(), page2.trimBox.getLowerLeft_y(), page2.trimBox.getWidth(), page2.trimBox.getHeight()]), "re"])
page2Content.operations.insert(1, [[], "W"])
page2Content.operations.insert(2, [[], "n"])
(see the pull request for exactly where to put them). With that done, you can then crop the section of a pdf you want, and merge it with another page with no issues. There's no need to save the cropped section into a separate pdf, unless you want to.
from PyPDF2 import PdfFileReader, PdfFileWriter
tree_page = PdfFileReader(open('document1.pdf','rb')).getPage(0)
text_page = PdfFileReader(open('document2.pdf','rb')).getPage(0)
tree_page.cropBox.lowerLeft = [100,200]
tree_page.cropBox.upperRight = [250, 300]
text_page.mergeScaledTranslatedPage(page2=tree_page, scale='1.0', tx='100', ty='200')
output = PdfFileWriter()
output.addPage(text_page)
output.write(open('merged_document.pdf', 'wb'))
Maybe there's a better way of doing this that inserts that code without directly editing the source code. I'd be grateful if anyone finds a way to do it as this admittedly is a slightly dodgy hack.