Covert Rect location from pymupdf to a page number - python

Covert Rect location from pymupdf to a page number
If I get the locations of certain text like "exam" and get the rectangle location. I then highlight the text in the pdfs with that location. I now want to delete all other pages that do not have this text in it so I use the doc.select()
function to select the pages I want to keep before making a save of the new pdf with the pages with highlighted text on only.
The Issue
You have to pass a dictionary to the doc.select() function with the page numbers I want to keep.
So what I tried to do was to pass the dictionary with the rectangle coordinates to this function but I got the following error
<br>
ValueError: bad page number(s)
<br>
I know understand that I must be able to convert the coordinates of the rectangles to page numbers.
But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong) .
<br>
Current code
from pathlib import Path
import fitz
directory = "pdfs"
# iterate over files in
# that directory
files = Path(directory).glob('*')
for file in files:
doc = fitz.open(file)
for page in doc:
### SEARCH
text = "Exam"
text_instances = page.search_for(text)
### HIGHLIGHT
for inst in text_instances:
highlight = page.add_highlight_annot(inst)
highlight.update()
### OUTPUT
doc.select(text_instances)
doc.save("output.pdf", garbage=4, deflate=True, clean=True)
Pdf that I used for testing purposes:
pdf

I know understand that I must be able to convert the coordinates of the rectangles to page numbers. But I don not know how to do this and it is not mentioned anywhere in the docs (Correct me if I am wrong).
That is completely wrong!
The rectangles returned by text searches are locations on the current page and have nothing to do with page numbers.
You already are iterating over pages. If your search text has been found on some page, put that page's number in a list, then do your highlights.
When finished with a document, select() with the pages memorized, close document, empty the page selection, then continue with next document.
Something like that:
for filename in filenamelist:
select_pages = []
doc = fitz.open(filename)
for page in doc:
hits = page.search_for(text)
if hits == []:
continue
select_pages.append(page.number)
for rect in hits:
page.add_highlight_annot(rect)
doc.select(select_pages)
doc.save(...)
doc.close()

Related

How can I place a pdf asset into a empty PDF page?

I have multiple PDF files with small sizes (e.g. 3cm x 2 cm) exported from Adobe Indesign.
I want to compose many of these into one new PDF which has the size of a whole page.
The small PDFs contain a plotter line in a special color which would get lost if I convert them into images.
How can I place these PDFs (at given positions) using python and without losing the special color.
I tried to read into pypdf, pypdf2 and reportlab but I got lost and the examples I found did not work. I do not need the full code, a hint into the right direction would be enough (even with another language if necessary).
Thanks
Try:
cpdf in.pdf -stamp-on stamp.pdf -pos-left "x y" AND -stamp-on stamp2.pdf -pos-left "x2 y2" AND ..... -o out.pdf
where in.pdf is a blank PDF of appropriate size, and x and y and x2 and y2 etc... are the coordinates required and ..... are parts of the command for the third, fourth etc. stamps.
Here is a sample code to do your task using PyPDF2.
from PyPDF2 import PdfFileMerger
merger = PdfFileMerger()
for pdf in pdf_files:
merger.append(pdf) #pdf_files is a list of the pdf files (path) to be merged.
merger.write(output_pdf) #output_pdf is the path of the merged pdf file.
merger.close()
Apart from joining / merging pages from different documents into one output PDF, PyMuPDF lets you also embed pages from PDFs into an existing page of some target PDF (as if it were an image).
You can select a subrectangle within the source page (a "clip") and the target rectangle of the target page. Multiple embedded source pages in the same target page are also supported.
This embedding maintains all original source page features - it will not be converted to an image.
Scaling between source / target rectangles will be done as necessary.
This snippet will take the full page number 0 of a source PDF and place it inside a rectangle of a page in the target PDF.
The source page will be positioned centered in the target rectangle.
import fitz # import PyMuPDF
source = fitz.open("source.pdf") # source PDF
target = fitz.open("target.pdf") # target PDF
target_page = target[pagenumber] # desired page in target
# if you want a new page, create it like so:
# target_page = target.new_page()
target_rect = fitz.Rect(100, 100, 300, 250) # show source page here
# if you want to cover the full target page, use
# target_rect = target_page.rect
target_page.show_pdf_page(target_rect,
source, # the source document
0, # the source page number
clip=None, # any subrectangle on source page
)
# if desired, more pages from other sources can be put on same target page

How do I extract text in the right order from PDF using PyPDF2?

I am currently doing a project to extract the contents of a PDF. The code runs smoothly and I am able to extract the text but the extracted text are not in the right order. The code extracts the text in a weird way. The order of the text is all over the place. It does not go from top to bottom and is really confusing.
I looked up online but there was very little help on how to order the text extraction. Most tutorials came up with the same result. For reference, this is the PDF that I am currently testing it on (page 5): https://www.pidm.gov.my/PIDM/files/13/134b5c79-5319-4199-ac68-99f62aca6047.pdf
import PyPDF2
with open('pdftest2.pdf', 'rb') as pdfTest:
reader = PyPDF2.PdfFileReader(pdfTest)
page5 = reader.getPage(4)
text = page5.extractText()
print(text)
The extracted text would always start with the footer of the page and then go its way from bottom to top. I noticed in the next page it would start from top to bottom but only for a few certain sentences. Then it would extract text from a different position of the page instead of continuing from where it left off.
All of the text does get extracted but the order of which it is extracted is all over the place. Is there any solution for this problem?
I had to deal with a problem that was similar and it turned out that the module pdfplumber worked better than PyPDF. I guess it depends on the document itself, you should try.
Otherwise another answer to your problem would be to treat the PDFs as images with the pdf2image module and extract the text within them using pytesseract. However it might not be perfect method as the pdf2image method convert_from_path can take quite a long time to run.
I drop some code down here if you are interested.
First of all make sure you install all necessary depedencies as well as Tesseract and ImageMagik. You can find any information regarding install on the website. If you are working with windows there's a good Medium article here.
To convert PDFs to images using pdf2image:
Don't forget to add your poppler path if you are working on windows. It should look like something like that r'C:\<your_path>\poppler-21.02.0\Library\bin'
def pdftoimg(fic,output_folder, poppler_path):
# Store all the pages of the PDF in a variable
pages = convert_from_path(fic, dpi=500,output_folder=output_folder,thread_count=9, poppler_path=poppler_path)
image_counter = 0
# Iterate through all the pages stored above
for page in pages:
filename = "page_"+str(image_counter)+".jpg"
page.save(output_folder+filename, 'JPEG')
image_counter = image_counter + 1
for i in os.listdir(output_folder):
if i.endswith('.ppm'):
os.remove(output_folder+i)
To extract text from the image:
Your tesseract path is going to be something like that: r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def imgtotext(img, tesseract_path):
# Recognize the text as string in image using pytesserct
pytesseract.pytesseract.tesseract_cmd = tesseract_path
text = str(((pytesseract.image_to_string(Image.open(img)))))
text = text.replace('-\n', '')
return text
I recently started using PyMuPDF. It’s licensing is a little confusing but some of their methods have ways to correctly sort the text as it naturally appears (left to right, top to bottom). Something like page.get_text(“words”, sort=True) is all it takes.

Webscrape images and their alt text in Python when you know part of a URL

I want to webscrape a site, and save some, but not all images to my computer. I want to save about 5,600 images, so doing it manually would be difficult. All of the images urls start with
https://assets.pokemon.com/assets/cms2/img/cards/
and then some other stuff that is specific to the image. How can I download only images that meet that criteria?
Also (sorry this is kind of 2 questions in 1, but its related) how can I save the images alt text as the file name?
Thanks!
Also sorry if this is a dumb question, if you can't tell by the fact that I'm scraping pokemon.com, I'm not exactly a professional.
Here's what I ended up doing:
import requests
import urllib.request
contents = requests.get(url) # Get request to site
data = contents.text # Get HTMl file as text
x = data.split("\"") # Splits it into an array using double quotes as separators (Because all of the image urls were in quotes)
for a in range(len(x)): # Runs this code for every member of the array
if 'https://assets.pokemon.com/assets/cms2/img/cards' in x[a]: # Checks for that URL snippet. (That's not the full URL, each full URL just started with that)
link = x[a] # If it is, store that member of the array separately to be extracted
name = x[a+2] # Alt text was always 2 members of the array later, not sure if this is true for all sites.
path = "/Users/myName/Desktop/Poke/" + name + ".png" # This is where I wanted to store the files
urllib.request.urlretrieve(link, path) # Retrieved the file from the link, and saved it to the path

Python + PyPdf: Crop region of page and paste it in another page

Lets say you have a pdf page with various complex elements inside.
The objective is to crop a region of the page (to extract only one of the elements) and then paste it in another pdf page.
Here is a simplified version of my code:
import PyPDF2
import PyPdf
def extract_tree(in_file, out_file):
with open(in_file, 'rb') as infp:
# Read the document that contains the tree (in its first page)
reader = pyPdf.PdfFileReader(infp)
page = reader.getPage(0)
# Crop the tree. Coordinates below are only referential
page.cropBox.lowerLeft = [100,200]
page.cropBox.upperRight = [250,300]
# Create an empty document and add a single page containing only the cropped page
writer = pyPdf.PdfFileWriter()
writer.addPage(page)
with open(out_file, 'wb') as outfp:
writer.write(outfp)
def insert_tree_into_page(tree_document, text_document):
# Load the first page of the document containing 'text text text text...'
text_page = PyPDF2.PdfFileReader(file(text_document,'rb')).getPage(0)
# Load the previously cropped tree (cropped using 'extract_tree')
tree_page = PyPDF2.PdfFileReader(file(tree_document,'rb')).getPage(0)
# Overlay the text-page and the tree-crop
text_page.mergeScaledTranslatedPage(page2=tree_page,scale='1.0',tx='100',ty='200')
# Save the result into a new empty document
output = PyPDF2.PdfFileWriter()
output.addPage(text_page)
outputStream = file('merged_document.pdf','wb')
output.write(outputStream)
# First, crop the tree and save it into cropped_document.pdf
extract_tree('document1.pdf', 'cropped_document.pdf')
# Now merge document2.pdf with cropped_document.pdf
insert_tree_into_page('cropped_document.pdf', 'document2.pdf')
The method "extract_tree" seems to be working. It generates a pdf file containing only the cropped region (in the example, the tree).
The problem in that when I try to paste the tree in the new page, the star and the house of the original image are pasted anyway
I tried something that actually worked. Try to convert your first output(pdf containing only the tree) to docx then convert it another time from docx to pdf before merging it with other pdf pages. It will work(only the tree will be merged).
Allow me to ask please, how did you implement an interface that define the bounds of the crop Au.
I had the exact same issue. In the end, the solution for me was to make a small edit to the source code of pyPDF2 (from this pull request, which never made it into the master branch). What you need to do is insert these lines into the method _mergePage of the class PageObject inside the file pdf.py:
page2Content = ContentStream(page2Content, self.pdf)
page2Content.operations.insert(0, [map(FloatObject, [page2.trimBox.getLowerLeft_x(), page2.trimBox.getLowerLeft_y(), page2.trimBox.getWidth(), page2.trimBox.getHeight()]), "re"])
page2Content.operations.insert(1, [[], "W"])
page2Content.operations.insert(2, [[], "n"])
(see the pull request for exactly where to put them). With that done, you can then crop the section of a pdf you want, and merge it with another page with no issues. There's no need to save the cropped section into a separate pdf, unless you want to.
from PyPDF2 import PdfFileReader, PdfFileWriter
tree_page = PdfFileReader(open('document1.pdf','rb')).getPage(0)
text_page = PdfFileReader(open('document2.pdf','rb')).getPage(0)
tree_page.cropBox.lowerLeft = [100,200]
tree_page.cropBox.upperRight = [250, 300]
text_page.mergeScaledTranslatedPage(page2=tree_page, scale='1.0', tx='100', ty='200')
output = PdfFileWriter()
output.addPage(text_page)
output.write(open('merged_document.pdf', 'wb'))
Maybe there's a better way of doing this that inserts that code without directly editing the source code. I'd be grateful if anyone finds a way to do it as this admittedly is a slightly dodgy hack.

PyMuPDF - read/write text box

I've been able to read the content of PDFs with: PYMuPDF using code similar to the following:
myfile = r"C:\users\xxx\desktop\testpdf1.pdf"
doc =fitz.open(myfile)
page=doc[1]
text = page.getText("text")
to read the contents of PDF files, however I can't read text box annotations is there a way to do this?
Use firstAnnot on the page object. Once you have an annotation object it looks like you can call next on it and get the others. Note the example at the bottom of the Annot page.
I created a PDF from a Word document and added one text box and one sticky note. The following code printed the contents of each. Look inside info for other information you may want.
import fitz
pdf = fitz.open('WordTest.pdf')
page = pdf[0]
annot = page.firstAnnot
print(annot.info['content'])
next_annot = annot.next
print(next_annot.info['content'])
pdf.close()

Categories