Create searchable (multipage) PDF with Python

Create searchable (multipage) PDF with Python - python

I've found some guides online on how to make a PDF searchable if it was scanned. However, I'm currently struggling with figuring out how to do it for a multipage PDF.
My code takes multipaged PDFs, converts each page into a JPG, runs OCR on each page and then converts it into a PDF. However, only the last page is returned.
import pytesseract
from pdf2image import convert_from_path
pytesseract.pytesseract.tesseract_cmd = 'directory'
TESSDATA_PREFIX = 'directory'
tessdata_dir_config = '--tessdata-dir directory'
# Path of the pdf
PDF_file = r"pdf directory"
def pdf_text():
# Store all the pages of the PDF in a variable
pages = convert_from_path(PDF_file, 500)
image_counter = 1
for page in pages:
# Declare file names
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
# Variable to get count of total number of pages
filelimit = image_counter-1
outfile = "out_text.pdf"
# Open the file in append mode so that all contents of all images are added to the same file
f = open(outfile, "a")
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesseract
result = pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config)
f = open(outfile, "w+b")
f.write(bytearray(result))
f.close()
pdf_text()
How can I run this for all pages and output one merged PDF?

I can't run it but I think all problem is because you use open(..., 'w+b') inside loop - and this remove previous content, and finally you write only last page.
You should use already opened file open(outfile, "a") and close it after loop.
# --- before loop ---
f = open(outfile, "ab")
# --- loop ---
for i in range(1, filelimit+1):
filename = f"page_{i}.jpg"
result = pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config)
f.write(bytearray(result))
# --- after loop ---
f.close()
BTW:
But there is other problem - image_to_pdf_or_hocr creates full PDF - with special headers and maybe footers - and appending two results can't create correct PDF. You would have to use special modules to merge pdfs. Like Merge PDF files
Something similar to
# --- before loop ---
from PyPDF2 import PdfFileMerger
import io
merger = PdfFileMerger()
# --- loop ---
for i in range(1, filelimit + 1):
filename = "page_"+str(i)+".jpg"
result = pytesseract.image_to_pdf_or_hocr(filename, lang="eng", config=tessdata_dir_config)
pdf_file_in_memory = io.BytesIO(result)
merger.append(pdf_file_in_memory)
# --- after loop ---
merger.write(outfile)
merger.close()

There are a number of potential issues here and without being able to debug it's hard to say what is the root cause.
Are the JPGs being successfully created, and as separate files as is expected?
I would suspect that pages = convert_from_path(PDF_file, 500) is not returning as expected - have you manually verified they are being created as expected?

Related

PDF range split

I am trying to split a PDF file by finding a key word of text and then grabbing that page the key word is on and the following 4 pages after, so total of 5 pages, and splitting them from that original PDF and putting them into their own PDF so the new PDF will have those 5 pages only, then loop through again find that key text again because its repeated further down the original PDF X amount of times, grabbing that page plus the 4 after and putting into its own PDF.
Example: key word is found on page 7 the first loop so need page 7 and also pages 8-11 and put those 5 pages 7-11 into a pdf file,
the next loop they key word is found on page 12 so need page 12 and pages 13-16 so pages 12-16 split onto their own pdf at this point it has created 2 separate pdfs
the below code finds the key word and puts it into its own pdf file but only got it for that one page not sure how to include the range
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
path = "example.pdf"
fname = os.path.basename(path)
reader = PdfFileReader(path)
for page_number in range(reader.getNumPages()):
writer = PdfFileWriter()
writer.addPage(reader.getPage(page_number))
text = reader.getPage(page_number).extractText()
text_stripped = text.replace("\n", "")
print(text_stripped)
if text_stripped.find("Disregarded Branch") != (-1):
output_filename = f"{fname}_page_{page_number + 1}.pdf"
with open(output_filename, "wb") as out:
writer.write(out)
print(f"Created: {output_filename}")

disclaimer: I am the author of borb, the library used in this answer.
I think your question comes down to 2 common functionalities:
find the location of a given piece of text
merge/split/extract pages from a PDF
For the first part, there is a good tutorial in the examples repo.
You can find it here. I'll repeat one of the examples here for completeness.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction
def main():
# read the Document
doc: typing.Optional[Document] = None
l: SimpleTextExtraction = SimpleTextExtraction()
with open("output.pdf", "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [l])
# check whether we have read a Document
assert doc is not None
# print the text on the first Page
print(l.get_text_for_page(0))
if __name__ == "__main__":
main()
This example extracts all the text from page 0 of the PDF. of course you could simply iterate over all pages, and check whether a given page contains the keyword you're looking for.
For the second part, you can find a good example in the examples repository. This is the link. This example (and subsequent example) takes you through the basics of frankensteining a PDF from various sources.
The example I copy/paste here will show you how to build a PDF by alternatively picking a page from input document 1, and input document 2.
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
import typing
from decimal import Decimal
from borb.pdf.document.document import Document
from borb.pdf.page.page import Page
from borb.pdf.pdf import PDF
def main():
# open doc_001
doc_001: typing.Optional[Document] = Document()
with open("output_001.pdf", "rb") as pdf_file_handle:
doc_001 = PDF.loads(pdf_file_handle)
# open doc_002
doc_002: typing.Optional[Document] = Document()
with open("output_002.pdf", "rb") as pdf_file_handle:
doc_002 = PDF.loads(pdf_file_handle)
# create new document
d: Document = Document()
for i in range(0, 10):
p: typing.Optional[Page] = None
if i % 2 == 0:
p = doc_001.get_page(i)
else:
p = doc_002.get_page(i)
d.append_page(p)
# write
with open("output_003.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, d)
if __name__ == "__main__":
main()

You've almost got it!
import os
from PyPDF2 import PdfFileReader, PdfFileWriter
def create_4page_pdf(base_pdf_path, start):
reader = PdfFileReader(base_pdf_path)
writer = PdfFileWriter()
for i in range(4):
index = start + i
if index < len(reader.pages):
page = reader.pages[index]
writer.addPage(page)
fname = os.path.basename(base_pdf_path)
output_filename = f"{fname}_page_{start + 1}.pdf"
with open(output_filename, "wb") as out:
writer.write(out)
print(f"Created: {output_filename}")
def main(base_pdf_path="example.pdf"):
base_pdf_path = "example.pdf"
reader = PdfFileReader(base_pdf_path)
for page_number, page in enumerate(reader.pages):
text = page.extractText()
text_stripped = text.replace("\n", "")
print(text_stripped)
if text_stripped.find("Disregarded Branch") != (-1):
create_4page_pdf(base_pdf_path, page_number)

Delete all pdf files in a folder using python

I am trying to convert all pdf files to .jpg files and then remove them from the directory. I am able to convert all pdf's to jpg's but when I try to delete them, I get the error "The process is being used by another person".
Could you please help me?
Below is the code
Below script wil convert all pdfs to jpegs and storesin the same location.
for fn in files:
doc = fitz.open(pdffile)
page = doc.loadPage(0) # number of page
pix = page.getPixmap()
fn1 = fn.replace('.pdf', '.jpg')
output = fn1
pix.writePNG(output)
os.remove(fn) # one file at a time.
path = 'D:/python_ml/Machine Learning/New folder/Invoice/'
i = 0
for file in os.listdir(path):
path_to_zip_file = os.path.join(path, folder)
if file.endswith('.pdf'):
os.remove(file)
i += 1

As #K J noted in their comment, most probably the problem is with files not being closed, and indeed your code misses closing the doc object(s).
(Based on the line fitz.open(pdffile), I guess you use the pymupdf library.)
The problematic fragment:
doc = fitz.open(pdffile)
page = doc.loadPage(0) # number of page
pix = page.getPixmap()
fn1 = fn.replace('.pdf', '.jpg')
output = fn1
pix.writePNG(output)
...should be adjusted, e.g., in the following way:
with fitz.open(pdffile) as doc:
page = doc.loadPage(0) # number of page
pix = page.getPixmap()
output = fn.replace('.pdf', '.jpg')
pix.writePNG(output)
(Side note: the fn1 variable seems to be completely redundant so I got rid of it. Also, shouldn't pdffile be replaced with fn? What pdffile actually is?)

How to make my Tesseract-OCR conversion code run faster

I have a conversion script, which converts pdf files and image files to text files. But it takes forever to run my script. It took me almost 48 hours to finished 2000 pdf documents. Right now, I have a pool of documents (around 12000+) that I need to convert. Based on my previous rate, I can't imagine how long will it take to finish the conversion using my code. I am wondering is there anything I can do/change with my code to make it run faster?
Here is the code that I used.
def tesseractOCR_pdf(pdf):
filePath = pdf
pages = convert_from_path(filePath, 500)
# Counter to store images of each page of PDF to image
image_counter = 1
# Iterate through all the pages stored above
for page in pages:
# Declaring filename for each page of PDF as JPG
# For each page, filename will be:
# PDF page 1 -> page_1.jpg
# PDF page 2 -> page_2.jpg
# PDF page 3 -> page_3.jpg
# ....
# PDF page n -> page_n.jpg
filename = "page_"+str(image_counter)+".jpg"
# Save the image of the page in system
page.save(filename, 'JPEG')
# Increment the counter to update filename
image_counter = image_counter + 1
# Variable to get count of total number of pages
filelimit = image_counter-1
# Create an empty string for stroing purposes
text = ""
# Iterate from 1 to total number of pages
for i in range(1, filelimit + 1):
# Set filename to recognize text from
# Again, these files will be:
# page_1.jpg
# page_2.jpg
# ....
# page_n.jpg
filename = "page_"+str(i)+".jpg"
# Recognize the text as string in image using pytesserct
text += str(((pytesseract.image_to_string(Image.open(filename)))))
text = text.replace('-\n', '')
#Delete all the jpg files that created from above
for i in glob.glob("*.jpg"):
os.remove(i)
return text
def tesseractOCR_img(img):
filePath = img
text = str(pytesseract.image_to_string(filePath,lang='eng',config='--psm 6'))
text = text.replace('-\n', '')
return text
def Tesseract_ALL(docDir, txtDir, troubleDir):
if docDir == "": docDir = os.getcwd() + "\\" #if no docDir passed in
for doc in os.listdir(docDir): #iterate through docs in doc directory
try:
fileExtension = doc.split(".")[-1]
if fileExtension == "pdf":
pdfFilename = docDir + doc
text = tesseractOCR_pdf(pdfFilename) #get string of text content of pdf
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
else:
# elif (fileExtension == "tif") | (fileExtension == "tiff") | (fileExtension == "jpg"):
imgFilename = docDir + doc
text = tesseractOCR_img(imgFilename) #get string of text content of img
textFilename = txtDir + doc + ".txt"
textFile = open(textFilename, "w") #make text file
textFile.write(text) #write text to text file
except:
print("Error in file: "+ str(doc))
shutil.move(os.path.join(docDir, doc), troubleDir)
for filename in os.listdir(txtDir):
fileExtension = filename.split(".")[-2]
if fileExtension == "pdf":
os.rename(txtDir + filename, txtDir + filename.replace('.pdf', ''))
elif fileExtension == "tif":
os.rename(txtDir + filename, txtDir + filename.replace('.tif', ''))
elif fileExtension == "tiff":
os.rename(txtDir + filename, txtDir + filename.replace('.tiff', ''))
elif fileExtension == "jpg":
os.rename(txtDir + filename, txtDir + filename.replace('.jpg', ''))
docDir = "/drive/codingstark/Project/pdf/"
txtDir = "/drive/codingstark/Project/txt/"
troubleDir = "/drive/codingstark/Project/trouble_pdf/"
Tesseract_ALL(docDir, txtDir, troubleDir)
Does anyone know how can I edit my code to make it run faster?

I think a process pool would be perfect for your case.
First you need to figure out parts of your code that can run independent of each other, than you wrap it into a function.
Here is an example
from concurrent.futures import ProcessPoolExecutor
def do_some_OCR(filename):
pass
with ProcessPoolExecutor() as executor:
for file in range(file_list):
_ = executor.submit(do_some_OCR, file)
The code above will open a new process for each file and start processing things in parallel.
You can find the oficinal documentation here: https://docs.python.org/3/library/concurrent.futures.html#concurrent.futures.ProcessPoolExecutor
There is also an really awesome video that shows step-by-step how to use processes for exactly this: https://www.youtube.com/watch?v=fKl2JW_qrso

Here is a compact version of the function removing the file write stuff. I think this should work based on what I was reading on the APIs but I haven't tested this.
Note that I changed from string to list because adding to a list is MUCH less costly than appending to a string (See this about join vs concatenation
How slow is Python's string concatenation vs. str.join?) TLDR is that string concat makes a new string every time you are concatenating so with large strings you start having to copy many times.
Also, when you were calling replace each iteration on the string after concatenation, it was doing again creating a new string. So I moved that to operate on each string that is generated. Note that if for some reason that string '-\n' is an artifact that occured due to the concatenation previously, then it should be removed from where it is and placed here: return ''.join(pageText).replace('-\n','') but realize putting it there will be creating a new string with the join, then creating a whole new string from the replace.
def tesseractOCR_pdf(pdf):
pages = convert_from_path(pdf, 500)
# Counter to store images of each page of PDF to image
# Create an empty list for storing purposes
pageText = []
# Iterate through all the pages stored above will be a PIL Image
for page in pages:
# Recognize the text as string in image using pytesserct
# Add the text to a list while removing the -\n characters.
pageText.append(str(pytesseract.image_to_string(page)).replace('-\n',''))
return ''.join(pageText)
An even more compact one-liner version
def tesseractOCR_pdf(pdf):
#This takes each page of the pdf, extracts the text, removing -\n and combines the text.
return ''.join([str(pytesseract.image_to_string(page)).replace('-\n', '') for page in convert_from_path(pdf, 500)])

How to read all pdf files in a directory and convert to text file using tesseract python 3?

How to read all pdf files in a directory and convert to text file using tesseract python 3?
The below code is for reading one pdf file and convert to text file.
But i want to read all pdf files in a directory and convert to text file using tesseract python 3
from PIL import Image
import pytesseract
import sys
from pdf2image import convert_from_path
import os
pdf_filename = "pdffile_name.pdf"
txt_filename = "text_file_created.txt"
def tesseract(pdf_filename,txt_filename):
PDF_file = pdf_filename
pages = convert_from_path(PDF_file, 500)
image_counter = 1
for page in pages:
pdf_filename = "page_"+str(image_counter)+".jpg"
page.save(pdf_filename, 'JPEG')
image_counter = image_counter + 1
filelimit = image_counter-1
outfile = txt_filename
f = open(outfile, "a",encoding = "utf-8")
for i in range(1, filelimit + 1):
pdf_filename = "page_"+str(i)+".jpg"
text = str(((pytesseract.image_to_string(Image.open(pdf_filename)))))
text = text.replace('-\n', '')
f.write(text)
f.close()
f1 = open(outfile, "r",encoding = "utf-8")
text_list = f1.readlines()
return text_list
tesseract(pdf_filename,txt_filename)`enter code here`
i have code for reading pdf files in a directory but i dont know to combine this code with above code
def readfiles():
os.chdir(path)
pdfs = []
for file_list in glob.glob("*.pdf"):
print(file_list)
pdfs.append(file_list)
readfiles()

Simply convert the variable pdf_filename to a list using this code snippet:
import glob
pdf_filename = [f for f in glob.glob("your_preferred_path/*.pdf")]
which will get you all the pdf files you want and store it into a list.
Or simply use any of the methods posted here:
How do I list all files of a directory?
Once you do that, you now have a list of pdf files.
Now iterate over the list of pdfs, one at a time, which will give you a list of test files.
You can use it something like this code snippet:
for one_pdf in pdf_filename:
#* your code to convert the files *#
Hope this helps.

Highlight text content in pdf files using python and save a screenshot

I have a list of pdf files and I need to highlight specific text on each page of these files and save a snapshot for each of the text instances.
So far I am able to highlight the text and save the entire page of a pdf file as a snapshot. But, I want to find the position of highlighted text and take a zoomed in the snapshot which will be more detailed compared to the full page snapshot.
I'm pretty sure there must be a solution to this problem. I am new to Python and hence I am not able to find it. I would be really grateful if someone can help me out with this.
I have tried using PyPDF2, Pymupdf libraries but I couldn't figure out the solution. I also tried highlighting by providing coordinates which works but couldn't find a way to get these coordinates as output.
[![Sample snapshot from the code[![\]\[1\]][1]][1]][1]
#import PyPDF2
import os
import fitz
from wand.image import Image
import csv
#import re
#from pdf2image import convert_from_path
check = r'C:\Users\Pradyumna.M\Desktop\Pradyumna\Automation\Intel Bytes\Create Source Docs\Sample Check 8 Apr 2019'
dir1 = check + '\\Source Docs\\'
dir2 = check + '\\Output\\'
dir = [dir1, dir2]
for x in dir:
try:
os.mkdir(x)
except FileExistsError:
print("Directory ", x, " already exists")
### READ PDF FILE
with open('upload1.csv', newline='') as myfile:
reader = csv.reader(myfile)
for row in reader:
rowarray = '; '.join(row)
src = rowarray.split("; ")
file = check + '\\' + src[4] + '.pdf'
print(file)
#pdfFileObj = open(file,'rb')
#pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#print("Total number of pages: " + str(pdfReader.numPages))
doc = fitz.open(file)
print(src[5])
for i in range(int(src[5])-1, int(src[5])):
i = int(i)
page = doc[i]
print("Processing page: " + str(i))
text = src[3]
#SEARCH TEXT
print("Searching: " + text)
text_instances = page.searchFor(text)
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
file1 = check + '\\Output\\' + src[4] + '_output.pdf'
print(file1)
doc.save(file1, garbage=4, deflate=True, clean=True)
### Screenshot
with(Image(filename=file1, resolution=150)) as source:
images = source.sequence
newfilename = check + "\\Source Docs\\" + src[0] + '.jpeg'
Image(images[i]).save(filename=newfilename)
print("Screenshot of " + src[0] + " saved")

"couldn't find a way to get these coordinates as output"
- you can get the coordinates out by doing this:
for inst in text_instances:
print(inst)
inst are fitz.Rect objects which contain the top left and bottom right coordinates of the piece of text that was found. All the information is available in the docs.
I managed to highlight points and also save a cropped region using the following snippet of code. I am using python 3.7.1 and my output for fitz.version is ('1.14.13', '1.14.0', '20190407064320').
import fitz
doc = fitz.open("foo.pdf")
inst_counter = 0
for pi in range(doc.pageCount):
page = doc[pi]
text = "hello"
text_instances = page.searchFor(text)
five_percent_height = (page.rect.br.y - page.rect.tl.y)*0.05
for inst in text_instances:
inst_counter += 1
highlight = page.addHighlightAnnot(inst)
# define a suitable cropping box which spans the whole page
# and adds padding around the highlighted text
tl_pt = fitz.Point(page.rect.tl.x, max(page.rect.tl.y, inst.tl.y - five_percent_height))
br_pt = fitz.Point(page.rect.br.x, min(page.rect.br.y, inst.br.y + five_percent_height))
hl_clip = fitz.Rect(tl_pt, br_pt)
zoom_mat = fitz.Matrix(2, 2)
pix = page.getPixmap(matrix=zoom_mat, clip = hl_clip)
pix.writePNG(f"pg{pi}-hl{inst_counter}.png")
doc.close()
I tested this on a sample pdf that i peppered with "hello":
Some of the outputs from the script:
I composed the solution out of the following pages of the documentation:
Tutorial page to get introduced into the library
page.searchFor to figure out the return type of the searchFor method
fitz.Rect to understand what the returned objects from page.searchFor are
Collection of Recipes page (called faq in the URL) to figure out how to crop and save part of a pdf page

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create searchable (multipage) PDF with Python - python

Related

PDF range split

Delete all pdf files in a folder using python

How to make my Tesseract-OCR conversion code run faster

How to read all pdf files in a directory and convert to text file using tesseract python 3?

Highlight text content in pdf files using python and save a screenshot

Categories

Resources