Tesserocr vs Pytesseract Speed comparison - python

From what I've been able to gather online, when trying to extract text from multiple images in python, using the tesserocr library should be faster than using pytesseract as it doesn't have to initiate the tesseract framework each time, it just makes the prediction. However, I implemented two functions as can be seen below:
api = tesserocr.PyTessBaseAPI()
# tessserocr function
def tessserocr_extract(p):
api.SetImageFile(p)
text = api.GetUTF8Text()
return text
# pytesseract function
def pytesseract_extract(p):
pytesseract.tesseract_cmd = path_to_tesseract
img = Image.open(p)
#Extract text from image
text = pytesseract.image_to_string(img)
return text
When I use both functions to extract text from 20 images, the tesserocr library is always slower the first time around. When I try to extract the text from the same set of images, the tesserocr library is faster though, maybe due to some image caching. I have also tried using tessdata_fast and observed the same result. I did also try using api.SetImage(...) after loading the image using PIL, and it was still slower.
The images are mostly screenshots of websites that vary in size.
Am I doing something incorrectly, or is tesserocr simply slower than pytesseract for extracting text from multiple images?

Do not measure something you do not understand (... maybe due to some image caching ... suggests you do not really understand the code you posted above). Even if you get correct results (which you did not), you will not be able to interpret them.
If you were to analyse the differences between pytesseract and tesserocr, you would see that it is not possible for pytesseract to be faster than tesserocr (It has to perform several extra steps to reach the same state as tesserocr). In any case, on modern hardware the difference in speed is very small.

Related

Python: Basic OCR / Image to Text library

I have a very simple use case for OCR.
.PNG Image then extract the text from the image. Then print to console.
I'm after a lite weight library. I am trying to avoid any system level dependency's. Pytesseract is great, but deployment is a bit annoying for such a simple use case.
I have tried quite a few of them. They seem to be designed for more complex use cases.
Note: White text is not necessarily suitable for OCR. I will change the image format to suit the OCR Library.

How to convert a PDF to a JPG/PNG in Python with the highest possible quality?

I am tying to convert a PDF to an image so I can OCR it. But the quality is being degraded during the conversion.
There seem to be two main methods for converting a PDF to an image (JPG/PNG) with Python - pdf2image and ImageMagick/Wand.
#pdf2image (altering dpi to 300/600 etc does not seem to make a difference):
pages = convert_from_path("page.pdf", dpi=300)
for page in pages:
page.save("page.jpg", 'JPEG')
#ImageMagick (Wand lib)
with Image(filename="page.pdf", resolution=300) as img:
img.compression_quality = 100
img.save(filename="page.jpg")
But if I simply take a screenshot of the PDF on a Mac, the quality is higher than using either Python conversion method.
A good way to see this is to run Tesseract OCR on the resulting images - both Python methods give average results, whereas the screenshot gives perfect results. (I've tried both PNG and JPG.)
Assume I have infinite time, computing power and storage space. I am only interested in image quality and OCR output. It's frustrating to have the perfect image just within reach, but not be able to generate it in code.
What is going on here? Is there a better way to convert a PDF? Is there a way I can get more direct control? What is a screenshot doing such a better job than an actual conversion?
You can use PyMuPDF and set the dpi you want:
import fitz
doc = fitz.open('some/pdf/path')
page = doc.load_page(0)
pixmap = page.get_pixmap(dpi=300)
img = pixmap.tobytes()
# Continue with whatever logic...

Detecting Bangla characters using pytesseract

I am trying to detect Bangla characters from images of Bangla number plates using Python, so I decided to use pytesseract. For this purpose I have used below code:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('input.png'),lang="ben")
print(text)
The problem is when I am printing, it is showing as empty output.
When I tried to freeze it in a text, it is showing like:
Example Picture: (Link)
Expected Output (should be something like or should be somewhat relatable like):
ঢাকা মেট্রো হ
৪৫ ২৩০৭
P.S: I have downloaded Bengali language data while installing Tesseract-OCR-64 and I am trying to run it in VS Code.
Can anyone help me to solve this problem or give me an idea of how to solve this problem?
The solution to this problem is:
You need to segment all the characters (you can take any approach if you want, can be deep learning or image processing) and feed the PyTesseract only the character. (only clear photos)
Reason: It can detect the Bangla language from pictures of clear and considerably acceptable resolution. It might have considerably fewer models trained for this language for pictures of small size. (which is quite understandable)
Code:
### any deep learning approach or any image processing approach here
# load the segmented character
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
character = pytesseract.image_to_string(Image.open('char.png'),lang="ben")
print(character)

OCR for Bank Receipts

I am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.
I used the below text for extraction.Should I use a different Encoding?
import os
if os.path.isfile('output.hocr'):
fp=open('output.hocr','r',encoding='UTF-8')
text=fp.read()
fp.close()
Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.
I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...
or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..
You can simply provide the image as input, instead of processing and creating an HOCR output file.
Try:-
from PIL import Image
import pytesseract
im = Image.open("reciept.jpg")
text = pytesseract.image_to_string(im, lang = 'eng')
print(text)
This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.
P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

Text cannot be read using pyTesseract

I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?
Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.

Categories