Python: Basic OCR / Image to Text library

Python: Basic OCR / Image to Text library - python

I have a very simple use case for OCR.
.PNG Image then extract the text from the image. Then print to console.
I'm after a lite weight library. I am trying to avoid any system level dependency's. Pytesseract is great, but deployment is a bit annoying for such a simple use case.
I have tried quite a few of them. They seem to be designed for more complex use cases.
Note: White text is not necessarily suitable for OCR. I will change the image format to suit the OCR Library.

Related

Best image format for face detection and face recognition with the DeepFace library

I'm using the DeepFace library for face recognition and detection.
I was wondering if there is a better format (png, jpg, etc) than others to get better results.
Is there a preferred image format for face recognition and face detection generally? and specifically in this library?

Deepface is wrapped around several face recognition frameworks so the answer to your question should be: it is case-depending issue. However, all basic FR fameworks are not working with the original inpit images, they converting them first to greyscale, downsizing, making numpy arrays and so on usually using OpenCV and PIL for that. So... So, my oppinion is that image file format does not matter. Image file size, colour depth do matter.

This answer is based on an answer from 'Bhargav'.
In python, images are processed as bit bitmaps using the color depth of the graphic system. Converting a PNG image to a bitmap is really fast (20 x times) when compared to jpegs.
In my case, I had the image and needed to save it before proceeding, so I saved it as png so I won't lose quality (jpg is lossy).
Deepace - Currently accepting only 2 types of image input formats. PNG/Jpeg. there is no way to use other formats of images directly as you are using their libraries. If you want to use another input formats so then at least you need to convert either to PNG or Jpeg to give input to the functions. Which may cost you extra execution time while bringing other format images to PNG/Jpegs.
If you want to improve face recognition and face detection with deepface library then use some preprocessing filters.
Some of the filters you can try for better results. ultimate guide
Grayscale conversions
Face straightening
Face cropping (#Deepcae automatically do this while processing so no need to this)
Image resizing
Normalization
Image enhancement with PIL like sharpening.
image equalization.
Some basic filtering will be done by deepface. If your results are not accurate, which means filtering done by deepface is not sufficient, you have to try each and every filter. Something like a trail and error method until you got good results.
sharpening and grayscaling are the first methods to try.

Tesserocr vs Pytesseract Speed comparison

From what I've been able to gather online, when trying to extract text from multiple images in python, using the tesserocr library should be faster than using pytesseract as it doesn't have to initiate the tesseract framework each time, it just makes the prediction. However, I implemented two functions as can be seen below:
api = tesserocr.PyTessBaseAPI()
# tessserocr function
def tessserocr_extract(p):
api.SetImageFile(p)
text = api.GetUTF8Text()
return text
# pytesseract function
def pytesseract_extract(p):
pytesseract.tesseract_cmd = path_to_tesseract
img = Image.open(p)
#Extract text from image
text = pytesseract.image_to_string(img)
return text
When I use both functions to extract text from 20 images, the tesserocr library is always slower the first time around. When I try to extract the text from the same set of images, the tesserocr library is faster though, maybe due to some image caching. I have also tried using tessdata_fast and observed the same result. I did also try using api.SetImage(...) after loading the image using PIL, and it was still slower.
The images are mostly screenshots of websites that vary in size.
Am I doing something incorrectly, or is tesserocr simply slower than pytesseract for extracting text from multiple images?

Do not measure something you do not understand (... maybe due to some image caching ... suggests you do not really understand the code you posted above). Even if you get correct results (which you did not), you will not be able to interpret them.
If you were to analyse the differences between pytesseract and tesserocr, you would see that it is not possible for pytesseract to be faster than tesserocr (It has to perform several extra steps to reach the same state as tesserocr). In any case, on modern hardware the difference in speed is very small.

OCR for Bank Receipts

I am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.
I used the below text for extraction.Should I use a different Encoding?
import os
if os.path.isfile('output.hocr'):
fp=open('output.hocr','r',encoding='UTF-8')
text=fp.read()
fp.close()
Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.

I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...
or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..

You can simply provide the image as input, instead of processing and creating an HOCR output file.
Try:-
from PIL import Image
import pytesseract
im = Image.open("reciept.jpg")
text = pytesseract.image_to_string(im, lang = 'eng')
print(text)
This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.
P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

Text cannot be read using pyTesseract

I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?

Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.

Image cleaning before OCR application

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:
File with icons, images and text - 5-10% accurate
File with only text(images and icons erased) - 50-60% accurate
File with stretching(And this is the best part) - Stretching file
in 2) above on x or y axis increased the accuracy by 10-20%
So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?
...........
Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:
1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.
So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.

As it turns out, tesseract wiki has an article that answers this question in best way I can imagine:
Illustrated guide about "Improving the quality of the [OCR] output".
Question "image processing to improve tesseract OCR accuracy" may also be of interest.
(initial answer, just for the record)
I haven't used PyTesser, but I have done some experiments with tesseract (version: 3.02.02).
If you invoke tesseract on colored image, then it first applies global Otsu's method to binarize it and then actual character recognition is run on binary (black and white) image.
Image from: http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html
As it can be seen, 'global Otsu' may not always produce desirable result.
To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image.
In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to tesseract.
Somebody was kind enough to publish api docs for tesseract, so it is possible to verify previous statements about processing pipeline: ProcessPage -> GetThresholdedImage -> ThresholdToPix -> OtsuThresholdRectToPix

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
ORIGINAL
After Pre-Processing with given arguments.

I know it's not a perfect answer. But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem.
Link to Video
Link to Presentation
And if you do decide to use ImageMagick to pre-process your source images a little. Here is question that points you to nice python bindings for it.
On a side note. Quite an important thing with Tesseract. You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.