How can I improve pyTesseract detection for this image? - python

I'm trying to read the following images using pyTessaract.
The program I've made extracts these images from a video, I want to use pytessaract to extract text from them. The program also has the option to make boxes at the areas I want to the OCR to look at, it then crops the image at those places and runs OCR.
But still the output is not very accurate sometimes I get random letters / text included in the output, sometimes the text is completely random. It's just not detecting the text. How can I improve the detection ? Is there some preprocessing strategy I can use ? Or anything else ?

Related

OCR for Bank Receipts

I am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.
I used the below text for extraction.Should I use a different Encoding?
import os
if os.path.isfile('output.hocr'):
fp=open('output.hocr','r',encoding='UTF-8')
text=fp.read()
fp.close()
Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.
I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...
or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..
You can simply provide the image as input, instead of processing and creating an HOCR output file.
Try:-
from PIL import Image
import pytesseract
im = Image.open("reciept.jpg")
text = pytesseract.image_to_string(im, lang = 'eng')
print(text)
This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.
P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

Text cannot be read using pyTesseract

I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?
Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.

Making my Own training set by distorting text (ANPR System)

I'm working on an anpr system and to convert the registration plate image to text output. I previously tried to use (py)tessaract to do the ocr for me but this wasn't giving me sufficient results.
As my current training set i'm using this font as all registration fonts are the same
from my images some of the resultant number plates will be at weird angles so the plate isn't recognised correctly
So I am asking is there a way to make each digit distorted in many different ways and storing that distortion in an nparray in a file and from this can perform machine learning techniques on
Something like this (however output to be different)
https://archive.ics.uci.edu/ml/datasets/Letter+Recognition
Thanks i used a previous point to help me so far to separate characters and so on
Recognize the characters of license plate
Thanks any help would be appreciated

Tesseract OCR not recognizing any character

I am working on a project that requires character recognition as a part of it. I am using a handwriting dataset by IAM, so all the images are more or less taken in the same conditions. I am using pictures of words that have been provided by the dataset and following these steps
Binarizing and thresholding
Dividing the word into the characters constituting it
Resizing the extracted character
Letting tesseract figure out what the English alphabet is
What I'm trying to achieve is to store characters of a person's document in folders categorized by the alphabet and maybe form a template from them later on. For this I need to know which character it is.
Here's what I get as a result -
All the characters are properly segmented (for most cases). This is more of a tesseract question than it is a python question, but I'm using python to write the script and calling tesseract through the pytesseract wrapper.
I'm using OpenCV to manipulate the image. Images of these letter matrices are sent as input to tesseract (handled by pytesseract). The input is not an issue, I assure you. Is there anything else I need to do for tesseract to work?
None of these characters are recognized.
Tesseract doesn't support handwritten text well. You should try either ABBYY OCR for that or alternative free libraries like Lipi Toolkit.

How to extract the text part only from an image using opencv and python?

Here is the image after the Pre Processed of a water meter reading...
But whenever I am using tesseract to recognize the digits its not giving an appropriate output.
So, I want to extract/segment out the digits part only as an region of Interest and to save it in a new image file, such that the tesseract can recognize it properly...
I am able to remove those extra noises in an image, that's why I am using this option.
Is there any way to do that ?
The Unprocessed Image is
Before you try extracting your digits from this image, try to reduce your image size so that your digit size would be about 16 pixels height. Secondly, reduce your tesseract scanned characters whitelist to "0123456789", to avoid other characters like ",.;'/" and so on being scanned (that is quite common on this type of pictures). Lowering your image size should help tesseract to dump this noise and not scan in or mix it with digits. This method should not work by 100% on this kind of image for sure, but to clear this kind of noise would be a challenge withoud a doubt by other ways. Maybe you could try to provide us with unprocessed image if you have one, lets see what is possible then.

Categories