Detecting Bangla characters using pytesseract

Detecting Bangla characters using pytesseract - python

I am trying to detect Bangla characters from images of Bangla number plates using Python, so I decided to use pytesseract. For this purpose I have used below code:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
text = pytesseract.image_to_string(Image.open('input.png'),lang="ben")
print(text)
The problem is when I am printing, it is showing as empty output.
When I tried to freeze it in a text, it is showing like:
Example Picture: (Link)
Expected Output (should be something like or should be somewhat relatable like):
ঢাকা মেট্রো হ
৪৫ ২৩০৭
P.S: I have downloaded Bengali language data while installing Tesseract-OCR-64 and I am trying to run it in VS Code.
Can anyone help me to solve this problem or give me an idea of how to solve this problem?

The solution to this problem is:
You need to segment all the characters (you can take any approach if you want, can be deep learning or image processing) and feed the PyTesseract only the character. (only clear photos)
Reason: It can detect the Bangla language from pictures of clear and considerably acceptable resolution. It might have considerably fewer models trained for this language for pictures of small size. (which is quite understandable)
Code:
### any deep learning approach or any image processing approach here
# load the segmented character
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
character = pytesseract.image_to_string(Image.open('char.png'),lang="ben")
print(character)

Related

Tesserocr vs Pytesseract Speed comparison

From what I've been able to gather online, when trying to extract text from multiple images in python, using the tesserocr library should be faster than using pytesseract as it doesn't have to initiate the tesseract framework each time, it just makes the prediction. However, I implemented two functions as can be seen below:
api = tesserocr.PyTessBaseAPI()
# tessserocr function
def tessserocr_extract(p):
api.SetImageFile(p)
text = api.GetUTF8Text()
return text
# pytesseract function
def pytesseract_extract(p):
pytesseract.tesseract_cmd = path_to_tesseract
img = Image.open(p)
#Extract text from image
text = pytesseract.image_to_string(img)
return text
When I use both functions to extract text from 20 images, the tesserocr library is always slower the first time around. When I try to extract the text from the same set of images, the tesserocr library is faster though, maybe due to some image caching. I have also tried using tessdata_fast and observed the same result. I did also try using api.SetImage(...) after loading the image using PIL, and it was still slower.
The images are mostly screenshots of websites that vary in size.
Am I doing something incorrectly, or is tesserocr simply slower than pytesseract for extracting text from multiple images?

Do not measure something you do not understand (... maybe due to some image caching ... suggests you do not really understand the code you posted above). Even if you get correct results (which you did not), you will not be able to interpret them.
If you were to analyse the differences between pytesseract and tesserocr, you would see that it is not possible for pytesseract to be faster than tesserocr (It has to perform several extra steps to reach the same state as tesserocr). In any case, on modern hardware the difference in speed is very small.

How would I filter the results from Tesseract against a dictionary supplied?

I started learning Python a bit ago and just now have started learning Tesseract to create a tool for my own use. I have a script written to take four screenshots of specific parts of the screen and then it uses Tesseract to pull data from those images. It is mostly accurate, and gets the words almost 100% of the time, but there is still some garbage letters and symbols that I don't want in the results.
Rather than trying to process the image (if this really is the easiest way, I could do it, but I still feel like that would result in more data coming through that I don't want) I would like to only keep the words in the result that are in the dictionary I can provide.
import cv2
import pytesseract
import pyscreenshot as ImageGrab
im=ImageGrab.grab(bbox=(580,430,780,500))
im.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab.png')
im2=ImageGrab.grab(bbox=(770,430,960,500))
im2.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab2.png')
im3=ImageGrab.grab(bbox=(950,430,1150,500))
im3.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab3.png')
im4=ImageGrab.grab(bbox=(1140,430,1320,500))
im4.save(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab4.png')
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
from PIL import Image
image = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab.png')
image2 = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab2.png')
image3 = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab3.png')
image4 = Image.open(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab4.png')
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab.png'))
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab2.png'))
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab3.png'))
print(pytesseract.image_to_string(r'C:\Users\Charlie\Desktop\tesseract_images\imagegrab4.png'))
This is what I've written above. It might not be as clean as possible but it does what I want it to for now. When I run the program against the test screenshot I've taken, I get the following:
Ballistica Prime Receiver
a
“Ze ij
Titania Prime Blueprint
—
‘|! stradavar Prime
Blueprint
My.
Bronco Prime Barrel
uby-
Here is my screenshot:
It can pick up the words perfectly fine, but the data like "uby-" and "‘|! " aren't needed and is why I want to hopefully clean them out by only keeping words that are in the dictionary. If there's an easier way to do this I'd love to know as I haven't been using Tesseract but for a day or so and don't know of another way I'd do it other than the aforementioned image processing.

OCR for Bank Receipts

I am working on OCR problem for Bank receipts and I need to extract details like the Date and Account Number for the same. After processing the input,I am using Tessaract-OCR (using pyteserract in python) for the same.I have obtained the hocr output file however I am not able to make sense of it.How do we extract information from the HOCR output file?Note that the receipt has numbers filled in Boxes like the normal forms.
I used the below text for extraction.Should I use a different Encoding?
import os
if os.path.isfile('output.hocr'):
fp=open('output.hocr','r',encoding='UTF-8')
text=fp.read()
fp.close()
Note:The attached image is one example of data.These images are available in pdf files which I am converting programmatically into images.

I personally would use something more like tesseract to do the OCR and then perhaps something like opencv with surf for the tick boxes...
or even do edge detection with opencv and surf for each section and ocr that specific area to make it more robust by analyzing that specific area rather than the whole document..

You can simply provide the image as input, instead of processing and creating an HOCR output file.
Try:-
from PIL import Image
import pytesseract
im = Image.open("reciept.jpg")
text = pytesseract.image_to_string(im, lang = 'eng')
print(text)
This program takes in the location of your image which is to be run through OCR, and extracts text from it, stores it in a variable text, and prints it out. If you want you can store the data in text in a separate file too.
P.S.:- The Image that you are trying to process, is way too complex as compared to images that tesseract is made to deal with. Due to this you may get incorrect results, after the text is processed. I would definitely recommend you to optimize it before using, like reducing the character set used, processing the image before passing it to OCR, upsampling image, having dpi over 250 etc.

Using Python and Tesseract OCR to solve Captcha

I am not planning to spam, and besides Google has made captcha obsolete with reCaptcha. I am doing this as a project to learn more about OCR and eventually maybe neural networks.
SO I have an image from a Captcha, I have been able to make modest progress, but the documentation on tesseract isn't exactly well documented. Here is the code I have so far and the results are bellow it.
from selenium import webdriver
from selenium.webdriver.common import keys
import time
import random
import pytesseract
from pytesseract import image_to_string
from PIL import Image, ImageEnhance, ImageFilter
def ParsePic():
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract.exe'
im = Image.open("path\\screenshot.png")
im = im.filter(ImageFilter.CONTOUR)
im = im.filter(ImageFilter.DETAIL)
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(4)
im = im.convert('L')
im.save('temp10.png')
text = image_to_string(Image.open('temp10.png'))
print(text)
Original Image
Output
I understand that Captcha was made specifically to defeat OCR, but I read that it is no longer the case, and Im interested in learning how it was done.
My question is, how do I make the background the same color, so the text becomes easily readable?

Late answer but anyway...
You are doing edge detection but there are, obviously, to many in this image so this will not work.
You will have to do some thing with the colors.
I don't know if this is true for every of your captchas but you can just use contrast.
You can test this by open up your original with paint (or any other image edit program) and save the image as "monochrom" (black and white only, NOT grayscale)
result:
without any other editing! (Even the Questionmark is gone)
This would be ready to OCR right away.
Maybe your other images are not this easy, but color/contrast is the way to go for you. If you need ideas on how you can use color, contrast and other things to solve captachs, you can take a look on harder examples and how I solved them here: https://github.com/cracker0dks/CaptchaSolver
cheers

Text cannot be read using pyTesseract

I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?

Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Detecting Bangla characters using pytesseract - python

Related

Tesserocr vs Pytesseract Speed comparison

How would I filter the results from Tesseract against a dictionary supplied?

OCR for Bank Receipts

Using Python and Tesseract OCR to solve Captcha

Text cannot be read using pyTesseract

Categories

Resources