Tesseract OCR having trouble detecting numbers

Tesseract OCR having trouble detecting numbers - python

I am trying to detect some numbers with tesseract in python. Below you will find my starting image and what I can get it down to. Here is the code I used to get it there.
import pytesseract
import cv2
import numpy as np
pytesseract.pytesseract.tesseract_cmd = "C:\\Users\\choll\\AppData\\Local\\Programs\\Tesseract-OCR\\tesseract.exe"
image = cv2.imread(r'64normalwart.png')
lower = np.array([254, 254, 254])
upper = np.array([255, 255, 255])
image = cv2.inRange(image, lower, upper)
image = cv2.bitwise_not(image)
#Uses a language that should work with minecraft text, I have tried with and without, no luck
text = pytesseract.image_to_string(image, lang='mc')
print(text)
cv2.imwrite("Wartthreshnew.jpg", image)
cv2.imshow("Image", image)
cv2.waitKey(0)
I end up with black numbers on a white background which seems pretty good but tesseract can still not detect the numbers. I also noticed the numbers were pretty jagged but I don't know how to fix that. Does anyone have recommendations for how I could make tesseract be able to recognize these numbers?
Starting Image
What I end up with

Your problem is with the page segmentation mode. Tesseract segments every image in a different way. When you don't choose an appropriate PSM, it goes for mode 3, which is automatic and might not be suitable for your case. I've just tried your image and it works perfectly with PSM 6.
df = pytesseract.image_to_string(np.array(image),lang='eng', config='--psm 6')
These are all PSMs availabe at this moment:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.

Use the pytesseract.image_to_string(img, config='--psm 8') or try diffrent configs to see if the image will get recognized. Useful link here Pytesseract OCR multiple config options

Related

Pytesseract not working sometimes on perfectly clear Images

I have very high resolution Engineering drawings/Circuit diagrams which contain text in a number of different regions. The aim is to extract text from such Images.
I am using pytesseract for this task. Applying pytesseract directly is not possible as in that case text from different regions gets jumbled up in the output. So I am identifying different bounding boxes containing the text and then iteratively passing these regions to pytesseract. The bounding box logic is working fine however sometimes I get no text from the cropped Image or partial text only. I would understand if the cropped images were low res or blurred but this is not the case. Please have a look at the attached couple of examples.
Image 1
Image 2
Here is my code to get the text:
source_img_simple = cv2.imread('image_name.tif')
source_img_simple_gray = cv2.cvtColor(source_img_simple, cv2.COLOR_BGR2GRAY)
img_text = pytesseract.image_to_string(source_img_simple_gray)
# Export the text file
with open('Output_OCR.txt', 'w') as text:
text.write(img_text)
Actual result for first image- No output (Blank text file)
For 2nd Image- Partial text (ALL MISCELLANEOUS PIPING AND CONNECTION SIZES)
I'm trying to know how to improve the quality of the OCR. I'm open to using any other tools as well (apart from pytesseract) if required. But can't use API's (Google, AWS etc.) as that is a restriction. Note: I have gone through the below post and it is not duplicate of my case as I have black text on white background:
Pytesseract dont reconize a very clear image

Since your images already look clean, no preprocessing is needed. A simple approach is to threshold and Gaussian blur to smooth the image before throwing it into Pytesseract. Here's the results after simple processing and the output from Pytesseract
SYSTEM CODE IS 3CAB, EXCEPT AS INDICATED.
For the 2nd image
ALL MISCELLANEOUS PIPING AND CONNECTION SIZES
SHALL BE 1 INCH. EXCEPT AS INDICATED.
We use the --psm 6 config flag since we want to treat the image as a single uniform block of text. Here's some additional configuration flags that could be useful
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('2.jpg',0)
thresh = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY_INV)[1]
result = cv2.GaussianBlur(thresh, (5,5), 0)
result = 255 - result
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()

How to get text from image

I want to read the text from an image.
I use pytesseract in Python.
Here is my code:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = Image.open(r'a.jpg')
image.resize((150, 50),Image.ANTIALIAS).save("pic.jpg")
image = Image.open("pic.jpg")
captcha = pytesseract.image_to_string(image).replace(" ", "").replace("-", "").replace("$", "")
image
However, it returns empty string.
What should be the correct way?
Thanks.

i agree with #Jon Betts
tesseract is not very strong in OCR, only good in binary cases with right settings
CAPTCHAs ment to fool OCRs!
but if you really need to do it, you need to come up with the manual procedure for it,
i created the code below specifically for the type of CAPTCHAs that you gave (but its completely rigid and is not generalized/optimized for all cases)
psudo code
apply median blur
apply a threshold to get Blue colors only (binary image output from this stage)
apply opening to reduce small white pixels in binary image
give the image to tesseract with options:
limited whitelist of output chars
OEM 3 : tesseract + cube
PSM 8 : one word per image
Code
from PIL import Image
import pytesseract
import numpy as np
import cv2
img = cv2.imread('a.jpg')
img = cv2.medianBlur(img, 3)
# extract blue parts
img2 = np.zeros((img.shape[0], img.shape[1]), dtype=np.uint8)
cond = np.bitwise_and(img[:, :, 0] >= 100, img[:, :, 2] < 100)
img2[np.where(cond)] = 255
img = img2
# delete the noise
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3))
img = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
str1 = pytesseract.image_to_string(Image.fromarray(img),
config='-c tessedit_char_whitelist=abcedfghijklmnopqrtuvwxyz0123456789 -oem 3 -psm 8')
cv2.imwrite("frame.png", img)
print(str1)
output
f2e4
image
in order to see full options of tesseract, type the following command tesseract --help-extra or refere to this_link

Tesseract is intended for performing OCR on text documents. In my experience it's good but a bit patchy even with very clean data.
In this case it appears you are trying to solve a CAPTCHA which is specifically designed to defeat OCR software. It's very likely you cannot use Tesseract to solve this issue, because:
It's not really designed for that
The scenario is adversarial:
The example is specifically designed to prevent what you are trying to do
If you could get it to work, the other party would likely change it to break again
If you want to proceed I would suggest:
Working on cleaning up the image before attempting to process it (can you get a nice readable black and white image?)
Train your own recognition network using a lot of examples

PyTesseract not recognising images

I am currently facing a problem with pytesseract where the software is unable to detect a number in this image:
https://i.stack.imgur.com/kmH2R.png
This is taken from a bigger image with threshold filter applied.
For some reason, pytesseract doesn't want to recognise the 6 in this image. Any suggestions? Here is my code:
image = #Insert raw image here. My code takes a screenshot.
image = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
image = cv2.medianBlur(image, 3)
rel, gray = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
# If you want to use the image from above, start here.
image = Image.fromarray(image)
string = pytesseract.image_to_string(image)
print(string)
EDIT: With some further investigation, my code works fine wit numbers containing 2 digits. But not those with singular digits.

pytesseract defaults to a mode that looks for large chunks of text (PSM_SINGLE_BLOCK or --psm 6), in order to have it detect a single character you need to run it with the option --psm 10 (PSM_SINGLE_CHAR). However, due to the black spots in the corners of the image you provided it detects them as random dashes and returns nothing in this mode since it things there's multiple characters, so in this case you need to use --psm 8 (PSM_SINGLE_WORD):
string = pytesseract.image_to_string(image, config='--psm 8')
The output from this will include those random characters so you would need to strip them after pytesseract runs or improve your bounding box around the numbers to remove any noise. Also, if all of your characters being detected are numbers you can add '-c tessedit_char_whitelist=0123456789' after '--psm 8' to improve the detection.
Some other minor tips to simplify your code is that cv2.imread has an option to read the image as black & white so you don't need to run cvtColor afterwards, just do:
image = cv2.imread('/path/to/image/6.png', 0)
also you can create the PIL image object within your call to pytesseract, so that line simplifies to:
string = pytesseract.image_to_string(Image.fromarray(img), config='--psm 8')
as long as you have 'from PIL import Image' at the top of your script.

Python - Improving Tesseract OCR to recognize list of names

I'm working on a project that will recognize teams in a game (Overwatch) and record which players were on which team. It has a predefined list of who is playing, it only needs to recognize which image they are located on. So far I have had success in capturing the images for each team and getting a rough output as to the name for each player, however, it is getting several letters confused.
My input images:
And the output I get from OCR:
W THEMIGHTVMRT
ERSVZENVRTTR
ERSVLUCID
ERSVZRRVR
ERSVMEI
EFISVSDMBRR
ERSV RNR
ERSVZENVRTTR
EFISVZHRVR
ERSVMCCREE
ERSVMEI
EHSVRDRDHDG
From this, you can see that the OCR confuses "A" with "R" and "Y" with "V". I was able to get the font file that Overwatch uses and generate a .traineddata file using Train Your Tesseract - I'm aware that there is probably a better way of generating this file, though I'm not sure how.
My code:
from pytesseract import *
import pyscreenshot
pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
tessdata_dir_config = '--tessdata-dir "C:\\Program Files (x86)\\Tesseract-OCR\\tessdata"'
team1 = pyscreenshot.grab(bbox=(50,450,530,810)) # X1, Y1, X2, Y2
team1.save("team1screenshot.png")
team1text = pytesseract.image_to_string(team1, config=tessdata_dir_config, lang='owf')
team2 = pyscreenshot.grab(bbox=(800,450,1280,810)) # X1, Y1, X2, Y2
team2.save("team2screenshot.png")
team2text = pytesseract.image_to_string(team2, config=tessdata_dir_config, lang='owf')
print(team1text)
print("------------------")
print(team2text)
How should I improve the recognition of these characters? Do I need a better .traineddata file, or is it regarding better image processing?
Thanks for any help!

As #FlorianBrucker mentioned, doing a similarity test on the strings allows (with some fine tuning) the ability to find the correct string after the OCR level.

You could try custom OCR configs to do a sparse text search, "Find as much text as possible in no particular order."
SET psm to 11 in tesseract configs
See if you can do this:
tessdata_dir_config = "--oem 3 --psm 11"
To see a complete list of supported page segmentation modes (psm), use tesseract -h. Here's the list as of 3.21:
Orientation and script detection (OSD) only.
Automatic page segmentation with OSD.
Automatic page segmentation, but no OSD, or OCR.
Fully automatic page segmentation, but no OSD. (Default)
Assume a single column of text of variable sizes.
Assume a single uniform block of vertically aligned text.
Assume a single uniform block of text.
Treat the image as a single text line.
Treat the image as a single word.
Treat the image as a single word in a circle.
Treat the image as a single character.
Sparse text. Find as much text as possible in no particular order.
Sparse text with OSD.
Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
I'm using python wrapper for Tesseract https://github.com/madmaze/pytesseract
Here you can configure tesseract as:
custom_oem_psm_config = r'--oem 3 --psm 6'
pytesseract.image_to_string(image, config=custom_oem_psm_config)

Python - Read number in image with Pytesseract

I am using a combination of pyautogui and pytesseract to capture small regions on the screen and then pull the number/text out of the region. I have written script that has read the majority of captured images perfectly, but single digit numbers seem to cause an issue for it. For example small regions of an image containing numbers are saved to .png files the numbers 11, 14, and 18 were pulled perfectly, but the number 7 is just returning as a blank string.
Question: What could be causing this to happen?
Code: Scaled down drastically to make it every easy to follow:
def get_text(image):
return pytesseract.image_to_string(image)
answer2 = pyautogui.screenshot('answer2.png',region=(727, 566, 62, 48))
img = Image.open('answer2.png')
answer2 = get_text(img)
This code is repeated 4 times, once for each image, it worked for 11,14,18 but not for 7.
Just to slow the files being read here is a screenshot of the images after they were saved through the screenshot command.
https://gyazo.com/0acbf5be2d970abeb29561113c171fbe
here is a screenshot of what I am working from:
https://gyazo.com/311913217a1302382b700b07ad3e3439

I found question Why pytesseract does not recognise single digits? and in comments I found option --psm 6.
I checked tesseract with option --psm 6 and it can recognize single digit on your image.
tesseract --psm 6 number-7.jpg result.txt
I checked pytesseract.image_to_string() with option config='--psm 6' and it can recognize single digit on your image too.
#!/usr/bin/env python3
from PIL import Image
import pytesseract
img = Image.open('number-7.jpg')
print(pytesseract.image_to_string(img, config='--psm 6'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.