Python tesseract increase accuracy for OCR

Python tesseract increase accuracy for OCR - python

I have quite simple pictures, but tesseract is not succeeding in giving me correct answers.
code:
pytesseract.image_to_string(image, lang='eng')
Example picture gives a result of
SARVN PRIM E N EU ROPTICS\nBLU EPRINT
I have also tried to add my own words to dictionary, if it makes something better, but still no.
pytesseract.image_to_string(image, lang='eng', config="--user-words words.txt")
My word list looks like this
SARYN
PRIME
NEUROPTICS
BLUEPRINT
How should I approach the problem, maybe I have to convert the image before predicting? The text color could vary between couple of colors, but background is always black.

Try inverting the image then doing a binarization/thresholding process to get black text on a white background before using trying OCR.
See this post for tips on the binarization of an image in Python.
Of course, the better the quality and the sharper the text in the input image, the better your OCR results will be.
I used an external tool to change it to black on white and got the below image.

I have a four-step solution
Smooth the image
Apply simple-threshold
Take sentences line-by-line
Apply erosion to each individual sentence
Result
Smoothing
Threshold
Upsample + Erode
Pytesseract
SARYN PRIME NEUVROPTICS BLUEPRINT
Code:
import cv2
import pytesseract
img = cv2.imread('j0nNV.png')
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blr = cv2.GaussianBlur(gry, (3, 3), 0)
thr = cv2.threshold(blr, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
(h_thr, w_thr) = thr.shape[:2]
s_idx = 0
e_idx = int(h_thr/2)
for _ in range(0, 2):
crp = thr[s_idx:e_idx, 0:w_thr]
(h_crp, w_crp) = crp.shape[:2]
crp = cv2.resize(crp, (w_crp*2, h_crp*2))
crp = cv2.erode(crp, None, iterations=1)
s_idx = e_idx
e_idx = s_idx + int(h_thr/2)
txt = pytesseract.image_to_string(crp)
print(txt)
cv2.imshow("crp", crp)
cv2.waitKey(0)

Related

Remove large blobs of noise while leaving text intact with opencv

I'm trying to process an image with opencv for feeding it to tesseract for text recognition. I'm working with tags which have a lot of noise and an holographic pattern which varies greatly depending on light conditions.
For this reason I tried different approaches.
First I convert the image to grayscale, then I apply a median blur to soften the noise, then apply an adaptive threshold for masking.
I end up with this result
However the image still has a lot of noise around the text which really messes with its recognition. I'm new to image processing and I'm a bit lost as to how to proceed.
Here's the code for the aforementioned approach:
def approach_3(img, blurIterations=13):
img = rotate_img(img)
img = resize(img)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY).astype(np.uint8)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2, 2))
median = cv2.medianBlur(gray, blurIterations)
th2 = cv2.adaptiveThreshold(median,255,cv2.ADAPTIVE_THRESH_MEAN_C,\
cv2.THRESH_BINARY, 31 , 12)
return th2

How to improve python/tesseract Image to Text accuracy?

How can I grab an image from a region and properly use tesseract to translate to text? I got this currently:
img = ImageGrab.grab(bbox =(1341,182, 1778, 213))
tesstr = pytesseract.image_to_string(np.array(img), lang ='eng')
print (tesstr)
Issue is that it translates it incredibly wrong because the region it's getting the text from is in red with blue background, how can I improve its accuracy? Example of what it's trying to turn from image to text:

*Issue is that it translates it incredibly wrong because the region it's getting the text from is in red with blue background, how can I improve its accuracy? *
You should know the Improving the quality of the output. You need to try each of the suggested method listed. If you still can't achieve the desired result, you should look at the other methods:
Thresholding Operations using inRange
Changing Colorspaces
Image segmentation
To get the desired result, you need to get the binary mask of the image. Both simple threshold, and adaptive-threshold won't work for the input image.
To get the binary mask
Up-sample and convert input image to the HSV color-space
Set lower and higher color boundaries.
Result:
The OCR output for 0.37 version will be:
Day 20204, 16:03:12: Your ‘Metal Triangle Foundation’
was destroved!
Code:
import cv2
import numpy as np
import pytesseract
# Load the image
img = cv2.imread("b.png")
# Up-sample
img = cv2.resize(img, (0, 0), fx=2, fy=2)
# Convert to HSV color-space
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Get the binary mask
msk = cv2.inRange(hsv, np.array([0, 0, 123]), np.array([179, 255, 255]))
# OCR
txt = pytesseract.image_to_string(msk)
print(txt)
# Display
cv2.imshow("msk", msk)
cv2.waitKey(0)

There is an option in the Tesseract API such that you are able to increase the DPI at which you examine the image to detect text. Higher the DPI, hihger the precision, till diminishing returns set in. More processing power is required. DPI should not exceed original image DPI.

Pytesseract Image to String issue

Does anyone know how I can get these results better?
Total Kills: 15,230,550
Kill Details: (recorded after 2019/10,/Z3]
993,151 331,129
1,330,450 33,265,533
5,031,168
This is what it returns however it is meant to be the same as the image posted below, I am new to python so are there any parameters that I can add to make it read the image better?
img = cv2.imread("kills.jpeg")
text = pytesseract.image_to_string(img)
print(text)
This is my code to read the image, Is there anything I can add to make it read better? Also, the black boxes are to cover images that were interfering with the reading. I would like to also say that I have added the 2 black boxes to see if the images behind them were causing the issue, but I still get the same issue.

The missing knowledge is page-segmentation-mode (psm). You need to use them, when you can't get the desired result.
If we look at your image, the only artifacts are the black columns. Other than that, the image looks like a binary image. Suitable for tesseract to recognize the characters and the digits.
Lets try reading the image by setting the psm to 6.
6 Assume a single uniform block of text.
print(pytesseract.image_to_string(img, config="--psm 6")
The result will be:
Total Kills: 75,230,550
Kill Details: (recorded after 2019/10/23)
993,161 331,129
1,380,450 33,265,533
5,031,168
Update
The second way to solve the problem is getting binary mask and applying OCR to the mask features.
Binary-mask
Features of the binary-mask
As we can see the result is slightly different from the input image. Now when we apply OCR result will be:
Total Kills: 75,230,550
Kill Details: (recorded after 2019/10/23)
993,161 331,129
1,380,450 33,265,533
5,031,168
Code:
import cv2
import numpy as np
import pytesseract
# Load the image
img = cv2.imread("LuKz3.jpg")
# Convert to hsv
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Get the binary mask
msk = cv2.inRange(hsv, np.array([0, 0, 0]), np.array([179, 255, 154]))
# Extract
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=5)
res = 255 - cv2.bitwise_and(dlt, msk)
# OCR
txt = pytesseract.image_to_string(res, config="--psm 6")
print(txt)
# Display
cv2.imshow("res", res)
cv2.waitKey(0)

Reading numbers using PyTesseract

I am trying to read numbers from images and cannot find a way to get it to work consistently (not all images have numbers). These are the images:
(here is the link to the album in case the images are not working)
This is the command I'm using to run tesseract on the images: pytesseract.image_to_string(image, timeout=2, config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789'). I have tried multiple configurations, but this seems to work best.
As far as preprocessing goes, this works the best:
gray = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2GRAY)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
im_bw = cv2.threshold(gray, thresh, 255, cv2.THRESH_BINARY_INV)[1]
This works for all images except the 3rd one. To solve the problem of lines in the 3rd image, i tried getting the edges with cv2.Canny and a pretty large threshold which works, but when drawing them back, even though it gets more than 95% of each number's edges, tesseract does not read them correctly.
I have also tried resizing the image, using cv2.morphologyEx, blurring it etc. I cannot find a way to get it to work for each case.
Thank you.

cv2.resize has consistently worked for me with INTER_CUBIC interpolation.
Adding this last step to pre-processing would most likely solve your problem.
im_bw_scaled = cv2.resize(im_bw, (0, 0), fx=4, fy=4, interpolation=cv2.INTER_CUBIC)
You could play around with the scale. I have used '4' above.
EDIT:
The following code worked with your images very well, even special characters. Please try it out with the rest of your dataset. Scaling, OTSU and erosion was the best combination.
import cv2
import numpy
import pytesseract
pytesseract.pytesseract.tesseract_cmd = "<path to tesseract.exe>"
# Page segmentation mode, PSM was changed to 6 since each page is a single uniform text block.
custom_config = r'--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789'
# load the image as grayscale
img = cv2.imread("5.png",cv2.IMREAD_GRAYSCALE)
# Change all pixels to black, if they aren't white already (since all characters were white)
img[img != 255] = 0
# Scale it 10x
scaled = cv2.resize(img, (0,0), fx=10, fy=10, interpolation = cv2.INTER_CUBIC)
# Retained your bilateral filter
filtered = cv2.bilateralFilter(scaled, 11, 17, 17)
# Thresholded OTSU method
thresh = cv2.threshold(filtered, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
# Erode the image to bulk it up for tesseract
kernel = numpy.ones((5,5),numpy.uint8)
eroded = cv2.erode(thresh, kernel, iterations = 2)
pre_processed = eroded
# Feed the pre-processed image to tesseract and print the output.
ocr_text = pytesseract.image_to_string(pre_processed, config=custom_config)
if len(ocr_text) != 0:
print(ocr_text)
else: print("No string detected")

Python - Pytesseract extracts incorrect text from image

I used the below code in Python to extract text from image,
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "<dir path>"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")
print "------ Done -------"
But the output is incorrect.. The input file is,
The output received is '0001' instead of 'D001'
The output received is '3001' instead of 'B001'
What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]

#Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.
But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:
Try disabling the dictionary check feature in Tesseract input parameters.
Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset,
for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.

Try different config parameters in below line
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
Like as shown below:
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')
Try to change the psm value and compare the results
-- Good Luck --

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python tesseract increase accuracy for OCR - python

Related

Remove large blobs of noise while leaving text intact with opencv

How to improve python/tesseract Image to Text accuracy?

Pytesseract Image to String issue

Reading numbers using PyTesseract

Python - Pytesseract extracts incorrect text from image

Categories

Resources