Python - Pytesseract extracts incorrect text from image - python

I used the below code in Python to extract text from image,
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "<dir path>"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")
print "------ Done -------"
But the output is incorrect.. The input file is,
The output received is '0001' instead of 'D001'
The output received is '3001' instead of 'B001'
What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]

#Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.
But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:
Try disabling the dictionary check feature in Tesseract input parameters.
Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset,
for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.

Try different config parameters in below line
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
Like as shown below:
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')
Try to change the psm value and compare the results
-- Good Luck --

Related

How to read numbers on screen efficiently (pytesseract)?

I'm trying to read numbers on the screen and for that I'm using pytesseract. The thing is, even though it works, it works slowly and doesn't give good results at all. for example, with this image:
I can make this thresholded image:
and it reads 5852 instead of 585, which is understandable, but sometimes it can be way worse with different thresholding. It can read 1 000 000 as 1 aaa eee for example, or 585 as 5385r (yes it even adds characters without any reason)
Isn't any way to force pytesseract to read only numbers or simply use something that works better than pytesseract?
my code:
from PIL import Image
from pytesseract import pytesseract as pyt
import test
pyt.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
def tti2(location) :
image_file = location
im = Image.open(image_file)
text = pyt.image_to_string(im)
print(text)
for character in "abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ*^&\n" :
text = text.replace(character, "")
return text
test.th("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TEST.png")
print(tti2("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TESTbis.png"))
code of "test" (it's for the thresholding) :
import cv2
from PIL import Image
def th(Path) :
img = cv2.imread(Path)
# If your image is not already grayscale :
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 60 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)
Path = Path.replace(".png","")
pil_img.save(Path+"bis.png")
A way to force pytesseract to read only numbers can be done using tessedit_char_whitelist config with only digits values.
You can try to improve results using Tesseract documentation:
Tesseract - Improving the quality of the output
Also i suggest you to use:
White for the background and black for characters font color.
Select desired tesseract psm mode. In the previous case i was using 7 psm mode to treat image as a single text line.
Use tessedit_char_whitelist config to specify only the characters that you are sarching for.
With that in mind, here is the code:
import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
originalImage = cv2.imread('1.png')
grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
(_, blackAndWhiteImage) = cv2.threshold(grayImage, 127, 255, cv2.THRESH_BINARY_INV)
text = pytesseract.image_to_string(blackAndWhiteImage, config="--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789")
print('Text: ', text)
cv2.imshow('Image result', blackAndWhiteImage)
cv2.waitKey(0)
cv2.destroyAllWindows()
And the desired result:

Python Tesseract OCR isn't properly detecting both numbers

This is the image I'm using. This is the output that was returned with print(pytesseract.image_to_string(img)):
#& [COE] daKizz
6 Grasslands
Q Badlands
#g Swamp
VALUE: # 7,615,; HIRE)
I've tried using the config second parameter (from psm 0 - 13, and "digits") but it doesn't make much difference. I'm considering looking into training tesseract to recognize the type of numbers in an image like that but I'm hoping there are other options.
I've also tried cleaning up the picture but I'm getting the same output
How can I fix this?
I guess you are missing the image processing part.
From the documentation "Improving the quality of the output" you may be interested in Image Processing section.
One way of reaching the desired numbers is inRange thresholding
To apply in-range, we first need to convert the image in hsv colorspace. Since we want the image between the range of pixels.
The result will be:
Now if you read it with "digits" option:
53.8.
53.8.
453.3.
7.615.725.834
One possible question is: Why inRange threshold?
Well, if you try with the other methods (simple or adaptive) you will see inRange is more accurate for the given input image.
Another possible question is: Can we use the values of the cv2.inRange method for other images.
You may not get the desired result. For other images you need to find other range values.
Code:
import cv2
import pytesseract
import numpy as np
bgr_image = cv2.imread("vNYcf.jpg")
hsv_image = cv2.cvtColor(bgr_image, cv2.COLOR_BGR2HSV)
mask = cv2.inRange(hsv_image, np.array([0, 180, 218]), np.array([60, 255, 255]))
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dilate = cv2.dilate(mask, kernel, iterations=1)
thresh = 255 - cv2.bitwise_and(dilate, mask)
text = pytesseract.image_to_string(thresh, config="--psm 6 digits")
print(text)

Pytesseract Image to String issue

Does anyone know how I can get these results better?
Total Kills: 15,230,550
Kill Details: (recorded after 2019/10,/Z3]
993,151 331,129
1,330,450 33,265,533
5,031,168
This is what it returns however it is meant to be the same as the image posted below, I am new to python so are there any parameters that I can add to make it read the image better?
img = cv2.imread("kills.jpeg")
text = pytesseract.image_to_string(img)
print(text)
This is my code to read the image, Is there anything I can add to make it read better? Also, the black boxes are to cover images that were interfering with the reading. I would like to also say that I have added the 2 black boxes to see if the images behind them were causing the issue, but I still get the same issue.
The missing knowledge is page-segmentation-mode (psm). You need to use them, when you can't get the desired result.
If we look at your image, the only artifacts are the black columns. Other than that, the image looks like a binary image. Suitable for tesseract to recognize the characters and the digits.
Lets try reading the image by setting the psm to 6.
6 Assume a single uniform block of text.
print(pytesseract.image_to_string(img, config="--psm 6")
The result will be:
Total Kills: 75,230,550
Kill Details: (recorded after 2019/10/23)
993,161 331,129
1,380,450 33,265,533
5,031,168
Update
The second way to solve the problem is getting binary mask and applying OCR to the mask features.
Binary-mask
Features of the binary-mask
As we can see the result is slightly different from the input image. Now when we apply OCR result will be:
Total Kills: 75,230,550
Kill Details: (recorded after 2019/10/23)
993,161 331,129
1,380,450 33,265,533
5,031,168
Code:
import cv2
import numpy as np
import pytesseract
# Load the image
img = cv2.imread("LuKz3.jpg")
# Convert to hsv
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Get the binary mask
msk = cv2.inRange(hsv, np.array([0, 0, 0]), np.array([179, 255, 154]))
# Extract
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=5)
res = 255 - cv2.bitwise_and(dlt, msk)
# OCR
txt = pytesseract.image_to_string(res, config="--psm 6")
print(txt)
# Display
cv2.imshow("res", res)
cv2.waitKey(0)

Reading numbers using PyTesseract

I am trying to read numbers from images and cannot find a way to get it to work consistently (not all images have numbers). These are the images:
(here is the link to the album in case the images are not working)
This is the command I'm using to run tesseract on the images: pytesseract.image_to_string(image, timeout=2, config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789'). I have tried multiple configurations, but this seems to work best.
As far as preprocessing goes, this works the best:
gray = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2GRAY)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
im_bw = cv2.threshold(gray, thresh, 255, cv2.THRESH_BINARY_INV)[1]
This works for all images except the 3rd one. To solve the problem of lines in the 3rd image, i tried getting the edges with cv2.Canny and a pretty large threshold which works, but when drawing them back, even though it gets more than 95% of each number's edges, tesseract does not read them correctly.
I have also tried resizing the image, using cv2.morphologyEx, blurring it etc. I cannot find a way to get it to work for each case.
Thank you.
cv2.resize has consistently worked for me with INTER_CUBIC interpolation.
Adding this last step to pre-processing would most likely solve your problem.
im_bw_scaled = cv2.resize(im_bw, (0, 0), fx=4, fy=4, interpolation=cv2.INTER_CUBIC)
You could play around with the scale. I have used '4' above.
EDIT:
The following code worked with your images very well, even special characters. Please try it out with the rest of your dataset. Scaling, OTSU and erosion was the best combination.
import cv2
import numpy
import pytesseract
pytesseract.pytesseract.tesseract_cmd = "<path to tesseract.exe>"
# Page segmentation mode, PSM was changed to 6 since each page is a single uniform text block.
custom_config = r'--psm 6 --oem 3 -c tessedit_char_whitelist=0123456789'
# load the image as grayscale
img = cv2.imread("5.png",cv2.IMREAD_GRAYSCALE)
# Change all pixels to black, if they aren't white already (since all characters were white)
img[img != 255] = 0
# Scale it 10x
scaled = cv2.resize(img, (0,0), fx=10, fy=10, interpolation = cv2.INTER_CUBIC)
# Retained your bilateral filter
filtered = cv2.bilateralFilter(scaled, 11, 17, 17)
# Thresholded OTSU method
thresh = cv2.threshold(filtered, 0, 255, cv2.THRESH_BINARY_INV | cv2.THRESH_OTSU)[1]
# Erode the image to bulk it up for tesseract
kernel = numpy.ones((5,5),numpy.uint8)
eroded = cv2.erode(thresh, kernel, iterations = 2)
pre_processed = eroded
# Feed the pre-processed image to tesseract and print the output.
ocr_text = pytesseract.image_to_string(pre_processed, config=custom_config)
if len(ocr_text) != 0:
print(ocr_text)
else: print("No string detected")

Text Detection of Labels using PyTesseract

A label detection tool that automatically identifies and alphabetically sorts the images based on equipment number (19-V1083AI). I used the pytesseract library to convert the image to a string after the contours of the equipment label were identified. Although the code runs correctly, it never outputs the equipment number. It's my first time using the pytesseract library and the goodFeaturesToTrack function. Any help would be greatly appreciated!
Original Image
import numpy as np
import cv2
import imutils #resizeimage
import pytesseract # convert img to string
from matplotlib import pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Read the image file
image = cv2.imread('Car Images/s3.JPG')
# Resize the image - change width to 500
image = imutils.resize(image, width=500)
# Display the original image
cv2.imshow("Original Image", image)
cv2.waitKey(0)
# RGB to Gray scale conversion
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow("1 - Grayscale Conversion", gray)
cv2.waitKey(0)
# Noise removal with iterative bilateral filter(removes noise while preserving edges)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
cv2.imshow("2 - Bilateral Filter", gray)
cv2.waitKey(0)
corners = cv2.goodFeaturesToTrack(gray,60,0.001,10)
corners = np.int0(corners)
for i in corners:
x,y = i.ravel()
cv2.circle(image,(x,y),0,255,-1)
coord = np.where(np.all(image == (255, 0, 0),axis=-1))
plt.imshow(image)
# Use tesseract to covert image into string
text = pytesseract.image_to_string(image, lang='eng')
print("Equipment Number is:", text)
plt.show()
Output Image2
Note: It worked with one of the images but not for the others
Output Image2
I found using a particular configuration option for PyTesseract will find your text -- and some noise. Here are the configuration options explained: https://stackoverflow.com/a/44632770/42346
For this task I chose: "Sparse text. Find as much text as possible in no particular order."
Since there's more "text" returned by PyTesseract you can use a regular expression to filter out the noise.
This particular regular expression looks for two digits, a hyphen, five digits or characters, a space, and then two digits or characters. This can be adjusted to your equipment number format as necessary, but I'm reasonably confident this is a good solution because there's nothing else like this equipment number in the returned text.
import re
import cv2
import pytesseract
image = cv2.imread('Fv0oe.jpg')
text = pytesseract.image_to_string(image, lang='eng', config='--psm 11')
for line in text.split('\n'):
if re.match(r'^\d{2}-\w{5} \w{2}$',line):
print(line)
Result (with no image processing necessary):
19-V1083 AI

Categories