How to read numbers on screen efficiently (pytesseract)? - python

I'm trying to read numbers on the screen and for that I'm using pytesseract. The thing is, even though it works, it works slowly and doesn't give good results at all. for example, with this image:
I can make this thresholded image:
and it reads 5852 instead of 585, which is understandable, but sometimes it can be way worse with different thresholding. It can read 1 000 000 as 1 aaa eee for example, or 585 as 5385r (yes it even adds characters without any reason)
Isn't any way to force pytesseract to read only numbers or simply use something that works better than pytesseract?
my code:
from PIL import Image
from pytesseract import pytesseract as pyt
import test
pyt.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
def tti2(location) :
image_file = location
im = Image.open(image_file)
text = pyt.image_to_string(im)
print(text)
for character in "abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ*^&\n" :
text = text.replace(character, "")
return text
test.th("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TEST.png")
print(tti2("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TESTbis.png"))
code of "test" (it's for the thresholding) :
import cv2
from PIL import Image
def th(Path) :
img = cv2.imread(Path)
# If your image is not already grayscale :
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 60 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)
Path = Path.replace(".png","")
pil_img.save(Path+"bis.png")

A way to force pytesseract to read only numbers can be done using tessedit_char_whitelist config with only digits values.
You can try to improve results using Tesseract documentation:
Tesseract - Improving the quality of the output
Also i suggest you to use:
White for the background and black for characters font color.
Select desired tesseract psm mode. In the previous case i was using 7 psm mode to treat image as a single text line.
Use tessedit_char_whitelist config to specify only the characters that you are sarching for.
With that in mind, here is the code:
import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
originalImage = cv2.imread('1.png')
grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
(_, blackAndWhiteImage) = cv2.threshold(grayImage, 127, 255, cv2.THRESH_BINARY_INV)
text = pytesseract.image_to_string(blackAndWhiteImage, config="--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789")
print('Text: ', text)
cv2.imshow('Image result', blackAndWhiteImage)
cv2.waitKey(0)
cv2.destroyAllWindows()
And the desired result:

Related

Problem to recognize characters in Pytesseract python

I'm woorking with this kind of image Original_Image and I'm having some problems to apply character recognition. I'm tried some image treatment (gray, black and white, noise removal,..) and got only bad results. This is a part of the code I'm work in Python.
import cv2
from matplotlib import pyplot as plt
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Users\14231744700\AppData\Local\Tesseract-OCR\tesseract.exe"
image_file = '5295_down.bmp'
img = cv2.imread(image_file)
height,width,channels= img.shape
#The attached image is this one (img_cropped) and I want this data as a string to work on it
img_cropped = img[41*height//50:92*height//100,2*width//14:81*width//100]
#cv2.imshow('Image_cropped', img_cropped)
#cv2.imwrite('image_cropped.png', img_cropped)
#cv2.waitKey(0)
def image_to_string(image):
data = pytesseract.image_to_string(image, lang='eng', config='--psm 6')
return data
image_to_string(img_cropped)
If someone know about a preprocessing step or any other possibility to get better results, I'll be very thankfull.

Pytesseract Wrong Number

I have a problem with the recognition, that some of my input images that are visibly a 1 turn into a 4 after the .image_to_string() command.
My input image is this:
unedited img
I then run some preprocessing steps over it (greyscale, thresholding with otsu, and enlarge the picture) leading to this:
preprocessed img
I also tried dilating the picture with no improvement in the output changing.
After running:
custom_config = "-c tessedit_char_whitelist=0123456789LV --psm 13"
pytesseract.image_to_string(processed_img, config=custom_config)
The final result is a String Displaying:
4LV♀ and I don't understand what I can change to get a 1 instead of the 4.
Thanks in advance for your time.
The ♀ or \n\x0c is because you need custom_config = "-c page_separator=''" as the config because for some reason it adds it as the page separator. you don't need anything else in your config.
To get your number is to do with the processing, mainly to do with the size. However this code i found works best.
import pytesseract
from PIL import Image
import cv2
pytesseract.pytesseract.tesseract_cmd = r'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
import numpy as np
imagepath = "./Pytesseract Wrong Number/kD3Zy.jpg"
read_img = Image.open(imagepath)
# convert PIL image to cv2 image locally
read_img = read_img.convert('RGB')
level_img = np.array(read_img)
level_img = level_img[:, :, ::-1].copy()
# convert to grayscale
level_img = cv2.cvtColor(level_img, cv2.COLOR_RGB2GRAY)
level_img, img_bin = cv2.threshold(level_img, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
level_img = cv2.bitwise_not(img_bin)
kernel = np.ones((2, 1), np.uint8)
# make the image bigger, because it needs at least 30 pixels for the height for the characters
level_img = cv2.resize(level_img,(0,0),fx=4,fy=4, interpolation=cv2.INTER_CUBIC)
level_img = cv2.dilate(level_img, kernel, iterations=1)
# --debug--
#cv2.imshow("Debug", level_img)
#cv2.waitKey()
#cv2.destroyAllWindows
#cv2.imwrite("1.png", level_img)
custom_config = "-c page_separator=''"
level = pytesseract.image_to_string(level_img, config=custom_config)
print(level)
if you want to save it uncomment #cv2.imwrite("1.png", level_img)
Try settings "--psm 8 --oem 3" All list is at enter link description here, though psm 8 and oem 3 generally works fine.

Text Detection of Labels using PyTesseract

A label detection tool that automatically identifies and alphabetically sorts the images based on equipment number (19-V1083AI). I used the pytesseract library to convert the image to a string after the contours of the equipment label were identified. Although the code runs correctly, it never outputs the equipment number. It's my first time using the pytesseract library and the goodFeaturesToTrack function. Any help would be greatly appreciated!
Original Image
import numpy as np
import cv2
import imutils #resizeimage
import pytesseract # convert img to string
from matplotlib import pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Read the image file
image = cv2.imread('Car Images/s3.JPG')
# Resize the image - change width to 500
image = imutils.resize(image, width=500)
# Display the original image
cv2.imshow("Original Image", image)
cv2.waitKey(0)
# RGB to Gray scale conversion
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow("1 - Grayscale Conversion", gray)
cv2.waitKey(0)
# Noise removal with iterative bilateral filter(removes noise while preserving edges)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
cv2.imshow("2 - Bilateral Filter", gray)
cv2.waitKey(0)
corners = cv2.goodFeaturesToTrack(gray,60,0.001,10)
corners = np.int0(corners)
for i in corners:
x,y = i.ravel()
cv2.circle(image,(x,y),0,255,-1)
coord = np.where(np.all(image == (255, 0, 0),axis=-1))
plt.imshow(image)
# Use tesseract to covert image into string
text = pytesseract.image_to_string(image, lang='eng')
print("Equipment Number is:", text)
plt.show()
Output Image2
Note: It worked with one of the images but not for the others
Output Image2
I found using a particular configuration option for PyTesseract will find your text -- and some noise. Here are the configuration options explained: https://stackoverflow.com/a/44632770/42346
For this task I chose: "Sparse text. Find as much text as possible in no particular order."
Since there's more "text" returned by PyTesseract you can use a regular expression to filter out the noise.
This particular regular expression looks for two digits, a hyphen, five digits or characters, a space, and then two digits or characters. This can be adjusted to your equipment number format as necessary, but I'm reasonably confident this is a good solution because there's nothing else like this equipment number in the returned text.
import re
import cv2
import pytesseract
image = cv2.imread('Fv0oe.jpg')
text = pytesseract.image_to_string(image, lang='eng', config='--psm 11')
for line in text.split('\n'):
if re.match(r'^\d{2}-\w{5} \w{2}$',line):
print(line)
Result (with no image processing necessary):
19-V1083 AI

What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

EDIT: I forgot to process the image which solves the reading issue, thanks to Nathancy. Still wondering what makes Tesseract read only the top OR the bottom line of an unprocessed image (same image, two different outcomes)
Orignal:
I have an image that contains two lines of text:
random test image for pytesseract
When I open the image within python (IDLE Python 3.6) with PIL Image and use pytesseract to extract a string, it only extracts the last/bottom line correctly. The upper line of text is scrambled garbage.(see code section below)However, when I use opencv to open the image and use pytesseract to extract a string, it only extracts the top/upper line correctly whilst making a mess of the second/bottom line of text.(see also code section below)
Here is the code:
>>> from PIL import Image, ImageFilter
>>> import pytesseract
>>> pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
>>> import cv2
>>> img = Image.open(r"C:\Users\user\MyImage.png")
>>> img2 = cv2.imread(r"C:\Users\user\MyImage.png", cv2.IMREAD_COLOR)
>>> print(pytesseract.image_to_string(img2))
Pet Sock has 448/600 HP left
A ae eee PER eats ae
>>> print(pytesseract.image_to_string(img))
Le TL
JHE has 329/350 HP left.
When I use pytesseract.image_to_boxes on both img and img2 it will show the same bounding box for certain locations with a different letter (only showing 2 extracted lines which contain an identical box)
>>> print(pytesseract.image_to_boxes(img2))
A 4 6 10 16 0
>>> print(pytesseract.image_to_boxes(img))
J 4 6 10 16 0
When I use the pytesseract.image_to_data on both img and img2 it shows very high (95+) confidence on the line it reads correctly and very low (30-) on the garbled line.Excel table output of image_to_data edit: excel tables are img2 and img accordingly
I fiddled around with the psm config values (I have tried them all) and except for creating more garbage on settings: 5, 7, 8, 9, 10, 13; and some giving an error: 0, 2; it gave no different results than the default (which is 3 I believe)
I must be making some rookie mistake but I can't get my head around why this is happening. If anyone can shine a light in the right direction it would be awesome.
The image was just a fitting, but random, image for an OCR test that I had laying around. No further intentions than experimenting with pytesseract.
Whenever performing OCR with Pytesseract, it is important to preprocess the image so that the text is in black with the background in white. We can do this with simple thresholding
Output from Pytesseract
Pet Sock has 448/600 HP left
JHE has 329/359 HP left.
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()

Python - Pytesseract extracts incorrect text from image

I used the below code in Python to extract text from image,
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "<dir path>"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")
print "------ Done -------"
But the output is incorrect.. The input file is,
The output received is '0001' instead of 'D001'
The output received is '3001' instead of 'B001'
What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]
#Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.
But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:
Try disabling the dictionary check feature in Tesseract input parameters.
Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset,
for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.
Try different config parameters in below line
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
Like as shown below:
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')
Try to change the psm value and compare the results
-- Good Luck --

Categories