Text Detection of Labels using PyTesseract

Text Detection of Labels using PyTesseract - python

A label detection tool that automatically identifies and alphabetically sorts the images based on equipment number (19-V1083AI). I used the pytesseract library to convert the image to a string after the contours of the equipment label were identified. Although the code runs correctly, it never outputs the equipment number. It's my first time using the pytesseract library and the goodFeaturesToTrack function. Any help would be greatly appreciated!
Original Image
import numpy as np
import cv2
import imutils #resizeimage
import pytesseract # convert img to string
from matplotlib import pyplot as plt
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Read the image file
image = cv2.imread('Car Images/s3.JPG')
# Resize the image - change width to 500
image = imutils.resize(image, width=500)
# Display the original image
cv2.imshow("Original Image", image)
cv2.waitKey(0)
# RGB to Gray scale conversion
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
cv2.imshow("1 - Grayscale Conversion", gray)
cv2.waitKey(0)
# Noise removal with iterative bilateral filter(removes noise while preserving edges)
gray = cv2.bilateralFilter(gray, 11, 17, 17)
cv2.imshow("2 - Bilateral Filter", gray)
cv2.waitKey(0)
corners = cv2.goodFeaturesToTrack(gray,60,0.001,10)
corners = np.int0(corners)
for i in corners:
x,y = i.ravel()
cv2.circle(image,(x,y),0,255,-1)
coord = np.where(np.all(image == (255, 0, 0),axis=-1))
plt.imshow(image)
# Use tesseract to covert image into string
text = pytesseract.image_to_string(image, lang='eng')
print("Equipment Number is:", text)
plt.show()
Output Image2
Note: It worked with one of the images but not for the others
Output Image2

I found using a particular configuration option for PyTesseract will find your text -- and some noise. Here are the configuration options explained: https://stackoverflow.com/a/44632770/42346
For this task I chose: "Sparse text. Find as much text as possible in no particular order."
Since there's more "text" returned by PyTesseract you can use a regular expression to filter out the noise.
This particular regular expression looks for two digits, a hyphen, five digits or characters, a space, and then two digits or characters. This can be adjusted to your equipment number format as necessary, but I'm reasonably confident this is a good solution because there's nothing else like this equipment number in the returned text.
import re
import cv2
import pytesseract
image = cv2.imread('Fv0oe.jpg')
text = pytesseract.image_to_string(image, lang='eng', config='--psm 11')
for line in text.split('\n'):
if re.match(r'^\d{2}-\w{5} \w{2}$',line):
print(line)
Result (with no image processing necessary):
19-V1083 AI

Related

How to read numbers on screen efficiently (pytesseract)?

I'm trying to read numbers on the screen and for that I'm using pytesseract. The thing is, even though it works, it works slowly and doesn't give good results at all. for example, with this image:
I can make this thresholded image:
and it reads 5852 instead of 585, which is understandable, but sometimes it can be way worse with different thresholding. It can read 1 000 000 as 1 aaa eee for example, or 585 as 5385r (yes it even adds characters without any reason)
Isn't any way to force pytesseract to read only numbers or simply use something that works better than pytesseract?
my code:
from PIL import Image
from pytesseract import pytesseract as pyt
import test
pyt.tesseract_cmd = 'C:/Program Files/Tesseract-OCR/tesseract.exe'
def tti2(location) :
image_file = location
im = Image.open(image_file)
text = pyt.image_to_string(im)
print(text)
for character in "abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ*^&\n" :
text = text.replace(character, "")
return text
test.th("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TEST.png")
print(tti2("C:\\Users\\Utilisateur\\Pictures\\greenshot\\flea market sniper\\TESTbis.png"))
code of "test" (it's for the thresholding) :
import cv2
from PIL import Image
def th(Path) :
img = cv2.imread(Path)
# If your image is not already grayscale :
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
threshold = 60 # to be determined
_, img_binarized = cv2.threshold(img, threshold, 255, cv2.THRESH_BINARY)
pil_img = Image.fromarray(img_binarized)
Path = Path.replace(".png","")
pil_img.save(Path+"bis.png")

A way to force pytesseract to read only numbers can be done using tessedit_char_whitelist config with only digits values.
You can try to improve results using Tesseract documentation:
Tesseract - Improving the quality of the output
Also i suggest you to use:
White for the background and black for characters font color.
Select desired tesseract psm mode. In the previous case i was using 7 psm mode to treat image as a single text line.
Use tessedit_char_whitelist config to specify only the characters that you are sarching for.
With that in mind, here is the code:
import cv2
import numpy as np
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
originalImage = cv2.imread('1.png')
grayImage = cv2.cvtColor(originalImage, cv2.COLOR_BGR2GRAY)
(_, blackAndWhiteImage) = cv2.threshold(grayImage, 127, 255, cv2.THRESH_BINARY_INV)
text = pytesseract.image_to_string(blackAndWhiteImage, config="--psm 7 --oem 3 -c tessedit_char_whitelist=0123456789")
print('Text: ', text)
cv2.imshow('Image result', blackAndWhiteImage)
cv2.waitKey(0)
cv2.destroyAllWindows()
And the desired result:

Image processing to get numbers from an image using cv2 and pytesseract

I am trying to extract the values from photographs of a Ritter biogas counter, specifically I want get the numbers at the black measurer. Here is an example:
I am trying to do this in Python, using the cv2 and pytesseract libraries. Currently my script looks like this:
import argparse
import cv2
import pytesseract
from matplotlib import pyplot as plt
# Parsing input arguments
parser = argparse.ArgumentParser(description='Analyze an image from Ritter counter and extract the measured gas volume')
parser.add_argument("--img", required=True, help="Route to image to be analyzed")
args = parser.parse_args()
img=str(args.img)
# Reading photo as a grayscale image
img = cv2.imread(img, 0)
print("Pixels (height x width):")
print(img.shape[:2])
# Cropping image
img = img[377:420, 295:660]
#Transforming image to a binary image using a fixed threshold
for i in range(65,85,1):
thresh = cv2.threshold(img, i, 255, cv2.THRESH_TOZERO)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
plt.imshow(thresh)
plt.title("Fixed: " + str(i) + "; Result: " + data)
plt.show()
However, glare differences across the image, and those white lines of flash reflection in the counter's glass are causing me trouble to process the image before pytesseract. This is, currently, my best result:
I have tried using cv's adaptative thresholding with no better results. The expected result would process several images similar as the uploaded one, each with small differences in light reflection intensity and angle.

Pytesseract Image to String issue

Does anyone know how I can get these results better?
Total Kills: 15,230,550
Kill Details: (recorded after 2019/10,/Z3]
993,151 331,129
1,330,450 33,265,533
5,031,168
This is what it returns however it is meant to be the same as the image posted below, I am new to python so are there any parameters that I can add to make it read the image better?
img = cv2.imread("kills.jpeg")
text = pytesseract.image_to_string(img)
print(text)
This is my code to read the image, Is there anything I can add to make it read better? Also, the black boxes are to cover images that were interfering with the reading. I would like to also say that I have added the 2 black boxes to see if the images behind them were causing the issue, but I still get the same issue.

The missing knowledge is page-segmentation-mode (psm). You need to use them, when you can't get the desired result.
If we look at your image, the only artifacts are the black columns. Other than that, the image looks like a binary image. Suitable for tesseract to recognize the characters and the digits.
Lets try reading the image by setting the psm to 6.
6 Assume a single uniform block of text.
print(pytesseract.image_to_string(img, config="--psm 6")
The result will be:
Total Kills: 75,230,550
Kill Details: (recorded after 2019/10/23)
993,161 331,129
1,380,450 33,265,533
5,031,168
Update
The second way to solve the problem is getting binary mask and applying OCR to the mask features.
Binary-mask
Features of the binary-mask
As we can see the result is slightly different from the input image. Now when we apply OCR result will be:
Total Kills: 75,230,550
Kill Details: (recorded after 2019/10/23)
993,161 331,129
1,380,450 33,265,533
5,031,168
Code:
import cv2
import numpy as np
import pytesseract
# Load the image
img = cv2.imread("LuKz3.jpg")
# Convert to hsv
hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
# Get the binary mask
msk = cv2.inRange(hsv, np.array([0, 0, 0]), np.array([179, 255, 154]))
# Extract
krn = cv2.getStructuringElement(cv2.MORPH_RECT, (5, 3))
dlt = cv2.dilate(msk, krn, iterations=5)
res = 255 - cv2.bitwise_and(dlt, msk)
# OCR
txt = pytesseract.image_to_string(res, config="--psm 6")
print(txt)
# Display
cv2.imshow("res", res)
cv2.waitKey(0)

What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

EDIT: I forgot to process the image which solves the reading issue, thanks to Nathancy. Still wondering what makes Tesseract read only the top OR the bottom line of an unprocessed image (same image, two different outcomes)
Orignal:
I have an image that contains two lines of text:
random test image for pytesseract
When I open the image within python (IDLE Python 3.6) with PIL Image and use pytesseract to extract a string, it only extracts the last/bottom line correctly. The upper line of text is scrambled garbage.(see code section below)However, when I use opencv to open the image and use pytesseract to extract a string, it only extracts the top/upper line correctly whilst making a mess of the second/bottom line of text.(see also code section below)
Here is the code:
>>> from PIL import Image, ImageFilter
>>> import pytesseract
>>> pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
>>> import cv2
>>> img = Image.open(r"C:\Users\user\MyImage.png")
>>> img2 = cv2.imread(r"C:\Users\user\MyImage.png", cv2.IMREAD_COLOR)
>>> print(pytesseract.image_to_string(img2))
Pet Sock has 448/600 HP left
A ae eee PER eats ae
>>> print(pytesseract.image_to_string(img))
Le TL
JHE has 329/350 HP left.
When I use pytesseract.image_to_boxes on both img and img2 it will show the same bounding box for certain locations with a different letter (only showing 2 extracted lines which contain an identical box)
>>> print(pytesseract.image_to_boxes(img2))
A 4 6 10 16 0
>>> print(pytesseract.image_to_boxes(img))
J 4 6 10 16 0
When I use the pytesseract.image_to_data on both img and img2 it shows very high (95+) confidence on the line it reads correctly and very low (30-) on the garbled line.Excel table output of image_to_data edit: excel tables are img2 and img accordingly
I fiddled around with the psm config values (I have tried them all) and except for creating more garbage on settings: 5, 7, 8, 9, 10, 13; and some giving an error: 0, 2; it gave no different results than the default (which is 3 I believe)
I must be making some rookie mistake but I can't get my head around why this is happening. If anyone can shine a light in the right direction it would be awesome.
The image was just a fitting, but random, image for an OCR test that I had laying around. No further intentions than experimenting with pytesseract.

Whenever performing OCR with Pytesseract, it is important to preprocess the image so that the text is in black with the background in white. We can do this with simple thresholding
Output from Pytesseract
Pet Sock has 448/600 HP left
JHE has 329/359 HP left.
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()

Python - Pytesseract extracts incorrect text from image

I used the below code in Python to extract text from image,
import cv2
import numpy as np
import pytesseract
from PIL import Image
# Path of working folder on Disk
src_path = "<dir path>"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
# Remove template file
#os.remove(temp)
return result
print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")
print "------ Done -------"
But the output is incorrect.. The input file is,
The output received is '0001' instead of 'D001'
The output received is '3001' instead of 'B001'
What is the required code changes to retrieve the right Characters from image, also to train the pytesseract to return the right characters for all font types in image[including Bold characters]

#Maaaaa has pointed out the exact reason for incorrect text recognition by Tessearact.
But still you can improve your final output by applying some post processing steps on the tesseract output. Here are a few points that you can think about and use them if it helps:
Try disabling the dictionary check feature in Tesseract input parameters.
Use heuristic based information from your dataset. From the given sample images in question, i guess first character of each word/sequence is an alphabet so you can replace first digit in your output with most probable alphabet based on your dataset,
for example '0' can be replaced with D so '0001' -> 'D001', similarly for other cases too.
Tesseract also provides the character level recognition confidence value, so use that information to replace the characters with the one having highest confidence value.

Try different config parameters in below line
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))
Like as shown below:
result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')
Try to change the psm value and compare the results
-- Good Luck --

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Text Detection of Labels using PyTesseract - python

Related

How to read numbers on screen efficiently (pytesseract)?

Image processing to get numbers from an image using cv2 and pytesseract

Pytesseract Image to String issue

What causes pytesseract to read either the top or bottom text-line of a dual-line image depending on whether opencv or pillow is used?

Python - Pytesseract extracts incorrect text from image

Categories

Resources