Pytesseract with custom font incorrectly classifying numbers

Pytesseract with custom font incorrectly classifying numbers - python

I am trying to detect prices using pytesseract.
However I am having very bad results.
I have one large image with several prices in different locations.
These locations are constant so I am cropping the image down and saving each area as a new image and then trying to detect the text.
I know the text will only contain 0123456789$¢.
I trained my new font using trainyourtesseract.com.
For example, I take this image.
Double it's size, and threshold it to get this.
Run it through tesseract and get an output of 8.
Any help would be appreciated.
def getnumber(self, img):
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh, grey = cv2.threshold(grey, 50, 255, cv2.THRESH_BINARY_INV)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, grey)
text = pytesseract.image_to_string(Image.open(filename), lang='Droid',
config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789.$¢')
os.remove(filename)
return(text)

You're on the right track. When preprocessing the image for OCR, you want to get the text in black with the background in white. The idea is to enlarge the image, Otsu's threshold to get a binary image, then perform OCR. We use --psm 6 to tell Pytesseract to assume a single uniform block of text. Look here for more configuration options. Here's the processed image:
Result from OCR:
2¢
Code
import cv2
import pytesseract
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Resize, grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=500)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Perform text extraction
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imwrite('thresh.png', thresh)
cv2.waitKey()
Machine specs:
Windows 10
opencv-python==4.2.0.32
pytesseract==0.2.7
numpy==1.14.5

Related

OCR not performing well on clean image | Python Pytesseract

I have been working on project which involves extracting text from an image. I have researched that tesseract is one of the best libraries available and I decided to use the same along with opencv. Opencv is needed for image manipulation.
I have been playing a lot with tessaract engine and it does not seems to be giving the expected results to me. I have attached the image as an reference. Output I got is:
1] =501 [
Instead, expected output is
TM10-50%L
What I have done so far:
Remove noise
Adaptive threshold
Sending it tesseract ocr engine
Are there any other suggestions to improve the algorithm?
Thanks in advance.
Snippet of the code:
import cv2
import sys
import pytesseract
import numpy as np
from PIL import Image
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: python ocr_simple.py image.jpg')
sys.exit(1)
# Read image path from command line
imPath = sys.argv[1]
gray = cv2.imread(imPath, 0)
# Blur
blur = cv2.GaussianBlur(gray,(9,9), 0)
# Binarizing
thres = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 5, 3)
text = pytesseract.image_to_string(thresh)
print(text)
Images attached.
First image is original image. Original image
Second image is what has been fed to tessaract. Input to tessaract

Before performing OCR on an image, it's important to preprocess the image. The idea is to obtain a processed image where the text to extract is in black with the background in white. For this specific image, we need to obtain the ROI before we can OCR.
To do this, we can convert to grayscale, apply a slight Gaussian blur, then adaptive threshold to obtain a binary image. From here, we can apply morphological closing to merge individual letters together. Next we find contours, filter using contour area filtering, and then extract the ROI. We perform text extraction using the --psm 6 configuration option to assume a single uniform block of text. Take a look here for more options.
Detected ROI
Extracted ROI
Result from Pytesseract OCR
TM10=50%L
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Grayscale, Gaussian blur, Adaptive threshold
image = cv2.imread('1.jpg')
original = image.copy()
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.adaptiveThreshold(blur, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV, 5, 5)
# Perform morph close to merge letters together
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (5,5))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=3)
# Find contours, contour area filtering, extract ROI
cnts, _ = cv2.findContours(close, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)[-2:]
for c in cnts:
area = cv2.contourArea(c)
if area > 1800 and area < 2500:
x,y,w,h = cv2.boundingRect(c)
ROI = original[y:y+h, x:x+w]
cv2.rectangle(image, (x, y), (x + w, y + h), (36,255,12), 3)
# Perform text extraction
ROI = cv2.GaussianBlur(ROI, (3,3), 0)
data = pytesseract.image_to_string(ROI, lang='eng', config='--psm 6')
print(data)
cv2.imshow('ROI', ROI)
cv2.imshow('close', close)
cv2.imshow('image', image)
cv2.waitKey()

is it possible to extracting text inside colored background region using opencv in python?

I having the following table area from the original image:
I'm trying extract the text,from this table.But when using threshold the whole gray regions get darkening.For example like below,
The threshold type which i did used,
thresh_value = cv2.threshold(original_gray, 128, 255, cv2.THRESH_BINARY_INV +cv2.THRESH_OTSU)[1]
is it possible to extract and change gray background into white and lets remain text pixel as it is if black then?

You should use adaptive thresholding in Python/OpenCV.
Input:
import cv2
import numpy as np
# read image
img = cv2.imread("text_table.jpg")
# convert img to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# do adaptive threshold on gray image
thresh = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 11, 11)
# write results to disk
cv2.imwrite("text_table_thresh.jpg", thresh)
# display it
cv2.imshow("thresh", thresh)
cv2.waitKey(0)
Result

Tesseract detecting 1 and 0 as L and O

In this image tesseract is detecting the text as LOOOPCS but it is 1000PCS. Command I am using is
tesseract "item_04.png" stdout --psm 6
I have tried all psm values 0 to 13
As per suggestions by other blogs and questions on SO and internet following clipping of image as well as thresholding is also tried.
Also tried -c tessedit_char_whitelist=PCS0123456789 but that gives 00PCS.
But I am not getting 1000PCS. Can someone try these and let me know what am I missing?
Edit:
As per suggestion given by #nathancy, tried using - cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU which worked on this 1 and 0 but failed for below image. It is being detected as LL8gPcs:

You need to preprocess the image. A simple approach is to Otsu's threshold then invert the image so the text is in black with the background in white. Here's the processed image and the result using Pytesseract OCR with --psm 6.
Result
1000PCS
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Grayscale, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Invert and perform text extraction
thresh = 255 - thresh
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()

How to extract text or numbers from images using python

I want to extract text (mainly numbers) from images like this
I tried this code
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open('1.jpg')
text = pytesseract.image_to_string(img, lang='eng')
print(text)
but all i get is this
(hE PPAR)

When performing OCR, it is important to preprocess the image so the desired text to detect is in black with the background in white. To do this, here's a simple approach using OpenCV to Otsu's threshold the image which will result in a binary image. Here's the image after preprocessing:
We use the --psm 6 configuration setting to treat the image as a uniform block of text. Here's other configuration options you can try. Result from Pytesseract
01153521976
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png', 0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()

Pytesseract missing lines

I'm using pytesseract and cv2 in my project and want to read the text from the following image (after cv2 processing):
The text seems to be clearly visible and pytesseract is returning the lines correctly but is missing "Da-Dumm" and "Ploink".
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
gray = cv2.medianBlur(gray, 1)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
text = pytesseract.image_to_string(Image.open(filename),lang="deu")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytesseract with custom font incorrectly classifying numbers - python

Related

OCR not performing well on clean image | Python Pytesseract

is it possible to extracting text inside colored background region using opencv in python?

Tesseract detecting 1 and 0 as L and O

How to extract text or numbers from images using python

Pytesseract missing lines

Categories

Resources