How to extract text or numbers from images using python

How to extract text or numbers from images using python - python

I want to extract text (mainly numbers) from images like this
I tried this code
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
img = Image.open('1.jpg')
text = pytesseract.image_to_string(img, lang='eng')
print(text)
but all i get is this
(hE PPAR)

When performing OCR, it is important to preprocess the image so the desired text to detect is in black with the background in white. To do this, here's a simple approach using OpenCV to Otsu's threshold the image which will result in a binary image. Here's the image after preprocessing:
We use the --psm 6 configuration setting to treat the image as a uniform block of text. Here's other configuration options you can try. Result from Pytesseract
01153521976
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('1.png', 0)
thresh = cv2.threshold(image, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()

Related

py_tesseract doesnt recognize numbers in this image

I am trying to use py-tesseract on google colabs to parse the following image containing readings from a meter. However it fails to get me the result expected. Looks like i need to do some pre-processing on the image. I am new to py-tesseract. Can you please help what i need to do to get this to work?
Here is my current code followed by output seen and the image:
!sudo apt install tesseract-ocr
!pip install pytesseract
import pytesseract
import shutil
import os
import random
try:
from PIL import Image
except ImportError:
import Image
image_path='drive/MyDrive/cropped_image.jpg'
print(pytesseract.image_to_string(image_path, config='--psm 13 --oem=3'))
Output
ey
Image being parsed
Cropped_image
Thanks

Try this:
print(pytesseract.image_to_string(image_path, config='-1 eng --oem 2 --psm 12'))
Output:
194735787

First, you need to preprocess the image which helps to reduce the noise and help in text extraction.
Extract text area from the image
Convert the image to grayscale and sharpen the image
Apply adaptive threshold
Clean the image by performing morphological operations and invert the image
import cv2
import numpy as np
img = cv2.imread('qKiJi.jpg')
#crop the text extraction area
crop = img[200:350, 100:400]
gray = cv2.cvtColor(crop, cv2.COLOR_BGR2GRAY)
sharpen_kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
sharpen = cv2.filter2D(gray, -1, sharpen_kernel)
thresh = cv2.threshold(sharpen, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,2))
close = cv2.morphologyEx(thresh, cv2.MORPH_CLOSE, kernel, iterations=1)
result = 255 - close
cv2.imshow('img', crop)
cv2.imshow('sharpen', sharpen)
cv2.imshow('thresh', thresh)
cv2.imshow('close', close)
cv2.imshow('result', result)
cv2.waitKey()

Pytesseract with custom font incorrectly classifying numbers

I am trying to detect prices using pytesseract.
However I am having very bad results.
I have one large image with several prices in different locations.
These locations are constant so I am cropping the image down and saving each area as a new image and then trying to detect the text.
I know the text will only contain 0123456789$¢.
I trained my new font using trainyourtesseract.com.
For example, I take this image.
Double it's size, and threshold it to get this.
Run it through tesseract and get an output of 8.
Any help would be appreciated.
def getnumber(self, img):
grey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thresh, grey = cv2.threshold(grey, 50, 255, cv2.THRESH_BINARY_INV)
filename = "{}.png".format(os.getpid())
cv2.imwrite(filename, grey)
text = pytesseract.image_to_string(Image.open(filename), lang='Droid',
config='--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789.$¢')
os.remove(filename)
return(text)

You're on the right track. When preprocessing the image for OCR, you want to get the text in black with the background in white. The idea is to enlarge the image, Otsu's threshold to get a binary image, then perform OCR. We use --psm 6 to tell Pytesseract to assume a single uniform block of text. Look here for more configuration options. Here's the processed image:
Result from OCR:
2¢
Code
import cv2
import pytesseract
import imutils
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Resize, grayscale, Otsu's threshold
image = cv2.imread('1.png')
image = imutils.resize(image, width=500)
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Perform text extraction
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imwrite('thresh.png', thresh)
cv2.waitKey()
Machine specs:
Windows 10
opencv-python==4.2.0.32
pytesseract==0.2.7
numpy==1.14.5

Tesseract detecting 1 and 0 as L and O

In this image tesseract is detecting the text as LOOOPCS but it is 1000PCS. Command I am using is
tesseract "item_04.png" stdout --psm 6
I have tried all psm values 0 to 13
As per suggestions by other blogs and questions on SO and internet following clipping of image as well as thresholding is also tried.
Also tried -c tessedit_char_whitelist=PCS0123456789 but that gives 00PCS.
But I am not getting 1000PCS. Can someone try these and let me know what am I missing?
Edit:
As per suggestion given by #nathancy, tried using - cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU which worked on this 1 and 0 but failed for below image. It is being detected as LL8gPcs:

You need to preprocess the image. A simple approach is to Otsu's threshold then invert the image so the text is in black with the background in white. Here's the processed image and the result using Pytesseract OCR with --psm 6.
Result
1000PCS
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Grayscale, Otsu's threshold
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]
# Invert and perform text extraction
thresh = 255 - thresh
data = pytesseract.image_to_string(thresh, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.waitKey()

Pytesseract missing lines

I'm using pytesseract and cv2 in my project and want to read the text from the following image (after cv2 processing):
The text seems to be clearly visible and pytesseract is returning the lines correctly but is missing "Da-Dumm" and "Ploink".
# load the image
image = cv2.imread(args["image"])
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.threshold(gray, 0, 255,
cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
gray = cv2.medianBlur(gray, 1)
# store grayscale image as a temp file to apply OCR
filename = "Screens/{}.png".format(os.getpid())
cv2.imwrite(filename, gray)
# load the image as a PIL/Pillow image, apply OCR, and then delete the temporary file
text = pytesseract.image_to_string(Image.open(filename),lang="deu")

How to improve text extraction from an image?

I am using pytesseract to extract text from images. Before extracting text with pytesseract, I use Pillow and cv2 to reduce noise and enhance the image:
import numpy as np
import pytesseract
from PIL import Image, ImageFilter, ImageEnhance
import cv2
img = cv2.imread('ss.png')
img = cv2.resize(img, (0,0), fx=3, fy=3)
cv2.imwrite("new.png", img)
img1 = cv2.imread("new.png", 0)
#Apply dilation and erosion
kernel = np.ones((2, 2), np.uint8)
img1 = cv2.dilate(img1, kernel, iterations=1)
img1 = cv2.erode(img1, kernel, iterations=1)
img1 = cv2.adaptiveThreshold(img1,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY_INV,11,2)
cv2.imwrite("new1.png", img1)
img2 = Image.open("new1.png")
#Enhance the image
img2 = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
img2 = enhancer.enhance(2)
img2.save('new2.png')
result = pytesseract.image_to_string(Image.open("new2.png"))
print(result)
I mostly get good results, but when I use some low quality/resolution images, I do not get the expected output. Can I improve this in my code?
Example:
Input:
new1.png:
new2.png:
The string that I get from the console is play. What could I change in my algorithm, so that I get the whole string extracted?
Any help would be greatly appreciated.

This is a late answer, but I just came across this. we can use Pillow and cv2 to reduce noise and enhance the image before extracting text from images using pytesseract. I hope it would help someone in future.
#import required library
src_path = "C:/Users/chethan/Desktop/"
def get_string(img_path):
# Read image with opencv
img = cv2.imread(img_path)
# Convert to gray
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Apply dilation and erosion to remove some noise
kernel = np.ones((1, 1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)
# Write image after removed noise
cv2.imwrite(src_path + "removed_noise.png", img)
# Apply threshold to get image with only black and white
#img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)
# Write the image after apply opencv to do some ...
cv2.imwrite(src_path + "thres.png", img)
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(src_path + "thres.png"))
# Recognize text with tesseract for python
result = pytesseract.image_to_string(Image.open(img_path))
# Remove template file
# os.remove(temp)
return result
print(get_string(src_path + "dummy.png"))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract text or numbers from images using python - python

Related

py_tesseract doesnt recognize numbers in this image

Pytesseract with custom font incorrectly classifying numbers

Tesseract detecting 1 and 0 as L and O

Pytesseract missing lines

How to improve text extraction from an image?

Categories

Resources