How to get text from image

How to get text from image - python

I want to read the text from an image.
I use pytesseract in Python.
Here is my code:
import pytesseract
from PIL import Image
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = Image.open(r'a.jpg')
image.resize((150, 50),Image.ANTIALIAS).save("pic.jpg")
image = Image.open("pic.jpg")
captcha = pytesseract.image_to_string(image).replace(" ", "").replace("-", "").replace("$", "")
image
However, it returns empty string.
What should be the correct way?
Thanks.

i agree with #Jon Betts
tesseract is not very strong in OCR, only good in binary cases with right settings
CAPTCHAs ment to fool OCRs!
but if you really need to do it, you need to come up with the manual procedure for it,
i created the code below specifically for the type of CAPTCHAs that you gave (but its completely rigid and is not generalized/optimized for all cases)
psudo code
apply median blur
apply a threshold to get Blue colors only (binary image output from this stage)
apply opening to reduce small white pixels in binary image
give the image to tesseract with options:
limited whitelist of output chars
OEM 3 : tesseract + cube
PSM 8 : one word per image
Code
from PIL import Image
import pytesseract
import numpy as np
import cv2
img = cv2.imread('a.jpg')
img = cv2.medianBlur(img, 3)
# extract blue parts
img2 = np.zeros((img.shape[0], img.shape[1]), dtype=np.uint8)
cond = np.bitwise_and(img[:, :, 0] >= 100, img[:, :, 2] < 100)
img2[np.where(cond)] = 255
img = img2
# delete the noise
kernel = cv2.getStructuringElement(cv2.MORPH_CROSS, (3, 3))
img = cv2.morphologyEx(img, cv2.MORPH_OPEN, kernel)
str1 = pytesseract.image_to_string(Image.fromarray(img),
config='-c tessedit_char_whitelist=abcedfghijklmnopqrtuvwxyz0123456789 -oem 3 -psm 8')
cv2.imwrite("frame.png", img)
print(str1)
output
f2e4
image
in order to see full options of tesseract, type the following command tesseract --help-extra or refere to this_link

Tesseract is intended for performing OCR on text documents. In my experience it's good but a bit patchy even with very clean data.
In this case it appears you are trying to solve a CAPTCHA which is specifically designed to defeat OCR software. It's very likely you cannot use Tesseract to solve this issue, because:
It's not really designed for that
The scenario is adversarial:
The example is specifically designed to prevent what you are trying to do
If you could get it to work, the other party would likely change it to break again
If you want to proceed I would suggest:
Working on cleaning up the image before attempting to process it (can you get a nice readable black and white image?)
Train your own recognition network using a lot of examples

Related

Reading text from image, facing problem because of the font

I am trying to read this image and do the arithmetic operation in the image. For some reason i am not able to read 7 because of the font it has. I am relatively new to image processing. Can you please help me with solution. I tried pixeliating the image, but that did not help.
import cv2
import pytesseract
from PIL import Image
img = cv2.imread('modules/visual_basic_math/temp2.png', cv2.IMREAD_GRAYSCALE)
thresh = cv2.threshold(img, 100, 255, cv2.THRESH_BINARY_INV+cv2.THRESH_OTSU)[1]
print(pytesseract.image_to_string(img, config='--psm 6'))
Response i am getting is -
+44 849559
+46653% 14
+7776197
+6415995
+*9156346
x4463310
+54Q%433
+1664 20%

Right now, tesseract is a bit outdated. There are much more powerful libraries. I recommend PaddleOCR. To install it:
pip install paddlepaddle
pip install paddleocr
Then:
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='es')
predictions = ocr.ocr("ietDJpng")[0]
filtered_text = []
for pred in predictions:
filtered_text.append(pred[-1][0])
filtered_text = [t.replace(" ", "") for t in filtered_text] # Remove spaces
['+4487559', '+4665714', '+7776157', ':6415995', ':9156346', 'x4463310', '-54q7433', '+1664207']
The output is not completely correct (the division symbols are : and one of them is wrong). Also, it confuses a 9 with a q). However, the results are better and the use of the library is as comfortable as tesseract.
Hope it helps!

Pytesseract not working sometimes on perfectly clear Images

I have very high resolution Engineering drawings/Circuit diagrams which contain text in a number of different regions. The aim is to extract text from such Images.
I am using pytesseract for this task. Applying pytesseract directly is not possible as in that case text from different regions gets jumbled up in the output. So I am identifying different bounding boxes containing the text and then iteratively passing these regions to pytesseract. The bounding box logic is working fine however sometimes I get no text from the cropped Image or partial text only. I would understand if the cropped images were low res or blurred but this is not the case. Please have a look at the attached couple of examples.
Image 1
Image 2
Here is my code to get the text:
source_img_simple = cv2.imread('image_name.tif')
source_img_simple_gray = cv2.cvtColor(source_img_simple, cv2.COLOR_BGR2GRAY)
img_text = pytesseract.image_to_string(source_img_simple_gray)
# Export the text file
with open('Output_OCR.txt', 'w') as text:
text.write(img_text)
Actual result for first image- No output (Blank text file)
For 2nd Image- Partial text (ALL MISCELLANEOUS PIPING AND CONNECTION SIZES)
I'm trying to know how to improve the quality of the OCR. I'm open to using any other tools as well (apart from pytesseract) if required. But can't use API's (Google, AWS etc.) as that is a restriction. Note: I have gone through the below post and it is not duplicate of my case as I have black text on white background:
Pytesseract dont reconize a very clear image

Since your images already look clean, no preprocessing is needed. A simple approach is to threshold and Gaussian blur to smooth the image before throwing it into Pytesseract. Here's the results after simple processing and the output from Pytesseract
SYSTEM CODE IS 3CAB, EXCEPT AS INDICATED.
For the 2nd image
ALL MISCELLANEOUS PIPING AND CONNECTION SIZES
SHALL BE 1 INCH. EXCEPT AS INDICATED.
We use the --psm 6 config flag since we want to treat the image as a single uniform block of text. Here's some additional configuration flags that could be useful
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
image = cv2.imread('2.jpg',0)
thresh = cv2.threshold(image, 150, 255, cv2.THRESH_BINARY_INV)[1]
result = cv2.GaussianBlur(thresh, (5,5), 0)
result = 255 - result
data = pytesseract.image_to_string(result, lang='eng',config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('result', result)
cv2.waitKey()

Why does tesseract fail to read text off this simple image?

I have read mountains of posts on pytesseract, but I cannot get it to read text off a dead simple image; It returns an empty string.
Here is the image:
I have tried scaling it, grayscaling it, and adjusting the contrast, thresholding, blurring, everything it says in other posts, but my problem is that I don't know what the OCR wants to work better. Does it want blurry text? High contrast?
Code to try:
import pytesseract
from PIL import Image
print pytesseract.image_to_string(Image.open(IMAGE FILE))
As you can see in my code, the image is stored locally on my computer, hence Image.open()

Trying something along the lines of
import pytesseract
from PIL import Image
import requests
import io
response = requests.get('https://i.stack.imgur.com/J2ojU.png')
img = Image.open(io.BytesIO(response.content))
text = pytesseract.image_to_string(img, lang='eng', config='--psm 7')
print(text)
with --psm values equal or larger than 6 did yield "Gm" for me.
If the image is stored locally (and in your working directory), just drop the response variable and change the definition of text with the lines
image_name = "J2ojU.png" # or whatever appropriate
text = pytesseract.image_to_string(Image.open(image_name), lang='eng', config='--psm 7')

There are several reasons:
Edges are not sharp and continuous (By sharp I mean smooth, not with teeth)
Image is too small, you need to resize
Font is missing (not mandatory, but trained font incredibly improve possibility of recognition)
Based on points 1) and 2) I was able to recognize text.
1) I resized image 3x and 2) I blurred the image to make edges smooth
import pytesseract
import cv2
import numpy as np
import urllib
import requests
pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract'
from PIL import Image
def url_to_image(url):
resp = urllib.request.urlopen(url)
image = np.asarray(bytearray(resp.read()), dtype="uint8")
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
return image
url = 'https://i.stack.imgur.com/J2ojU.png'
img = url_to_image(url)
retval, img = cv2.threshold(img,200,255, cv2.THRESH_BINARY)
img = cv2.resize(img,(0,0),fx=3,fy=3)
img = cv2.GaussianBlur(img,(11,11),0)
img = cv2.medianBlur(img,9)
cv2.imshow('asd',img)
cv2.waitKey(0)
cv2.destroyAllWindows()
txt = pytesseract.image_to_string(img)
print('recognition:', txt)
>> recognition: Gm
Note:
This script is good for testing any image on web
Note 2:
All processing is based on your posted image
Note 3:
Text recognition is not easy. Every recognition requires special processing. If you try this steps with different image, it may not work at all. Important is to try a lot of recognition on images so you understand what tesseract wants

Is it possible to change a part of the background color of an image, when the image is a table?

I am using pytesseract, pillow,cv2 to OCR an image and get the text present in the image. Since my input is a scanned PDF document, I first converted it into an image (JPEG) format and then tried extracting the text. I am only half way there. The input is a table and the titles are not being displayed, since the titles have a black background. I also tried getstructuringelement but unable to figure out a way Here is what I did-
import cv2
import os
import numpy as np
import pytesseract
#import pillow
#Since scanned PDF can't be handled by pdf2image, convert the scanned PDF into a JPEG format using the below code-
filename = path
from pdf2image import convert_from_path
pages = convert_from_path(filename, 500) for page in pages:
page.save("dest", 'JPEG')
imgname = "path"
oriimg = cv2.imread(imgname,cv2.IMREAD_COLOR)
cv2.imshow("original image", oriimg)
cv2.waitKey(0)
#img = cv2.resize(oriimg,None,fx=0.5,fy=0.5,interpolation=cv2.INTER_CUBIC)
img = cv2.resize(oriimg,(700,1500),interpolation=cv2.INTER_AREA)
#here length height
cv2.imshow("lol", img)
cv2.waitKey(0)
cv2.imwrite("changed_dimensionsimgpath", img)
import PIL.Image
image = cv2.imread(imgname,cv2.IMREAD_COLOR)
grayedimg = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) grayedimg =
cv2.threshold(grayedimg, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
cv2.imwrite("H://newim.jpg", grayedimg)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files (x86)\Tesseract-
OCR\tesseract.exe"
text = pytesseract.image_to_string(PIL.Image.open("path"))
print(text)
My input table looks like below. The regions which have black background are not being identified by OCR and not being extracted as text. --

I have 3 possible ways from an image-analysis perspective
Splitting
You can split the images in two part. First part is just your normal flow (load image, detect text on it). The second flow you first take the negative of the image (255 - img) and than detect text.
The two results will need to be merged afterwards.
difference filter
You can first apply a difference filter/edge detection this will high everything with a high contrast BUT can alter the shape of the letters if done to extreme or if some letters are way bigger.
contour finding + filling
Again an edge detection but now very thin and followed with an contour detection. This will redraw all letter in one color.

PyTesseract not recognising images

I am currently facing a problem with pytesseract where the software is unable to detect a number in this image:
https://i.stack.imgur.com/kmH2R.png
This is taken from a bigger image with threshold filter applied.
For some reason, pytesseract doesn't want to recognise the 6 in this image. Any suggestions? Here is my code:
image = #Insert raw image here. My code takes a screenshot.
image = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
image = cv2.medianBlur(image, 3)
rel, gray = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
# If you want to use the image from above, start here.
image = Image.fromarray(image)
string = pytesseract.image_to_string(image)
print(string)
EDIT: With some further investigation, my code works fine wit numbers containing 2 digits. But not those with singular digits.

pytesseract defaults to a mode that looks for large chunks of text (PSM_SINGLE_BLOCK or --psm 6), in order to have it detect a single character you need to run it with the option --psm 10 (PSM_SINGLE_CHAR). However, due to the black spots in the corners of the image you provided it detects them as random dashes and returns nothing in this mode since it things there's multiple characters, so in this case you need to use --psm 8 (PSM_SINGLE_WORD):
string = pytesseract.image_to_string(image, config='--psm 8')
The output from this will include those random characters so you would need to strip them after pytesseract runs or improve your bounding box around the numbers to remove any noise. Also, if all of your characters being detected are numbers you can add '-c tessedit_char_whitelist=0123456789' after '--psm 8' to improve the detection.
Some other minor tips to simplify your code is that cv2.imread has an option to read the image as black & white so you don't need to run cvtColor afterwards, just do:
image = cv2.imread('/path/to/image/6.png', 0)
also you can create the PIL image object within your call to pytesseract, so that line simplifies to:
string = pytesseract.image_to_string(Image.fromarray(img), config='--psm 8')
as long as you have 'from PIL import Image' at the top of your script.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.