How to extract the text displayed in a dot matrix LED display - python

I need to extract the number from a display(LED dot matrix)
Sample Image:
I am using the example code given by pytesseract to test. But I am failing.
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
# If you don't have tesseract executable in your PATH, include the following:
pytesseract.pytesseract.tesseract_cmd = r'<full_path_to_your_tesseract_executable>'
# Example tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'
# Simple image to string
print(pytesseract.image_to_string(Image.open('test.png')))
From the sample image I need the output as the number and alphabets given.
If my image have only one number totally filled in whole display that number alone should be in result.. Am I using the right tools? or should I look into opencv and machine learning to extract characters from display.

Related

cv2 to tesseract directly without saving

import pytesseract
from pdf2image import convert_from_path, convert_from_bytes
import cv2,numpy
def pil_to_cv2(image):
open_cv_image = numpy.array(image)
return open_cv_image[:, :, ::-1].copy()
path='OriginalsFile.pdf'
images = convert_from_path(path)
cv_h=[pil_to_cv2(i) for i in images]
img_header = cv_h[0][:160,:]
#print(pytesseract.image_to_string(Image.open('test.png'))) I only found this in tesseract docs
Hello, is there a way to read the img_header directly using pytesseract without saving it,
pytesseract docs
pytesseract.image_to_string() input format
As documentation explains pytesseract.image_to_string() needs a PIL image as input.
So you can convert your CV image into PIL one easily, like this:
from PIL import Image
... (your code)
print(pytesseract.image_to_string(Image.fromarray(img_header)))
if you really don't want to use PIL!
see:
https://github.com/madmaze/pytesseract/blob/master/src/pytesseract.py
pytesseract is an easy wrapper to run the tesseract command def run_and_get_output() line, you'll see that it saves your image into an temporary file, and then gives the address to the tesseract to run.
hence, you can do the same with opencv, just rewrite the pytesseract only .py file to do it with opencv, although; i don't see any performance improvements whatsoever.
The fromarray function allows you to load the PIL document into tesseract without saving the document to disk, but you should also ensure that you don`t send a list of pil images into tesseract. The convert_from_path function can generate a list of pil images if a pdf document contains multiple pages, therefore you need to send each page into tesseract individually.
import pytesseract
from pdf2image import convert_from_path
import cv2, numpy
def pil_to_cv2(image):
open_cv_image = numpy.array(image)
return open_cv_image[:, :, ::-1].copy()
doc = convert_from_path(path)
for page_number, page_data in enumerate(doc):
cv_h= pil_to_cv2(page_data)
img_header = cv_h[:160,:]
print(f"{page_number} - {pytesseract.image_to_string(Image.fromarray(img_header))}")

Python - Image to text enclosed in pentagon shape pytesseract

I am trying to ready Energy Efficiency Rating from EPC certificate using python. Usually EPC certificate comes in PDF format. I have converted PDF into image already and using pytesseract to get text from image. However I am not getting expected results.
Sample Image:
Expected output:
Current rating : 79, Potential rating : 79
What I have tried so far:
from pdf2image import convert_from_path
import pytesseract
from PIL import Image
pages = convert_from_path(r'my_file.pdf', 500)
img =pages[0].save(r'F:\Freelancer\EPC rating\fwdepcs\out.jpg', 'JPEG')
text = pytesseract.image_to_string(Image.open(r'F:\Freelancer\EPC rating\fwdepcs\out.jpg'))
However text does not capture 79.
I also tried cv2 pattern matching and shape detection, but those not worked for other reasons.
You say that you have convert this pdf to image file.
Use PIL(.crop()) or opencv to crop picture.And crop it like this:
And use PIL Image.convert("1"),maybe tesseract can catch this number.
If not,I think you can use jTessBoxEditor to train tesseract.

How to find a file/ data from a given data set in python- opencv image processing project?

I have a data set of images in an image processing project. I want to input an image and scan through the data set to recognize the given image. What module/ library/ approach( eg: ML) should I use to identify my image in my python- opencv code?
To find exactly the same image, you don't need any kind of ML. The image is just an array of pixels, so you can check if the array of the input image equals that of an image in your dataset.
import glob
import cv2
import numpy as np
# Read in source image (the one you want to match to others in the dataset)
source = cv2.imread('test.jpg')
# Make a list of all the images in the dataset (I assume they are images in a directory)
filelist = glob.glob(r'C:\Users\...\Images\*.JPG')
# Loop through the images, read them in and check if an image is equal to your source
for file in filelist:
img = cv2.imread(file)
if np.array_equal(source, img):
print("%s is the same image as source" %(file))
break

Extracting text/characters from Xray images using python

I am trying to extract the characters in the x-ray, I have tried using pytesseract to extract but couldn't succeed, I used a canny edge to remove the noise and extract, but still, I am not able to extract the text/chars. Can you please help/guide me to extract the text/chars
Try this tuotrial to locate the text:
https://www.pyimagesearch.com/2018/08/20/opencv-text-detection-east-text-detector/
Then once you locate you can isolate and use tesseract to recognize it.
If it's a DICOM file, you could use gdcm to get the attribute. It's available on python too.
pytesseract should be sufficient, if the file is in 'png' or 'jpg' form.
now suppose image is the name of your image. Please write the below code.
from PIL import Image
from pytesseract import image_to_string
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
im = Image.open('F:/kush/invert.jpg')
pytesseract.image_to_string(im, lang = 'eng')

Python - Read number in image with Pytesseract

I am using a combination of pyautogui and pytesseract to capture small regions on the screen and then pull the number/text out of the region. I have written script that has read the majority of captured images perfectly, but single digit numbers seem to cause an issue for it. For example small regions of an image containing numbers are saved to .png files the numbers 11, 14, and 18 were pulled perfectly, but the number 7 is just returning as a blank string.
Question: What could be causing this to happen?
Code: Scaled down drastically to make it every easy to follow:
def get_text(image):
return pytesseract.image_to_string(image)
answer2 = pyautogui.screenshot('answer2.png',region=(727, 566, 62, 48))
img = Image.open('answer2.png')
answer2 = get_text(img)
This code is repeated 4 times, once for each image, it worked for 11,14,18 but not for 7.
Just to slow the files being read here is a screenshot of the images after they were saved through the screenshot command.
https://gyazo.com/0acbf5be2d970abeb29561113c171fbe
here is a screenshot of what I am working from:
https://gyazo.com/311913217a1302382b700b07ad3e3439
I found question Why pytesseract does not recognise single digits? and in comments I found option --psm 6.
I checked tesseract with option --psm 6 and it can recognize single digit on your image.
tesseract --psm 6 number-7.jpg result.txt
I checked pytesseract.image_to_string() with option config='--psm 6' and it can recognize single digit on your image too.
#!/usr/bin/env python3
from PIL import Image
import pytesseract
img = Image.open('number-7.jpg')
print(pytesseract.image_to_string(img, config='--psm 6'))

Categories