Extraction of characters from forms with boxed field inputs

Extraction of characters from forms with boxed field inputs - python

I am trying to extract characters from all fields in forms with boxes such as the one shown here:
Sample printed form
My current approach is as follows:
Crop the field from the form based on some standard format.
Image pre-processing and finding contours around the boxes of fields.
Based on the number of boxes in that field, crop each small box and run character recognition on these cropped character images.
The boxes may be slightly tilted in the images. I use an alignment algorithm but it still does not always straighten the box edges. This can be seen in this image:
Aligned date crop
.
On such images, when I crop characters using straight lines (step 3 of the algorithm mentioned above), the edges of boxes are also included which confuse the character recognition module. For example, the number '3' and 'the box edge' is represented as 31 sometimes.
I want to use pre-trained models only and hence, I am looking for a better way to extract characters from the boxed fields properly.
I would highly appreciate any help provided by the SO community.

As the box edges are generally thinner (as in your case) than the text inside them, we can leverage this information.
By applying a horizontal morphological closing kernel (Dilation -> Erosion), we can make thin vertical lines white, which will help OCR. Some trash may be left after processing but that wouldn't hinder OCR accuracy. The size of the kernel depends on the width of the border lines. Obviously, you can tweak it as per your case.
Here is the sample code:
import cv2
import numpy as np
im = cv2.imread('sample_image.png')
im = cv2.cvtColor(im, cv2.COLOR_BGR2GRAY)
k1 = (4,1)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, k1)
im = cv2.morphologyEx(im, cv2.MORPH_CLOSE, kernel, iterations=1)
_,im = cv2.threshold(im, thresh=200, maxval=255, type=cv2.THRESH_BINARY)
cv2.imwrite('sample_output.png',im)
And here are the images:
sample_image.png
sample_output.png

Related

Extract Data from an Image with Python/OpenCV/Tesseract?

I'm trying to extract some contents from a cropped image. I tried pytesseract and opencv template matching but the results are very poor. OpenCV template matching sometimes fails due to poor quality of the icons and tesseract gives me a line of text with false characters.
I'm trying to grab the values like this:
0:26 83 1 1
Any thoughts or techniques?

A technique you could use would be to blur your image. From what it looks like, the image is kind of low res and blurry already, so you wouldn't need to blur the image super hard. Whenever I need to use a blur function in Opencv, I normally choose the gaussian blur, as its technique of blurring each pixel as well as each surrounding pixel is great. Once the image is blurred, I would threshold, or adaptive threshold the image. Once you have gotten this far, the image that should be shown should be mostly hard lines with little bits of short lines mixed between. Afterwards, dilate the threshold image just enough to have the bits where there are a lot of hard edges connect. Once a dilate has been performed, find the contours of that image, and sort based on their height with the image. Since I assume the position of those numbers wont change, you will only have to sort your contours based on the height of the image. Afterwards, once you have sorted your contours, just create bounding boxes over them, and read the text from there.
However, if you want to do this the quick and dirty way, you can always just manually create your own ROI's around each area you want to read and do it that way.
First Method
Gaussian blur the image
Threshold the image
Dilate the image
Find Contours
Sort Contours based on height
Create bounding boxes around relevent contours
Second Method
Manually create ROI's around the area you want to read text from

Recognizing digits with OpenCV and Python (Simple digit OCR)

So I'm trying to create a program that can see what number an image is and print the integer in the console. (I'm using python 3)
For example that the program recognizes that the following image (an actual image the program has to check) is number 2:
I've tried to just compare it with an other image with the 2 in it with cv2.matchTemplate() but each time the blue pixels rgb values are a little bit different for each image and the image could be a bit larger or smaller. for example the following image:
It also has to recognize it apart from al the other blue number images (0-9), for example the following one:
I've tried mulitple match template codes, and make a folder with number 0-9 images as templates, but each time almost every single number is recognized in the number that needs to be recognized. for example number 5 gets recognized in an image that is number 2. And if its doesnt recognize all of them, it recognizes the wrong one(s).
The ones I've tried:
Answer from this question
Both codes from this tutorial
And the one from this tutorial
but like I said before it comes with those problems.
I've also tried to see how much percentage blue is in each image, but those numbers were to close to tell the numbers appart by seeing how much blue was in them.
Does anyone have a solution? Am I being stupid for using cv2.matchTemplate() and is there a much simpler option? (I don't mind using a library for it, because this is part of a bigger piece of code, but I prefer to code it, instead of libraries)

Instead of using Template Matching, a better approach is to use Pytesseract OCR to read the number with image_to_string(). But before performing OCR, you need to preprocess the image. For optimal OCR performance, the preprocessed image should have the desired text/number/characters to OCR in black with the background in white. A simple preprocessing step is to convert the image to grayscale, Otsu's threshold to obtain a binary image, then invert the image. Here's a visualization of the preprocessing step:
Input image -> Grayscale -> Otsu's threshold -> Inverted image ready for OCR
Result from Pytesseract OCR
2
Here's the results with the other images:
2
5
We use the --psm 6 configuration option to assume a single uniform block of text. See here for more configuration options.
Code
import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"
# Load image, grayscale, Otsu's threshold, then invert
image = cv2.imread('1.png')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
invert = 255 - thresh
# Perfrom OCR with Pytesseract
data = pytesseract.image_to_string(invert, lang='eng', config='--psm 6')
print(data)
cv2.imshow('thresh', thresh)
cv2.imshow('invert', invert)
cv2.waitKey()
Note: If you insist on using Template Matching, you need to use scale variant template matching. Take a look at how to isolate everything inside of a contour, scale it, and test the similarity to an image? and Python OpenCV line detection to detect X symbol in image for some examples. If you know for certain that your images are blue, then another approach would be to use color thresholding with cv2.inRange() to obtain a binary mask image then apply OCR on the image.

Given the lovely regular input, I expect that all you need is simple comparison to templates. Since you neglected to supply your code and output, it's hard to tell what might have gone wrong.
Very simply ...
Rescale your input to the size or your templates.
Calculate any straightforward matching evaluation on the input with each of the 10 templates. A simply matching count should suffice: how many pixels match between the two images.
The template with the highest score is the identification.
You might also want to set a lower threshold for declaring a match, perhaps based on how well that template matches each of the other templates: any identification has to clearly exceed the match between two different templates.

If you don't have access to an OCR engine, just know you can build your own OCR system via a KNN classifier. In this example, the implementation should not be very difficult, as you are only classifying numbers. OpenCV provides a very straightforward implementation of KNN.
The classifier is trained using features calculated from samples from known instances of classes. In this case, you have 10 classes (if you are working with digits 0 - 9), so you can prepare a "template" with your digits, extract some features, train the classifier and use it to classify new instances.
All can be done in OpenCV without the need of extra libraries and the KNN (for this kind of application) has a more than acceptable accuracy rate.

Detect rectanglular signature fields in document scans using OpenCV

I am trying to extract rectangular big boxes from document images with signatures in it. Since i don't have training data (for deep learning), i want to cut rectangular boxes (3 in all images) from these images using OpenCV.
Here is what I tried:
import numpy as np
import cv2
img = cv2.imread('S-0330-444-20012800.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
ret,thresh = cv2.threshold(gray,127,255,1)
contours,h = cv2.findContours(thresh,1,2)
for cnt in contours:
approx = cv2.approxPolyDP(cnt,0.02*cv2.arcLength(cnt,True),True)
if len(approx)==4:
cv2.drawContours(img,[cnt],0,(26,60,232),-1)
cv2.imshow('img',img)
cv2.waitKey(0)
sample image
With the above code, I get a lot of squares (around 152 small points like squares) and of course not the 3 boxes.
Replies appreciated. [sample image is attached]

I would suggest you read up on template matching. There is also a good OpenCV tutorial on this.
For your use case, the idea would be to generate a stereotyped image of a rectangular box with the same shape (width/height ratio) as the boxes found on your documents. Depending on whether your input images show the document always in the same scaling or not, your would need to either resize the inputs to keep their magnification constant, or you would need to operate with a template bank (e.g. an array of box templates in various scalings).
Briefly, you would then cross-correlate the template box(es) with the input image and (in case of well-matched scaling) would find ideally relatively sharp peaks indicating the centers of your document boxes.

In the code above, use image pyramids (to merge unwanted contour noises) and cv2.findContours in combination. Post to that filtering list of Contours based on contour area cv2.contourArea will lead to only bigger squares.
There is also an alternate solution. Looking at images, we can see that the signature text usually is bigger than that of printed text in that ROI. so we can filter out contours smaller than signature contours and extract only the signature.
Its always good to remove noise before using cv2.findContours e.g. dilate, erode, blurring etc.

Opencv - Extracting data from in-game images

I need some help with an OpenCV project I'm working on. I'm taking images from a computer game (in this case, Fortnite), and I would like to extract different elements from them, eg. timer value, quantities of materials, health and shield etc.
Currently I perform a series of image preprocessing functions until I get a binary image, followed by locating the contours in the image and then sending those contours to a machine learning algorithm (K-Nearest-Neighbours).
I am able to succeed in a lot of cases, but there are some images where I don't manage to find some of the contours, therefore I don't find all the data.
An important thing to note is that I use the same preprocessing pipeline for all images, because I'm looking for as robust of a solution that I can manage.
I would like to know what I can do to improve the performance of my program. -
Is KNN a good model for this sort of task, or are there other models that might give me better results?
Is there any way to recognise characters without locating contours?
How can I make my preprocessing pipeline as robust as possible, given the fact that there is a lot of variance in the background across all images?
My goal is to process the images as fast as possible, starting out with a minimum of at least 2 images per second.
Thanks in advance for any help or advice you can give me!
Here is an example image before preprocessing
Here is the image after preprocessing, in this example I cannot find the contour for the 4 on the right side.

Quite simply, enlarging the image might help, since it increases the dark border of the number.
I threw together some code that does that. The result could be improved, but my point here is to show that the 4 can now be detected as a contour. To increase efficiency I only selected contours within a certain size.
Also, since it is part of the HUD, that usually means that the location on screen is always the same. If so, you can get great performance increase by only selecting the area with values (described here) - as I have done manually.
Finally, since the numbers have a consistent shape, you could try matchShapes as an alternative to kNN to recognize the numbers. I don't know how they compare in performance though, so you'll have to try that out yourself.
Result:
Code:
import numpy as np
import cv2
# load image
img = cv2.imread("fn2.JPG")
# enlarge image
img = cv2.resize(img,None,fx=4, fy=4, interpolation = cv2.INTER_CUBIC)
# convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
# create mask using threshold
ret,mask = cv2.threshold(gray,200,255,cv2.THRESH_BINARY)
# find contours in mask
im, contours, hierarchy = cv2.findContours(mask, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
# draw contour on image
for cnt in contours:
if cv2.contourArea(cnt) < 3000 and cv2.contourArea(cnt) > 200:
cv2.drawContours(img, [cnt], 0, (255,0,0), 2)
#show images
cv2.imshow("Mask", mask)
cv2.imshow("Image", img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Best approach to slice text image into characters

I need to process some text images, images from reCAPTCHA. I want to slice the image into pieces, each is a bounding box of one character.
The images contain both light font color and dark font color, and all images comes with some white margin space.
For example:
I have preprocessed the images into grayscale and de-skewed them.
How can I proceed slicing the image.
How can I get rid of the white margin, is there a convenient way to fill the margin with similar text background color?

The given problem can be solved by using the opencv by finding the contours. Have a look on the findcontours function from the opencv documentation. It helped me to solve this problem. Use ranges to limit the noises that are created by contours.
image = cv2.cvtColor('image.jpg',cv2.COLOR_BGR2GRAY,1)
ret,thresh = cv2.threshold(image,150,255,0)
n_,contours,_ = cv2.findContours(thresh,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.