Character extraction from image of combed field

Character extraction from image of combed field - python

I am currently working on handwritten character recognition from a form iamge. Everything works pretty well so far, but I was hoping I could get some insight on extracting character from an image of a boxed or a "combed" field
For example, after a specific field has been cropped and binazarized (with otu's method), I'm left with something like this:
Binary Field Image
For character recogntion, I have a trained CNN model using the emnist dataset. In order to predict the characters, I have to extract the characters one by one. What would be the best way to extract the characters from the boxes?
Currently, I am using a pretty trivial method of just find groupings of non-white lines of horizontal and vertical pixels that take up a certain number of pixels in relation to the image width and height. For example, I would find horizontal lines that consists of at least 90% non-white pixels and group the ones that have concurrent y coordinates to form a rectangle object which would be the horizontal lines found on the image (which should constist of two lines/rectangles, for top and bottom). For vertical lines I do a similar thing except I would end up with {2 * charLength} lines. I use these values to crop out each character. However, it is not perfect.
Here are some issues with this:
Field is not always perfectly straight (rotation is slightly off). I am already applying SURF and homography to the original image, which does a very good job but it is not perfect.
If a user writes a "1" that takes up the entire height of the box, it will most likely falsly indicate that as a vertical line of the box.
The coordinates don't always match up with the original image and the input image. Therefore, part of the field will be cropped out sometimes. To fix this, I am currently extracting a surrounding part of the field (as seen in the image) but this can also cause problems because the form can have other vertical and horizontal lines very close to some fields. This will cause my current trivial method to not work properly.
Is there a better way to do this? One thing is that I have to keep performance in mind. I was thinking of doing SURF matching again for just the field image, but doing it for the entire form page takes very long, so I am not sure if I want to do it again for each field that I am reading.
I was hoping someone would have suggestions. I am using OpenCV for image processing, but solution in words is fine. Thank you

I know this is a bit late response, but I ended up using the contour feature that OpenCV had to extract the character portion.
When OpenCV finds the contours of the images, it sets up a hierarchy system of contours. The first level ended up being the very outer box so I was able to just grab the contours of the next level to extract the characters.
It didn't work 100% in the beginning, but after some additional image processing I was able to extract the characters properly for at least 99% of cases.

Related

Is isolating overlapping contours possible?

I have a set of images of phone numbers. Unfortunately the image always has parentheses, ( and ), and a dash, -, embedded as shown below:
Mind you, this is just one variation of the overlapping problem. Sometimes the - will be overlapping with a 1, for example.
This is severely limiting my ability to OCR the number accurately. Using RECT_TREE doesn't improve performance because 1 and ( or 3 and ) are getting contoured as one object.
This seemed like a variant of a previous issue which uses groupRectangles() but I am not finding any improvements. I'm wondering if anyone could direct me to where I might be able to solve this or any relevant SO questions.
Thanks.

I would try template matching the parenthesis and hyphens. That would help you identify the where the items of interest are in the image. The first step would be to crop out an image of just a parenthesis and a picture of a hyphen. If this works, then you could determine what the best way to "mask" out the image is with the results. The opencv implementation of template matching returns a set of points representing a bounding box of the object of interest.
(https://docs.opencv.org/2.4/doc/tutorials/imgproc/histograms/template_matching/template_matching.html)

Removing line artifacts from an image

I'm creating an OCR application. It extracts handwritten characters from a boxed section in a scanned or photographed printed form, and reads it using a CNN.
It successfully extracts characters using contours, but there are cases where there are lines that are read too as contours. These lines seem to be the result of either mere noise, or leftover pixels when the boxed section is cropped. The boxed section is cropped using contours.
Basically, it works when the form is scanned with a good scanner, saved in PNG format. Otherwise, it won't work as well. I need it to account for JPEG files too and crap camera/scanners.
This is then more of a question of what possible techniques I can use theoretically.
I'd like to either remove lines, or make the code ignore it.
I've tried:
"padding" the cropped boxed section by a negative number n. So it instead removes n pixels from each side. This can't be used too much though, as it also eats up the pixels of the character.
use morphological operation "close". Modifying the kernel size does almost nothing significant, though.
implementing a boxed section area:character area ratio. If the retrieved contour area ratio to the boxed section area is not in the range, it's ignored.
Here's what it looks like:
The grey parts outline the detected contours. The numbers indicate the index of the contour, ordered by the order they are detected. Notice there are strips of lines detected too. I want to get rid of this.
Beside the lines interfering with the model and making it spout nonsense trying to interpret these, there are some cases where it also seems to cause this error:
ValueError: cannot reshape array of size 339 into shape (1,28,28,1)
Maybe I'll start with investigating this in the meantime.

Getting bounding boxes of characters from PDF

I've been hacking away at this for a couple of days now, but haven't been able to find a solution that is satisfactory. Essentially, my goal is to find the bounding boxes of characters from PDF to eventually use as training data for an OCR system. This means I need clear and consistent bounding box extraction from generated PDFs (like those at arxiv which actually have text information in them, hence the ability to highlight with cursor). I've been mainly working with python and PDFMiner.
Most of the solutions I've seen are for now lower level than lines of text, and the issue I had there was that PDFs had such varying structures that this wasn't even reliable. I've been able to get bounding boxes of characters through html using pdftotext, but the boxes were mis-sized, most often cutting off the tail ends of characters which are crucial to OCR training.
Thanks!

Clipping image/remove background programmatically in Python

How to go from the image on the left to the image on the right programmatically using Python (and maybe some tools, like OpenCV)?
I made this one by hand using an online tool for clipping. I am completely noob in image processing (especially in practice). I was thinking to apply some edge or contour detection to create a mask, which I will apply later on the original image to paint everything else (except the region of interest) black. But I failed miserably.
The goal is to preprocess a dataset of very similar images, in order to train a CNN binary classifier. I tried to train it by just cropping the image close to the region of interest, but the noise is so high that the CNN learned absolutely nothing.
Can someone help me do this preprocessing?

I used OpenCV's implementation of watershed algorithm to solve your problem. You can find out how to use it if you read this great tutorial, so I will not explain this into a lot of detail.
I selected four points (markers). One is located on the region that you want to extract, one is outside and the other two are within lower/upper part of the interior that does not interest you. I then created an empty integer array (the so-called marker image) and filled it with zeros. Then I assigned unique values to pixels at marker positions.
The image below shows the marker positions and marker values, drawn on the original image:
I could also select more markers within the same area (for example several markers that belong to the area you want to extract) but in that case they should all have the same values (in this case 255).
Then I used watershed. The first input is the image that you provided and the second input is the marker image (zero everywhere except at marker positions). The algorithm stores the result in the marker image; the region that interests you is marked with the value of the region marker (in this case 255):
I set all pixels that did not have the 255 value to zero. I dilated the obtained image three times with 3x3 kernel. Then I used the dilated image as a mask for the original image (i set all pixels outside the mask to zero) and this is the result i got:
You will probably need some kind of method that will find markers automatically. The difficulty of this task depends heavily on the set of the input images. In some cases, the method can be really straightforward and simple (as in the tutorial linked above) but sometimes this can be a tough nut to crack. But I can't recommend anything because I don't know how your images look like in general (you only provided one). :)

Segmentation of lines, words and characters from a document's image

I am working on a project where I have to read the document from an image. In initial stage I will read the machine printed documents and then eventually move to handwritten document's image. However I am doing this for learning purpose, so I don't intend to use apis like Tesseract etc.
I intend to do in steps:
Preprocessing(Blurring, Thresholding, Erosion&Dilation)
Character Segmentation
OCR (or ICR in later stages)
So I am doing the character segmentation right now, I recently did it through the Horizontal and Vertical Histogram. I was not able to get very good results for some of the fonts, like the image as shown I was not able to get good results.
Is there any other method or algorithm to do the same?
Any help will be appreciated!
Edit 1:
The result I got after detecting blobs using cv2.SimpleBlobDetector.
The result I got after using cv2.findContours.

A first option is by deskewing, i.e. measuring the skew angle. You can achieve this for instance by Gaussian filtering or erosion in the horizontal direction, so that the characters widen and come into contact. Then binarize and thin or find the lower edges of the blobs (or directly the directions of the blobs). You will get slightly oblique line segments which give you the skew direction.
When you know the skew direction, you can counter-rotate to perform de-sekwing. The vertical histogram will then reliably separate the lines, and you can use an horizontal histogram in each of them.
A second option, IMO much better, is to binarize the characters and perform blob detection. Then proximity analysis of the bounding boxes will allow you to determine chains of characters. They will tell you the lines, and where spacing is larger, delimit the words.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.