OpenCV: How to find individual positions of connected letters in an image? - python

I have an image of text that I would like to segment so I can get the individual positions of the letters. However, some of the letters are touching each other, so I'm having trouble with segmentation:
I tried using some contour methods (which seem to work for license plates, etc), but it always ends up detecting the entire block of text. I also tried to apply some erosion to increase the gaps between the letters, but sometimes this causes parts of the letters to erode while the connections remain. I also tried blurring the image and then thresholding, but like before, the connections remain. Is there a way to get the positions of these letters?

Related

Removing line artifacts from an image

I'm creating an OCR application. It extracts handwritten characters from a boxed section in a scanned or photographed printed form, and reads it using a CNN.
It successfully extracts characters using contours, but there are cases where there are lines that are read too as contours. These lines seem to be the result of either mere noise, or leftover pixels when the boxed section is cropped. The boxed section is cropped using contours.
Basically, it works when the form is scanned with a good scanner, saved in PNG format. Otherwise, it won't work as well. I need it to account for JPEG files too and crap camera/scanners.
This is then more of a question of what possible techniques I can use theoretically.
I'd like to either remove lines, or make the code ignore it.
I've tried:
"padding" the cropped boxed section by a negative number n. So it instead removes n pixels from each side. This can't be used too much though, as it also eats up the pixels of the character.
use morphological operation "close". Modifying the kernel size does almost nothing significant, though.
implementing a boxed section area:character area ratio. If the retrieved contour area ratio to the boxed section area is not in the range, it's ignored.
Here's what it looks like:
The grey parts outline the detected contours. The numbers indicate the index of the contour, ordered by the order they are detected. Notice there are strips of lines detected too. I want to get rid of this.
Beside the lines interfering with the model and making it spout nonsense trying to interpret these, there are some cases where it also seems to cause this error:
ValueError: cannot reshape array of size 339 into shape (1,28,28,1)
Maybe I'll start with investigating this in the meantime.

Character extraction from image of combed field

I am currently working on handwritten character recognition from a form iamge. Everything works pretty well so far, but I was hoping I could get some insight on extracting character from an image of a boxed or a "combed" field
For example, after a specific field has been cropped and binazarized (with otu's method), I'm left with something like this:
Binary Field Image
For character recogntion, I have a trained CNN model using the emnist dataset. In order to predict the characters, I have to extract the characters one by one. What would be the best way to extract the characters from the boxes?
Currently, I am using a pretty trivial method of just find groupings of non-white lines of horizontal and vertical pixels that take up a certain number of pixels in relation to the image width and height. For example, I would find horizontal lines that consists of at least 90% non-white pixels and group the ones that have concurrent y coordinates to form a rectangle object which would be the horizontal lines found on the image (which should constist of two lines/rectangles, for top and bottom). For vertical lines I do a similar thing except I would end up with {2 * charLength} lines. I use these values to crop out each character. However, it is not perfect.
Here are some issues with this:
Field is not always perfectly straight (rotation is slightly off). I am already applying SURF and homography to the original image, which does a very good job but it is not perfect.
If a user writes a "1" that takes up the entire height of the box, it will most likely falsly indicate that as a vertical line of the box.
The coordinates don't always match up with the original image and the input image. Therefore, part of the field will be cropped out sometimes. To fix this, I am currently extracting a surrounding part of the field (as seen in the image) but this can also cause problems because the form can have other vertical and horizontal lines very close to some fields. This will cause my current trivial method to not work properly.
Is there a better way to do this? One thing is that I have to keep performance in mind. I was thinking of doing SURF matching again for just the field image, but doing it for the entire form page takes very long, so I am not sure if I want to do it again for each field that I am reading.
I was hoping someone would have suggestions. I am using OpenCV for image processing, but solution in words is fine. Thank you
I know this is a bit late response, but I ended up using the contour feature that OpenCV had to extract the character portion.
When OpenCV finds the contours of the images, it sets up a hierarchy system of contours. The first level ended up being the very outer box so I was able to just grab the contours of the next level to extract the characters.
It didn't work 100% in the beginning, but after some additional image processing I was able to extract the characters properly for at least 99% of cases.

Segmentation of lines, words and characters from a document's image

I am working on a project where I have to read the document from an image. In initial stage I will read the machine printed documents and then eventually move to handwritten document's image. However I am doing this for learning purpose, so I don't intend to use apis like Tesseract etc.
I intend to do in steps:
Preprocessing(Blurring, Thresholding, Erosion&Dilation)
Character Segmentation
OCR (or ICR in later stages)
So I am doing the character segmentation right now, I recently did it through the Horizontal and Vertical Histogram. I was not able to get very good results for some of the fonts, like the image as shown I was not able to get good results.
Is there any other method or algorithm to do the same?
Any help will be appreciated!
Edit 1:
The result I got after detecting blobs using cv2.SimpleBlobDetector.
The result I got after using cv2.findContours.
A first option is by deskewing, i.e. measuring the skew angle. You can achieve this for instance by Gaussian filtering or erosion in the horizontal direction, so that the characters widen and come into contact. Then binarize and thin or find the lower edges of the blobs (or directly the directions of the blobs). You will get slightly oblique line segments which give you the skew direction.
When you know the skew direction, you can counter-rotate to perform de-sekwing. The vertical histogram will then reliably separate the lines, and you can use an horizontal histogram in each of them.
A second option, IMO much better, is to binarize the characters and perform blob detection. Then proximity analysis of the bounding boxes will allow you to determine chains of characters. They will tell you the lines, and where spacing is larger, delimit the words.

How to detect and fix joined/overlapped letters (Segmentation for OCR) using OpenCV and Python?

I'm having problem segmenting the joined or overlapped texts, for an OCR program. I'm dealing with the Times New Roman font. In this font, the letters such as fb, fh, fi, fj, fk, fl, etc are joined with each other at the top. (See pic. below). This is mostly seen in serif fonts.
Letters joining in Times New Roman font and result of watershed algorithm:
Obviously the contour detection will give these two letters a single segmentation. So, I tried the watershed algorithm. As you can see in the above picture, it does detect the overlapping but I discovered another problem in itself. The thin part of the letter 'f' is also being divided into another segment, but I want the whole 'f'. I know this is because of the marker that I'm using. (See below)
Markers that I'm using for watershed:
Also, does anyone know how to detect whether there is overlapping of letters, so that I can apply watershed algorithm to the overlapping parts only.
So how to solve this issue? Am I using the correct method, i.e. watershed to solve this? Does anyone know a better solution to this?

Remove Captcha background

I entered a captcha-ed website I would like to get rid of. Here is some sample images
Since the background is static and the word is so computer-generated non distorted character, I believe it is very do-able. Since passing the image directly to Tesseract (OCR engine) doesn't come a positive result. I would like to remove the captcha background before OCR.
I tried multiple background removal methods using Python-PIL
Remove all non-black pixels, which remove the lines but it wouldn't remove the small solid black box.
Apply filter mentioned another StackOverflow post, which would not remove the small solid black box. Also it is less effective than method 1.
Method 1 and 2 would give me a image like this
It seems close but Tesseract couldn't recognize the character, even after the top and bottom dot row is removed.
Create a background mask, and apply the background mask to the image.
Here is the mask image
And this is the image with the mask applied and grey lines removed
However blindly applying this mask would generate some "white holes" in the captcha character. And still Tesseract failed to find out the words.
Are there any better methods removing the static background?
Lastly how could I split the filtered image into 6 image with single character? Thanks very much.
I can give you a few ideas to have a try.
After you have applied step 3, you may thicken the black edges in the images using PIL so as the fill the white holes. And I guess you are using python-tesseract. If so, please refer to Example 4 in https://code.google.com/p/python-tesseract/wiki/CodeSnippets
In order to extract the characters, you may refer to Numpy PIL Python : crop image on whitespace or crop text with histogram Thresholds. There are methods about analysing the histogram of the image so as to locate the position of the whitespaces from which you can infer the boundary.

Categories