Prepare Image for OCR

Prepare Image for OCR - python

The images that I have gives me inconsistent results. My thought process is: my text is always in white font; if I can switch the pixel of my text to black and turned everything else to white or transparent, I will have better success.
My question is, what library or language is best for this? Do I have to turn my white pixel into some unique RGB, turn everything else to white or transparent, then find the unique RGB and make that black? Any help is appreciated.

Yes, if you could make the text pixels black and all the rest of the documents white you would have better success, although this is not always possible, there are processes that can help.
The median filter (and other low pass filters) can be used to remove noise present in the image.
erosion can also help to remove things that are not characters, like thin lines and also noise.
align the text is also a good idea, the OCR accuracy can drop considerably if the text is not aligned. To do this you could try the Hough transform followed by a rotation. Use the Hough transform to find a line in your text and then rotate the image in the same angle as the line.
All processing steps mentioned can be done with opencv or scikit-image.
Is also good to point out that there are many other ways to process text, too many to mention.

Related

Skewing text - How to take advantage of existing edges

I have the following JPG image. If I want to find the edges where the white page meets the black background. So I can rotate the contents a few degrees clockwise. My aim is to straighten the text for using with Tesseract OCR conversion. I don't see the need to rotate the text blocks as I have seen in similar examples.
In the docs Canny Edge Detection the third arg 200 eg edges = cv.Canny(img,100,200) is maxVal and said to be 'sure to be edges'. Is there anyway to determine these (max/min) values ahead of any trial & error approach?
I have used code examples which utilize the Python cv2 module. But the edge detection is set up for simpler applications.
Is there any approach I can use to take the text out of the equation. For example: only detecting edge lines greater than a specified length?
Any suggestions would be appreciated.
Below is an example of edge detection (above image same min/max values) The outer edge of the page is clearly defined. The image is high contrast b/w. It has even lighting. I can't see a need for the use of an adaptive threshold. Simple global is working. Its just at what ratio to use it.
I don't have the answer to this yet. But to add. I now have the contours of the above doc.
I used find contours tutorial with some customization of the file loading. Note: removing words gives a thinner/cleaner outline.

Consider Otsu.
Its chief virtue is that it is adaptive to local
illumination within the image.
In your case, blank margins might be the saving grace.
Consider working on a series of 2x reduced resolution images,
where new pixel is min() (or even max()!) of original four pixels.
These reduced images might help you to focus on the features
that matter for your use case.
The usual way to deskew scanned text is to binarize and
then keep changing theta until "sum of pixels across raster"
is zero, or small. In particular, with few descenders
and decent inter-line spacing, we will see "lots" of pixels
on each line of text and "near zero" between text lines,
when theta matches the original printing orientation.
Which lets us recover (1.) pixels per line, and (2.) inter-line spacing, assuming we've found a near-optimal theta.
In your particular case, focusing on the ... leader dots
seems a promising approach to finding the globally optimal
deskew correction angle. Discarding large rectangles of
pixels in the left and right regions of the image could
actually reduce noise and enhance the accuracy of
such an approach.

How to draw contours around black areas in pixeld image?

i am quite new to Python and i try to write some code for image analysing.
Here is my initial image:
Initial image
After splitting the image in to the rgb channels, converting in to gradient, using a threshold and merging them back together i get the following image:
Gradient/Threshold
Now i have to draw contours around the black areas and get the size of the surrounded areas. I just dont know how to do it, since my trials with find/draw.contours in opencv are not succesfull at all.
Maybe someone also knows an easier way to get that from the initial image.
Hope someone can help me here!
I am coding in Python 3.

Try adaptive thresholding on the grayscale image of the input image.
Also play with the last two parameters of the adaptive thresholding. You will find good results as I have shown in the image. (Tip: Create trackbar and play with value, this will be quick and easy method to get best values of these params.)

Python: Crop out area from image along borders

What functions (and how I should use them) should I use to crop out the center part of this image? I want to take just the less-dense parts, not the dense borders.
Thanks!
In the end, I want to either count the tiny circles/dots (cells) in the areas or calculate the area of the less-dense parts, outlined in the second image. I've done this before with ImageJ by tracing out the area by hand, but it is a really tedious process with lots of images.
Original
Area traced
I've currently looked at Scipy, but they are big and I don't really know how to approach this. If someone would point me in the right direction, that would be great!

It would take me a bit longer to do in Python, but I tried a few ideas just on the command-line with ImageMagick which is installed on most Linux distros and is available for free for macOS and Windows.
First, I trimmed your image to get rid of extraneous junk:
Then, the steps I did were:
discard the alpha/transparency channel
convert to greyscale as there is no useful colour information,
normalised to stretch contrast and make all pixels in range 0-255,
thresholded to find cells
replaced each pixel by the mean of their surrounding 49x49 pixels (box blur)
thresholded again at 90%
That command looks like this in Terminal/Command Prompt:
convert blobs.png -alpha off -colorspace gray -normalize -threshold 50% -statistic mean 49x49 -threshold 90% result.png
The result is:
If that approach looks promising for your other pictures we can work out a Python version pretty quickly, so let me know.
Of course, if you know other useful information about your image that could help improve things... maybe you know the density is always higher at the edges, for example.
In case anyone wants to see the intermediate steps, here is the image after grey scaling and normalising:
And here it is after blurring:

Segmentation of lines, words and characters from a document's image

I am working on a project where I have to read the document from an image. In initial stage I will read the machine printed documents and then eventually move to handwritten document's image. However I am doing this for learning purpose, so I don't intend to use apis like Tesseract etc.
I intend to do in steps:
Preprocessing(Blurring, Thresholding, Erosion&Dilation)
Character Segmentation
OCR (or ICR in later stages)
So I am doing the character segmentation right now, I recently did it through the Horizontal and Vertical Histogram. I was not able to get very good results for some of the fonts, like the image as shown I was not able to get good results.
Is there any other method or algorithm to do the same?
Any help will be appreciated!
Edit 1:
The result I got after detecting blobs using cv2.SimpleBlobDetector.
The result I got after using cv2.findContours.

A first option is by deskewing, i.e. measuring the skew angle. You can achieve this for instance by Gaussian filtering or erosion in the horizontal direction, so that the characters widen and come into contact. Then binarize and thin or find the lower edges of the blobs (or directly the directions of the blobs). You will get slightly oblique line segments which give you the skew direction.
When you know the skew direction, you can counter-rotate to perform de-sekwing. The vertical histogram will then reliably separate the lines, and you can use an horizontal histogram in each of them.
A second option, IMO much better, is to binarize the characters and perform blob detection. Then proximity analysis of the bounding boxes will allow you to determine chains of characters. They will tell you the lines, and where spacing is larger, delimit the words.

Remove Captcha background

I entered a captcha-ed website I would like to get rid of. Here is some sample images
Since the background is static and the word is so computer-generated non distorted character, I believe it is very do-able. Since passing the image directly to Tesseract (OCR engine) doesn't come a positive result. I would like to remove the captcha background before OCR.
I tried multiple background removal methods using Python-PIL
Remove all non-black pixels, which remove the lines but it wouldn't remove the small solid black box.
Apply filter mentioned another StackOverflow post, which would not remove the small solid black box. Also it is less effective than method 1.
Method 1 and 2 would give me a image like this
It seems close but Tesseract couldn't recognize the character, even after the top and bottom dot row is removed.
Create a background mask, and apply the background mask to the image.
Here is the mask image
And this is the image with the mask applied and grey lines removed
However blindly applying this mask would generate some "white holes" in the captcha character. And still Tesseract failed to find out the words.
Are there any better methods removing the static background?
Lastly how could I split the filtered image into 6 image with single character? Thanks very much.

I can give you a few ideas to have a try.
After you have applied step 3, you may thicken the black edges in the images using PIL so as the fill the white holes. And I guess you are using python-tesseract. If so, please refer to Example 4 in https://code.google.com/p/python-tesseract/wiki/CodeSnippets
In order to extract the characters, you may refer to Numpy PIL Python : crop image on whitespace or crop text with histogram Thresholds. There are methods about analysing the histogram of the image so as to locate the position of the whitespaces from which you can infer the boundary.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Prepare Image for OCR - python

Related

Skewing text - How to take advantage of existing edges

How to draw contours around black areas in pixeld image?

Python: Crop out area from image along borders

Segmentation of lines, words and characters from a document's image

Remove Captcha background

Categories

Resources