Image cleaning before OCR application - python

I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:
File with icons, images and text - 5-10% accurate
File with only text(images and icons erased) - 50-60% accurate
File with stretching(And this is the best part) - Stretching file
in 2) above on x or y axis increased the accuracy by 10-20%
So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?
...........
Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:
1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.
So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.

As it turns out, tesseract wiki has an article that answers this question in best way I can imagine:
Illustrated guide about "Improving the quality of the [OCR] output".
Question "image processing to improve tesseract OCR accuracy" may also be of interest.
(initial answer, just for the record)
I haven't used PyTesser, but I have done some experiments with tesseract (version: 3.02.02).
If you invoke tesseract on colored image, then it first applies global Otsu's method to binarize it and then actual character recognition is run on binary (black and white) image.
Image from: http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html
As it can be seen, 'global Otsu' may not always produce desirable result.
To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image.
In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to tesseract.
Somebody was kind enough to publish api docs for tesseract, so it is possible to verify previous statements about processing pipeline: ProcessPage -> GetThresholdedImage -> ThresholdToPix -> OtsuThresholdRectToPix

Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
ORIGINAL
After Pre-Processing with given arguments.

I know it's not a perfect answer. But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem.
Link to Video
Link to Presentation
And if you do decide to use ImageMagick to pre-process your source images a little. Here is question that points you to nice python bindings for it.
On a side note. Quite an important thing with Tesseract. You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being.

Related

How accurate is pytesseract for reading computer generated images containing popular digital font?

I'm trying to translate images of texts using tesseract. The results seems accurate from my trials. However it seems that I can also train tesseract to be more accurate although complicated.
My question is, how reliable out-of-box tesseract for image to text function for digital images containing popular font like times new roman, arial, etc?
It usually depends on the content of the image - if there's some noise or just unrelated to text background (logos/tables/just random things) - the quality would drop, especially if the contrast of text vs noise is not big enough.
It also depends on the text size: if you have multiple text areas with different font size - you'd most likely need to process those separately (or figure out if different PSM mode could help you), so it would be hard to prepare a generic solution which would work in all cases.
In general - you can visit Tessereact: how to improve quality page and try to follow all the instructions there.

Text cannot be read using pyTesseract

I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?
Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.

Best algorithm for character recognition

I am trying to create a hardcoded subtitle ripper from a video.
So far i have done some pre-processing.
Get subtitle frame
Crop subtitle lines
Separate subtitle lines
Separate characters.
The major part that is character recognition, is still not done. I tried using tesseract but accuracy is around 60%. Also I tried training character images and then comparing them. But when I run on different resolution video, it failed badly.
Following are the results so far. (Original Image, Threshold, Text Enhancement, Separated characters)
I did go through K Means and comparing images using Structural Similarity. But nothing worked in my case. As you can see above the image text is very clear.
Edited:
Question: I want to improve accuracy to 95% or above as the text is similar across all video, i am able to get the clear text or characters as shown above. Which are the best approaches I can try in my case?
P.S: Language is croatian
I would suggest two things:
Play a bit more with image clean-up
Get better OCR. Tesseract is free, but not the best one. If your budget allows, you may look into some commertial ones. For example: OCRSDK.com This one has some free recognitions available, at least enough to play and see if it works for you.
I tried your latest picture (the one after all cleanings), on demo page, it was recognised almost completely right - see below. Much better than 60% of errors. I am sure that with better image prerpocessing you could improve accuracy even more.
Disclaimer: I work for ABBYY.

Python: Manipulating a 16-bit .tiff image in PIL &/or pygame: convert to 8-bit somehow?

Hello all,
I am working on a program which determines the average colony size of yeast from a photograph, and it is working fine with the .bmp images I tested it on. The program uses pygame, and might use PIL later.
However, the camera/software combo we use in my lab will only save 16-bit grayscale tiff's, and pygame does not seem to be able to recognize 16-bit tiff's, only 8-bit. I have been reading up for the last few hours on easy ways around this, but even the Python Imaging Library does not seem to be able to work with 16-bit .tiff's, I've tried and I get "IOError: cannot identify image file".
import Image
img = Image.open("01 WT mm.tif")
My ultimate goal is to have this program be user-friendly and easy to install, so I'm trying to avoid adding additional modules or requiring people to install ImageMagick or something.
Does anyone know a simple workaround to this problem using freeware or pure python? I don't know too much about images: bit-depth manipulation is out of my scope. But I am fairly sure that I don't need all 16 bits, and that probably only around 8 actually have real data anyway. In fact, I once used ImageMagick to try to convert them, and this resulted in an all-white image: I've since read that I should use the command "-auto-levels" because the data does not actually encompass the 16-bit range.
I greatly appreciate your help, and apologize for my lack of knowledge.
P.S.: Does anyone have any tips on how to make my Python program easy for non-programmers to install? Is there a way, for example, to somehow bundle it with Python and pygame so it's only one install? Can this be done for both Windows and Mac? Thank you.
EDIT: I tried to open it in GIMP, and got 3 errors:
1) Incorrect count for field "DateTime" (27, expecting 20); tag trimmed
2) Sorry, can not handle images with 12-bit samples
3) Unsupported layout, no RGBA loader
What does this mean and how do I fit it?
py2exe is the way to go for packaging up your application if you are on a windows system.
Regarding the 16bit tiff issue:
This example http://ubuntuforums.org/showthread.php?t=1483265 shows how to convert for display using PIL.
Now for the unasked portion question: When doing image analysis, you want to maintain the highest dynamic range possible for as long as possible in your image manipulations - you lose less information that way. As you may or may not be aware, PIL provides you with many filters/transforms that would allow you enhance the contrast of an image, even out light levels, or perform edge detection. A future direction you might want to consider is displaying the original image (scaled to 8 bit of course) along side a scaled image that has been processed for edge detection.
Check out http://code.google.com/p/pyimp/wiki/screenshots for some more examples and sample code.
I would look at pylibtiff, which has a pure python tiff reader.
For bundling, your best bet is probably py2exe and py2app.
This is actually a 2 part question:
1) 16 bit image data mangling for Python - I usually use GDAL + Numpy. This might be a bit too much for your requirements, you can use PIL + Numpy instead.
2) Release engineering Python apps can get messy. Depending on how complex your app is you can get away with py2deb, py2app and py2exe. Learning distutils will help too.

Designing an open source OCR engine specifically for rendered text (screenshots)

So my current personal project is to be able to automatically grab screenshots out of a game, OCR the text, and count the number of occurrences of given words.
Having spent all evening looking around at different OCR solutions, I've come to realize that the majority of OCR packages out there are designed for scanned text. If there are any packages that can read screen text reliably, they're well outside this hobbyist's budget.
I've been reading through some other questions, and the closest I found was OCR engines designed for screen-reading.
It seems to me that reading rendered text should be much easier than printed and scanned text. Lines are always straight, and any given letter will always appear with the exact same pixel representation (mostly, anyways). Also, why not use the actual font file (if you have it) as a cheat sheet to recognizing characters? We might actually reach 100% accuracy with a system like this.
Assuming you have the font file for a cheat sheet and your source image is perfectly square and has no noise, how would you go about recognizing characters from the screen?
(Problems I can foresee are ui lines and images that could confuse any crude attempt at pixel-guessing.)
If you already know of a free/open-source OCR package designed for screen-reading, please let me know. I kind of doubt that's going to show up though, as no other askers seem to have gotten a lead either.
A Python interface is preferred, but beggars can't be choosers.
EDIT:
To clarify, I'm looking for design suggestions for an OCR solution that is specifically designed to read text from screenshots. Popular tools like tesseract (mentioned in the question I linked) are hard to use at best because they are not designed for this kind of source file.
So I've been thinking about it and I feel that the best approach will be to count the number of pixels in each blob/glyph/character. This should really cut down on the number of tests I need to do to differentiate between glyphs.
Regretfully, I'll have to be very specific about fonts. The software will only be able to recognize fonts at the right dpi, for the right font face and weight, etc.
It isn't ideal, and I'd still like to see someone who knows more about this stuff design OCR for rendered text; but it will work for my limited case.
If your goal is to count occurrences of certain events in a game, OCR is really not the right way to be going about it. That said, if you are determined to use OCR, then tesseract-OCR is a well-known open source package for performing optical character recognition. I'm not really sure what you are getting at with respect to scanned vs. rendered text, but tesseract will probably do as good a job as any opensource package that is available. OCR is still a tricky art, so I wouldn't expect 100% accuracy.
This isn't exactly what you want, but you may want to look at Sikuli.

Categories