I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?
Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.
Related
I have a very simple use case for OCR.
.PNG Image then extract the text from the image. Then print to console.
I'm after a lite weight library. I am trying to avoid any system level dependency's. Pytesseract is great, but deployment is a bit annoying for such a simple use case.
I have tried quite a few of them. They seem to be designed for more complex use cases.
Note: White text is not necessarily suitable for OCR. I will change the image format to suit the OCR Library.
I'm trying to read the following images using pyTessaract.
The program I've made extracts these images from a video, I want to use pytessaract to extract text from them. The program also has the option to make boxes at the areas I want to the OCR to look at, it then crops the image at those places and runs OCR.
But still the output is not very accurate sometimes I get random letters / text included in the output, sometimes the text is completely random. It's just not detecting the text. How can I improve the detection ? Is there some preprocessing strategy I can use ? Or anything else ?
I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?
Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.
I'm trying to translate images of texts using tesseract. The results seems accurate from my trials. However it seems that I can also train tesseract to be more accurate although complicated.
My question is, how reliable out-of-box tesseract for image to text function for digital images containing popular font like times new roman, arial, etc?
It usually depends on the content of the image - if there's some noise or just unrelated to text background (logos/tables/just random things) - the quality would drop, especially if the contrast of text vs noise is not big enough.
It also depends on the text size: if you have multiple text areas with different font size - you'd most likely need to process those separately (or figure out if different PSM mode could help you), so it would be hard to prepare a generic solution which would work in all cases.
In general - you can visit Tessereact: how to improve quality page and try to follow all the instructions there.
I have been experimenting with PyTesser for the past couple of hours and it is a really nice tool. Couple of things I noticed about the accuracy of PyTesser:
File with icons, images and text - 5-10% accurate
File with only text(images and icons erased) - 50-60% accurate
File with stretching(And this is the best part) - Stretching file
in 2) above on x or y axis increased the accuracy by 10-20%
So apparently Pytesser does not take care of font dimension or image stretching. Although there is much theory to be read about image processing and OCR, are there any standard procedures of image cleanup(apart from erasing icons and images) that needs to be done before applying PyTesser or other libraries irrespective of the language?
...........
Wow, this post is quite old now. I started my research again on OCR these last couple of days. This time I chucked PyTesser and used the Tesseract Engine with ImageMagik instead. Coming straight to the point, this is what I found:
1) You can increase the resolution with ImageMagic(There are a bunch of simple shell commands you can use)
2) After increasing the resolution, the accuracy went up by 80-90%.
So the Tesseract Engine is without doubt the best open source OCR engine in the market. No prior image cleaning was required here. The caveat is that it does not work on files with a lot of embedded images and I coudn't figure out a way to train Tesseract to ignore them. Also the text layout and formatting in the image makes a big difference. It works great with images with just text. Hope this helped.
As it turns out, tesseract wiki has an article that answers this question in best way I can imagine:
Illustrated guide about "Improving the quality of the [OCR] output".
Question "image processing to improve tesseract OCR accuracy" may also be of interest.
(initial answer, just for the record)
I haven't used PyTesser, but I have done some experiments with tesseract (version: 3.02.02).
If you invoke tesseract on colored image, then it first applies global Otsu's method to binarize it and then actual character recognition is run on binary (black and white) image.
Image from: http://scikit-image.org/docs/dev/auto_examples/plot_local_otsu.html
As it can be seen, 'global Otsu' may not always produce desirable result.
To better understand what tesseract 'sees' is to apply Otsu's method to your image and then look at the resulting image.
In conclusion: the most straightforward method to improve recognition ratio is to binarize images yourself (most likely you will have find good threshold by trial and error) and then pass those binarized images to tesseract.
Somebody was kind enough to publish api docs for tesseract, so it is possible to verify previous statements about processing pipeline: ProcessPage -> GetThresholdedImage -> ThresholdToPix -> OtsuThresholdRectToPix
Not sure if your intent is for commercial use or not, But this works wonders if your performing OCR on a bunch of like images.
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
ORIGINAL
After Pre-Processing with given arguments.
I know it's not a perfect answer. But I'd like to share with you a video that I saw from PyCon 2013 that might be applicable. It's a little devoid of implementation details, but just might be some guidance/inspiration to you on how to solve/improve your problem.
Link to Video
Link to Presentation
And if you do decide to use ImageMagick to pre-process your source images a little. Here is question that points you to nice python bindings for it.
On a side note. Quite an important thing with Tesseract. You need to train it, otherwise it wont be nearly as good/accurate as it's capable of being.