I'm trying to convert scanned images to text from tesseract ocr and it is working great except that my images has two languages in it and the tesseract is unable to detect both at once. I can either convert all the images to English (with Arabic being showed as some garbage value not roman Arabic), and vice versa if I convert it to Arabic (that is I get all the text in Arabic, with the English ones as Garbage).
I have tried to detect the exported text with langDetect but given the characters and ASCII are of English letters I'm unable to detect it.
I am sharing a sample of the image here, it would be great if someone can help me get a better solution of the issue.
Just Update your code with this
lang = 'eng+ara'
ara stands for ara.traineddata.
One more thing: arabic trained data might not be in the tesseract so download the ara.traineddata from git and paste it in tessdata folder of tesseract ocr.
I am also giving you the link for this traineddata: link.
Related
I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?
Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.
I am trying to extract logo from the PDFs.
I am applying GaussianBlur, finding the contours and extracting only image. But Tesseract cannot read the text from that Image?
Removing the frame around the letters often helps tesseract recognize texts better. So, if you try your script with the following image, you'll have a better chance of reading the logo.
With that said, you might ask how you could achieve this for this logo and other logos in a similar fashion. I could think of a few ways off the top of my head but I think the most generic solution is likely to be a pipeline where text detection algorithms and OCR are combined.
Thus, you might want to check out this repository that provides a text detection algorithm based on R-CNN.
You can also step up your tesseract game by applying a few different image pre-processing techniques. I've recently written a pretty simple guide to Tesseract and some image pre-processing techniques. In case you'd like to check them out, here I'm sharing the links with you:
Getting started with Tesseract - Part I: Introduction
Getting started with Tesseract - Part II: Image Pre-processing
However, you're also interested in this particular logo, or font, you can also try training tesseract with this font by following the instructions given here.
I am working on a project that requires character recognition as a part of it. I am using a handwriting dataset by IAM, so all the images are more or less taken in the same conditions. I am using pictures of words that have been provided by the dataset and following these steps
Binarizing and thresholding
Dividing the word into the characters constituting it
Resizing the extracted character
Letting tesseract figure out what the English alphabet is
What I'm trying to achieve is to store characters of a person's document in folders categorized by the alphabet and maybe form a template from them later on. For this I need to know which character it is.
Here's what I get as a result -
All the characters are properly segmented (for most cases). This is more of a tesseract question than it is a python question, but I'm using python to write the script and calling tesseract through the pytesseract wrapper.
I'm using OpenCV to manipulate the image. Images of these letter matrices are sent as input to tesseract (handled by pytesseract). The input is not an issue, I assure you. Is there anything else I need to do for tesseract to work?
None of these characters are recognized.
Tesseract doesn't support handwritten text well. You should try either ABBYY OCR for that or alternative free libraries like Lipi Toolkit.
I need to write a desktop application that performs the following operations. I'm thinking of using Python as the programming language, but I'd be more than glad to switch, if there's an appropriate approach or library in any other languages.
The file I wish to capture is an HWP file, that only certain word processors can run.
Capture the entire HWP document in an image, might span multiple pages (>10 and <15)
The HWP file contains an MCQ formatted quiz
Parse the data from the image that is separate out the questions and answers and save them as separate image files.
I have looked into the following python library, but am still not able to figure out how to perform both 1 and 3.
https://pypi.python.org/pypi/pyscreenshot
Any help would be appreciated.
If i got it correctly , you need to extract text from image.
For this one you should use an OCR like tesseract.
Before using an OCR, try to clear noises from image.
To split the image try to add some unique strings to distinguish between the quiz Q/A
I'd like to interface an application by reading the text it displays.
I've had success in some applications when windows isn't doing any font smoothing by typing in a phrase manually, rendering it in all windows fonts, and finding a match - from there I can map each letter image to a letter by generating all letters in the font.
This won't work if any font smoothing is being done, though, either by Windows or by the application. What's the state of the art like in OCRing computer-generated text? It seems like it should be easier than breaking CAPTCHAs or OCRing scanned text. Where can I find resources about this? So far I've only found articles on CAPTCHA breaking or OCRing scanned text.
I prefer solutions easily accessible from Python, though if there's a good one in some other lang I'll do the work to interface it.
I'm not exactly sure what you mean, but I think just reading the text with an OCR program would work well.
Tesseract is amazingly accurate for scanned documents, so a specific font would be a breeze for it to read. Here's my Python OCR solution: Python OCR Module in Linux?.
But you could generate each character as an image and find the locations on the image. It (might) work, but I have no idea how accurate it would be with smoothing.