Searching Information at PDFs

Searching Information at PDFs - python

I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?

Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.

Related

Best algorithm for character recognition

I am trying to create a hardcoded subtitle ripper from a video.
So far i have done some pre-processing.
Get subtitle frame
Crop subtitle lines
Separate subtitle lines
Separate characters.
The major part that is character recognition, is still not done. I tried using tesseract but accuracy is around 60%. Also I tried training character images and then comparing them. But when I run on different resolution video, it failed badly.
Following are the results so far. (Original Image, Threshold, Text Enhancement, Separated characters)
I did go through K Means and comparing images using Structural Similarity. But nothing worked in my case. As you can see above the image text is very clear.
Edited:
Question: I want to improve accuracy to 95% or above as the text is similar across all video, i am able to get the clear text or characters as shown above. Which are the best approaches I can try in my case?
P.S: Language is croatian

I would suggest two things:
Play a bit more with image clean-up
Get better OCR. Tesseract is free, but not the best one. If your budget allows, you may look into some commertial ones. For example: OCRSDK.com This one has some free recognitions available, at least enough to play and see if it works for you.
I tried your latest picture (the one after all cleanings), on demo page, it was recognised almost completely right - see below. Much better than 60% of errors. I am sure that with better image prerpocessing you could improve accuracy even more.
Disclaimer: I work for ABBYY.

Capturing screenshot and parsing data from the captured image

I need to write a desktop application that performs the following operations. I'm thinking of using Python as the programming language, but I'd be more than glad to switch, if there's an appropriate approach or library in any other languages.
The file I wish to capture is an HWP file, that only certain word processors can run.
Capture the entire HWP document in an image, might span multiple pages (>10 and <15)
The HWP file contains an MCQ formatted quiz
Parse the data from the image that is separate out the questions and answers and save them as separate image files.
I have looked into the following python library, but am still not able to figure out how to perform both 1 and 3.
https://pypi.python.org/pypi/pyscreenshot
Any help would be appreciated.

If i got it correctly , you need to extract text from image.
For this one you should use an OCR like tesseract.
Before using an OCR, try to clear noises from image.
To split the image try to add some unique strings to distinguish between the quiz Q/A

Extract text from screen in python

Is there a library etc for extracting text from a png bitmap screen shot?
It is for a automizer and would (for example) be able to read buttons etc. I've checked Tesseract, but it seems to be made for pictures, not computer screen fonts.

If you're dealing with a small amount of possible matches (i.e.: you want to recognize two or three different buttons), the simplest way is to isolate those in a previous screenshot, save them to individual files, and then use some form of template matching, which is quite easy in opencv.
If, however, you need to actually perform recognition of the button text, you're going to need a OCR engine. Tesseract is a good candidate, if you can get it trained for your font (it's a lengthy process). As you mention, you'll need to do this if you're dealing with a small font, which tesseract is not originally trained to recognize. If you can't, there's a couple other engines usable in python around, like Ocropus

ocr'ing application text (not scanned, NOT captchas)

I'd like to interface an application by reading the text it displays.
I've had success in some applications when windows isn't doing any font smoothing by typing in a phrase manually, rendering it in all windows fonts, and finding a match - from there I can map each letter image to a letter by generating all letters in the font.
This won't work if any font smoothing is being done, though, either by Windows or by the application. What's the state of the art like in OCRing computer-generated text? It seems like it should be easier than breaking CAPTCHAs or OCRing scanned text. Where can I find resources about this? So far I've only found articles on CAPTCHA breaking or OCRing scanned text.
I prefer solutions easily accessible from Python, though if there's a good one in some other lang I'll do the work to interface it.

I'm not exactly sure what you mean, but I think just reading the text with an OCR program would work well.
Tesseract is amazingly accurate for scanned documents, so a specific font would be a breeze for it to read. Here's my Python OCR solution: Python OCR Module in Linux?.
But you could generate each character as an image and find the locations on the image. It (might) work, but I have no idea how accurate it would be with smoothing.

Designing an open source OCR engine specifically for rendered text (screenshots)

So my current personal project is to be able to automatically grab screenshots out of a game, OCR the text, and count the number of occurrences of given words.
Having spent all evening looking around at different OCR solutions, I've come to realize that the majority of OCR packages out there are designed for scanned text. If there are any packages that can read screen text reliably, they're well outside this hobbyist's budget.
I've been reading through some other questions, and the closest I found was OCR engines designed for screen-reading.
It seems to me that reading rendered text should be much easier than printed and scanned text. Lines are always straight, and any given letter will always appear with the exact same pixel representation (mostly, anyways). Also, why not use the actual font file (if you have it) as a cheat sheet to recognizing characters? We might actually reach 100% accuracy with a system like this.
Assuming you have the font file for a cheat sheet and your source image is perfectly square and has no noise, how would you go about recognizing characters from the screen?
(Problems I can foresee are ui lines and images that could confuse any crude attempt at pixel-guessing.)
If you already know of a free/open-source OCR package designed for screen-reading, please let me know. I kind of doubt that's going to show up though, as no other askers seem to have gotten a lead either.
A Python interface is preferred, but beggars can't be choosers.
EDIT:
To clarify, I'm looking for design suggestions for an OCR solution that is specifically designed to read text from screenshots. Popular tools like tesseract (mentioned in the question I linked) are hard to use at best because they are not designed for this kind of source file.

So I've been thinking about it and I feel that the best approach will be to count the number of pixels in each blob/glyph/character. This should really cut down on the number of tests I need to do to differentiate between glyphs.
Regretfully, I'll have to be very specific about fonts. The software will only be able to recognize fonts at the right dpi, for the right font face and weight, etc.
It isn't ideal, and I'd still like to see someone who knows more about this stuff design OCR for rendered text; but it will work for my limited case.

If your goal is to count occurrences of certain events in a game, OCR is really not the right way to be going about it. That said, if you are determined to use OCR, then tesseract-OCR is a well-known open source package for performing optical character recognition. I'm not really sure what you are getting at with respect to scanned vs. rendered text, but tesseract will probably do as good a job as any opensource package that is available. OCR is still a tricky art, so I wouldn't expect 100% accuracy.

This isn't exactly what you want, but you may want to look at Sikuli.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.