text layout recognition with python

text layout recognition with python - python

I'm trying to sort through several thousand scanned files and sort them into folders based on type (ie: if one of the files is a scanned copy of formA, then it should go in the formA folder, if it's a scanned copy of formB, then it should go in the formB folder, etc...). I feel like the best way to match the files and types is based on their text outlines, but am totally new to image processing, so if there's a better solution, then I'm all ears.
I'm working in python. Any ideas of a best way to do this? PIL? OpenCV? imageMagick?
Thanks in advance...

This library is probably of interest to you -
http://code.google.com/p/ocropus/
Its made by googlers and lets you do OCR and layout analysis from python.
I had some trouble installing it, but that was quite a while back, so things may have gotten fixed by now.

I don't know in what format you've got the scanned documents, but pdfminer can do layout analysis for pdf. I guess it would fit the bill for your purpose, provided you get the documents in somewhat decent pdf format (if you've just got "pure images", it won't do you any good)

Related

Searching Information at PDFs

I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?

Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.

How to conform text to a surface using a displacement map in Python?

I am working on a project where I need to programmatically add a text to an image such that the text conforms to the surface so that it looks much more real-life.
Here is an example of what I am talking about:
https://www.youtube.com/watch?v=huvysaySBrw
I am fairly new when it comes to programmatic image editing. I have gone through quite a bit of documentation of ImageMagick and similar libraries. But didn't find anything that might help me so far.

.asc viewer using Kivy

I want to develop a 3D file viewer in kivy and python that reads and displays .asc mesh files of the format:
x1,y1,z1
x2,y2,z2
........
xi,yi,zi
What I have thought so far is to use a method similar to beginShape() of Processing so as to begin drawing a 3D shape then use a for-loop to append each point respectively.
I have also found that kivy example which parses .obj files and then displays them. Do you have any ideas on how can I make a similar ascparser and try to display my files?
Any help is greatly appreciated

I have also found that kivy example which parses .obj files and then displays them. Do you have any ideas on how can I make a similar ascparser and try to display my files?
Your best strategy at the moment is probably to read the objparser and try to understand what it is doing. The important thing is building a list of points and normals, which are passed to opengl via a Mesh with a custom vertex_format and custom shaders. In principle it wouldn't be very hard to do the same thing for your own filetype just by comparison with the .obj code, though you will need some understanding of what's going on (you can read about opengl and read the kivy source, if you haven't already) to make significant changes.
This is really an advanced topic right now, Kivy has very few pre-built wrappers to 3d opengl rendering. The backend is fully capable (so the 3d rendering example isn't that complex, for instance), but you probably do need some understanding of what's going on to accomplish things like your own task.
There are also a few other examples of 3d rendering in Kivy, which you might find helpful. nskrypnik has several repositories doing just this (see kivy-trackball, kivy-3dpicking, kivy-rotation3d), and seems to have begun implementing a proper 3d api in the kivy3 repo, though this is not complete and I suggest it as something you can learn about by reading, not something that can necessarily do what you want right now. The other nice example I've seen is a 3d inspector POC by tito, though it's just a proof of concept and not a polished product.

Extract text from screen in python

Is there a library etc for extracting text from a png bitmap screen shot?
It is for a automizer and would (for example) be able to read buttons etc. I've checked Tesseract, but it seems to be made for pictures, not computer screen fonts.

If you're dealing with a small amount of possible matches (i.e.: you want to recognize two or three different buttons), the simplest way is to isolate those in a previous screenshot, save them to individual files, and then use some form of template matching, which is quite easy in opencv.
If, however, you need to actually perform recognition of the button text, you're going to need a OCR engine. Tesseract is a good candidate, if you can get it trained for your font (it's a lengthy process). As you mention, you'll need to do this if you're dealing with a small font, which tesseract is not originally trained to recognize. If you can't, there's a couple other engines usable in python around, like Ocropus

Designing an open source OCR engine specifically for rendered text (screenshots)

So my current personal project is to be able to automatically grab screenshots out of a game, OCR the text, and count the number of occurrences of given words.
Having spent all evening looking around at different OCR solutions, I've come to realize that the majority of OCR packages out there are designed for scanned text. If there are any packages that can read screen text reliably, they're well outside this hobbyist's budget.
I've been reading through some other questions, and the closest I found was OCR engines designed for screen-reading.
It seems to me that reading rendered text should be much easier than printed and scanned text. Lines are always straight, and any given letter will always appear with the exact same pixel representation (mostly, anyways). Also, why not use the actual font file (if you have it) as a cheat sheet to recognizing characters? We might actually reach 100% accuracy with a system like this.
Assuming you have the font file for a cheat sheet and your source image is perfectly square and has no noise, how would you go about recognizing characters from the screen?
(Problems I can foresee are ui lines and images that could confuse any crude attempt at pixel-guessing.)
If you already know of a free/open-source OCR package designed for screen-reading, please let me know. I kind of doubt that's going to show up though, as no other askers seem to have gotten a lead either.
A Python interface is preferred, but beggars can't be choosers.
EDIT:
To clarify, I'm looking for design suggestions for an OCR solution that is specifically designed to read text from screenshots. Popular tools like tesseract (mentioned in the question I linked) are hard to use at best because they are not designed for this kind of source file.

So I've been thinking about it and I feel that the best approach will be to count the number of pixels in each blob/glyph/character. This should really cut down on the number of tests I need to do to differentiate between glyphs.
Regretfully, I'll have to be very specific about fonts. The software will only be able to recognize fonts at the right dpi, for the right font face and weight, etc.
It isn't ideal, and I'd still like to see someone who knows more about this stuff design OCR for rendered text; but it will work for my limited case.

If your goal is to count occurrences of certain events in a game, OCR is really not the right way to be going about it. That said, if you are determined to use OCR, then tesseract-OCR is a well-known open source package for performing optical character recognition. I'm not really sure what you are getting at with respect to scanned vs. rendered text, but tesseract will probably do as good a job as any opensource package that is available. OCR is still a tricky art, so I wouldn't expect 100% accuracy.

This isn't exactly what you want, but you may want to look at Sikuli.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.