ocr'ing application text (not scanned, NOT captchas)

ocr'ing application text (not scanned, NOT captchas) - python

I'd like to interface an application by reading the text it displays.
I've had success in some applications when windows isn't doing any font smoothing by typing in a phrase manually, rendering it in all windows fonts, and finding a match - from there I can map each letter image to a letter by generating all letters in the font.
This won't work if any font smoothing is being done, though, either by Windows or by the application. What's the state of the art like in OCRing computer-generated text? It seems like it should be easier than breaking CAPTCHAs or OCRing scanned text. Where can I find resources about this? So far I've only found articles on CAPTCHA breaking or OCRing scanned text.
I prefer solutions easily accessible from Python, though if there's a good one in some other lang I'll do the work to interface it.

I'm not exactly sure what you mean, but I think just reading the text with an OCR program would work well.
Tesseract is amazingly accurate for scanned documents, so a specific font would be a breeze for it to read. Here's my Python OCR solution: Python OCR Module in Linux?.
But you could generate each character as an image and find the locations on the image. It (might) work, but I have no idea how accurate it would be with smoothing.

Related

Searching Information at PDFs

I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?

Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.

How does Photoshop convert type format to a rasterized layer?

I have been thinking of fonts quite recently. I find the whole process of a keystroke converted to a character displayed in a particular font quite fascinating. What fascinates me more is that each character is not an image but just the right bunch of pixels switched on (or off).
In Photoshop when I make a text layer, I am assuming it's like any other text layer in a word processor. There's a glyph attached to a character and that is displayed. So technically it's still not an 'image' so as to speak and it can be treated as a text in a word processor. However, when you rasterize the text layer, an image of the text is created with the font that was used. Can somebody tell me how Photoshop does this? I am assuming there should be a lookup table with the characters' graphics which Photoshop accesses to rasterize the layer.
I want to kind of create a program where I generate an image of the character that I am pressing (in C or Python or something like that). Is there a way to do this?

Adobe currently has publicly accessible documentation for the Photoshop file format. I've needed to extract information from PSD files (about a year ago, but actually the ancient CS2 version of Photoshop) so I can warn you that this isn't light reading, and there are some parts (at least in the CS2 documentation) that are incomplete or inaccurate. Usually, even when you have file format documentation, you need to do some reverse engineering to work with that file format.
Even so, see here for info about the TySh chunk from Photoshop 6.0 (not sure at a quick glance if it's still the current form for text - "type" to Photoshop).
Anyway, yes - text is stored as a sequence of character codes in memory and in the file. Fonts are basically collections of vector artwork, so that text can be converted to vector paths. That can be done either by dealing with the font files yourself, using on operating system call (there's definitely one for Windows, but I don't remember the name, it's bugging me now so I might figure it out later), or using a library.
Once you have the vector form, that's basically Bezier paths just like any other vector artwork, and can be rendered the same way.
Or to go directly from text to pixels, you just ask e.g. Windows to draw the text for you - perhaps to a memory DC (device context) if you don't want to draw to the screen.
FreeType is an open source library for working with fonts. It can definitely render to a bitmap. I haven't checked but it can probably convert text to vector paths too - after all it needs to do that as part of rendering to pixels anyway.
Cairo is another obvious library to look at for font handling and much more, but I've never used it directly myself.
wxWidgets is yet another obvious library to look at, and uses a memory-DC scheme similar to that for Windows, though I don't remember exact class/method names. Converting text to vectors might be outside wxWidgets scope, though.

How to use PIL (pillow) to draw text in any language?

I'm rendering user input text on a background image with Python PIL(I'm using pillow).
the code is simple:
draw = ImageDraw.Draw(im)
draw.text((x, y), text, font=font, fill=font_color)
the problem is, the user may input in any language, how could I determine which font to use?
ps: I know I have to have font files first, so I searched and found Google Noto, downloaded all the fonts, put them in /usr/local/share/fonts/, but these fonts are separated by language, so I still can't load a font that can render all user input texts.

NoTo (which is literally just Adobe's Source Pro fonts with a different name because it's easier for Google to market it that way) isn't a single font, it's a family of fonts. When you go to download them, Google explicitly tells you that there are lots of different versions for lots of different target languages, for the two simple reasons that:
if you need to typeset the entire planet's set of known scripts, there are vastly more glyphs than fit in a single font (OpenType fonts have a hard limit of 65535 glyphs per file due to the fact that glyph IDs are encoded as USHORT fields. And fonts are compositional: the "letter" ℃ can actually be the letter C and the symbol °, so it relies on three glyphs: two real glyphs, and one virtual composition. You run out of space real fast that way) , and
even if a font could fit all the glyphs, the same script may need to be rendered rather different depending on the language it's used for, so even having a single font for both Chinese and Japanese, or for Arabic and Urdu, simply doesn't work. While OpenType fonts can cope with that by being told which variation sets to use, and which compositional rules based on specific language tags, that is the kind of control that works great in InDesign or LaTeX, and is the worst thing for fonts that are going to be used in control-less context (like an Android webview, for instance).
So the proper solution is to grab all the fonts, and then pick the right one based on the {script, language} pair you're generating text for. Is that more complicated than what you're trying to do? Yes. Is it necessary? Equally yes =)

Extract text from screen in python

Is there a library etc for extracting text from a png bitmap screen shot?
It is for a automizer and would (for example) be able to read buttons etc. I've checked Tesseract, but it seems to be made for pictures, not computer screen fonts.

If you're dealing with a small amount of possible matches (i.e.: you want to recognize two or three different buttons), the simplest way is to isolate those in a previous screenshot, save them to individual files, and then use some form of template matching, which is quite easy in opencv.
If, however, you need to actually perform recognition of the button text, you're going to need a OCR engine. Tesseract is a good candidate, if you can get it trained for your font (it's a lengthy process). As you mention, you'll need to do this if you're dealing with a small font, which tesseract is not originally trained to recognize. If you can't, there's a couple other engines usable in python around, like Ocropus

Designing an open source OCR engine specifically for rendered text (screenshots)

So my current personal project is to be able to automatically grab screenshots out of a game, OCR the text, and count the number of occurrences of given words.
Having spent all evening looking around at different OCR solutions, I've come to realize that the majority of OCR packages out there are designed for scanned text. If there are any packages that can read screen text reliably, they're well outside this hobbyist's budget.
I've been reading through some other questions, and the closest I found was OCR engines designed for screen-reading.
It seems to me that reading rendered text should be much easier than printed and scanned text. Lines are always straight, and any given letter will always appear with the exact same pixel representation (mostly, anyways). Also, why not use the actual font file (if you have it) as a cheat sheet to recognizing characters? We might actually reach 100% accuracy with a system like this.
Assuming you have the font file for a cheat sheet and your source image is perfectly square and has no noise, how would you go about recognizing characters from the screen?
(Problems I can foresee are ui lines and images that could confuse any crude attempt at pixel-guessing.)
If you already know of a free/open-source OCR package designed for screen-reading, please let me know. I kind of doubt that's going to show up though, as no other askers seem to have gotten a lead either.
A Python interface is preferred, but beggars can't be choosers.
EDIT:
To clarify, I'm looking for design suggestions for an OCR solution that is specifically designed to read text from screenshots. Popular tools like tesseract (mentioned in the question I linked) are hard to use at best because they are not designed for this kind of source file.

So I've been thinking about it and I feel that the best approach will be to count the number of pixels in each blob/glyph/character. This should really cut down on the number of tests I need to do to differentiate between glyphs.
Regretfully, I'll have to be very specific about fonts. The software will only be able to recognize fonts at the right dpi, for the right font face and weight, etc.
It isn't ideal, and I'd still like to see someone who knows more about this stuff design OCR for rendered text; but it will work for my limited case.

If your goal is to count occurrences of certain events in a game, OCR is really not the right way to be going about it. That said, if you are determined to use OCR, then tesseract-OCR is a well-known open source package for performing optical character recognition. I'm not really sure what you are getting at with respect to scanned vs. rendered text, but tesseract will probably do as good a job as any opensource package that is available. OCR is still a tricky art, so I wouldn't expect 100% accuracy.

This isn't exactly what you want, but you may want to look at Sikuli.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.