I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.
Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.
I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.
Related
I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver.
When using this driver all SHX fonts are rendered as shape data instead of text data, they do however have a comment inserted into the PDF at the expected location with the expected text.
So far in my script I have got it to run through the PDF and insert hidden text on top of each section, with the text squashed to the size of the comment, this gets me 90% of the way and gives me a document that is searchable.
Unfortunately the sizing of the comment regions is relatively course (integer based) which makes it difficult to accurately determine the orientation of short text, and results in uneven sized boxes around text.
What I would like to be able to do is parse through the shape data in the PDF, collect anything within the bounds of the comment, and then determine a smaller and more accurate bounding box. However all the information I can find is by people trying to parse through text data, and I haven't been able to find anything at all in terms of shape data.
The below image is an example of the raw text in the PDF, the second image shows the comment bounding box in blue, with the red text being what I am setting to hidden to make the document searchable, and copy/paste able. I can get things a little better by shrinking the box by a fixed margin, but with small text items the low resolution of the comment box coordinate data messes things up.
To get this far I am using a combination of PyPDF2 and reportlab, but am open to moving to different libraries.
I didn't end up finding a solution with PyPDF2, I was able to find an easy way to iterate over shape data in pdfminer.six, but then couldn't find a nice way in pdfminer to extract annotation data.
As such I am using one library to get the annotations, one to look at the shape data, and last of all a third library to add the hidden text on the new pdf. It runs pretty slowly as sheet complexity increases but is giving me good enough results, see image below where the rough green borders as found in the annotations are shrunk to the blue borders surrounding the text. Of course I don't draw the boundaries, and use invisible text for the actual program output, giving pretty good selectable/searchable text.
If anyone is interested in looping over the shape data in PDFs the below snippet should get you started.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTLine, LTCurve
for page_layout in extract_pages("TestSchem.pdf"):
for element in page_layout:
if isinstance(element, LTCurve) or isinstance(element, LTLine):
print(element.bbox)
It seems there's no out-of-the-box way to do this (I've skimmed the docs), but just in case: if I want to draw a sentence to an image with PIL but want control over individual words (or even individual letters), how can I do that?
I'm also open to solutions that use a library other than PIL.
The image below shows the kind of effect I'm going for.
There is no a direct way to do this. However, you are able to determine the text size - http://pillow.readthedocs.io/en/5.2.x/reference/ImageDraw.html#PIL.ImageDraw.PIL.ImageDraw.ImageDraw.textsize.
You could draw the first part of your text, get the size of that text to calculate the new position for the next part of text, and so on. That way, you could style each part of the text in any matter you see fit.
I've been hacking away at this for a couple of days now, but haven't been able to find a solution that is satisfactory. Essentially, my goal is to find the bounding boxes of characters from PDF to eventually use as training data for an OCR system. This means I need clear and consistent bounding box extraction from generated PDFs (like those at arxiv which actually have text information in them, hence the ability to highlight with cursor). I've been mainly working with python and PDFMiner.
Most of the solutions I've seen are for now lower level than lines of text, and the issue I had there was that PDFs had such varying structures that this wasn't even reliable. I've been able to get bounding boxes of characters through html using pdftotext, but the boxes were mis-sized, most often cutting off the tail ends of characters which are crucial to OCR training.
Thanks!
I have been thinking of fonts quite recently. I find the whole process of a keystroke converted to a character displayed in a particular font quite fascinating. What fascinates me more is that each character is not an image but just the right bunch of pixels switched on (or off).
In Photoshop when I make a text layer, I am assuming it's like any other text layer in a word processor. There's a glyph attached to a character and that is displayed. So technically it's still not an 'image' so as to speak and it can be treated as a text in a word processor. However, when you rasterize the text layer, an image of the text is created with the font that was used. Can somebody tell me how Photoshop does this? I am assuming there should be a lookup table with the characters' graphics which Photoshop accesses to rasterize the layer.
I want to kind of create a program where I generate an image of the character that I am pressing (in C or Python or something like that). Is there a way to do this?
Adobe currently has publicly accessible documentation for the Photoshop file format. I've needed to extract information from PSD files (about a year ago, but actually the ancient CS2 version of Photoshop) so I can warn you that this isn't light reading, and there are some parts (at least in the CS2 documentation) that are incomplete or inaccurate. Usually, even when you have file format documentation, you need to do some reverse engineering to work with that file format.
Even so, see here for info about the TySh chunk from Photoshop 6.0 (not sure at a quick glance if it's still the current form for text - "type" to Photoshop).
Anyway, yes - text is stored as a sequence of character codes in memory and in the file. Fonts are basically collections of vector artwork, so that text can be converted to vector paths. That can be done either by dealing with the font files yourself, using on operating system call (there's definitely one for Windows, but I don't remember the name, it's bugging me now so I might figure it out later), or using a library.
Once you have the vector form, that's basically Bezier paths just like any other vector artwork, and can be rendered the same way.
Or to go directly from text to pixels, you just ask e.g. Windows to draw the text for you - perhaps to a memory DC (device context) if you don't want to draw to the screen.
FreeType is an open source library for working with fonts. It can definitely render to a bitmap. I haven't checked but it can probably convert text to vector paths too - after all it needs to do that as part of rendering to pixels anyway.
Cairo is another obvious library to look at for font handling and much more, but I've never used it directly myself.
wxWidgets is yet another obvious library to look at, and uses a memory-DC scheme similar to that for Windows, though I don't remember exact class/method names. Converting text to vectors might be outside wxWidgets scope, though.
I'm trying to write what more or less accounts for a PDF soft proof.
There are a few infos that I would like to extract, but have no clue how to.
What I need to extract:
Bleed: I got this somewhat working with pyPdf, given
that the document uses 72 dpi, which sadly isn't
always the case. I need to be able to calculate
the bleed in millimeters.
Print resolution (dpi): If I read the PDF spec[1] correctly this ought to
always be 72 dpi, unless a page has UserUnit set,
which was only introduced in PDF-1.6, but shouldn't
print documents always be at least 300 dpi? I'm
afraid that I misunderstood something…
I'd also need the print resolution for images, if
they can differ from the default page resolution,
that is.
Text color: I don't have the slightest clue on how to extract
this, the string 'text colour' only shows up once
in the whole spec, without any explanation how it
is set.
Image colormodel: If I understand it correctly I can read this out
in pyPdf with page['/Group']['/CS'] which can be:
- /DeviceRGB
- /DeviceCMY
- /DeviceCMYK
- /DeviceGray
- /DeviceRGBK
- /DeviceN
Font 'embeddedness': I read in another post on stackoverflow that I
can just iterate over the font resources and if a
resource has a '/FontFile'-key that means that
the font is embedded. Is this correct?
If other libs than pyPdf are better able to extract this info (or a combination
of them) they are more than welcome. So far I fumbled around with pyPdf, pdfrw
and pdfminer. All of which don't exactly have the most extensive documentation.
[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
If I read the PDF spec1 correctly this ought to always be 72 dpi,
unless a page has UserUnit set, which was only introduced in PDF-1.6,
but shouldn't print documents always be at least 300 dpi? I'm afraid
that I misunderstood something…
You do misunderstand something. The default user space unit which defaults to 1/72 inch but can be changed on a per-page base since PDF-1.6, is not defining a print resolution, it merely defines what length a unit in coordinates given by the user by default (i.e. unless any size-changing transformation is active) corresponds to.
For printing all data are converted into a device dependent space whose resolution has nothing to do with the user space coordinates. Printing resolutions depend on the printing device and their drivers; they may be limitted due to security settings allowing low quality printing only.
I'd also need the print resolution for images, if they can differ from
the default page resolution, that is.
Images (well, bitmap images, in PDF there are also vector graphics) come each with their individual resolution and then may be transformed (e.g. enlarged) before being rendered. For an "image printing resolution" you'd, therefore, have to inspect each and every bitmap image and each and every page content in which it is inserted. And if the image is rotated, skewed and asymmetrically stretched, I wonder what number you will use as resolution... ;)
Text color: I don't have the slightest clue on how to extract this, the string
'text colour' only shows up once in the whole spec, without any
explanation how it is set.
Have a look at section 9.2.3 in the spec:
The colour used for painting glyphs shall be the current colour in the
graphics state: either the nonstroking colour or the stroking colour
(or both), depending on the text rendering mode (see 9.3.6, "Text
Rendering Mode"). The default colour shall be black (in DeviceGray),
but other colours may be obtained by executing an appropriate
colour-setting operator or operators (see 8.6.8, "Colour Operators")
before painting the glyphs.
There you find a number of pointers to interesting sections. Be aware, though, text is not simply coloured; it may also be rendered as a clip path applied to any background.
I read in another post on stackoverflow that I can just iterate over
the font resources and if a resource has a '/FontFile'-key that means
that the font is embedded. Is this correct?
I would advice a more precise analysis. There are other relevant keys, too, e.g. '/FontFile2' and '/FontFile3', and the correct one must be used.
Don't underestimate your tasks... you should start to define what the properties you search shall mean in a mixed environment of rotated, stretched and skewed glyphs, vector graphics and bitmap images like PDF.