PDF 'advanced' information extraction

PDF 'advanced' information extraction - python

I'm trying to write what more or less accounts for a PDF soft proof.
There are a few infos that I would like to extract, but have no clue how to.
What I need to extract:
Bleed: I got this somewhat working with pyPdf, given
that the document uses 72 dpi, which sadly isn't
always the case. I need to be able to calculate
the bleed in millimeters.
Print resolution (dpi): If I read the PDF spec[1] correctly this ought to
always be 72 dpi, unless a page has UserUnit set,
which was only introduced in PDF-1.6, but shouldn't
print documents always be at least 300 dpi? I'm
afraid that I misunderstood something…
I'd also need the print resolution for images, if
they can differ from the default page resolution,
that is.
Text color: I don't have the slightest clue on how to extract
this, the string 'text colour' only shows up once
in the whole spec, without any explanation how it
is set.
Image colormodel: If I understand it correctly I can read this out
in pyPdf with page['/Group']['/CS'] which can be:
- /DeviceRGB
- /DeviceCMY
- /DeviceCMYK
- /DeviceGray
- /DeviceRGBK
- /DeviceN
Font 'embeddedness': I read in another post on stackoverflow that I
can just iterate over the font resources and if a
resource has a '/FontFile'-key that means that
the font is embedded. Is this correct?
If other libs than pyPdf are better able to extract this info (or a combination
of them) they are more than welcome. So far I fumbled around with pyPdf, pdfrw
and pdfminer. All of which don't exactly have the most extensive documentation.
[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf

If I read the PDF spec1 correctly this ought to always be 72 dpi,
unless a page has UserUnit set, which was only introduced in PDF-1.6,
but shouldn't print documents always be at least 300 dpi? I'm afraid
that I misunderstood something…
You do misunderstand something. The default user space unit which defaults to 1/72 inch but can be changed on a per-page base since PDF-1.6, is not defining a print resolution, it merely defines what length a unit in coordinates given by the user by default (i.e. unless any size-changing transformation is active) corresponds to.
For printing all data are converted into a device dependent space whose resolution has nothing to do with the user space coordinates. Printing resolutions depend on the printing device and their drivers; they may be limitted due to security settings allowing low quality printing only.
I'd also need the print resolution for images, if they can differ from
the default page resolution, that is.
Images (well, bitmap images, in PDF there are also vector graphics) come each with their individual resolution and then may be transformed (e.g. enlarged) before being rendered. For an "image printing resolution" you'd, therefore, have to inspect each and every bitmap image and each and every page content in which it is inserted. And if the image is rotated, skewed and asymmetrically stretched, I wonder what number you will use as resolution... ;)
Text color: I don't have the slightest clue on how to extract this, the string
'text colour' only shows up once in the whole spec, without any
explanation how it is set.
Have a look at section 9.2.3 in the spec:
The colour used for painting glyphs shall be the current colour in the
graphics state: either the nonstroking colour or the stroking colour
(or both), depending on the text rendering mode (see 9.3.6, "Text
Rendering Mode"). The default colour shall be black (in DeviceGray),
but other colours may be obtained by executing an appropriate
colour-setting operator or operators (see 8.6.8, "Colour Operators")
before painting the glyphs.
There you find a number of pointers to interesting sections. Be aware, though, text is not simply coloured; it may also be rendered as a clip path applied to any background.
I read in another post on stackoverflow that I can just iterate over
the font resources and if a resource has a '/FontFile'-key that means
that the font is embedded. Is this correct?
I would advice a more precise analysis. There are other relevant keys, too, e.g. '/FontFile2' and '/FontFile3', and the correct one must be used.
Don't underestimate your tasks... you should start to define what the properties you search shall mean in a mixed environment of rotated, stretched and skewed glyphs, vector graphics and bitmap images like PDF.

Related

Parse PDF shape data in python

I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver.
When using this driver all SHX fonts are rendered as shape data instead of text data, they do however have a comment inserted into the PDF at the expected location with the expected text.
So far in my script I have got it to run through the PDF and insert hidden text on top of each section, with the text squashed to the size of the comment, this gets me 90% of the way and gives me a document that is searchable.
Unfortunately the sizing of the comment regions is relatively course (integer based) which makes it difficult to accurately determine the orientation of short text, and results in uneven sized boxes around text.
What I would like to be able to do is parse through the shape data in the PDF, collect anything within the bounds of the comment, and then determine a smaller and more accurate bounding box. However all the information I can find is by people trying to parse through text data, and I haven't been able to find anything at all in terms of shape data.
The below image is an example of the raw text in the PDF, the second image shows the comment bounding box in blue, with the red text being what I am setting to hidden to make the document searchable, and copy/paste able. I can get things a little better by shrinking the box by a fixed margin, but with small text items the low resolution of the comment box coordinate data messes things up.
To get this far I am using a combination of PyPDF2 and reportlab, but am open to moving to different libraries.

I didn't end up finding a solution with PyPDF2, I was able to find an easy way to iterate over shape data in pdfminer.six, but then couldn't find a nice way in pdfminer to extract annotation data.
As such I am using one library to get the annotations, one to look at the shape data, and last of all a third library to add the hidden text on the new pdf. It runs pretty slowly as sheet complexity increases but is giving me good enough results, see image below where the rough green borders as found in the annotations are shrunk to the blue borders surrounding the text. Of course I don't draw the boundaries, and use invisible text for the actual program output, giving pretty good selectable/searchable text.
If anyone is interested in looping over the shape data in PDFs the below snippet should get you started.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTLine, LTCurve
for page_layout in extract_pages("TestSchem.pdf"):
for element in page_layout:
if isinstance(element, LTCurve) or isinstance(element, LTLine):
print(element.bbox)

How does Photoshop convert type format to a rasterized layer?

I have been thinking of fonts quite recently. I find the whole process of a keystroke converted to a character displayed in a particular font quite fascinating. What fascinates me more is that each character is not an image but just the right bunch of pixels switched on (or off).
In Photoshop when I make a text layer, I am assuming it's like any other text layer in a word processor. There's a glyph attached to a character and that is displayed. So technically it's still not an 'image' so as to speak and it can be treated as a text in a word processor. However, when you rasterize the text layer, an image of the text is created with the font that was used. Can somebody tell me how Photoshop does this? I am assuming there should be a lookup table with the characters' graphics which Photoshop accesses to rasterize the layer.
I want to kind of create a program where I generate an image of the character that I am pressing (in C or Python or something like that). Is there a way to do this?

Adobe currently has publicly accessible documentation for the Photoshop file format. I've needed to extract information from PSD files (about a year ago, but actually the ancient CS2 version of Photoshop) so I can warn you that this isn't light reading, and there are some parts (at least in the CS2 documentation) that are incomplete or inaccurate. Usually, even when you have file format documentation, you need to do some reverse engineering to work with that file format.
Even so, see here for info about the TySh chunk from Photoshop 6.0 (not sure at a quick glance if it's still the current form for text - "type" to Photoshop).
Anyway, yes - text is stored as a sequence of character codes in memory and in the file. Fonts are basically collections of vector artwork, so that text can be converted to vector paths. That can be done either by dealing with the font files yourself, using on operating system call (there's definitely one for Windows, but I don't remember the name, it's bugging me now so I might figure it out later), or using a library.
Once you have the vector form, that's basically Bezier paths just like any other vector artwork, and can be rendered the same way.
Or to go directly from text to pixels, you just ask e.g. Windows to draw the text for you - perhaps to a memory DC (device context) if you don't want to draw to the screen.
FreeType is an open source library for working with fonts. It can definitely render to a bitmap. I haven't checked but it can probably convert text to vector paths too - after all it needs to do that as part of rendering to pixels anyway.
Cairo is another obvious library to look at for font handling and much more, but I've never used it directly myself.
wxWidgets is yet another obvious library to look at, and uses a memory-DC scheme similar to that for Windows, though I don't remember exact class/method names. Converting text to vectors might be outside wxWidgets scope, though.

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.

Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.

I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.

PDF bleed detection

I'm currently writing a little tool (Python + pyPdf) to test PDFs for printer conformity.
Alas I already get confused at the first task: Detecting if the PDF has at least 3mm 'bleed' (border around the pages where nothing is printed). I already got that I can't detect the bleed for the complete document, since there doesn't seem to be a global one. On the pages however I can detect a total of five different boxes:
mediaBox
bleedBox
trimBox
cropBox
artBox
I read the pyPdf documentation concerning those boxes, but the only one I understood is the mediaBox which seems to represent the overall page size (i.e. the paper).
The bleedBox pretty obviously ought to define the bleed, but that doesn't always seem to be the case.
Another thing I noted was that for instance with the PDF, all those boxes have the exact same size (implying no bleed at all) on each page, but when I open it there's a huge amount of bleed; This leads me to think that the individual text elements have their own offset.
So, obviously, just calculating the bleed from mediaBox and bleedBox is not a viable option.
I would be more than delighted if anyone could shed some light on what those boxes actually are and what I can conclude from that (e.g. is one box always smaller than another one).
Bonus question: Can someone tell me what exactly the "default user space unit" mentioned in the documentation? I'm pretty sure this refers to mm on my machine, but I'd like to enforce mm everywhere.

Quoting from the PDF specification ISO 32000-1:2008 as published by Adobe:
14.11.2 Page Boundaries
14.11.2.1 General
A PDF page may be prepared either for a finished medium, such as a
sheet of paper, or as part of a prepress process in which the content
of the page is placed on an intermediate medium, such as film or an
imposed reproduction plate. In the latter case, it is important to
distinguish between the intermediate page and the finished page. The
intermediate page may often include additional production-related
content, such as bleeds or printer marks, that falls outside the
boundaries of the finished page. To handle such cases, a PDF page
maydefine as many as five separate boundaries to control various
aspects of the imaging process:
The media box defines the boundaries of the physical medium on which
the page is to be printed. It may include any extended area
surrounding the finished page for bleed, printing marks, or other such
purposes. It may also include areas close to the edges of the medium
that cannot be marked because of physical limitations of the output
device. Content falling outside this boundary may safely be discarded
without affecting the meaning of the PDF file.
The crop box defines the region to which the contents of the page
shall be clipped (cropped) when displayed or printed. Unlike the other
boxes, the crop box has no defined meaning in terms of physical page
geometry or intended use; it merely imposes clipping on the page
contents. However, in the absence of additional information (such as
imposition instructions specified in a JDF or PJTF job ticket), the
crop box determines how the page’s contents shall be positioned on the
output medium. The default value is the page’s media box.
The bleed box (PDF 1.3) defines the region to which the contents of
the page shall be clipped when output in a production environment.
This may include any extra bleed area needed to accommodate the
physical limitations of cutting, folding, and trimming equipment. The
actual printed page may include printing marks that fall outside the
bleed box. The default value is the page’s crop box.
The trim box (PDF 1.3) defines the intended dimensions of the
finished page after trimming. It may be smaller than the media box to
allow for production-related content, such as printing instructions,
cut marks, or colour bars. The default value is the page’s crop box.
The art box (PDF 1.3) defines the extent of the page’s meaningful
content (including potential white space) as intended by the page’s
creator. The default value is the page’s crop box.
The page object dictionary specifies these boundaries in the MediaBox,
CropBox, BleedBox, TrimBox, and ArtBox entries, respectively (see
Table 30). All of them are rectangles expressed in default user space
units. The crop, bleed, trim, and art boxes shall not ordinarily
extend beyond the boundaries of the media box. If they do, they are
effectively reduced to their intersection with the media box. Figure
86 illustrates the relationships among these boundaries. (The crop box
is not shown in the figure because it has no defined relationship with
any of the other boundaries.)
Following that there is a nice graphic showing those boxes in relation to each other:
The reasons why in many cases only the media box is set, are
that in case of PDFs meant for electronic consumption (i.e. reading on a computer) the other boxes hardly matter; and
that even in the prepress context they aren't as necessary anymore as they used to be, cf. the article Pedro refers to in his comment.
Concerning your "bonus question": The user space unit is 1⁄72 inch by default; since PDF 1.6 it can be changed, though, to any (not necessary integer) multiple of that size using the UserUnit entry in the page dictionary. Changing it in an existing PDF essentially scales it as the user space unit is the basic unit in the device independent coordinate system of a page. Therefore, unless you want to update each and every command in the page descriptions refering to coordinates to keep the page dimensions, you won't want to enforce a millimeter user space unit... ;)

Is there a way to extract text information from a postscript file? (.ps .eps)

I want to extract the text information contained in a postscript image file (the captions to my axis labels).
These images were generated with pgplot. I have tried ps2ascii and ps2txt on Ubuntu but they didn't produce any useful results. Does anyone know of another method?
Thanks

It's likely that pgplot drew the fonts in the text directly with lines rather than using text. Especially since pgplot is designed to output to a huge range of devices including plotters where you would have to do this.
Edit:
If you have enough plots to be worth
the effort than it's a very simple
image processing task. Convert each
page to something like tiff, in mono
chrome Threshold the image to binary,
the text will be max pixel value.
Use a template matching technique.
If you have a limited set of
possible labels then just match the
entire label, you can even start
with a template of the correct size
and rotation. Then just flag each
plot as containing label[1-n], no
need to read the actual text.
If you
don't know the label then you can
still do OCR fairly easily, just
extract the region around the axis,
rotate it for the vertical - and use
Google's free OCR lib
If you have pgplot you can even
build the training set for OCR or
the template images directly rather
than having to harvest them from the
image list

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.