PDF bleed detection - python

I'm currently writing a little tool (Python + pyPdf) to test PDFs for printer conformity.
Alas I already get confused at the first task: Detecting if the PDF has at least 3mm 'bleed' (border around the pages where nothing is printed). I already got that I can't detect the bleed for the complete document, since there doesn't seem to be a global one. On the pages however I can detect a total of five different boxes:
mediaBox
bleedBox
trimBox
cropBox
artBox
I read the pyPdf documentation concerning those boxes, but the only one I understood is the mediaBox which seems to represent the overall page size (i.e. the paper).
The bleedBox pretty obviously ought to define the bleed, but that doesn't always seem to be the case.
Another thing I noted was that for instance with the PDF, all those boxes have the exact same size (implying no bleed at all) on each page, but when I open it there's a huge amount of bleed; This leads me to think that the individual text elements have their own offset.
So, obviously, just calculating the bleed from mediaBox and bleedBox is not a viable option.
I would be more than delighted if anyone could shed some light on what those boxes actually are and what I can conclude from that (e.g. is one box always smaller than another one).
Bonus question: Can someone tell me what exactly the "default user space unit" mentioned in the documentation? I'm pretty sure this refers to mm on my machine, but I'd like to enforce mm everywhere.

Quoting from the PDF specification ISO 32000-1:2008 as published by Adobe:
14.11.2 Page Boundaries
14.11.2.1 General
A PDF page may be prepared either for a finished medium, such as a
sheet of paper, or as part of a prepress process in which the content
of the page is placed on an intermediate medium, such as film or an
imposed reproduction plate. In the latter case, it is important to
distinguish between the intermediate page and the finished page. The
intermediate page may often include additional production-related
content, such as bleeds or printer marks, that falls outside the
boundaries of the finished page. To handle such cases, a PDF page
maydefine as many as five separate boundaries to control various
aspects of the imaging process:
The media box defines the boundaries of the physical medium on which
the page is to be printed. It may include any extended area
surrounding the finished page for bleed, printing marks, or other such
purposes. It may also include areas close to the edges of the medium
that cannot be marked because of physical limitations of the output
device. Content falling outside this boundary may safely be discarded
without affecting the meaning of the PDF file.
The crop box defines the region to which the contents of the page
shall be clipped (cropped) when displayed or printed. Unlike the other
boxes, the crop box has no defined meaning in terms of physical page
geometry or intended use; it merely imposes clipping on the page
contents. However, in the absence of additional information (such as
imposition instructions specified in a JDF or PJTF job ticket), the
crop box determines how the page’s contents shall be positioned on the
output medium. The default value is the page’s media box.
The bleed box (PDF 1.3) defines the region to which the contents of
the page shall be clipped when output in a production environment.
This may include any extra bleed area needed to accommodate the
physical limitations of cutting, folding, and trimming equipment. The
actual printed page may include printing marks that fall outside the
bleed box. The default value is the page’s crop box.
The trim box (PDF 1.3) defines the intended dimensions of the
finished page after trimming. It may be smaller than the media box to
allow for production-related content, such as printing instructions,
cut marks, or colour bars. The default value is the page’s crop box.
The art box (PDF 1.3) defines the extent of the page’s meaningful
content (including potential white space) as intended by the page’s
creator. The default value is the page’s crop box.
The page object dictionary specifies these boundaries in the MediaBox,
CropBox, BleedBox, TrimBox, and ArtBox entries, respectively (see
Table 30). All of them are rectangles expressed in default user space
units. The crop, bleed, trim, and art boxes shall not ordinarily
extend beyond the boundaries of the media box. If they do, they are
effectively reduced to their intersection with the media box. Figure
86 illustrates the relationships among these boundaries. (The crop box
is not shown in the figure because it has no defined relationship with
any of the other boundaries.)
Following that there is a nice graphic showing those boxes in relation to each other:
The reasons why in many cases only the media box is set, are
that in case of PDFs meant for electronic consumption (i.e. reading on a computer) the other boxes hardly matter; and
that even in the prepress context they aren't as necessary anymore as they used to be, cf. the article Pedro refers to in his comment.
Concerning your "bonus question": The user space unit is 1⁄72 inch by default; since PDF 1.6 it can be changed, though, to any (not necessary integer) multiple of that size using the UserUnit entry in the page dictionary. Changing it in an existing PDF essentially scales it as the user space unit is the basic unit in the device independent coordinate system of a page. Therefore, unless you want to update each and every command in the page descriptions refering to coordinates to keep the page dimensions, you won't want to enforce a millimeter user space unit... ;)

Related

Remove Black Rectangles from PDF using Python (PikePDF or PyPDF2)

Please help me surprise my wife with a useful PDF of her iMessage chain with her now deceased grandmother.
Apple Messages allows you to print conversations to PDF. You have to manually scroll to the top of the message on the Mac, over and over again. It took me 4 hours!
However, it places black rectangle over most photos. This seems to be a half-hearted attempt at privacy because the photo is still there under the black rectangle.
I can use Edit PDF in Adobe Acrobat and remove the main black rectangle, each of the four side rectangles and the four corner rectangles. But doing this for every image in this chain will take me an extraordinary amount of time as the chain goes back 4 years and they texted a lot.
I'm reasonably savvy with Python and have tried to work through using PikePDF and PyPDF2 to do this, but I can't make any headway owing to the complex structure of PDFs.
Note that the extraordinary long page size of the PDF is because when you PrintPDF in Messages it doesn't handle images across page breaks very well. So I set a custom page size of 200 inches height so there are far fewer page breaks.
Example PDF at link below. It is one page, with most of the content removed (using EditPDF in Acrobat) for privacy, with one photo (of my baby son) which is blocked by the aforementioned black rectangle. The target document is 63 such pages (each 200 inches high) and 1.73 gb in size, so you can understand why doing this manually is a wee bit impractical.
Please help me internet. It would mean so much to my wife!
Edit to include previously left out link to sample file: https://www.dropbox.com/s/t4dgwr5eylb4rfm/MessagesBlackBoxExample.pdf?dl=0

Getting bounding boxes of characters from PDF

I've been hacking away at this for a couple of days now, but haven't been able to find a solution that is satisfactory. Essentially, my goal is to find the bounding boxes of characters from PDF to eventually use as training data for an OCR system. This means I need clear and consistent bounding box extraction from generated PDFs (like those at arxiv which actually have text information in them, hence the ability to highlight with cursor). I've been mainly working with python and PDFMiner.
Most of the solutions I've seen are for now lower level than lines of text, and the issue I had there was that PDFs had such varying structures that this wasn't even reliable. I've been able to get bounding boxes of characters through html using pdftotext, but the boxes were mis-sized, most often cutting off the tail ends of characters which are crucial to OCR training.
Thanks!

Ghostscript is cutting off my ps to png

Questions about ghostscript and tkinter:
I have made a tkinter program and I want to convert into an image; I want the image to have the same ratio as 8.5 x 12 in paper; I have read that it's 2550x3300 pixels.
How does this translate to canvas coordinates? For now I picked some numbers with a similar ratio, width=1275,height=1650
Also, what exactly is the canvas and its size? I thought it was the length of what I could write in, but then I can set the scroll regions even further than the given width and height.
Basic idea for the canvas code:
class Application(tk.Frame):
def __init__(self, master):
self.master=master
self.canvas = tk.Canvas(master, width=1275, height=1650, bg='white', highlightthickness=0,
scrollregion=(0, 0, 1275, 1650))
self.hbar = Scrollbar(master, orient=HORIZONTAL)
self.hbar.pack(side=TOP, fill=X)
self.hbar.config(command=self.canvas.xview)
self.vbar = Scrollbar(master, orient=VERTICAL)
self.vbar.pack(side=RIGHT, fill=Y)
self.vbar.config(command=self.canvas.yview)
self.canvas.config(width=1275, height=1650)
self.canvas.config(xscrollcommand=self.hbar.set, yscrollcommand=self.vbar.set)
self.canvas.pack(side=LEFT, expand=True, fill=BOTH)
B1=Button(master,text='add image',command= lambda:self.insert_image())
B1.pack(side=TOP)
and here is my Ghostscript related code:
def _save(self):
self.canvas.postscript(file="tmp.ps",colormode='color')
args = [
"ps2jpg",
"-dSAFER","-dBATCH", "-dNOPAUSE",
"-sDEVICE=png16m",
"-sOutputFile=./ABC.png",
"./tmp.ps"
]
ghostscript.Ghostscript(*args)
And it seems to cut of at the left and right side.
Then I added parameters such as -dFitPage","-g1275x1650","-dPSFitPage", or even "-g2550x3300" instead of "-g1275x1650", but it creates a different error,
where the top of my canvas ends up in the middle of my saved image. What I want is the top of my canvas to appear at the top of my image.
Thank you.
OK, so firstly, the size you have quoted uses '-g' which is the number of pixels. Clearly the actual media size will then depend on the resolution. If I declare the size to be 600x600 and the resolution is 600 dpi then that's 1 inch by 1 inch, if the resolution is 300 dpi, then its 2 inches by 2 inches.
So you can't say 8.5x12 inch (is that supposed to be one of the standard media sizes ?) is 2550x3300 pixels, without also stating the resolution. In fact that can't even be correct. If I assume that 3300 is correct for 12 inch length, then that's a resolution of 275 dpi. If I then figure the width its 2550/275 = 9 inches.
As it happens, the default resolution of the png16m device appears to be 72. So 2550 by 3300 pixels means that your media is 35x45 inches. Not too surprising that you have scroll bars :-)
Of course, its possible that your PostScript program alters the resolution, but since you haven't supplied it to look at, I can't tell.
Now, Postscript co-ordinate systems start (by default) at the bottom left corner which is 0,0 and extend in both directions, positive numbers go up, and right, negative numbers go down and left. Yes its entirely possible to specify that part or all of a drawing operation takes place off the media.
You can also alter the co-ordinate system too, but that's probably more complex than you want to get into.
Without seeing your PostScript program, I can't really say why it lies partially off the media, it may be that that's what the program is asking for.
Using FitPage will attempt to fit the requested image to the page, if its too big it will scale it down (linearly, both directions equally) until both the dimensions fit into the media. This will result in white space in one direction unless your media happens to be the same shape as the program requested. That smallest dimension is then centered. I don't recall exactly but I think if the program marks fit into the media, then it just centres it.
So basically, you need to get the dimensions correct to start with. Assuming you are happy with a 72 dpi output image, and that your media is genuinely 8.5x12 inches, then you can specify -g612x864. If the rendered image doesn't fit precisely then its probable that your program makes marks off the media, is using a different media size, or 'something'. Can't say what without seeing the PostScript.
If you can share a simple PostScript file I can look at it (I can't use anything that requires me to use tkinter, sorry) and give you some more detailed guidance.
[EDIT]
So the output is actually an EPS, not a PostScript program, we can see this from the initial comments (any line beginning with '%' is a comment):
%!PS-Adobe-3.0 EPSF-3.0
%%Creator: Tk Canvas Widget
%%Title: Window .49823304L
%%CreationDate: Mon Aug 14 23:47:27 2017
%%BoundingBox: -171 85 785 707
%!PS tells us its a PostScript program, -Adobe-3.0 tells us it conforms to version 3.0 of the Document Structure Convention (a way of creating PostScript programs that makes them more portable for non PostScript interpreters) and the EPSF tells us its actually an EPS, finally the trailing -3.0 declares that it conforms to version 3.0 of the EPS specification.
Now EPS files are not intended to be sent directly to a PostScript interpreter. They are supposed to be included inside other PostScript programs and used as a kind of 'black box'. This technique is often used when the object is something like a company logo which does not change, or when you send work to an outside agency (eg a freelance graphic artist) they may send it as an EPS for you to use.
The EPS conforms to certain rules regarding what it can and can't contain. One of the important things it cannot do is set the media size. Execution of setpagedevice can cause the device to reset the marked content, which would throw away any marks made before the media selection.
Additionally, the EPS doesn't know how big its going to be when drawn on the final page. You could think of a logo being drawn large on the front page, then drawn small in the footer of each page for example.
So what the EPS contains is a declaration of where it marks the page, this is given by the BoundingBox:
%%BoundingBox: -171 85 785 707
Now you will note that the BoundingBox of this EPS begins at -179,85 and extends to 785,707. So its width is 964 and its height is 792. Those are in PostScript5 units which are 1/72 of an inch. So your EPS is actually 13.38 inches wide by 12 inches tall. Not only that, but it begins 2.48 inches off the left edge of the media.
This probably explains why you are having trouble getting the output you want, you probably are not setting the media correctly in Ghostscript, and translating the origin so that the left edge doesn't lie off the media.
Its the job of the application which places the EPS in its own PostScript output to translate and scale the co-ordinate system so that the EPS is at the required position and size on the final media.
So you have a choice; you can create a PostScript program which sets up the Current Transformation Matrix appropriately to scale and position the EPS, and then includes the EPS in its entirety, finishign with a showpage to actually render it (EPS files may not include a showpage for obvious reasons)
Or you can use the -dEPSCrop or EPSFitPage switches in Ghostscript (documented here) which will fit the content to the page, or the page to the content. Note that the precise behaviour of the FitPage switch depends on the exact version of Ghostscript you are using, which you haven't mentioned. The documentation there is for the current version, 9.21.
If you create a PostScript program yourself to do the work then you have complete control over how its rendered, if you let Ghostscript do it then you have less control but its simple to do. Your choice really.
NB the pastebin stops abruptly and makes no actual marks, so its not a valid PostScript program. If you put the whole file on DropBox or soemthing I could perhaps be more specific, but the gist is certainly covered above.

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.
Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.
I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.

PDF 'advanced' information extraction

I'm trying to write what more or less accounts for a PDF soft proof.
There are a few infos that I would like to extract, but have no clue how to.
What I need to extract:
Bleed: I got this somewhat working with pyPdf, given
that the document uses 72 dpi, which sadly isn't
always the case. I need to be able to calculate
the bleed in millimeters.
Print resolution (dpi): If I read the PDF spec[1] correctly this ought to
always be 72 dpi, unless a page has UserUnit set,
which was only introduced in PDF-1.6, but shouldn't
print documents always be at least 300 dpi? I'm
afraid that I misunderstood something…
I'd also need the print resolution for images, if
they can differ from the default page resolution,
that is.
Text color: I don't have the slightest clue on how to extract
this, the string 'text colour' only shows up once
in the whole spec, without any explanation how it
is set.
Image colormodel: If I understand it correctly I can read this out
in pyPdf with page['/Group']['/CS'] which can be:
- /DeviceRGB
- /DeviceCMY
- /DeviceCMYK
- /DeviceGray
- /DeviceRGBK
- /DeviceN
Font 'embeddedness': I read in another post on stackoverflow that I
can just iterate over the font resources and if a
resource has a '/FontFile'-key that means that
the font is embedded. Is this correct?
If other libs than pyPdf are better able to extract this info (or a combination
of them) they are more than welcome. So far I fumbled around with pyPdf, pdfrw
and pdfminer. All of which don't exactly have the most extensive documentation.
[1] http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
If I read the PDF spec1 correctly this ought to always be 72 dpi,
unless a page has UserUnit set, which was only introduced in PDF-1.6,
but shouldn't print documents always be at least 300 dpi? I'm afraid
that I misunderstood something…
You do misunderstand something. The default user space unit which defaults to 1/72 inch but can be changed on a per-page base since PDF-1.6, is not defining a print resolution, it merely defines what length a unit in coordinates given by the user by default (i.e. unless any size-changing transformation is active) corresponds to.
For printing all data are converted into a device dependent space whose resolution has nothing to do with the user space coordinates. Printing resolutions depend on the printing device and their drivers; they may be limitted due to security settings allowing low quality printing only.
I'd also need the print resolution for images, if they can differ from
the default page resolution, that is.
Images (well, bitmap images, in PDF there are also vector graphics) come each with their individual resolution and then may be transformed (e.g. enlarged) before being rendered. For an "image printing resolution" you'd, therefore, have to inspect each and every bitmap image and each and every page content in which it is inserted. And if the image is rotated, skewed and asymmetrically stretched, I wonder what number you will use as resolution... ;)
Text color: I don't have the slightest clue on how to extract this, the string
'text colour' only shows up once in the whole spec, without any
explanation how it is set.
Have a look at section 9.2.3 in the spec:
The colour used for painting glyphs shall be the current colour in the
graphics state: either the nonstroking colour or the stroking colour
(or both), depending on the text rendering mode (see 9.3.6, "Text
Rendering Mode"). The default colour shall be black (in DeviceGray),
but other colours may be obtained by executing an appropriate
colour-setting operator or operators (see 8.6.8, "Colour Operators")
before painting the glyphs.
There you find a number of pointers to interesting sections. Be aware, though, text is not simply coloured; it may also be rendered as a clip path applied to any background.
I read in another post on stackoverflow that I can just iterate over
the font resources and if a resource has a '/FontFile'-key that means
that the font is embedded. Is this correct?
I would advice a more precise analysis. There are other relevant keys, too, e.g. '/FontFile2' and '/FontFile3', and the correct one must be used.
Don't underestimate your tasks... you should start to define what the properties you search shall mean in a mixed environment of rotated, stretched and skewed glyphs, vector graphics and bitmap images like PDF.

Categories