Remove Black Rectangles from PDF using Python (PikePDF or PyPDF2)

Remove Black Rectangles from PDF using Python (PikePDF or PyPDF2) - python

Please help me surprise my wife with a useful PDF of her iMessage chain with her now deceased grandmother.
Apple Messages allows you to print conversations to PDF. You have to manually scroll to the top of the message on the Mac, over and over again. It took me 4 hours!
However, it places black rectangle over most photos. This seems to be a half-hearted attempt at privacy because the photo is still there under the black rectangle.
I can use Edit PDF in Adobe Acrobat and remove the main black rectangle, each of the four side rectangles and the four corner rectangles. But doing this for every image in this chain will take me an extraordinary amount of time as the chain goes back 4 years and they texted a lot.
I'm reasonably savvy with Python and have tried to work through using PikePDF and PyPDF2 to do this, but I can't make any headway owing to the complex structure of PDFs.
Note that the extraordinary long page size of the PDF is because when you PrintPDF in Messages it doesn't handle images across page breaks very well. So I set a custom page size of 200 inches height so there are far fewer page breaks.
Example PDF at link below. It is one page, with most of the content removed (using EditPDF in Acrobat) for privacy, with one photo (of my baby son) which is blocked by the aforementioned black rectangle. The target document is 63 such pages (each 200 inches high) and 1.73 gb in size, so you can understand why doing this manually is a wee bit impractical.
Please help me internet. It would mean so much to my wife!
Edit to include previously left out link to sample file: https://www.dropbox.com/s/t4dgwr5eylb4rfm/MessagesBlackBoxExample.pdf?dl=0

Related

Reading multiple invoices from an image using OCR/computer vision

I wish to extract key-value pairs from the following image that consists of 2 invoices.
Image example
I am using AWS Textract to achieve this however I'd like to be able to map the key-value pairs back to the invoices. For ex- 'Cornbread SVC' should be mapped to bill #1 and '1 #1 CHKN PLATE' should be mapped to bill #2.
One approach I thought was to perform some pre-processing on the image in which if we could find out the no. of bills and their coordinates then crop the image as per the dimensions. So basically '5' bills on an image would yield the coordinates of '5' bills and then take the original image and crop it 5 times as per the different bill dimensions. And then send each bill as a separate image to AWS Textract.
However, I have not been to able to figure out a method to detect the no. of bills in an image and it's boundary coordinates.
Any help would be appreciated. I am open to using any other APIs or methods to achieve this.

As you've already mentioned it would be necessary to split bills before you do any OCR. There are some techniques to achieve this.
You could use OpenCV and detect white paper in the image, see. From my experiences, I can tell you that it will work when the background of an image is dark enough. It won't work when you will take a picture at, for example, a white table. Therefore user experience achieved with this approach won't be satisfying - sometimes it works, sometimes it doesn't.
If it is a mobile app, you could ask your user to draw a rectangle around each receipt. A similar approach for a single document is used in mobile scanners, example.
The last option, which I prefer, is to use scanning app/SDK and force a user to simply take pictures of a single receipt. It may sound a bit rigid and uncool, but it works all the time. Let's face it - more steps that you have with a chance of failure, more failures will happen. In the invoice data extraction process you have at least the following steps:
image capture
image processing
OCR - not 100% accurate
recognition of data (what is invoice number, etc.) - not 100% accurate
At least, you have two steps that are not 100%. Why adding a new step that cannot work in 100% cases while it can achieve the same feature by taking separate images?

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.

Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.

I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.

image rendering issue in psychopy

I am a long-time psychopy user, and i just upgraded to 1.81.03 (from 1.78.x). In one experiment, i present images (.jpgs) to the user and ask for a rating scale response. The code worked fine before the update, but now i am getting weird artifacts on some images. For example, here is one image i want to show:
But here is what shows up [screencapped]:
You can see that one border is missing. This occurs for many of my images, though it is not always the same border, and sometimes two or three borders are missing.
Does anyone have an idea about what might be going on?

I received this information from the psychopy-users group (Micahel MacAskill):
As a general point, you should avoid using .jpgs for line art: they aren't designed for this (if you zoom in, in the internal corners of your square, you'll see the typical compression artefacts that their natural image-optimised compression algorithm introduces when applied to line art). .png format is optimal for line art. It is lossless and for this sort of image will still be very small file-size wise.
Graphics cards sometimes do scaling-up and then down-scaling of bitmaps, which can lead to issues like this with single-pixel width lines. Perhaps this is particularly the issue here because (I think) this image was supposed to be 255 × 255 pixels, and cards will sometimes scale up to the nearest power-of-two size (256 × 256) and then down again, so easy to see how the border might be trimmed.
I grabbed your image off SO, it seemed to have a surrounding border around the black line to make it 321 × 321 in total. I made that surround transparent and saved it as .png (another benefit of png vs jpg). It displays without problems (although a version cropped to just the precise dimensions of the black line did show the error you mentioned). (Also, the compression artefacts are still there, as I just made this png directly from the jpg). See attached file.
If this is the sort of simple stimulus you are showing, you might want to use ShapeStim/Polygon stimuli instead of bitmaps. They will always be drawn precisely, without any scaling issues, and there wouldn't be the need for any jiggery pokery.
Why this changed from 1.78 I'm not sure. The issue is also there in 1.82.00

PDF bleed detection

I'm currently writing a little tool (Python + pyPdf) to test PDFs for printer conformity.
Alas I already get confused at the first task: Detecting if the PDF has at least 3mm 'bleed' (border around the pages where nothing is printed). I already got that I can't detect the bleed for the complete document, since there doesn't seem to be a global one. On the pages however I can detect a total of five different boxes:
mediaBox
bleedBox
trimBox
cropBox
artBox
I read the pyPdf documentation concerning those boxes, but the only one I understood is the mediaBox which seems to represent the overall page size (i.e. the paper).
The bleedBox pretty obviously ought to define the bleed, but that doesn't always seem to be the case.
Another thing I noted was that for instance with the PDF, all those boxes have the exact same size (implying no bleed at all) on each page, but when I open it there's a huge amount of bleed; This leads me to think that the individual text elements have their own offset.
So, obviously, just calculating the bleed from mediaBox and bleedBox is not a viable option.
I would be more than delighted if anyone could shed some light on what those boxes actually are and what I can conclude from that (e.g. is one box always smaller than another one).
Bonus question: Can someone tell me what exactly the "default user space unit" mentioned in the documentation? I'm pretty sure this refers to mm on my machine, but I'd like to enforce mm everywhere.

Quoting from the PDF specification ISO 32000-1:2008 as published by Adobe:
14.11.2 Page Boundaries
14.11.2.1 General
A PDF page may be prepared either for a finished medium, such as a
sheet of paper, or as part of a prepress process in which the content
of the page is placed on an intermediate medium, such as film or an
imposed reproduction plate. In the latter case, it is important to
distinguish between the intermediate page and the finished page. The
intermediate page may often include additional production-related
content, such as bleeds or printer marks, that falls outside the
boundaries of the finished page. To handle such cases, a PDF page
maydefine as many as five separate boundaries to control various
aspects of the imaging process:
The media box defines the boundaries of the physical medium on which
the page is to be printed. It may include any extended area
surrounding the finished page for bleed, printing marks, or other such
purposes. It may also include areas close to the edges of the medium
that cannot be marked because of physical limitations of the output
device. Content falling outside this boundary may safely be discarded
without affecting the meaning of the PDF file.
The crop box defines the region to which the contents of the page
shall be clipped (cropped) when displayed or printed. Unlike the other
boxes, the crop box has no defined meaning in terms of physical page
geometry or intended use; it merely imposes clipping on the page
contents. However, in the absence of additional information (such as
imposition instructions specified in a JDF or PJTF job ticket), the
crop box determines how the page’s contents shall be positioned on the
output medium. The default value is the page’s media box.
The bleed box (PDF 1.3) defines the region to which the contents of
the page shall be clipped when output in a production environment.
This may include any extra bleed area needed to accommodate the
physical limitations of cutting, folding, and trimming equipment. The
actual printed page may include printing marks that fall outside the
bleed box. The default value is the page’s crop box.
The trim box (PDF 1.3) defines the intended dimensions of the
finished page after trimming. It may be smaller than the media box to
allow for production-related content, such as printing instructions,
cut marks, or colour bars. The default value is the page’s crop box.
The art box (PDF 1.3) defines the extent of the page’s meaningful
content (including potential white space) as intended by the page’s
creator. The default value is the page’s crop box.
The page object dictionary specifies these boundaries in the MediaBox,
CropBox, BleedBox, TrimBox, and ArtBox entries, respectively (see
Table 30). All of them are rectangles expressed in default user space
units. The crop, bleed, trim, and art boxes shall not ordinarily
extend beyond the boundaries of the media box. If they do, they are
effectively reduced to their intersection with the media box. Figure
86 illustrates the relationships among these boundaries. (The crop box
is not shown in the figure because it has no defined relationship with
any of the other boundaries.)
Following that there is a nice graphic showing those boxes in relation to each other:
The reasons why in many cases only the media box is set, are
that in case of PDFs meant for electronic consumption (i.e. reading on a computer) the other boxes hardly matter; and
that even in the prepress context they aren't as necessary anymore as they used to be, cf. the article Pedro refers to in his comment.
Concerning your "bonus question": The user space unit is 1⁄72 inch by default; since PDF 1.6 it can be changed, though, to any (not necessary integer) multiple of that size using the UserUnit entry in the page dictionary. Changing it in an existing PDF essentially scales it as the user space unit is the basic unit in the device independent coordinate system of a page. Therefore, unless you want to update each and every command in the page descriptions refering to coordinates to keep the page dimensions, you won't want to enforce a millimeter user space unit... ;)

How can I improve ReportLab image quality?

I'm building a label printer. It consists of a logo and some text, not tough. I have already spent 3 days trying to get the original SVG logo to draw to screen but the SVG is too complex, using too many gradients, etc.
So I have a high quality bitmapped logo (as a JPG or PNG) and I'm drawing that on a ReportLab canvas. The image in question is much larger than 85*123px. I did this hoping ReportLab would embed the whole thing and scale it accordingly. Here's how I'm doing it:
canvas.drawImage('logo.jpg', 22+xoffset, 460, 85, 123)
The problem is, my assumption was incorrect. It seems to scale it down to 85*123px at screen resolution and that means when it's printed, it doesn't look great.
Does ReportLab have any DPI commands for canvases or documents so I can keep the quality sane?

Having previously worked at the ReportLab company, I can tell you that raster images do not go through any automatic resampling/downscaling while being included in the PDF. The 85*123 dimensions you are using are not pixels, but points (pt) which are a physical unit like millimetres or inches.
I would suggest printing the PDF with different quality images to confirm this or otherwise zooming in very, very far using your PDF viewer. It will always look a bit fuzzy in a PDF viewer as the image is resampled twice (once in the imaging software and then again to the pixels available to the PDF viewer).
This is how I would calculate what size in pixels to make a raster image for it to print well at a given physical size:
Assume I want the picture to be 2 inches wide, there are 72 points in a inch so the width in my code would be 144. I know that a good crisp resolution to print at is 300dpi (dots per inch) so the raster image is saved at 600px wide.

One option that I thought of while writing the question is: increase the size of the PDF and let the printer sort things out.
If I just multiplied all my numbers by 5 and the printer did manage to figure things out, I'd have close to 350DPI... But I'm making quite an assumption.

I don't know if it will work for all but in my case it did.
I only needed to add a logo on the top so I used drawImage()
but shrank the size of the logo by a third
c.drawImage(company_logo,225,750,width=(483/3),height=(122/3))
I had to previously know the real company logo size so it does not get distorted.
I hope it helps!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.