Parse PDF shape data in python - python

I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver.
When using this driver all SHX fonts are rendered as shape data instead of text data, they do however have a comment inserted into the PDF at the expected location with the expected text.
So far in my script I have got it to run through the PDF and insert hidden text on top of each section, with the text squashed to the size of the comment, this gets me 90% of the way and gives me a document that is searchable.
Unfortunately the sizing of the comment regions is relatively course (integer based) which makes it difficult to accurately determine the orientation of short text, and results in uneven sized boxes around text.
What I would like to be able to do is parse through the shape data in the PDF, collect anything within the bounds of the comment, and then determine a smaller and more accurate bounding box. However all the information I can find is by people trying to parse through text data, and I haven't been able to find anything at all in terms of shape data.
The below image is an example of the raw text in the PDF, the second image shows the comment bounding box in blue, with the red text being what I am setting to hidden to make the document searchable, and copy/paste able. I can get things a little better by shrinking the box by a fixed margin, but with small text items the low resolution of the comment box coordinate data messes things up.
To get this far I am using a combination of PyPDF2 and reportlab, but am open to moving to different libraries.

I didn't end up finding a solution with PyPDF2, I was able to find an easy way to iterate over shape data in pdfminer.six, but then couldn't find a nice way in pdfminer to extract annotation data.
As such I am using one library to get the annotations, one to look at the shape data, and last of all a third library to add the hidden text on the new pdf. It runs pretty slowly as sheet complexity increases but is giving me good enough results, see image below where the rough green borders as found in the annotations are shrunk to the blue borders surrounding the text. Of course I don't draw the boundaries, and use invisible text for the actual program output, giving pretty good selectable/searchable text.
If anyone is interested in looping over the shape data in PDFs the below snippet should get you started.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTLine, LTCurve
for page_layout in extract_pages("TestSchem.pdf"):
for element in page_layout:
if isinstance(element, LTCurve) or isinstance(element, LTLine):
print(element.bbox)

Related

Extracting Data from Unstructured PDFs in Python

Like it says, I'm trying to find a method to extract data from PDFs in Python. I've explored a few solutions already, but I'm not finding any solution that fit the need.
The PDF I have is scanned in, but I can use Tesseract to turn it into a text pdf if necessary. The goal in the short term is to grab a few values from the PDF and store them. The large scale goal is to get a large number of these PDFs and perform this task automatically. I know how to store the data if I can get it out of the PDF, my problem is actually getting the values out.
I'm not at liberty to display the PDF, below is an example of what the document looks like.
Sorry for my crude art, I figured this would be easier than recreating an empty copy of the PDF, but I can make a better mock up if necessary. The fields I would like to extract are highlighted in red. Wherever it says TITLE: next to a field is where title would appear on the document, usually on a separate line, save for the field at the bottom.
I've tried using a few tools, notably Azure Cognitive Services and PyPDF2, however the issues I'm usually running into is that the output has each group of words as an individual line in the output, which does not work if the title of a form field is above it, like the example table below
left
center
right
One
Two
Three
The output returns left, then center, then right, then One, then Two, then Three. If the field for Two or One was left blank, searching for 3 rows below right would not give me the expected output.
I've run into a few other bugs with other solutions, like needing to have bounding boxes on my PDF for it to work, but I'm starting to run out of solutions to find, and I was wondering if anyone had any ideas for how I can get this task done.
There are multiple pages, however I only really need 1-2, and I only have 1 scanned with Tesseract. The format stays relatively the same, although each pdf is independently scanned in so there could be minor changes there.
Any and all help is greatly appreciated.

Getting bounding boxes of characters from PDF

I've been hacking away at this for a couple of days now, but haven't been able to find a solution that is satisfactory. Essentially, my goal is to find the bounding boxes of characters from PDF to eventually use as training data for an OCR system. This means I need clear and consistent bounding box extraction from generated PDFs (like those at arxiv which actually have text information in them, hence the ability to highlight with cursor). I've been mainly working with python and PDFMiner.
Most of the solutions I've seen are for now lower level than lines of text, and the issue I had there was that PDFs had such varying structures that this wasn't even reliable. I've been able to get bounding boxes of characters through html using pdftotext, but the boxes were mis-sized, most often cutting off the tail ends of characters which are crucial to OCR training.
Thanks!

Python pptx picture placeholder rotation makes picture vanish in PowerPoint (not in LibreOffice)

I want to insert pictures in a Picture Placeholder. Previously I get EXIF information to rotate if needed. This is my code:
placeholder = slide.placeholder[N]
photo.open()
ph_picture = placeholder.insert_picture(photo)
if exif['Orientation'] == LANDSCAPE_ORIENTATION:
ph_picture.rotation = 90.0
When I open the pptx generated file in LibreOffice, the images looks great. But in PowerPoint they vanish. Removing the lines that rotates the image makes that the images appears again.
Any clue what is happening or what is a better way to rotate the images?
(I've also tried with PIL but the result is even worst).
In general, the way forward in this kind of situation is to create a slide that looks the way you want using PowerPoint (by hand), then inspect the XML it creates to see how it approaches the feature.
Unfortunately, neither PowerPoint or OpenOffice are completely compliant with the spec and they exhibit unexpected behaviors, especially around the fringes, in the less-used features. So even though I expect you'll find that python-pptx is producing the expected XML in this situation, for some reason PowerPoint is not interpreting that as the spec would state.
But there's a very high likelihood that if you can get PowerPoint to rotate the image without disappearing it, that doing the same thing to the XML that it does will produce the result you're after. So that's where I'd start.
It is a little suspicious that it's a picture placeholder though, I'd first try it with a normal picture shape and see if it has the same problem. That's liable to narrow it down either way.

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.
Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.
I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.

Is there a way to extract text information from a postscript file? (.ps .eps)

I want to extract the text information contained in a postscript image file (the captions to my axis labels).
These images were generated with pgplot. I have tried ps2ascii and ps2txt on Ubuntu but they didn't produce any useful results. Does anyone know of another method?
Thanks
It's likely that pgplot drew the fonts in the text directly with lines rather than using text. Especially since pgplot is designed to output to a huge range of devices including plotters where you would have to do this.
Edit:
If you have enough plots to be worth
the effort than it's a very simple
image processing task. Convert each
page to something like tiff, in mono
chrome Threshold the image to binary,
the text will be max pixel value.
Use a template matching technique.
If you have a limited set of
possible labels then just match the
entire label, you can even start
with a template of the correct size
and rotation. Then just flag each
plot as containing label[1-n], no
need to read the actual text.
If you
don't know the label then you can
still do OCR fairly easily, just
extract the region around the axis,
rotate it for the vertical - and use
Google's free OCR lib
If you have pgplot you can even
build the training set for OCR or
the template images directly rather
than having to harvest them from the
image list

Categories