Extracting Data from Unstructured PDFs in Python

Extracting Data from Unstructured PDFs in Python - python

Like it says, I'm trying to find a method to extract data from PDFs in Python. I've explored a few solutions already, but I'm not finding any solution that fit the need.
The PDF I have is scanned in, but I can use Tesseract to turn it into a text pdf if necessary. The goal in the short term is to grab a few values from the PDF and store them. The large scale goal is to get a large number of these PDFs and perform this task automatically. I know how to store the data if I can get it out of the PDF, my problem is actually getting the values out.
I'm not at liberty to display the PDF, below is an example of what the document looks like.
Sorry for my crude art, I figured this would be easier than recreating an empty copy of the PDF, but I can make a better mock up if necessary. The fields I would like to extract are highlighted in red. Wherever it says TITLE: next to a field is where title would appear on the document, usually on a separate line, save for the field at the bottom.
I've tried using a few tools, notably Azure Cognitive Services and PyPDF2, however the issues I'm usually running into is that the output has each group of words as an individual line in the output, which does not work if the title of a form field is above it, like the example table below
left
center
right
One
Two
Three
The output returns left, then center, then right, then One, then Two, then Three. If the field for Two or One was left blank, searching for 3 rows below right would not give me the expected output.
I've run into a few other bugs with other solutions, like needing to have bounding boxes on my PDF for it to work, but I'm starting to run out of solutions to find, and I was wondering if anyone had any ideas for how I can get this task done.
There are multiple pages, however I only really need 1-2, and I only have 1 scanned with Tesseract. The format stays relatively the same, although each pdf is independently scanned in so there could be minor changes there.
Any and all help is greatly appreciated.

Related

Extracting text from specific coordinates of a PDF in python

I have some pre-determined coordinates that I want to look into a PDF to extract text from (some part on the top of the page). I've been trying to use the library pdfminer.six but it seems like the smallest unit for processing and extracting elements is a page.
I was thinking that in order to just get text from a small part of a page, it could get a little inefficient to go through and analyse the entire page when there are large number of documents to process.
Is there any way to do so? Or is there some other library that can work with this use case, where I can pass in coordinates? Or am I getting the concept wrong fundamentally?
Thanks!

You can use visitor functions to do that:
https://pypdf2.readthedocs.io/en/latest/user/extract-text.html#example-1-ignore-header-and-footer

Parse PDF shape data in python

I am trying to put together a script to fix PDFs a large number of PDFs that have been exported from Autocad via their DWG2PDF print driver.
When using this driver all SHX fonts are rendered as shape data instead of text data, they do however have a comment inserted into the PDF at the expected location with the expected text.
So far in my script I have got it to run through the PDF and insert hidden text on top of each section, with the text squashed to the size of the comment, this gets me 90% of the way and gives me a document that is searchable.
Unfortunately the sizing of the comment regions is relatively course (integer based) which makes it difficult to accurately determine the orientation of short text, and results in uneven sized boxes around text.
What I would like to be able to do is parse through the shape data in the PDF, collect anything within the bounds of the comment, and then determine a smaller and more accurate bounding box. However all the information I can find is by people trying to parse through text data, and I haven't been able to find anything at all in terms of shape data.
The below image is an example of the raw text in the PDF, the second image shows the comment bounding box in blue, with the red text being what I am setting to hidden to make the document searchable, and copy/paste able. I can get things a little better by shrinking the box by a fixed margin, but with small text items the low resolution of the comment box coordinate data messes things up.
To get this far I am using a combination of PyPDF2 and reportlab, but am open to moving to different libraries.

I didn't end up finding a solution with PyPDF2, I was able to find an easy way to iterate over shape data in pdfminer.six, but then couldn't find a nice way in pdfminer to extract annotation data.
As such I am using one library to get the annotations, one to look at the shape data, and last of all a third library to add the hidden text on the new pdf. It runs pretty slowly as sheet complexity increases but is giving me good enough results, see image below where the rough green borders as found in the annotations are shrunk to the blue borders surrounding the text. Of course I don't draw the boundaries, and use invisible text for the actual program output, giving pretty good selectable/searchable text.
If anyone is interested in looping over the shape data in PDFs the below snippet should get you started.
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTLine, LTCurve
for page_layout in extract_pages("TestSchem.pdf"):
for element in page_layout:
if isinstance(element, LTCurve) or isinstance(element, LTLine):
print(element.bbox)

Searching Information at PDFs

I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?

Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.

Tell if text of PDF is visible or not

I'm parsing some PDF files using the pdfminer library.
I need to know if the document is a scanned document, where the scanning machine places the scanned image on top and OCR-extracted text in the background.
Is there a way to identify if text is visible, as OCR machines do place it on the page for selection.
Generally the problem is distinguishing between two very different, but similar looking cases.
In one case there's an image of a scanned document that covers most of the page, with the OCR text behind it.
Here's the PDF as text with the image truncated: http://pastebin.com/a3nc9ZrG
In the other case there's a background image that covers most of the page with the text in front of it.
Telling them apart is proving difficult for me.

Your question is a bit confusing so I'm not really sure what is going to help you the most. However, you describe two ways to "hide" text from OCR. Both I think are detectable but one is much easier than the other.
Hidden text
Hidden text is regular or invisible text that is placed behind something else. In other words, you use the stacking order of objects to hide some of them. The only way you can detect this type of case is by figuring out where all of the text objects on the page are (calculating their bounding boxes isn't trivial but certainly possible) and then figuring out whether any of the images on the page overlaps that text and is in front of it. Some additional comments:
Theoretically it could be something else than an image hiding it, but in your OCR case I would guess it's always an image.
Though an image may be overlapping it, it may also be transparent in some way. In that case, the text that is underneath may still shine through. In your case of a general OCR engine, probably not likely.
Invisible text
PDF supports invisible text. More precisely, PDF supports different text rendering modes; those rendering modes determine whether characters are filled, outlined, filled + outlined, or invisible (there are other possibilities yet). In the PDF file you posted, you find this fragment:
BT
3 Tr
0.00 Tc
/F3 8.5 Tf
1 0 0 1 42.48 762.96 Tm
(Chicken ) Tj
That's an invisible chicken right there! The instruction "3 Tr" sets the text rendering mode to "3", which is equal to "invisible" or "neither stroked nor filled" as the PDF specification very elegantly puts it.
It's worthwhile mentioning that these two techniques can be used interchangeably by OCR engines. Placing invisible text on top of a scanned image is actually good practice because it means that most PDF viewers will allow you to select the text. Some PDF viewers that I looked at at some point didn't allow text selection if the text was "behind" the image.

I don't have a copy of the PDF 1.7 specification, but I suspect that the objects on a page are rendered in order, that is, the preceding objects end up covered up by succeeding objects.
Thus, you would have to iterate through the layout objects (See Performing Layout Analysis) and calculate where everything falls on the page, their dimensions, and their rendering order (and possibly their transparency).
As the pdfminer documentation mentions, PDF is evil.

best way to print data in columnar format?

I am using Python to read in data in a user-unfriendly format and transform it into an easier-to-read format. The records I am outputting are usually going to be just a last name, first name, and room code. I
I would like to output a series of pages, each containing a contiguous subset of the total records, divided into multiple columns, each of which contains a contiguous subset of the total records on the page. (So in other words, you'd read down the first column, move to the next column, move to the next column, etc., and then start over on the next page...)
The problem I am facing now is that for output formats, I'm almost certainly limited to HTML (and Javascript, CSS, etc.) What is the best way to get the data into this columnar format? If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally, for instance, I could easily print tables of 5x20, but I don't know if there's a way to indicate a page break -- and I don't know if there's any way to calculate programmatically how many records will fit on the page.
How would you approach this?
EDIT: The reason I said that I was limited in output: I have to produce the file on one computer, then bring it to a different computer upon which we cannot install new software and on which the selection of existing software is not optimal. The file itself is only going to be used to make a physical printout (which is what the end users will actually work with), but my time on the computer that I can print from is going to be limited, so I need to have the file all ready to go and print right away without a lot of tweaking.
Right now I've managed to find a word processor that I can use on the target machine, so I'm going to see if I can target a format that the word processor uses.
EDIT: Once I knew there was a word processor I could use, I made a simple skeleton file with the settings that I wanted (column and tab settings, monospaced font in a small point size, etc.) and then measured how many characters I got per line of a column and how many lines I got per column. I've watched the runs pretty carefully to make sure that there weren't some strange lines that somehow overflowed the characters-per-line guideline (which shouldn't happen with monospaced font, of course, but how many times do you end up having to figure out why that thing that "shouldn't" happen is happening anyways?)
If there hadn't been a word processor on the target machine that I could use, I probably would have looked at PDF as an output format.

"If I knew for certain that the printable area of the paper would hold 20 records vertically and five horizontally"
You do know that.
You know the size of your paper. You know the size of your font. You can easily do the math.
"almost certainly limited to HTML..." doesn't make much sense. Is this a web application? The page can have a "Previous" and "Next" button to step through the pages? Pick a size that looks good to you and display one page full with "Previous" and "Next" buttons.
If it's supposed to be one HTML page that prints correctly, that's hard. There are CSS things you can do, but you'll be happier creating a PDF file.
Get PyX or ReportLab and create a PDF that prints properly.
I -- personally -- have no patience with any of this. I try put this kind of thing into a CSV file. My users can then open CSV with a tool spreadsheet (Open Office Org has a good one) and then adjust the columns and print with it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.