Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet? - python

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.

Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.

You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

Related

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

Hide information in a PDF file in Python

In Python, I have files generated by ReportLab. Now, i need to extract some pages from that PDF and hide confidential information.
I can create a PDF file with blacked-out spots and use pyPdf to mergePage, but people can still select and copy-paste the information under the blacked-out spots.
Is there a way to make those spots completely confidential?
Per example, I need to hide addresses on the pages, how would i do it?
Thanks,
Basically you'll have to remove the corresponding text drawing commands in the PDF's page content stream. It's much easier to generate the pages twice, once with the confidential information, once without them.
It might be possible (I don't know ReportLab enough) to specially craft the PDF in a way that the confidential information is easier accessible (e.g. as separate XObjects) for deletion. Still you'd have to do pretty low-level operations on the PDF -- which I would advise against.
(Sorry, I was not able to log on when I posted the question...)
Unfortunately, the document cannot be regenerated at will (context sensitive), and those PDF files (about 35) are 3000+ pages.
I was thinking about using pdf2ps and pdf2ps back, but there is a lot of quality.
pdf2ps -dLanguageLevel=3 input.pdf - | ps2pdf14 - output.pdf
And if i use "pdftops" instead, the text is still selectable. If there is a way to make it non-selectable like with "pdf2ps" but with better quality, it will do too.

Generating & Merging PDF Files in Python

I want to automatically generate booking confirmation PDF files in Python. Most of the content will be static (i.e. logos, booking terms, phone numbers), with a few dynamic bits (dates, costs, etc).
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
From doing a bit of search, it seems that I can use reportlab for creating content and pyPdf for merging PDF's together. Is this the best approach? Or is there a really funky way that I haven't come across yet?
Thanks!
From the user side, the simplest way to do this would be to start with a PDF file with the static content, and then using python to just add the dynamic parts. Is this a simple process?
Unfortunately no. There are several tools that are good at producing PDFs from scratch (most commonly for Python, ReportLab), but they don't generally load existing PDFs. You would have to include generating code for any boilerplate text, lines, blocks, shapes and images, rather than this being freely editable by the user.
On the other side there's pyPdf which can load PDFs, collate the pages, and extract some of the information, but can't really add new content. You can ‘merge’ pages into one, but you'd still have to create the extra information overlay as a page in ReportLab first.
Look into docutils and reSTructuredText. You could quickly write out your PDF document in reST and then compile the PDF using rst2pdf.py
I've used this, it creates very beautiful documents and the markup is extensible! Later you could take the same code and run it into rst2html to create a website out if it!
Take a look here:
http://docutils.sourceforge.net/docs/user/rst/quickref.html
http://code.google.com/p/rst2pdf/
Good luck
You could generate a document through, for example, TeX, or OpenOffice, or whatever gives you the most comfortable bindings and then print the document with a pdf printer.
This allows you not to have to figure out where to put fields precisely or figure out what to do if your content overflows the space allocated for it.

Categories