Processing a PDF for information extraction

Processing a PDF for information extraction - python

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.

PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

Related

Is there any way of generating exact HTML page from a PDF file?

I am trying different python libraries like pdftotree, pdfminer, tabula etc. But could not get the exact results. I mean I can get text from PDF, Images and Tabular data in HTML, but not as maintained and organized as original PDF file. Can someone help me with something regarding this? I would be thankful.

Mostly yes. Translate the PDF to SVG, and embed the SVG in your web page.
SVG's image model (what it can represent and how) is a near-superset of the PDF image model (which is itself a superset of PostScript), though SVG lacks some of the print-specific features of PDF. There are probably quite a few PDF->SVG converters out there already. Googling "Pdf to SVG" turned up quite a few promising hits
There will be some complications:
Many PDF files are longer than 1 page. You might need to generate 10 SVG files for a single 10 page PDF file, and then build a web page around those 10 SVGs. Throw in some dynamic HTML to "turn pages" and you've got a good web-based PDF viewer.
There are parts of PDF that aren't within its image model at all... bookmarks, annotations (form fields, digital signatures), document metadata (author, creation date, etc), and so forth. Some of the non-image-model stuff is common enough that a PDF to SVG utility might handle it directly (links), while other stuff doesn't have an HTML equivalent and would be lost.
You could preserve the appearance of a digital signature, but the actual security represented by those visuals would be gone. Preserving that signature's appearance could be considered lying about the security.

Convert PDF to HTML to get bold and font size in python

I want to get bold and size of text from pdf, but I can't extract such information from pdf.
So, I want to convert pdf to html in python and I have tried every possible library which I know but none is working perfectly, my format is getting disturb in every library. I have tried pdfminer.six, pdf2htmlEX and many more but none is giving output in proper format. I know this question has been asked previously so many times but none is working perfectly for me as of now.
PDF link: https://github.com/sahib-s/Heading-Detection-PDF-Files/blob/master/1.pdf
Problem: In pdf very first page COURSE OBJECTIVE is in single line but after converting, it's splitting up in different tags. I'm having similar problem everywhere. I want to get headings in same tags.

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?

There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Is there a way to automate specific data extraction from a number of pdf files and add them to an excel sheet?

Regularly I have to go through a list of pdf files and search for specific data and add them to an excel sheet for later review. As the number of pdf files are around 50 per month, it is both time taking and frustrating to do it manually.
Can the process be automated in windows by python or any other scripting language? I require to have all the pdf files in a folder and run the script which will generate an excel sheet with all the data added. The pdf files with which I work are tabular and have similar structures.

Yes. And no. And maybe.
The problem here is not extracting something from a PDF document. Extracting something is almost always possible and there are plenty of tools available to extract content from a PDF document. Text, images, whatever you need.
The major problem (and the reason for the "no" or "maybe") is that PDF in general is not a structured file format. It doesn't care about columns, paragraphs, tables, sentences or even words. In the general case it cares only about characters on a page in a specific location.
This means that in the general case you cannot query a PDF document and ask it for every paragraph or for the third sentence in the fifth paragraph. You can ask a library to get all of the text or all of the text in a specific location. And then you have to hope the library is able to extract the text you need in a legible format. Because there doesn't even have to be the case that you can copy and paste or otherwise extra understandable characters from a PDF file. Many PDF files don't even contain enough information for that.
So... If you have a certain type of document and you can test that it predictably behaves a certain way with a certain extraction engine, then yes, you can extract information from a PDF file.
If the PDF files you receive are different all the time or the layout on the page is totally different every time than the answer is probably that you cannot reliably extract the information you want.
As a side note:
There are certain types of PDF documents that are easier to handle than others so if you're lucky that might make your life easier. Two examples:
Many PDF files will in fact contain textual information in such a way that it can be extracted in a legible way. PDF files that follow certain standards (such as PDF/A-1a, PDF/A-2a or PDF/A-2u etc...) are even required to be created this way.
Some PDF files are "tagged" which means they contain additional structural information that allows you to extract information in an easier and more meaningful way. This structure would in fact identify paragraphs, images, tables etc and if the tagging was done in a good way it could make the job of content extraction much easier.

You could use pdf2text2 in Python to extract data from your PDF.
Alternatively you can use pdftotext that is part of the Xpdf suite

How to include page in PDF in PDF document in Python

I am using reportlab toolkit in Python to generate some reports in PDF format. I want to use some predefined parts of documents already published in PDF format to be included in generated PDF file. Is it possible (and how) to accomplish this in reportlab or in python library?
I know I can use some other tools like PDF Toolkit (pdftk) but I am looking for Python-based solution.

I'm currently using PyPDF to read, write, and combine existing PDF's and ReportLab to generate new content. Using the two package seemed to work better than any single package I was able to find.

If you want to place existing PDF pages in your Reportlab documents I recommend pdfrw. Unlike PageCatcher it is free.
I've used it for several projects where I need to add barcodes etc to existing documents and it works very well. There are a couple of examples on the project page of how to use it with Reportlab.
A couple of things to note though:
If the source PDF contains errors (due to the originating program following the PDF spec imperfectly for example), pdfrw may fail even though something like Adobe Reader has no apparent problems reading the PDF. pdfrw is currently not very fault tolerant.
Also, pdfrw works by being completely agnostic to the actual content of the PDF page you are placing. So for example, you wouldn't be able to use pdfrw inspect a page to see if it contains a certain string of text in the lower right-hand corner. However if you don't need to do anything like that you should be fine.

There is an add-on for ReportLab — PageCatcher.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.