Convert PDF to HTML to get bold and font size in python - python

I want to get bold and size of text from pdf, but I can't extract such information from pdf.
So, I want to convert pdf to html in python and I have tried every possible library which I know but none is working perfectly, my format is getting disturb in every library. I have tried pdfminer.six, pdf2htmlEX and many more but none is giving output in proper format. I know this question has been asked previously so many times but none is working perfectly for me as of now.
PDF link: https://github.com/sahib-s/Heading-Detection-PDF-Files/blob/master/1.pdf
Problem: In pdf very first page COURSE OBJECTIVE is in single line but after converting, it's splitting up in different tags. I'm having similar problem everywhere. I want to get headings in same tags.

Related

Camelot Cannot extract entire table

Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).
Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.
This is the table from the PDF in question:
But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block
Any guidance you can provide on improving Camelots "vision" here would be helpful.
Apart from the block, the horizontal lines are also marked as text, which is odd.
Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.

How can I extract text from a PDF using Python similarly to what Chrome browser does?

I'm trying to extract text from pdf files (similar to a form). Currently, I open the file on Chrome, select/copy all the text, paste it into a txt file and process it into CSV using Python. Chrome allows me to have data quite structured and uniform, so that every page of the pdf results in a similar block of text, allowing me to process it easily.
I'm trying to extract the text directly from the pdf, to process it into CSV format, but I always get some messy results, due to the way the original pdf is generated. I've tried pdfminer and pyPdf2, but the results get messy when the form has a missing value in certain fields.
Maybe it's a generalistic question, but, how can I have a more structured result in my extraction?
Not all PDFs have embedded texts. some are texts in embedded images. Hence, to get a common solution that works for all PDFs, is to use OCR.
Step 1) Convert the PDF to an image
Step 2) Use pytessract to perform OCR: Use pytesseract OCR to recognize text from an image

Read PDF summary with Python

I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.

Exporting Jupyter Notebook as either PDF or HTML makes all HTML plaintext

I am just getting started with Jupyter Notebook and I'm running into an issue when exporting.
In my current notebook, I alternate between code cells with code and markdown cells. (Which explain my code).
In the markdown cells, sometimes I will use a little HTML to display a table or a list. I will also use the bold tag <b></b> to emphasize a particular portion of text.
My problem is, when I export this notebook to PDF (via the menu in Jupyter Notebook) all of my HTML gets saved as plaintext.
For example, instead of displaying a table, when exporting to PDF, the HTML will be displayed instead. <tr>Table<tr> <th>part1</th>, etc.
I've tried exporting to HTML instead, but even the HTML file displays the HTML as plaintext.
I tried downloading nbconvert (which is probably what I'm doing when I use the jupter GUI anyways) and using that via terminal, but I still get the same result.
Has anyone run into this problem before?
I tried to export it to html and it worked normally.
Where did you define your html? Did you used the Markdown textfields?
Alternatives:
I don't have the nbconverter, but what about exporting it to html and use another tool to convert it to a pdf?
Use markdown language, it provides tables. Link:
https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
Consider upgrading your notebook
I fixed this myself.
It turns out that somewhere in the code, there was a tag.
Although it did not run the entire length of the cell, the fact that the plaintext tag was there at all changed the dynamic of the cell.
Next, I had strange formatting errors (Text was of different size and strangely emphasized) when using = as plaintext in the cell. When opening the cell for editing, these = symbols were big bold and blue. This probably has something to do with the markdown language.
This was solved by placing the = on the same line as other text.
I did have to convert the page to HTML, then use a firefox addon to convert to PDF.
Converting to PDF from jupyter notebook uses LaTeX to transcribe the page, and all html is converted to plaintext.
The page appeared as normal with html tables, and normal html in the markdown cell. I just had to be careful with any extraneous tags.
If anyone else encounters this problem, check your html tags, and make sure that you are not accidentally doing something in markdown language.

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

Categories