Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).
Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.
This is the table from the PDF in question:
But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block
Any guidance you can provide on improving Camelots "vision" here would be helpful.
Apart from the block, the horizontal lines are also marked as text, which is odd.
Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.
Related
I need to deal with tables in many word files. Some of them are created in word table format, which can be read using python-docx.
However, some of them are inserted from excel. I don't know why python-docx cannot read them. Here is piece of code I wrote for test. As you can see in the terminal, there is nothings in the list variable 'tables'.
import docx
from docx import Document
docFile = 'a.docx'
document = Document(docFile)
tables = document.tables
print(tables)
Anyone can help? Thanks a lot!
I'm fighting the same issue using Pages on OSX to create a .docx template. I've found that Format > Arrange > Object Placement needs to be set to Move with text for the table, changing it to have any alignment or formatting causes the tables to disappear in python and be read as paragraphs that contain nothing. Looking at the XML of both and the python-docx code I'm suspicious of w:tblInd but I'm not clued up enough to go much further. I see recent GitHub issues covering this so hopefully will get sorted.
example on OSX:
I'm trying to read some PDF documents with Python.
I would like to extract a summary in the first page.
Does it exist a library able to do it?
There are two parts to your problem: first you must extract the text from the PDF, and then run that through a summarizer.
There are many utilities to extract text from a PDF, though text in a PDF may not be stored in a 'logical' order.
(For instance, a page with two text columns might be stored with the first line of both columns, followed by the next, etc; rather than all the text of the first column, then the second column, as a human would read it.)
The PDFMiner library would seem to be ideal for extracting the text. A quick Google reveals that there are several text summarizer python libraries, though I haven't used any of them and can't attest to their abilities. But parsing human language is tricky - even for humans.
https://pypi.org/project/text-summarizer/
http://ai.intelligentonlinetools.com/ml/text-summarization/
If you're using MacOS, there is a built-in text summarizing Service. Right click on any selected text and click "Summarize" to activate. Though it seems hard to incorporate this into any automated process.
I am planning to search the specific heading in the document, and then i have to strike out all the contents in that heading. The document has many headings, each heading may have paragraph, tables, images altogether or in any combinations.
I have installed docx, i was able to search the specific heading, strike out paragraph, tables.
Now i am not able to access the images under that Heading. To indicate that, the image is strikeout, we are trying to blur the image
Problem 1: I am able to get the Image ID (Resource ID), Image Name for all the images in the document. But i don't know how to get the resource id for the images which is under Specific Heading, and then i have to blur it.
Problem 2: I have enabled Track Changes option using VBMacro from python code. But whatever changes i did using docx (strikeout) is not highlighted for Tracking.
These are two separate questions (or three, depending on how you count). I'll address the first one here, you can post the other question as a separate new question. (Maybe: "How use python-pptx to track changes in Word document").
Regarding blurring the image, you have two challenges:
Identify images associated with a particular area in the document.
Blur the image.
There is no direct API support for either of these operations in python-docx. However, you can use python-docx to access the underlying XML and make the changes using lxml calls (which python-docx uses internally). Such efforts are commonly called "workaround functions", so if you search Google on 'python-docx OR python-pptx workaround function' you will find examples.
An in-line image is stored at the Run level. So you can iterate over all the runs in the section of interest and see if any of them have images. This analysis page from the python-docx project has some of the details you'll need: http://python-docx.readthedocs.io/en/latest/dev/analysis/features/shapes/shapes-inline.html
Basically you'd do something like this:
for run in runs: # however you decide to get the runs
r = run._element # this is the `<w:r>` XML element for the run
pics = r.xpath('.//w:drawing/wp:inline/a:graphic/a:graphicData/pic:pic')
if not pics:
break
print(r.xml) # if you want to see the XML for this run
This will print the XML for run elements containing a picture.
Regarding the actual blurring, there are two approaches I can think of:
Replace the current picture with a "blurred" version.
Change the transparency of the image in Word to make it look much lighter. This does not remove detail from the image and the actual image is still "behind", unchanged, if for example the user wanted to right click and pick "Save image...".
The second approach is much easier. You'll have to decide whether it meets your requirements.
Once you decide which way you want to go you can search for solutions to that problem or submit a new question focused on that topic.
I'm currently creating a part of a program: I'm supposed to take an entirely HTML page, and get it into a WORD file. There are title, paragraph, graphic...
But, into my HTML page, I also have table (tr, th, td, table tags). I parse it into a Word table. But, my table are too large, so I have some word that are cutting into parts. I expected something like this:
|Myentireword|
but it go out like this (because there are too many cell in one line)
|Myent|
|irewo|
|rd |
It's really really ugly, so, I didn't see other solution than transforme my table into image, but I can't find the way to do this. (use my html-parsor, creatable a table of string, display it into an image, save the image, then load it into my Word document. I can do the first and second thing, and load an image in my document, but I do not have a way for creating the image itself.
I cannot change the output, it IS required to be a Word. Our custommer want to be able to modified and save it, and many customers are not really expert. So, I'm not able to use some PIL technic. (also, it does'nt solve my problem)
I already have a working parsor, and have ouput for my html, so i can't use something like "create a table from html"
I'm not able to use css, nor pdf-creating as I said, so this way won't work too: "Quality tables in python" But the question is exactly mine in that case!
I cannot use Matplotlib to crate table myself, because I do not know how to print into it, and I need something (If possible) to "merge" cells. If it's not, I'll do without, I juste need something working, it's really necessary, please!
I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.