I need to deal with tables in many word files. Some of them are created in word table format, which can be read using python-docx.
However, some of them are inserted from excel. I don't know why python-docx cannot read them. Here is piece of code I wrote for test. As you can see in the terminal, there is nothings in the list variable 'tables'.
import docx
from docx import Document
docFile = 'a.docx'
document = Document(docFile)
tables = document.tables
print(tables)
Anyone can help? Thanks a lot!
I'm fighting the same issue using Pages on OSX to create a .docx template. I've found that Format > Arrange > Object Placement needs to be set to Move with text for the table, changing it to have any alignment or formatting causes the tables to disappear in python and be read as paragraphs that contain nothing. Looking at the XML of both and the python-docx code I'm suspicious of w:tblInd but I'm not clued up enough to go much further. I see recent GitHub issues covering this so hopefully will get sorted.
example on OSX:
Related
I'm currently struggling with the following point :
I work on a project that needed me to generate Documents (docx) for autodocumentation purposes, they contain text/images/formatting. These documents are singulars, it means that each part of my documentation (there are almost 15 parts) has its docx.
I have a global documentation template (docx) that expects to be filled by these singular documents (for some technical reason, I just can't directly generate a complete document) at specific locations
**Do you know a way to copy-paste these singular documents in the global documentation template ?
Thank you !
I tried using QuickParts from Word + MailMerge package but it seems that I can only fill these QuickParts with text.
I found ways to copy-paste a docx to another in a raw manner, I didn't find a way to copy-paste in a specific spot.
Im using Camelot to extract table information from a PDF that i have converted from scanned to searchable using ocrmypdf(500dpi).
Camelot seems to be able to identify the table and extract most of the data within the table but it seems to be unable to extract the bottom half. In essence, it sees the top half of the table but seems to be unable to separate the text from the lower half.
This is the table from the PDF in question:
But when i use the visual debugging method of Camelot where i ask it to show me the words it will extract it seems to recognize the bottom section of the table as one giant block
Any guidance you can provide on improving Camelots "vision" here would be helpful.
Apart from the block, the horizontal lines are also marked as text, which is odd.
Camelot uses pdfminer.six for text extraction and you can pass LAParams (page 16) to camelot.read_pdf() to tweak that.
You should also check out camelot.plot(table, type="grid") to see if the lines are recognized correctly. If not, that might be where the problem lies.
I'm pretty much stuck right now.
I wrote a parser in python3 using the python-docx library to extract all tables found in an existing .docx and store it in a python datastructure.
So far so good. Works as it should.
Now I have the problem that there are hyperlinks in these tables which I definitely need! Due to the structure (xml underneath) the docx library doesn't catch these. Neither the url nor the display text provided. I found many people having similar concerns about this, but most didn't seem to have 'just that' dilemma.
I thought about unpacking the .docx and scan the _ref document for the corresponding 'rid' and fill the actual data I have with the links found in the _ref xml.
Either way it seems seriously weary to do it that way, so I was wondering if there is a more pythonic way to do it or if somebody got good advise how to tackle this problem?
You can extract the links by parsing xml of docx file.
You can extract all text from the document by using document.element.getiterator()
Iterate all the tags of xml and extract its text. You will get all the missing data which python-docx failed to extract.
I am trying to take my data and put it in tables in either microsoft words or libreoffice writer.
I need to be able to change the background of cells within the table and I need to be able to change the page property to 'landscape'.
I have been looking for a library with simple code ( I am a beginner in coding ) but I did not find one for what I need to do.
Have you heard of anything for me ? If there are example on how to use it that would make it easier for me to learn it.
Check out this project
And here is a great quick-start guide
It's pretty simple to use, i haven't tested this, but it should work:
from docx import Document
document = Document()
r = 2 # Number of rows you want
c = 2 # Number of collumns you want
table = document.add_table(rows=r, cols=c)
table.style = 'LightShading-Accent1' # set your style, look at the help documentation for more help
for y in range(r):
for x in range(c):
cell.text = 'text goes here'
document.save('demo.docx') # Save document
It don't think you can set the page orientation property with this library, but what you could do is create a blank word document that is in landscape yourself, store it in the working directory and make a copy of it every time you generate this document.
The previous answer is a good one, but there is another way: create the document in Word then hack the xml in Python to insert the content you want. I have done this several times. In fact, my current invoicing program works this way.
Disadvantages: Conditional formatting and numbered lists will require some real xml knowledge.
Advantages: No limitations or intricate style definitions to manage. EVERYTHING is supported, because it's all done in Word.
Here's the workflow:
create a *.docx document with marker text where you want your headings, table cells, etc.
Use lxml to find those markers and copy their parent elements (along with their formatting).
Use those found elements to create templates
Insert your data into those templates, and assemble the whole thing like a jigsaw puzzle.
Zip your xml into a new *.docx file.
I have a short sample project showing the procedure. Docx2Python does most of the work for you.
https://github.com/ShayHill/replace_docx_tables
I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.
PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.