Python, text mining, docx to table(CSV) - python

So I'm pretty new to python and probably asking an easy question. I'm looking for a way where I can extract chapter names, section names and text from a docx file and transfer it to a table where in the first row I have the chapter name, the second row the section name, third row the text from the chapter. Another thing I want to add at some point have a new line for each new paragraph of the text. I had the following steps in mind but I doubt sincerely whether it is the right way to go:
Open word document
1.a. Read word document
1.b. Define headings, subheadings, footnotes & headers
Create new file
2.a. Create table with 9 rows
Name each row
Fill in header with predefined text: Legal document
"Part Title Chapter Section Subsection Article number Article text
Article title Reference"
Define rankings of the categories
5.a. Give rankings to the table, row 1 contains document name
5.b: row 2 contains chapter name, row 3 section name etc.
Read word document from start to the first defined ranking
6.a.: Copy the text of the defined ranking
6.b. Append to file the copied text into the correct row
I've looked into docx and xlml but I wonder whether it will give me the result I'm looking for.

You'll need the docx and CSV or openpyxl modules. You'll also need effort. Figure out a way to differentiate between the things you want to store in the CSV then throw this detection and storage into a loop that senses and stops when there is nothing more to do. That's the most advice you'll get with this type of question.

Related

How to get all the text in a nested table using python?

I have to extract all the text in a nested table (tables inside table inside table) from a word document. I'm unable to do it using the python-docx, maybe my lack of knowledge.
Please suggest some code examples.
You will want some sort of recursion. The basic idea is:
def iter_paragraphs_of_tables(tables):
for table in tables:
for row in table.rows:
for cell in row.cells:
yield from cell.paragraphs
yield from iter_paragraphs_of_tables(cell.tables)
for paragraph in iter_paragraphs_of_tables(document.tables):
print(paragraph.text)
This is Python3, if you're on Python2 you'll need to expand the yield from statements into, for example:
yield from cell.paragraphs
# --- becomes ---
for paragraph in cell.paragraphs:
yield paragraph
python-docx seems more like a write/modify docx library you may want to try PyPDF2 https://pythonhosted.org/PyPDF2/. But the table inside table thing i don't really understand it i guess the table is nested in the word document ? if that's the case just read the read with PyPDF2 and put the words that you want to keep in a table. I wish you the best time reading.

How to wrap cell text in tables via docx library or xml?

I have been using python docx library and oxml to automate some changes to my tables in my word document. Unfortunately, no matter what I do, I cannot wrap the text in the table cells.
I managed to successfully manipulate 'autofit' and 'fit-text' properties of my table, but non of them contribute to the wrapping of the text in the cells. I can see that there is a "w:noWrap" in the xml version of my word document and no matter what I do I cannot manipulate and remove it. I believe it is responsible for the word wrapping in my table.
for example in this case I am adding a table. I can fit text in cell and set autofit to 'true' but cannot for life of me wrap the text:
from docx import Document
from docx.oxml import OxmlElement
from docx.oxml.ns import qn
doc = Document()
table = doc.add_table(5,5)
table.autofit = True # Does Autofit but not wrapping
tc = table.cell(0,0)._tc # As a test, fit text to cell 0,0
tcPr = tc.get_or_add_tcPr()
tcFitText = OxmlElement('w:tcFitText')
tcFitText.set(qn('w:val'),"true")
tcPr.append(tcFitText) #Does fitting but no wrapping
doc.save('demo.docx')
I would appreciate any help or hints.
The <w:noWrap> element appears to be a child of <w:tcPr>, the element that controls table cell properties.
You should be able to access it from the table cell element using XPath:
tc = table.cell(0, 0)._tc
noWraps = tc.xpath(".//w:noWrap")
The noWraps variable here will then be a list containing zero or more <w:noWrap> elements, in your case probably one.
Deleting it is probably the simplest approach, which you can accomplish like this:
if noWraps: # ---skip following code if list is empty---
noWrap = noWraps[0]
noWrap.getparent().remove(noWrap)
You can also take the approach of setting the value of the w:val attribute of the w:noWrap element, but then you have to get into specifying the Clark name of the attribute namespace, which adds some extra fuss and doesn't really produce a different outcome unless for some reason you want to keep that element around.

Python docx paragraph in textbox

Is there any way to access and manipulate text in an existing docx document in a textbox with python-docx?
I tried to find a keyword in all paragraphs in a document by iteration:
doc = Document('test.docx')
for paragraph in doc.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
It is found if placed in normal text, but not inside a textbox.
A workaround for textboxes that contain only formatted text is to use a floating, formatted table. It can be styled almost like a textbox (frames, colours, etc.) and is easily accessible by the docx API.
doc = Document('test.docx')
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
if '<DATE>' in paragraph.text:
print('found date: ', paragraph.text)
Not via the API, not yet at least. You'd have to uncover the XML structure it lives in and go down to the lxml level and perhaps XPath to find it. Something like this might be a start:
body = doc._body
# assuming differentiating container element is w:textBox
text_box_p_elements = body.xpath('.//w:textBox//w:p')
I have no idea whether textBox is the actual element name here, you'd have to sort that out with the rest of the XPath path details, but this approach will likely work. I use similar approaches frequently to work around features that aren't built into the API yet.
opc-diag is a useful tool for inspecting the XML. The basic approach is to create a minimally small .docx file containing the type of thing you're trying to locate. Then use opc-diag to inspect the XML Word generates when you save the file:
$ opc browse test.docx document.xml
http://opc-diag.readthedocs.org/en/latest/index.html

How to automatically google search each row in a text file

I have a set names on a text file. My purpose is to search each name (row in the text file) on google and record the very first link appears on the google search results. Is there any way to automatically execute this process with a script? Otherwise I have to type 1000 names one by one on google and list the first link :(
Is there a way? Yes. Is there a super quick and easy way? I'm not too sure about that.
What I would do given the task is use BeautifulSoup4 for web-scraping. You could easily iterate through each line in your text file with a loop and then you could convert the phrase into Google-URL-friendly.
EX: Take the phrase "This is a test sentence", replace spaces with "+" and then add it to the end of a google search default URL. Like this:
https://www.google.com/?gws_rd=ssl#q=This+is+a+test+sentence
After that you find whatever the id or class is of the link of the first page of a Google result and code your program to return that information.

Reading text values in a PowerPoint table using pptx?

Using the pptx module,
tbl.cell(3,3).text
is a writable, but not a readable attribute. Is there a way to just read the text in a PowerPoint table? I'm trying to avoid COM and pptx is a great module, but lacks this particular feature.
Thanks!
At present, you'll need to go a level deeper to get text out of a cell using python-pptx. The cell.text 'getter' is on the roadmap.
Something like this should get it done:
cell = tbl.cell(3, 3)
paragraphs = cell.textframe.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
print run.text
Let me know if you need more to go on.

Categories