Removing Paragraph From Cell In Python-Docx

Removing Paragraph From Cell In Python-Docx - python

I am attempting to create a table with a two row header that uses a simple template format for all of the styling. The two row header is required because I have headers that are the same under two primary categories. It appears that the only way to handle this within Word so that a document will format and flow with repeating header across pages is to nest a two row table into the header row of a main content table.
In Python-DocX a table cell is always created with a single empty paragraph element. For my use case I need to be able to remove this empty paragraph element entirely not simply clear it with an empty string. Or else I have line break above my nested table that ruins my illusion of a single table.
So the question is how do you remove the empty paragraph?
If you know of a better way to handle the two row header implementation... that would also be appreciated info.

While Paragraph.delete() is not implemented yet in python-docx, there is a workaround function documented here: https://github.com/python-openxml/python-docx/issues/33#issuecomment-77661907
Note that a table cell must always end with a paragraph. So you'll need to add an empty one after your table otherwise I believe you'll get a so-called "repair-step" error when you try to load the document.
Probably worth a try without the extra paragraph just to confirm; I'm expect it would look better without it, but last time I tried that I got the error.

As #scanny said before, it can delete the current graph if pass the p to self-defined delete function.
I just want to do a supplement, in case if you want to delete multiple paragraphs.
def delete_paragraph(paragraph):
p = paragraph._element
p.getparent().remove(p)
paragraph._p = paragraph._element = None
def remove_multiple_para(doc):
i = 0
while i < len(doc.paragraphs):
if 'xxxx' in doc.paragraphs[i].text:
for j in range(i+2, i-2, -1):
# delete the related 4 lines
delete_paragraph(doc.paragraphs[j])
i += 1
doc.save('outputDoc.docx')
doc = docx.Document('./inputDoc.docx')
remove_multiple_para(doc)

Related

How to parse a complex text file using Python string methods or regex and export into tabular form

As the title mentions, my issue is that I don't understand quite how to extract the data I need for my table (The columns for the table I need are Date, Time, Courtroom, File Number, Defendant Name, Attorney, Bond, Charge, etc.)
I think regex is what I need but my class did not go over this, so I am confused on how to parse in order to extract and output the correct data into an organized table...
I am supposed to turn my text file from this
https://pastebin.com/ZM8EPu0p
and export it into a more readable format like this- example output is below
Here is what I have so far.
def readFile(court):
csv_rows = []
# read and split txt file into pages & chunks of data by pagragraph
with open(court, "r") as file:
data_chunks = file.read().split("\n\n")
for chunk in data_chunks:
chunk = chunk.strip # .strip removes useless spaces
if str(data_chunks[:4]).isnumeric(): # if first 4 characters are digits
entry = None # initialize an empty dictionary
elif (
str(data_chunks).isspace() and entry
): # if we're on an empty line and the entry dict is not empty
csv_rows.DictWriter(dialect="excel") # turn csv_rows into needed output
entry = {}
else:
# parse here?
print(data_chunks)
return csv_rows
readFile("/Users/mia/Desktop/School/programming/court.txt")

It is quite a lot of work to achieve that, but it is possible. If you split it in a couple of sub-tasks.
First, your input looks like a text file so you could parse it line by line. -- using https://www.w3schools.com/python/ref_file_readlines.asp
Then, I noticed that your data can be split in pages. You would need to prepare a lot of regular expressions, but you can start with one for identifying where each page starts. -- you may want to read this as your expression might get quite complicated: https://www.w3schools.com/python/python_regex.asp
The goal of this step is to collect all lines from a page in some container (might be a list, dict, whatever you find it suitable).
And afterwards, write some code that parses the information page by page. But for simplicity I suggest to start with something easy, like the columns for "no, file number and defendant".
And when you got some data in a reliable manner, you can address the export part, using pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

MS word document document stucturre and COM calls and python

I am using comptypes to call function and create ms-word document. Being the first time writing such a program there is something I don't understand right, what I want to do is:
Create section in the document and call them A, B, ...
In each section create paragraphs that contain text. For section A call the paragraphs a1,a2,a3,...
Add formatting to each paragraph in each section, the formatting may be different for each paragraphs
Below is some code fragments in VBA, VBA is used since the translations to use comptypes are almost directly and there are more examples on the net for VBA.
Set myRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
ActiveDocument.Sections.Add Range:=myRange //Section A
Set newRange = ActiveDocument.Range(Start:= ...., End:= ...) //start and end can be any thing
newRange.Paragraphs.Add
I get stuck to select paragraphs a1 and set its text. What is missing for me is a function that say get collection of paragraphs in section A.

The following VBA, based on the code in the question, illustrates getting a Document object, adding a Section, getting the Paragraphs of that Section, getting the Paragraphs of any given Section in a document, getting the first or any Paragraph from a Paragraphs collection.
Set doc = ActiveDocument //More efficient if the Document object will be used more than once
Set section1 = doc.Sections.Add(Range:=myRange) //Section A | Type Word.Section
Set section1Paras = section1.Paragraphs //Type Word.Paragraphs
//OR
Set sectionParas = doc.Sections(n).Paragraphs //where n = section index number
Set para = sectionParas.First //OR =sectionParas(n) where n= paragraph index number

Possible to insert row at specific position with python-docx?

I'd like to insert a couple of rows in the middle of the table using python-docx. Is there any way to do it? I've tried to use a similar to inserting pictures approach but it didn't work.
If not, I'd appreciate any hint on which module is a better fit for this task. Thanks.
Here is my attempt to mimic the idea for inserting a picture. It's WRONG. 'Run' object has no attribute 'add_row'.
from docx import Document
doc = Document('your docx file')
tables = doc.tables
p = tables[1].rows[4].cells[0].add_paragraph()
r = p.add_run()
r.add_row()
doc.save('test.docx')

The short answer is No. There's no Table.insert_row() method in the API.
A possible approach is to write a so-called "workaround function" that manipulates the underlying XML directly. You can get to any given XML element (e.g. <w:tbl> in this case or perhaps <w:tr>) from it's python-docx proxy object. For example:
tbl = table._tbl
Which gives you a starting point in the XML hierarchy. From there you can create a new element from scratch or by copying and use lxml._Element API calls to place it in the right position in the XML.
It's a little bit of an advanced approach, but probably the simplest option. There are no other Python packages out there that provide a more extensive API as far as I know. The alternative would be to do something in Windows with their COM API or whatever from VBA, possibly IronPython. That would only work at small scale (desktop, not server) running Windows OS.
A search on python-docx workaround function and python-pptx workaround function will find you some examples.

You can insert row to the end of the table and then move it in another position as follows:
from docx import Document
doc = Document('your docx file')
t = doc.tables[0]
row0 = t.rows[0] # for example
row1 = t.rows[-1]
row0._tr.addnext(row1._tr)

Though there isn't a directly usable api to achieve this according to the python-docx documentation, there is a simple solution without using any other libs such as lxml, just use the underlying data structure provided by python-docx, which are CT_Tbl, CT_Row, etc.
These classes do have common methods like addnext, addprevious which can conveniently add element as siblings right after/before current element.
So the problem can be solved as below, (tested on python-docx v0.8.10)
from docx import Document
doc = Document('your docx file')
tables = doc.tables
row = tables[1].rows[4]
tr = row._tr # this is a CT_Row element
for new_tr in build_rows(): # build_rows should return list/iterator of CT_Row instance
tr.addnext(new_tr)
doc.save('test.docx')
this should solve the problem

You can add a row in last position by this way :
from win32com import client
doc = word.Documents.Open(r'yourFile.docx'))
doc = word.ActiveDocument
table = doc.Tables(1) #number of the tab you want to manipulate
table.Rows.Add()

addnext() in lxml.etree seems like will be the better option to use and its working fine, and the only thing is, i cannot set the height of the row, so please provide some answers, if you know!
current_row = table.rows[row_index]
table.rows[row_index].height_rule = WD_ROW_HEIGHT_RULE.AUTO
tbl = table._tbl
border_copied = copy.deepcopy(current_row._tr)
tr = border_copied
current_row._tr.addnext(tr)

I created a video here to demonstrate how to do this because it threw me for a loop the first time.
https://www.youtube.com/watch?v=nhReq_0qqVM
document=Document("MyDocument.docx")
Table = document.table[0]
Table.add_row()
for cells in Table.rows[-1].cells:
cells.text = "test text"
insertion_row = Table.rows[4]._tr
insertion_row.add_next(Table.rows[-1]._tr)
document.save("MyDocument.docx")
The python-docx module doesn't have a method for this, So the best workaround Ive found is to create a new row at the bottom of the table and then use methods from the xml elements to place it in the position it is suppose to be.
This will create a new row with every cell in the row having the value "test text" and we then add that row underneath of our insertion_row.

Parse a file in python to find first a string, then parse the following strings until it find another string

I am trying to scroll trough a result file that one of our process print.
The objective is to look through various blocks and find a specific parameter. I tried to tackle this but can't find an efficient way that would avoid to parse the file multiple times.
This is an example of the output file that I read:
ID:13123
Compound:xyz
... various parameters
RhPhase:abc
ID:543
Compound:lbm
... various parameters
ID:232355
Compound:dfs
... various parameters
RhPhase:cvb
I am looking for a specific ID that has a RhPhase in it, but since the file contains many more entry, I just want that specific ID. It may or may not have an RhPhase in it; if it has one, I get the value.
The only way that I figured out is to actually go through the whole file (which may be hundreds of blocks, to give an idea of the size), and make a list for each ID that has a RhPhase, then in second instance, I scroll through the dictionary, retrieving the value for a specific ID.
This feels so inefficient; I tried to do something different, but got stuck at how you mark the lines while you scroll through them; so I can tell python to read each line->when find the ID that I want continue to read->if you find RhPhase get the value, otherwise stop at the next ID.
I am stuck here:
datafile=open("datafile.txt", "r")
for items in datafile.readline():
if "ID:543" in items:
[read more lines]
[if "RhPhase" in lines:]
[ rhphase=lines ]
[elif ""ID:" in lines ]
[ rhphase=None ]
[ break ]
Once I find the ID; I don't know how to continue to either look for the RhPhase string or find the first ID: string and stop everything (because this means that the ID does not have an associated RhPhase).
This would pass through the file once, and just check for the specific ID, instead of parse the whole thing once and then do a second pass.
Is possible to do so or am I stuck to the double parsing ?

Usually, you solve these kind of things with a simple state machine: You read the lines until you find your id; then you put your reader into a special state that then checks for the parameter you want to extract. In your case, you only have two states: ID not found, and ID found, so a simple boolean is enough:
foundId = False
with open('datafile.txt', 'r') as datafile:
for line in datafile:
if foundId:
if line.startswith('RhPhase'):
print('Found RhPhase for ID 543:')
print(line)
# end reading the file
break
elif line.startswith('ID:'):
print('Error: Found another ID without finding RhPhase first')
break
# if we haven’t found the ID yet, keep looking for it
elif line.startswith('ID:543'):
foundId = True

Iterate over sections in a config file

I recently got introduced to the library configparser. I would like to be able to check if each section has at least one Boolean value set to 1. For example:
[Horizontal_Random_Readout_Size]
Small_Readout = 0
Medium_Readout = 0
Large_Readout = 0
The above would cause an error.
[Vertical_Random_Readout_Size]
Small_Readout = 0
Medium_Readout = 0
Large_Readout = 1
The above would pass. Below is some pseudo code of what I had in mind:
exit_test = False
for sections in config_file:
section_check = False
for name in parser.options(section):
if parser.getboolean(section, name):
section_check = True
if not section_check:
print "ERROR:Please specify a setting in {} section of the config file".format(section)
exit_test = True
if exit_test:
exit(1)
Questions:
1) How do I perform the first for loop and iterate over the sections of the config file?
2) Is this a good way of doing this or is there a better way? (If there isn't please answer question one.)

Using ConfigParser you have to parse your config.
After parsing you will get all sections using .sections() method.
You can iterate over each section and use .items() to get all key/value pairs of each section.
for each_section in conf.sections():
for (each_key, each_val) in conf.items(each_section):
print each_key
print each_val

Best bet is to load ALL the lines in the file into some kind of array (I'm going to ignore the issue of how much memory that might use and whether to page through it instead).
Then from there you know that lines denoting headings follow a certain format, so you can iterate over your array to create an array of objects containing the heading name; the line index (zero based reference to master array) and whether that heading has a value set.
From there you can iterate over these objects in cross-reference to the master array, and for each heading check the next "n" lines (in the master array) between the current heading and the next.
At this point you're down to the individual config values for that heading so you should easily be able to parse the line and detect a value, whereupon you can break from the loop if true, or for more robustness issue an exclusivity check on those heading's values in order to ensure ONLY one value is set.
Using this approach you have access to all the lines, with one object per heading, so your code remains flexible and functional. Optimise afterwards.
Hope that makes sense and is helpful.

To complete the answer by #Nilesh and comment from #PashMic, here is an example that really iterate over ALL sections, including DEFAULT:
all_section_names: list[str] = conf.sections()
all_section_names.append("DEFAULT")
for section_name in all_section_names:
for key, value in conf.items(section_name):
...
Note that even if there is no real "DEFAULT" section, this will still works. There will just be no item retreived by conf.items("DEFAULT").

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing Paragraph From Cell In Python-Docx - python

Related

How to parse a complex text file using Python string methods or regex and export into tabular form

MS word document document stucturre and COM calls and python

Possible to insert row at specific position with python-docx?

Parse a file in python to find first a string, then parse the following strings until it find another string

Iterate over sections in a config file

Categories

Resources