Iterating through cell sentences/paragraphs - docx tables

Iterating through cell sentences/paragraphs - docx tables - python

I am looking to iterate through sentences/paragraphs within cells of a docx table, performing functions depending on their style tags using the pywin32 module.
I can manually select the cell using
cell = table.Cell(Row = 1, Column =2)
I tried using something like for x in cell: #do something but
<class 'win32com.client.CDispatch'> objects 'do not support enumeration'
I tried looking through: Word OM to find a solution but to no avail (I understand this is for VBA, but still can be very useful)

Here is a simple example that reads the content from the the first row / first column of the first table in a document and prints it word-by-word:
import win32com.client as win32
import os
wordApp = win32.gencache.EnsureDispatch("Word.Application")
wordApp.Visible = False
doc = wordApp.Documents.Open(os.getcwd() + "\\Test.docx")
table = doc.Tables(1)
for word in table.Cell(Row = 1, Column = 1).Range.Text.split():
print(word)
wordApp.Application.Quit(-1)
The cell's content is just a string, you could easily also split it by paragraphs using split('\r') or by sentences using split('.').

Related

python-docx trailing trilling whitepsaces not showing correctly

Goal
I am trying to add a text to a table cell where the text is a combination of 2 strings and the space between the strings of variable size so that the final text has the same length and it appears as if the second string is right aligned.
I can either use format or ljust to combine the strings in python.
period = "from Monday to Friday"
item_text = "Some txt"
item_text2 = "Some other txt"
t1 = "t1: {:<30}{:0}".format(item_text,period)
t2 = "t2: {:<30}{:0}".format(item_text2,period)
t3 = f"t3: {item_text.ljust(30)}{period}"
t4 = f"t4: {item_text2.ljust(30)}{period}"
from pprint import pprint
pprint(t1)
pprint(t2)
pprint(t3)
pprint(t4)
Text in python with variable space length between strings
However, if I add this text to a docx table, the space between the strings changes.
from docx import Document
doc = Document()
# Creating a table object
table = doc.add_table(rows=2, cols=2, style="Table Grid")
table.rows[0].cells[0].text = f"{item_text.ljust(30)}{period}"
table.rows[1].cells[0].text = f"{item_text2.ljust(30)}{period}"
def set_col_widths(table):
widths = tuple( Cm(val) for val in [15,8])
for row in table.rows:
for idx, width in enumerate(widths):
row.cells[idx].width = width
set_col_widths(table)
doc.save("test_whitespace.docx")
Text in word. Space between strings changed.
Note
I am aware that I could add a table to the table cell and left adjust the left and right adjust the right but that seems like way more code to write.
Question
Why is the spacing changing in the word document and how can I create the text differently to get the desired goal?

Splitting a docx by headings into separate files in Python

I want to write a program that grabs my docx files, iterates through them and splits each file into multiple separate files based on headings. Inside each of the docx there are a couple of articles, each with a 'Heading 1' and text underneath it.
So if my original file1.docx has 4 articles, I want it to be split into 4 separate files each with its heading and text.
I got to the part where it iterates through all of the files in a path where I hold the .docx files, and I can read the headings and text separately, but I can't seem to figure out a way how to merge it all and split it into separate files each with the heading and the text. I am using the python-docx library.
import glob
from docx import Document
headings = []
texts = []
def iter_headings(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Heading'):
yield paragraph
def iter_text(paragraphs):
for paragraph in paragraphs:
if paragraph.style.name.startswith('Normal'):
yield paragraph
for name in glob.glob('/*.docx'):
document = Document(name)
for heading in iter_headings(document.paragraphs):
headings.append(heading.text)
for paragraph in iter_text(document.paragraphs):
texts.append(paragraph.text)
print(texts)
How do I extract the text and heading for each article?
This is the XML reading python-docx gives me. The red braces mark what I want to extract from each file.
https://user-images.githubusercontent.com/17858776/51575980-4dcd0200-1eac-11e9-95a8-f643f87b1f40.png
I am open for any alternative suggestions on how to achieve what I want with different methods, or if there is an easier way to do it with PDF files.

I think the approach of using iterators is a sound one, but I'd be inclined to parcel them differently. At the top level you could have:
for paragraphs in iterate_document_sections(document.paragraphs):
create_document_from_paragraphs(paragraphs)
Then iterate_document_sections() would look something like:
def iterate_document_sections(document):
"""Generate a sequence of paragraphs for each headed section in document.
Each generated sequence has a heading paragraph in its first position,
followed by one or more body paragraphs.
"""
paragraphs = [document.paragraphs[0]]
for paragraph in document.paragraphs[1:]:
if is_heading(paragraph):
yield paragraphs
paragraphs = [paragraph]
continue
paragraphs.append(paragraph)
yield paragraphs
Something like this combined with portions of your other code should give you something workable to start with. You'll need an implementation of is_heading() and create_document_from_paragraphs().
Note that the term "section" here is used as in common publishing parlance to refer to a (section) heading and its subordinate paragraphs, and does not refer to a Word document section object (like document.sections).

In fact, provided solution works well only if documents don't have any other elements except paragraphs (tables for example).
Another possible solution is to iterate not only through paragraphs but all document body's child xml elements. Once you find "subdocument's" start and end elements (paragraphs with headings in your example) you should delete other irrelevant to this part xml elements (a kind of cut off all other document content). This way you can preserve all styles, text, tables and other document elements and formatting.
It's not an elegant solution and means that you have to keep a temporary copy of a full source document in memory.
This is my code:
import tempfile
from typing import Generator, Tuple, Union
from docx import Document
from docx.document import Document as DocType
from docx.oxml.table import CT_Tbl
from docx.oxml.text.paragraph import CT_P
from docx.oxml.xmlchemy import BaseOxmlElement
from docx.text.paragraph import Paragraph
def iterparts(doc_path:str, skip_first=True, bias:int=0) -> Generator[Tuple[int,DocType],None,None]:
"""Iterate over sub-documents by splitting source document into parts
Split into parts by copying source document and cutting off unrelevant
data.
Args:
doc_path (str): path to source *docx* file
skip_first (bool, optional): skip first split point and wait for
second occurrence. Defaults to True.
bias (int, optional): split point bias. Defaults to 0.
Yields:
Generator[Tuple[int,DocType],None,None]: first element of each tuple
indicates the number of a
sub-document, if number is 0
then there are no sub-documents
"""
doc = Document(doc_path)
counter = 0
while doc:
split_elem_idx = -1
doc_body = doc.element.body
cutted = [doc, None]
for idx, elem in enumerate(doc_body.iterchildren()):
if is_split_point(elem):
if split_elem_idx == -1 and skip_first:
split_elem_idx = idx
else:
cutted = split(doc, idx+bias) # idx-1 to keep previous paragraph
counter += 1
break
yield (counter, cutted[0])
doc = cutted[1]
def is_split_point(element:BaseOxmlElement) -> bool:
"""Split criteria
Args:
element (BaseOxmlElement): oxml element
Returns:
bool: whether the *element* is the beginning of a new sub-document
"""
if isinstance(element, CT_P):
p = Paragraph(element, element.getparent())
return p.text.startswith("Some text")
return False
def split(doc:DocType, cut_idx:int) -> Tuple[DocType,DocType]:
"""Splitting into parts by copying source document and cutting of
unrelevant data.
Args:
doc (DocType): [description]
cut_idx (int): [description]
Returns:
Tuple[DocType,DocType]: [description]
"""
tmpdocfile = write_tmp_doc(doc)
second_part = doc
second_elems = list(second_part.element.body.iterchildren())
for i in range(0, cut_idx):
remove_element(second_elems[i])
first_part = Document(tmpdocfile)
first_elems = list(first_part.element.body.iterchildren())
for i in range(cut_idx, len(first_elems)):
remove_element(first_elems[i])
tmpdocfile.close()
return (first_part, second_part)
def remove_element(elem: Union[CT_P,CT_Tbl]):
elem.getparent().remove(elem)
def write_tmp_doc(doc:DocType):
tmp = tempfile.TemporaryFile()
doc.save(tmp)
return tmp
Note that you should define is_split_point method according to your split criteria

docx center text in table cells

So I am starting to use pythons docx library. Now, I create a table with multiple rows, and only 2 columns, it looks like this:
Now, I would like the text in those cells to be centered horizontally. How can I do this? I've searched through docx API documentation but I only saw information about aligning paragraphs.

There is a code to do this by setting the alignment as you create cells.
doc=Document()
table = doc.add_table(rows=0, columns=2)
row=table.add_row().cells
p=row[0].add_paragraph('left justified text')
p.alignment=WD_ALIGN_PARAGRAPH.LEFT
p=row[1].add_paragraph('right justified text')
p.alignment=WD_ALIGN_PARAGRAPH.RIGHT
code by: bnlawrence
and to align text to the center just change:
p.alignment=WD_ALIGN_PARAGRAPH.CENTER
solution found here: Modify the alignment of cells in a table

Well, it seems that adding a paragraph works, but (oh, really?) it addes a new paragraph -- so in my case it wasn't an option. You could change the value of the existing cell and then change paragraph's alignment:
row[0].text = "hey, beauty"
p = row[0].paragraphs[0]
p.alignment = docx.enum.text.WD_ALIGN_PARAGRAPH.CENTER
Actually, in the top answer this first "docx.enum.text" was missing :)

The most reliable way that I have found for setting he alignment of a table cell (or really any text property) is through styles. Define a style for center-aligned text in your document stub, either programatically or through the Word UI. Then it just becomes a matter of applying the style to your text.
If you create the cell by setting its text property, you can just do
for col in table.columns:
for cell in col.cells:
cell.paragraphs[0].style = 'My Center Aligned Style'
If you have more advanced contents, you will have to add another loop to your function:
for col in table.columns:
for cell in col.cells:
for par in cell.paragraphs:
par.style = 'My Center Aligned Style'
You can easily stick this code into a function that will accept a table object and a style name, and format the whole thing.

In my case I used this.
from docx.enum.text import WD_ALIGN_PARAGRAPH
def addCellText(row_cells, index, text):
row_cells[index].text = str(text)
paragraph=row_cells[index].paragraphs[0]
paragraph.alignment = WD_ALIGN_PARAGRAPH.LEFT
font = paragraph.runs[0].font
font.size= Pt(10)
def addCellTextRight(row_cells, index, text):
row_cells[index].text = str(text)
paragraph=row_cells[index].paragraphs[0]
paragraph.alignment = WD_ALIGN_PARAGRAPH.RIGHT
font = paragraph.runs[0].font
font.size= Pt(10)

For total alignment to center I use this code:
from docx.enum.text import WD_ALIGN_PARAGRAPH
from docx.enum.table import WD_ALIGN_VERTICAL
for row in table.rows:
for cell in row.cells:
cell.paragraphs[0].alignment = WD_ALIGN_PARAGRAPH.CENTER
cell.vertical_alignment = WD_ALIGN_VERTICAL.CENTER

From docx.enum.table import WD_TABLE_ALIGNMENT
table = document.add_table(3, 3)
table.alignment = WD_TABLE_ALIGNMENT.CENTER
For details see a link .
http://python-docx.readthedocs.io/en/latest/api/enum/WdRowAlignment.html

Put Header with Python - docx

I am using Python-docx to create and write a Word document.
How i can put a text in document header using python-docx?
http://image.prntscr.com/image/8757b4e6d6f545a5ab6a08a161e4c55e.png
Thanks

UPDATE: This feature has been implemented since the time of this answer.
As other respondents have noted below, the Section object provides access to its header objects.
header = document.sections[0].header
Note that a section can have up to three headers (first_page, odd_pages, even_pages) and each section can have its own set of headers. The most common situation is that a document has a single header and a single section.
A header is like a document body or table cell in that it can contain tables and/or paragraphs and by default has a single empty paragraph (it cannot contain zero paragraphs).
header.paragraphs[0].text = "My header text"
This is explained in greater detail on this page in the documentation::
https://python-docx.readthedocs.io/en/latest/user/hdrftr.html
Unfortunately this feature is not implemented yet. The page #SamRogers linked to is part of the enhancement proposal (aka. "analysis page"). The implementation is in progress however, by #eupharis, so might be available in a month or so. The ongoing pull request is here if you want to follow it. https://github.com/python-openxml/python-docx/pull/291

This feature has been implemented. See: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html
You can add text to the header of a word document using python-docx as follows:
header = document.sections[0].header
head = header.paragraphs[0]
head.text = 'Add Your Text'

I've been using it to work
header = document.sections[0].header
header.add_paragraph('Test Header')
Header is a subclass of BlockItemContainer, from which it inherits the same content editing capabilities as Document, such as .add_paragraph().

(With respect that this question is old...)
I have used a work-around in my project where my "client" wanted different headers in different pages by:
Creating a document using python-docx and section breaks
Execute a word macro file (*.xlsm) with two arguments: (1) fileName = path, docTitle = title of the document to be inserted in footer.
The macro file will open the newly created document and add headers and footers that are already inside the macro file. This would need to be modified if the header and footer text need to vary.
Pyton code:
wd = win32com.client.Dispatch("Word.Application")
wd.Visible = False
doc = wd.Documents.Open(pathToDOCM) # path here
wd.Run("Main.RunMain",fileName, docTitle) # 2 args
doc.Close()
del wd
VBA code:
VBA (inside *.xlsm) code:
Sub RunInside()
Call RunMain("C:\Users\???\dokument.docx", "test")
End Sub
Sub RunMain(wordDocument As String, wordTitle As String)
' Create Headers
Call CreateHeaders(wordDocument, wordTitle)
End Sub
Sub CreateHeaders(wordDocument As String, wordTitle As String)
Dim i As Integer
Dim outputName As String
Dim aDoc As Document
Dim oApp As Word.Application
Dim oSec As Word.Section
Dim oDoc As Word.Document
Dim hdr1, hdr2 As HeaderFooter
Dim ftr1, ftr2 As HeaderFooter
'Create a new document in Word
Set oApp = New Word.Application
'Set oDoc = oApp.Documents.Add
Set oDoc = oApp.Documents.Open(wordDocument)
'Set aDoc as active document
Set aDoc = ActiveDocument
oDoc.BuiltInDocumentProperties("Title") = wordTitle
For i = 1 To 9:
Set hdr1 = aDoc.Sections(i).Headers(wdHeaderFooterPrimary)
Set hdr2 = oDoc.Sections(i).Headers(wdHeaderFooterPrimary)
Set ftr1 = aDoc.Sections(i).Footers(wdHeaderFooterPrimary)
Set ftr2 = oDoc.Sections(i).Footers(wdHeaderFooterPrimary)
If i > 1 Then
With oDoc.Sections(i).Headers(wdHeaderFooterPrimary)
.LinkToPrevious = False
End With
With oDoc.Sections(i).Footers(wdHeaderFooterPrimary)
.LinkToPrevious = False
End With
End If
hdr1.Range.Copy
hdr2.Range.Paste
ftr1.Range.Copy
ftr2.Range.Paste
Next i
outputName = Left(wordDocument, Len(wordDocument) - 5)
outputName = outputName + ".pdf"
oDoc.SaveAs outputName, 17
oDoc.Close SaveChanges:=wdSaveChanges
Set oDoc = Nothing
Set aDoc = Nothing
End Sub
Final remark:
The code loops through different sections and copy-paste the header and footers. It also saves the document to *.PDF.

For those of you looking to set custom headers w/docx:
I had to use a couple packages to get this to work. My use case was this: I was generating multiple templates and then merging them together, however when I merged them with docx the header from the master file (below) was applied to all sections and all sections were marked as linkedToPrevious = True despite being =False in the original files. However, docx does a really nice job appending files and having it come out error-free on the other end, so I decided to find a way to make it work. Code for reference:
master = Document(files[0])
composer = Composer(master)
footnotes_doc = Document('chapters/footnotes.docx')
for file in files[1:]:
mergeDoc = Document(file)
composer.append(mergeDoc)
composer.append(footnotes_doc)
composer.save("chapters/combined.docx")
So now I have a master doc (combined.docx) with all the proper sections however the headers need to be adjusted. You can't iterate over the document with docx, get the current section that you are in, and adjust it or set the headers linking to false. If you set to False you wipe the header completely. You can explicitly call the section and adjust it, but since everything after it is linked to previous, you change the rest of the document from that point. So I pulled in win32com:
Gets the number of sections and then iterates thru them backwards using win32com. This way as you remove linkedToPrevious, you preserve the header in place.
def getSections(document):
sectionArray = {}
sections = document.sections
x = 1
for section in sections:
sectionArray[x] = section
x += 1
return sectionArray
start_doc = Document('chapters/combined.docx')
listArray = getSections(start_doc) #gets an array of all sections
keylist = list(reversed(sorted(listArray.keys()))) ##now reverse it
word = win32com.client.gencache.EnsureDispatch("Word.Application")
word = client.DispatchEx("Word.Application")
word.Visible = False
#tell word to open the document
word.Documents.Open(' C:\path to\combined.docx')
#open it internally
doc = word.Documents(1)
try:
for item in keylist:
word.ActiveDocument.Sections(item).Headers(win32com.client.constants.wdHeaderFooterPrimary).LinkToPrevious=False
word.ActiveDocument.Sections(item).Headers(win32com.client.constants.wdHeaderFooterEvenPages).LinkToPrevious=False
word.ActiveDocument.SaveAs("c:\wherever\combined_1.docx")
doc.Close()
word.Quit()
except:
doc.Close()
word.Quit()
ok so now the doc is primed to edit the headers, which we can do with docx easily and worry free now. First we need to parse the XML, which I use docx to access then feed to lxml, to get the location of the section needed:
xml = str(start_doc._element.xml) #this gets the full XML using docx
tree = etree.fromstring(xml)
WORD_NAMESPACE='{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
TEXT = WORD_NAMESPACE + 't'
PARA = WORD_NAMESPACE + 'p'
SECT = WORD_NAMESPACE + 'sectPr'
sectionLoc = []
for item in tree.iter(PARA):
for node in item.iter(TEXT):
if 'Section' in node.text: #this is how I am identifying which headers I need to edit
print(node.text)
sectionLoc.append(node.text)
for sect in item.iter(SECT):
print(sect)
sectionLoc.append('section')
# print(etree.tostring(sect))
counter =0
sectionLocs = []
for index, item in enumerate(sectionLoc): #just some logic to get the correct section number from the xml parse
if 'Section' in item:
sectionLocs.append(counter)
continue
counter += 1
#ok now use those locations with docx to adjust the headers
#remember that start_doc here needs to be the new result from win32 process-
#so start_doc = Document('C:\path to\combined.docx') in this case
for item in sectionLocs:
section = start_doc.sections[item]
header = section.header
para_new = header.paragraphs[0]
para_new.text = 'TEST!'
start_doc.save('willthiswork.docx')
This is a lot of work. I bet there is a way to do it entirely with win32com but I couldn't figure out how to get the relevant sections with it based on the content in the body of the page. The "sectPr" tags always come at the end of the page, so in combing the document for text I know is on the page that needs a new header, "Section," I know that the next printed out section is the one I want to edit so I just get it's location in the list.
I think this whole workflow is a hack but it works and I hope the sample code helps someone.

You can use header.text like
header = section.header
header.text = 'foobar'
see http://python-docx.readthedocs.io/en/latest/dev/analysis/features/header.html?highlight=header for more information

import docx
document = docx.Document()
header_section = document.sections[0]
header = header_section.header
header_text = header.paragraphs[0]
header_text.text = "Header of document"
You can use \t either side of text to align it in the centre

How do I find the formatting for a subset of text in an Excel document cell

Using Python, I need to find all substrings in a given Excel sheet cell that are either bold or italic.
My problem is similar to this:
Using XLRD module and Python to determine cell font style (italics or not)
..but the solution is not applicable for me as I cannot assume that the same formatting holds for all content in the cell. The value in a single cell can look like this:
1. Some bold text Some normal text. Some italic text.
Is there a way to find the formatting of a range of characters in a cell using xlrd (or any other Python Excel module)?

Thanks to #Vyassa for all of the right pointers, I've been able to write the following code which iterates over the rows in a XLS file and outputs style information for cells with "single" style information (e.g., the whole cell is italic) or style "segments" (e.g., part of the cell is italic, part of it is not).
import xlrd
# accessing Column 'C' in this example
COL_IDX = 2
book = xlrd.open_workbook('your-file.xls', formatting_info=True)
first_sheet = book.sheet_by_index(0)
for row_idx in range(first_sheet.nrows):
text_cell = first_sheet.cell_value(row_idx, COL_IDX)
text_cell_xf = book.xf_list[first_sheet.cell_xf_index(row_idx, COL_IDX)]
# skip rows where cell is empty
if not text_cell:
continue
print text_cell,
text_cell_runlist = first_sheet.rich_text_runlist_map.get((row_idx, COL_IDX))
if text_cell_runlist:
print '(cell multi style) SEGMENTS:'
segments = []
for segment_idx in range(len(text_cell_runlist)):
start = text_cell_runlist[segment_idx][0]
# the last segment starts at given 'start' and ends at the end of the string
end = None
if segment_idx != len(text_cell_runlist) - 1:
end = text_cell_runlist[segment_idx + 1][0]
segment_text = text_cell[start:end]
segments.append({
'text': segment_text,
'font': book.font_list[text_cell_runlist[segment_idx][1]]
})
# segments did not start at beginning, assume cell starts with text styled as the cell
if text_cell_runlist[0][0] != 0:
segments.insert(0, {
'text': text_cell[:text_cell_runlist[0][0]],
'font': book.font_list[text_cell_xf.font_index]
})
for segment in segments:
print segment['text'],
print 'italic:', segment['font'].italic,
print 'bold:', segment['font'].bold
else:
print '(cell single style)',
print 'italic:', book.font_list[text_cell_xf.font_index].italic,
print 'bold:', book.font_list[text_cell_xf.font_index].bold

xlrd can do this. You must call load_workbook() with the kwarg formatting_info=True, then sheet objects will have an attribute rich_text_runlist_map which is a dictionary mapping cell coordinates ((row, col) tuples) to a runlist for that cell. A runlist is a sequence of (offset, font_index) pairs where offset tells you where in the cell the font begins, and font_index indexes into the workbook object's font_list attribute (the workbook object is what's returned by load_workbook()), which gives you a Font object describing the properties of the font, including bold, italics, typeface, size, etc.

I don't know if you can do that with xlrd, but since you ask about any other Python Excel module: openpyxl cannot do this in version 1.6.1.
The rich text gets reconstructed away in function get_string() in openpyxl/reader/strings.py. It would be relatively easy to setup a second table with 'raw' strings in that module.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating through cell sentences/paragraphs - docx tables - python

Related

python-docx trailing trilling whitepsaces not showing correctly

Splitting a docx by headings into separate files in Python

docx center text in table cells

Put Header with Python - docx

How do I find the formatting for a subset of text in an Excel document cell

Categories

Resources