I am trying to parse a Word document using python-docx, but have trouble getting the correct styles of paragraphs. I have uploaded a simplified version of the file to Dropbox.
The document's 'Normal' style uses 'Garamont' font, but this is changed so that everywhere I click in the file, the font is 'Calibri (Body)'.
When I use the 'Style inspector' in Word on the first line, it shows: "Paragraph formatting" is Normal + Plus: Centered, Left: 0 cm, Before: 0 pt, and "Text level formatting" is Default Paragraph Font + Plus: +Body (Calibri), 14 pt, Bold, Underline.
When I do the same on a non-bold text in the table, I get: "Paragraph formatting" is Normal + Plus: +Body (Calibri), Before: 0 pt, and "Text level formatting" is Default Paragraph Font + Plus: <none>.
That is, the font is changed on different levels inside and outside of the table. In both case, however, I do not know how to get this info using python-docx:
import docx
doc = docx.Document('test.docx')
par = doc.paragraphs[0]
#par = doc.tables[0].cell(0,1).paragraphs[0]
print(f"'{par.style.name}'")
print(f"'{par.style.font.name}'")
print(f"'{par.runs[0].font.name}'")
print(f"'{par.runs[0].style.name}'")
print(f"'{par.runs[0].style.font.name}'")
c = doc.tables[0].cell(1,0)
for par in c.paragraphs:
print(f"{len(par.runs)}", end=' ')
c.paragraphs[0].add_run('Very short summary')
doc.save('test_ed.docx')
returns
'Normal'
'Garamond'
'None'
'Default Paragraph Font'
'None'
1 0 0 0 0 0 0 0 0 1
In other words, I do not see any sign that the document actually uses the Calibri font.
It returns exactly the same if I use the second par definition (from the table).
Moreover, looking at the resulting test_ed.docx, the added line is using 'Garamont', even if Word shows the other empty paragraphs as using 'Calibri (Body)'.
So, my question is how to detect the actual format of the text and how to copy it to new paragraphs?
Related
How to ask python docx to write in italic type but not the entire sentence, just some words ?
I have this code :
names = ["names1"]
dates = ["dates1"]
client = ["client1"]
from docx import Document
document = Document('filename.docx')
paragraphs = document.paragraphs
paragraphs[0].insert_paragraph_before(" To "+names+" Date "+dates)
paragraphs[0].insert_paragraph_before(" ")
paragraphs[0].insert_paragraph_before(" From "+names+" Ref "+client)
paragraphs[0].insert_paragraph_before(" ")
I know how to specify an entire sentence to be in italic type, but not how to tell python to transform just one word in italic type.
Here, I would like to transform To, Date, From, Ref but just those four word, not the rest.
Have you an idea how to do that ?
Character formatting, such as bold and italic, is applied to a run. A paragraph is composed of zero or more runs.
When you specify the paragraph text as a parameter to the .add_paragraph() call (or .insert_paragraph_before() call), the resulting paragraph contains a single run containing all the specified text and having default formatting.
To do what you want, you will need to build up the paragraph text run by run, like so:
paragraph = paragraphs[0].insert_paragraph_before()
paragraph.add_run(" ")
run = paragraph.add_run("To")
run.italic = True
paragraph.add_run(" " + names + " ")
run = paragraph.add_run("Date")
run.italic = True
paragraph.add_run(" " + dates)
You can do something like this:
p = document.add_paragraph()
p.add_run('To').italic = True
p.add_run(" "+names+" ")
p.add_run('Date').italic = True
and so on.
I have a word docx which consist lots of table. so I'm getting trouble to go through all the table and counting some details. I need to automate those cases. Here my question is, First thing i need to read the table which has the header of "Test case details" then i need to count the "Test Type" row which has the "black box" testing value. Here i attached the word docx image for your concern. I need the output like "Total no of Black box test: 200". I'm using python 3.6, Please help me.
sample image of docx
sample code, i tried
from docx import Document
def table_test_automation(table):
for row in table.rows:
row_heading = row.cells[9].text
if row_heading != 'Test Type':
continue
Black_box = row.cells[1].text
return 1 if Black_box == 'Black Box' else 0
return 0
document = Document('VRRPv3-PEGASUS.docx')
yes_count = 0
for table in document.tables:
yes_count += table_test_automation(table)
print("Total No Of Black_box:",yes_count)
It's not clear what the contents of the first row is, so this might take some experimentation.
The place to start is by printing out the contents of the table heading cell:
table_heading_text = table.rows[0].cells[0].text
print(table_heading_text)
If that text were "Test Case Details", you can just test on that to qualify the table for further processing.
The thing that makes me suspicious is the disk icon. If that cell contains only an image, this approach isn't going to work.
Part of below is sourced from another example. It’s modified a bit and use to read a HTML file, and output the contents into a spreadsheet.
As it’s a just a local file, using Selenium is maybe an over-kill, but I just want to learn through this example.
from selenium import webdriver
import lxml.html as LH
import lxml.html.clean as clean
import xlwt
book = xlwt.Workbook(encoding='utf-8', style_compression = 0)
sheet = book.add_sheet('SeaWeb', cell_overwrite_ok = True)
driver = webdriver.PhantomJS()
ignore_tags=('script','noscript','style')
results = []
driver.get("source_file.html")
content = driver.page_source
cleaner = clean.Cleaner()
content = cleaner.clean_html(content)
doc = LH.fromstring(content)
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text = elt.text or '' #question 1
tail = elt.tail or '' #question 1
words = ''.join((text,tail)).strip()
if words: # extra question
words = words.encode('utf-8') #question 2
results.append(words) #question 3
results.append('; ') #question 3
sheet.write (0, 0, results)
book.save("C:\\ source_output.xls")
The lines text=elt.text or '' and tail=elt.tail or '' – why both .text and .tail have texts? And why the or '' part is important here?
The texts in the HTML file contains special characters like ° (temperature degrees) – the .encode('utf-8') doesn’t make it a perfect output, neither in IDLE or Excel spreadsheet. What’s the alternative?
Is it possible to join the output into a string, instead of a list? Now to append it into a list, I have to .append it twice to have the texts and ; added.
elt is a html node. It contains certain attributes and a text section. lxml provides way to extract all the attributes and text, by using .text or .tail depending where the text is.
<a attribute1='abc'>
some text ----> .text gets this
<p attributeP='def'> </p>
some tail ---> .tail gets this
</a>
The idea behind the or ''is that if there is no text/tail found in the current html node, it returns None. And later when we want to concatenate/append None type it will complain. So to avoid any future error, if the text/tail is None then use an empty string ''
Degree character is a one-character unicode string, but when you do a .encode('utf-8') it becomes 2-byte utf-8 byte string. This 2-byte is nothing but ° or \xc3\x82\xc2\xb0. So basically you do not have to do any encoding for ° character and Python interpreter correctly interprets the encoding. If not, provide the correct shebang on top of your python script. Check the PEP-0263
# -*- coding: UTF-8 -*-
Yes you can also join the output in string, just use + as there is no append for string types for e.g.
results = ''
results = results + 'whatever you want to join'
You can keep the list and combine your 2 lines:
results.append(words + '; ')
Note: Just now i checked the xlwt documentation and sheet.write() accept only strings. So basically you cannot pass results, a list type.
A simple example for Q1
from lxml import etree
test = etree.XML("<main>placeholder</main>")
print test.text #prints placeholder
print test.tail #prints None
print test.tail or '' #prints empty string
test.text = "texter"
print etree.tostring(test) #prints <main>texter</main>
test.tail = "tailer"
print etree.tostring(test) #prints <main>texter</main>tailer
I am using python docx library to manipulate a word document. However I can't find how to align a line to the center in the documents page of that library. I can't find by google either.
from docx import Document
document = Document()
p = document.add_paragraph('A plain paragraph having some ')
p.add_run('bold').bold = True
p.add_run(' and some ')
p.add_run('italic.').italic = True
How can I align the text in docx?
With the new version of python-docx 0.7 https://github.com/python-openxml/python-docx/commit/158f2121bcd2c58b258dec1b83f8fef15316de19
Add feature #51: Paragraph.alignment (read/write)
Now it is possible to align a paragraph as here: http://python-docx.readthedocs.org/en/latest/dev/analysis/features/par-alignment.html
paragraph = document.add_paragraph("This is a text")
paragraph.alignment = 0 # for left, 1 for center, 2 right, 3 justify ....
edit from comments
actually it is 0 for left, 1 for center, 2 for right
edit 2 from comments
You shouldn't hard code magic numbers like this. Use WD_ALIGN_PARAGRAPH.CENTER to get the correct value for centering, etc. To do this use the following import
from docx.enum.text import WD_ALIGN_PARAGRAPH
p = document.add_paragraph('A plain paragraph having some ',style='BodyText', breakbefore=False, jc='left')# #param string jc: Paragraph alignment, possible values:left, center, right, both (justified), ...
for reference see this reference at def paragraph read the documentation
Using Python, I need to find all substrings in a given Excel sheet cell that are either bold or italic.
My problem is similar to this:
Using XLRD module and Python to determine cell font style (italics or not)
..but the solution is not applicable for me as I cannot assume that the same formatting holds for all content in the cell. The value in a single cell can look like this:
1. Some bold text Some normal text. Some italic text.
Is there a way to find the formatting of a range of characters in a cell using xlrd (or any other Python Excel module)?
Thanks to #Vyassa for all of the right pointers, I've been able to write the following code which iterates over the rows in a XLS file and outputs style information for cells with "single" style information (e.g., the whole cell is italic) or style "segments" (e.g., part of the cell is italic, part of it is not).
import xlrd
# accessing Column 'C' in this example
COL_IDX = 2
book = xlrd.open_workbook('your-file.xls', formatting_info=True)
first_sheet = book.sheet_by_index(0)
for row_idx in range(first_sheet.nrows):
text_cell = first_sheet.cell_value(row_idx, COL_IDX)
text_cell_xf = book.xf_list[first_sheet.cell_xf_index(row_idx, COL_IDX)]
# skip rows where cell is empty
if not text_cell:
continue
print text_cell,
text_cell_runlist = first_sheet.rich_text_runlist_map.get((row_idx, COL_IDX))
if text_cell_runlist:
print '(cell multi style) SEGMENTS:'
segments = []
for segment_idx in range(len(text_cell_runlist)):
start = text_cell_runlist[segment_idx][0]
# the last segment starts at given 'start' and ends at the end of the string
end = None
if segment_idx != len(text_cell_runlist) - 1:
end = text_cell_runlist[segment_idx + 1][0]
segment_text = text_cell[start:end]
segments.append({
'text': segment_text,
'font': book.font_list[text_cell_runlist[segment_idx][1]]
})
# segments did not start at beginning, assume cell starts with text styled as the cell
if text_cell_runlist[0][0] != 0:
segments.insert(0, {
'text': text_cell[:text_cell_runlist[0][0]],
'font': book.font_list[text_cell_xf.font_index]
})
for segment in segments:
print segment['text'],
print 'italic:', segment['font'].italic,
print 'bold:', segment['font'].bold
else:
print '(cell single style)',
print 'italic:', book.font_list[text_cell_xf.font_index].italic,
print 'bold:', book.font_list[text_cell_xf.font_index].bold
xlrd can do this. You must call load_workbook() with the kwarg formatting_info=True, then sheet objects will have an attribute rich_text_runlist_map which is a dictionary mapping cell coordinates ((row, col) tuples) to a runlist for that cell. A runlist is a sequence of (offset, font_index) pairs where offset tells you where in the cell the font begins, and font_index indexes into the workbook object's font_list attribute (the workbook object is what's returned by load_workbook()), which gives you a Font object describing the properties of the font, including bold, italics, typeface, size, etc.
I don't know if you can do that with xlrd, but since you ask about any other Python Excel module: openpyxl cannot do this in version 1.6.1.
The rich text gets reconstructed away in function get_string() in openpyxl/reader/strings.py. It would be relatively easy to setup a second table with 'raw' strings in that module.