Python Docx Module merges tables when added subsequently to document

Python Docx Module merges tables when added subsequently to document - python

I'm using the python-docx module and python 3.9.0 to create word docx files with python. The problem I have is the following:
A) I defined a table style named my_table_style
B) I open my template, add one table of that style to my document object and then I store the created file with the following code:
import os
from docx import Document
template_path = os.path.realpath(__file__).replace("test.py","template.docx")
my_file = Document(template_path)
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.save(template_path.replace("template.docx","test.docx"))
When I now open test.docx, it's all good, there's one table with one row saying "hello".
NOW, when I use this syntax to create two of these tables:
import os
from docx import Document
template_path = os.path.realpath(__file__).replace("test.py","template.docx")
my_file = Document(template_path)
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.add_table(1,1,style="my_table_style").rows[-1].cells[0].paragraphs[0].add_run("hello")
my_file.save(template_path.replace("template.docx","test.docx"))
Instead of getting two tables, each with one row saying "hello", I get one single table with two rows, each saying "hello". The formatting is however correct, according to my_table_style, so it seems that python-docx merges two subsequently added tables of the same table style. Is this normal behavior? How can I avoid that?
Cheers!
HINTS:
When I use print(len(my_file.tables)) to print the amount of tables present in my_file, I actually get "2"! Also, when I change the style used in the second add_table line it works all good, so this seems to be related to the fact of using the same style. Any ideas, anyone?

Alright, so I figured it out, it seems to be default behaviour by Word to do what's described above. I manually created a table style my_custom_style in the template.docx file where I customized the table border lines etc. to have the format I want to have as if I would have two tables.
Instead of then using two add_table() statements, I used
new_table = my_file.add_table(1,1,style = "my_custom_style")
first_row = new_table.rows[-1]
second_row = new_table.add_row()
(you can actually access table styles defined in your template via python-docx, simply by using the table style name you used to manually create your table style in your word template file used to open your Document object. Just make sure you tick the "add this table style to the word template" option upon saving the style in Word and it should all work). Everything working now.

Related

Can python-docx preserve font color and styles when importing documents?

Essentially what I need to do is write a program that takes in many .docx files and puts them all in one, ordered in a certain way. I have importing working via:
import docx, os, glob
finaldocname = 'Midterm-All-Questions.docx'
finaldoc=docx.Document()
docstoworkon = glob.glob('*.docx')
if finaldocname in docstoworkon:
docstoworkon.remove(finaldocname) #dont process final doc if it exists
for f in docstoworkon:
doc=docx.Document(f)
fullText=[]
for para in doc.paragraphs:
fullText.append(para.text) #generates a long text list
# finaldoc.styles = doc.styles
for l in fullText:
# if l=='u\'\\n\'':
if '#' in l:
print('We got here!')
if '#1 ' not in l: #check last two characters to see if this is the first question
finaldoc.add_section() #only add a page break between questions
finaldoc.add_paragraph(l)
# finaldoc.add_page_break
# finaldoc.add_page_break
finaldoc.save(finaldocname)
But I need to preserve text styles, like font colors, sizes, italics, etc., and they aren't in this method since it just gets the raw text and dumps it. I can't find anything on the python-docx documentation about preserving text styles or importing in something other than raw text. Does anyone know how to go about this?

Styles are a bit difficult to work with in python-docx but it can be done.
See this explanation first to understand some of the problems with styles and Word.
The Long Way
When you read in a file as a Document() it will bring in all of the paragraphs and within each of these are the runs. These runs are chunks of text with the same style attached to them.
You can find out how many paragraphs or runs there are by doing len() on the object or you can iterate through them like you did in your example with paragraphs.
You can inspect the style of any given paragraph but runs may have different styles than the paragraph as a whole, so I would skip to the run itself and inspect the style there using paragraphs[0].runs[0].style which will give you a style object. You can inspect the font object beyond that which will tell you a number of attributes like size, italic, bold, etc.
Now to the long solution:
You first should create a new blank paragraph, then you should go and add_run() one by one with your text from your original. For each of these you can define a style attribute but it would have to be a named style as described in the first link. You cannot apply a stlye object directly as it won't copy the attributes over. But there is a way around that: check the attributes that you care about copying to the output and then ensure your new run applies the same attributes.
doc_out = docx.Document()
for para in doc.paragraphs:
p = doc_out.add_paragraph()
for run in para.runs:
r = p.add_run(run.text)
if run.bold:
r.bold = True
if run.italic:
r.italic = True
# etc
Obviously this is inefficient and not a great solution, but it will work to ensure you have copied the style appropriately.
Add New Styles
There is a way to add styles by name but because it isn't likely that the Word document you are getting the text and styles from is using named styles (rather than just applying bold, etc. to the words that you want), it is probably going to be a long road to adding a lot of slightly different styles or sometimes even the same ones.
Unfortunately that is the best answer I have for you on how to do this. Working with Word, Outlook, and Excel documents is not great in Python, especially for what you are trying to do.

Storing Scraped data in csv

I am learning web scraping using scrapy. Having Pretty Fun with it. The only problem is I can't save the scraped data in the way I want to.
The below code scrapes reviews from Amazon. How to make the storing of data better?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import csv
class Oneplus6Spider(scrapy.Spider):
name = 'oneplus6'
allowed_domains = ['amazon.in']
start_urls = ['https://www.amazon.in/OnePlus-Silk-White-128GB-
Storage/product-reviews/B078BNQ2ZS/ref=cm_cr_arp_d_viewopt_sr?
ie=UTF8&reviewerType=all_reviews&filterByStar=positive&pageNumber=1']
def parse(self, response):
writer = csv.writer(open('jack.csv','w+'))
opinions = response.xpath('//*[#class="a-size-base a-link-normal
review-title a-color-base a-text-bold"]/text()').extract()
for opinion in opinions:
yield({'Opinion':opinion})
reviewers = response.xpath('//*[#class="a-size-base a-link-normal
author"]/text()').extract()
for reviewer in reviewers:
yield({'Reviewer':reviewer})
verified = response.xpath('//*[#class="a-size-mini a-color-state a-
text-bold"]/text()').extract()
for verified_buyer in verified:
yield({'Verified_buyer':verified_buyer})
ratings = response.xpath('//span[#class="a-icon-
alt"]/text()').extract()
for rating in ratings:
yield({'Rating':rating[0]})
model_bought = response.xpath('//a[#class="a-size-mini a-link-
normal a-color-secondary"]/text()').extract()
for model in model_bought:
yield({'Model':model})
I tried using scrapy's default way -o method and also tried using csv.
The data gets stored in single row.I am very new to pandas and csv modules and I can't figure out how to store the scraped data in a proper format?
It is storing all the values in one single row.
I want the different values in different rows
Eg: Reviews|Rating|Model|
but I just can't figure out how to do it
How can I do it ?

It's observed in your code that you're trying to extract records with different types: they're all dict objects with a single key, where the key might have different values ("Opinion", "Reviewer", etc.).
In Scrapy, exporting data to CSV is handled by CsvItemExporter where the _write_headers_and_set_fields_to_export method is what matters with your current problem, as the exporter needs to know the list of fields (column names) before writing the first item.
Specifically:
It'll first check the fields_to_export attribute (configured by the FEED_EXPORT_FIELDS setting via feed exporter (related code here))
If unset:
2.a. If the first item is a dict, it'll use all its keys as the column name.
2.b. If the first item is a scrapy.Item, it'll use the keys from the item definition.
Thus there're several ways to resolve the problem:
You may define a scrapy.Item class with all possible keys you need, and yield items of this type in your code (just fill in the one field you need, and leave others empty, for any specific record).
Or, properly configure the FEED_EXPORT_FIELDS setting and leave other part of your existing code unchanged.
I suppose the hints above are sufficient. Please let me know if you need further examples.

To set csv data format one of the easiest way is to clean data using excel power queries, follow these steps:
Open csv file in excel.
Select all values using ctrl+A.
Then click on table from insert and create table.
After create table click on Data from top menu and select
From Table.
Know they open new excel window power queries.
Select any column and click on split column.
From split column select by delimiter.
Know select delimiter like comma, space etc.
Final step: Select advanced option in which there are two
options split in rows or column.
You can do all type of data cleaning using these power queries, this is the easiest way to setup data format according to your need.

Change table length in word-document, python-docx, Python

I can't seem to find a way to change the length f the entire table in a word-document. I have only seen examples of ways to change the columns in the table, not the actual table itself.
Would be great if someone could tell me how to do it :)
Here is my code:
from docx import Document
document = Document()
table = document.add_table(rows=4, cols=2)
table.style = 'Table Grid'

The Table class has methods to add rows.
https://python-docx.readthedocs.io/en/latest/api/table.html#docx.table.Table.add_row

Found a solution to my problem. I got this from ANOTHER user here # stack. Can't seem to find the link tho....
The original code is NOT mine, I only modified it a little.
def ChangeWidthOfTable(table,width,column):
for columnVarible in range(0,column):
for cell in table.columns[columnVarible].cells:
cell.width = Cm(width)

Possible to insert row at specific position with python-docx?

I'd like to insert a couple of rows in the middle of the table using python-docx. Is there any way to do it? I've tried to use a similar to inserting pictures approach but it didn't work.
If not, I'd appreciate any hint on which module is a better fit for this task. Thanks.
Here is my attempt to mimic the idea for inserting a picture. It's WRONG. 'Run' object has no attribute 'add_row'.
from docx import Document
doc = Document('your docx file')
tables = doc.tables
p = tables[1].rows[4].cells[0].add_paragraph()
r = p.add_run()
r.add_row()
doc.save('test.docx')

The short answer is No. There's no Table.insert_row() method in the API.
A possible approach is to write a so-called "workaround function" that manipulates the underlying XML directly. You can get to any given XML element (e.g. <w:tbl> in this case or perhaps <w:tr>) from it's python-docx proxy object. For example:
tbl = table._tbl
Which gives you a starting point in the XML hierarchy. From there you can create a new element from scratch or by copying and use lxml._Element API calls to place it in the right position in the XML.
It's a little bit of an advanced approach, but probably the simplest option. There are no other Python packages out there that provide a more extensive API as far as I know. The alternative would be to do something in Windows with their COM API or whatever from VBA, possibly IronPython. That would only work at small scale (desktop, not server) running Windows OS.
A search on python-docx workaround function and python-pptx workaround function will find you some examples.

You can insert row to the end of the table and then move it in another position as follows:
from docx import Document
doc = Document('your docx file')
t = doc.tables[0]
row0 = t.rows[0] # for example
row1 = t.rows[-1]
row0._tr.addnext(row1._tr)

Though there isn't a directly usable api to achieve this according to the python-docx documentation, there is a simple solution without using any other libs such as lxml, just use the underlying data structure provided by python-docx, which are CT_Tbl, CT_Row, etc.
These classes do have common methods like addnext, addprevious which can conveniently add element as siblings right after/before current element.
So the problem can be solved as below, (tested on python-docx v0.8.10)
from docx import Document
doc = Document('your docx file')
tables = doc.tables
row = tables[1].rows[4]
tr = row._tr # this is a CT_Row element
for new_tr in build_rows(): # build_rows should return list/iterator of CT_Row instance
tr.addnext(new_tr)
doc.save('test.docx')
this should solve the problem

You can add a row in last position by this way :
from win32com import client
doc = word.Documents.Open(r'yourFile.docx'))
doc = word.ActiveDocument
table = doc.Tables(1) #number of the tab you want to manipulate
table.Rows.Add()

addnext() in lxml.etree seems like will be the better option to use and its working fine, and the only thing is, i cannot set the height of the row, so please provide some answers, if you know!
current_row = table.rows[row_index]
table.rows[row_index].height_rule = WD_ROW_HEIGHT_RULE.AUTO
tbl = table._tbl
border_copied = copy.deepcopy(current_row._tr)
tr = border_copied
current_row._tr.addnext(tr)

I created a video here to demonstrate how to do this because it threw me for a loop the first time.
https://www.youtube.com/watch?v=nhReq_0qqVM
document=Document("MyDocument.docx")
Table = document.table[0]
Table.add_row()
for cells in Table.rows[-1].cells:
cells.text = "test text"
insertion_row = Table.rows[4]._tr
insertion_row.add_next(Table.rows[-1]._tr)
document.save("MyDocument.docx")
The python-docx module doesn't have a method for this, So the best workaround Ive found is to create a new row at the bottom of the table and then use methods from the xml elements to place it in the position it is suppose to be.
This will create a new row with every cell in the row having the value "test text" and we then add that row underneath of our insertion_row.

Word & Python - Create Table of Contents

I'm using the pywin32.client extension for python and building a Word document. I have tried a pretty good host of methods to generate a ToC but all have failed.
I think what I want to do is call the ActiveDocument object and create one with something like this example from the MSDN page:
Set myRange = ActiveDocument.Range(Start:=0, End:=0)
ActiveDocument.TablesOfContents.Add Range:=myRange, _
UseFields:=False, UseHeadingStyles:=True, _
LowerHeadingLevel:=3, _
UpperHeadingLevel:=1
Except in Python it would be something like:
wordObject.ActiveDocument.TableOfContents.Add(Range=???,UseFiles=False, UseHeadingStyles=True, LowerHeadingLevel=3, UpperHeadingLevel=1)
I've built everything so far using the 'Selection' object (example below) and wish to add this ToC after the first page break.
Here's a sample of what the document looks like:
objWord = win32com.client.Dispatch("Word.Application")
objDoc = objWord.Documents.Open('pathtotemplate.docx') #
objSel = objWord.Selection
#These seem to work but I don't know why...
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.Add(1,True)
objWord.ActiveDocument.Sections(1).Footers(1).PageNumbers.NumberStyle = 57
objSel.Style = objWord.ActiveDocument.Styles("Heading 1")
objSel.TypeText("TITLE PAGE AND STUFF")
objSel.InsertParagraph()
objSel.TypeText("Some data or another"
objSel.TypeParagraph()
objWord.Selection.InsertBreak()
####INSERT TOC HERE####
Any help would be greatly appreciated! In a perfect world I'd use the default first option which is available from the Word GUI but that seems to point to a file and be harder to access (something about templates).
Thanks

Manually, edit your template in Word, add the ToC (which will be empty initially) any intro stuff, header/footers etc., then at where you want your text content inserted (i.e. after the ToC) put a uniquely named bookmark. Then in your code, create a new document based on the template (or open the template then save it to a different name), search for the bookmark and insert your content there. Save to a different filename.
This approach has all sorts of advantages - you can format your template in Word rather than by writing all the code details, and so you can very easily edit your template to update styles when someone says they want the Normal font to be bigger/smaller/pink you can do it just by editing the template. Make sure to use styles in your code and only apply formatting when it is specifically different from the default style.
Not sure how you make sure the ToC is actually generated, might be automatically updated on every save.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.