How to add the column headers on every page on python-docx? - python

I'm trying to print a Pandas data frame as .docx file on python. My problem is since the docx file will most of the time print more than 1 page, I want to have the column names of the data frame to be printed for every new page.
Currently my code just prints the whole data frame as is:
# add the header rows.
for j in range(t01.shape[-1]):
table.cell(0,j).text = t01.columns[j]
# add the rest of the data frame
for i in range(t01.shape[0]):
for j in range(t01.shape[-1]):
table.cell(i+1,j).text = str(t01.values[i,j])

what you're probably looking after is Repeat Header Rows functionality which can be find in here:
since python-docx doesn't have that functionality yet, you can add that flag by yourself. first you need to look for it in the ooxml schema http://www.datypic.com/sc/ooxml/e-w_tblHeader-1.html
note that rows that are declared as header rows will repeat themselves at the beginning of every page if the table can't fit onto a single page. so what you need to do is to declare the first row as a header row. that can be done like:
from docx import Document
from docx.oxml import OxmlElement
doc = Document()
t = doc.add_table(rows=50, cols=2)
# set header values
t.cell(0, 0).text = 'A'
t.cell(0, 1).text = 'B'
tbl_header = OxmlElement('w:tblHeader') # create new oxml element flag which indicates that row is header row
first_row_props = t.rows[0]._element.get_or_add_trPr() # get if exists or create new table row properties el
first_row_props.append(tbl_header) # now first row is the header row
for i in range(1, len(t.rows)):
for j in range(len(t.columns)):
t.cell(i, j).text = f'i:{i}, j:{j}'
doc.save('t1.docx')
the end result should look like:

Related

How to extract a Word table from multiple files using python docx

I am working on a project at work where I need to analyze over a thousand MS-Word files each consisting of the same table. From each table I just need to extract few cells and turn them into a row that later will be concatenated to create a dateframe for further analysis.
I tested Python's library docx on one file and it managed to read the table. However, after plugging the same function inside a for loop that begins by creating a variable consisting of all the file names and then passing that to the Document function, the output is just one table, which is the first table in the list of files.
I have a feeling I'm not looking at this the right way, I would appreciate any guidance on this as I'm completely helpless now.
following is the code I used, it consists mainly of code I stumbled upon in stackoverflow:
import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]
for name in file:
document = Document(name)
table = document.tables[0]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
thanks
You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.
If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.
data = []
for name in filenames:
...
data.append(row_data)
print(data)

Python-docx : Hyperlink text from markdown in table

I was looking to insert a hyperlink in the text within the DOCX table. Currently the data I have has the hyperlinks in markdown format.
Is there a way to add this as a hyperlink in my DOCx table? I'm aware that this can be done for paragraphs *here, but was wondering it if it could be done in tables.
Example code:
import pandas as pd
import docx
# Sample Data
df = pd.DataFrame([["Some [string](www.google.com) is the link to google",'IDnum1',"China"],
["This is string 2","IDnum3","Australia"],
["some string3","IDNum5","America"]], columns = ["customer string","ID number","Country"])
# Docx DF to table
# open an existing document
doc = docx.Document()
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
# save the doc
doc.save('./test.docx')
Any help will be greatly appreciated!
Each cell has one or more paragraphs on cell.paragraphs. Whatever works on a paragraph outside a table will also work on a paragraph inside a table.

Python: print each excel row as its own ms word page

I've got an excel document with thousands of rows, each row represents a person's record. How can I use Python to extract each row and write that information into an MS Word page?
My goal is to create a doc containing a pseudo-narrative of each record on its own page.
You can extract the content of the Excel file as a Pandas Data Frame and then export the Data Frame into Word Document as a table. This is the generic syntax to achieve your objective
import pandas as pd
xls_file = pd.ExcelFile('../data/example.xls')
df = xls_file.parse('Sheet1')
#Needs PyWin32
wordApp = win32.gencache.EnsureDispatch('Word.Application')
wordApp.Visible = False
doc = wordApp.Documents.Open(os.getcwd()+'\\template.docx')
rng = doc.Bookmarks("PUTTABLEHERE").Range
# creating Table
# add one more row in table at word because you want to add column names as header
Table=rng.Tables.Add(rng,NumRows=df.shape[0]+1,NumColumns=df.shape[1])
for col in range(df.shape[1]):
# Writing column names
Table.Cell(1,col+1).Range.Text=str(df.columns[col])
for row in range(df.shape[0]):
# writing each value of data frame
Table.Cell(row+1+1,col+1).Range.Text=str(df.iloc[row,col])
Hope this helps!

How to parse dataframes from an excel sheet with many tables (using Python, possibly Pandas)

I'm dealing with badly laid out excel sheets which I'm trying to parse and write into a database.
Every sheet can have multiple tables. Though the header for these possible tables are known, which tables are gonna be on any given sheet is not, neither is their exact location on the sheet (the tables don't align in a consistent way). I've added a pic of two possible sheet layout to illustrate this: This layout has two tables, while this one has all the tables of the first, but not in the same location, plus an extra table.
What I do know:
All the possible table headers, so each individual table can be identified by its headers
Tables are separated by blank cells. They do not touch each other.
My question Is there a clean way to deal with this using some Python module such as pandas?
My current approach:
I'm currently converting to .csv and parsing each row. I split each row around blank cells, and process the first part of the row (should belong to the leftmost table). The remainder of the row is queued and later processed in the same manner. I then read this first_part and check whether or not it's a header row. If it is, I use it to identify which table I'm dealing with (this is stored in a global current_df). Subsequent rows which are not header rows are fed into this table (here I'm using pandas.DataFrame for my tables).
Code so far is below (mostly incomplete and untested, but it should convey the approach above):
class DFManager(object): # keeps track of current table and its headers
current_df = None
current_headers = []
def set_current_df(self, df, headers):
self.current_headers = headers
self.current_df = df
def split_row(row, separator):
while row and row[0] == separator:
row.pop(0)
while row and row[-1] == separator:
row.pop()
if separator in row:
split_index = row.index(separator)
return row[:split_index], row[split_index:]
else:
return row, []
def process_df_row(row, dfmgr):
df = df_with_header(row) # returns the dataframe with these headers
if df is None: # is not a header row, add it to current df
df = dfmgr.current_df
add_row_to_df(row, df)
else:
dfmgr.set_current_df(df, row)
# this is passed the Excel sheet
def populate_dataframes(xl_sheet):
dfmgr = DFManager()
row_queue = Queue()
for row in xl_sheet:
row_queue.put(row)
for row in iter(row_queue.get, None):
if not row:
continue
first_part, remainder = split_row(row)
row_queue.put(remainder)
process_df_row(first_part, dfmgr)
This is such a specific situation that there is likely no "clean" way to do this with a ready-made module.
One way to do this might use the header information you already have to find the starting indices of each table, something like this solution (Python Pandas - Read csv file containing multiple tables), but with an offset in the column direction as well.
Once you have the starting position of each table, you'll want to determine the widths (either known a priori or discovered by reading until the next blank column) and read those columns into a dataframe until the end of the table.
The benefit of an index-based method rather than a queue based method is that you do not need to re-discover where the separator is in each row or keep track of which row fragments belong to which table. It is also agnostic to the presence of >2 tables per row.
I written code to merge multiple tables separated vertically, having common headers in each table. I am assuming unique headers should name does not end with dot integer number.
'''
def clean(input_file, output_file):
try:
df = pd.read_csv(input_file, skiprows=[1,1])
df = df.drop(df.columns[df.columns.str.contains('unnamed',case = False)],axis=1)
df = rename_duplicate_columns(df)
except:
df =[]
print("Error: File Not found\t", sys.exc_info() [0])
exit
udf = df.loc[:, ~df.columns.str.match(".*\.\d")]
udf = udf.dropna(how='all')
try:
table_num = int(df.columns.values[-1].split('.')[-1])
fdf = udf
for i in range(1,table_num+1):
udfi = pd.DataFrame()
udfi = df.loc[:, df.columns.str.endswith(f'.{i}')]
udfi.rename(columns = lambda x: '.'.join(x.split('.')[:-1]), inplace=True)
udfi = udfi.dropna(how='all')
fdf = fdf.append(udfi,ignore_index=True)
fdf.to_csv(output_file)
except ValueError:
print ("File Contains only single Table")
exit
def rename_duplicate_columns(df):
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else
dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
print(df.columns)
return df
clean(input_file, output_file)
'''

Create variables no matter the nr of rows of the data in clipboard

I created a program that does certain operations in a web based software (with Selenium) using the data that i have in windows Clipboard (it's rows with string like QWERTY123 ). It means that for each row the program copy from clip board that row, paste in a field and execute a task. This task should be performed the same way for all rows. My problem is that same times the clipboard has 2 rows (like the code below), other times for examplehas 20, other 77...a so on. How could I modify my code to work no matter the number of rows I have?
Please see what I have done here:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import tkinter as tk
path_to_Ie = 'C:\\Python34\\ChromeDriver\\ChromeDriver.exe'
browser = webdriver.Chrome(executable_path = path_to_Ie)
url = 'https://wwww.corp/'
browser.get(url)
browser.find_element_by_xpath("//*[#id='username']").send_keys("user")
browser.find_element_by_xpath("//*[#id='password']").send_keys("pass")
browser.find_element_by_xpath("//*[#id='login-link']").click()
browser.get('https://wwww.corp/soft.html')
time.sleep(2)
browser.find_element_by_xpath("//*[#id='content-column']/div[4]/form/div[1]/span/label").click()
time.sleep(2)
browser.find_element_by_xpath("//*[#id='agent_list_filter_id_2']").clear()
root = tk.Tk()
# keep the window from showing
root.withdraw()
# read the clipboard
machineName = root.clipboard_get()
one1, two2 = machineName.split('\n') # generates variables for the data (2 rows) of clipboard
one = one1.replace(" ", "") #
two = two2.replace(" ", "") #
browser.find_element_by_xpath("//*[#id='agent_list_filter_id_2']").send_keys(one) # send first row data
browser.find_element_by_xpath("//*[#id='content-column']/div[4]/form/div[1]/span/span[1]").click()
browser.find_element_by_xpath("//*[#id='action-select-all']/span/span").click()
browser.find_element_by_xpath("//*[#id='action-delete']/span/span").click()
browser.find_element_by_xpath("//*[#id='btn_save']").click() # last command to delete the first row
browser.get('https://wwww.corp/soft.html')
time.sleep(2)
browser.find_element_by_xpath("//*[#id='agent_list_filter_id_2']").clear()
browser.find_element_by_xpath("//*[#id='agent_list_filter_id_2']").send_keys(two) # send 2nd row data
browser.find_element_by_xpath("//*[#id='content-column']/div[4]/form/div[1]/span/span[1]").click()
browser.find_element_by_xpath("//*[#id='action-select-all']/span/span").click()
browser.find_element_by_xpath("//*[#id='action-delete']/span/span").click()
browser.find_element_by_xpath("//*[#id='btn_save']").click() # last command to delete the 2nd row
Any input is welcome.
Thanks.
You don't need to assign each clipboard row to a variable - using split('\n') gives you a list with each row from the clipboard an item in that list. Using the list you can access each row by subscript, i.e. the position of the item in the list.
Also, you seem to want to remove all spaces from each clipboard row? I'm not sure whether this is what you intend: 'Hi there you' => 'Hithereyou'. You can do that in one go on the clipboard data before splitting it:
rows = root.clipboard_get().replace(' ', '').split('\n')
Or if you want to remove leading and trailing whitespace from each row after it has been split into rows:
rows = root.clipboard_get().split('\n')
rows = [row.strip() for row in rows] # removes leading and trailing whitespace from each row
print(rows)
print(rows[0]) # the first row
print(rows[1]) # the second row
Assuming the latter, if the clipboard contained e.g. (note the leading whitespace)
Some data from the first row
More data on the second row
The output would be:
['Some data from the first row', 'More data on the second row']
Some data from the first row
More data on the second row
So that covers the basics. Now, looking at the structure of your code, you do essentially the same thing for each clipboard row. That means that you can use a loop to iterate over each row and perform the required browser commands:
rows = root.clipboard_get().split('\n')
rows = [row.strip() for row in rows]
for row in rows:
browser.get('https://wwww.corp/soft.html')
time.sleep(2)
browser.find_element_by_xpath("//*[#id='agent_list_filter_id_2']").clear()
browser.find_element_by_xpath("//*[#id='agent_list_filter_id_2']").send_keys(row)
browser.find_element_by_xpath("//*[#id='content-column']/div[4]/form/div[1]/span/span[1]").click()
browser.find_element_by_xpath("//*[#id='action-select-all']/span/span").click()
browser.find_element_by_xpath("//*[#id='action-delete']/span/span").click()
browser.find_element_by_xpath("//*[#id='btn_save']").click()

Categories