I've got an excel document with thousands of rows, each row represents a person's record. How can I use Python to extract each row and write that information into an MS Word page?
My goal is to create a doc containing a pseudo-narrative of each record on its own page.
You can extract the content of the Excel file as a Pandas Data Frame and then export the Data Frame into Word Document as a table. This is the generic syntax to achieve your objective
import pandas as pd
xls_file = pd.ExcelFile('../data/example.xls')
df = xls_file.parse('Sheet1')
#Needs PyWin32
wordApp = win32.gencache.EnsureDispatch('Word.Application')
wordApp.Visible = False
doc = wordApp.Documents.Open(os.getcwd()+'\\template.docx')
rng = doc.Bookmarks("PUTTABLEHERE").Range
# creating Table
# add one more row in table at word because you want to add column names as header
Table=rng.Tables.Add(rng,NumRows=df.shape[0]+1,NumColumns=df.shape[1])
for col in range(df.shape[1]):
# Writing column names
Table.Cell(1,col+1).Range.Text=str(df.columns[col])
for row in range(df.shape[0]):
# writing each value of data frame
Table.Cell(row+1+1,col+1).Range.Text=str(df.iloc[row,col])
Hope this helps!
Related
I'm trying to print a Pandas data frame as .docx file on python. My problem is since the docx file will most of the time print more than 1 page, I want to have the column names of the data frame to be printed for every new page.
Currently my code just prints the whole data frame as is:
# add the header rows.
for j in range(t01.shape[-1]):
table.cell(0,j).text = t01.columns[j]
# add the rest of the data frame
for i in range(t01.shape[0]):
for j in range(t01.shape[-1]):
table.cell(i+1,j).text = str(t01.values[i,j])
what you're probably looking after is Repeat Header Rows functionality which can be find in here:
since python-docx doesn't have that functionality yet, you can add that flag by yourself. first you need to look for it in the ooxml schema http://www.datypic.com/sc/ooxml/e-w_tblHeader-1.html
note that rows that are declared as header rows will repeat themselves at the beginning of every page if the table can't fit onto a single page. so what you need to do is to declare the first row as a header row. that can be done like:
from docx import Document
from docx.oxml import OxmlElement
doc = Document()
t = doc.add_table(rows=50, cols=2)
# set header values
t.cell(0, 0).text = 'A'
t.cell(0, 1).text = 'B'
tbl_header = OxmlElement('w:tblHeader') # create new oxml element flag which indicates that row is header row
first_row_props = t.rows[0]._element.get_or_add_trPr() # get if exists or create new table row properties el
first_row_props.append(tbl_header) # now first row is the header row
for i in range(1, len(t.rows)):
for j in range(len(t.columns)):
t.cell(i, j).text = f'i:{i}, j:{j}'
doc.save('t1.docx')
the end result should look like:
I'm facing the problem of trying to extract data from word files in the form of tables. I have to iterate through 500 word files and extract a specific table in each file, but the table appears at a different point in each word file. This is the code I have:
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(path+"\\"+filename)
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
df = pd.DataFrame(data)
print(df)
Which goes through all the files fine, but gets an error as some of the word documents do not have the table it looks for, as it looks only for an element: wordDoc.tables[8] so an IndexError appears. I want to be able to change it from this, to instead look for a table with certain column titles:
CONTACT NAME POSITION LOCATION EMAIL TELEPHONE ASSET CLASS
Is there a way that I can modify the code shown to be able to find the tables I'm looking for?
Many thanks.
Instead of changing the logic to look up tables with certain column names you can catch the index error and ignore it. This will enable you to continue without error when that table is not present in a document. This is done using try and except.
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
try:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
except IndexError:
continue
df = pd.DataFrame(data)
print(df)
Also, note that it is better to use os.path.join() when combining paths instead of concatenating the path strings.
I am working on a project at work where I need to analyze over a thousand MS-Word files each consisting of the same table. From each table I just need to extract few cells and turn them into a row that later will be concatenated to create a dateframe for further analysis.
I tested Python's library docx on one file and it managed to read the table. However, after plugging the same function inside a for loop that begins by creating a variable consisting of all the file names and then passing that to the Document function, the output is just one table, which is the first table in the list of files.
I have a feeling I'm not looking at this the right way, I would appreciate any guidance on this as I'm completely helpless now.
following is the code I used, it consists mainly of code I stumbled upon in stackoverflow:
import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]
for name in file:
document = Document(name)
table = document.tables[0]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
thanks
You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.
If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.
data = []
for name in filenames:
...
data.append(row_data)
print(data)
I was looking to insert a hyperlink in the text within the DOCX table. Currently the data I have has the hyperlinks in markdown format.
Is there a way to add this as a hyperlink in my DOCx table? I'm aware that this can be done for paragraphs *here, but was wondering it if it could be done in tables.
Example code:
import pandas as pd
import docx
# Sample Data
df = pd.DataFrame([["Some [string](www.google.com) is the link to google",'IDnum1',"China"],
["This is string 2","IDnum3","Australia"],
["some string3","IDNum5","America"]], columns = ["customer string","ID number","Country"])
# Docx DF to table
# open an existing document
doc = docx.Document()
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
# save the doc
doc.save('./test.docx')
Any help will be greatly appreciated!
Each cell has one or more paragraphs on cell.paragraphs. Whatever works on a paragraph outside a table will also work on a paragraph inside a table.
I don't understand how to import a Smartsheet and convert it to a pandas dataframe. I want to manipulate the data from smartsheets, currently I go to smartsheets export to csv and import csv in python but want to eliminate this step so that it can run on a schedule.
import smartsheet
import pandas as pd
access_token ='#################'
smartsheet = Smartsheet(access_token)
sheet = smartsheet.sheets.get('Sheet 1')
pd.DataFrame(sheet)
Here is a simple method to convert a sheet to a dataframe:
def simple_sheet_to_dataframe(sheet):
col_names = [col.title for col in sheet.columns]
rows = []
for row in sheet.rows:
cells = []
for cell in row.cells:
cells.append(cell.value)
rows.append(cells)
data_frame = pd.DataFrame(rows, columns=col_names)
return data_frame
The only issue with creating a dataframe from smartsheets is that for certain column types cell.value and cell.display_value are different. For example, contact columns will either display the name or the email address depending on which is used.
Here is a snippet of what I use when needing to pull in data from Smartsheet into Pandas. Note, I've included garbage collection as I regularly work with dozens of sheets at or near the 200,000 cell limit.
import smartsheet
import pandas as pd
import gc
configs = {'api_key': 0000000,
'value_cols': ['Assigned User']}
class SmartsheetConnector:
def __init__(self, configs):
self._cfg = configs
self.ss = smartsheet.Smartsheet(self._cfg['api_key'])
self.ss.errors_as_exceptions(True)
def get_sheet_as_dataframe(self, sheet_id):
sheet = self.ss.Sheets.get_sheet(sheet_id)
col_map = {col.id: col.title for col in sheet.columns}
# rows = sheet id, row id, cell values or display values
data_frame = pd.DataFrame([[sheet.id, row.id] +
[cell.value if col_map[cell.column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells]
for row in sheet.rows],
columns=['Sheet ID', 'Row ID'] +
[col.title for col in sheet.columns])
del sheet, col_map
gc.collect() # force garbage collection
return data_frame
def get_report_as_dataframe(self, report_id):
rprt = self.ss.Reports.get_report(report_id, page_size=0)
page_count = int(rprt.total_row_count/10000) + 1
col_map = {col.virtual_id: col.title for col in rprt.columns}
data = []
for page in range(1, page_count + 1):
rprt = self.ss.Reports.get_report(report_id, page_size=10000, page=page)
data += [[row.sheet_id, row.id] +
[cell.value if col_map[cell.virtual_column_id] in self._cfg['value_cols']
else cell.display_value for cell in row.cells] for row in rprt.rows]
del rprt
data_frame = pd.DataFrame(data, columns=['Sheet ID', 'Row ID']+list(col_map.values()))
del col_map, page_count, data
gc.collect()
return data_frame
This adds additional columns for sheet and row IDs so that I can write back to Smartsheet later if needed.
Sheets cannot be retrieved by name, as you've shown in your example code. It is entirely possible for you to have multiple sheets with the same name. You must retrieve them with their sheetId number.
For example:
sheet = smartsheet_client.Sheets.get_sheet(4583173393803140) # sheet_id
http://smartsheet-platform.github.io/api-docs/#get-sheet
Smartsheet sheets have a lot of properties associated with them. You'll need to go through the rows and columns of your sheet to retrieve the information you're looking for, and construct it in a format your other system can recognize.
The API docs contain a listing of properties and examples. As a minimal example:
for row in sheet.rows:
for cell in row.cells
# Do something with cell.object_value here
Get the sheet as a csv:
(https://smartsheet-platform.github.io/api-docs/?python#get-sheet-as-excel-pdf-csv)
smartsheet_client.Sheets.get_sheet_as_csv(
1531988831168388, # sheet_id
download_directory_path)
Read the csv into a DataFrame:
(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
pandas.read_csv
You can use this library
Very easy to use and allows Sheets or Reports to be delivered as a Dataframe.
pip install smartsheet-dataframe
Get a report as df
from smartsheet_dataframe import get_as_df, get_report_as_df
df = get_report_as_df(token='smartsheet_auth_token',
report_id=report_id_int)
Get a sheet as df
from smartsheet_dataframe import get_as_df, get_sheet_as_df
df = get_sheet_as_df(token='smartsheet_auth_token',
sheet_id=sheet_id_int)
replace 'smartsheet_auth_token' with your token (numbers and letters)
replace sheet_id_int with your sheet/report id (numbers only)