Python-docx : Hyperlink text from markdown in table - python

I was looking to insert a hyperlink in the text within the DOCX table. Currently the data I have has the hyperlinks in markdown format.
Is there a way to add this as a hyperlink in my DOCx table? I'm aware that this can be done for paragraphs *here, but was wondering it if it could be done in tables.
Example code:
import pandas as pd
import docx
# Sample Data
df = pd.DataFrame([["Some [string](www.google.com) is the link to google",'IDnum1',"China"],
["This is string 2","IDnum3","Australia"],
["some string3","IDNum5","America"]], columns = ["customer string","ID number","Country"])
# Docx DF to table
# open an existing document
doc = docx.Document()
# add a table to the end and create a reference variable
# extra row is so we can add the header row
t = doc.add_table(df.shape[0]+1, df.shape[1])
# add the header rows.
for j in range(df.shape[-1]):
t.cell(0,j).text = df.columns[j]
# add the rest of the data frame
for i in range(df.shape[0]):
for j in range(df.shape[-1]):
t.cell(i+1,j).text = str(df.values[i,j])
# save the doc
doc.save('./test.docx')
Any help will be greatly appreciated!

Each cell has one or more paragraphs on cell.paragraphs. Whatever works on a paragraph outside a table will also work on a paragraph inside a table.

Related

How to add the column headers on every page on python-docx?

I'm trying to print a Pandas data frame as .docx file on python. My problem is since the docx file will most of the time print more than 1 page, I want to have the column names of the data frame to be printed for every new page.
Currently my code just prints the whole data frame as is:
# add the header rows.
for j in range(t01.shape[-1]):
table.cell(0,j).text = t01.columns[j]
# add the rest of the data frame
for i in range(t01.shape[0]):
for j in range(t01.shape[-1]):
table.cell(i+1,j).text = str(t01.values[i,j])
what you're probably looking after is Repeat Header Rows functionality which can be find in here:
since python-docx doesn't have that functionality yet, you can add that flag by yourself. first you need to look for it in the ooxml schema http://www.datypic.com/sc/ooxml/e-w_tblHeader-1.html
note that rows that are declared as header rows will repeat themselves at the beginning of every page if the table can't fit onto a single page. so what you need to do is to declare the first row as a header row. that can be done like:
from docx import Document
from docx.oxml import OxmlElement
doc = Document()
t = doc.add_table(rows=50, cols=2)
# set header values
t.cell(0, 0).text = 'A'
t.cell(0, 1).text = 'B'
tbl_header = OxmlElement('w:tblHeader') # create new oxml element flag which indicates that row is header row
first_row_props = t.rows[0]._element.get_or_add_trPr() # get if exists or create new table row properties el
first_row_props.append(tbl_header) # now first row is the header row
for i in range(1, len(t.rows)):
for j in range(len(t.columns)):
t.cell(i, j).text = f'i:{i}, j:{j}'
doc.save('t1.docx')
the end result should look like:

How do I extract tables from word via table titles?

I'm facing the problem of trying to extract data from word files in the form of tables. I have to iterate through 500 word files and extract a specific table in each file, but the table appears at a different point in each word file. This is the code I have:
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(path+"\\"+filename)
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
df = pd.DataFrame(data)
print(df)
Which goes through all the files fine, but gets an error as some of the word documents do not have the table it looks for, as it looks only for an element: wordDoc.tables[8] so an IndexError appears. I want to be able to change it from this, to instead look for a table with certain column titles:
CONTACT NAME POSITION LOCATION EMAIL TELEPHONE ASSET CLASS
Is there a way that I can modify the code shown to be able to find the tables I'm looking for?
Many thanks.
Instead of changing the logic to look up tables with certain column names you can catch the index error and ignore it. This will enable you to continue without error when that table is not present in a document. This is done using try and except.
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
try:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
except IndexError:
continue
df = pd.DataFrame(data)
print(df)
Also, note that it is better to use os.path.join() when combining paths instead of concatenating the path strings.

Extract a table from word document which is between certain text , docx.api, Python

I am trying to extract tables from a document consisting of 100 pages, the document is updated every week. The table headings remain consistent however the data inside the table changes every week. There are approx. 20-30 tables on different pages that needs to be extracted. All tables have a heading and in the end a text line after the table. how can I extract the table which is between heading and ending text , example, Table heading is
"This is a annual table x123"
<table>
and then ending text. " the above table is xxxx"
This is one example, i need to search based on the heading text for each table and then extract the tables from underneath it.
Currently the code i am using is extracting all tables from the doc table.
from docx.api import Document
import pandas as pd
document = Document("C:/Users/user123/Desktop/Python/python_truncated_tables.docx")
tables = document.tables
df = pd.DataFrame()
for table in document.tables:
for row in table.rows:
text = [cell.text for cell in row.cells]
df = df.append([text], ignore_index=True)
df.columns = ["Column1", "Column2","Column3","Column4","Column5", "Column6","Column7","Column8","Column9"]
df.to_excel("C:/Users/user123/Desktop/Python/pythonoutput1.xlsx")
print(df)

How to add text from documents in a folder to an array

Good afternoon. Unfortunately, I did not find an answer to a simple question. I have a document folder. PDF format. I can use Pandas to open one document and add its text to an array. Where the first column is the folder name and the second is the text from the document. But how do you do this for all documents in a folder? Alas, I don't know.
category
text
test
first document
test
second document
test
...
Assuming you put the code you already have into a function that takes in a file name and the DataFrame you have so far, it's pretty easy to do what you want:
import os
import pandas as pd
dataframe = pd.DataFrame()
files = os.listdir("[path/to/folder/]")
for file in files:
dataframe = addFileToTable(file, dataFrame)
If you're not sure how to add a new row to the dataframe:
def addFileToTable(file, dataframe):
# Convert PDF to array
# ...
row = {"category" : array[0], "text" : array[1]}
df = dataframe.append(row, ignore_index = True)
return df

Python: print each excel row as its own ms word page

I've got an excel document with thousands of rows, each row represents a person's record. How can I use Python to extract each row and write that information into an MS Word page?
My goal is to create a doc containing a pseudo-narrative of each record on its own page.
You can extract the content of the Excel file as a Pandas Data Frame and then export the Data Frame into Word Document as a table. This is the generic syntax to achieve your objective
import pandas as pd
xls_file = pd.ExcelFile('../data/example.xls')
df = xls_file.parse('Sheet1')
#Needs PyWin32
wordApp = win32.gencache.EnsureDispatch('Word.Application')
wordApp.Visible = False
doc = wordApp.Documents.Open(os.getcwd()+'\\template.docx')
rng = doc.Bookmarks("PUTTABLEHERE").Range
# creating Table
# add one more row in table at word because you want to add column names as header
Table=rng.Tables.Add(rng,NumRows=df.shape[0]+1,NumColumns=df.shape[1])
for col in range(df.shape[1]):
# Writing column names
Table.Cell(1,col+1).Range.Text=str(df.columns[col])
for row in range(df.shape[0]):
# writing each value of data frame
Table.Cell(row+1+1,col+1).Range.Text=str(df.iloc[row,col])
Hope this helps!

Categories