Search through Word tables for certain text Python docx - python

I have some code that reads a table in a Word document and makes a dataframe from it.
import numpy as np
import pandas as pd
from docx import Document
#### Time for some old fashioned user functions ####
def make_dataframe(f_name, table_loc):
document = Document(f_name)
tables = document.tables[table_loc]
for i, row in enumerate(tables.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame.from_dict(data)
return df
SHRD_filename = "SHRD - 12485.docx"
SHDD_filename = "SHDD - 12485.docx"
df_SHRD = make_dataframe(SHRD_filename,30)
df_SHDD = make_dataframe(SHDD_filename,-60)
Because the files are different (for instance the SHRD has 32 tables and the one I am looking for is the second to last, but the SHDD file has 280 tables, and the one I am looking for is 60th from the end. But that may not always be the case.
How do I search through the tables in a document and start working on the one that cell[0,0] = 'Tag Numbers'.

You can iterate through the tables and check the text in the first cell. I have modified the output to return a list of dataframes, just in case more than one table is found. It will return an empty list if no table meets the criteria.
def make_dataframe(f_name, first_cell_string='tag number'):
document = Document(f_name)
# create a list of all of the table object with text of the
# first cell equal to `first_cell_string`
tables = [t for t in document.tables
if t.cell(0,0).text.lower().strip()==first_cell_string]
# in the case that more than one table is found
out = []
for table in tables:
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
out.append(pd.DataFrame.from_dict(data))
return out

Related

How do I extract tables from word via table titles?

I'm facing the problem of trying to extract data from word files in the form of tables. I have to iterate through 500 word files and extract a specific table in each file, but the table appears at a different point in each word file. This is the code I have:
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(path+"\\"+filename)
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
df = pd.DataFrame(data)
print(df)
Which goes through all the files fine, but gets an error as some of the word documents do not have the table it looks for, as it looks only for an element: wordDoc.tables[8] so an IndexError appears. I want to be able to change it from this, to instead look for a table with certain column titles:
CONTACT NAME POSITION LOCATION EMAIL TELEPHONE ASSET CLASS
Is there a way that I can modify the code shown to be able to find the tables I'm looking for?
Many thanks.
Instead of changing the logic to look up tables with certain column names you can catch the index error and ignore it. This will enable you to continue without error when that table is not present in a document. This is done using try and except.
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
try:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
except IndexError:
continue
df = pd.DataFrame(data)
print(df)
Also, note that it is better to use os.path.join() when combining paths instead of concatenating the path strings.

Extract Table information from .docx in the form of dictionary and dataframe using python

I am trying to read a word document and extract all table information in the form of json or dataframe with key and value.
Sample image of tables in word document is as below.
Expected Output:
Below is the code I tried out but the mapping is not done properly.
big_data = []
for table in document.tables:
data = []
keys = None
for i, column in enumerate(table.columns):
text = (cell.text for cell in column.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
#print (data)
big_data.append(data)
Output received on the above command is huge and just pasting a small sample of the output here.
Could please help me in achieving the expected output.

How to extract a Word table from multiple files using python docx

I am working on a project at work where I need to analyze over a thousand MS-Word files each consisting of the same table. From each table I just need to extract few cells and turn them into a row that later will be concatenated to create a dateframe for further analysis.
I tested Python's library docx on one file and it managed to read the table. However, after plugging the same function inside a for loop that begins by creating a variable consisting of all the file names and then passing that to the Document function, the output is just one table, which is the first table in the list of files.
I have a feeling I'm not looking at this the right way, I would appreciate any guidance on this as I'm completely helpless now.
following is the code I used, it consists mainly of code I stumbled upon in stackoverflow:
import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]
for name in file:
document = Document(name)
table = document.tables[0]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
thanks
You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.
If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.
data = []
for name in filenames:
...
data.append(row_data)
print(data)

loop through words in a csv file and replace in python

I have a csv file with three columns, namely (cid,ccontent,value) . And I want to loop through each word in ccontent column and translate the words individually.
I found this code for translating a row but I want to translate each word not the row.
How to write a function in Python that translates each row of a csv to another language?
from googletrans import Translator
import pandas as pd
headers = ['A','B','A_translation', 'B_translation']
data = pd.read_csv('./data.csv')
translator = Translator()
# Init empty dataframe with much rows as `data`
df = pd.DataFrame(index=range(0,len(data)), columns=headers)
def translate_row(row):
''' Translate elements A and B within `row`. '''
a = translator.translate(row[0], dest='Fr')
b = translator.translate(row[1], dest='Fr')
return pd.Series([a.origin, b.origin, a.text, b.text], headers)
for i, row in enumerate(data.values):
# Fill empty dataframe with given serie.
df.loc[i] = translate_row(row)
print(df)
Thank you
You can try along the lines of, using list comprehension:
def translate_row(row):
row0bywords = [translator.translate(eachword, dest='Fr') for eachword in row[0]]
orw1bywords = [translator.translate(eachword, dest='Fr') for eachword in row[1]]
return row0bywords, row1bywords

Import Excel Tables into pandas dataframe

I would like to import excel tables (made by using the Excel 2007 and above tabulating feature) in a workbook into separate dataframes. Apologies if this has been asked before but from my searches I couldn't find what I wanted. I know you can easily do this using the read_excel function however this requires the specification of a Sheetname or returns a dict of dataframes for each sheet.
Instead of specifying sheetname, I was wondering whether there was a way of specifying tablename or better yet return a dict of dataframes for each table in the workbook.
I know this can be done by combining xlwings with pandas but was wondering whether this was built-into any of the pandas functions already (maybe ExcelFile).
Something like this:-
import pandas as pd
xls = pd.ExcelFile('excel_file_path.xls')
# to read all tables to a map
tables_to_df_map = {}
for table_name in xls.table_names:
table_to_df_map[table_name] = xls.parse(table_name)
Although not exactly what I was after, I have found a way to get table names with the caveat that it's restricted to sheet name.
Here's an excerpt from the code that I'm currently using:
import pandas as pd
import openpyxl as op
wb=op.load_workbook(file_location)
# Connecting to the specified worksheet
ws = wb[sheetname]
# Initliasing an empty list where the excel tables will be imported
# into
var_tables = []
# Importing table details from excel: Table_Name and Sheet_Range
for table in ws._tables:
sht_range = ws[table.ref]
data_rows = []
i = 0
j = 0
for row in sht_range:
j += 1
data_cols = []
for cell in row:
i += 1
data_cols.append(cell.value)
if (i == len(row)) & (j == 1):
data_cols.append('Table_Name')
elif i == len(row):
data_cols.append(table.name)
data_rows.append(data_cols)
i = 0
var_tables.append(data_rows)
# Creating an empty list where all the ifs will be appended
# into
var_df = []
# Appending each table extracted from excel into the list
for tb in var_tables:
df = pd.DataFrame(tb[1:], columns=tb[0])
var_df.append(df)
# Merging all in one big df
df = pd.concat(var_df,axis=1) # This merges on columns

Categories