I'm facing the problem of trying to extract data from word files in the form of tables. I have to iterate through 500 word files and extract a specific table in each file, but the table appears at a different point in each word file. This is the code I have:
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(path+"\\"+filename)
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
df = pd.DataFrame(data)
print(df)
Which goes through all the files fine, but gets an error as some of the word documents do not have the table it looks for, as it looks only for an element: wordDoc.tables[8] so an IndexError appears. I want to be able to change it from this, to instead look for a table with certain column titles:
CONTACT NAME POSITION LOCATION EMAIL TELEPHONE ASSET CLASS
Is there a way that I can modify the code shown to be able to find the tables I'm looking for?
Many thanks.
Instead of changing the logic to look up tables with certain column names you can catch the index error and ignore it. This will enable you to continue without error when that table is not present in a document. This is done using try and except.
import pandas as pd
from docx.api import Document
import os
os.chdir('C:\\Users\\user1\\test')
path = 'C:\\Users\\user1\\test'
worddocs_list = []
for filename in list(os.listdir(path)):
wordDoc = Document(os.path.join(path, filename))
worddocs_list.append(wordDoc)
data = []
for wordDoc in worddocs_list:
try:
table = wordDoc.tables[8]
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
row_data = (text)
data.append(row_data)
except IndexError:
continue
df = pd.DataFrame(data)
print(df)
Also, note that it is better to use os.path.join() when combining paths instead of concatenating the path strings.
Related
I have an Excel file with a column 'id' with product IDs and column 'imagelinks', where cells contain URL links to images for those products. Cells with links can contain several links, separated with commas.
I am trying to add the list of URLs from the Excel file to a list in Python, from which I can then download the images to my computer. However, I can't seem to return the list.
I have added an image of the excel file (see link).
Here is my code so far:
import pandas as pd
import requests
import os
productID ='65212380'
[df = pd.read_excel("OLD_AP_Web_imageLinks_CatID.xlsx")
imagelinks = []
df_productID = df[df["id"] == productID]
for row in df_productID.iterrows():
imagelinks.append(df_productID['imagelinks'])
dest_dir = f'//Users/ljimac/Documents/Website/{productID}'
try:
os.mkdir(dest_dir)
except FileExistsError as e:
print('The file path already exists!')
os.chdir(dest_dir)
for image in imagelinks:
file_name = image.split('/')[-2]
with open(file_name, 'wb') as f:
im = requests.get(image)
f.write(im.content)
I actually resolved getting the list by using str.split().
Here is the code that is working for me for the first sheet:
df = pd.read_excel("OLD_AP_Web_imageLinks_CatID.xlsx")
imagelinks = []
# use str split on the column
df['imagelinks'] = df['imagelinks'].str.split(',')
df_productID = df[df["id"] == productID]
for row in df_productID.iterrows():
imagelinks.append(df.loc[0,'imagelinks'])
print(imagelinks)
However, it does not work for multiple sheets in an Excel file, only the first one. As soon as I add (sheet_name=None) to the read.excel method, the str.split() stops working. I tried a workaround by saving the DataFrame to a new Excel file, but the search parameters still only work for the first sheet in the workbook.
Here is my latest code:
df = pd.read_excel("OLD_AP_Web_imageLinks_CatID.xlsx", sheet_name=None)
with pd.ExcelWriter('image_links.xlsx') as writer: # <- HERE
for name, df in df.items():
df['imagelinks'] = df['imagelinks'].str.split(',')
df.to_excel(writer, sheet_name=name, index=False)
imagelinks = []
df_new = pd.read_excel('image_links.xlsx', sheet_name=None)
df_productID = df[df_new['id'] == productID]
for row in df_productID.iterrows():
imagelinks =(df.loc[0,'imagelinks'])
print(imagelinks)
I can get the image links for the first sheet. If the product code is in any other sheet, I get this error:
File "/Users/ljimac/Documents/Coding_Projects/PWdemo/soup.py", line 63, in <module>
df_productID = df[df_new['id'] == productID]
KeyError: 'id'
What am I missing?
I am working on a project at work where I need to analyze over a thousand MS-Word files each consisting of the same table. From each table I just need to extract few cells and turn them into a row that later will be concatenated to create a dateframe for further analysis.
I tested Python's library docx on one file and it managed to read the table. However, after plugging the same function inside a for loop that begins by creating a variable consisting of all the file names and then passing that to the Document function, the output is just one table, which is the first table in the list of files.
I have a feeling I'm not looking at this the right way, I would appreciate any guidance on this as I'm completely helpless now.
following is the code I used, it consists mainly of code I stumbled upon in stackoverflow:
import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]
for name in file:
document = Document(name)
table = document.tables[0]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
thanks
You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.
If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.
data = []
for name in filenames:
...
data.append(row_data)
print(data)
I have an Excel workbook that has a worksheet called 'Professional Staff'. On that sheet, there is a range of cells named 'ProStaff'. I retrieve a list of the values in those cells with this code:
import openpyxl
wb = openpyxl.load_workbook(filename='SOexample.xlsx', read_only=True)
#Get the ProStaff range values
ProStaffRange = wb.defined_names['ProStaff']
#returns a generator of (worksheet title, cell range) tuples
dests = ProStaffRange.destinations
#use generator to create a list of (sheet, cell) tuples
cells = []
for title, coord in dests:
ws = wb[title]
cells.append(ws[coord])
#Above was from the OpenPyXL website
#Below is my best attempt to retrieve the values from those cells
cellsStr = []
startChar = '.'
stopChar = '>'
for item in cells[0]:
itemStr = str(item)
cellsStr.append( (itemStr.split("'")[1].strip(), itemStr[itemStr.find(startChar)+1:itemStr.find(stopChar)]) )
for item in cellsStr:
print(wb[item[0]][item[1]].value)
The string manipulation I do takes something like:
(<ReadOnlyCell 'Professional Staff'.A1>,)
and turns it into:
('Professional Staff', 'A1')
It seems to me that there should be a way to work with the ReadOnlyCell items directly in order to retrieve their values, but I haven't been able to figure out how.
Try this, modified from something I saw elsewhere, it works for single-cell named ranges:
wb = load_workbook('filename.xlsx', data_only=True)
ws = wb['sheet_name']
val=ws[list(wb.defined_names['single_cell_named_range'].destinations)[0][1]].value
print(val)
I'm using Openpyxl 2.5.12.
I have some code that reads a table in a Word document and makes a dataframe from it.
import numpy as np
import pandas as pd
from docx import Document
#### Time for some old fashioned user functions ####
def make_dataframe(f_name, table_loc):
document = Document(f_name)
tables = document.tables[table_loc]
for i, row in enumerate(tables.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
df = pd.DataFrame.from_dict(data)
return df
SHRD_filename = "SHRD - 12485.docx"
SHDD_filename = "SHDD - 12485.docx"
df_SHRD = make_dataframe(SHRD_filename,30)
df_SHDD = make_dataframe(SHDD_filename,-60)
Because the files are different (for instance the SHRD has 32 tables and the one I am looking for is the second to last, but the SHDD file has 280 tables, and the one I am looking for is 60th from the end. But that may not always be the case.
How do I search through the tables in a document and start working on the one that cell[0,0] = 'Tag Numbers'.
You can iterate through the tables and check the text in the first cell. I have modified the output to return a list of dataframes, just in case more than one table is found. It will return an empty list if no table meets the criteria.
def make_dataframe(f_name, first_cell_string='tag number'):
document = Document(f_name)
# create a list of all of the table object with text of the
# first cell equal to `first_cell_string`
tables = [t for t in document.tables
if t.cell(0,0).text.lower().strip()==first_cell_string]
# in the case that more than one table is found
out = []
for table in tables:
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
out.append(pd.DataFrame.from_dict(data))
return out
I've got an excel document with thousands of rows, each row represents a person's record. How can I use Python to extract each row and write that information into an MS Word page?
My goal is to create a doc containing a pseudo-narrative of each record on its own page.
You can extract the content of the Excel file as a Pandas Data Frame and then export the Data Frame into Word Document as a table. This is the generic syntax to achieve your objective
import pandas as pd
xls_file = pd.ExcelFile('../data/example.xls')
df = xls_file.parse('Sheet1')
#Needs PyWin32
wordApp = win32.gencache.EnsureDispatch('Word.Application')
wordApp.Visible = False
doc = wordApp.Documents.Open(os.getcwd()+'\\template.docx')
rng = doc.Bookmarks("PUTTABLEHERE").Range
# creating Table
# add one more row in table at word because you want to add column names as header
Table=rng.Tables.Add(rng,NumRows=df.shape[0]+1,NumColumns=df.shape[1])
for col in range(df.shape[1]):
# Writing column names
Table.Cell(1,col+1).Range.Text=str(df.columns[col])
for row in range(df.shape[0]):
# writing each value of data frame
Table.Cell(row+1+1,col+1).Range.Text=str(df.iloc[row,col])
Hope this helps!