How can I write to specific Excel columns using openpyxl? - python

I'm writing a Python script that needs to write collections of data down specific columns in an Excel document.
More specifically, I'm calling an API that returns a list of items. Each item in the list contains multiple fields of data (item name, item version, etc). I would like to iterate through each item in the list, then write selected fields of data down specific columns in Excel.
Currently, I'm iterating through the items list, then appending the fields of data I want as a list into an empty list (creating a list of lists). Once I'm done iterating through the list of items, I iterate through the list of lists and append to each row of the Excel document. This works, but makes writing to a specific column complicated.
Here is roughly the code that I currently have:
import requests
import json
from openpyxl import Workbook
def main():
wb = Workbook()
ws = wb.active
r = requests.get(api_url) # Can't provide URL
items_list = r.json()['items']
filler_list = []
for item in items_list:
item_name = item['itemName']
item_version = item['itemVersion']
# etc...
filler_list.append([item_name, item_version])
for i in filler_list:
ws.append(i)
wb.save('output.xlsx')
if __name__ == "__main__":
main()
The above code will write to the Excel document across row 1, then row 2, etc. for however many lists were appended to the filler list. What I would prefer to do is specify that I want every item name or item version to be added to whatever column letter I want. Is this possible with openpyxl? The main function would look something like this:
def main():
wb = Workbook()
ws = wb.active
r = requests.get(api_url) # Can't provide URL
items_list = r.json()['items']
for item in items_list:
item_name = item['itemName']
item_version = item['itemVersion']
# Add item name to next open cell in column B (any column)
# Add item version to next open cell in column D (any column)
wb.save('output.xlsx')

There are two general methods for accessing specific cells in openpyxl.
One is to use the cell() method of the worksheet, specifying the row and column numbers. For example:
ws.cell(row=1, column=1).value = 5
sets the first cell.
The second method is to index the worksheet using Excel-style cell notation. For example:
ws["A1"].value = 5
does the same thing as above.
If you're going to be setting the same column positions in each row, probably the simplest thing to do is make items_list an iterator and loop through the columns you want to set. This is a simplified example from your code above:
columns = ["B", "D", "G", etc. etc.]
items_list = iter(r.json()['items'])
row = 1
for col in columns:
ws[f"{col}{row}"].value = next(items_list)

Related

How to extract a Word table from multiple files using python docx

I am working on a project at work where I need to analyze over a thousand MS-Word files each consisting of the same table. From each table I just need to extract few cells and turn them into a row that later will be concatenated to create a dateframe for further analysis.
I tested Python's library docx on one file and it managed to read the table. However, after plugging the same function inside a for loop that begins by creating a variable consisting of all the file names and then passing that to the Document function, the output is just one table, which is the first table in the list of files.
I have a feeling I'm not looking at this the right way, I would appreciate any guidance on this as I'm completely helpless now.
following is the code I used, it consists mainly of code I stumbled upon in stackoverflow:
import os
import pandas as pd
file = [f for f in os.listdir() if f.endswith(".docx") ]
for name in file:
document = Document(name)
table = document.tables[0]
data = []
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
# Establish the mapping based on the first row
# headers; these will become the keys of our dictionary
if i == 0:
keys = tuple(text)
continue
# Construct a dictionary for this row, mapping
# keys to values for this row
row_data = dict(zip(keys, text))
data.append(row_data)
thanks
You are reinitializing the data list to [] (empty) for every document. So you carefully collect the row-data from a document and then in the next step throw it away.
If you move data = [] outside the loop then after iterating through the documents it will contain all the extracted rows.
data = []
for name in filenames:
...
data.append(row_data)
print(data)

Data append to list using XLRD

I am able to import data of rows in a particular column of certain sheet name in to a python list. But, the list is looking like Key:Value formatted list (not the one I need).
Here is my code:
import xlrd
excelList = []
def xcel(path):
book = xlrd.open_workbook(path)
impacted_files = book.sheet_by_index(2)
for row_index in range(2, impacted_files.nrows):
#if impacted_files.row_values(row_index) == 'LCR':
excelList.append(impacted_files.cell(row_index, 1))
print(excelList)
if __name__ == "__main__":
xcel(path)
The output is like below:
[text:'LCR_ContractualOutflowsMaster.aspx', text:'LCR_CountryMaster.aspx', text:'LCR_CountryMasterChecker.aspx', text:'LCR_EntityMaster.aspx', text:'LCR_EntityMasterChecker.aspx', text:'LCR_EscalationMatrixMaster.aspx',....]
I want the list to have just values. Like this...
['LCR_ContractualOutflowsMaster.aspx', 'LCR_CountryMaster.aspx', 'LCR_CountryMasterChecker.aspx', 'LCR_EntityMaster.aspx', 'LCR_EntityMasterChecker.aspx', 'LCR_EscalationMatrixMaster.aspx',...]
I've tried pandas too (df.value.tolist() method). Yet the output is not what I visualize.
Please suggest a way.
Regards
You are accumulating a list of cells, and what you are seeing is the repr of each cell in your list. Cell objects have three attributes: ctype is an int that identifies the type of the cell's value, value (which which is a Python rtype holding the cell's value) and xf_index. If you want only the values then try
excelList.append(impacted_files.cell(row_index, 1).value)
You can read more about cells in the documentation.
If you are willing to try one more library, openpyxl this is how it can be done.
from openpyxl import load_workbook
book = load_workbook(path)
sh = book.worksheets[0]
print([cell.value for cell in row for row in sheet.iter_rows()] )

is there a better way to use OpenPyXL's defined_names feature to return values from an Excel Named Range?

I have an Excel workbook that has a worksheet called 'Professional Staff'. On that sheet, there is a range of cells named 'ProStaff'. I retrieve a list of the values in those cells with this code:
import openpyxl
wb = openpyxl.load_workbook(filename='SOexample.xlsx', read_only=True)
#Get the ProStaff range values
ProStaffRange = wb.defined_names['ProStaff']
#returns a generator of (worksheet title, cell range) tuples
dests = ProStaffRange.destinations
#use generator to create a list of (sheet, cell) tuples
cells = []
for title, coord in dests:
ws = wb[title]
cells.append(ws[coord])
#Above was from the OpenPyXL website
#Below is my best attempt to retrieve the values from those cells
cellsStr = []
startChar = '.'
stopChar = '>'
for item in cells[0]:
itemStr = str(item)
cellsStr.append( (itemStr.split("'")[1].strip(), itemStr[itemStr.find(startChar)+1:itemStr.find(stopChar)]) )
for item in cellsStr:
print(wb[item[0]][item[1]].value)
The string manipulation I do takes something like:
(<ReadOnlyCell 'Professional Staff'.A1>,)
and turns it into:
('Professional Staff', 'A1')
It seems to me that there should be a way to work with the ReadOnlyCell items directly in order to retrieve their values, but I haven't been able to figure out how.
Try this, modified from something I saw elsewhere, it works for single-cell named ranges:
wb = load_workbook('filename.xlsx', data_only=True)
ws = wb['sheet_name']
val=ws[list(wb.defined_names['single_cell_named_range'].destinations)[0][1]].value
print(val)
I'm using Openpyxl 2.5.12.

Find list items in excel sheet with Python

I've the following code below which finds non-blank values in Column J of an Excel worksheet. It does some things with it, including getting the value's email address in column K. Then it emails the member using smtp.
What I'd like instead is to get the person's email from a Python list, which can be declared in the beginning of the code. I just can't figure out how to find the matching names in column J in the worksheet per the list, and then get the resulting email address from the list.
Please excuse any horrible syntax...this is my first stab at a major python project.
memlist = {'John Frank':'email#email.com',
'Liz Poe':'email2#email.com'}
try:
for i in os.listdir(os.getcwd()):
if i.endswith(".xlsx") or i.endswith(".xls"):
workbook = load_workbook(i, data_only=True)
ws = workbook.get_sheet_by_name(wsinput)
cell_range = ws['j3':'j7']
for row in cell_range: # This is iterating through rows 1-7
#for matching names in memlist
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
if cell.value:
if cell.offset(row=0, column =-9).value.date() == (datetime.now().date() + timedelta(days=7)):
#print(cell.value)
email = cell.offset(row=0, column=1).value
name = cell.value.split(',',1)[0]
This is my attempt at an answer.
memlist is not a list, rather it is a dict because it contains key : value pairs.
If you want to check that a certain key exists in a dict, you can use dict.has_key(key) method.
In memlist , the name is the key and the corresponding email is the value.
In your code, you could do this:
if memlist.has_key(cell.value): # For Python 2
if ... # From your code
email = memlist[cell.value]
In case you're using Python 3, you can search for the key like this:
if cell.value in memlist: # For Python 3
See if this works for you as I couldn't fully comprehend your question.
Shubham,
I used a part of your response in finding my own answer. Instead of the has_key method, I just used another for/in statement with a subsequent if statement.
My fear, however, is that with these multiple for's and if's, the code takes a long time to run and maybe not the most efficient/optimal. But that's worthy of another day.
try:
for i in os.listdir(os.getcwd()):
if i.endswith(".xlsx") or i.endswith(".xls"):
workbook = load_workbook(i, data_only=True)
ws = workbook.get_sheet_by_name(wsinput)
cell_range = ws['j3':'j7']
for row in cell_range: # This is iterating through rows 1-7
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
if cell.value:
if cell.offset(row=0, column =-9).value.date() == (datetime.now().date() + timedelta(days=7)):
for name, email in memlist.items():
if cell.value == name:
#send the email

Is it possible to get an Excel document's row count without loading the entire document into memory?

I'm working on an application that processes huge Excel 2007 files, and I'm using OpenPyXL to do it. OpenPyXL has two different methods of reading an Excel file - one "normal" method where the entire document is loaded into memory at once, and one method where iterators are used to read row-by-row.
The problem is that when I'm using the iterator method, I don't get any document meta-data like column widths and row/column count, and i really need this data. I assume this data is stored in the Excel document close to the top, so it shouldn't be necessary to load the whole 10MB file into memory to get access to it.
So, is there a way to get ahold of the row/column count and column widths without loading the entire document into memory first?
Adding on to what Hubro said, apparently get_highest_row() has been deprecated. Using the max_row and max_column properties returns the row and column count. For example:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column
The solution suggested in this answer has been deprecated, and might no longer work.
Taking a look at the source code of OpenPyXL (IterableWorksheet) I've figured out how to get the column and row count from an iterator worksheet:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.get_highest_row() - 1
column_count = letter_to_index(sheet.get_highest_column()) + 1
IterableWorksheet.get_highest_column returns a string with the column letter that you can see in Excel, e.g. "A", "B", "C" etc. Therefore I've also written a function to translate the column letter to a zero based index:
def letter_to_index(letter):
"""Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based
column index.
A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc.
Args:
letter (str): The column index letter.
Returns:
The column index as an integer.
"""
letter = letter.upper()
result = 0
for index, char in enumerate(reversed(letter)):
# Get the ASCII number of the letter and subtract 64 so that A
# corresponds to 1.
num = ord(char) - 64
# Multiply the number with 26 to the power of `index` to get the correct
# value of the letter based on it's index in the string.
final_num = (26 ** index) * num
result += final_num
# Subtract 1 from the result to make it zero-based before returning.
return result - 1
I still haven't figured out how to get the column sizes though, so I've decided to use a fixed-width font and automatically scaled columns in my application.
Python 3
import openpyxl as xl
wb = xl.load_workbook("Sample.xlsx", enumerate)
#the 2 lines under do the same.
sheet = wb.get_sheet_by_name('sheet')
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column
#this works fore me.
This might be extremely convoluted and I might be missing the obvious, but without OpenPyXL filling in the column_dimensions in Iterable Worksheets (see my comment above), the only way I can see of finding the column size without loading everything is to parse the xml directly:
from xml.etree.ElementTree import iterparse
from openpyxl import load_workbook
wb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)
ws=wb.worksheets[0]
xml = ws._xml_source
xml.seek(0)
for _,x in iterparse(xml):
name= x.tag.split("}")[-1]
if name=="col":
print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"]
if name=="cols":
print "break before reading the rest of the file"
break
https://pythonhosted.org/pyexcel/iapi/pyexcel.sheets.Sheet.html
see : row_range() Utility function to get row range
if you use pyexcel, can call row_range get max rows.
python 3.4 test pass.
Options using pandas.
Gets all sheetnames with count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
for sheet in sheetnames:
df = xl.parse(sheet)
dimensions = df.shape
print('sheetname', ' --> ', dimensions)
Single sheet count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
df = xl.parse(sheetnames[0]) # [0] get first tab/sheet.
dimensions = df.shape
print(f'sheetname: "{sheetnames[0]}" - -> {dimensions}')
output sheetname "Sheet1" --> (row count, column count)

Categories