Find list items in excel sheet with Python - python

I've the following code below which finds non-blank values in Column J of an Excel worksheet. It does some things with it, including getting the value's email address in column K. Then it emails the member using smtp.
What I'd like instead is to get the person's email from a Python list, which can be declared in the beginning of the code. I just can't figure out how to find the matching names in column J in the worksheet per the list, and then get the resulting email address from the list.
Please excuse any horrible syntax...this is my first stab at a major python project.
memlist = {'John Frank':'email#email.com',
'Liz Poe':'email2#email.com'}
try:
for i in os.listdir(os.getcwd()):
if i.endswith(".xlsx") or i.endswith(".xls"):
workbook = load_workbook(i, data_only=True)
ws = workbook.get_sheet_by_name(wsinput)
cell_range = ws['j3':'j7']
for row in cell_range: # This is iterating through rows 1-7
#for matching names in memlist
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
if cell.value:
if cell.offset(row=0, column =-9).value.date() == (datetime.now().date() + timedelta(days=7)):
#print(cell.value)
email = cell.offset(row=0, column=1).value
name = cell.value.split(',',1)[0]

This is my attempt at an answer.
memlist is not a list, rather it is a dict because it contains key : value pairs.
If you want to check that a certain key exists in a dict, you can use dict.has_key(key) method.
In memlist , the name is the key and the corresponding email is the value.
In your code, you could do this:
if memlist.has_key(cell.value): # For Python 2
if ... # From your code
email = memlist[cell.value]
In case you're using Python 3, you can search for the key like this:
if cell.value in memlist: # For Python 3
See if this works for you as I couldn't fully comprehend your question.

Shubham,
I used a part of your response in finding my own answer. Instead of the has_key method, I just used another for/in statement with a subsequent if statement.
My fear, however, is that with these multiple for's and if's, the code takes a long time to run and maybe not the most efficient/optimal. But that's worthy of another day.
try:
for i in os.listdir(os.getcwd()):
if i.endswith(".xlsx") or i.endswith(".xls"):
workbook = load_workbook(i, data_only=True)
ws = workbook.get_sheet_by_name(wsinput)
cell_range = ws['j3':'j7']
for row in cell_range: # This is iterating through rows 1-7
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
if cell.value:
if cell.offset(row=0, column =-9).value.date() == (datetime.now().date() + timedelta(days=7)):
for name, email in memlist.items():
if cell.value == name:
#send the email

Related

How can I write to specific Excel columns using openpyxl?

I'm writing a Python script that needs to write collections of data down specific columns in an Excel document.
More specifically, I'm calling an API that returns a list of items. Each item in the list contains multiple fields of data (item name, item version, etc). I would like to iterate through each item in the list, then write selected fields of data down specific columns in Excel.
Currently, I'm iterating through the items list, then appending the fields of data I want as a list into an empty list (creating a list of lists). Once I'm done iterating through the list of items, I iterate through the list of lists and append to each row of the Excel document. This works, but makes writing to a specific column complicated.
Here is roughly the code that I currently have:
import requests
import json
from openpyxl import Workbook
def main():
wb = Workbook()
ws = wb.active
r = requests.get(api_url) # Can't provide URL
items_list = r.json()['items']
filler_list = []
for item in items_list:
item_name = item['itemName']
item_version = item['itemVersion']
# etc...
filler_list.append([item_name, item_version])
for i in filler_list:
ws.append(i)
wb.save('output.xlsx')
if __name__ == "__main__":
main()
The above code will write to the Excel document across row 1, then row 2, etc. for however many lists were appended to the filler list. What I would prefer to do is specify that I want every item name or item version to be added to whatever column letter I want. Is this possible with openpyxl? The main function would look something like this:
def main():
wb = Workbook()
ws = wb.active
r = requests.get(api_url) # Can't provide URL
items_list = r.json()['items']
for item in items_list:
item_name = item['itemName']
item_version = item['itemVersion']
# Add item name to next open cell in column B (any column)
# Add item version to next open cell in column D (any column)
wb.save('output.xlsx')
There are two general methods for accessing specific cells in openpyxl.
One is to use the cell() method of the worksheet, specifying the row and column numbers. For example:
ws.cell(row=1, column=1).value = 5
sets the first cell.
The second method is to index the worksheet using Excel-style cell notation. For example:
ws["A1"].value = 5
does the same thing as above.
If you're going to be setting the same column positions in each row, probably the simplest thing to do is make items_list an iterator and loop through the columns you want to set. This is a simplified example from your code above:
columns = ["B", "D", "G", etc. etc.]
items_list = iter(r.json()['items'])
row = 1
for col in columns:
ws[f"{col}{row}"].value = next(items_list)

How to append a large number of rows to a Google Sheet without going over the API Quota

I'm writing a python program that takes data from a website and sorts it into different worksheets in a Google Sheet. The program works when appending and deleting a smaller amount of rows but when I try to insert or delete a larger amount of rows I hit the google api quota limit.
From what I can gather I believe the solution would be to use the spreadsheets.values.batchUpdate() method because from what I gathered it sends all the requests at once then once they're verified as valid requests for the specific worksheet they are all executed at once. Unfortunately, there's no gspread function that calls this method, and I'm currently lost trying to use the raw api.
This is how I'm appending my data
sheet = vt.sheet1
........
with open('objectbuffer.csv', encoding='utf-8') as bfile:
reader = csv.reader(bfile, delimiter=',')
#gets every worksheet in the spreadsheet
wkshts = vt.worksheets()
#each row in the csv is evaluated for
#and is copied into any corresponding worksheet one at a time
for row in reader:
kwsFound = 0
hasUploaded = False
appendRow = [row[0],row[1],row[2],row[3],row[4],row[5]]
#iterates through every dynamically created worksheet in the spreadsheet
for sheets in wkshts:
#if the title of the sheet is found anywhere within Column "E"
#then 1 is added to the number of keywords found
if sheets.title in row[4]:
kwsFound += 1
kwSheet = sheets
#if only one keyword (title of a worksheet) is found ANYWHERE in the row
#then that row is copied to the worksheet with the name found
if kwsFound == 1:
kwSheet.append_row(appendRow,"USER_ENTERED")
hasUploaded = True
#if no keyword is found/ more than 1 is found
#the row is copied to the conflicts worksheet (which is constant)
if hasUploaded == False:
conflicts.append_row(appendRow,"USER_ENTERED")
#every row is always copied to the main worksheet
sheet.append_row(appendRow,"USER_ENTERED")
the kwsFound/kwsSheet is what's sorting the data into the separate worksheets. Currently, the gspread append_row function is what I'm using to append data 1 at a time which is what is putting me over the api limit.
Bonus Question
This is how I'm deleting duplicate rows in my program. Since the delete requests are sent 1 at a time, this is also making the program go over the api quota
allVal = sheet.get_all_values()
rowTot = len(sheet.get_all_values())
standard = ""
counter = 0
dcounter = 0
deleteRows = []
while counter<rowTot:
if allVal[counter] == standard:
deleteRows.append(counter)
else:
standard = allVal[counter]
counter+=1
while dcounter < len(deleteRows) :
sheet.delete_row(deleteRows[dcounter]-dcounter)
sleep(.5)
dcounter+=1
Help making this into a batchUpdate would be appreciated
EDIT:
Here's an example of the csv that's generated from scraping my venmo profile. http://www.filedropper.com/csvexample Though I edited it to remove personal information. Here's an example of the output I want to go into google sheets http://www.filedropper.com/gsheetsoutputexample, with all the transactions on the main sheet, but if the title of one of the secondary worksheets shows up in the description (5th column of the csv) of a transaction a copy of that transaction data is also placed in that respective worksheet. If 2 or more worksheet titles show up in the description of a transaction (or none) a copy of that transaction is then sent to the conflicts worksheet. If google sheets quotas were unlimited then my code would function in the way described without the worry of interruption.
EDIT 2:
1.) What I want to do is check the values of column "E" and if the title of one of the worksheets is a substring of the value of column "E" the program appends the row to the specified worksheet. So in this case values of "food", "food!", and "I love food" would all be appended to the food worksheet.
2.) The worksheet names are not constant. The program I'm building is meant to be used by my friends, so I made it so that you add named worksheets to the spreadsheets through a gui so they can make their own categories to filter their data. Let me know if you have other questions or if I didn't clarify well enough
EDIT 3:
Added comments to the code above
So that you do not have to wait very long, here is the Google Sheets API Doc Get acquainted with it and I'll work, in the mean-time, on creating a solution for your specific situation.
Doing this with gspread, see doc
Try something like this:
#!/usr/bin/python3
#This dict represents the request body for batchUpdate(body)
thisDict = {
"requests": [
{
#we append update commands to a list and then place that list here. Next we send thisDict as the body of a batchupdate.
}
],
"includeSpreadsheetInResponse": bool, #todo set this to bool value
"responseRanges": [
string #todo set this is string range
],
"responseIncludeGridData": bool #todo set this to bool value
}
#to contain our request objects
List = []
with open('objectbuffer.csv', encoding='utf-8') as bfile:
reader = csv.reader(bfile, delimiter=',')
wkshts = vt.worksheets()
for row in reader:
kwsFound = 0
hasUploaded = False
appendRow = [row[0],row[1],row[2],row[3],row[4],row[5]]
for sheets in wkshts:
if sheets.title in row[4]:
kwsFound += 1
kwSheet = sheets
if kwsFound == 1:
List.append("kwSheet.append_row(appendRow,'USER_ENTERED')") #append command to list
hasUploaded = True
if hasUploaded == False:
List.append("conflicts.append_row(appendRow,'USER_ENTERED')") #append command to list
List.append("sheet.append_row(appendRow,'USER_ENTERED')") #append command to list
thisDict["requests"] = List #set requests equal to the list of commands
spreadsheets.batchUpdate(thisDict) #send the request body with a list of command to execute.

Data append to list using XLRD

I am able to import data of rows in a particular column of certain sheet name in to a python list. But, the list is looking like Key:Value formatted list (not the one I need).
Here is my code:
import xlrd
excelList = []
def xcel(path):
book = xlrd.open_workbook(path)
impacted_files = book.sheet_by_index(2)
for row_index in range(2, impacted_files.nrows):
#if impacted_files.row_values(row_index) == 'LCR':
excelList.append(impacted_files.cell(row_index, 1))
print(excelList)
if __name__ == "__main__":
xcel(path)
The output is like below:
[text:'LCR_ContractualOutflowsMaster.aspx', text:'LCR_CountryMaster.aspx', text:'LCR_CountryMasterChecker.aspx', text:'LCR_EntityMaster.aspx', text:'LCR_EntityMasterChecker.aspx', text:'LCR_EscalationMatrixMaster.aspx',....]
I want the list to have just values. Like this...
['LCR_ContractualOutflowsMaster.aspx', 'LCR_CountryMaster.aspx', 'LCR_CountryMasterChecker.aspx', 'LCR_EntityMaster.aspx', 'LCR_EntityMasterChecker.aspx', 'LCR_EscalationMatrixMaster.aspx',...]
I've tried pandas too (df.value.tolist() method). Yet the output is not what I visualize.
Please suggest a way.
Regards
You are accumulating a list of cells, and what you are seeing is the repr of each cell in your list. Cell objects have three attributes: ctype is an int that identifies the type of the cell's value, value (which which is a Python rtype holding the cell's value) and xf_index. If you want only the values then try
excelList.append(impacted_files.cell(row_index, 1).value)
You can read more about cells in the documentation.
If you are willing to try one more library, openpyxl this is how it can be done.
from openpyxl import load_workbook
book = load_workbook(path)
sh = book.worksheets[0]
print([cell.value for cell in row for row in sheet.iter_rows()] )

Iterate through columns in Read-only workbook in openpyxl

I have a somewhat large .xlsx file - 19 columns, 5185 rows. I want to open the file, read all the values in one column, do some stuff to those values, and then create a new column in the same workbook and write out the modified values. Thus, I need to be able to both read and write in the same file.
My original code did this:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc)
ws = wb["Sheet1"]
# iterate through the columns to find the correct one
for col in ws.iter_cols(min_row=1, max_row=1):
for mycell in col:
if mycell.value == "PerceivedSound.RESP":
origCol = mycell.column
# get the column letter for the first empty column to output the new values
newCol = utils.get_column_letter(ws.max_column+1)
# iterate through the rows to get the value from the original column,
# do something to that value, and output it in the new column
for myrow in range(2, ws.max_row+1):
myrow = str(myrow)
# do some stuff to make the new value
cleanedResp = doStuff(ws[origCol + myrow].value)
ws[newCol + myrow] = cleanedResp
wb.save(doc)
However, python threw a memory error after row 3853 because the workbook was too big. The openpyxl docs said to use Read-only mode (https://openpyxl.readthedocs.io/en/latest/optimized.html) to handle big workbooks. I'm now trying to use that; however, there seems to be no way to iterate through the columns when I add the read_only = True param:
def readExcel(doc):
wb = load_workbook(generalpath + exppath + doc, read_only=True)
ws = wb["Sheet1"]
for col in ws.iter_cols(min_row=1, max_row=1):
#etc.
python throws this error:
AttributeError: 'ReadOnlyWorksheet' object has no attribute 'iter_cols'
If I change the final line in the above snippet to:
for col in ws.columns:
python throws the same error:
AttributeError: 'ReadOnlyWorksheet' object has no attribute 'columns'
Iterating over rows is fine (and is included in the documentation I linked above):
for col in ws.rows:
(no error)
This question asks about the AttritubeError but the solution is to remove Read-only mode, which doesn't work for me because openpyxl won't read my entire workbook in not Read-only mode.
So: how do I iterate through columns in a large workbook?
And I haven't yet encountered this, but I will once I can iterate through the columns: how do I both read and write the same workbook, if said workbook is large?
Thanks!
If the worksheet has only around 100,000 cells then you shouldn't have any memory problems. You should probably investigate this further.
iter_cols() is not available in read-only mode because it requires constant and very inefficient reparsing of the underlying XML file. It is however, relatively easy to convert rows into columns from iter_rows() using zip.
def _iter_cols(self, min_col=None, max_col=None, min_row=None,
max_row=None, values_only=False):
yield from zip(*self.iter_rows(
min_row=min_row, max_row=max_row,
min_col=min_col, max_col=max_col, values_only=values_only))
import types
for sheet in workbook:
sheet.iter_cols = types.MethodType(_iter_cols, sheet)
According to the documentation, ReadOnly mode only supports row-based reads (column reads are not implemented). But that's not hard to solve:
wb2 = Workbook(write_only=True)
ws2 = wb2.create_sheet()
# find what column I need
colcounter = 0
for row in ws.rows:
for cell in row:
if cell.value == "PerceivedSound.RESP":
break
colcounter += 1
# cells are apparently linked to the parent workbook meta
# this will retain only values; you'll need custom
# row constructor if you want to retain more
row2 = [cell.value for cell in row]
ws2.append(row2) # preserve the first row in the new file
break # stop after first row
for row in ws.rows:
row2 = [cell.value for cell in row]
row2.append(doStuff(row2[colcounter]))
ws2.append(row2) # write a new row to the new wb
wb2.save('newfile.xlsx')
wb.close()
wb2.close()
# copy `newfile.xlsx` to `generalpath + exppath + doc`
# Either using os.system,subprocess.popen, or shutil.copy2()
You will not be able to write to the same workbook, but as shown above you can open a new workbook (in writeonly mode), write to it, and overwrite the old file using OS copy.

Finding Excel cell reference using Python

Here is the Excel file in question:
Context: I am writing a program which can pull values from a PDF and put them in the appropriate cell in an Excel file.
Question: I want to write a function which takes a column value (e.g. 2014) and a row value (e.g. 'COGS') as arguments and return the cell reference where those two intersect (e.g. 'C3' for 2014 COGS).
def find_correct_cell(year=2014, item='COGS'):
#do something similar to what the =match function in Excel does
return cell_reference #returns 'C3'
I have already tried using openpyxl like this to change the values of some random empty cells where I can store these values:
col_num = '=match(2014, A1:E1)'
row_num = '=match("COGS", A1:A5)'
But I want to grab those values without having to arbitrarily write to those random empty cells. Plus, even with this method, when I read those cells (F5 and F6) it reads the formulae in those cells and not the face value of 3.
Any help is appreciated, thanks.
Consider a translated VBA solution as the Match function can adequately handle your needs. Python can access the Excel VBA Object Library using a COM interface with the win32com module. Please note this solution assumes you are using Excel for PC. Below includes the counterpart VBA function.
VBA Function (native interface)
If below function is placed in Excel standard module, function can be called in spreadsheet cell =FindCell(..., ###)
' MATCHES ROW AND COL INPUT FOR CELL ADDRESS OUTPUT
Function FindCell(item As String, year As Integer) As String
FindCell = Cells(Application.Match(item, Range("A1:A5"), 0), _
Application.Match(year, Range("A1:E1"), 0)).Address
End Function
debug.Print FindCell("COGS", 2014)
' $C$3
Python Script (foreign interface, requiring all objects to be declared)
Try/Except/Finally is used to properly close the Excel process regardless of script success or fail.
import win32com.client
# MATCHES ROW AND COL INPUT FOR CELL ADDRESS OUTPUT
def FindCell(item, year):
return(xlWks.Cells(xlApp.WorksheetFunction.Match(item, xlWks.Range("A1:A5"), 0),
xlApp.WorksheetFunction.Match(year, xlWks.Range("A1:E1"), 0)).Address)
try:
xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open('C:/Path/To/Workbook.xlsx')
xlWks = xlWbk.Worksheets("SHEETNAME")
print(FindCell("COGS", 2014))
# $C$3
except Exception as e:
print(e)
finally:
xlWbk.Close(False)
xlApp.Quit
xlWks = None
xlWbk = None
xlApp = None
There are a surprising number of details you need to get right to manipulate Excel files this way with openpyxl. First, it's worth knowing that the xlsx file contains two representations of each cell - the formula, and the current value of the formula. openpyxl can return either, and if you want values you should specify data_only=True when you open the file. Also, openpyxl is not able to calculate a new value when you change the formula for a cell - only Excel itself can do that. So inserting a MATCH() worksheet function won't solve your problem.
The code below does what you want, mostly in Python. It uses the "A1" reference style, and does some calculations to turn column numbers into column letters. This won't hold up well if you go past column Z. In that case, you may want to switch to numbered references to rows and columns. There's some more info on that here and here. But hopefully this will get you on your way.
Note: This code assumes you are reading a workbook called 'test.xlsx', and that 'COGS' is in a list of items in 'Sheet1!A2:A5' and 2014 is in a list of years in 'Sheet1!B1:E1'.
import openpyxl
def get_xlsx_region(xlsx_file, sheet, region):
""" Return a rectangular region from the specified file.
The data are returned as a list of rows, where each row contains a list
of cell values"""
# 'data_only=True' tells openpyxl to return values instead of formulas
# 'read_only=True' makes openpyxl much faster (fast enough that it
# doesn't hurt to open the file once for each region).
wb = openpyxl.load_workbook(xlsx_file, data_only=True, read_only=True)
reg = wb[sheet][region]
return [[cell.value for cell in row] for row in reg]
# cache the lists of years and items
# get the first (only) row of the 'B1:F1' region
years = get_xlsx_region('test.xlsx', 'Sheet1', 'B1:E1')[0]
# get the first (only) column of the 'A2:A6' region
items = [r[0] for r in get_xlsx_region('test.xlsx', 'Sheet1', 'A2:A5')]
def find_correct_cell(year, item):
# find the indexes for 'COGS' and 2014
year_col = chr(ord('B') + years.index(year)) # only works in A:Z range
item_row = 2 + items.index(item)
cell_reference = year_col + str(item_row)
return cell_reference
print find_correct_cell(year=2014, item='COGS')
# C3

Categories