How can I make this python(using openpyxl) program run faster? - python

Here is my code:
import openpyxl
import os
os.chdir('c:\\users\\Desktop')
wb= openpyxl.load_workbook(filename= 'excel.xlsx',data_only = True)
wb.create_sheet(index=0,title='Summary')
sumsheet= wb.get_sheet_by_name('Summary')
print('Creating Summary Sheet')
#loop through worksheets
print('Looping Worksheets')
for sheet in wb.worksheets:
for row in sheet.iter_rows():
for cell in row:
#find headers of columns needed
if cell.value=='LowLimit':
lowCol=cell.column
if cell.value=='HighLimit':
highCol=cell.column
if cell.value=='MeasValue':
measCol=cell.column
#name new columns
sheet['O1']='meas-low'
sheet['P1']='high-meas'
sheet['Q1']='Minimum'
sheet['R1']='Margin'
#find how many rows of each sheet
maxrow=sheet.max_row
i=0
#subtraction using max row
for i in range(2,maxrow+1):
if sheet[str(highCol)+str(i)].value=='---':
sheet['O'+str(i)]='='+str(measCol)+str(i)+'-'+str(lowCol)+str(i)
sheet['P'+str(i)]='=9999'
sheet['Q'+str(i)]='=MIN(O'+str(i)+':P'+str(i)+')'
sheet['R'+str(i)]='=IF(AND(Q'+str(i)+'<3,Q'+str(i)+'>-3),"Marginal","")'
elif sheet[str(lowCol)+str(i)].value=='---':
sheet['O'+str(i)]='=9999'
sheet['P'+str(i)]='='+str(highCol)+str(i)+'-'+str(measCol)+str(i)
sheet['Q'+str(i)]='=MIN(O'+str(i)+':P'+str(i)+')'
sheet['R'+str(i)]='=IF(AND(Q'+str(i)+'<3,Q'+str(i)+'>-3),"Marginal","")'
else:
sheet['O'+str(i)]='='+str(measCol)+str(i)+'-'+str(lowCol)+str(i)
sheet['P'+str(i)]='='+str(highCol)+str(i)+'-'+str(measCol)+str(i)
sheet['Q'+str(i)]='=MIN(O'+str(i)+':P'+str(i)+')'
sheet['R'+str(i)]='=IF(AND(Q'+str(i)+'<3,Q'+str(i)+'>-3),"Marginal","")'
++i
print('Saving new wb')
import os
os.chdir('C:\\Users\\hpj683\\Desktop')
wb.save('example.xlsx')
This runs perfectly fine except that it takes 4 minutes to complete one excel workbook. Is there any way I can optimize my code to make this run faster? My research online suggested to change to read_only or write_only to make it run faster however my code requires reading and writing to an excel workbook, so neither of those worked.

The code could benefit from being broken down into separate functions. This will help you identify the slow bits and replace them bit by bit.
The following bits should not be in the loop for every row:
finding the headers
calling ws.max_row this is very expensive
ws["C" + str(i)]. Use ws.cell(row=i, column=3)
And if the nested loop is not a formatting error then why is it nested?
Also you should look at the profile module to find out what is slow. You might want to watch my talk on profiling openpyxl from last year's PyCon UK.
Good luck!

Related

Openpyxl compare subsequent rows in column

I've been learning Python for the express purpose of creating a program that automates part of my job. I'm far along in the learning process to feel comfortable to take on a small part of the problem.
I want to create a function that combines two cells into one, with just one of their values (meaning I don't want to concatenate), if they are equal to each other. If they aren't it will pass.
I don't know how to express this in a for loop. I really want to complete this project myself, but I need some jumping off point. Any help is greatly appreciated.
I've created a virtual environment and have the following code. I understanding indexing with for loops for lists, but don't know how it works with openpyxl. Again, I am very new to programming in general, but am excited to work on this problem. The issue I have now that I haven't been able to find online, is how do I refer to cell's location and then have it refer to subsequent cell's location.
from openpyxl.workbook import Workbook
from openpyxl import load_workbook
wb = workbook()
ws = wb.active
#load existing spreadsheet
wb = load_workbook('input.xlsx')
column_i = ws['I']
def same_name()
for cell in column_i:
if cell.value[cell] == cell.value[cell+1]
You can use iter_cols to loop through each column then use merge_cells :
from openpyxl import load_workbook
def merge_subsequently(inpath, outpath):
wb = load_workbook(inpath)
ws = wb.active
for col in ws.iter_cols():
for _1st, _2nd in zip(col, col[1:]):
if _1st.value == _2nd.value:
ws.merge_cells(None, _1st.row, _1st.column,
_2nd.row, _2nd.column)
wb.save(outpath)
p1 = "/Desktop/input.xlsx"
p2 = "/Desktop/output.xlsx"
merge_subsequently(p1, p2)
BEFORE :
AFTER :

Freeze Panes first two rows and column with openpyxl

Trying to freeze the first two rows and first column with openpyxl, however, whenever doing such Excel says that the file is corrupted and there is no freeze.
Current code:
workbook = openpyxl.load_workbook(path)
worksheet = workbook[first_sheet]
freeze_panes = Pane(xSplit=2000, ySplit=3000, state="frozen", activePane="bottomRight")
worksheet.sheet_view.pane = freeze_panes
Took a look at the documentation, however, there is little explanation on parametere setting.
Desired output:
Came across this answer, however, it fits a specific use case, hence, wanted to make a general question for future reference:
How to split Excel screen with Openpyxl?
To freeze the first two rows and first column, use the sample code below... ws.freeze_panes works. Note that, like you would do in excel, select the cell above and left of which you want to freeze. So, in your case, the cell should be B3. Hope this is what you are looking for.
import openpyxl
wb=openpyxl.load_workbook('Sample.xlsx')
ws=wb['Sheet1']
mycell = ws['B3']
ws.freeze_panes = mycell
wb.save('Sample.xlsx')

Deleting rows from a large file using openpyxl

i'm working with openpyxl on a .xlsx file which has around 10K products, of which some are "regular items" and some are products that need to be ordered when required. For the project I'm doing I would like to delete all of the rows containing the items that need to be ordered.
I tested this with a small sample size of the actual workbook and did have the code working the way I wanted to. However when I tried this in the actual workbook with 10K rows it seems to be taking forever to delete those rows (it has been running for nearly and hour now).
Here's the code that I used:
wb = openpyxl.load_workbook('prod.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
def clean_workbook():
for row in sheet:
for cell in row:
if cell.value == 'ordered':
sheet.delete_rows(cell.row)
I would like to know is there a faster way of doing this with some tweaks in my code? Or is there a better way to just read just the regular stock from the workbook without deleting the unwanted items?
Deleting rows in loops can be slow because openpyxl has to update all the cells below the row being deleted. Therefore, you should do this as little as possible. One way is to collect a list of row numbers, check for contiguous groups and then delete using this list from the bottom.
A better approach might be to loop through ws.values and write to a new worksheet filtering out the relevant rows. Copy any other relevant data such as formatting, etc. Then you can delete the original worksheet and rename the new one.
ws1 = wb['My Sheet']
ws2 = wb.create_sheet('My Sheet New')
for row in ws1.values:
if row[x] == "ordered": # we can assume this is always the same column
continue
ws2.append(row)
del wb["My Sheet"]
ws2.title = "My Sheet"
For more sophisticated filtering you will probably want to load the values into a Pandas dataframe, make the changes and then write to a new sheet.
You can open with read-only mode, and import all content into a list, then modify in list is always a lot more faster than working in excel. After you modify the list, made a new worksheet and upload your list back to excel. I did this way with my 100k items excel .

gspread - get_all_values() returns an empty list

If I call a sheet by name, get_all_values function will always give me an empty list for a sheet that is definitely not empty.
import gspread
sheet = workbook.worksheet(sheet_name)
all_rows_list = sheet.get_all_values()
The only time get_all_values seems to return like it should is if I do the following:
all_rows_list = workbook.sheet1.get_all_values()
But the above works just for the first sheet and for no other, which is kind of useless for a workbook with more sheets.
What always works is reading row by row like
one_row_list = sheet.row_values(1) # first row
But the problem is that I'm trying to read a relatively big workbook with lots of sheets to figure out where I'm supposed to start writing, and it looks like reading row by row triggers "RESOURCES EXHAUSTED" error very fast.
So, am I doing something wrong or is get_all_values broken in gspread?
EDIT:
Added a screenshot.
gspread doesn't work well with sheets with names that could be confused as a cell reference in the A1 notation (like X101 and AT8 in your case).
https://github.com/burnash/gspread/issues/554 is an older issue that describes the underlying problem (the symptoms in that issue are different, but I'm pretty sure the root problem is the same).
I'll copy the workaround with providing a range, that you've discovered yourself:
ws.range("A1:C"+str(end_row)) That end_row is usually row_count of the sheet.

Iterating over rows in a column with XLRD

I have been able to get the column to output the values of the column in a separated list. However I need to retain these values and use them one by one to perform an Amazon lookup with them. The amazon lookup is not the problem. Getting XLRD to give one value at a time has been a problem. Is there also an efficient method of setting a time in Python? The only answer I have found to the timer issue is recording the time the process started and counting from there. I would prefer just a timer. This question is somewhat two parts here is what I have done so far.
I load the spreadsheet with xlrd using argv[1] i copy it to a new spreadsheet name using argv[2]; argv[3] i need to be the timer entity however I am not that far yet.
I have tried:
import sys
import datetime
import os
import xlrd
from xlrd.book import colname
from xlrd.book import row
import xlwt
import xlutils
import shutil
import bottlenose
AMAZON_ACCESS_KEY_ID = "######"
AMAZON_SECRET_KEY = "####"
print "Executing ISBN Amazon Lookup Script -- Please be sure to execute it python amazon.py input.xls output.xls 60(seconds between database queries)"
print "Copying original XLS spreadsheet to new spreadsheet file specified as the second arguement on the command line."
print "Loading Amazon Account information . . "
amazon = bottlenose.Amazon(AMAZON_ACCESS_KEY_ID, AMAZON_SECRET_KEY)
response = amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
shutil.copy2(sys.argv[1], sys.argv[2])
print "Opening copied spreadsheet and beginning ISBN extraction. . ."
wb = xlrd.open_workbook(sys.argv[2])
print "Beginning Amazon lookup for the first ISBN number."
for row in colname(colx=2):
print amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
I know this is a little vague. Should I perhaps try doing something like column = colname(colx=2) then i could do for row in column: Any help or direction is greatly appreciated.
The use of colname() in your code is simply going to return the name of the column (e.g. 'C' by default in your case unless you've overridden the name). Also, the use of colname is outside the context of the contents of your workbook. I would think you would want to work with a specific sheet from the workbook you are loading, and from within that sheet you would want to reference the values of a column (2 in the case of your example), does this sound somewhat correct?
wb = xlrd.open_workbook(sys.argv[2])
sheet = wb.sheet_by_index(0)
for row in sheet.col(2):
print amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
Although I think looking at the call to amazon.ItemLookup() you probably want to refer to row and not to "row" as the latter is simply a string and the former is the actual contents of the variable named row from your for loop.

Categories