Iterating over rows in a column with XLRD - python

I have been able to get the column to output the values of the column in a separated list. However I need to retain these values and use them one by one to perform an Amazon lookup with them. The amazon lookup is not the problem. Getting XLRD to give one value at a time has been a problem. Is there also an efficient method of setting a time in Python? The only answer I have found to the timer issue is recording the time the process started and counting from there. I would prefer just a timer. This question is somewhat two parts here is what I have done so far.
I load the spreadsheet with xlrd using argv[1] i copy it to a new spreadsheet name using argv[2]; argv[3] i need to be the timer entity however I am not that far yet.
I have tried:
import sys
import datetime
import os
import xlrd
from xlrd.book import colname
from xlrd.book import row
import xlwt
import xlutils
import shutil
import bottlenose
AMAZON_ACCESS_KEY_ID = "######"
AMAZON_SECRET_KEY = "####"
print "Executing ISBN Amazon Lookup Script -- Please be sure to execute it python amazon.py input.xls output.xls 60(seconds between database queries)"
print "Copying original XLS spreadsheet to new spreadsheet file specified as the second arguement on the command line."
print "Loading Amazon Account information . . "
amazon = bottlenose.Amazon(AMAZON_ACCESS_KEY_ID, AMAZON_SECRET_KEY)
response = amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
shutil.copy2(sys.argv[1], sys.argv[2])
print "Opening copied spreadsheet and beginning ISBN extraction. . ."
wb = xlrd.open_workbook(sys.argv[2])
print "Beginning Amazon lookup for the first ISBN number."
for row in colname(colx=2):
print amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
I know this is a little vague. Should I perhaps try doing something like column = colname(colx=2) then i could do for row in column: Any help or direction is greatly appreciated.

The use of colname() in your code is simply going to return the name of the column (e.g. 'C' by default in your case unless you've overridden the name). Also, the use of colname is outside the context of the contents of your workbook. I would think you would want to work with a specific sheet from the workbook you are loading, and from within that sheet you would want to reference the values of a column (2 in the case of your example), does this sound somewhat correct?
wb = xlrd.open_workbook(sys.argv[2])
sheet = wb.sheet_by_index(0)
for row in sheet.col(2):
print amazon.ItemLookup(ItemId="row", ResponseGroup="Offer Summaries", SearchIndex="Books", IdType="ISBN")
Although I think looking at the call to amazon.ItemLookup() you probably want to refer to row and not to "row" as the latter is simply a string and the former is the actual contents of the variable named row from your for loop.

Related

How to read outline levels using Python `openpyxl`?

My organization has a clean export for bills of materials (BOM). I would like to automatically parse the excel file to check the BOM for certain attributes.
At the moment, I'm using Python with openpyxl.
I can read the excel workbook and worksheet just fine, but I cannot seem to find the attribute that contains the "outline level" of each row (I fully concede that I may be using the wrong terminology... another term candidate might be "group").
When I look at my excel file using excel, I see this at the left of the screen:
I would like to extract the 1 2 3 4 5 from each of the rows and to tell what grouping they were in.
My initial code is:
from pathlib import Path
import openpyxl as xl
path = Path('<path-to-my-file>.xlsx')
wb = xl.load_workbook(filename=path)
sh = wb.worksheets[0]
# ... would like to put outline level reading code here
From reading other questions, I suspect that I need to look at the row_dimension.group method of the worksheet, but I can't seem to get a handle on the syntax or the exact attribute that I'm looking for.
Thanks for the post. I was struggling with the same problem and seing your post gave me an idea!
I overcome it with the following code:
from pathlib import Path
import openpyxl as xl
path = Path('<path-to-my-file>.xlsx')
wb = xl.load_workbook(filename=path)
sh = wb.worksheets[0]
for row in sorted(sheet.row_dimensions):
outline1=sheet.dimensions[row].outlineLevel
outline2=sheet.dimensions[row].outline_level
print(row,sheet.dimensions[row], outline1, outline2 )
Maybe you can use the following code to gather individual row outline levels as an integer. I use a similar code to find maximum outline level in a sheet with some more lines.
for index in range(ws.min_row, ws.max_row):
row_level = ws.row_dimensions[index].outline_level + 1
In here row level variable is the outline level, you may use as required. But please double check +1, if I remember correctly, to get true level, you need to increase variable by one.

Need help creating a loop that will go through row by row in Excel

I'm a beginner at Python, and I have been trying my hand at some projects. I have an excel spreadsheet that contains a column of URLs that I want to open, pull some data from, output to a different column on my spreadsheet, and then go down to the next URL and repeat.
I was able to write code that allowed me to complete almost the entire process if I enter in a single URL, but I suck at creating loops
My list is only 10 cells long.
My question is, what code can I use that will loop through a column until it hits a stopping point. .
import urllib.request, csv, pandas as pd
from openpyxl import load_workbook
xl = pd.ExcelFile("filename.xlsx")
ws = xl.parse("Sheet1")
i = 0 # This is where I insert the row number for a specific URL
urlpage = str(ws['URLPage'][i]) # 'URLPage' is the name of the column in Excel
p = urlpage.replace(" ", "") # This line is for deleting whitespace in my URL
response = urllib.request.urlopen(p)
Also as stated, I'm newer at Python, so if you see where I can improve the code I already have, please let me know.

gspread - get_all_values() returns an empty list

If I call a sheet by name, get_all_values function will always give me an empty list for a sheet that is definitely not empty.
import gspread
sheet = workbook.worksheet(sheet_name)
all_rows_list = sheet.get_all_values()
The only time get_all_values seems to return like it should is if I do the following:
all_rows_list = workbook.sheet1.get_all_values()
But the above works just for the first sheet and for no other, which is kind of useless for a workbook with more sheets.
What always works is reading row by row like
one_row_list = sheet.row_values(1) # first row
But the problem is that I'm trying to read a relatively big workbook with lots of sheets to figure out where I'm supposed to start writing, and it looks like reading row by row triggers "RESOURCES EXHAUSTED" error very fast.
So, am I doing something wrong or is get_all_values broken in gspread?
EDIT:
Added a screenshot.
gspread doesn't work well with sheets with names that could be confused as a cell reference in the A1 notation (like X101 and AT8 in your case).
https://github.com/burnash/gspread/issues/554 is an older issue that describes the underlying problem (the symptoms in that issue are different, but I'm pretty sure the root problem is the same).
I'll copy the workaround with providing a range, that you've discovered yourself:
ws.range("A1:C"+str(end_row)) That end_row is usually row_count of the sheet.

How to pull last cell in column using openpyxl in python

I created a small program that writes to an excel file. I have another program that needs to read the last entry (in column A) every day. Since there is a new data imported into the excel file every day, the cell that I need to capture is different.
I'm looking to see if there is a way for me to grab the last cell in Column A using openpyxl in python?
I don't have much experience with this, so I wasn't sure where to start.
import openpyxl
wb = openpyxl.load_workbook('text.xlsx')
sheet = wb.get_sheet_by_name('Sheet1')
from https://openpyxl.readthedocs.io/en/stable/tutorial.html
try this, it should get the entire A column and take the last entry:
sheet['A'][-1]

How can I make this python(using openpyxl) program run faster?

Here is my code:
import openpyxl
import os
os.chdir('c:\\users\\Desktop')
wb= openpyxl.load_workbook(filename= 'excel.xlsx',data_only = True)
wb.create_sheet(index=0,title='Summary')
sumsheet= wb.get_sheet_by_name('Summary')
print('Creating Summary Sheet')
#loop through worksheets
print('Looping Worksheets')
for sheet in wb.worksheets:
for row in sheet.iter_rows():
for cell in row:
#find headers of columns needed
if cell.value=='LowLimit':
lowCol=cell.column
if cell.value=='HighLimit':
highCol=cell.column
if cell.value=='MeasValue':
measCol=cell.column
#name new columns
sheet['O1']='meas-low'
sheet['P1']='high-meas'
sheet['Q1']='Minimum'
sheet['R1']='Margin'
#find how many rows of each sheet
maxrow=sheet.max_row
i=0
#subtraction using max row
for i in range(2,maxrow+1):
if sheet[str(highCol)+str(i)].value=='---':
sheet['O'+str(i)]='='+str(measCol)+str(i)+'-'+str(lowCol)+str(i)
sheet['P'+str(i)]='=9999'
sheet['Q'+str(i)]='=MIN(O'+str(i)+':P'+str(i)+')'
sheet['R'+str(i)]='=IF(AND(Q'+str(i)+'<3,Q'+str(i)+'>-3),"Marginal","")'
elif sheet[str(lowCol)+str(i)].value=='---':
sheet['O'+str(i)]='=9999'
sheet['P'+str(i)]='='+str(highCol)+str(i)+'-'+str(measCol)+str(i)
sheet['Q'+str(i)]='=MIN(O'+str(i)+':P'+str(i)+')'
sheet['R'+str(i)]='=IF(AND(Q'+str(i)+'<3,Q'+str(i)+'>-3),"Marginal","")'
else:
sheet['O'+str(i)]='='+str(measCol)+str(i)+'-'+str(lowCol)+str(i)
sheet['P'+str(i)]='='+str(highCol)+str(i)+'-'+str(measCol)+str(i)
sheet['Q'+str(i)]='=MIN(O'+str(i)+':P'+str(i)+')'
sheet['R'+str(i)]='=IF(AND(Q'+str(i)+'<3,Q'+str(i)+'>-3),"Marginal","")'
++i
print('Saving new wb')
import os
os.chdir('C:\\Users\\hpj683\\Desktop')
wb.save('example.xlsx')
This runs perfectly fine except that it takes 4 minutes to complete one excel workbook. Is there any way I can optimize my code to make this run faster? My research online suggested to change to read_only or write_only to make it run faster however my code requires reading and writing to an excel workbook, so neither of those worked.
The code could benefit from being broken down into separate functions. This will help you identify the slow bits and replace them bit by bit.
The following bits should not be in the loop for every row:
finding the headers
calling ws.max_row this is very expensive
ws["C" + str(i)]. Use ws.cell(row=i, column=3)
And if the nested loop is not a formatting error then why is it nested?
Also you should look at the profile module to find out what is slow. You might want to watch my talk on profiling openpyxl from last year's PyCon UK.
Good luck!

Categories