I have a excel spreadsheet containing data as follows
Serial Number SAMPLE ID SAMPLE NAME
value value value
value value value
value value value......
Basically a table of entries. I do not know how many entries the table will have in it. Now I write Python code with xlrd to extract the values from Excel. The first thing that I want to do is determine the amount of entries present, so I use the following piece of code:
kicker = 0
counter = 0
rownum = 5
colnum = 1
while (kicker == 0):
if sh.cell_value(rowx=rownum, colx=colnum) is None:
kicker = 1
else:
counter = counter + 1
rownum = rownum + 1
print("done")
The code scans through the values and successfully reads the entries that have a value in the first field. The problem is, when I get to the first row without a value in the first field, xlrd gives me a "list index out of range" error. Thus, I read the last valid value, but as soon as I read the first empty block, it gives the error. How can I determine the amount of entries in my "table" without having xlrd throw an out of range error?
You should query for nrows and not use an potentional endless loop.
kicker = 0
counter = 0
colnum = 1
for rownum in range(5, sh.nrows):
if sh.cell_type(rowx=rownum, colx=colnum) in (xlrd.XL_CELL_EMPTY, xlrd.XL_CELL_BLANK):
kicker = 1
else:
counter = counter + 1
print("done")
Testing an empty cell I looked up here How to detect if a cell is empty when reading Excel files using the xlrd library?.
Related
I'm working with a xlsx file where it is divided by sections with empty rows and each section has an information displayed in a different manner i.e. different columns.
So i'm basically trying to find the section that i'm looking for ('Ação') and create a range from its next line, where are the headers, until the next empty row so I can create a DataFrame of this range.
when I try to print the index, it returns a tuple containing the values of the row, but I couldn't find a way to return its index (integer)
from openpyxl import load_workbook
data = '2019/02/07'
symbol = 'EQTL3'
ano = data[0:4]
mes = data[5:7]
dia = data[8:10]
file = "Fundo_{}{}{}.xlsx".format(ano, mes, dia)
wb = load_workbook(filename=file, read_only=False)
ws = wb["Fundo_{}{}{}".format(ano, mes, dia)]
for cell in ws['A']:
if (cell.value == 'Ação'):
x = int(cell.coordinate[1:]) + 1
for index in ws.iter_rows(min_row=x, max_col=ws.max_column, max_row=ws.max_row, values_only=True):
if (index[0] == None):
y = ws._current_row
break
I expect to receive an integer value with the index of the last row different than empty.
you can use enumerate for that....
something like this:
for row_idx, row_of_cells in enumerate(ws.iter_rows(min_row=x, values_only=True), start=1):
I have a medium size excel file, with about 25000 rows.
In the excel file I check if a specific column value is in a list, and if is in the list I delete the row.
I'm using openpyxl.
The code:
count = 1
while count <= ws.max_row:
if ws.cell(row=count, column=2).value in remove_list:
ws.delete_rows(count, 1)
else:
count += 1
wb.save(src)
The code works, but is very slow(take hours) to finish.
I know that is a read-only and write-only modes, but in my case, I use both, first checking and second deleting.
I see you are using a list of rows which you need to delete. Instead, you can create "sequences" of rows to delete, thus changing a delete list like [2,3,4,5,6,7,8,45,46,47,48] to one like [[2, 7],[45, 4]]
i.e. Delete 7 rows starting at row 2, then delete 4 rows starting at row 45
Deleting in bulk is faster than 1 by 1. I deleted 6k rows in around 10 seconds
The following code will convert a list to a list of lists/sequences:
def get_sequences(list_of_ints):
sequence_count = 1
sequences = []
for row in list_of_ints:
next_item = None
if list_of_ints.index(row) < (len(list_of_ints) - 1):
next_item = list_of_ints[list_of_ints.index(row) + 1]
if (row + 1) == next_item:
sequence_count += 1
else:
first_in_sequence = list_of_ints[list_of_ints.index(row) - sequence_count + 1]
sequences.append([first_in_sequence, sequence_count])
sequence_count = 1
return sequences
Then run another loop to delete
for sequence in sequences:
sheet.delete_rows(sequence[0], sequence[1])
Personally, I would do two things:
first transform the list into a set so the lookup of the item takes less time
remove_set = set(remove_list)
...
if ws.cell(row=count, column=2).value in remove_set:
then I would avoid removing the rows in place, as it takes a lot of time to reorganise the data structures representing the sheet.
I would create a new blank worksheet and add to it only the rows which must be kept.
Then save the new worksheet, overwriting the original if you wish.
If it still takes too long, consider using a CSV format so you can treat the input data as text and output it the same way, re-importing the data later from the spreadsheet program (e.g. Ms-Excel)
Have a look at the official docs and at this tutorial to find out how to use the CSV library
Further note: as spotted by #Charlie Clark, the calculation of
ws.max_row
may take some time as well and there is no need to repeat it.
To do that, the easiest solution is to work backwards from the last row down to the first, so that the deleted rows do not affect the position of the ones before them.
When a number of rows have to be deleted from a sheet, I create a list of these row numbers, e.g. remove_list and then I rewrite the sheet to a temporary sheet, excluding these rows. I delete the original sheet and rename the temporary sheet to the original sheet. See my function for doing this below:
def delete_excel_rows_with_openpyxl(workbook, sheet, remove_list):
""" Delete rows with row numbers in remove_list from sheet contained in workbook """
temp_sheet = workbook.create_sheet('TempSheet')
destination_row_counter = 1
for source_row_counter, source_row in enumerate(sheet.iter_rows(min_row=1, max_row=sheet.max_row)):
try:
i = remove_list.index(source_row_counter+1) # enumerate counts from 0 and sheet from 1
# do not copy row
del remove_list[i]
except ValueError:
# copy row
column_count = 1
for cell in source_row:
temp_sheet.cell(row=destination_row_counter, column=column_count).value = cell.value
column_count = column_count + 1
destination_row_counter = destination_row_counter + 1
sheet_title = sheet.title
workbook.remove_sheet(sheet)
temp_sheet.title = sheet_title
return workbook, temp_sheet
Adding on to ketdaddy's response. I tested it and noticed that when you use this sequence in a for loop as suggested, you need to update the row number in every loop to account for the deleted rows.
For example, when you get to the second step in the loop, the start row is not the original start row, it's the original start row minus the rows which were previously deleted.
This code will update ketdaddy's sequence to generate a sequence which takes this into account.
original sequence = get_sequences(deleterows)
updated_sequence=[]
cumdelete = 0
for start, delete in original sequence:
new_start = start-cumdelete
cumdelete = cumdelete + delete
updated_sequence.append([new_start, delete])
updated_sequence
I would like to build a dict using a concatenation of column and row headers as the keys. In this format cell (0,0) is blank
I want the script to start at the second column heading, concatenate its string value to the second row value in the first column, then the second row, and so forth until there are no more rows. If the corresponding cell value in column1 row[i] is blank, it should skip making a key. If it is not blank, it should make the key with the corresponding cell value as it's value.
So, 0.5_20.00 would not be created; and .05_32.00: 9.00 would be created
Once it reaches the last row, it should move to the third column (column[2]) and do the same thing until there are no more columns.
import xlrd
#make dict from excel workbook, size and frequency
serParams = {}
wb = xlrd.open_workbook(r"S:\Shared\Service_Levels.xlsx")
sh = wb.sheet_by_index(0)
# Get row range value
row_range_value = 0
for i in sh.col(0):
row_range_value += 1
row_range_value = row_range_value -1
print row_range_value
# Get column range value
column_range_value = 0
for i in sh.row(0):
column_range_value += 1
column_range_value = column_range_value - 1
print column_range_value
# build the dict by using concatenated column and row headers as keys and
# corresponding values as key value
for i in range(0,column_range_value,1):
for j in range(0,row_range_value,1):
if sh.cell(i+1,j+1).value != '':
key = str(sh.cell(i+1,j).value) +"_"+str(sh.cell(i,j+1).value)
serParams[key] = sh.cell(i+1, j+1).value
Maybe the format of my data is causing the index error because the ranges exceed the table data with i+1 and j+1 once the loop reaches the end of the table? I tried to address this by subtracting 1 from the range values but I continue to get the index error.
I am having a problem in searching a the first empty cell in a certain column
on a 40k lines .xlsx file. As the search goes farther, it becoming slower and slower. Is there a faster/instant search method in searching the first empty cell on a column?
wb = load_workbook(filename = dest_filename,read_only=True)
sheet_ranges1 = wb[name]
i = 1
x = 0
sam = 0
cc = 0
brgyst =Street+Brgy
entrylist = [TotalNoConfig,TotalNoChannel,Rsl,Mode,RslNo,Year,IssuedDate,Carrier,CaseNo,Site,brgyst,Municipality,Province,Region,Longitude1,Longitude2,Longitude3,Latitude1,Latitude2,Latitude3,ConvertedLong,ConvertedLat,License,Cos,NoS,CallSign,PTSVC,PTSVCCS,Tx,Rx] #The values to be inputted in the entire row after searching the last empty cell in column J
listX1 = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N', 'O','P','Q','T','U','V','R','X','Y','Z','AA','AB','AM','AN','AP','FL'] #The columns in the file
eter = 0
while(x != 1):
cellS = 'J'+str(i) #until there is no empty cell
if(sheet_ranges1[cellS].value is None): #if found empty cell, insert the values
x=1
book = load_workbook(filename = dest_filename)
sheet = book[name]
rangeof = int(len(entrylist))
while(cc<rangeof):
cells = listX1[cc]+str(i)
sheet[cells]= entrylist[cc]
cc=cc+1
else:
x=0
sam = sam+1
i=i+1
wb.save(dest_filename)
wb.close()
In read-only mode every cell lookup causes the worksheet to parsed again so you should always use ws.iter_rows() for your work.
I have a script in Python that uses xlwings to open up an Excel file, and read and process the values of a certain column row by row. Here is the for statement:
for row in range(2, rownum):
I would like to repeat this function over every row in the sheet that actually contains something. It starts at 2 and ends at 'rownum'. My question is how to automatically count the number of rows and pass that value to 'rownum'. I'm sure xlwings has a way to do this, but I can't figure it out- perhaps the Autofit tool?
Thanks for the help!
It's all the API Documentation
If you're only looking for the number of rows, you can obtain the total number of row in your array/table by using the current_region property of your range and then getting the address of the last cell of this range: (It works only if your range is contiguous - no empty rows/columns inside of it)
rownum = Range('A1').current_region.last_cell.row
Alternatively, you can use table instead of current_region, the range will just be a bit different.
Once you have that, you can just loop through the rows:
for i in range(1, rownum + 1): # The indexing starts at 1
Range((i, 1)) = ... # Will select cell 'Ai'
But as mentioned in other answers, this multiplies the calls between app, which will be considerably slower. Better import the range, modify it and export it back to Excel.
Unless I've missed something while reading their API documentation it doesn't seem possible. You might need to use other libraries, for example pandas:
import pandas as pd
df = pd.read_excel(excel_file_path, sheetname="Sheet1")
print len(df)
If you don't want to use another library just for that, you can do it the hard and ugly way:
last_row = 0
while True:
if cell_value is not None: # replace cell_value with however
# xlwings accesses a cell's value
last_row += 1
else:
break
print last_row
With xlwings, you would read in the Range first, then iterate over it:
rng = Range((startrow, startcol), (rownum, colnum)).value
for row in rng:
...
Then at the end, write the result back:
Range((startrow, startcol)).value = result_rng
This way you minimize the cross-application calls which are slow.
You might also want to use Range.table.
I had to make a counter because I am automating a bunch of things that taken from excel and filled onto different websites. This is just the "prototype" that I came up with just to do it make sure I could do it.
wb = xw.Book(r'C:\Users\dd\Desktop\Testbook.xlsm')
Dudsht = wb.sheets['Dud']
lastcell = Dudsht.range(1,1).end('down').row #this just does ctrl+shift+down
print(lastcell) #just so you know how many rows you have. Mine was 30.
x = 2
for i in range(x, lastcell+1): #range of 2 to 30
Dudsht.cells(i,2).value = 'y' #enters 'y' triggering formulas
if Dudsht.cells(i,1).value == 'ERROR':
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 2
continue #if there is an error it will hightlight and skip an item
time.sleep(.5) #this was just so I could see visually
Dudsht.cells(i,2).value = 'x'
print('Item' + str(i) + ' Complete') #Item1 Complete
time.sleep(.5)
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 3 #highlights completed item
If there is no blank row, you can just use this:
len(Range('A1').vertical)
You don't need to know how many rows in the sheet.
import xlwings as xw
wb = xw.Book('20180301.xlsm')
sh = wb.sheets['RowData']
rownum = 2
while (sh.range('A'+str(rownum)).value != None):
value = sh.range('A'+str(rownum)).value
print(str(value))
rownum += 1
This will print out all data in column A.
Clean solution from https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33?permalink_comment_id=2088976#gistcomment-2088976:
used_range_rows = (active_sheet.api.UsedRange.Row, active_sheet.api.UsedRange.Row + active_sheet.api.UsedRange.Rows.Count)
used_range_cols = (active_sheet.api.UsedRange.Column, active_sheet.api.UsedRange.Column + active_sheet.api.UsedRange.Columns.Count)
used_range = xw.Range(*zip(used_range_rows, used_range_cols))
For counting rows in a column with empty cells in between:
import xlwings as xw
wb = xw.Book(loc)
sheet = wb.sheets['sheetname']
counter = 0
rownum = 1
while (rownum >= 1):
if sheet.range('A'+str(rownum)).value !=None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value != None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value == None:
counter += 1
break
rownum += 1