Count rows in excel sheet in Python with xlwings - python

I have a script in Python that uses xlwings to open up an Excel file, and read and process the values of a certain column row by row. Here is the for statement:
for row in range(2, rownum):
I would like to repeat this function over every row in the sheet that actually contains something. It starts at 2 and ends at 'rownum'. My question is how to automatically count the number of rows and pass that value to 'rownum'. I'm sure xlwings has a way to do this, but I can't figure it out- perhaps the Autofit tool?
Thanks for the help!

It's all the API Documentation
If you're only looking for the number of rows, you can obtain the total number of row in your array/table by using the current_region property of your range and then getting the address of the last cell of this range: (It works only if your range is contiguous - no empty rows/columns inside of it)
rownum = Range('A1').current_region.last_cell.row
Alternatively, you can use table instead of current_region, the range will just be a bit different.
Once you have that, you can just loop through the rows:
for i in range(1, rownum + 1): # The indexing starts at 1
Range((i, 1)) = ... # Will select cell 'Ai'
But as mentioned in other answers, this multiplies the calls between app, which will be considerably slower. Better import the range, modify it and export it back to Excel.

Unless I've missed something while reading their API documentation it doesn't seem possible. You might need to use other libraries, for example pandas:
import pandas as pd
df = pd.read_excel(excel_file_path, sheetname="Sheet1")
print len(df)
If you don't want to use another library just for that, you can do it the hard and ugly way:
last_row = 0
while True:
if cell_value is not None: # replace cell_value with however
# xlwings accesses a cell's value
last_row += 1
else:
break
print last_row

With xlwings, you would read in the Range first, then iterate over it:
rng = Range((startrow, startcol), (rownum, colnum)).value
for row in rng:
...
Then at the end, write the result back:
Range((startrow, startcol)).value = result_rng
This way you minimize the cross-application calls which are slow.
You might also want to use Range.table.

I had to make a counter because I am automating a bunch of things that taken from excel and filled onto different websites. This is just the "prototype" that I came up with just to do it make sure I could do it.
wb = xw.Book(r'C:\Users\dd\Desktop\Testbook.xlsm')
Dudsht = wb.sheets['Dud']
lastcell = Dudsht.range(1,1).end('down').row #this just does ctrl+shift+down
print(lastcell) #just so you know how many rows you have. Mine was 30.
x = 2
for i in range(x, lastcell+1): #range of 2 to 30
Dudsht.cells(i,2).value = 'y' #enters 'y' triggering formulas
if Dudsht.cells(i,1).value == 'ERROR':
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 2
continue #if there is an error it will hightlight and skip an item
time.sleep(.5) #this was just so I could see visually
Dudsht.cells(i,2).value = 'x'
print('Item' + str(i) + ' Complete') #Item1 Complete
time.sleep(.5)
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 3 #highlights completed item

If there is no blank row, you can just use this:
len(Range('A1').vertical)

You don't need to know how many rows in the sheet.
import xlwings as xw
wb = xw.Book('20180301.xlsm')
sh = wb.sheets['RowData']
rownum = 2
while (sh.range('A'+str(rownum)).value != None):
value = sh.range('A'+str(rownum)).value
print(str(value))
rownum += 1
This will print out all data in column A.

Clean solution from https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33?permalink_comment_id=2088976#gistcomment-2088976:
used_range_rows = (active_sheet.api.UsedRange.Row, active_sheet.api.UsedRange.Row + active_sheet.api.UsedRange.Rows.Count)
used_range_cols = (active_sheet.api.UsedRange.Column, active_sheet.api.UsedRange.Column + active_sheet.api.UsedRange.Columns.Count)
used_range = xw.Range(*zip(used_range_rows, used_range_cols))

For counting rows in a column with empty cells in between:
import xlwings as xw
wb = xw.Book(loc)
sheet = wb.sheets['sheetname']
counter = 0
rownum = 1
while (rownum >= 1):
if sheet.range('A'+str(rownum)).value !=None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value != None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value == None:
counter += 1
break
rownum += 1

Related

Count and compare occurrences across different columns in different spreadsheets

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.
I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Bad performance with big dataset

I have an excel file which contains two columns and 743914 rows. What I want to do is iterate row by row, and if the combination of the two rows is found for the first time then assign next to it, in a third column, a value. Else the value is the value that I assigned next to this combination the first time it was found. The problem is analogous to trying to build a dictionary where the key is the combination of the two existing rows and the value is the third row. I have written the above code which I have tested for 20 rows and works fine.
from openpyxl import load_workbook
wb = load_workbook('test.xlsx')
dicta = {}
i = 0
lista = []
listb = []
ws = wb.active
for row in ws.iter_rows(min_row=1, max_col=3, max_row=743914):
for cell in row:
i += 1
if i%3 != 0:
lista.append(cell.value)
if i%3 == 0:
if lista in listb:
cell.value = dicta[tuple(lista)]
else:
cell.value = i
dicta[tuple(lista)] = i
listb.append(lista)
lista = []
My problem is when I scale up the rows to 743914 it seems to run infinitely and totally unefficiently, as it is already running for 15 hours and hasn't terminated yet.
I don't think your problem is related to openpyxl but exponential growth of your lists and nested checks: if lista in listb looks suspicious. Your counter is also more or less uncontrolled.

Quickly count non empty cells in large excel sheet

I'm trying to determine how much data is missing from a large excel sheet. The following code takes a prohibitive amount of time to complete. I've seen similar questions, but I'm not sure how to translate the answer to this case. Any help would be appreciated!
import openpyxl
wb = openpyxl.load_workbook('C://Users/Alec/Documents/Vertnet master list.xlsx', read_only = True)
sheet = wb.active
lat = 0
loc = 0
ele = 0
a = openpyxl.utils.cell.column_index_from_string('CF')
b = openpyxl.utils.cell.column_index_from_string('BU')
c = openpyxl.utils.cell.column_index_from_string('BX')
print('Workbook loaded')
for x in range(2, sheet.max_row):
if sheet.cell(row = x, column = a).value:
lat += 1
if sheet.cell(row = x, column = b).value:
loc += 1
if sheet.cell(row = x, column = c).value:
ele += 1
print((x/sheet.max_row) * 100, '%')
print('Latitude: ', lat/sheet.max_row)
print('Location', loc/sheet.max_row)
print('Elevation', ele/sheet.max_row)
If you are simply trying to do the calc on a table on the sheet and not the entire sheet, you could make one adjustment to make it faster.
row = 1
Do Until IsEmpty(range("A1").offset(row,1).value)
if range("B"&row).value: lat += 1
if range("C"&row).value: loc += 1
if range("D"&row).value: ele += 1
row = row + 1
Loop
This would take you to the end of your defined table rather than the end of the whole sheet which is 90% of the reason it's taking you so long.
Hope this helps
Your problem is that, despite advice in the documentation to the contrary, you're using your own counters to access cells. In read-only mode each use of ws.cell() will force the worksheet to reparse the XML source for the worksheet. Simply use ws.iter_rows(min_col=a, max_col=c) to get the cells in the columns you're interested in.

XLRD Out of Range Error

I have a excel spreadsheet containing data as follows
Serial Number SAMPLE ID SAMPLE NAME
value value value
value value value
value value value......
Basically a table of entries. I do not know how many entries the table will have in it. Now I write Python code with xlrd to extract the values from Excel. The first thing that I want to do is determine the amount of entries present, so I use the following piece of code:
kicker = 0
counter = 0
rownum = 5
colnum = 1
while (kicker == 0):
if sh.cell_value(rowx=rownum, colx=colnum) is None:
kicker = 1
else:
counter = counter + 1
rownum = rownum + 1
print("done")
The code scans through the values and successfully reads the entries that have a value in the first field. The problem is, when I get to the first row without a value in the first field, xlrd gives me a "list index out of range" error. Thus, I read the last valid value, but as soon as I read the first empty block, it gives the error. How can I determine the amount of entries in my "table" without having xlrd throw an out of range error?
You should query for nrows and not use an potentional endless loop.
kicker = 0
counter = 0
colnum = 1
for rownum in range(5, sh.nrows):
if sh.cell_type(rowx=rownum, colx=colnum) in (xlrd.XL_CELL_EMPTY, xlrd.XL_CELL_BLANK):
kicker = 1
else:
counter = counter + 1
print("done")
Testing an empty cell I looked up here How to detect if a cell is empty when reading Excel files using the xlrd library?.

Is it possible to get an Excel document's row count without loading the entire document into memory?

I'm working on an application that processes huge Excel 2007 files, and I'm using OpenPyXL to do it. OpenPyXL has two different methods of reading an Excel file - one "normal" method where the entire document is loaded into memory at once, and one method where iterators are used to read row-by-row.
The problem is that when I'm using the iterator method, I don't get any document meta-data like column widths and row/column count, and i really need this data. I assume this data is stored in the Excel document close to the top, so it shouldn't be necessary to load the whole 10MB file into memory to get access to it.
So, is there a way to get ahold of the row/column count and column widths without loading the entire document into memory first?
Adding on to what Hubro said, apparently get_highest_row() has been deprecated. Using the max_row and max_column properties returns the row and column count. For example:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column
The solution suggested in this answer has been deprecated, and might no longer work.
Taking a look at the source code of OpenPyXL (IterableWorksheet) I've figured out how to get the column and row count from an iterator worksheet:
wb = load_workbook(path, use_iterators=True)
sheet = wb.worksheets[0]
row_count = sheet.get_highest_row() - 1
column_count = letter_to_index(sheet.get_highest_column()) + 1
IterableWorksheet.get_highest_column returns a string with the column letter that you can see in Excel, e.g. "A", "B", "C" etc. Therefore I've also written a function to translate the column letter to a zero based index:
def letter_to_index(letter):
"""Converts a column letter, e.g. "A", "B", "AA", "BC" etc. to a zero based
column index.
A becomes 0, B becomes 1, Z becomes 25, AA becomes 26 etc.
Args:
letter (str): The column index letter.
Returns:
The column index as an integer.
"""
letter = letter.upper()
result = 0
for index, char in enumerate(reversed(letter)):
# Get the ASCII number of the letter and subtract 64 so that A
# corresponds to 1.
num = ord(char) - 64
# Multiply the number with 26 to the power of `index` to get the correct
# value of the letter based on it's index in the string.
final_num = (26 ** index) * num
result += final_num
# Subtract 1 from the result to make it zero-based before returning.
return result - 1
I still haven't figured out how to get the column sizes though, so I've decided to use a fixed-width font and automatically scaled columns in my application.
Python 3
import openpyxl as xl
wb = xl.load_workbook("Sample.xlsx", enumerate)
#the 2 lines under do the same.
sheet = wb.get_sheet_by_name('sheet')
sheet = wb.worksheets[0]
row_count = sheet.max_row
column_count = sheet.max_column
#this works fore me.
This might be extremely convoluted and I might be missing the obvious, but without OpenPyXL filling in the column_dimensions in Iterable Worksheets (see my comment above), the only way I can see of finding the column size without loading everything is to parse the xml directly:
from xml.etree.ElementTree import iterparse
from openpyxl import load_workbook
wb=load_workbook("/path/to/workbook.xlsx", use_iterators=True)
ws=wb.worksheets[0]
xml = ws._xml_source
xml.seek(0)
for _,x in iterparse(xml):
name= x.tag.split("}")[-1]
if name=="col":
print "Column %(max)s: Width: %(width)s"%x.attrib # width = x.attrib["width"]
if name=="cols":
print "break before reading the rest of the file"
break
https://pythonhosted.org/pyexcel/iapi/pyexcel.sheets.Sheet.html
see : row_range() Utility function to get row range
if you use pyexcel, can call row_range get max rows.
python 3.4 test pass.
Options using pandas.
Gets all sheetnames with count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
for sheet in sheetnames:
df = xl.parse(sheet)
dimensions = df.shape
print('sheetname', ' --> ', dimensions)
Single sheet count of rows and columns.
import pandas as pd
xl = pd.ExcelFile('file.xlsx')
sheetnames = xl.sheet_names
df = xl.parse(sheetnames[0]) # [0] get first tab/sheet.
dimensions = df.shape
print(f'sheetname: "{sheetnames[0]}" - -> {dimensions}')
output sheetname "Sheet1" --> (row count, column count)

Categories