Bad performance with big dataset - python

I have an excel file which contains two columns and 743914 rows. What I want to do is iterate row by row, and if the combination of the two rows is found for the first time then assign next to it, in a third column, a value. Else the value is the value that I assigned next to this combination the first time it was found. The problem is analogous to trying to build a dictionary where the key is the combination of the two existing rows and the value is the third row. I have written the above code which I have tested for 20 rows and works fine.
from openpyxl import load_workbook
wb = load_workbook('test.xlsx')
dicta = {}
i = 0
lista = []
listb = []
ws = wb.active
for row in ws.iter_rows(min_row=1, max_col=3, max_row=743914):
for cell in row:
i += 1
if i%3 != 0:
lista.append(cell.value)
if i%3 == 0:
if lista in listb:
cell.value = dicta[tuple(lista)]
else:
cell.value = i
dicta[tuple(lista)] = i
listb.append(lista)
lista = []
My problem is when I scale up the rows to 743914 it seems to run infinitely and totally unefficiently, as it is already running for 15 hours and hasn't terminated yet.

I don't think your problem is related to openpyxl but exponential growth of your lists and nested checks: if lista in listb looks suspicious. Your counter is also more or less uncontrolled.

Related

Is there a way to find the current row of the iteration using 'openpyxl' on Python?

I'm working with a xlsx file where it is divided by sections with empty rows and each section has an information displayed in a different manner i.e. different columns.
So i'm basically trying to find the section that i'm looking for ('Ação') and create a range from its next line, where are the headers, until the next empty row so I can create a DataFrame of this range.
when I try to print the index, it returns a tuple containing the values of the row, but I couldn't find a way to return its index (integer)
from openpyxl import load_workbook
data = '2019/02/07'
symbol = 'EQTL3'
ano = data[0:4]
mes = data[5:7]
dia = data[8:10]
file = "Fundo_{}{}{}.xlsx".format(ano, mes, dia)
wb = load_workbook(filename=file, read_only=False)
ws = wb["Fundo_{}{}{}".format(ano, mes, dia)]
for cell in ws['A']:
if (cell.value == 'Ação'):
x = int(cell.coordinate[1:]) + 1
for index in ws.iter_rows(min_row=x, max_col=ws.max_column, max_row=ws.max_row, values_only=True):
if (index[0] == None):
y = ws._current_row
break
I expect to receive an integer value with the index of the last row different than empty.
you can use enumerate for that....
something like this:
for row_idx, row_of_cells in enumerate(ws.iter_rows(min_row=x, values_only=True), start=1):

Count and compare occurrences across different columns in different spreadsheets

I would like to know (in Python) how to count occurrences and compare values from different columns in different spreadsheets. After counting, I would need to know if those values fulfill a condition i.e. If Ana (user) from the first spreadsheet appears 1 time in the second spreadsheet and 5 times in the third one, I would like to sum 1 to a variable X.
I am new in Python, but I have tried getting the .values() after using the Counter from collections. However, I am not sure if the real value Ana is being considered when iterating in the results of the Counter. All in all, I need to iterate each element in spreadsheet one and see if each element of it appears one time in the second spreadsheet and five times in the third spreadsheet, if such thing happens, the variable X will be added by one.
def XInputOutputs():
list1 = []
with open(file1, 'r') as fr:
r = csv.reader(fr)
for row in r:
list1.append(row[1])
number_of_occurrences_in_list_1 = Counter(list1)
list1_ocurrences = number_of_occurrences_in_list_1.values()
list2 = []
with open(file2, 'r') as fr:
r = csv.reader(fr)
for row in r:
list2.append(row[1])
number_of_occurrences_in_list_2 = Counter(list2)
list2_ocurrences = number_of_occurrences_in_list_2.values()
X = 0
for x,y in zip(list1_ocurrences, list2_ocurrences):
if x == 1 and y == 5:
X += 1
return X
I tested with small spreadsheets, but this just works for pre-ordered values. If Ana appears after 100000 rows, everything is broken. I think it is needed to iterate each value (Ana) and check simultaneously in all the spreadsheets and sum the variable X.
I am at work, so I will be able to write a full answer only later.
If you can import modules, I suggest you to try using pandas: a real super-useful tool to quickly and efficiently manage data frames.
You can easily import a .csv spreadsheet with
import pandas as pd
df = pd.read_csv()
method, then perform almost any kind of operation.
Check out this answer out: I got few time to read it, but I hope it helps
what is the most efficient way of counting occurrences in pandas?
UPDATE: then try with this
# not tested but should work
import os
import pandas as pd
# read all csv sheets from folder - I assume your folder is named "CSVs"
for files in os.walk("CSVs"):
files = files[-1]
# here it's generated a list of dataframes
df_list = []
for file in files:
df = pd.read_csv("CSVs/" + file)
df_list.append(df)
name_i_wanna_count = "" # this will be your query
columun_name = "" # here insert the column you wanna analyze
count = 0
for df in df_list:
# retrieve a series matching your query and then counts the elements inside
matching_serie = df.loc[df[columun_name] == name_i_wanna_count]
partial_count = len(matching_serie)
count = count + partial_count
print(count)
I hope it helps

Quickly count non empty cells in large excel sheet

I'm trying to determine how much data is missing from a large excel sheet. The following code takes a prohibitive amount of time to complete. I've seen similar questions, but I'm not sure how to translate the answer to this case. Any help would be appreciated!
import openpyxl
wb = openpyxl.load_workbook('C://Users/Alec/Documents/Vertnet master list.xlsx', read_only = True)
sheet = wb.active
lat = 0
loc = 0
ele = 0
a = openpyxl.utils.cell.column_index_from_string('CF')
b = openpyxl.utils.cell.column_index_from_string('BU')
c = openpyxl.utils.cell.column_index_from_string('BX')
print('Workbook loaded')
for x in range(2, sheet.max_row):
if sheet.cell(row = x, column = a).value:
lat += 1
if sheet.cell(row = x, column = b).value:
loc += 1
if sheet.cell(row = x, column = c).value:
ele += 1
print((x/sheet.max_row) * 100, '%')
print('Latitude: ', lat/sheet.max_row)
print('Location', loc/sheet.max_row)
print('Elevation', ele/sheet.max_row)
If you are simply trying to do the calc on a table on the sheet and not the entire sheet, you could make one adjustment to make it faster.
row = 1
Do Until IsEmpty(range("A1").offset(row,1).value)
if range("B"&row).value: lat += 1
if range("C"&row).value: loc += 1
if range("D"&row).value: ele += 1
row = row + 1
Loop
This would take you to the end of your defined table rather than the end of the whole sheet which is 90% of the reason it's taking you so long.
Hope this helps
Your problem is that, despite advice in the documentation to the contrary, you're using your own counters to access cells. In read-only mode each use of ws.cell() will force the worksheet to reparse the XML source for the worksheet. Simply use ws.iter_rows(min_col=a, max_col=c) to get the cells in the columns you're interested in.

Faster search method on the first empty cell in a column using openpyxl PYTHON 3.5

I am having a problem in searching a the first empty cell in a certain column
on a 40k lines .xlsx file. As the search goes farther, it becoming slower and slower. Is there a faster/instant search method in searching the first empty cell on a column?
wb = load_workbook(filename = dest_filename,read_only=True)
sheet_ranges1 = wb[name]
i = 1
x = 0
sam = 0
cc = 0
brgyst =Street+Brgy
entrylist = [TotalNoConfig,TotalNoChannel,Rsl,Mode,RslNo,Year,IssuedDate,Carrier,CaseNo,Site,brgyst,Municipality,Province,Region,Longitude1,Longitude2,Longitude3,Latitude1,Latitude2,Latitude3,ConvertedLong,ConvertedLat,License,Cos,NoS,CallSign,PTSVC,PTSVCCS,Tx,Rx] #The values to be inputted in the entire row after searching the last empty cell in column J
listX1 = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N', 'O','P','Q','T','U','V','R','X','Y','Z','AA','AB','AM','AN','AP','FL'] #The columns in the file
eter = 0
while(x != 1):
cellS = 'J'+str(i) #until there is no empty cell
if(sheet_ranges1[cellS].value is None): #if found empty cell, insert the values
x=1
book = load_workbook(filename = dest_filename)
sheet = book[name]
rangeof = int(len(entrylist))
while(cc<rangeof):
cells = listX1[cc]+str(i)
sheet[cells]= entrylist[cc]
cc=cc+1
else:
x=0
sam = sam+1
i=i+1
wb.save(dest_filename)
wb.close()
In read-only mode every cell lookup causes the worksheet to parsed again so you should always use ws.iter_rows() for your work.

Count rows in excel sheet in Python with xlwings

I have a script in Python that uses xlwings to open up an Excel file, and read and process the values of a certain column row by row. Here is the for statement:
for row in range(2, rownum):
I would like to repeat this function over every row in the sheet that actually contains something. It starts at 2 and ends at 'rownum'. My question is how to automatically count the number of rows and pass that value to 'rownum'. I'm sure xlwings has a way to do this, but I can't figure it out- perhaps the Autofit tool?
Thanks for the help!
It's all the API Documentation
If you're only looking for the number of rows, you can obtain the total number of row in your array/table by using the current_region property of your range and then getting the address of the last cell of this range: (It works only if your range is contiguous - no empty rows/columns inside of it)
rownum = Range('A1').current_region.last_cell.row
Alternatively, you can use table instead of current_region, the range will just be a bit different.
Once you have that, you can just loop through the rows:
for i in range(1, rownum + 1): # The indexing starts at 1
Range((i, 1)) = ... # Will select cell 'Ai'
But as mentioned in other answers, this multiplies the calls between app, which will be considerably slower. Better import the range, modify it and export it back to Excel.
Unless I've missed something while reading their API documentation it doesn't seem possible. You might need to use other libraries, for example pandas:
import pandas as pd
df = pd.read_excel(excel_file_path, sheetname="Sheet1")
print len(df)
If you don't want to use another library just for that, you can do it the hard and ugly way:
last_row = 0
while True:
if cell_value is not None: # replace cell_value with however
# xlwings accesses a cell's value
last_row += 1
else:
break
print last_row
With xlwings, you would read in the Range first, then iterate over it:
rng = Range((startrow, startcol), (rownum, colnum)).value
for row in rng:
...
Then at the end, write the result back:
Range((startrow, startcol)).value = result_rng
This way you minimize the cross-application calls which are slow.
You might also want to use Range.table.
I had to make a counter because I am automating a bunch of things that taken from excel and filled onto different websites. This is just the "prototype" that I came up with just to do it make sure I could do it.
wb = xw.Book(r'C:\Users\dd\Desktop\Testbook.xlsm')
Dudsht = wb.sheets['Dud']
lastcell = Dudsht.range(1,1).end('down').row #this just does ctrl+shift+down
print(lastcell) #just so you know how many rows you have. Mine was 30.
x = 2
for i in range(x, lastcell+1): #range of 2 to 30
Dudsht.cells(i,2).value = 'y' #enters 'y' triggering formulas
if Dudsht.cells(i,1).value == 'ERROR':
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 2
continue #if there is an error it will hightlight and skip an item
time.sleep(.5) #this was just so I could see visually
Dudsht.cells(i,2).value = 'x'
print('Item' + str(i) + ' Complete') #Item1 Complete
time.sleep(.5)
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 3 #highlights completed item
If there is no blank row, you can just use this:
len(Range('A1').vertical)
You don't need to know how many rows in the sheet.
import xlwings as xw
wb = xw.Book('20180301.xlsm')
sh = wb.sheets['RowData']
rownum = 2
while (sh.range('A'+str(rownum)).value != None):
value = sh.range('A'+str(rownum)).value
print(str(value))
rownum += 1
This will print out all data in column A.
Clean solution from https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33?permalink_comment_id=2088976#gistcomment-2088976:
used_range_rows = (active_sheet.api.UsedRange.Row, active_sheet.api.UsedRange.Row + active_sheet.api.UsedRange.Rows.Count)
used_range_cols = (active_sheet.api.UsedRange.Column, active_sheet.api.UsedRange.Column + active_sheet.api.UsedRange.Columns.Count)
used_range = xw.Range(*zip(used_range_rows, used_range_cols))
For counting rows in a column with empty cells in between:
import xlwings as xw
wb = xw.Book(loc)
sheet = wb.sheets['sheetname']
counter = 0
rownum = 1
while (rownum >= 1):
if sheet.range('A'+str(rownum)).value !=None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value != None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value == None:
counter += 1
break
rownum += 1

Categories