Quickly count non empty cells in large excel sheet - python

I'm trying to determine how much data is missing from a large excel sheet. The following code takes a prohibitive amount of time to complete. I've seen similar questions, but I'm not sure how to translate the answer to this case. Any help would be appreciated!
import openpyxl
wb = openpyxl.load_workbook('C://Users/Alec/Documents/Vertnet master list.xlsx', read_only = True)
sheet = wb.active
lat = 0
loc = 0
ele = 0
a = openpyxl.utils.cell.column_index_from_string('CF')
b = openpyxl.utils.cell.column_index_from_string('BU')
c = openpyxl.utils.cell.column_index_from_string('BX')
print('Workbook loaded')
for x in range(2, sheet.max_row):
if sheet.cell(row = x, column = a).value:
lat += 1
if sheet.cell(row = x, column = b).value:
loc += 1
if sheet.cell(row = x, column = c).value:
ele += 1
print((x/sheet.max_row) * 100, '%')
print('Latitude: ', lat/sheet.max_row)
print('Location', loc/sheet.max_row)
print('Elevation', ele/sheet.max_row)

If you are simply trying to do the calc on a table on the sheet and not the entire sheet, you could make one adjustment to make it faster.
row = 1
Do Until IsEmpty(range("A1").offset(row,1).value)
if range("B"&row).value: lat += 1
if range("C"&row).value: loc += 1
if range("D"&row).value: ele += 1
row = row + 1
Loop
This would take you to the end of your defined table rather than the end of the whole sheet which is 90% of the reason it's taking you so long.
Hope this helps

Your problem is that, despite advice in the documentation to the contrary, you're using your own counters to access cells. In read-only mode each use of ws.cell() will force the worksheet to reparse the XML source for the worksheet. Simply use ws.iter_rows(min_col=a, max_col=c) to get the cells in the columns you're interested in.

Related

Inserting rows in QTableWidget more freely

i have a QTableWidget Table populated with data from xlsx. To insert row in any position i want , i must give "Kod_Towaru" index first to insert below specific amount of rows.
Code is :
columnHeaders = []
for j in range(self.ui.zestawienie_analiza_tab_2.model().columnCount()):
columnHeaders.append(self.ui.zestawienie_analiza_tab_2.horizontalHeaderItem(j).text())
df = pd.DataFrame(columns=columnHeaders)
for row in range(self.ui.zestawienie_analiza_tab_2.rowCount()):
for col in range(self.ui.zestawienie_analiza_tab_2.columnCount()):
df.at[row, columnHeaders[col]] = self.ui.zestawienie_analiza_tab_2.item(row, col).text()
from openpyxl import Workbook
flag=False
wb = openpyxl.load_workbook('Zestawienie - NOWA WERSJA.xlsx')
sheet = wb['Zestawienie']
index = df.index
number_of_rows = len(index)
# find length of index
print(number_of_rows)
kod_towaru = self.ui.dodaj_normalia.text()
if index.size != 0:
result = df.loc[df['Kod_Towaru'] == kod_towaru].index[0]
print(result)
amount = self.ui.ilosc_normalia.text()
direct_amount = int(amount)
sheet.insert_rows(idx=result+3, amount=direct_amount)
wb.save('Zestawienie - NOWA WERSJA.xlsx')
But this is as you can see from the code above very complex to use at very first the xlsx to insert blank rows and then to populate again the same table. And as i said before i must give 2 variables : Kod_Towaru to give an index position of the row and amount .
Is there a way to do it like Excel , just with Ctrl + , with right mouse click or something?

Faster search method on the first empty cell in a column using openpyxl PYTHON 3.5

I am having a problem in searching a the first empty cell in a certain column
on a 40k lines .xlsx file. As the search goes farther, it becoming slower and slower. Is there a faster/instant search method in searching the first empty cell on a column?
wb = load_workbook(filename = dest_filename,read_only=True)
sheet_ranges1 = wb[name]
i = 1
x = 0
sam = 0
cc = 0
brgyst =Street+Brgy
entrylist = [TotalNoConfig,TotalNoChannel,Rsl,Mode,RslNo,Year,IssuedDate,Carrier,CaseNo,Site,brgyst,Municipality,Province,Region,Longitude1,Longitude2,Longitude3,Latitude1,Latitude2,Latitude3,ConvertedLong,ConvertedLat,License,Cos,NoS,CallSign,PTSVC,PTSVCCS,Tx,Rx] #The values to be inputted in the entire row after searching the last empty cell in column J
listX1 = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N', 'O','P','Q','T','U','V','R','X','Y','Z','AA','AB','AM','AN','AP','FL'] #The columns in the file
eter = 0
while(x != 1):
cellS = 'J'+str(i) #until there is no empty cell
if(sheet_ranges1[cellS].value is None): #if found empty cell, insert the values
x=1
book = load_workbook(filename = dest_filename)
sheet = book[name]
rangeof = int(len(entrylist))
while(cc<rangeof):
cells = listX1[cc]+str(i)
sheet[cells]= entrylist[cc]
cc=cc+1
else:
x=0
sam = sam+1
i=i+1
wb.save(dest_filename)
wb.close()
In read-only mode every cell lookup causes the worksheet to parsed again so you should always use ws.iter_rows() for your work.

Python to excel array broken into characters

When I put the code into excel every character is spaced out. This causes Tuesday to look like T,u,e,s,d,a,y in excel. The goal would be to have each cell in excel to have its own word and not the character. There are many for loops and I struggle with finding an answer to this ongoing problem. Any ideas?
import requests
from pprint import pprint
from xml.dom.minidom import parseString
from openpyxl import Workbook
NMNorth2=[("Farmington"),("Gallup"),("Grants"),("Las_Vegas"),("Raton"),("Santa_Fe"), ("Taos"),("Tijeras"),("Tucumcari")]
NMNorth=[("NM", "Farmington"),("NM", "Gallup"),("NM", "Grants"),("NM", "Las_Vegas"),("NM", "Raton"),("NM", "Santa_Fe"), ("NM", "Taos"),("NM", "Tijeras"),("NM", "Tucumcari")]
wb = Workbook()
dest_filename = 'weather.xlsx'
ws1 = wb.active
ws1.title = "Weather"
for state, city in NMNorth:
r = requests.get("http://api.wunderground.com/api/id/forecast/q/"+state+"/"+city+".json")
data = r.json()
forecast = data['forecast']['txt_forecast']['forecastday']
for n in forecast:
day = n['title']
forecaststm = (n['fcttext'])
columnVariable = 2
for x in day:
ws1.cell(row = 1, column = columnVariable).value = x
columnVariable +=1
for y in forecaststm:
ws1.cell(row = 2, column = columnVariable).value = y
columnVariable +=1
rowVariable = 2
ws1.cell(row = 1, column = 1).value = "City"
for state in NMNorth2:
ws1.cell(row = rowVariable, column = 1).value = state
rowVariable +=1
wb.save(filename = dest_filename)
The issue here is that python treats strings as iterables. In other words, this bites you if you think you're iterating through a list of strings (or similar) and go one level too deep in nested for loops; the easiest way to identify this is to print what you're working with on each loop.
In your case, the below loop is taking each letter (x) in the day of the week (day), writing it to a column and then incrementing the column you're writing to (columnVariable):
for x in day:
ws1.cell(row = 1, column = columnVariable).value = x
columnVariable +=1
Aside, camelCase isn't standard Python, it's more common to use underscores e.g. column_variable. See PEP8

Count rows in excel sheet in Python with xlwings

I have a script in Python that uses xlwings to open up an Excel file, and read and process the values of a certain column row by row. Here is the for statement:
for row in range(2, rownum):
I would like to repeat this function over every row in the sheet that actually contains something. It starts at 2 and ends at 'rownum'. My question is how to automatically count the number of rows and pass that value to 'rownum'. I'm sure xlwings has a way to do this, but I can't figure it out- perhaps the Autofit tool?
Thanks for the help!
It's all the API Documentation
If you're only looking for the number of rows, you can obtain the total number of row in your array/table by using the current_region property of your range and then getting the address of the last cell of this range: (It works only if your range is contiguous - no empty rows/columns inside of it)
rownum = Range('A1').current_region.last_cell.row
Alternatively, you can use table instead of current_region, the range will just be a bit different.
Once you have that, you can just loop through the rows:
for i in range(1, rownum + 1): # The indexing starts at 1
Range((i, 1)) = ... # Will select cell 'Ai'
But as mentioned in other answers, this multiplies the calls between app, which will be considerably slower. Better import the range, modify it and export it back to Excel.
Unless I've missed something while reading their API documentation it doesn't seem possible. You might need to use other libraries, for example pandas:
import pandas as pd
df = pd.read_excel(excel_file_path, sheetname="Sheet1")
print len(df)
If you don't want to use another library just for that, you can do it the hard and ugly way:
last_row = 0
while True:
if cell_value is not None: # replace cell_value with however
# xlwings accesses a cell's value
last_row += 1
else:
break
print last_row
With xlwings, you would read in the Range first, then iterate over it:
rng = Range((startrow, startcol), (rownum, colnum)).value
for row in rng:
...
Then at the end, write the result back:
Range((startrow, startcol)).value = result_rng
This way you minimize the cross-application calls which are slow.
You might also want to use Range.table.
I had to make a counter because I am automating a bunch of things that taken from excel and filled onto different websites. This is just the "prototype" that I came up with just to do it make sure I could do it.
wb = xw.Book(r'C:\Users\dd\Desktop\Testbook.xlsm')
Dudsht = wb.sheets['Dud']
lastcell = Dudsht.range(1,1).end('down').row #this just does ctrl+shift+down
print(lastcell) #just so you know how many rows you have. Mine was 30.
x = 2
for i in range(x, lastcell+1): #range of 2 to 30
Dudsht.cells(i,2).value = 'y' #enters 'y' triggering formulas
if Dudsht.cells(i,1).value == 'ERROR':
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 2
continue #if there is an error it will hightlight and skip an item
time.sleep(.5) #this was just so I could see visually
Dudsht.cells(i,2).value = 'x'
print('Item' + str(i) + ' Complete') #Item1 Complete
time.sleep(.5)
Dudsht.cells(i,1).api.EntireRow.Interior.ColorIndex = 3 #highlights completed item
If there is no blank row, you can just use this:
len(Range('A1').vertical)
You don't need to know how many rows in the sheet.
import xlwings as xw
wb = xw.Book('20180301.xlsm')
sh = wb.sheets['RowData']
rownum = 2
while (sh.range('A'+str(rownum)).value != None):
value = sh.range('A'+str(rownum)).value
print(str(value))
rownum += 1
This will print out all data in column A.
Clean solution from https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33?permalink_comment_id=2088976#gistcomment-2088976:
used_range_rows = (active_sheet.api.UsedRange.Row, active_sheet.api.UsedRange.Row + active_sheet.api.UsedRange.Rows.Count)
used_range_cols = (active_sheet.api.UsedRange.Column, active_sheet.api.UsedRange.Column + active_sheet.api.UsedRange.Columns.Count)
used_range = xw.Range(*zip(used_range_rows, used_range_cols))
For counting rows in a column with empty cells in between:
import xlwings as xw
wb = xw.Book(loc)
sheet = wb.sheets['sheetname']
counter = 0
rownum = 1
while (rownum >= 1):
if sheet.range('A'+str(rownum)).value !=None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value != None:
counter += 1
elif sheet.range('A'+str(rownum)).value == None and sheet.range('A'+str(rownum+1)).value == None:
counter += 1
break
rownum += 1

data validation range Django and xlsxwriter

I have been using Django and xlsxwriter on a project that I am working on. I want to use data_validation in Sheet1 to pull in the lists that I have printed in Sheet2. I get the lists to print, but am not seeing the data_validation in Sheet1 when I open the file. Any insight on what I am doing incorrectly is much appreciated!
wb = xlsxwriter.Workbook(TestCass)
sh_1 = wb.add_worksheet()
sh_2 = wb.add_worksheet()
col = 15
head_col = 0
for header in headers:
sh_1.write(0,head_col,header)
sh_2.write(0,head_col,header)
list_row = 1
list = listFunction(headerToModelDic[header])
for entry in list:
sh_2.write(list_row,col,entry)
list_row += 1
sh_1.data_validation(1,col,50,col,{'validate':'list','source':'=Sheet2!$A2:$A9'})
col += 1
wb.close()
Note: The reason I am not pulling the list directly from the site is because it is too long (longer than 256 characters). Secondly, I ultimately would like the source range in the data validation to take in variables from sheet2, however I cannot get sheet 1 to have any sort of data validation as is so I figured I would start with the absolute values.
It looks like the data ranges are wrong in the example. It appears that you are writing out the list data in a column but the data validation refers to a row of data.
Maybe in your full example there is data in that row but in the example above there isn't.
I've modified your example slightly to a non-Django example with some sample data. I've also changed the data validation range to match the written data range:
import xlsxwriter
wb = xlsxwriter.Workbook('test.xlsx')
sh_1 = wb.add_worksheet()
sh_2 = wb.add_worksheet()
col = 15
head_col = 0
headers = ['Header 1']
for header in headers:
sh_1.write(0,head_col,header)
sh_2.write(0,head_col,header)
list_row = 1
list = [1, 2, 3, 4, 5]
for entry in list:
sh_2.write(list_row,col,entry)
list_row += 1
sh_1.data_validation(1,col,50,col,
{'validate':'list','source':'=Sheet2!$P2:$P6'})
col += 1
wb.close()
And here is the output:

Categories