Openpyxl enumerating over NoneType cells - python

My goal is to open a spreadsheet that had mismatched data entries and iterate through them row by row within the column to then process the information. When iterating with a for loop with range(2, ws.max_row), the iteration continues past the last instance of data (cell a36) and keeps iterating and returning None beyond where I would expect the ws.max_row argument to stop at cell a36.
This is what I tried:
for cell in range(2, ws.max_row):
value = ws.cell(row=cell, column=1).value
print(value)
The output displays the cell values containing data and then after cell a36 it continues to output empty cells such as:
DPS ‐ Hot Springs ‐ 206 S Chicago Street
None
None
None
None
None
A36 is the last cell containing any data and I thought that max_row was equivalent to the last row containing data. Does anyone understand what could be causing my for loop to continue afterwards?

If your columns are fully populated up till the last row (in your case row 36), why not use a while loop and a break condition?
Something along the lines of this:
row = 2
while True:
value = ws.cell(row=row, column=1).value
print(value)
if value == None:
break
row += 1
As stated by #stovfl and #Charlie Clark, max_row isn't necessarily the last row in your data that isn't None. Better to be explicit

Maybe ur excel file had edited several NoneType cells beyond the last data cell that u expected.
There r 2 ways to solve it.
Set a break in ur loop when the cell.value equals what u r expected.
Check ur excel file if it has NoneType cells that u unexpected.

Related

How do I iterate this Data frame - First row has no column 1

Please help. I have the following data frame with 5 columns. The first row throws an Index out of bounds error even though there are supposed to be 5 columns - if you include the area marked in red. Not quite sure how to iterate this?:
index 4 is out of bounds for axis 0 with size 4
Here is the code I am using:
for index, row in df.iterrows():
print(row.index[0],row.index[1],row.index[2],row.index[3],row.index[4])
In the end I had to adopt a crude method of first writing the data frame to a CSV file and then reading that file.
Fortunately the empty header (marked in red in the table) returns as a blank ('') that I can easily identify and replace with a value:
Am sure this is not the best solution but here goes:
df.to_csv('x.csv')
with open('x.csv') as csvfile:
dt = csv.reader(csvfile)
for row in dt:
print(row)
Output for row1/col1 is an empty string. This works for me.

How to compare two unordered columns by value with openpyxl and print the results for each row?

I have an excel file with two columns. The values in the columns are unordered. I know for sure that some of the cells from the source column exist in the target column, which is longer (240 rows compared to 191). How can I check if a value from the source column exists in the target column and then print to a column to the right from target column row by row i.e. "check" if the value from source exists and "missing" if it doesn't?
I assume it should follow this logic, but the value comparison itself seems tricky to me:
for (source_row, target_row) in zip(ws.iter_rows(min_row=2, max_col=3, max_row=240),
ws.iter_rows(min_row=2, max_col=7, max_row=240)):
for (source_cell, target_cell) in zip(source_row, target_row):
if target_cell in source_row: # doesn't seem to work
ws.cell(column=10, row=target_cell.row).value = "check"
break
else:
ws.cell(column=10, row=target_cell.row).value = "missing"
break
For this kind of thing you should avoid using nested loops. Not only will they not work necessarily work in this case because you are only checking cell with cell and not cell against a set of values, nested loops can, ahem, quickly become slow, but also make debugging and flow control more difficult.
If you just want to know which values are in each column.
source = set((r[0] for r in ws.iter_rows(min_row=2, max_row=240, min_col=3, max_col=3, values_only=True))
target = set((r[0] for r in ws.iter_rows(min_row=2, max_row=240, min_col=7, max_col=7, values_only=True))
missing_value = target - source
You can now loop over the source cells and do the comparison:
for row in ws.iter_rows(min_row=2, max_row=240, min_col=3, max_col=3):
for c in row:
value = c.value in target and "check" or "missing"
c.offset(rows=7).value = value

python openpyxl find certain cell then return the next nonempty cell

I am using openpyxl to work with this excel sheet. Once i find the cell that contains "Mandatory Field", I want to keep looking down that column to find the first nonempty value.
for row in ws.iter_rows():
for cell in row[0:4]:
if cell.value == 'Mandatory Field'
print (cell.value)
This is what I have so far. I do not know how to tell it to say ok now that you have found the cell with Mandatory field. return the the value of the cell that is not empty below you. I am looking through 5 columns because i need to do this to twice.
If you need to do this twice within a range of cells you should use a "sentinel" flag.
sentinel = False
for row in ws.iter_rows(max_col=5):
for cell in row:
if cell.value == "Mandatory Field":
sentinel = True
if sentinel = True:
print(cell.offset(row=1).value)
sentinel = False
Note that in the example you can avoid the use of a sentinel because of the offset() method but I'm including it as an example.

Openpyxl max_row and max_column wrongly reports a larger figure

My query is to do with a function that is part of a parsing script Im developing. I am trying to write a python function to find the column number corresponding to a matched value in excel. The excel has been created on the fly with openpyxl, and it has the first row (from 3rd column) headers that each span 4 columns merged into one. In my subsequent function, I am parsing some content to be added to the columns corresponding to the matching headers. (Additional info: The content I'm parsing is blast+ output. I'm trying to create a summary spreadsheet with the hit names in each column with subcolumns for hits, gaps, span and identity. The first two columns are query contigs and its length. )
I had initially written a similar function for xlrd and it worked. But when I try to rewrite it for openpyxl, I find that the max_row and max_col function wrongly returns a larger number of rows and columns than actually present. For instance, I have 20 rows for this pilot input, but it reports it as 82.
Note that I manually selected the empty rows & columns and right clicked and deleted them, as advised elsewhere in this forum. This didn't change the error.
def find_column_number(x):
col = 0
print "maxrow = ", hrsh.max_row
print "maxcol = ", hrsh.max_column
for rowz in range(hrsh.max_row):
print "now the row is ", rowz
if(rowz > 0):
pass
for colz in range(hrsh.max_column):
print "now the column is ", colz
name = (hrsh.cell(row=rowz,column=colz).value)
if(name == x):
col = colz
return col
The issue with max_row and max_col, has been discussed here https://bitbucket.org/openpyxl/openpyxl/issues/514/cell-max_row-reports-higher-than-actual I applied the suggestion here. But the max_row is still wrong.
for row in reversed(hrsh.rows):
values = [cell.value for cell in row]
if any(values):
print("last row with data is {0}".format(row[0].row))
maxrow = row[0].row
I then tried the suggestion at https://www.reddit.com/r/learnpython/comments/3prmun/openpyxl_loop_through_and_find_value_of_the/, and tried to get the column values. Once, again the script takes into account the empty columns and reports a higher number columns than actually present.
for currentRow in hrsh.rows:
for currentCell in currentRow:
print(currentCell.value)
Can you please help me resolve this error, or suggest another method to achieve my aim?
As noted in the bug report you linked to there's a difference between a sheet's reported dimensions and whether these include empty rows or columns. If max_row and max_column are not reporting what you want to see then you will need to write your own code to find the first completely empty. The most efficient way, of course, would be to start from max_row and work backwards but the following is probably sufficient:
for max_row, row in enumerate(ws, 1):
if all(c.value is None for c in row):
break
I confirm the bug found by the OP. I found newer posts reporting max_row being too large.
This bug cannot be fixed.
In my case, it appears when I set the value of all cells in a worksheet to None.
After this operation, the worksheet still reports the old dimensions.
A call to ws.calculate_dimensions() does not change anything.
Closing and restarting excel still has openpyxl report the same wrong dimensions.
This is a problem because ws.append() starts at ws.max_row, and there is no way to override this behaviour. You end up with a worksheet that is blank and then, somewhere down, the data you appended appears.
The only way I found out that remedies this bug is to delete entire rows by hand in excel. openpyxl then shows the correct max_row.
I found out that this is linked to the member ws._cells not being empty as it should after setting all cells to None. However, the user cannot delete this dictionary as it is a private member.
I have the same behaviour with the latest version 3.0.3 of openpyxl. I use an XLSX file as a template (created from a XLS file), open it, add some data then save it with a different name. I find out that max_row is set to 49 and I don't know why.
However after reading in the online documentation https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html this line:
Do not create worksheets yourself, use
openpyxl.workbook.Workbook.create_sheet() instead
I created my XLSX template directly from openpyxl simply as follows:
wb = openpyxl.Workbook()
wb.save(filename="template.xslx")
It works fine now (max_row=1). Hope it helps.
When using openpyxl max_row function to get the maximum rows containing the data in the sheet, sometimes it even counts the empty rows, this is because the max_row function returns the maximum row index of the sheet, not the count of rows containing the data.
Example: Let's say an excel/google-sheet file is created with 10 rows of data and 5 rows of data are removed, the max_row function of openpyxl returns maximum rows as 10, as the maximum row index of file will be 10, as the file had contained 10 rows initially.
So to get the maximum rows containing the data in openpyxl
def get_maximum_rows(*, sheet_object):
rows = 0
for max_row, row in enumerate(sheet_object, 1):
if not all(col.value is None for col in row):
rows += 1
return rows
import openpyxl
workbook = openpyxl.load_workbook(<filepath>)
sheet_object = workbook.active
max_rows = get_maximum_rows(sheet_object=sheet_object)
Today I encountered the same. I edited the .xlsx file which I'm using in openpyxl. I deleted all values from the extreme right side column and found that max_column not giving exact max_column. Then I deleted the columns where the cell values were previously deleted (right-click on column 'ID' and delete). Now I find it is reporting correct value.
I used Dharman's approach and solved the problem.
I had an Excel file with more than 100k rows. I had deleted the duplications in this file.
At first, the max_row reported the total row number before the deletion.
I used workbook.save(filename='another_filename.xlsx") method to save the original Excel file to a new one.
Then I used the openpyxl to open the new file (another_filanem.xlsx). The max_row reports the correct number now.
in general max_row and max_col will make your script so slow to run, maybe it is better to detect a None and store the row or col in case.
Here is how I find the max column and max row by simply looping through the Excel sheet. By using this code, you can compare both the result from the Python and the loop.
from openpyxl import load_workbook
wb = lw("Test.xlsx")
sheet = wb["Sheet 1"]
print("Python defined max_column " + str(sheet.max_column))
print("Python defined max_row " + str(sheet.max_row))
def get_maximum_cols():
for i in range(1, 20000):
if sheet.cell(row=2, column= i).value == None:
max_col = i
break
return max_col
def get_maximum_rows():
for i in range(1, 20000):
if sheet.cell(row=i, column = 2).value == None:
max_row = i
break
return max_row
max_cols = get_maximum_cols()
max_rows = get_maximum_rows()
print('max column ' + str(max_cols))
print('max row ' + str(max_rows))
wb.save("Test.xlsx")

Looping and XLSXwriter formatting of a row

I have a workbook with a number of sheets that I want to format after it's created, and I want to alter the colors of the header row based on column. For example, I want the first 9 columns to be one color, then column 10 should be another, then all the rest should be a third color.
This is what I am looping through...it sort of works, but all the cells in row 0 end up the same color; the last color assigned always overwrites the previous columns.
visitFormat = mtbook.add_format({'bg_color':'#e9ccfc'})
cognotesFormat = mtbook.add_format({'bg_color':'#d2eff2'})
filedateFormat = mtbook.add_format({'bg_color':'#8cbcff'})
for worksheet in mtbook.worksheets():
print(worksheet)
# for every column
for i in range(len(subreportCols)):
# set header bgcolor based on current column (i)
if [i] in range(0,11):
useheader = visitFormat
elif [i] == 10:
useheader = cognotesFormat
else:
useheader = filedateFormat
# Write the value from cell (first row, column=1) back into that cell with formatting applied
worksheet.write(0, i, subreportCols[i], useheader)
I'm confused by this, since I thought it was writing each column separately. Do I need to do this cell by cell somehow?
Thank you!
Solved it through troubleshooting, leaving up in case it helps someone else (there is an "Answer Your Question" button, after all).
In this line:
if [i] in range(0,11):
...what I thought I was doing was using [i] as a reference to the i'th value in my list, but I was actually referencing the WHOLE list. I swapped out [i] for just i, and that worked fine.

Categories