Openpyxl max_row and max_column wrongly reports a larger figure - python

My query is to do with a function that is part of a parsing script Im developing. I am trying to write a python function to find the column number corresponding to a matched value in excel. The excel has been created on the fly with openpyxl, and it has the first row (from 3rd column) headers that each span 4 columns merged into one. In my subsequent function, I am parsing some content to be added to the columns corresponding to the matching headers. (Additional info: The content I'm parsing is blast+ output. I'm trying to create a summary spreadsheet with the hit names in each column with subcolumns for hits, gaps, span and identity. The first two columns are query contigs and its length. )
I had initially written a similar function for xlrd and it worked. But when I try to rewrite it for openpyxl, I find that the max_row and max_col function wrongly returns a larger number of rows and columns than actually present. For instance, I have 20 rows for this pilot input, but it reports it as 82.
Note that I manually selected the empty rows & columns and right clicked and deleted them, as advised elsewhere in this forum. This didn't change the error.
def find_column_number(x):
col = 0
print "maxrow = ", hrsh.max_row
print "maxcol = ", hrsh.max_column
for rowz in range(hrsh.max_row):
print "now the row is ", rowz
if(rowz > 0):
pass
for colz in range(hrsh.max_column):
print "now the column is ", colz
name = (hrsh.cell(row=rowz,column=colz).value)
if(name == x):
col = colz
return col
The issue with max_row and max_col, has been discussed here https://bitbucket.org/openpyxl/openpyxl/issues/514/cell-max_row-reports-higher-than-actual I applied the suggestion here. But the max_row is still wrong.
for row in reversed(hrsh.rows):
values = [cell.value for cell in row]
if any(values):
print("last row with data is {0}".format(row[0].row))
maxrow = row[0].row
I then tried the suggestion at https://www.reddit.com/r/learnpython/comments/3prmun/openpyxl_loop_through_and_find_value_of_the/, and tried to get the column values. Once, again the script takes into account the empty columns and reports a higher number columns than actually present.
for currentRow in hrsh.rows:
for currentCell in currentRow:
print(currentCell.value)
Can you please help me resolve this error, or suggest another method to achieve my aim?

As noted in the bug report you linked to there's a difference between a sheet's reported dimensions and whether these include empty rows or columns. If max_row and max_column are not reporting what you want to see then you will need to write your own code to find the first completely empty. The most efficient way, of course, would be to start from max_row and work backwards but the following is probably sufficient:
for max_row, row in enumerate(ws, 1):
if all(c.value is None for c in row):
break

I confirm the bug found by the OP. I found newer posts reporting max_row being too large.
This bug cannot be fixed.
In my case, it appears when I set the value of all cells in a worksheet to None.
After this operation, the worksheet still reports the old dimensions.
A call to ws.calculate_dimensions() does not change anything.
Closing and restarting excel still has openpyxl report the same wrong dimensions.
This is a problem because ws.append() starts at ws.max_row, and there is no way to override this behaviour. You end up with a worksheet that is blank and then, somewhere down, the data you appended appears.
The only way I found out that remedies this bug is to delete entire rows by hand in excel. openpyxl then shows the correct max_row.
I found out that this is linked to the member ws._cells not being empty as it should after setting all cells to None. However, the user cannot delete this dictionary as it is a private member.

I have the same behaviour with the latest version 3.0.3 of openpyxl. I use an XLSX file as a template (created from a XLS file), open it, add some data then save it with a different name. I find out that max_row is set to 49 and I don't know why.
However after reading in the online documentation https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html this line:
Do not create worksheets yourself, use
openpyxl.workbook.Workbook.create_sheet() instead
I created my XLSX template directly from openpyxl simply as follows:
wb = openpyxl.Workbook()
wb.save(filename="template.xslx")
It works fine now (max_row=1). Hope it helps.

When using openpyxl max_row function to get the maximum rows containing the data in the sheet, sometimes it even counts the empty rows, this is because the max_row function returns the maximum row index of the sheet, not the count of rows containing the data.
Example: Let's say an excel/google-sheet file is created with 10 rows of data and 5 rows of data are removed, the max_row function of openpyxl returns maximum rows as 10, as the maximum row index of file will be 10, as the file had contained 10 rows initially.
So to get the maximum rows containing the data in openpyxl
def get_maximum_rows(*, sheet_object):
rows = 0
for max_row, row in enumerate(sheet_object, 1):
if not all(col.value is None for col in row):
rows += 1
return rows
import openpyxl
workbook = openpyxl.load_workbook(<filepath>)
sheet_object = workbook.active
max_rows = get_maximum_rows(sheet_object=sheet_object)

Today I encountered the same. I edited the .xlsx file which I'm using in openpyxl. I deleted all values from the extreme right side column and found that max_column not giving exact max_column. Then I deleted the columns where the cell values were previously deleted (right-click on column 'ID' and delete). Now I find it is reporting correct value.

I used Dharman's approach and solved the problem.
I had an Excel file with more than 100k rows. I had deleted the duplications in this file.
At first, the max_row reported the total row number before the deletion.
I used workbook.save(filename='another_filename.xlsx") method to save the original Excel file to a new one.
Then I used the openpyxl to open the new file (another_filanem.xlsx). The max_row reports the correct number now.

in general max_row and max_col will make your script so slow to run, maybe it is better to detect a None and store the row or col in case.

Here is how I find the max column and max row by simply looping through the Excel sheet. By using this code, you can compare both the result from the Python and the loop.
from openpyxl import load_workbook
wb = lw("Test.xlsx")
sheet = wb["Sheet 1"]
print("Python defined max_column " + str(sheet.max_column))
print("Python defined max_row " + str(sheet.max_row))
def get_maximum_cols():
for i in range(1, 20000):
if sheet.cell(row=2, column= i).value == None:
max_col = i
break
return max_col
def get_maximum_rows():
for i in range(1, 20000):
if sheet.cell(row=i, column = 2).value == None:
max_row = i
break
return max_row
max_cols = get_maximum_cols()
max_rows = get_maximum_rows()
print('max column ' + str(max_cols))
print('max row ' + str(max_rows))
wb.save("Test.xlsx")

Related

Openpyxl max_rows shows different output after saving and reopeing the same file

Today I have encountered very strange error with `ws.max_row` method. After saving a `Workbook` into .xlsx and loading it again with `load_workbook()` (I am always reopening the file for final check), `ws.max_row` is showing more rows than just before save operation.
Additionally after finding that `ws.max_row` changed its value I have tried to print cell values to find out what is happening. Please take a look at this code sample and output which it is producing:
print('before save')
print(f'data_ws.max_row: {data_ws.max_row}')
for idx, cell in enumerate(data_ws['L']):
print(f'Row idx: {idx}, Cell value: {cell.value}')
### save processed file
final_folder = 'Final file'
if not os.path.exists(final_folder):
os.mkdir(final_folder)
data_workbook.save(f'{final_folder}/{data_file}')
wb = load_workbook(f'{final_folder}/{data_file}')
data_ws = wb.active
print('after save')
print(f'data_ws.max_row: {data_ws.max_row}')
for idx, cell in enumerate(data_ws['L']):
print(f'Row idx: {idx}, Cell value: {cell.value}')
Code output:
https://pastecode.io/s/r09p7cz1
(I have cut some lines, because they were identical in both prints and nobody would like to scroll that much :D)
I am not posting my whole code here, because it is working fine and it is not the point here. Mainly what my program does is replacing one URL with couple of other URLs correlated to that one. Sometimes I am deleting some rows if there are no correlations to given URL.
I have discovered that there are some new rows with None value at the end of the print after the save operation and what's more intriguing - last printed row after the None rows was the last row which was before replaced with couple of other rows which are located (and shown in the print()) at the end of the file before save operation. At the top of that rows with None value and the last row with actual value are not present in .xlsx file.
For now I've make a workaround in which I'm manually checking for None value rows and it is working fine but the problem intrigued me very much and I would love to hear some explanations about it.

Merging Specific Cells in an Excel Sheet with Python

I've been trying to merge cells that meet specific criteria with the cell next to it via a loop, but I'm not quite sure how to go about it.
For example, starting at row 7, if the cell has the word "Sample" in it, I want it to merge with the cell in the column next to it and I want to keep doing that until I get to the end of that row.
I'm currently using openpyxl for this.
Here is what I've tried (it does not work):
wb = load_workbook('Test.xlsx')
ws = wb.active
worksheet = wb['Example']
q_cells = []
for row_cells in worksheet.iter_rows(min_row = 7):
for cell in row_cells:
if cell.value == 'Sample':
q_cells.append(cell.coordinate)
for item in q_cells:
worksheet.merge_cells(item:item+1)
wb.save('merging.xlsx')
I'm not quite sure how best to proceed with this code. Any help would be appreciated!
merge_cells takes a string (eg: "A2:A8") or a set of values. From the docs:
>>> ws.merge_cells('A2:D2')
>>> ws.unmerge_cells('A2:D2')
>>>
>>> # or equivalently
>>> ws.merge_cells(start_row=2, start_column=1, end_row=4, end_column=4)
>>> ws.unmerge_cells(start_row=2, start_column=1, end_row=4, end_column=4)
Source: https://openpyxl.readthedocs.io/en/stable/usage.html
It sounds like you will want to find your first cell and your last cell, and merge as such (here I'm using f-strings):
ws.merge_cells(f'{first_cell.coordinate}:{last_cell.coordinate}')
Merged cells in openpyxl change from type 'Cell' to type 'MultiCellRange', which is specified as a particular range of cell coordinates. Openpyxl will let you overlap merge ranges without throwing an error, but Excel won't let you open the resulting file without a warning (and probably removing the later merges). If you want to merge, you have to specify the whole range.

How to delete specific rows in excel with openpyxl python if condition is met

Using openpyxl I am creating python script that will loop through the rows of data and find rows in which some of the column are empty - these will be deleted. The range of rows is 3 to 1800.
I am not excatly sure how to delete these row - please see code I have come up with so far.
What I was trying to achieve is to iterate through the rows and check if columns 4, 7 values are set to None. If True I wanted to return row number into suitable collection (need advise which one would be best for this) and then create another loop that would delete specific row number reversed as I don't want change the structure of the file by deleting rows with data.
I believe there may be easier function to do this but could not find this particular answer.
for i in worksheet.iter_rows(min_row=3, max_row=1800):
emp_id = i[4]
full_name = i[7]
if emp_id.value == None and full_name.value == None:
print(worksheet[i].rows)
rows_to_delete.append(i)
Your iteration looks good.
OpenPyXl offers worksheet.delete_rows(idx, amt=1) where idx is the index of the first row to delete (1-based) and amt is the amount of rows to delete beyond that index.
So worksheet.delete_rows(3, 4) will delete rows 3, 4, 5, and 6 and worksheet.delete_rows(2, 1) will just delete row 2.
In your case, you'd probably want to do something like worksheet.delete_rows(i, 1).

How do I find the max row in Openpyxl

The max_row function returns a value higher than it should be (the largest row that has a value in it is row 7, but max_row returns 10), and if I try iterating through a column to find the first row that has nothing in it I get the same value as max_row.
This would be easier to understand if you work with excel on java.
Excel cell have properties which define them as active or inactive. If you enter a value to a cell then delete the value, the cell still remains active.
max_row returns the row number of the last active cell, hence you get 10 rather than 7 even if the sheet now have data only till row 7 it may once have data till 10.
Manually you can clear the cell (Editing->Clear->Clear All) for the cell in excel making it inactive again. Not sure how to do the same via code in python.

openpyxl iterate through rows and apply formula

I am trying to iterate through the rows of a particular column in an Excel worksheet, apply a formula, and save the output. I'm struggling to get my code right and am not sure where to go next.
My code so far:
import openpyxl
wb = openpyxl.load_workbook('test-in.xlsx')
sheet = wb.worksheets[2]
maxRow = sheet.max_row
for row in range(2, maxRow)
wb.save('test-out.xlsx')
So I'm not clear how to write my for loop to write the results of applying the =CLEAN(D2) formula, in column E. I can apply the formula to a single cell with:
sheet['I2'] = '=CLEAN(D2)'
However I'm not sure how I can incorporate this into my for loop!
Any help much appreciated...
Try this (max_row_num is yours maxRow - in Python we usually do not use cameCase for variables):
for row_num in range(2, max_row_num):
sheet['E{}'.format(row_num)] = '=CLEAN(D{})'.format(row_num)
This is covered in the documentation: http://openpyxl.readthedocs.io/en/latest/tutorial.html#accessing-many-cells

Categories