I am trying to read these cells using openpyxl in Python. There are two merged cells called 'Name' and I am trying to create a for loop to read the cells but skip over cells that have the same content. To do this, I check each cell with its succeeding column or its succeeding row, if they are the same, I skip over them. The problem is only the rest of the cells in the merged cell are called 'None' so it compares 'None' with 'Name' and doesn't skip over the duplicate. The desired output is "Clear, Name" but instead I get "Clear, Name, Name". Is there a way to detect the duplicate even though it is a merged cell?
Here is my current approach:
origName2 = sheet.cell(row=(rowNum+1), column=colNum).value
origName1 = sheet.cell(row=rowNum, column=(colNum+1)).value
origName = sheet.cell(row=rowNum, column=colNum).value
if str(origName) == "None":
pass
elif str(origName) == str(origName1):
pass
elif str(origName) == str(origName2):
pass
else:
commands.append(origName)
My goal is to open a spreadsheet that had mismatched data entries and iterate through them row by row within the column to then process the information. When iterating with a for loop with range(2, ws.max_row), the iteration continues past the last instance of data (cell a36) and keeps iterating and returning None beyond where I would expect the ws.max_row argument to stop at cell a36.
This is what I tried:
for cell in range(2, ws.max_row):
value = ws.cell(row=cell, column=1).value
print(value)
The output displays the cell values containing data and then after cell a36 it continues to output empty cells such as:
DPS ‐ Hot Springs ‐ 206 S Chicago Street
None
None
None
None
None
A36 is the last cell containing any data and I thought that max_row was equivalent to the last row containing data. Does anyone understand what could be causing my for loop to continue afterwards?
If your columns are fully populated up till the last row (in your case row 36), why not use a while loop and a break condition?
Something along the lines of this:
row = 2
while True:
value = ws.cell(row=row, column=1).value
print(value)
if value == None:
break
row += 1
As stated by #stovfl and #Charlie Clark, max_row isn't necessarily the last row in your data that isn't None. Better to be explicit
Maybe ur excel file had edited several NoneType cells beyond the last data cell that u expected.
There r 2 ways to solve it.
Set a break in ur loop when the cell.value equals what u r expected.
Check ur excel file if it has NoneType cells that u unexpected.
I have an excel file with two columns. The values in the columns are unordered. I know for sure that some of the cells from the source column exist in the target column, which is longer (240 rows compared to 191). How can I check if a value from the source column exists in the target column and then print to a column to the right from target column row by row i.e. "check" if the value from source exists and "missing" if it doesn't?
I assume it should follow this logic, but the value comparison itself seems tricky to me:
for (source_row, target_row) in zip(ws.iter_rows(min_row=2, max_col=3, max_row=240),
ws.iter_rows(min_row=2, max_col=7, max_row=240)):
for (source_cell, target_cell) in zip(source_row, target_row):
if target_cell in source_row: # doesn't seem to work
ws.cell(column=10, row=target_cell.row).value = "check"
break
else:
ws.cell(column=10, row=target_cell.row).value = "missing"
break
For this kind of thing you should avoid using nested loops. Not only will they not work necessarily work in this case because you are only checking cell with cell and not cell against a set of values, nested loops can, ahem, quickly become slow, but also make debugging and flow control more difficult.
If you just want to know which values are in each column.
source = set((r[0] for r in ws.iter_rows(min_row=2, max_row=240, min_col=3, max_col=3, values_only=True))
target = set((r[0] for r in ws.iter_rows(min_row=2, max_row=240, min_col=7, max_col=7, values_only=True))
missing_value = target - source
You can now loop over the source cells and do the comparison:
for row in ws.iter_rows(min_row=2, max_row=240, min_col=3, max_col=3):
for c in row:
value = c.value in target and "check" or "missing"
c.offset(rows=7).value = value
My query is to do with a function that is part of a parsing script Im developing. I am trying to write a python function to find the column number corresponding to a matched value in excel. The excel has been created on the fly with openpyxl, and it has the first row (from 3rd column) headers that each span 4 columns merged into one. In my subsequent function, I am parsing some content to be added to the columns corresponding to the matching headers. (Additional info: The content I'm parsing is blast+ output. I'm trying to create a summary spreadsheet with the hit names in each column with subcolumns for hits, gaps, span and identity. The first two columns are query contigs and its length. )
I had initially written a similar function for xlrd and it worked. But when I try to rewrite it for openpyxl, I find that the max_row and max_col function wrongly returns a larger number of rows and columns than actually present. For instance, I have 20 rows for this pilot input, but it reports it as 82.
Note that I manually selected the empty rows & columns and right clicked and deleted them, as advised elsewhere in this forum. This didn't change the error.
def find_column_number(x):
col = 0
print "maxrow = ", hrsh.max_row
print "maxcol = ", hrsh.max_column
for rowz in range(hrsh.max_row):
print "now the row is ", rowz
if(rowz > 0):
pass
for colz in range(hrsh.max_column):
print "now the column is ", colz
name = (hrsh.cell(row=rowz,column=colz).value)
if(name == x):
col = colz
return col
The issue with max_row and max_col, has been discussed here https://bitbucket.org/openpyxl/openpyxl/issues/514/cell-max_row-reports-higher-than-actual I applied the suggestion here. But the max_row is still wrong.
for row in reversed(hrsh.rows):
values = [cell.value for cell in row]
if any(values):
print("last row with data is {0}".format(row[0].row))
maxrow = row[0].row
I then tried the suggestion at https://www.reddit.com/r/learnpython/comments/3prmun/openpyxl_loop_through_and_find_value_of_the/, and tried to get the column values. Once, again the script takes into account the empty columns and reports a higher number columns than actually present.
for currentRow in hrsh.rows:
for currentCell in currentRow:
print(currentCell.value)
Can you please help me resolve this error, or suggest another method to achieve my aim?
As noted in the bug report you linked to there's a difference between a sheet's reported dimensions and whether these include empty rows or columns. If max_row and max_column are not reporting what you want to see then you will need to write your own code to find the first completely empty. The most efficient way, of course, would be to start from max_row and work backwards but the following is probably sufficient:
for max_row, row in enumerate(ws, 1):
if all(c.value is None for c in row):
break
I confirm the bug found by the OP. I found newer posts reporting max_row being too large.
This bug cannot be fixed.
In my case, it appears when I set the value of all cells in a worksheet to None.
After this operation, the worksheet still reports the old dimensions.
A call to ws.calculate_dimensions() does not change anything.
Closing and restarting excel still has openpyxl report the same wrong dimensions.
This is a problem because ws.append() starts at ws.max_row, and there is no way to override this behaviour. You end up with a worksheet that is blank and then, somewhere down, the data you appended appears.
The only way I found out that remedies this bug is to delete entire rows by hand in excel. openpyxl then shows the correct max_row.
I found out that this is linked to the member ws._cells not being empty as it should after setting all cells to None. However, the user cannot delete this dictionary as it is a private member.
I have the same behaviour with the latest version 3.0.3 of openpyxl. I use an XLSX file as a template (created from a XLS file), open it, add some data then save it with a different name. I find out that max_row is set to 49 and I don't know why.
However after reading in the online documentation https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html this line:
Do not create worksheets yourself, use
openpyxl.workbook.Workbook.create_sheet() instead
I created my XLSX template directly from openpyxl simply as follows:
wb = openpyxl.Workbook()
wb.save(filename="template.xslx")
It works fine now (max_row=1). Hope it helps.
When using openpyxl max_row function to get the maximum rows containing the data in the sheet, sometimes it even counts the empty rows, this is because the max_row function returns the maximum row index of the sheet, not the count of rows containing the data.
Example: Let's say an excel/google-sheet file is created with 10 rows of data and 5 rows of data are removed, the max_row function of openpyxl returns maximum rows as 10, as the maximum row index of file will be 10, as the file had contained 10 rows initially.
So to get the maximum rows containing the data in openpyxl
def get_maximum_rows(*, sheet_object):
rows = 0
for max_row, row in enumerate(sheet_object, 1):
if not all(col.value is None for col in row):
rows += 1
return rows
import openpyxl
workbook = openpyxl.load_workbook(<filepath>)
sheet_object = workbook.active
max_rows = get_maximum_rows(sheet_object=sheet_object)
Today I encountered the same. I edited the .xlsx file which I'm using in openpyxl. I deleted all values from the extreme right side column and found that max_column not giving exact max_column. Then I deleted the columns where the cell values were previously deleted (right-click on column 'ID' and delete). Now I find it is reporting correct value.
I used Dharman's approach and solved the problem.
I had an Excel file with more than 100k rows. I had deleted the duplications in this file.
At first, the max_row reported the total row number before the deletion.
I used workbook.save(filename='another_filename.xlsx") method to save the original Excel file to a new one.
Then I used the openpyxl to open the new file (another_filanem.xlsx). The max_row reports the correct number now.
in general max_row and max_col will make your script so slow to run, maybe it is better to detect a None and store the row or col in case.
Here is how I find the max column and max row by simply looping through the Excel sheet. By using this code, you can compare both the result from the Python and the loop.
from openpyxl import load_workbook
wb = lw("Test.xlsx")
sheet = wb["Sheet 1"]
print("Python defined max_column " + str(sheet.max_column))
print("Python defined max_row " + str(sheet.max_row))
def get_maximum_cols():
for i in range(1, 20000):
if sheet.cell(row=2, column= i).value == None:
max_col = i
break
return max_col
def get_maximum_rows():
for i in range(1, 20000):
if sheet.cell(row=i, column = 2).value == None:
max_row = i
break
return max_row
max_cols = get_maximum_cols()
max_rows = get_maximum_rows()
print('max column ' + str(max_cols))
print('max row ' + str(max_rows))
wb.save("Test.xlsx")
I thought this would be fairly simple but I am stuck. I am trying to repeat a row of data based upon a population field. For example, if the population is 921, the row needs to be repeated 921 times, and then move to the next row and repeat based upon the population. The csv file does have a header. I tried removing that and ran into problems so I put the header back.
i = 0
while i < pop:
if pop == 'F21_64':
break
else:
# writerow
i += 1
I keep getting this error code. IndexError: list index out of range
You left a lot to assumption but i'll try to answer. For one thing, it is not clear what you want to do with the pop, other than you want to do something the amount of times of its value (print to screen, output to file???) I will assume print to screen.
I imagine you are trying to do something like this: (lets assume your population field is in column 2)
for row in rows[1:]: #dont look at the header row
pop = row.split(',')[1] #isolate just the pop value
popvalue = int(pop) #convert to int
for i in range(0,popvalue): #for the number of the value...
print row #do the thing you want with the entire row