Openpyxl Python - Vlookup Iterate through rows - python

I'm trying to automate a daily report we have, and I'm using a query to pull in data and writing it in Excel using openpyxl, and then doing a vlookup in openpyxl to match a cell value. Unfortunately I'm hung up on how to iterate through the rows to find the cell value to look up.
for row in ws['E5:E91']:
for cell in row:
cell.value = "=VLOOKUP(D5, 'POD data'!C1:D87, 2, FALSE)"
It works except I don't know how to change the D5 value to look up D6, D7, D8, etc. depending on the row I'm on. I'm honestly at a loss for how to best approach this. Obviously I don't feel like writing the formula out for every single row, and there's other columns I'd like to do this for once I get it.

Using your example, you can do:
for row in ws['E5:E91']:
for cell in row:
cell.value = "=VLOOKUP(D{0}, 'POD data'!C1:D87, 2, FALSE)".format(cell.row)

Related

Merging Specific Cells in an Excel Sheet with Python

I've been trying to merge cells that meet specific criteria with the cell next to it via a loop, but I'm not quite sure how to go about it.
For example, starting at row 7, if the cell has the word "Sample" in it, I want it to merge with the cell in the column next to it and I want to keep doing that until I get to the end of that row.
I'm currently using openpyxl for this.
Here is what I've tried (it does not work):
wb = load_workbook('Test.xlsx')
ws = wb.active
worksheet = wb['Example']
q_cells = []
for row_cells in worksheet.iter_rows(min_row = 7):
for cell in row_cells:
if cell.value == 'Sample':
q_cells.append(cell.coordinate)
for item in q_cells:
worksheet.merge_cells(item:item+1)
wb.save('merging.xlsx')
I'm not quite sure how best to proceed with this code. Any help would be appreciated!
merge_cells takes a string (eg: "A2:A8") or a set of values. From the docs:
>>> ws.merge_cells('A2:D2')
>>> ws.unmerge_cells('A2:D2')
>>>
>>> # or equivalently
>>> ws.merge_cells(start_row=2, start_column=1, end_row=4, end_column=4)
>>> ws.unmerge_cells(start_row=2, start_column=1, end_row=4, end_column=4)
Source: https://openpyxl.readthedocs.io/en/stable/usage.html
It sounds like you will want to find your first cell and your last cell, and merge as such (here I'm using f-strings):
ws.merge_cells(f'{first_cell.coordinate}:{last_cell.coordinate}')
Merged cells in openpyxl change from type 'Cell' to type 'MultiCellRange', which is specified as a particular range of cell coordinates. Openpyxl will let you overlap merge ranges without throwing an error, but Excel won't let you open the resulting file without a warning (and probably removing the later merges). If you want to merge, you have to specify the whole range.

How to write on first empty row through xlwings?

The excel file has content in A1, A2 , A3. I want python to automatically write the output in first empty cell in column A .i.e it should write on A4
Another example - lets say if I have content written from B1 to B130. Here I would like python to write the desired result in cell B131.
How do I form a python solution that can perform this task in excel through xlwings ?
if your data is continuous, get the end of the current region to get the last cell then offset the cell by one to get the next empty cell.
cel = Range("A1:A2")
rng = cel.current_region
last_cel=rng.end("down")
empty_cell= last_cel.offset(1,0)
now you can do what you want with the empty_cell

Openpyxl max_row and max_column wrongly reports a larger figure

My query is to do with a function that is part of a parsing script Im developing. I am trying to write a python function to find the column number corresponding to a matched value in excel. The excel has been created on the fly with openpyxl, and it has the first row (from 3rd column) headers that each span 4 columns merged into one. In my subsequent function, I am parsing some content to be added to the columns corresponding to the matching headers. (Additional info: The content I'm parsing is blast+ output. I'm trying to create a summary spreadsheet with the hit names in each column with subcolumns for hits, gaps, span and identity. The first two columns are query contigs and its length. )
I had initially written a similar function for xlrd and it worked. But when I try to rewrite it for openpyxl, I find that the max_row and max_col function wrongly returns a larger number of rows and columns than actually present. For instance, I have 20 rows for this pilot input, but it reports it as 82.
Note that I manually selected the empty rows & columns and right clicked and deleted them, as advised elsewhere in this forum. This didn't change the error.
def find_column_number(x):
col = 0
print "maxrow = ", hrsh.max_row
print "maxcol = ", hrsh.max_column
for rowz in range(hrsh.max_row):
print "now the row is ", rowz
if(rowz > 0):
pass
for colz in range(hrsh.max_column):
print "now the column is ", colz
name = (hrsh.cell(row=rowz,column=colz).value)
if(name == x):
col = colz
return col
The issue with max_row and max_col, has been discussed here https://bitbucket.org/openpyxl/openpyxl/issues/514/cell-max_row-reports-higher-than-actual I applied the suggestion here. But the max_row is still wrong.
for row in reversed(hrsh.rows):
values = [cell.value for cell in row]
if any(values):
print("last row with data is {0}".format(row[0].row))
maxrow = row[0].row
I then tried the suggestion at https://www.reddit.com/r/learnpython/comments/3prmun/openpyxl_loop_through_and_find_value_of_the/, and tried to get the column values. Once, again the script takes into account the empty columns and reports a higher number columns than actually present.
for currentRow in hrsh.rows:
for currentCell in currentRow:
print(currentCell.value)
Can you please help me resolve this error, or suggest another method to achieve my aim?
As noted in the bug report you linked to there's a difference between a sheet's reported dimensions and whether these include empty rows or columns. If max_row and max_column are not reporting what you want to see then you will need to write your own code to find the first completely empty. The most efficient way, of course, would be to start from max_row and work backwards but the following is probably sufficient:
for max_row, row in enumerate(ws, 1):
if all(c.value is None for c in row):
break
I confirm the bug found by the OP. I found newer posts reporting max_row being too large.
This bug cannot be fixed.
In my case, it appears when I set the value of all cells in a worksheet to None.
After this operation, the worksheet still reports the old dimensions.
A call to ws.calculate_dimensions() does not change anything.
Closing and restarting excel still has openpyxl report the same wrong dimensions.
This is a problem because ws.append() starts at ws.max_row, and there is no way to override this behaviour. You end up with a worksheet that is blank and then, somewhere down, the data you appended appears.
The only way I found out that remedies this bug is to delete entire rows by hand in excel. openpyxl then shows the correct max_row.
I found out that this is linked to the member ws._cells not being empty as it should after setting all cells to None. However, the user cannot delete this dictionary as it is a private member.
I have the same behaviour with the latest version 3.0.3 of openpyxl. I use an XLSX file as a template (created from a XLS file), open it, add some data then save it with a different name. I find out that max_row is set to 49 and I don't know why.
However after reading in the online documentation https://openpyxl.readthedocs.io/en/stable/api/openpyxl.worksheet.worksheet.html this line:
Do not create worksheets yourself, use
openpyxl.workbook.Workbook.create_sheet() instead
I created my XLSX template directly from openpyxl simply as follows:
wb = openpyxl.Workbook()
wb.save(filename="template.xslx")
It works fine now (max_row=1). Hope it helps.
When using openpyxl max_row function to get the maximum rows containing the data in the sheet, sometimes it even counts the empty rows, this is because the max_row function returns the maximum row index of the sheet, not the count of rows containing the data.
Example: Let's say an excel/google-sheet file is created with 10 rows of data and 5 rows of data are removed, the max_row function of openpyxl returns maximum rows as 10, as the maximum row index of file will be 10, as the file had contained 10 rows initially.
So to get the maximum rows containing the data in openpyxl
def get_maximum_rows(*, sheet_object):
rows = 0
for max_row, row in enumerate(sheet_object, 1):
if not all(col.value is None for col in row):
rows += 1
return rows
import openpyxl
workbook = openpyxl.load_workbook(<filepath>)
sheet_object = workbook.active
max_rows = get_maximum_rows(sheet_object=sheet_object)
Today I encountered the same. I edited the .xlsx file which I'm using in openpyxl. I deleted all values from the extreme right side column and found that max_column not giving exact max_column. Then I deleted the columns where the cell values were previously deleted (right-click on column 'ID' and delete). Now I find it is reporting correct value.
I used Dharman's approach and solved the problem.
I had an Excel file with more than 100k rows. I had deleted the duplications in this file.
At first, the max_row reported the total row number before the deletion.
I used workbook.save(filename='another_filename.xlsx") method to save the original Excel file to a new one.
Then I used the openpyxl to open the new file (another_filanem.xlsx). The max_row reports the correct number now.
in general max_row and max_col will make your script so slow to run, maybe it is better to detect a None and store the row or col in case.
Here is how I find the max column and max row by simply looping through the Excel sheet. By using this code, you can compare both the result from the Python and the loop.
from openpyxl import load_workbook
wb = lw("Test.xlsx")
sheet = wb["Sheet 1"]
print("Python defined max_column " + str(sheet.max_column))
print("Python defined max_row " + str(sheet.max_row))
def get_maximum_cols():
for i in range(1, 20000):
if sheet.cell(row=2, column= i).value == None:
max_col = i
break
return max_col
def get_maximum_rows():
for i in range(1, 20000):
if sheet.cell(row=i, column = 2).value == None:
max_row = i
break
return max_row
max_cols = get_maximum_cols()
max_rows = get_maximum_rows()
print('max column ' + str(max_cols))
print('max row ' + str(max_rows))
wb.save("Test.xlsx")

openpyxl - Iterate over columns and rows to grab data from middle of sheet

I'm reading the documentation for openpyxl, and I needed something a bit more specific and I wasn't sure if there's a way to do it using iter_rows or iter_cols.
In the docs, it said to do this to grab rows and cols:
for row in ws.iter_rows(min_row=1, max_col=3, max_row=2):
for cell in row:
print(cell)
or
for col in ws.iter_cols(min_row=1, max_col=3, max_row=2):
for cell in col:
print(cell)
Doing this will give me A1, B1, C1 and so on or A1, A2, B1, B2, and so on.
But is there a way to manipulate this so you can grab the data from another point in the sheet?
I'm trying to grab the cells from F3 to W3 for example. But I'm not sure how many rows there are, there could be 5, there could be 10. So I would need to grab F4 to W4 and so on until I reach the last one which could be F10 to W10 or something.
I understand how the iter_rows and iter_cols work but I haven't found a way to manipulate it to start elsewhere and to reach an end if there are no other values left? It appears I would have to define the max_rows to how many rows there are in the sheet. Is there a way for it to check for the max amount of rows itself?
The biggest question I have is just how to iterate through the rows starting in the middle of the sheet rather than the beginning. It doesn't have to use iter_rows or iter_cols, that's just the part I was reading up on most in the documentation.
Thank you in advance!
What's wrong with ws.iter_cols(min_row=3, min_col=6, max_col=23) for ws[F3:W…]? If no maximum is specified openpyxl will keep iterating as far as it can.
If you wish to be able to dynamically end when you've reached the end of data (so, if you end up with a sheet with more than 23 rows / columns, for example), you can set max_row=ws.max_row and max_col=ws.max_column

Using xlrd to get list of excel values in python

I am trying to read a list of values in a column in an excel sheet. However, the length of the column varies every time, so I don't know the length of the list. How do I get python to read all the values in a column and stop when the cells are empty using xlrd?
for i in range(worksheet.nrows):
will iterate through all the rows in the worksheet
if you were interested in column 0 for example
c0 = [worksheet.row_values(i)[0] for i in range(worksheet.nrows) if worksheet.row_values(i)[0]]
or even better make this a generator
column_generator = (worksheet.row_values(i)[0] for i in range(worksheet.nrows))
then you can use itertools.takewhile for lazy evaluations... that will stop when you get your first empty... this will provide better performance if you just want to stop once you get your first empty value
from itertools import takewhile
print list(takewhile(str.strip,column_generator))

Categories