Python and Excel Formula - python

Complete beginner here but have a specific need to try and make my life easier with automating Excel.
I have a weekly report that contains a lot of useless columns and using Python I can delete these and rename them, with the code below.
from openpyxl import Workbook, load_workbook
wb = load_workbook('TestExcel.xlsx')
ws = wb.active
ws.delete_cols(1,3)
ws.delete_cols(3,8)
ws.delete_cols(4,3)
ws.insert_cols(3,1)
ws['A1'].value = "Full Name"
ws['C1'].value = "Email Address"
ws['C2'].value = '=B2&"#testdomain.com"'
wb.save('TestExcelUpdated.xlsx')
This does the job but I would like the formula to continue from B2 downwards (since the top row are headings).
ws['C2'].value = '=B2&"#testdomain.com"'
Obviously, in Excel it is just a case of dragging the formula down to the end of the column but I'm at a loss to get this working in Python. I've seen similar questions asked but the answers are over my head.
Would really appreciate a dummies guide.
Example of Excel report after Python code

one way to do this is by iterating over the rows in your worksheet.
for row in ws.iter_rows(min_row=2): #min_row ensures you skip your header row
row[2].value = '=B' + str(row[0].row) + '&"#testdomain.com"'
row[2].value selects the third column due to zero based indexing. row[0].row gets the number corresponding to the current row

Related

Copying/pasting a column of formulas using python

I have a very large excel file that I'm dealing with in python. I have a column where every cell is a different formula. I want to copy the formulas and paste them one column over from column GD to GE.
The issue is that I want to the formulas to update like they do in excel, its just that excel takes a very long time to copy/paste because the file I'm working with is very large.
Any ideas on possibly how to use openpyxl's translator to do this or anything else?
from openpyxl import load_workbook
import pandas as pd
#loads the excel file and is now saved under workbook#
workbook = load_workbook('file.xlsx')
#uses the individual sheets index(first sheet = 0) to work on one sheet at a time#
sheet= workbook.worksheets[8]
#inserts a column at specified index number#
sheet.insert_cols(187)
#naming the new columns#
sheet['GE2']= '20220531'
here is my updated code
from openpyxl import load_workbook
from openpyxl.formula.translate import Translator
#loads the excel file and is now saved under workbook#
workbook = load_workbook('file.xlsx')
#uses the individual sheets index(first sheet = 0) to work on one sheet at a time#
sheet= workbook.worksheets[8]
formula = sheet['GD3'].value
new_formula = Translator(formula, origin= 'GE3').translate_formula("GD3")
sheet['GD2'] = new_formula
for row in sheet.iter_rows(min_col=187, max_col=188):
old, new = row
if new.data_type != "f":
continue
new_formula = Translator(new.value, origin=old.coordinate).translate_formula(new.coordinate)
workbook.save('file.xlsx')
When you add or remove columns and rows, Openpyxl does not manage formulae for you. The reason for this is simple: where should it stop? Managing a "dependency graph" is exactly the kind of functionality that an application like MS Excel provides.
But it is quite easy to do this in your own code using the Formula Translator
# insert the column
formula = ws['GE1'].value
new_formula = Translator(formula, origin="GD1").translate_formula("GE1")
ws['GE1'] = new_formula
It should be fairly straightforward to create a loop for this (check the data type and use cell.coordinate to avoid potential typos or incorrect adjustments.
sheet.insert_cols(187)
for row in ws.iter_rows(min_col=187, max_col=188):
old, new = row
if new.data_type != "f"
continue
new_formula = Translator(new.value, origin=old.coordinate).translate_formula(new.coordinate)

Add Data Frame to Excel in Specific Cell

I am using openpyxl to write a pandas data frame to an excel spreadsheet. I am reading the excel workbook in from a template as the client wants a specific header, format, etc.
If I want to add the table starting at row 17, how do I modify the code below to achieve this? There are formatted, merged cells in rows 1 through 16 so when I try to save in current form it gives me AttributeError: 'MergedCell' object attribute 'value' is read-only. I saw a similar stack question but respondents advised just adding a bunch of blank cells before adding the data frame--a solution that won't work for me.
import pandas as pd
from openpyxl import Workbook
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook('proto_temp.xlsx')
ws_rq = wb["Results Query"]
rows = dataframe_to_rows(df)
for r_idx, row in enumerate(rows, 1):
for c_idx, value in enumerate(row, 1):
ws_rq.cell(row=r_idx, column=c_idx, value=value)
wb.save("df_to_xl.xlsx")
Question:
Is there data after row 17?
If Answer == NO, then see below
for r in dataframe_to_rows(df, index=True, header=True):
ws.append(r)
If Answer == YES, then the answer is slightly more complicated.
Options
Structure your code so you run the code above, before other cells are written below.
Read XLSX into DF, make the changes and then write it back.
Lastly, you could loop through a range and write cell-by-cell (feels too computationally expensive - but I figured I'd suggest it anyways).
sheet.cell(row=2, column=2).value = 2
Hope this provides value :)

Finding Excel cell reference using Python

Here is the Excel file in question:
Context: I am writing a program which can pull values from a PDF and put them in the appropriate cell in an Excel file.
Question: I want to write a function which takes a column value (e.g. 2014) and a row value (e.g. 'COGS') as arguments and return the cell reference where those two intersect (e.g. 'C3' for 2014 COGS).
def find_correct_cell(year=2014, item='COGS'):
#do something similar to what the =match function in Excel does
return cell_reference #returns 'C3'
I have already tried using openpyxl like this to change the values of some random empty cells where I can store these values:
col_num = '=match(2014, A1:E1)'
row_num = '=match("COGS", A1:A5)'
But I want to grab those values without having to arbitrarily write to those random empty cells. Plus, even with this method, when I read those cells (F5 and F6) it reads the formulae in those cells and not the face value of 3.
Any help is appreciated, thanks.
Consider a translated VBA solution as the Match function can adequately handle your needs. Python can access the Excel VBA Object Library using a COM interface with the win32com module. Please note this solution assumes you are using Excel for PC. Below includes the counterpart VBA function.
VBA Function (native interface)
If below function is placed in Excel standard module, function can be called in spreadsheet cell =FindCell(..., ###)
' MATCHES ROW AND COL INPUT FOR CELL ADDRESS OUTPUT
Function FindCell(item As String, year As Integer) As String
FindCell = Cells(Application.Match(item, Range("A1:A5"), 0), _
Application.Match(year, Range("A1:E1"), 0)).Address
End Function
debug.Print FindCell("COGS", 2014)
' $C$3
Python Script (foreign interface, requiring all objects to be declared)
Try/Except/Finally is used to properly close the Excel process regardless of script success or fail.
import win32com.client
# MATCHES ROW AND COL INPUT FOR CELL ADDRESS OUTPUT
def FindCell(item, year):
return(xlWks.Cells(xlApp.WorksheetFunction.Match(item, xlWks.Range("A1:A5"), 0),
xlApp.WorksheetFunction.Match(year, xlWks.Range("A1:E1"), 0)).Address)
try:
xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open('C:/Path/To/Workbook.xlsx')
xlWks = xlWbk.Worksheets("SHEETNAME")
print(FindCell("COGS", 2014))
# $C$3
except Exception as e:
print(e)
finally:
xlWbk.Close(False)
xlApp.Quit
xlWks = None
xlWbk = None
xlApp = None
There are a surprising number of details you need to get right to manipulate Excel files this way with openpyxl. First, it's worth knowing that the xlsx file contains two representations of each cell - the formula, and the current value of the formula. openpyxl can return either, and if you want values you should specify data_only=True when you open the file. Also, openpyxl is not able to calculate a new value when you change the formula for a cell - only Excel itself can do that. So inserting a MATCH() worksheet function won't solve your problem.
The code below does what you want, mostly in Python. It uses the "A1" reference style, and does some calculations to turn column numbers into column letters. This won't hold up well if you go past column Z. In that case, you may want to switch to numbered references to rows and columns. There's some more info on that here and here. But hopefully this will get you on your way.
Note: This code assumes you are reading a workbook called 'test.xlsx', and that 'COGS' is in a list of items in 'Sheet1!A2:A5' and 2014 is in a list of years in 'Sheet1!B1:E1'.
import openpyxl
def get_xlsx_region(xlsx_file, sheet, region):
""" Return a rectangular region from the specified file.
The data are returned as a list of rows, where each row contains a list
of cell values"""
# 'data_only=True' tells openpyxl to return values instead of formulas
# 'read_only=True' makes openpyxl much faster (fast enough that it
# doesn't hurt to open the file once for each region).
wb = openpyxl.load_workbook(xlsx_file, data_only=True, read_only=True)
reg = wb[sheet][region]
return [[cell.value for cell in row] for row in reg]
# cache the lists of years and items
# get the first (only) row of the 'B1:F1' region
years = get_xlsx_region('test.xlsx', 'Sheet1', 'B1:E1')[0]
# get the first (only) column of the 'A2:A6' region
items = [r[0] for r in get_xlsx_region('test.xlsx', 'Sheet1', 'A2:A5')]
def find_correct_cell(year, item):
# find the indexes for 'COGS' and 2014
year_col = chr(ord('B') + years.index(year)) # only works in A:Z range
item_row = 2 + items.index(item)
cell_reference = year_col + str(item_row)
return cell_reference
print find_correct_cell(year=2014, item='COGS')
# C3

openpyxl: Append data to first empty column cell

Background:
I have an excel workbook containing metadata which spread across various worksheets. I need to take the relevant columns of data from the various worksheets and combine them into a single worksheet. With the following code I have been able to create a new worksheet and add data to it.
# Open workbook and assign worksheet
try:
wb = openpyxl.load_workbook(metadata)
shtEditionLNM = wb.worksheets[0] # Edition date & latest NM
shtChartsTitles = wb.worksheets[1] # Charts & Titles
shtDepthHeight = wb.worksheets[4] # Depth & heights
shtChartProj = wb.worksheets[7] # Chart Projection
except:
raise SystemExit(0)
new = wb.create_sheet()
new.title = "MT_CHARTS INFO"
new.sheet_properties.tabColor = "1072BA"
shtMeta = wb.get_sheet_by_name("MT_CHARTS INFO")
for row in shtChartsTitles.rows:
shtMeta.append([row[0].value, row[1].value, row[2].value, row[4].value])
for row in shtEditionLNM.rows:
shtMeta.append([row[3].value, row[4].value])
wb.save('OW - Quarterly Extract of Metadata for Raster Charts Dec 2015.xlsx')
This works without any errors and I can see the data saved to my new workbook. However when I run a second loop and append values they are appended to cell A3169 whereas I actually want them to populate from E1.
My question boils down to 'is there a way I can append to a new column instead of a new row?'
Thanks in advance!
Not directly: ws.append() works with rows because this is the way the data is stored and thus the easiest to optimise for the read-only and write-only modes.
However, ws.cell(row=x, column=y, value=z) will allow you to do want you want. Version 2.4 (install from a checkout) will also let you work directly with columns by managing the assignment to cells for you: ws['E'] will return a tuple of the cells in the column up to the current ws.max_row; ws.iter_cols(min_col, min_row, max_col, max_row) will return a generator of columns as big as you need it.
Thank you Charlie,
Your answer gave me the direction I needed to get this done. Referring to this question:
how to write to a new cell in python using openpyxl
i've found out there are many ways to skin this cat - the method below is what I went for in the end!
x=0
for row in shtEditionLNM.rows:
x+=1
shtMeta.cell(coordinate="E{}".format(x)).value = row[3].value
shtMeta.cell(coordinate="F{}".format(x)).value = row[4].value
I am new to openpyxl, but I believe we can convert a list to a list of tuple of each element, and then pass that object into the sheet.append() function:
L1=[a,b,c,d.....]
L2=[]
for a in L1:
L2.append(tuple(a))
for a in L2:
sheet.append(L2)
Please feel free to correct me.

xlwings function to find the last row with data

I am trying to find the last row in a column with data. to replace the vba function: LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
I am trying this, but this pulls in all rows in Excel. How can I just get the last row.
from xlwings import Workbook, Range
wb = Workbook()
print len(Range('A:A'))
Consolidating the answers above, you can do it in one line:
wb.sheet.range(column + last cell value).Get End of section going up[non blank assuming the last cell is blank].row
Example code:
import xlwings as xw
from xlwings import Range, constants
wb = xw.Book(r'path.xlsx')
wb.sheets[0].range('A' + str(wb.sheets[0].cells.last_cell.row)).end('up').row
We can use Range object to find the last row and/or the last column:
import xlwings as xw
# open raw data file
filename_read = 'data_raw.csv'
wb = xw.Book(filename_read)
sht = wb.sheets[0]
# find the numbers of columns and rows in the sheet
num_col = sht.range('A1').end('right').column
num_row = sht.range('A1').end('down').row
# collect data
content_list = sht.range((1,1),(num_row,num_col)).value
print(content_list)
This is very much the same as crazymachu's answer, just wrapped up in a function. Since version 0.9.0 of xlwings you can do this:
import xlwings as xw
def lastRow(idx, workbook, col=1):
""" Find the last row in the worksheet that contains data.
idx: Specifies the worksheet to select. Starts counting from zero.
workbook: Specifies the workbook
col: The column in which to look for the last cell containing data.
"""
ws = workbook.sheets[idx]
lwr_r_cell = ws.cells.last_cell # lower right cell
lwr_row = lwr_r_cell.row # row of the lower right cell
lwr_cell = ws.range((lwr_row, col)) # change to your specified column
if lwr_cell.value is None:
lwr_cell = lwr_cell.end('up') # go up untill you hit a non-empty cell
return lwr_cell.row
Intuitively, the function starts off by finding the most extreme lower-right cell in the workbook. It then moves across to your selected column and then up until it hits the first non-empty cell.
You could try using Direction by starting at the very bottom and then moving up:
import xlwings
from xlwings.constants import Direction
wb = xlwings.Workbook(r'data.xlsx')
print(wb.active_sheet.xl_sheet.Cells(65536, 1).End(Direction.xlUp).Row)
Try this:
import xlwings as xw
cellsDown = xw.Range('A1').vertical.value
cellsRight = xw.Range('A1').horizontal.value
print len(cellsDown)
print len(cellsRight)
One could use the VBA Find function that is exposed through api property (use it to find anything with a star, and begin your search from the first cell).
Example:
row_cell = s.api.Cells.Find(What="*",
After=s.api.Cells(1, 1),
LookAt=xlwings.constants.LookAt.xlPart,
LookIn=xlwings.constants.FindLookIn.xlFormulas,
SearchOrder=xlwings.constants.SearchOrder.xlByRows,
SearchDirection=xlwings.constants.SearchDirection.xlPrevious,
MatchCase=False)
column_cell = s.api.Cells.Find(What="*",
After=s.api.Cells(1, 1),
LookAt=xlwings.constants.LookAt.xlPart,
LookIn=xlwings.constants.FindLookIn.xlFormulas,
SearchOrder=xlwings.constants.SearchOrder.xlByColumns,
SearchDirection=xlwings.constants.SearchDirection.xlPrevious,
MatchCase=False)
print((row_cell.Row, column_cell.Column))
Other methods outlined here seems to require no empty rows/columns between data.
source: https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33
python 3.6, xlwings 0.11
Solutoin 1
To find last row with data, you should do some work both horizontally and vertically. You have to go through every column to determine which row is the last row.
import xlwings
workbook_all = xlwings.Book(r'path.xlsx')
objectiveSheet = workbook_all .sheets['some_sheet']
# lastCellContainData(), inspired of Stefan's answer.
def lastCellContainData(objectiveSheet,lastRow=None,lastColumn=None):
lastRow = objectiveSheet.cells.last_cell.row if lastRow==None else lastRow
lastColumn = objectiveSheet.cells.last_cell.column if lastColumn==None else lastColumn
lastRows,lastColumns = [],[]
for col in range(1,lastColumn):
lastRows.append(objectiveSheet.range((lastRow, col)).end('up').row)
# extract last row of every column, then max(). Or you can compare the next
# column's last row number to the last column's last row number. Here you get
# the last row with data, you can also go further get the last column with data:
for row in range(1,lastRow):
lastColumns.append(objectiveSheet.range((row, lastColumn)).end('left').column)
return max(lastRows),max(lastColumns)
lastCellContainData(objectiveSheet,lastRow=5000,lastColumn=300)
I added lastRow and lastColumn. To make the program more effective, you can set these parameters according to the approximate shape of the data you're dealing with.
Solution 2
xlwings is honored for being wrapper of pywin32. I don't know if your situation allows for keyboard or mouse. If so, first you ctrl+tab switch to the workbook, then ctrl+a to select the region containing data, then you call workbook_all.selection.rows.count.
another way:
When you know where right bottom cell of your data locates faintly, say AAA10000, just call objectiveSheet.range('A1:'+'AAA10000').current_region.rows.count
Update:
After a while none of the solutions were really intuitive to me, so I decided to compile the following:
Code:
import xlwings as Objxlwings
import xlwings.constants
def Return_RangeLastCell(ObjWS):
return ObjWS.api.Cells.SpecialCells(xlwings.constants.CellType.xlCellTypeLastCell)
I tried to keep consistency with the way to call it from Excel to keep it simple
Then on my main code, I just call it like so:
ObjWS=Objxlwings.Book('Book1.xlsx').sheets["Sheet1"]
print(Return_RangeLastCell(ObjWS).Column)
Interesting solutions. But maybe like this:
print(sheet.used_range.last_cell.row)
#Cody's answer will help under normal circumstances, but if your sheet have hidden rows at bottom like links: example, it will give the wrong row number.
Lets say, if your row counts of data is 10, and row[5:11] are hidden, i.e. actually last_row will be 10.
[code a] below will give you answer 5, [code b] below will give you answer 10.
code a:
ws = wb.sheets[your_sheet_name]
last_row = ws.range('A' + str(ws.cells.last_cell.row)).end('up').row # return 5
code b:
ws = wb.sheets[your_sheet_name]
last_row_1 = ws.used_range.last_cell.row # return 10

Categories