How to detect merged cells in an Excel sheet? - python

I'm trying to read data from an Excel sheet that contains merged cells.
When reading merged cells with openpyxl the first merged cell contain the value and the rest of the cells are empty.
I would like to know about each cell if it is merged and how many cells are merged but I couldn't find any function that does so.
The sheet have empty others cells, so I can't use that.

You can use merged_cells.ranges (merged_cell_ranges has been deprecated in version 2.5.0-b1 (2017-10-19), changed to merged_cells.ranges) on the sheet (can't seem to find per row) like this:
from openpyxl import load_workbook
wb = load_workbook(filename='a file name')
sheet_ranges = wb['Sheet1']
print(sheet_ranges.merged_cells.ranges)

To test if a single cell is merged or not you can check the class (name):
cell = sheet.cell(row=15, column=14)
if type(cell).__name__ == 'MergedCell':
print("Oh no, the cell is merged!")
else:
print("This cell is not merged.")
To "unmerge" all cells you can use the function unmerge_cells
for items in sorted(sheet.merged_cell_ranges):
print(items)
sheet.unmerge_cells(str(items))

To test if a single cell is merged, I loop through sheet.merged_cells.ranges like #A. Lau suggests.
Unfortunately, checking the cell type like #0x4a6f4672 shows does not work any more.
Here is a function that shows you how to do this.
def testMerge(row, column):
cell = sheet.cell(row, column)
for mergedCell in sheet.merged_cells.ranges:
if (cell.coordinate in mergedCell):
return True
return False

The question asks about detecting merged cells and reading them, but so far the provided answers only deal with detecting and unmerging. Here is a function which returns the logical value of the cell, the value that the user would see as contained on a merged cell:
import sys
from openpyxl import load_workbook
from openpyxl.cell.cell import MergedCell
def cell_value(sheet, coord):
cell = sheet[coord]
if not isinstance(cell, MergedCell):
return cell.value
# "Oh no, the cell is merged!"
for range in sheet.merged_cells.ranges:
if coord in range:
return range.start_cell.value
raise AssertionError('Merged cell is not in any merge range!')
workbook = load_workbook(sys.argv[1])
print(cell_value(workbook.active, sys.argv[2]))

These all helped (thanks), but when I used the approaches with a couple of spreadsheets, it wasn't unmerging all the cells I expected. I had to loop and restest for merges to finally get them all to complete. In my case, it took 4 passes to get everything to unmerge as expected:
mergedRanges = sheet_ranges.merged_cells.ranges
### How many times do we run unmerge?
i=0
### keep testing and removing ranges until they are all actually gone
while mergedRanges:
for entry in mergedRanges:
i+=1
print(" unMerging: " + str(i) + ": " +str(entry))
ws.unmerge_cells(str(entry))

Related

Highlighting an excel cell using conditional formatting in python

I'm trying to create a program that reads an excel file containing some data, asks the user for an input, and if the input entered is equal to data within a cell, the cell gets highlighted. This code currently works fine till the user's input is compared with the data.
However I cannot find the correct way to highlight a cell if there is a match with the entered input. What can I do about this? I tried conditional formatting with xlsxwriter but realized while going through documentation that it can only be used to create new files and not change or modify existing files and tried using openpyxl here but to no avail. Sorry I'm just a bit new at this and would really appreciate some help fixing this code
import pandas as pd
import xlsxwriter
from openpyxl import Workbook
from openpyxl.styles import Color, PatternFill, Font, Border
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.formatting.rule import ColorScaleRule, CellIsRule,
FormulaRule
def checkIfValuesExists1(dfObj, listOfValues):
''' Check if given elements exists in dictionary or not'''
resultDict = {}
# Iterate over the list of elements one by one
for elem in listOfValues:
# Check if the element exists in dataframe values
if elem in dfObj.values:
resultDict[elem] = True
else:
resultDict[elem] = False
# Returns a dictionary of values & their existence flag
return resultDict
def main():
import pandas as pd
df=pd.read_excel("filename.xlsx")
df
# Create a DataFrame object
empDfObj = pd.DataFrame(df, columns=['Name', 'LastName',
'MiddleName'])
print('Contents of the dataframe :')
print(empDfObj)
print('**** Check if an element exists in DataFrame using in & not in
operators ****')
x= str(input("Enter your text: "))
print('** Use in operator to check if an element exists in dataframe
**')
if x in empDfObj.values:
df.style.apply(lambda y:['background:red' if x == df.Category else
'background:green' for x in df],axis=0)
print('Element exists in Dataframe')
else:
print('Element does not exist in Dataframe')
# Check if 'Hello' doesn't exist in DataFrame
# if 'Hello' not in empDfObj.values:
#print('Element does not exist in Dataframe')
# Get the xlsxwriter workbook and worksheet objects.
df.to_excel('output1.xlsx', engine='xlsxwriter')
if __name__ == '__main__':
main()
To get the value of a cell you just need to use .cell and selecting the row and column as answered here - Reading particular cell value from excelsheet in python

Data append to list using XLRD

I am able to import data of rows in a particular column of certain sheet name in to a python list. But, the list is looking like Key:Value formatted list (not the one I need).
Here is my code:
import xlrd
excelList = []
def xcel(path):
book = xlrd.open_workbook(path)
impacted_files = book.sheet_by_index(2)
for row_index in range(2, impacted_files.nrows):
#if impacted_files.row_values(row_index) == 'LCR':
excelList.append(impacted_files.cell(row_index, 1))
print(excelList)
if __name__ == "__main__":
xcel(path)
The output is like below:
[text:'LCR_ContractualOutflowsMaster.aspx', text:'LCR_CountryMaster.aspx', text:'LCR_CountryMasterChecker.aspx', text:'LCR_EntityMaster.aspx', text:'LCR_EntityMasterChecker.aspx', text:'LCR_EscalationMatrixMaster.aspx',....]
I want the list to have just values. Like this...
['LCR_ContractualOutflowsMaster.aspx', 'LCR_CountryMaster.aspx', 'LCR_CountryMasterChecker.aspx', 'LCR_EntityMaster.aspx', 'LCR_EntityMasterChecker.aspx', 'LCR_EscalationMatrixMaster.aspx',...]
I've tried pandas too (df.value.tolist() method). Yet the output is not what I visualize.
Please suggest a way.
Regards
You are accumulating a list of cells, and what you are seeing is the repr of each cell in your list. Cell objects have three attributes: ctype is an int that identifies the type of the cell's value, value (which which is a Python rtype holding the cell's value) and xf_index. If you want only the values then try
excelList.append(impacted_files.cell(row_index, 1).value)
You can read more about cells in the documentation.
If you are willing to try one more library, openpyxl this is how it can be done.
from openpyxl import load_workbook
book = load_workbook(path)
sh = book.worksheets[0]
print([cell.value for cell in row for row in sheet.iter_rows()] )

Finding Excel cell reference using Python

Here is the Excel file in question:
Context: I am writing a program which can pull values from a PDF and put them in the appropriate cell in an Excel file.
Question: I want to write a function which takes a column value (e.g. 2014) and a row value (e.g. 'COGS') as arguments and return the cell reference where those two intersect (e.g. 'C3' for 2014 COGS).
def find_correct_cell(year=2014, item='COGS'):
#do something similar to what the =match function in Excel does
return cell_reference #returns 'C3'
I have already tried using openpyxl like this to change the values of some random empty cells where I can store these values:
col_num = '=match(2014, A1:E1)'
row_num = '=match("COGS", A1:A5)'
But I want to grab those values without having to arbitrarily write to those random empty cells. Plus, even with this method, when I read those cells (F5 and F6) it reads the formulae in those cells and not the face value of 3.
Any help is appreciated, thanks.
Consider a translated VBA solution as the Match function can adequately handle your needs. Python can access the Excel VBA Object Library using a COM interface with the win32com module. Please note this solution assumes you are using Excel for PC. Below includes the counterpart VBA function.
VBA Function (native interface)
If below function is placed in Excel standard module, function can be called in spreadsheet cell =FindCell(..., ###)
' MATCHES ROW AND COL INPUT FOR CELL ADDRESS OUTPUT
Function FindCell(item As String, year As Integer) As String
FindCell = Cells(Application.Match(item, Range("A1:A5"), 0), _
Application.Match(year, Range("A1:E1"), 0)).Address
End Function
debug.Print FindCell("COGS", 2014)
' $C$3
Python Script (foreign interface, requiring all objects to be declared)
Try/Except/Finally is used to properly close the Excel process regardless of script success or fail.
import win32com.client
# MATCHES ROW AND COL INPUT FOR CELL ADDRESS OUTPUT
def FindCell(item, year):
return(xlWks.Cells(xlApp.WorksheetFunction.Match(item, xlWks.Range("A1:A5"), 0),
xlApp.WorksheetFunction.Match(year, xlWks.Range("A1:E1"), 0)).Address)
try:
xlApp = win32com.client.Dispatch("Excel.Application")
xlWbk = xlApp.Workbooks.Open('C:/Path/To/Workbook.xlsx')
xlWks = xlWbk.Worksheets("SHEETNAME")
print(FindCell("COGS", 2014))
# $C$3
except Exception as e:
print(e)
finally:
xlWbk.Close(False)
xlApp.Quit
xlWks = None
xlWbk = None
xlApp = None
There are a surprising number of details you need to get right to manipulate Excel files this way with openpyxl. First, it's worth knowing that the xlsx file contains two representations of each cell - the formula, and the current value of the formula. openpyxl can return either, and if you want values you should specify data_only=True when you open the file. Also, openpyxl is not able to calculate a new value when you change the formula for a cell - only Excel itself can do that. So inserting a MATCH() worksheet function won't solve your problem.
The code below does what you want, mostly in Python. It uses the "A1" reference style, and does some calculations to turn column numbers into column letters. This won't hold up well if you go past column Z. In that case, you may want to switch to numbered references to rows and columns. There's some more info on that here and here. But hopefully this will get you on your way.
Note: This code assumes you are reading a workbook called 'test.xlsx', and that 'COGS' is in a list of items in 'Sheet1!A2:A5' and 2014 is in a list of years in 'Sheet1!B1:E1'.
import openpyxl
def get_xlsx_region(xlsx_file, sheet, region):
""" Return a rectangular region from the specified file.
The data are returned as a list of rows, where each row contains a list
of cell values"""
# 'data_only=True' tells openpyxl to return values instead of formulas
# 'read_only=True' makes openpyxl much faster (fast enough that it
# doesn't hurt to open the file once for each region).
wb = openpyxl.load_workbook(xlsx_file, data_only=True, read_only=True)
reg = wb[sheet][region]
return [[cell.value for cell in row] for row in reg]
# cache the lists of years and items
# get the first (only) row of the 'B1:F1' region
years = get_xlsx_region('test.xlsx', 'Sheet1', 'B1:E1')[0]
# get the first (only) column of the 'A2:A6' region
items = [r[0] for r in get_xlsx_region('test.xlsx', 'Sheet1', 'A2:A5')]
def find_correct_cell(year, item):
# find the indexes for 'COGS' and 2014
year_col = chr(ord('B') + years.index(year)) # only works in A:Z range
item_row = 2 + items.index(item)
cell_reference = year_col + str(item_row)
return cell_reference
print find_correct_cell(year=2014, item='COGS')
# C3

openpyxl: Append data to first empty column cell

Background:
I have an excel workbook containing metadata which spread across various worksheets. I need to take the relevant columns of data from the various worksheets and combine them into a single worksheet. With the following code I have been able to create a new worksheet and add data to it.
# Open workbook and assign worksheet
try:
wb = openpyxl.load_workbook(metadata)
shtEditionLNM = wb.worksheets[0] # Edition date & latest NM
shtChartsTitles = wb.worksheets[1] # Charts & Titles
shtDepthHeight = wb.worksheets[4] # Depth & heights
shtChartProj = wb.worksheets[7] # Chart Projection
except:
raise SystemExit(0)
new = wb.create_sheet()
new.title = "MT_CHARTS INFO"
new.sheet_properties.tabColor = "1072BA"
shtMeta = wb.get_sheet_by_name("MT_CHARTS INFO")
for row in shtChartsTitles.rows:
shtMeta.append([row[0].value, row[1].value, row[2].value, row[4].value])
for row in shtEditionLNM.rows:
shtMeta.append([row[3].value, row[4].value])
wb.save('OW - Quarterly Extract of Metadata for Raster Charts Dec 2015.xlsx')
This works without any errors and I can see the data saved to my new workbook. However when I run a second loop and append values they are appended to cell A3169 whereas I actually want them to populate from E1.
My question boils down to 'is there a way I can append to a new column instead of a new row?'
Thanks in advance!
Not directly: ws.append() works with rows because this is the way the data is stored and thus the easiest to optimise for the read-only and write-only modes.
However, ws.cell(row=x, column=y, value=z) will allow you to do want you want. Version 2.4 (install from a checkout) will also let you work directly with columns by managing the assignment to cells for you: ws['E'] will return a tuple of the cells in the column up to the current ws.max_row; ws.iter_cols(min_col, min_row, max_col, max_row) will return a generator of columns as big as you need it.
Thank you Charlie,
Your answer gave me the direction I needed to get this done. Referring to this question:
how to write to a new cell in python using openpyxl
i've found out there are many ways to skin this cat - the method below is what I went for in the end!
x=0
for row in shtEditionLNM.rows:
x+=1
shtMeta.cell(coordinate="E{}".format(x)).value = row[3].value
shtMeta.cell(coordinate="F{}".format(x)).value = row[4].value
I am new to openpyxl, but I believe we can convert a list to a list of tuple of each element, and then pass that object into the sheet.append() function:
L1=[a,b,c,d.....]
L2=[]
for a in L1:
L2.append(tuple(a))
for a in L2:
sheet.append(L2)
Please feel free to correct me.

xlwings function to find the last row with data

I am trying to find the last row in a column with data. to replace the vba function: LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
I am trying this, but this pulls in all rows in Excel. How can I just get the last row.
from xlwings import Workbook, Range
wb = Workbook()
print len(Range('A:A'))
Consolidating the answers above, you can do it in one line:
wb.sheet.range(column + last cell value).Get End of section going up[non blank assuming the last cell is blank].row
Example code:
import xlwings as xw
from xlwings import Range, constants
wb = xw.Book(r'path.xlsx')
wb.sheets[0].range('A' + str(wb.sheets[0].cells.last_cell.row)).end('up').row
We can use Range object to find the last row and/or the last column:
import xlwings as xw
# open raw data file
filename_read = 'data_raw.csv'
wb = xw.Book(filename_read)
sht = wb.sheets[0]
# find the numbers of columns and rows in the sheet
num_col = sht.range('A1').end('right').column
num_row = sht.range('A1').end('down').row
# collect data
content_list = sht.range((1,1),(num_row,num_col)).value
print(content_list)
This is very much the same as crazymachu's answer, just wrapped up in a function. Since version 0.9.0 of xlwings you can do this:
import xlwings as xw
def lastRow(idx, workbook, col=1):
""" Find the last row in the worksheet that contains data.
idx: Specifies the worksheet to select. Starts counting from zero.
workbook: Specifies the workbook
col: The column in which to look for the last cell containing data.
"""
ws = workbook.sheets[idx]
lwr_r_cell = ws.cells.last_cell # lower right cell
lwr_row = lwr_r_cell.row # row of the lower right cell
lwr_cell = ws.range((lwr_row, col)) # change to your specified column
if lwr_cell.value is None:
lwr_cell = lwr_cell.end('up') # go up untill you hit a non-empty cell
return lwr_cell.row
Intuitively, the function starts off by finding the most extreme lower-right cell in the workbook. It then moves across to your selected column and then up until it hits the first non-empty cell.
You could try using Direction by starting at the very bottom and then moving up:
import xlwings
from xlwings.constants import Direction
wb = xlwings.Workbook(r'data.xlsx')
print(wb.active_sheet.xl_sheet.Cells(65536, 1).End(Direction.xlUp).Row)
Try this:
import xlwings as xw
cellsDown = xw.Range('A1').vertical.value
cellsRight = xw.Range('A1').horizontal.value
print len(cellsDown)
print len(cellsRight)
One could use the VBA Find function that is exposed through api property (use it to find anything with a star, and begin your search from the first cell).
Example:
row_cell = s.api.Cells.Find(What="*",
After=s.api.Cells(1, 1),
LookAt=xlwings.constants.LookAt.xlPart,
LookIn=xlwings.constants.FindLookIn.xlFormulas,
SearchOrder=xlwings.constants.SearchOrder.xlByRows,
SearchDirection=xlwings.constants.SearchDirection.xlPrevious,
MatchCase=False)
column_cell = s.api.Cells.Find(What="*",
After=s.api.Cells(1, 1),
LookAt=xlwings.constants.LookAt.xlPart,
LookIn=xlwings.constants.FindLookIn.xlFormulas,
SearchOrder=xlwings.constants.SearchOrder.xlByColumns,
SearchDirection=xlwings.constants.SearchDirection.xlPrevious,
MatchCase=False)
print((row_cell.Row, column_cell.Column))
Other methods outlined here seems to require no empty rows/columns between data.
source: https://gist.github.com/Elijas/2430813d3ad71aebcc0c83dd1f130e33
python 3.6, xlwings 0.11
Solutoin 1
To find last row with data, you should do some work both horizontally and vertically. You have to go through every column to determine which row is the last row.
import xlwings
workbook_all = xlwings.Book(r'path.xlsx')
objectiveSheet = workbook_all .sheets['some_sheet']
# lastCellContainData(), inspired of Stefan's answer.
def lastCellContainData(objectiveSheet,lastRow=None,lastColumn=None):
lastRow = objectiveSheet.cells.last_cell.row if lastRow==None else lastRow
lastColumn = objectiveSheet.cells.last_cell.column if lastColumn==None else lastColumn
lastRows,lastColumns = [],[]
for col in range(1,lastColumn):
lastRows.append(objectiveSheet.range((lastRow, col)).end('up').row)
# extract last row of every column, then max(). Or you can compare the next
# column's last row number to the last column's last row number. Here you get
# the last row with data, you can also go further get the last column with data:
for row in range(1,lastRow):
lastColumns.append(objectiveSheet.range((row, lastColumn)).end('left').column)
return max(lastRows),max(lastColumns)
lastCellContainData(objectiveSheet,lastRow=5000,lastColumn=300)
I added lastRow and lastColumn. To make the program more effective, you can set these parameters according to the approximate shape of the data you're dealing with.
Solution 2
xlwings is honored for being wrapper of pywin32. I don't know if your situation allows for keyboard or mouse. If so, first you ctrl+tab switch to the workbook, then ctrl+a to select the region containing data, then you call workbook_all.selection.rows.count.
another way:
When you know where right bottom cell of your data locates faintly, say AAA10000, just call objectiveSheet.range('A1:'+'AAA10000').current_region.rows.count
Update:
After a while none of the solutions were really intuitive to me, so I decided to compile the following:
Code:
import xlwings as Objxlwings
import xlwings.constants
def Return_RangeLastCell(ObjWS):
return ObjWS.api.Cells.SpecialCells(xlwings.constants.CellType.xlCellTypeLastCell)
I tried to keep consistency with the way to call it from Excel to keep it simple
Then on my main code, I just call it like so:
ObjWS=Objxlwings.Book('Book1.xlsx').sheets["Sheet1"]
print(Return_RangeLastCell(ObjWS).Column)
Interesting solutions. But maybe like this:
print(sheet.used_range.last_cell.row)
#Cody's answer will help under normal circumstances, but if your sheet have hidden rows at bottom like links: example, it will give the wrong row number.
Lets say, if your row counts of data is 10, and row[5:11] are hidden, i.e. actually last_row will be 10.
[code a] below will give you answer 5, [code b] below will give you answer 10.
code a:
ws = wb.sheets[your_sheet_name]
last_row = ws.range('A' + str(ws.cells.last_cell.row)).end('up').row # return 5
code b:
ws = wb.sheets[your_sheet_name]
last_row_1 = ws.used_range.last_cell.row # return 10

Categories