I am trying to define the formatting that needs to be applied to each column of an excel spreadsheet based on the column name.
For example, if column name is 'count' then 'number_format' needs to be used. If column name is 'sale_date' then 'date_format' needs to be used.
number_format = workbook.add_format({'num_format': '0', 'font_size': 12})
date_format = workbook.add_format({'num_format': 'dd/mm/yyyy hh:mm:ss', 'font_size': 12})
Using the above two formats in the respective columns as shown below:
worksheet1.write('A1', 'count', number_format)
worksheet1.write('B1', 'sale_date', date_format)
Could I make this dynamic based on the column name instead of defining format by column label. Thanks
Update:
Loop that displays the header column in the excel spreadsheet
for data in title:
worksheet.write(row, col, data, number_format)
col += 1
Comment: date_format = workbook.add_format({'num_format': 'dd/mm/yy'}), shows the date column as unix number rather than a proper date.
Sample value shown is : 42668 instead of displaying "24-10-16".
This is default behavior defined by Windows Excel.
Read Excel for Windows stores dates by default as the number of days
Documentation: XlsxWriter Working with Dates and Time
Comment: ...that I could use the appropriate format based on column name (namely count, sale_date)
You can use worksheet.set_column() to set a Style for a whole Column.
Documentation: XlsxWriter worksheet.set_column()
Precondition: The Order of the Columns Name/Style must be in sync with your table.
E.g. count == 'A', sale_date == 'B' and so on...
from collections import OrderedDict
_styles = OrderedDict([('count',number_format), ('sale_date', date_format), ('total', number_format), ('text', string_format)])
for col, key in enumerate(_styles):
A1_notation = '{c}:{c}'.format(c=chr(col + 65))
worksheet.set_column(A1_notation, None, _styles[key])
print("worksheet.set_column('{}', None, {})".format(A1_notation, _styles[key]))
Output:
worksheet.set_column('A:A', None, number_format)
worksheet.set_column('B:B', None, date_format)
worksheet.set_column('C:C', None, number_format)
worksheet.set_column('D:D', None, string_format)
For subsequent writes you don't need to assign a style, e.g. use
worksheet.write('A1', 123)
will default to A:A number_format
Question: Could I make this dynamic based on the column name
You are not using "column name", it's called Cell A1 Notation.
Setup a mapping dict, for example:
style_map = {'A': number_format, 'B':date_format}
Usage:
Note: This will only work with single letter, from A to Z
def write(A1_notation, value):
worksheet1.write(A1_notation, value, style_map[A1_notation[0]])
For Row-column notation (0, 0):
style_map = {'0': number_format, '1':date_format}
Usage:
def write(row, col, value):
worksheet1.write(row, col, value, style_map[col])
from xlsxwriter.utility import xl_rowcol_to_cell
def write(A1_notation, value):
worksheet1.write(A1_notation, value, style_map[xl_cell_to_rowcol(A1_notation)[1]])
Related
let me describe my issue below:
I've got got two excel worksheets, one containing past, the other - current data. They both have the following structure:
Col_1
Col_2
KEY
Col_3
Etc.
abc
xyz
key_1
foo
---
def
zyx
key_2
bar
---
Now, the goal is to check if a value for given key changed between the past and current iteration and if yes, color the given cell's background (in current data worksheet). This check has to be done for all the columns.
As the KEY column is not the very first one, I've decided to use XLOOKUP function and apply the formatting within the for loop. The full loop looks like this (in this example the KEY column is column C):
dark_blue = writer.book.add_format({'bg_color': '#3A67B8'})
old_sheet = "\'" + "old_" + "sheet_name" + "\'"
for col in range(last_col):
col_name = xl_col_to_name(col)
if col_name in unformatted_cols: # Not apply the formatting to certain columns
continue
else:
apply_range = '{0}1:{0}1048576'.format(col_name)
formula = "XLOOKUP(C1, {1}!C1:C1048576, {1}!{0}1:{0}1048576) <> XLOOKUP(C1, C1:C1048576, {0}1:{0}1048576)".format(col_name, old_sheet)
active_sheet.conditional_format(apply_range, {'type': 'formula',
'criteria': formula,
'format': dark_blue})
Now, my problem is that when I open the output the this conditional formatting doesn't work. If however I'll go to Conditional Formatting -> Manage Rules -> Edit Rule and without any editing I'll press OK and later apply it starts working correctly.
Does anyone know how to make this rule work properly without this manual intervention?
My all other conditional formatting rules, though simpler, work exactly as intended.
# This is the formula that I see in Python for the first loop iteration
=XLOOKUP(C1, 'old_sheet_name'!C1:C1048576, 'old_sheet_name'!A1:A1048576) <> XLOOKUP(C1, C1:C1048576, A1:A1048576)
# This formula I see in Excel for the same first column
=XLOOKUP(C1, 'old_sheet_name'!C:C, 'old_sheet_name'!A:A) <> XLOOKUP(C1, C:C, A:A)
The reason that XLOOKUPdoesn't work in your formula is that it is classified by Excel as a "Future Function", i.e, a function added after the original file format. In order to use it you need to prefix it with _xlfn.
This is explained in the XlsxWriter docs on Formulas added in Excel 2010 and later.
Here is a working example:
import xlsxwriter
workbook = xlsxwriter.Workbook('conditional_format.xlsx')
worksheet1 = workbook.add_worksheet('old_sheet_name')
worksheet2 = workbook.add_worksheet('new_sheet_name')
worksheet1.write(0, 0, 'Foo')
format1 = workbook.add_format({'bg_color': '#C6EFCE',
'font_color': '#006100'})
xlookup_formula = '=_xlfn.XLOOKUP(C1, old_sheet_name!C:C, old_sheet_name!A:A) <> _xlfn.XLOOKUP(C1, C:C, A:A)'
worksheet2.conditional_format('D1:D10',
{'type': 'formula',
'criteria': xlookup_formula,
'format': format1})
workbook.close()
Output:
If I have the following saved Excel document with 26007.930562 in every cell, where the column names represent the Excel formatting I am using for the given cell:
and I run the following Python code:
from openpyxl.utils import get_column_letter, column_index_from_string
import win32com.client # https://www.youtube.com/watch?v=rh039flfMto
ExcelApp = win32com.client.GetActiveObject('Excel.Application')
wb = ExcelApp.Workbooks('test.xlsx')
ws = wb.Worksheets(1)
excelRange = ws.Range(ws.Cells(1, 1), ws.Cells(2, 4))
listVals = [[*row] for row in excelRange.Value]
print(listVals)
I get the following output:
[['general', 'currency', 'accounting', 'number'], [26007.930562, Decimal('26007.9306'), Decimal('26007.9306'), 26007.930562]]
Notice how there is a loss of precision for the "currency" and "accounting" formats. They get turned into some decimal that rounds off several of the later decimal places.
Is it possible to read in currency and accounting formatted cells while still keeping full precision? If so, how?
This is what I mean when I say "currency formatting":
EDIT:
BigBen's solution works in this example. But if you have dates, Value2 doesn't treat them like dates which causes errors in Python where you intend to treat them like dates. I ended up having to write this instead:
listVals = [] # https://stackoverflow.com/a/71375004
for rowvalue, rowvalue2 in zip([[*row] for row in excelRange.Value], [[*row] for row in excelRange.Value2]):
rowlist = []
for value, value2 in zip(rowvalue, rowvalue2):
if type(value) == pywintypes.TimeType:
rowlist.append(value)
else:
rowlist.append(value2)
listVals.append(rowlist)
I'm sure there's a faster / more efficient way to do it than that but I don't know what it is.
Use .Value2 instead of .Value:
listVals = [[*row] for row in excelRange.Value2]
Result:
[['general', 'currency', 'accounting', 'number'], [26007.93056, 26007.93056, 26007.93056, 26007.93056]]
First time posting here, I apologize if this question has been asked before - I can't find anything that applies.
Is there a way to read the underlying data from an Excel PivotTable into a Pandas Data Frame? For several years I've had an Excel Auto_Open macro that downloads several Excel files and double clicks on the "Grand Total" row in order to extract all of the data, which ultimate gets imported into a database. This is done because the owners of the source data refuse to grant access to the database itself.
This macro has never been the ideal scenario and we need to move it to a better method soon. I have extensive SQL knowledge but have only recently begun to learn Python.
I have been able to read worksheets using OpenPyXl, but these files do not contain the source data on a separate worksheet by default - the pivotcache must be extracted to a new sheet first. What I would like to do, if possible, is read from the Excel PivotCache into a Pandas Data Frame and either save that output as a CSV or load it directly into our database. It seems that this is not capable with OpenPyXl and that I'll probably need to use win32com.client.
Does anybody have any experience with this, and know if it's even possible? Any pointers for where I might get started? I've tried several items from the Excel Object model (PivotCache, GetData, etc etc) but either I don't know how to use them or they don't return what I need.
Any help would be much appreciated. Thanks!
This answer is very late, but I came up with it while struggling with the same issue, and some of the comments above helped me nail it.
In essence, the steps one can take to solve this with openpyxl are:
Use openpyxl to get the openpyxl.pivot.table.TableDefinition object from the desired pivot table (let's call it my_pivot_table)
Get cached fields and their values from my_pivot_table.cache.cacheFields
Get rows data as a dict in two sub-steps:
3.1) Get all cached rows and their values from my_pivot_table.cache.records.r. Cache fields in these records are stored as indexes from my_pivot_table.cache.cacheFields
3.2) Replace cache fields from each record by their actual values, by "joining" cache.records.r and cache.cacheFields
Convert dict with rows into a pandas DataFrame
Below you will find a copy of the code that implements such solution. Since the structure of these Excel objects are somewhat complex, the code will probably look very cryptic (sorry about that). To address this, I'm adding further below minimal examples of the main objects being manipulated, so people can get a better sense of what is going on, what are the objects being returned, etc.
This was the simplest approach I could find to achieve this. I hope it is still useful for someone, albeit some tweaking may be needed for individual cases.
"Bare" code
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
fields_map[field.name] = [f.v for f in field.sharedItems._fields]
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
Results:
>>> df.head(2)
FUEL YEAR REGION STATE UNIT Jan Feb (...)
0 GASOLINE (m3) 2000.0 S TEXAS m3 9563.263 9563.263 (...)
1 GASOLINE (m3) 2000.0 NE NEW YORK m3 3065.758 9563.263 (...)
Some of the objects details
Object pivot_table
This is an object of type openpyxl.pivot.table.TableDefinition. It is quite complex. A small glimpse of it:
<openpyxl.pivot.table.TableDefinition object>
Parameters:
name='Tabela dinâmica1', cacheId=36, dataOnRows=True, dataPosition=None, (A LOT OF OMITTED STUFF...)
Parameters:
ref='B52:W66', firstHeaderRow=1, firstDataRow=2, firstDataCol=1, rowPageCount=2, colPageCount=1, pivotFields=[<openpyxl.pivot.table.PivotField object>
Parameters: (A LOT OF OMITTED STUFF...)
Object fields_map (from cache.cacheFields)
This is a dict with column name and their available values:
{'YEAR': [2000.0, 2001.0, 2002.0, 2003.0, 2004.0, 2005.0, 2006.0, 2007.0, 2008.0,
2009.0, 2010.0, 2011.0, 2012.0, 2013.0, 2014.0, 2015.0, 2016.0, 2017.0,
2018.0, 2019.0, 2020.0],
'FUEL': ['GASOLINE (m3)', 'AVIATION GASOLINE (m3)', 'KEROSENE (m3)'],
'STATE': ['TEXAS', 'NEW YORK', 'MAINE', (...)],
'REGION': ['S', 'NE', 'N', (...)]}
Object row_dict (before mapping)
Each row is a dict with column names and their values. Raw values for cache fields are not stored here. Here they are represented by their indexes in cache.cacheFields (see above)
{'YEAR': 0, # <<<--- 0 stands for index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 0, # <<<--- index in fields_map
'Dec': 10818.094,
'STATE': 0, # <<<--- index in fields_map
(...)
'UNIT': 'm3'}
Object row_dict (after mapping)
After extracting raw values for cache fields from their indexes, we have a dict that represent all values of a row:
{'YEAR': 2000.0, # extracted column value from index in fields_map
'Jan': 10719.983,
'Feb': 12482.281,
'FUEL': 'GASOLINE (m3)', # extracted from fields_map
'Dec': 10818.094,
'STATE': 'TEXAS', # extracted from fields_map
(...)
'UNIT': 'm3'}
Building on #PMHM excellent answer I have modified the code to take care of source data with blank cells. The piece of code that needed modification is the following:
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
The complete code (mostly copy/paste from above) is therefore:
import numpy as np
import pandas as pd
from openpyxl import load_workbook
from openpyxl.pivot.fields import Missing
file_path = 'path/to/your/file.xlsx'
workbook = load_workbook(file_path)
worksheet = workbook['Plan1']
# Name of desired pivot table (the same name that appears within Excel)
pivot_name = 'Tabela dinâmica1'
# Extract the pivot table object from the worksheet
pivot_table = [p for p in worksheet._pivots if p.name == pivot_name][0]
# Extract a dict of all cache fields and their respective values
fields_map = {}
for field in pivot_table.cache.cacheFields:
if field.sharedItems.count > 0:
# take care of cases where f.v returns an AttributeError because the cell is empty
# fields_map[field.name] = [f.v for f in field.sharedItems._fields]
l = []
for f in field.sharedItems._fields:
try:
l += [f.v]
except AttributeError:
l += [""]
fields_map[field.name] = l
# Extract all rows from cache records. Each row is initially parsed as a dict
column_names = [field.name for field in pivot_table.cache.cacheFields]
rows = []
for record in pivot_table.cache.records.r:
# If some field in the record in missing, we replace it by NaN
record_values = [
field.v if not isinstance(field, Missing) else np.nan for field in record._fields
]
row_dict = {k: v for k, v in zip(column_names, record_values)}
# Shared fields are mapped as an Index, so we replace the field index by its value
for key in fields_map:
row_dict[key] = fields_map[key][row_dict[key]]
rows.append(row_dict)
df = pd.DataFrame.from_dict(rows)
How do I write a format to a range of cells.
What I am doing is looping over the column names in a list from oracle, and formatting the columns as dates, where the column name starts with "DT". But I also want to make the entire data range have borders.
I would like to really apply the date format to the columns, and then separately apply the borders...but the last format applies wins, and the application of the borders overwrites the date formatting on the columns.
Ideally I want to blast the data range with borders, and then apply date formats to the date columns, while retaining the borders.
Can you select a range and then apply formatting or do range intersections as you can in VBA?
# Generate EXCEL File
xl_filename = "DQ_Valid_Status_Check.xlsx"
workbook = xlsxwriter.Workbook(xl_filename)
# Add a bold format to use to highlight cells.
bold = workbook.add_format({'bold': True})
date_format = workbook.add_format(
{'num_format': 'dd-mmm-yyyy hh:mm:ss'})
border = workbook.add_format()
border.set_bottom()
border.set_top()
border.set_left()
border.set_right()
worksheet_info = workbook.add_worksheet()
worksheet_info.name = "Information"
worksheet_info.write('A1', 'Report Description:', bold)
worksheet_info.write('B1', 'ARIEL Data Quality Report for Checking Authorisation Status of Marketing Applications')
worksheet_info.write('A2', 'Report Date:', bold)
worksheet_info.write('B2', datetime.datetime.now(), date_format)
worksheet_data = workbook.add_worksheet()
worksheet_data.name = "DQ Report"
worksheet_data.write_row('A1', col_names)
for i in range(len(results)):
print("result " + str(i) + ' of' + str(len(results)))
print(results[i])
worksheet_data.write_row('A' + str(i + 2), results[i])
#worksheet_data.set_row(i + 2, None, border)
# add borders
for i in range(len(results)):
worksheet_data.set_row(i + 2, None, border)
# format date columns
for i in range(len(col_names)):
col_name = col_names[i]
if col_name.startswith("DT"):
print(col_name)
worksheet_data.set_column(i, i, None, date_format)
workbook.close()
According to the FAQ, it is not currently possible to format a range of cells at once, but a future feature might allow this.
You could create Format objects containing multiple format properties and apply your custom format to each cell as you write to it. See "Creating and using a Format Object".
To apply borders to all columns at once you can do something like:
border = workbook.add_format({'border':2})
worksheet_info.set_column(first_col=0, last_col=10, cell_format=border)
And to retain the border format you can modify your date_format to:
date_format = workbook.add_format(
{'num_format': 'dd-mmm-yyyy hh:mm:ss',
'border': 2})
I'm reading a csv file into a data frame, and then using data nitro to allow users to modify the data based on inputs in excel cells. This works fine except it seems when every value in a df column is NaN. The first step is for the user to enter the UID of the entity for which he wishes to access the data. The csv is read with the UIDs as index.
This is the code:
class InterAction:
def __init__(self) :
self.PD_CL = pd.read_csv(r"C:\Users\rcreedon\Desktop\DataInProg\ContactList.csv", index_col = 'UID')
def CheckCL_UID(self):
self.UID = str(CellVal)
if self.UID in self.PD_CL.index.values:
return 'True'
else:
return "ERROR, the Factory Code you have entered is not in the Contact List"
def UpdateContactDetails(self, Cell_GMNum, Cell_CNum, Cell_GMNam, Cell_CNam, Cell_GMDesig, Cell_CDesig):
if not Cell_GMNum.is_empty():
self.PD_CL['Cnum_gm'][self.UID] = str(Cell_GMNum.value)
if not Cell_CNum.is_empty():
self.PD_CL['Cnum_upd'][self.UID] = str(Cell_CNum.value)
if not Cell_GMNam.is_empty():
self.PD_CL['Cnam_gm'][self.UID] = str(Cell_GMNam.value)
if not Cell_CNam.is_empty():
self.PD_CL['Cnam_upd'][self.UID] = str(Cell_CNam.value)
if not Cell_GMDesig.is_empty():
self.PD_CL['Cdesig_gm'][self.UID] = str(Cell_GMDesig.value)
Inter = InterAction()
Cell("InputSheet", 5, 2).value = Inter.CheckCL_UID()
Inter.UpdateContactDetails(Cell("InputSheet", 3, 7), Cell("InputSheet",4, 7), Cell("InputSheet",5, 7), Cell("InputSheet",6, 7), Cell("InputSheet", 7, 7), Cell("InputSheet",8, 7))
With a UID of 'MP01' which is in the csv dataframe index When I ran this I was receiving a composite error with respect to the user input in the GMDesig cell. It ended in
ValueError ['M' 'P' '0' '1'] not contained in index.
I noticed that the CDesig_gm column in the excel file was the only column with no values and was consequently read into the data frame as a column of NaNs. When I added a meaningless value to one of the cells in the csv and re-ran the program it worked fine.
What is happening here, I'm stumped.
Thanks
You might be getting a TypeError when you're trying to change the column value. Add this to your code:
if not Cell_GMDesig.is_empty():
self.PD_CL['Cdesig_gm'] = self.PD_CL['Cdesig_gm'].astype(str)
# cast to string first
self.PD_CL['Cdesig_gm'][self.UID] = str(Cell_GMDesig.value)
(Some more details: When Pandas reads in a CSV, it selects a datatype for each column. A blank column is read in as a column of floats, and writing a string to one of the entries will fail.
Putting in junk data lets pandas know that the column shouldn't be numerical, so the write succeeds.)