PANDAS: Stylize Dataframe [duplicate] - python

I'm trying to output a Pandas dataframe into an excel file using xlsxwriter. However I'm trying to apply some rule-based formatting; specifically trying to merge cells that have the same value, but having trouble coming up with how to write the loop. (New to Python here!)
See below for output vs output expected:
(As you can see based off the image above I'm trying to merge cells under the Name column when they have the same values).
Here is what I have thus far:
#This is the logic you use to merge cells in xlsxwriter (just an example)
worksheet.merge_range('A3:A4','value you want in merged cells', merge_format)
#Merge Car type Loop thought process...
#1.Loop through data frame where row n Name = row n -1 Name
#2.Get the length of the rows that have the same Name
#3.Based off the length run the merge_range function from xlsxwriter, worksheet.merge_range('range_found_from_loop','Name', merge_format)
for row_index in range(1,len(car_report)):
if car_report.loc[row_index, 'Name'] == car_report.loc[row_index-1, 'Name']
#find starting point based off index, then get range by adding number of rows to starting point. for example lets say rows 0-2 are similar I would get 'A0:A2' which I can then put in the code below
#from there apply worksheet.merge_range('A0:A2','[input value]', merge_format)
Any help is greatly appreciated!
Thank you!

Your logic is almost correct, however i approached your problem through a slightly different approach:
1) Sort the column, make sure that all the values are grouped together.
2) Reset the index (using reset_index() and maybe pass the arg drop=True).
3) Then we have to capture the rows where the value is new. For that purpose create a list and add the first row 1 because we will start for sure from there.
4) Then start iterating over the rows of that list and check some conditions:
4a) If we only have one row with a value the merge_range method will give an error because it can not merge one cell. In that case we need to replace the merge_range with the write method.
4b) With this algorithm you 'll get an index error when trying to write the last value of the list (because it is comparing it with the value in the next index postion, and because it is the last value of the list there is not a next index position). So we need to specifically mention that if we get an index error (which means we are checking the last value) we want to merge or write until the last row of the dataframe.
4c) Finally i did not take into consideration if the column contains blank or null cells. In that case code needs to be adjusted.
Lastly code might look a bit confusing, you have to take in mind that the 1st row for pandas is 0 indexed (headers are separate) while for xlsxwriter headers are 0 indexed and the first row is indexed 1.
Here is a working example to achieve exactly what you want to do:
import pandas as pd
# Create a test df
df = pd.DataFrame({'Name': ['Tesla','Tesla','Toyota','Ford','Ford','Ford'],
'Type': ['Model X','Model Y','Corolla','Bronco','Fiesta','Mustang']})
# Create the list where we 'll capture the cells that appear for 1st time,
# add the 1st row and we start checking from 2nd row until end of df
startCells = [1]
for row in range(2,len(df)+1):
if (df.loc[row-1,'Name'] != df.loc[row-2,'Name']):
startCells.append(row)
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
merge_format = workbook.add_format({'align': 'center', 'valign': 'vcenter', 'border': 2})
lastRow = len(df)
for row in startCells:
try:
endRow = startCells[startCells.index(row)+1]-1
if row == endRow:
worksheet.write(row, 0, df.loc[row-1,'Name'], merge_format)
else:
worksheet.merge_range(row, 0, endRow, 0, df.loc[row-1,'Name'], merge_format)
except IndexError:
if row == lastRow:
worksheet.write(row, 0, df.loc[row-1,'Name'], merge_format)
else:
worksheet.merge_range(row, 0, lastRow, 0, df.loc[row-1,'Name'], merge_format)
writer.save()
Output:

Alternate Approach:
One can use the unique() function to find the index assigned to each unique value (car name in this example). Using the above test data,
import pandas as pd
# Create a test df
df = pd.DataFrame({'Name': ['Tesla','Tesla','Toyota','Ford','Ford','Ford'],
'Type': ['Model X','Model Y','Corolla','Bronco','Fiesta','Mustang']})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
merge_format = workbook.add_format({'align': 'center', 'valign': 'vcenter', 'border': 2})
for car in df['Name'].unique():
# find indices and add one to account for header
u=df.loc[df['Name']==car].index.values + 1
if len(u) <2:
pass # do not merge cells if there is only one car name
else:
# merge cells using the first and last indices
worksheet.merge_range(u[0], 0, u[-1], 0, df.loc[u[0],'Name'], merge_format)
writer.save()

I think this is a better answer to your problem
df = pd.DataFrame({'Name': ['Tesla','Tesla','Toyota','Ford','Ford','Ford'],
'Type': ['Model X','Model Y','Corolla','Bronco','Fiesta','Mustang']})
# Use the groupby() function to group the rows by 'Name'
grouped = df.groupby('Name')
# Use the first() function to find the first row of each group
first_rows = grouped.first()
# Create a new column 'start_row' that contains the index of the first row of each group
first_rows['start_row'] = first_rows.index.map(lambda x: (df['Name'] == x).idxmax())
# Create a new column 'end_row' that contains the index of the last row of each group
first_rows['end_row'] = grouped.last().index.map(lambda x: (df['Name'] == x).idxmax())
# Create an empty list to store the merge ranges
merge_ranges = []
# Iterate over the first_rows dataframe and add the merge ranges to the list
for index, row in first_rows.iterrows():
merge_ranges.append((row['start_row'], 0, row['end_row'], 0))
# Write the dataframe to an excel file and apply the merge ranges
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
worksheet = writer.sheets['Sheet1']
for merge_range in merge_ranges:
worksheet.merge_range(*merge_range, "", worksheet.get_default_format())
writer.save()

Alternate Approach : Other than xlsxwriter you can also use a pivot table.
dataframe=pd.pivot_table(df,index=[column name...])
df.to_excel(dataframe)

Should "just work" with set_index() and to_excel()
my_index_cols = ['Name'] # this can also be a list of multiple columns
df.set_index(my_index_cols).to_excel('filename.xlsx', index=True, header=None)
see also: https://stackoverflow.com/a/68208815/2098573

Related

Custom Excel column using pandas [duplicate]

I am being asked to generate some Excel reports. I am currently using pandas quite heavily for my data, so naturally I would like to use the pandas.ExcelWriter method to generate these reports. However the fixed column widths are a problem.
The code I have so far is simple enough. Say I have a dataframe called df:
writer = pd.ExcelWriter(excel_file_path, engine='openpyxl')
df.to_excel(writer, sheet_name="Summary")
I was looking over the pandas docs, and I don't really see any options to set column widths. Is there a trick to make it such that the columns auto-adjust to the data? Or is there something I can do after the fact to the xlsx file to adjust the column widths?
(I am using the OpenPyXL library, and generating .xlsx files - if that makes any difference.)
Inspired by user6178746's answer, I have the following:
# Given a dict of dataframes, for example:
# dfs = {'gadgets': df_gadgets, 'widgets': df_widgets}
writer = pd.ExcelWriter(filename, engine='xlsxwriter')
for sheetname, df in dfs.items(): # loop through `dict` of dataframes
df.to_excel(writer, sheet_name=sheetname) # send df to writer
worksheet = writer.sheets[sheetname] # pull worksheet object
for idx, col in enumerate(df): # loop through all columns
series = df[col]
max_len = max((
series.astype(str).map(len).max(), # len of largest item
len(str(series.name)) # len of column name/header
)) + 1 # adding a little extra space
worksheet.set_column(idx, idx, max_len) # set column width
writer.save()
Dynamically adjust all the column lengths
writer = pd.ExcelWriter('/path/to/output/file.xlsx')
df.to_excel(writer, sheet_name='sheetName', index=False, na_rep='NaN')
for column in df:
column_length = max(df[column].astype(str).map(len).max(), len(column))
col_idx = df.columns.get_loc(column)
writer.sheets['sheetName'].set_column(col_idx, col_idx, column_length)
writer.save()
Manually adjust a column using Column Name
col_idx = df.columns.get_loc('columnName')
writer.sheets['sheetName'].set_column(col_idx, col_idx, 15)
Manually adjust a column using Column Index
writer.sheets['sheetName'].set_column(col_idx, col_idx, 15)
In case any of the above is failing with
AttributeError: 'Worksheet' object has no attribute 'set_column'
make sure to install xlsxwriter:
pip install xlsxwriter
For a more comprehensive explanation you can read the article How to Auto-Adjust the Width of Excel Columns with Pandas ExcelWriter on TDS.
I'm posting this because I just ran into the same issue and found that the official documentation for Xlsxwriter and pandas still have this functionality listed as unsupported. I hacked together a solution that solved the issue i was having. I basically just iterate through each column and use worksheet.set_column to set the column width == the max length of the contents of that column.
One important note, however. This solution does not fit the column headers, simply the column values. That should be an easy change though if you need to fit the headers instead. Hope this helps someone :)
import pandas as pd
import sqlalchemy as sa
import urllib
read_server = 'serverName'
read_database = 'databaseName'
read_params = urllib.quote_plus("DRIVER={SQL Server};SERVER="+read_server+";DATABASE="+read_database+";TRUSTED_CONNECTION=Yes")
read_engine = sa.create_engine("mssql+pyodbc:///?odbc_connect=%s" % read_params)
#Output some SQL Server data into a dataframe
my_sql_query = """ SELECT * FROM dbo.my_table """
my_dataframe = pd.read_sql_query(my_sql_query,con=read_engine)
#Set destination directory to save excel.
xlsFilepath = r'H:\my_project' + "\\" + 'my_file_name.xlsx'
writer = pd.ExcelWriter(xlsFilepath, engine='xlsxwriter')
#Write excel to file using pandas to_excel
my_dataframe.to_excel(writer, startrow = 1, sheet_name='Sheet1', index=False)
#Indicate workbook and worksheet for formatting
workbook = writer.book
worksheet = writer.sheets['Sheet1']
#Iterate through each column and set the width == the max length in that column. A padding length of 2 is also added.
for i, col in enumerate(my_dataframe.columns):
# find length of column i
column_len = my_dataframe[col].astype(str).str.len().max()
# Setting the length if the column header is larger
# than the max column value length
column_len = max(column_len, len(col)) + 2
# set the column length
worksheet.set_column(i, i, column_len)
writer.save()
There is a nice package that I started to use recently called StyleFrame.
it gets DataFrame and lets you to style it very easily...
by default the columns width is auto-adjusting.
for example:
from StyleFrame import StyleFrame
import pandas as pd
df = pd.DataFrame({'aaaaaaaaaaa': [1, 2, 3],
'bbbbbbbbb': [1, 1, 1],
'ccccccccccc': [2, 3, 4]})
excel_writer = StyleFrame.ExcelWriter('example.xlsx')
sf = StyleFrame(df)
sf.to_excel(excel_writer=excel_writer, row_to_add_filters=0,
columns_and_rows_to_freeze='B2')
excel_writer.save()
you can also change the columns width:
sf.set_column_width(columns=['aaaaaaaaaaa', 'bbbbbbbbb'],
width=35.3)
UPDATE 1
In version 1.4 best_fit argument was added to StyleFrame.to_excel.
See the documentation.
UPDATE 2
Here's a sample of code that works for StyleFrame 3.x.x
from styleframe import StyleFrame
import pandas as pd
columns = ['aaaaaaaaaaa', 'bbbbbbbbb', 'ccccccccccc', ]
df = pd.DataFrame(data={
'aaaaaaaaaaa': [1, 2, 3, ],
'bbbbbbbbb': [1, 1, 1, ],
'ccccccccccc': [2, 3, 4, ],
}, columns=columns,
)
excel_writer = StyleFrame.ExcelWriter('example.xlsx')
sf = StyleFrame(df)
sf.to_excel(
excel_writer=excel_writer,
best_fit=columns,
columns_and_rows_to_freeze='B2',
row_to_add_filters=0,
)
excel_writer.save()
There is probably no automatic way to do it right now, but as you use openpyxl, the following line (adapted from another answer by user Bufke on how to do in manually) allows you to specify a sane value (in character widths):
writer.sheets['Summary'].column_dimensions['A'].width = 15
By using pandas and xlsxwriter you can do your task, below code will perfectly work in Python 3.x. For more details on working with XlsxWriter with pandas this link might be useful https://xlsxwriter.readthedocs.io/working_with_pandas.html
import pandas as pd
writer = pd.ExcelWriter(excel_file_path, engine='xlsxwriter')
df.to_excel(writer, sheet_name="Summary")
workbook = writer.book
worksheet = writer.sheets["Summary"]
#set the column width as per your requirement
worksheet.set_column('A:A', 25)
writer.save()
I found that it was more useful to adjust the column with based on the column header rather than column content.
Using df.columns.values.tolist() I generate a list of the column headers and use the lengths of these headers to determine the width of the columns.
See full code below:
import pandas as pd
import xlsxwriter
writer = pd.ExcelWriter(filename, engine='xlsxwriter')
df.to_excel(writer, index=False, sheet_name=sheetname)
workbook = writer.book # Access the workbook
worksheet= writer.sheets[sheetname] # Access the Worksheet
header_list = df.columns.values.tolist() # Generate list of headers
for i in range(0, len(header_list)):
worksheet.set_column(i, i, len(header_list[i])) # Set column widths based on len(header)
writer.save() # Save the excel file
At work, I am always writing the dataframes to excel files. So instead of writing the same code over and over, I have created a modulus. Now I just import it and use it to write and formate the excel files. There is one downside though, it takes a long time if the dataframe is extra large.
So here is the code:
def result_to_excel(output_name, dataframes_list, sheet_names_list, output_dir):
out_path = os.path.join(output_dir, output_name)
writerReport = pd.ExcelWriter(out_path, engine='xlsxwriter',
datetime_format='yyyymmdd', date_format='yyyymmdd')
workbook = writerReport.book
# loop through the list of dataframes to save every dataframe into a new sheet in the excel file
for i, dataframe in enumerate(dataframes_list):
sheet_name = sheet_names_list[i] # choose the sheet name from sheet_names_list
dataframe.to_excel(writerReport, sheet_name=sheet_name, index=False, startrow=0)
# Add a header format.
format = workbook.add_format({
'bold': True,
'border': 1,
'fg_color': '#0000FF',
'font_color': 'white'})
# Write the column headers with the defined format.
worksheet = writerReport.sheets[sheet_name]
for col_num, col_name in enumerate(dataframe.columns.values):
worksheet.write(0, col_num, col_name, format)
worksheet.autofilter(0, 0, 0, len(dataframe.columns) - 1)
worksheet.freeze_panes(1, 0)
# loop through the columns in the dataframe to get the width of the column
for j, col in enumerate(dataframe.columns):
max_width = max([len(str(s)) for s in dataframe[col].values] + [len(col) + 2])
# define a max width to not get to wide column
if max_width > 50:
max_width = 50
worksheet.set_column(j, j, max_width)
writerReport.save()
return output_dir + output_name
Combining the other answers and comments and also supporting multi-indices:
def autosize_excel_columns(worksheet, df):
autosize_excel_columns_df(worksheet, df.index.to_frame())
autosize_excel_columns_df(worksheet, df, offset=df.index.nlevels)
def autosize_excel_columns_df(worksheet, df, offset=0):
for idx, col in enumerate(df):
series = df[col]
max_len = max((
series.astype(str).map(len).max(),
len(str(series.name))
)) + 1
worksheet.set_column(idx+offset, idx+offset, max_len)
sheetname=...
df.to_excel(writer, sheet_name=sheetname, freeze_panes=(df.columns.nlevels, df.index.nlevels))
worksheet = writer.sheets[sheetname]
autosize_excel_columns(worksheet, df)
writer.save()
you can solve the problem by calling the following function, where df is the dataframe you want to get the sizes and the sheetname is the sheet in excel where you want the modifications to take place
def auto_width_columns(df, sheetname):
workbook = writer.book
worksheet= writer.sheets[sheetname]
for i, col in enumerate(df.columns):
column_len = max(df[col].astype(str).str.len().max(), len(col) + 2)
worksheet.set_column(i, i, column_len)
import re
import openpyxl
..
for col in _ws.columns:
max_lenght = 0
print(col[0])
col_name = re.findall('\w\d', str(col[0]))
col_name = col_name[0]
col_name = re.findall('\w', str(col_name))[0]
print(col_name)
for cell in col:
try:
if len(str(cell.value)) > max_lenght:
max_lenght = len(cell.value)
except:
pass
adjusted_width = (max_lenght+2)
_ws.column_dimensions[col_name].width = adjusted_width
Yes, there is there is something you can do subsequently to the xlsx file to adjust the column widths.
Use xlwings to autofit columns. It's a pretty simple solution, see the 6 last lines of the example code. The advantage of this procedure is that you don't have to worry about font size, font type or anything else.
Requirement: Excel installation.
import pandas as pd
import xlwings as xw
path = r"test.xlsx"
# Export your dataframe in question.
df = pd._testing.makeDataFrame()
df.to_excel(path)
# Autofit all columns with xlwings.
with xw.App(visible=False) as app:
wb = xw.Book(path)
for ws in wb.sheets:
ws.autofit(axis="columns")
wb.save(path)
wb.close()
Easiest solution is to specify width of column in set_column method.
for worksheet in writer.sheets.values():
worksheet.set_column(0,last_column_value, required_width_constant)
This function works for me, also fixes the index width
def write_to_excel(writer, X, sheet_name, sep_only=False):
#writer=writer object
#X=dataframe
#sheet_name=name of sheet
#sep_only=True:write only as separate excel file, False: write as sheet to the writer object
if sheet_name=="":
print("specify sheet_name!")
else:
X.to_excel(f"{output_folder}{prefix_excel_save}_{sheet_name}.xlsx")
if not sep_only:
X.to_excel(writer, sheet_name=sheet_name)
#fix column widths
worksheet = writer.sheets[sheet_name] # pull worksheet object
for idx, col in enumerate(X.columns): # loop through all columns
series = X[col]
max_len = max((
series.astype(str).map(len).max(), # len of largest item
len(str(series.name)) # len of column name/header
)) + 1 # adding a little extra space
worksheet.set_column(idx+1, idx+1, max_len) # set column width (=1 because index = 1)
#fix index width
max_len=pd.Series(X.index.values).astype(str).map(len).max()+1
worksheet.set_column(0, 0, max_len)
if sep_only:
print(f'{sheet_name} is written as seperate file')
else:
print(f'{sheet_name} is written as seperate file')
print(f'{sheet_name} is written as sheet')
return writer
call example:
writer = write_to_excel(writer, dataframe, "Statistical_Analysis")
I may be a bit late to the party but this code works when using 'openpyxl' as your engine, sometimes pip install xlsxwriter wont solve the issue. This code below works like a charm. Edit any part as you wish.
def text_length(text):
"""
Get the effective text length in characters, taking into account newlines
"""
if not text:
return 0
lines = text.split("\n")
return max(len(line) for line in lines)
def _to_str_for_length(v, decimals=3):
"""
Like str() but rounds decimals to predefined length
"""
if isinstance(v, float):
# Round to [decimal] places
return str(Decimal(v).quantize(Decimal('1.' + '0' * decimals)).normalize())
else:
return str(v)
def auto_adjust_xlsx_column_width(df, writer, sheet_name, margin=3, length_factor=1.0, decimals=3, index=False):
sheet = writer.sheets[sheet_name]
_to_str = functools.partial(_to_str_for_length, decimals=decimals)
# Compute & set column width for each column
for column_name in df.columns:
# Convert the value of the columns to string and select the
column_length = max(df[column_name].apply(_to_str).map(text_length).max(), text_length(column_name)) + 5
# Get index of column in XLSX
# Column index is +1 if we also export the index column
col_idx = df.columns.get_loc(column_name)
if index:
col_idx += 1
# Set width of column to (column_length + margin)
sheet.column_dimensions[openpyxl.utils.cell.get_column_letter(col_idx + 1)].width = column_length * length_factor + margin
# Compute column width of index column (if enabled)
if index: # If the index column is being exported
index_length = max(df.index.map(_to_str).map(text_length).max(), text_length(df.index.name))
sheet.column_dimensions["A"].width = index_length * length_factor + margin
An openpyxl version based on #alichaudry's code.
The code 1) loads an excel file, 2) adjusts column widths and 3) saves it.
def auto_adjust_column_widths(excel_file : "Excel File Path", extra_space = 1) -> None:
"""
Adjusts column widths of the excel file and replaces it with the adjusted one.
Adjusting columns is based on the lengths of columns values (including column names).
Parameters
----------
excel_file :
excel_file to adjust column widths.
extra_space :
extra column width in addition to the value-based-widths
"""
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
wb = load_workbook(excel_file)
for ws in wb:
df = pd.DataFrame(ws.values,)
for i,r in (df.astype(str).applymap(len).max(axis=0) + extra_space).iteritems():
ws.column_dimensions[get_column_letter(i+1)].width = r
wb.save(excel_file)

Bound xlsxwriter set_row to last column

I was using set_row to apply bg color formatting to a table given an "if" condition (described here). It colored the entire row while the table has 15 columns, so I came up with a walkaround (kudos to SO) of conditional formatting:
(max_row, max_col) = df.shape
format1 = workbook.add_format({"bg_color": "#FFFFFF"})
format2 = workbook.add_format({"bg_color": "#E4DFEC"})
tmp_format = format1
tmp_val = 0
for i in range(0, max_row):
if df.loc[i]["chain_id"] != tmp_val:
tmp_format = format2 if tmp_format == format1 else format1
tmp_val = df.loc[i]["chain_id"]
worksheet.conditional_format(
i + 1,
0,
i + 1,
max_col - 1,
{
"type": "formula",
"criteria": '=$A1<>"mustbeabetterway"',
"format": tmp_format,
},
)
Not only is it super inelegant, but it also creates thousands of conditional formatting that cause laggy Excel workbook.
There's must be a better way to color a row between column indexes.
There are some different ways on how to format the file, i have been using for loops (not ideal for very large dataframes but it can still get the job done). Basically what i was doing was to iterate through the rows and columns until the point that i wanted (usually the last row or last column) and apply the format to every cell using the worksheet's write method (for more info have a look here https://xlsxwriter.readthedocs.io/worksheet.html#worksheet-write ). You do not need conditional formatting unless you want to highlight different values with specific colors.
import pandas as pd
df = pd.DataFrame({'Column A': [1,2,3,4],
'Column B': ['a','b','c','d'],
'Column C': ['A','B','C','D']})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Define your formats
format_red = workbook.add_format({'bg_color': '#FFC7CE'})
format_yellow = workbook.add_format({'bg_color': '#FFEB9C', 'italic': True})
format_green = workbook.add_format({'bg_color': '#C6EFCE', 'bold': True})
# Format the entire first column until the dataframe'w last cell
for row in range(0, df.shape[0]):
worksheet.write(row+1, 0, df.iloc[row,0], format_red)
# Format the entire row from 2nd column until the dataframe's last column
for col in range(1, df.shape[1]):
worksheet.write(2, col, df.iloc[1,col], format_green)
# Format the entire row from 1st column until the dataframe's last column
for col in range(0, df.shape[1]):
worksheet.write(4, col, df.iloc[3,col], format_yellow)
writer.save()
Initial output:
Final output:

How to flag an anomaly in a data frame (row wise)?

Python newbie here, I will like to flag sporadic numbers that are obviously off from the rest of the row.
In simple terms, flag numbers that seem not to belong to each row. Numbers in 100s and 100000s are considered 'off the rest'
import pandas as pd
# intialise data of lists.
data = {'A':['R1', 'R2', 'R3', 'R4', 'R5'],
'B':[12005, 18190, 1021, 13301, 31119,],
'C':[11021, 19112, 19021,15, 24509 ],
'D':[10022,19910, 19113,449999, 25519],
'E':[14029, 29100, 39022, 24509, 412271],
'F':[52119,32991,52883,69359,57835],
'G':[41218, 52991,1021,69152,79355],
'H': [43211,7672991,56881,211,77342],
'J': [31211,42901,53818,62158,69325],
}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.describe()
I am trying to do something exactly like this
# I need help with step 1
#my code/pseudocode
# step 1: identify the values in each row that are don't belong to the group
# step 2: flag the identified values and export to excel
style_df = .applymap(lambda x: "background-color: yellow" if x else "") # flags the values that meets the criteria
with pd.ExcelWriter("flagged_data.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,index=False)
I used two conditions here one to check less than 1000 and another one for greater than 99999. Based on this condition, the code will highlight outliers in red color.
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_conditional.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Get the xlsxwriter workbook and worksheet objects.
workbook = writer.book
worksheet = writer.sheets['Sheet1']
# Add a format. Light red fill with dark red text.
format1 = workbook.add_format({'bg_color': '#FFC7CE',
'font_color': '#9C0006'})
first_row = 1
first_col = 2
last_row = len(df)
last_col = 9
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '<',
'value': 1000,
'format': format1})
worksheet.conditional_format(first_row, first_col, last_row, last_col,
{'type': 'cell',
'criteria': '>',
'value': 99999,
'format': format1})
# Close the Pandas Excel writer and output the Excel file.
writer.save()
If you don't need to use machine learning outliers detection or Hampel filter and you already know the limits of your filter, you can simply do
def higlight_outliers(s):
# force to numeric and coerce string to NaN
s = pd.to_numeric(s, errors='coerce')
indexes = (s<1500)|(s>1000000)
return ['background-color: yellow' if v else '' for v in indexes]
styled = df.style.apply(higlight_outliers, axis=1)
styled.to_excel("flagged_data.xlsx", index=False)
I guess you could define a little better what you consider "off from the rest". This is very important when working with data.
Do you want to flag the outliers of your column B distribution for example? You could simply do a calculation of quartiles for your distributions and append those to a dict of some kind, those which are either below the lowest quartile or over the highest quartile. But you obviously would need more than those 5 rows you showed.
There are whole fields dedicated to identification of outliers using machine learning as well. The assumptions you are taking to define what should be considered "off from the rest" are very important.
Read this if you'd like more info on specifics of outlier detection:
https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-1e0b2c19e561

Color every other 2 columns of a dataframe into an excel?

I have a huge dataframe and I need to display it into an excel sheet such that every other 2 columns are colored except the 1st column.
For example:
If there are columns 1 to 100,
column 2,3 must be red
then 4,5 non colored
then 6,7 again red
then 8,9 non colored
and it goes on and on till last column of the dataframe.
In Excel, Selected the columns containing you data or the entire spreadsheet. Click Conditional formatting on the Home Ribbon. Click New Rule. Click Use a formula to determine which cells to format. In the formula box enter =OR(MOD(COLUMN(A1),4)=2,MOD(COLUMN(A1),4)=3). Click the Format button. Select the fill tab. Set the fill color to what you want. Hit OK a few times and you should be done.
This will fill in the cells that or equal to 2 or 3 mod 4.
I came with following solution:
import pandas as pd
import numpy as np
columns = 13
data = np.array([np.arange(10)]*columns).T
df = pd.DataFrame(data=data)
df = df.fillna(0) # with 0s rather than NaNs
writer = pd.ExcelWriter('pandas_conditional.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format1 = workbook.add_format({'bg_color': '#FFC7CE'})
for col in range(2, columns+1, 4):
worksheet.set_column(col, col + 1, cell_format=format1)
writer.save()
Iterate from index 2 (second col), until columns+1 (indexing comes from 1 in excel), color 2 cols at once and then move 4 indices further. The only problem here right now, it colors whole column (even not filled), I'll look for solution for that later.
Output:
You need to translate integer indices to excel-like labels with a function and use conditional_format in case you want to color only fields with text:
import pandas as pd
import numpy as np
columns = 13
data = np.array([np.arange(10)]*columns).T
df = pd.DataFrame(data=data)
df = df.fillna(0) # with 0s rather than NaNs
writer = pd.ExcelWriter('pandas_conditional.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1')
workbook = writer.book
worksheet = writer.sheets['Sheet1']
format1 = workbook.add_format({'bg_color': '#FFC7CE'})
def colnum_string(n):
string = ""
n+=1 #just because we have index saved in first col
while n > 0:
n, remainder = divmod(n - 1, 26)
string = chr(65 + remainder) + string
return string
for col in range(2, columns+1, 4):
str1 = colnum_string(col)+"2" #ommiting header, 1 if header
str2 = colnum_string(col+1)+str(11) #number of rows+1 (header)
ids = str1+":"+str2
print(ids)
worksheet.conditional_format(ids, {'type': 'no_blanks',
'format': format1})
writer.save()
Output of the second code:
if you have poblem only with iteration than:
for i in range(2, columns, 4):
setColumnColor(i)
setColumnColor(i+1)
You start coloring from the 2nd column and coloring 2 columns at a time. Than iterating on the columns by steps of 4.
But if you have problems on finding a method to set dataframe colors than this is a thread for you: Colouring one column of pandas dataframe
You do it in Excel:
You start with converting a range to a table (Ctrl+T).
Then switch to the Design tab, remove a tick from Banded rows and select Banded columns instead.
right-click on the table styles
click on duplicates
click on the first column stripe
on the stripe size insert 2
note: you can click https://www.ablebits.com/office-addins-blog/2014/03/13/alternate-row-column-colors-excel/#alternating-row-tables

Merge rows based on value (pandas to excel - xlsxwriter)

I'm trying to output a Pandas dataframe into an excel file using xlsxwriter. However I'm trying to apply some rule-based formatting; specifically trying to merge cells that have the same value, but having trouble coming up with how to write the loop. (New to Python here!)
See below for output vs output expected:
(As you can see based off the image above I'm trying to merge cells under the Name column when they have the same values).
Here is what I have thus far:
#This is the logic you use to merge cells in xlsxwriter (just an example)
worksheet.merge_range('A3:A4','value you want in merged cells', merge_format)
#Merge Car type Loop thought process...
#1.Loop through data frame where row n Name = row n -1 Name
#2.Get the length of the rows that have the same Name
#3.Based off the length run the merge_range function from xlsxwriter, worksheet.merge_range('range_found_from_loop','Name', merge_format)
for row_index in range(1,len(car_report)):
if car_report.loc[row_index, 'Name'] == car_report.loc[row_index-1, 'Name']
#find starting point based off index, then get range by adding number of rows to starting point. for example lets say rows 0-2 are similar I would get 'A0:A2' which I can then put in the code below
#from there apply worksheet.merge_range('A0:A2','[input value]', merge_format)
Any help is greatly appreciated!
Thank you!
Your logic is almost correct, however i approached your problem through a slightly different approach:
1) Sort the column, make sure that all the values are grouped together.
2) Reset the index (using reset_index() and maybe pass the arg drop=True).
3) Then we have to capture the rows where the value is new. For that purpose create a list and add the first row 1 because we will start for sure from there.
4) Then start iterating over the rows of that list and check some conditions:
4a) If we only have one row with a value the merge_range method will give an error because it can not merge one cell. In that case we need to replace the merge_range with the write method.
4b) With this algorithm you 'll get an index error when trying to write the last value of the list (because it is comparing it with the value in the next index postion, and because it is the last value of the list there is not a next index position). So we need to specifically mention that if we get an index error (which means we are checking the last value) we want to merge or write until the last row of the dataframe.
4c) Finally i did not take into consideration if the column contains blank or null cells. In that case code needs to be adjusted.
Lastly code might look a bit confusing, you have to take in mind that the 1st row for pandas is 0 indexed (headers are separate) while for xlsxwriter headers are 0 indexed and the first row is indexed 1.
Here is a working example to achieve exactly what you want to do:
import pandas as pd
# Create a test df
df = pd.DataFrame({'Name': ['Tesla','Tesla','Toyota','Ford','Ford','Ford'],
'Type': ['Model X','Model Y','Corolla','Bronco','Fiesta','Mustang']})
# Create the list where we 'll capture the cells that appear for 1st time,
# add the 1st row and we start checking from 2nd row until end of df
startCells = [1]
for row in range(2,len(df)+1):
if (df.loc[row-1,'Name'] != df.loc[row-2,'Name']):
startCells.append(row)
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
merge_format = workbook.add_format({'align': 'center', 'valign': 'vcenter', 'border': 2})
lastRow = len(df)
for row in startCells:
try:
endRow = startCells[startCells.index(row)+1]-1
if row == endRow:
worksheet.write(row, 0, df.loc[row-1,'Name'], merge_format)
else:
worksheet.merge_range(row, 0, endRow, 0, df.loc[row-1,'Name'], merge_format)
except IndexError:
if row == lastRow:
worksheet.write(row, 0, df.loc[row-1,'Name'], merge_format)
else:
worksheet.merge_range(row, 0, lastRow, 0, df.loc[row-1,'Name'], merge_format)
writer.save()
Output:
Alternate Approach:
One can use the unique() function to find the index assigned to each unique value (car name in this example). Using the above test data,
import pandas as pd
# Create a test df
df = pd.DataFrame({'Name': ['Tesla','Tesla','Toyota','Ford','Ford','Ford'],
'Type': ['Model X','Model Y','Corolla','Bronco','Fiesta','Mustang']})
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
workbook = writer.book
worksheet = writer.sheets['Sheet1']
merge_format = workbook.add_format({'align': 'center', 'valign': 'vcenter', 'border': 2})
for car in df['Name'].unique():
# find indices and add one to account for header
u=df.loc[df['Name']==car].index.values + 1
if len(u) <2:
pass # do not merge cells if there is only one car name
else:
# merge cells using the first and last indices
worksheet.merge_range(u[0], 0, u[-1], 0, df.loc[u[0],'Name'], merge_format)
writer.save()
I think this is a better answer to your problem
df = pd.DataFrame({'Name': ['Tesla','Tesla','Toyota','Ford','Ford','Ford'],
'Type': ['Model X','Model Y','Corolla','Bronco','Fiesta','Mustang']})
# Use the groupby() function to group the rows by 'Name'
grouped = df.groupby('Name')
# Use the first() function to find the first row of each group
first_rows = grouped.first()
# Create a new column 'start_row' that contains the index of the first row of each group
first_rows['start_row'] = first_rows.index.map(lambda x: (df['Name'] == x).idxmax())
# Create a new column 'end_row' that contains the index of the last row of each group
first_rows['end_row'] = grouped.last().index.map(lambda x: (df['Name'] == x).idxmax())
# Create an empty list to store the merge ranges
merge_ranges = []
# Iterate over the first_rows dataframe and add the merge ranges to the list
for index, row in first_rows.iterrows():
merge_ranges.append((row['start_row'], 0, row['end_row'], 0))
# Write the dataframe to an excel file and apply the merge ranges
writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='Sheet1', index=False)
worksheet = writer.sheets['Sheet1']
for merge_range in merge_ranges:
worksheet.merge_range(*merge_range, "", worksheet.get_default_format())
writer.save()
Alternate Approach : Other than xlsxwriter you can also use a pivot table.
dataframe=pd.pivot_table(df,index=[column name...])
df.to_excel(dataframe)
Should "just work" with set_index() and to_excel()
my_index_cols = ['Name'] # this can also be a list of multiple columns
df.set_index(my_index_cols).to_excel('filename.xlsx', index=True, header=None)
see also: https://stackoverflow.com/a/68208815/2098573

Categories