Color dataframe rows by condition in Pandas

Color dataframe rows by condition in Pandas - python

I need to color rows in my DataFrame, according to the condition depending on some column. For example:
test = pd.DataFrame({"A": [1,2,3,4,5], "B":[5,3,2,1,4]})
def color(score):
return f"background-color:" + (" #ffff00;" if score < 4 else "#ff0000")
test.style.applymap(color, subset = ["A"])
But in this way I get color only at A column:
Also i can code like this:
def color1(score):
return f"background-color: #ffff00;"
def color2(score):
return f"background-color: #ff0000;"
temp = test.style.applymap(color1, subset = (test[test["A"]< 4].index, slice(None)))
temp.applymap(color2, subset = (test[test["A"] >= 4].index, slice(None)))
In this way color will be ok, but i struggle multiple calls of applymap functions. Is there any way to fulfill my goal without the multiple calls?

In the first example, by the subset='A' you are telling to apply only to column A. If you remove that it will apply to the entire dataframe.
import pandas as pd
test = pd.DataFrame({"A": [1,2,3,4,5], "B":[5,3,2,1,4]})
def color(score):
return f"background-color:" + (" #ffff00;" if score < 4 else "#ff0000")
test.style.applymap(color)
If you want to apply different formatting in different columns you can easily design something handling one column at a time.
def color2(score):
return f"background-color: #80FFff;" if score < 4 else None
test.style.applymap(color, subset='A') \
.applymap(color2, subset='B')

If some complicated mask or set styles is possible create DataFrame of styles and pass to Styler.apply with axis=None:
def highlight(x):
c1 = f"background-color: #ffff00;"
c2 = f"background-color: #ff0000;"
m = x["A"]< 4
# DataFrame of styles
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
# set columns by condition
df1.loc[m, :] = c1
df1.loc[~m, :] = c2
#for check DataFrame of styles
print (df1)
return df1
test.style.apply(highlight, axis=None)

Related

Color rows of MultiIndex dataframe

I have a pandas dataframe with MultiIndex. The indexes of the rows are 'time' and 'type' while the columns are built from tuples. The dataframe stores the information about the price and size of three cryptocurrencies pairs (either info about trades or about the best_bids). The details are not really important, but the dataframe looks like this
I would like to change the color of the rows for which 'type' == 'Buy Trade' (let's say I want to make the text of these rows green, and red otherwise).
How can I do it?
You can download the csv of the dataframe from here https://github.com/doogersPy/files/blob/main/dataframe.csv and then load the dataframe with
df = pd.read_csv('dataframe.csv',index_col=[0,1], header=[0,1])
I have tried a similar method presented in this other question, but df.style.applydoes not work with non-unique multindexes (like in my case). In my dataframe, there are entries with same time value.
In fact, I have tried the following code
def highlight(ob):
c1 = f"background-color: #008000;"
c2 = f"background-color: #ff0000;"
m = ob.index.get_level_values('type') == 'Buy Trade'
# DataFrame of styles
df1 = pd.DataFrame('', index=ob.index, columns=ob.columns)
# set columns by condition
df1.loc[m, :] = c1
df1.loc[~m, :] = c2
#for check DataFrame of styles
return df1
df.style.apply(highlight,axis=None)
but I get the error
KeyError: 'Styler.apply and .applymap are not compatible with
non-unique index or columns.'

I have solved with the following method
col=df.reset_index().columns
idx= df.reset_index().index
def highlight(ob):
c_g = f"color: #008000;" # Green
c_r = f"color: #ff0000;" # Red
c_b = f"color: #000000;" #black
mBuy = (ob['type'] == 'Buy Trade')
mSell = (ob['type'] == 'Sell Trade')
mOB = (ob['type'] == 'OB Update')
# DataFrame of styles
df1 = pd.DataFrame('', index=idx, columns=col)
# set columns by condition
df1.loc[mBuy] = c_g
df1.loc[mSell] = c_r
df1.loc[mOB] = c_b
#for check DataFrame of styles
return df1
df.reset_index().style.apply(highlight,axis=None)

I can't higlight cell in Excell from Pandas Dataframe

I have an issue with my code to highlight specific cell in an excel file when I export my DF. Cells to highlight with background colors are outliers of the column. Oultiers are calculated thanks a for loop on each column.
Here the code where I calculate outliers for each column:
for col in dfmg.columns.difference(['Sbj', 'expertise', 'gender']):
Q1c = dfmg[col].quantile(0.25)
Q3c = dfmg[col].quantile(0.75)
IQRc = Q3 - Q1
lowc = Q1-1.5*IQR
uppc = Q3+1.5*IQR
Then I created this function to define how to higlight cells:
def colors(v):
for v in dfmg[col]:
if v < lowc or v > uppc:
color = 'yellow'
return 'background-color: %s' % color
And I apply my function to a new df:
df_colored = dfmg.style.applymap(colors)
The problem is that when I export df_colored, everything is yellow! Where am i wrong?
Thanks for help!

You can create DataFrame of styles and apply to DataFrame, I modify your solution for no loops:
def hightlight(x):
c1 = 'background-color: yellow'
cols = x.columns.difference(['Sbj', 'expertise', 'gender'])
Q1 = x[cols].quantile(0.25)
Q3 = x[cols].quantile(0.75)
IQR = Q3 - Q1
low = Q1-1.5*IQR
up = Q3+1.5*IQR
mask = x.lt(low) | x.gt(up)
#DataFrame with same index and columns names as original filled empty strings
df1 = pd.DataFrame(c2, index=x.index, columns=x.columns)
#modify values of df1 column by boolean mask
df1[cols] = df1[cols].mask(mask, c1)
return df1
dfng.style.apply(hightlight, axis=None)

How to calculate the mean and standard deviation of multiple dataframes at one go?

I've several hundreds of pandas dataframes and And the number of rows are not exactly the same in all the dataframes like some have 600 but other have 540 only.
So what i want to do is like, i have two samples of exactly the same numbers of dataframes and i want to read all the dataframes(around 2000) from both the samples. So that's how thee data looks like and i can read the files like this:
5113.440 1 0.25846 0.10166 27.96867 0.94852 -0.25846 268.29305 5113.434129
5074.760 3 0.68155 0.16566 120.18771 3.02654 -0.68155 101.02457 5074.745627
5083.340 2 0.74771 0.13267 105.59355 2.15700 -0.74771 157.52406 5083.337081
5088.150 1 0.28689 0.12986 39.65747 2.43339 -0.28689 164.40787 5088.141849
5090.780 1 0.61464 0.14479 94.72901 2.78712 -0.61464 132.25865 5090.773443
#first Sample
path_to_files = '/home/Desktop/computed_2d_blaze/'
lst = []
for filen in [x for x in os.listdir(path_to_files) if '.ares' in x]:
df = pd.read_table(path_to_files+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df = df.sort_values('stlines', ascending=False)
df = df.drop_duplicates('wave')
df = df.reset_index(drop=True)
lst.append(df)
#second sample
path_to_files1 = '/home/Desktop/computed_1d/'
lst1 = []
for filen in [x for x in os.listdir(path_to_files1) if '.ares' in x]:
df1 = pd.read_table(path_to_files1+filen, skiprows=0, usecols=(0,1,2,3,4,8),names=['wave','num','stlines','fwhm','EWs','MeasredWave'],delimiter=r'\s+')
df1 = df1.sort_values('stlines', ascending=False)
df1 = df1.drop_duplicates('wave')
df1 = df1.reset_index(drop=True)
lst1.append(df1)
Now the data is stored in lists and as the number of rows in all the dataframes are not same so i cant subtract them directly.
So how can i subtract them correctly?? And after that i want to take average(mean) of the residual to make a dataframe?

You shouldn't use apply. Just use Boolean making:
mask = df['waves'].between(lower_outlier, upper_outlier)
df[mask].plot(x='waves', y='stlines')

One solution that comes into mind is writing a function that finds outliers based on upper and lower bounds and then slices the data frames based on outliers index e.g.
df1 = pd.DataFrame({'wave': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'stlines': [0.1, 0.2, 0.3, 0.4, 0.5]})
def outlier(value, upper, lower):
"""
Find outliers based on upper and lower bound
"""
# Check if input value is within bounds
in_bounds = (value <= upper) and (value >= lower)
return in_bounds
# Function finds outliers in wave column of DF1
outlier_index = df1.wave.apply(lambda x: outlier(x, 4, 1))
# Return DF2 without values at outlier index
df2[outlier_index]
# Return DF1 without values at outlier index
df1[outlier_index]

How do I remove and re-sort (reindex) columns after applying style in python pandas?

Is there a way to remove columns or rows after applying style in python pandas? And re-sort them?
styled = df.style.apply(colorize, axis=None)
#remove _x columns
yonly = list(sorted(set(styled.columns) - set(df.filter(regex='_x$').columns)))
###Remove columns that end with "_x" here
styled.to_excel('styled.xlsx', engine='openpyxl', freeze_panes=(1,1))
Most things I tried were unavailable, i.e.
styled.reindex(columns=yonly) returned AttributeError: 'Styler' object has no attribute 'reindex'
styled.columns = yonly returned AttributeError: 'list' object has no attribute 'get_indexer'
styled = styled[yonly] returns
TypeError: 'Styler' object is not subscriptable
Follow-up from Colour specific cells from two columns that don't match, using python pandas style.where (or otherwise) and export to excel

After #jezrael's comment to remove columns before styling and colouring, I got my answer :)
The solution was to pass an extra argument, making the original dataframe df available. And coloured the dataframe df_tmp with the "_y" only. :)
df = pd.DataFrame({
'config_dummy1': ["dummytext"] * 10,
'a_y': ["a"] * 10,
'config_size_x': ["textstring"] * 10,
'config_size_y': ["textstring"] * 10,
'config_dummy2': ["dummytext"] * 10,
'a_x': ["a"] * 10
})
df.at[5, 'config_size_x'] = "xandydontmatch"
df.at[9, 'config_size_y'] = "xandydontmatch"
df.at[0, 'a_x'] = "xandydontmatch"
df.at[3, 'a_y'] = "xandydontmatch"
print(df)
def color(x, extra):
c1 = 'color: #ffffff; background-color: #ba3018'
df1 = pd.DataFrame('', index=x.index, columns=x.columns)
#select only columns ends with _x and _y and sorting
cols = sorted(extra.filter(regex='_x$|_y$').columns)
#loop by pairs and assign style by mask
for colx, coly in zip(cols[::2],cols[1::2]):
#pairs columns
m = extra[colx] != extra[coly]
df1.loc[m, [coly]] = c1
return df1
yonly = list(sorted(set(df.columns) - set(df.filter(regex='_x$').columns)))
df_tmp = df[yonly]
df_tmp.style.apply(color, axis=None, extra=df).to_excel('styled.xlsx', engine='openpyxl')
Thank you wonderful people of SO! :D

I'm not sure about re-sorting, but if you only want to remove (hide) some columns use Styler's
hide_columns method.
For example to hide columns 'A' and 'B':
hide_columns(['A', 'B'])

I had a similar scenario wherein I had to color background of a dataframe based on another dataframe. I created a function for coloring based on the ranges of the other dataframe as follows:
def colval(val, z1):
color= 'None'
df1= pd.DataFrame('', index= val.index, columns= val.columns) # dataframe for coloring
colm= z1.shape
for x in list(range(colm[0])):
for y in list(range(1, colm[1])):
# check the range in the dependent dataframe
# and color the other one
if(z1.iloc[x, y]>= 4.5):
df1.iloc[x, y]= 'background-color: red'
elif(z1.iloc[x, y]<= -4.5):
df1.iloc[x, y]= 'background-color: yellow'
return df1
df_tocol.style.apply(colval, axis= None, z1= diff_df)
Hope this helps!

Pandas - Working on multiple columns seems slow

I have some trouble processing a big csv with Pandas. Csv consists of an index and about other 450 columns in groups of 3, something like this:
cola1 colb1 colc1 cola2 colb2 colc2 cola3 colb3 colc3
1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1 stra_1 ctrlb_1 retc_1
2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2 stra_2 ctrlb_2 retc_2
3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3 stra_3 ctrlb_3 retc_3
For each trio of columns I would like to "analyze B column (it's a sort of "CONTROL field" and depending on its value I should then return a value by processing col A and C.
Finally I need to return a concatenation of all resulting columns starting from 150 to 1.
I already tried with apply but it seems too slow (10 min to process 50k rows).
df['Path'] = df.apply(lambda x: getFullPath(x), axis=1)
with an example function you can find here:
https://pastebin.com/S9QWTGGV
I tried extracting a list of unique combinations of cola,colb,colc - preprocessing the list - and applying map to generate results and it speeds up a little:
for i in range(1,151):
df['Concat' + str(i)] = df['cola' + str(i)] + '|' + df['colb' + str(i)] + '|' + df['colc' + str(i)]
concats = []
for i in range(1,151):
concats.append('Concat' + str(i))
ret = df[concats].values.ravel()
uniq = list(set(ret))
list = {}
for member in ret:
list[member] = getPath2(member)
for i in range(1,MAX_COLS + 1):
df['Res' + str(i)] = df['Concat' + str(i)].map(list)
df['Path'] = df.apply(getFullPath2,axis=1)
function getPath and getFullPath2 are defined as example here:
https://pastebin.com/zpFF2wXD
But it seems still a little bit slow (6 min for processing everything)
Do you have any suggestion on how I could speed up csv processing?
I don't even know if the way I using to "concatenate" columns could be better :), tried with Series.cat but I didn't get how to chain only some columns and not the full df
Thanks very much!
Mic

Amended answer: I see from your criteria, you actually have multiple controls on each column. I think what works is to split these into 3 dataframes, applying your mapping as follows:
import pandas as pd
series = {
'cola1': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb1': pd.Series(['ret1','ret1','ret2'],index=[1,2,3]),
'colc1': pd.Series(['B_1','C_2','B_3'],index=[1,2,3]),
'cola2': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb2': pd.Series(['ret3','ret1','ret2'],index=[1,2,3]),
'colc2': pd.Series(['B_2','A_1','A_3'],index=[1,2,3]),
'cola3': pd.Series(['D_1','C_1','E_1'],index=[1,2,3]),
'colb3': pd.Series(['ret2','ret2','ret1'],index=[1,2,3]),
'colc3': pd.Series(['A_1','B_2','C_3'],index=[1,2,3]),
}
your_df = pd.DataFrame(series, index=[1,2,3], columns=['cola1','colb1','colc1','cola2','colb2','colc2','cola3','colb3','colc3'])
# Split your dataframe into three frames for each column type
bframes = your_df[[col for col in your_df.columns if 'colb' in col]]
aframes = your_df[[col for col in your_df.columns if 'cola' in col]]
cframes = your_df[[col for col in your_df.columns if 'colc' in col]]
for df in [bframes, aframes, cframes]:
df.columns = ['col1','col2','col3']
# Mapping criteria
def map_colb(c):
if c == 'ret1':
return 'A'
elif c == 'ret2':
return None
else:
return 'F'
def map_cola(a):
if a.startswith('D_'):
return 'D'
else:
return 'E'
def map_colc(c):
if c.startswith('B_'):
return 'B'
elif c.startswith('C_'):
return 'C'
elif c.startswith('A_'):
return None
else:
return 'F'
# Use it on each frame
aframes = aframes.applymap(map_cola)
bframes = bframes.applymap(map_colb)
cframes = cframes.applymap(map_colc)
# The trick here is filling 'None's from the left to right in order of precedence
final = bframes.fillna(cframes.fillna(aframes))
# Then just combine them using whatever delimiter you like
# df.values.tolist() turns a row into a list
pathlist = ['|'.join(item) for item in final.values.tolist()]
This gives a result of:
In[70]: pathlist
Out[71]: ['A|F|D', 'A|A|B', 'B|E|A']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Color dataframe rows by condition in Pandas - python

Related

Color rows of MultiIndex dataframe

I can't higlight cell in Excell from Pandas Dataframe

How to calculate the mean and standard deviation of multiple dataframes at one go?

How do I remove and re-sort (reindex) columns after applying style in python pandas?

Pandas - Working on multiple columns seems slow

Categories

Resources