How can I plot st error bars with seaborn relplot? - python

I am studying different variables. I want to plot the results with the stadnard error.
I use the filter function because depending on what I want to analyse, I am interested in just plotting mineral, or just plotting one material...etc. I mention this because it is important for the error bars. With seaborn it is not possible to plot the error bars (I used the raw data and I introduced in the seaborn function cd='', but it does not work. Therefore, I have calculated the mean and st error in excel and I plot that directly. The table is the result of the average and the st error that I use in the script.
If I add ci in the seaborn, does not do anything. Therefore I want to add it externally in a second line. But I have tried with ax.errorbar(), I cant either plot the st error.
import os
import io
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.ticker import FormatStrFormatter
Results = pd.read_excel('results.xlsx',sheet_name='Sheet1',usecols="A:J")
df=pd.DataFrame(Results)
RR_filtered=Results[(Results['Mineral ']=='IC60') | (Results['Mineral ']=='MinFree')]
R_filtered=RR_filtered[(Results['Material']=='A')]
palette = ["#fdae61","#abd9e9"]
sns.set_palette(palette)
ax1=sns.relplot( data=R_filtered,x="Impeller speed (rpm)", y="Result",col="Media size ",hue="Mineral content (g/g fibre)",
palette=palette,size="Media size ",sizes=(50, 200))
R2_filtered=RR_filtered[(Results['Material']=='B')]
ax2=sns.relplot( data=R2_filtered,x="Impeller speed (rpm)", y="Result",col="Media size ",hue="Mineral content (g/g fibre)",
palette=palette,size="Media size ",sizes=(50, 200))
plt.show()
data as image
Media size Material Impeller speed (rpm) Energy input (kWh/t) Mineral Mineral content (g/g fibre) Result ster
1.7 A 400 3000 IC60 4 3.42980002276166 0.21806853183829
1.7 A 650 3000 IC60 4 5.6349292302978 0.63877270588513
1.7 A 900 3000 IC60 4 6.1386616444364 0.150420705145224
1.7 A 1150 3000 IC60 4 5.02677117937851 1.05459146256349
1.7 A 1400 3000 IC60 4 3.0654271029038 0.917937247698497
3 A 400 3000 IC60 4 8.06973541574516 2.07869756201064
3 A 650 3000 IC60 4 4.69110601906018 1.21725878149246
3 A 900 3000 IC60 4 10.2119514553564 1.80680816945106
3 A 1150 3000 IC60 4 7.3271067522139 0.438931805677489
3 A 1400 3000 IC60 4 4.86901883487513 2.04826541508181
1.7 A 400 3000 MinFree 0 1.30614274245145 0.341512517371074
1.7 A 650 3000 MinFree 0 0.80632268273782 0.311762840996982
1.7 A 900 3000 MinFree 0 1.35958635068886 0.360649049944933
1.7 A 1150 3000 MinFree 0 1.38784671261469 0.00524838126778526
1.7 A 1400 3000 MinFree 0 1.12365621425779 0.561737044169193
3 A 400 3000 MinFree 0 4.61104587078813 0.147526557483362
3 A 650 3000 MinFree 0 4.40934493149759 0.985706944001226
3 A 900 3000 MinFree 0 5.06333415444978 0.00165055503033251
3 A 1150 3000 MinFree 0 3.85940865344646 0.731238210429852
3 A 1400 3000 MinFree 0 3.75572328102963 0.275897272330075
3 A 400 3000 GIC 4 6.05239906571977 0.0646300937591957
3 A 650 3000 GIC 4 7.9023202316634 0.458062146361444
3 A 900 3000 GIC 4 6.97774277141699 0.171777036954104
3 A 1150 3000 GIC 4 11.0705742735252 1.3960974547215
3 A 1400 3000 GIC 4 9.37948091546579 0.0650589433632627
1.7 A 869 3000 IC60 4 2.39416757908564 0.394947207603093
3 A 859 3000 IC60 4 10.2373958352881 1.55162686552938
1.7 A 885 3000 BHX 4 87.7569689333017 10.2502550323564
3 A 918 3000 BHX 4 104.135074642339 4.77467275433362
1.7 B 400 3000 MinFree 0 1.87573877068556 0.34648345153664
1.7 B 650 3000 MinFree 0 1.99555403904079 0.482200923313764
1.7 B 900 3000 MinFree 0 2.54989484285768 0.398071770532481
1.7 B 1150 3000 MinFree 0 3.67636872311402 0.662270521850053
1.7 B 1400 3000 MinFree 0 3.5664978541551 0.164453275639932
3 B 400 3000 MinFree 0 2.62948341485392 0.0209463845730038
3 B 650 3000 MinFree 0 3.0066638279753 0.305024483713006
3 B 900 3000 MinFree 0 2.79255446831386 0.472851866083359
3 B 1150 3000 MinFree 0 5.64970870330824 0.251859240942665
3 B 1400 3000 MinFree 0 7.40595580787647 0.629256778750272
1.7 B 400 3000 IC60 4 0.38040036521839 0.231869270120922
1.7 B 650 3000 IC60 4 0.515922221163329 0.434661621954815
1.7 B 900 3000 IC60 4 3.06358032815653 0.959408177590503
1.7 B 1150 3000 IC60 4 4.04800689693192 0.255594912271896
1.7 B 1400 3000 IC60 4 3.69967975589305 0.469944383688801
3 B 400 3000 IC60 4 1.35706340378197 0.134829945730943
3 B 650 3000 IC60 4 1.91317966458018 1.77106692180411
3 B 900 3000 IC60 4 0.874227487043329 0.493348110823194
3 B 1150 3000 IC60 4 2.71732337235447 0.0703901684702626
3 B 1400 3000 IC60 4 4.96743231003956 0.45853815499614
3 B 400 3000 GIC 4 0.325743752029247 0.325743752029247
3 B 650 3000 GIC 4 3.12776074994155 0.452049425276085
3 B 900 3000 GIC 4 3.25564762321322 0.319567445434468
3 B 1150 3000 GIC 4 5.99730462724499 1.03439035936441
3 B 1400 3000 GIC 4 7.51312624370307 0.38399627585515

Tested in python 3.11, pandas 1.5.2, matplotlib 3.6.2, seaborn 0.12.1
Sample DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = {'Media size': [1.7, 1.7, 1.7, 1.7, 1.7, 3.0, 3.0, 3.0, 3.0, 3.0, 1.7, 1.7, 1.7, 1.7, 1.7, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 1.7, 3.0, 1.7, 3.0, 1.7, 1.7, 1.7, 1.7, 1.7, 3.0, 3.0, 3.0, 3.0, 3.0, 1.7, 1.7, 1.7, 1.7, 1.7, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0],
'Material': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
'Impeller speed (rpm)': [400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 869, 859, 885, 918, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400, 400, 650, 900, 1150, 1400],
'Energy input (kWh/t)': [3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000, 3000],
'Mineral': ['IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'GIC', 'GIC', 'GIC', 'GIC', 'GIC', 'IC60', 'IC60', 'BHX', 'BHX', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'MinFree', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'IC60', 'GIC', 'GIC', 'GIC', 'GIC', 'GIC'],
'Mineral content (g/g fibre)': [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], 'Result': [3.42980002276166, 5.6349292302978, 6.1386616444364, 5.02677117937851, 3.0654271029038, 8.06973541574516, 4.69110601906018, 10.2119514553564, 7.3271067522139, 4.86901883487513, 1.30614274245145, 0.80632268273782, 1.35958635068886, 1.38784671261469, 1.12365621425779, 4.61104587078813, 4.40934493149759, 5.06333415444978, 3.85940865344646, 3.75572328102963, 6.05239906571977, 7.9023202316634, 6.97774277141699, 11.0705742735252, 9.37948091546579, 2.39416757908564, 10.2373958352881, 87.7569689333017, 104.135074642339, 1.87573877068556, 1.99555403904079, 2.54989484285768, 3.67636872311402, 3.5664978541551, 2.62948341485392, 3.0066638279753, 2.79255446831386, 5.64970870330824, 7.40595580787647, 0.38040036521839, 0.515922221163329, 3.06358032815653, 4.04800689693192, 3.69967975589305, 1.35706340378197, 1.91317966458018, 0.874227487043329, 2.71732337235447, 4.96743231003956, 0.325743752029247, 3.12776074994155, 3.25564762321322, 5.99730462724499, 7.51312624370307],
'ster': [0.21806853183829, 0.63877270588513, 0.150420705145224, 1.05459146256349, 0.917937247698497, 2.07869756201064, 1.21725878149246, 1.80680816945106, 0.438931805677489, 2.04826541508181, 0.341512517371074, 0.311762840996982, 0.360649049944933, 0.0052483812677852, 0.561737044169193, 0.147526557483362, 0.985706944001226, 0.0016505550303325, 0.731238210429852, 0.275897272330075, 0.0646300937591957, 0.458062146361444, 0.171777036954104, 1.3960974547215, 0.0650589433632627, 0.394947207603093, 1.55162686552938, 10.2502550323564, 4.77467275433362, 0.34648345153664, 0.482200923313764, 0.398071770532481, 0.662270521850053, 0.164453275639932, 0.0209463845730038, 0.305024483713006, 0.472851866083359, 0.251859240942665, 0.629256778750272, 0.231869270120922, 0.434661621954815, 0.959408177590503, 0.255594912271896, 0.469944383688801, 0.134829945730943, 1.77106692180411, 0.493348110823194, 0.0703901684702626, 0.45853815499614, 0.325743752029247, 0.452049425276085, 0.319567445434468, 1.03439035936441, 0.38399627585515]}
df = pd.DataFrame(data)
Map plt.errorbar onto sns.relplot
# filter the dataframe by Mineral
filtered = df[(df['Mineral']=='IC60') | (df['Mineral']=='MinFree')]
# plot the filtered dataframe
g = sns.relplot(data=filtered, x="Impeller speed (rpm)", y="Result", col="Media size", row='Material', hue="Mineral content (g/g fibre)", size="Media size", sizes=(50, 200))
# add the errorbars
g.map(plt.errorbar, "Impeller speed (rpm)", "Result", "ster", marker="none", color='r', ls='none')
Specify color for each group of errorbars
plt.errorbar only accepts one value for color. In order to match the colors to a palette, the specific data for each facet needs to be selected, and the proper color for that group passed to the color parameter.
errorbars that are smaller than the circle can't be seen.
# create a palette dictionary for the unique values in the hue column
palette = dict(zip(filtered['Mineral content (g/g fibre)'].unique(), ["#fdae61", "#abd9e9"]))
# plot the filtered dataframe
g = sns.relplot(data=filtered, x="Impeller speed (rpm)", y="Result", col="Media size", row='Material', hue="Mineral content (g/g fibre)", size="Media size", sizes=(50, 200), palette=palette)
# iterate through each facet of the facetgrid
for (material, media), ax in g.axes_dict.items():
# select the data for the facet
data = filtered[filtered['Material'].eq(material) & filtered['Media size'].eq(media)]
# select the data for each hue group
for group, selected in data.groupby('Mineral content (g/g fibre)'):
# plot the errorbar with the correct color for each group
ax.errorbar(data=selected, x="Impeller speed (rpm)", y="Result", yerr="ster", marker="none", color=palette[group], ls='none')

Related

How To Categorize a List

Having this list:
list_price = [['1800','5060','6300','6800','10800','3000','7100',]
how do I categorize the list to be (1000, 2000, 3000, 4000, 5000, 6000, 7000, 000)?
example:
2000: 1800
7000:6800, 6300
And count them 2000(1),7000(2), if possible using pandas as an example.
Using rounding to the upper thousand:
list_price = ['1800','5060','6300','6800','10800','3000','7100']
out = (pd.Series(list_price).astype(int)
.sub(1).floordiv(1000)
.add(1).mul(100)
.value_counts()
)
output:
700 2
200 1
600 1
1100 1
300 1
800 1
0 1
dtype: int64
Intermediate without value_counts:
0 200
1 600
2 700
3 700
4 1100
5 300
6 800
dtype: int64
I assumed 000 at the end of categories is 10000. Try:
cut = pd.cut(list_price, bins=(1000, 2000, 3000, 4000, 5000, 6000, 7000, 10000))
pd.Series(list_price).groupby(cut).count()

Map counts of a numerical column from a new DataFrame to the bin range column of training data

I am trying to get the count of Age column and append it to my existing bin-range column created. I am able to do it for the training df and want to do it for prediction data. How do I map values of count of Age column from prediction data to to Age_bin column in my training data? The first one is my output DF whereas the 2nd one is the sample DF. I can get the count using value_counts() for the file I am reading.
First image - bin and count from training data
Second image - Training data
Third image - Prediction data
Fourth image - Final output
.
.
The Data
import pandas as pd
data = {
0: 0,
11: 1500,
12: 1000,
22: 3000,
32: 35000,
34: 40000,
44: 55000,
65: 7000,
80: 8000,
100: 1000000,
}
df = pd.DataFrame(data.items(), columns=['Age', 'Salary'])
Age Salary
0 0 0
1 11 1500
2 12 1000
3 22 3000
4 32 35000
5 34 40000
6 44 55000
7 65 7000
8 80 8000
9 100 1000000
The Code
bins = [-0.1, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# create a "binned" column
df['binned'] = pd.cut(df['Age'], bins)
# add bin count
df['count'] = df.groupby('binned')['binned'].transform('count')
The Output
Age Salary binned count
0 0 0 (-0.1, 10.0] 1
1 11 1500 (10.0, 20.0] 2
2 12 1000 (10.0, 20.0] 2
3 22 3000 (20.0, 30.0] 1
4 32 35000 (30.0, 40.0] 2
5 34 40000 (30.0, 40.0] 2
6 44 55000 (40.0, 50.0] 1
7 65 7000 (60.0, 70.0] 1
8 80 8000 (70.0, 80.0] 1
9 100 1000000 (90.0, 100.0] 1

Order columns in pandas dataframe

I have pandas dataframe as below:
import pandas as pd
import numpy as np
df = pd.DataFrame({'CATEGORY': [1, 1, 2, 2],
'GROUP': ['A', 'A', 'B', 'B'],
'XYZ': [3000, 2500, 3000, 3000],
'VAL': [3000, 2500, 3000, 3000],
'A_CLASS': [3000, 2500, 3000, 3000],
'B_CAL': [3000, 4500, 3000, 1000],
'C_CLASS': [3000, 2500, 3000, 3000],
'A_CAL': [3000, 2500, 3000, 3000],
'B_CLASS': [3000, 4500, 3000, 500],
'C_CAL': [3000, 2500, 3000, 3000],
'ABC': [3000, 2500, 3000, 3000]})
df
CATEGORY GROUP XYZ VAL A_CLASS B_CAL C_CLASS A_CAL B_CLASS C_CAL ABC
1 A 3000 1 3000 3000 3000 3000 3000 3000 3000
1 A 2500 2 2500 4500 2500 2500 4500 2500 2500
2 B 3000 4 3000 3000 3000 3000 3000 3000 3000
2 B 3000 1 3000 1000 3000 3000 500 3000 3000
I want columns in below order in my final dataframe
GROUP, CATEGORY, all columns with suffix "_CAL", all columns with suffix "_CLASS", all other fields
My expected output:
GROUP CATEGORY B_CAL A_CAL C_CAL A_CLASS C_CLASS B_CLASS XYZ VAL ABC
A 1 3000 3000 3000 3000 3000 3000 3000 1 3000
A 1 4500 2500 2500 2500 2500 4500 2500 2 2500
A 1 8000 7000 8000 8000 8000 8000 8000 5 8000
B 2 3000 3000 3000 3000 3000 3000 3000 4 3000
B 2 1000 3000 3000 3000 3000 500 3000 1 3000
Fun with sorted:
first = ['GROUP','CATEGORY']
cols = sorted(df.columns.difference(first),
key=lambda x: (not x.endswith('_CAL'), not x.endswith('_CLASS')))
df[first+cols]
GROUP CATEGORY A_CAL B_CAL C_CAL A_CLASS B_CLASS C_CLASS ABC VAL \
0 A 1 3000 3000 3000 3000 3000 3000 3000 3000
1 A 1 2500 4500 2500 2500 4500 2500 2500 2500
2 B 2 3000 3000 3000 3000 3000 3000 3000 3000
3 B 2 3000 1000 3000 3000 500 3000 3000 3000
XYZ
0 3000
1 2500
2 3000
3 3000
For more details here's a similar one with a detailed explanation
You just need to play with strings
cols = df.columns
cols_sorted = ["GROUP", "CATEGORY"] +\
[col for col in cols if col.endswith('_CAL')] +\
[col for col in cols if col.endswith('_CLASS')]
cols_sorted += sorted([col for col in cols if col not in cols_sorted])
df = df[cols_sorted]

Grouping data into bins

I want to subset the following data frame df into bins of a size 50:
ID FREQ
0 358081 6151
1 431511 952
2 410632 350
3 398149 220
4 177791 158
5 509179 151
6 485346 99
7 536655 50
8 389180 51
9 406622 45
10 410191 112
The result should be this one:
FREQ_BIN QTY_IDs
>200 3
150-200 2
100-150 1
50-100 3
<50 1
How can I do it? Should I use groupBy or any other approach?
You could use pd.cut.
df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
right=False ensures that we take half-open intervals as your output suggests, and unlike np.digitize we need to include np.inf in the bins for "infinite endpoints".
Demo
>>> df.groupby(pd.cut(df.FREQ,
bins=[-np.inf, 50, 100, 150, 200, np.inf],
right=False)
).size()
FREQ
[-inf, 50) 1
[50, 100) 3
[100, 150) 1
[150, 200) 2
[200, inf) 4
dtype: int64

how to calculate percentage for particular rows for given columns using python pandas?

student,total,m1,m2,m3
a,500,120,220,160
b,600,180,120,200
This is my dataframe and I just want to calculate m1,m2,m3 columns as percentages of the total column. I need output like following dataframe
student,total,m1,m2,m3,m1(%),m2(%),m3(%)
a,500,120,220,160,24,44,32
...
for example m1(%) column will be calculated by using (m1/total)*100.
I think you can use div:
df = pd.DataFrame({'total': {0: 500, 1: 600},
'm1': {0: 120, 1: 180},
'm3': {0: 160, 1: 200},
'student': {0: 'a', 1: 'b'},
'm2': {0: 220, 1: 120}},
columns=['student','total','m1','m2','m3'])
print df
student total m1 m2 m3
0 a 500 120 220 160
1 b 600 180 120 200
df[['m1(%)','m2(%)','m3(%)']] = df[['m1','m2','m3']].div(df.total, axis=0)*100
print df
student total m1 m2 m3 m1(%) m2(%) m3(%)
0 a 500 120 220 160 24.0 44.0 32.000000
1 b 600 180 120 200 30.0 20.0 33.333333

Categories