Connecting means in seaborn box plot - python

I want to connect box plot means. I can do the basic part but cannot connect box plot means and box plots offset from x axis. similar post but not connecting means Python: seaborn pointplot and boxplot in one plot but shifted on the x-axis
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'pre_score': [4, 24, 31, 2, 3,25, 94, 57, 62, 70,5, 43, 23, 23, 51]
}
data = pd.DataFrame(raw_data, columns = ['first_name', 'pre_score'])
first_name pre_score
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
5 Jason 25
6 Molly 94
7 Tina 57
8 Jake 62
9 Amy 70
10 Jason 5
11 Molly 43
12 Tina 23
13 Jake 23
14 Amy 51
sns.set_style("ticks")
ax = sns.stripplot(x='first_name', y='pre_score', hue='first_name', jitter=True, dodge=True, size=6, zorder=0, alpha=0.5, linewidth =1, data=data)
ax = sns.boxplot(x='first_name', y='pre_score', hue='first_name', dodge=True, showfliers=True, linewidth=0.8, showmeans=True, data=data)
ax = sns.lineplot(x='first_name', y='pre_score', color='k', data=data.groupby(['first_name'], as_index=False).mean())
fig_size = [18.0, 10.0]
plt.rcParams["figure.figsize"] = fig_size
handles, labels = ax.get_legend_handles_labels()
legend_len = labels.__len__()
ax.legend(handles[int(legend_len/2):legend_len], labels[int(legend_len/2):legend_len], bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.1);
As we can see the sns.line plot does not follow the means and box plots and names in the x axis has offset.
How can I fix this ?

When dealing with seaborn plot, I would strongly recommend you always provide an order= (and hue_order= if applicable) to avoid nasty surprise with the categories not showing up in a consistent order between calls.
For the purpose of your question, you can replace the lineplot with a pointplot, which will automatically aggregate the values by categories and plot using a line
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'pre_score': [4, 24, 31, 2, 3,25, 94, 57, 62, 70,5, 43, 23, 23, 51]
}
data = pd.DataFrame(raw_data, columns = ['first_name', 'pre_score'])
# define the order in which the categories will be plotted on the x-axis
order = np.sort(data['first_name'].unique()) # you could also create a list by hand if you want a specific order
sns.set_style("ticks")
ax = sns.stripplot(x='first_name', y='pre_score', order=order, jitter=True, size=6, zorder=0, alpha=0.5, linewidth =1, data=data)
ax = sns.boxplot(x='first_name', y='pre_score', order=order, showfliers=True, linewidth=0.8, showmeans=True, data=data)
ax = sns.pointplot(x='first_name', y='pre_score', order=order, data=data, ci=None, color='black')
If for some reason you don't want to or cannot use a seaborn function that takes an order argument, then aggregate by hand in pandas, and reindex() with your order to make sure the values appear in the right order in the dataframe before plotting with the tool of your choice.
For instance, you could replace the call to pointplot() above with:
means = data.groupby('first_name')['pre_score'].mean().reindex(order) # calculate the means and ensure they are
# displayed in the same order as the boxplots
ax.plot(means.index, means.values, 'ko-', lw=3)
and have the exact same result

Related

How to plot distributions for several bivariate groups of variable using Python

I am analysing data which is organised as following:
There are 4 different pandas data fram for each groups (A, B and C).
Each dataframe representing a group has 4 subroups (columns) and rows representing thoer corresponding observations.
For example, a single group of data looks like:
subgroup-1
subgroup-2
subgroup-3
subgroup-4
12
4
NaN
9
15
3
4
NaN
16
8
3
11
17
12
8
13
11
17
12
14
I want to visualise the distributions for each subgroup for the different group. Can anyone let me know what are the available options in Python to do this (the chart types I can use). Thanks.
I tried using histogram, density plots but all of them work only for 2 variables.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# pandas Dataframes
group_A = pd.DataFrame(np.random.rand(50, 4) , columns=['subgroup-1' , 'subgroup-2' , 'subgroup-3' , 'subgroup-4'])
group_B = pd.DataFrame(np.random.rand(50, 4) , columns=['subgroup-1' , 'subgroup-2' , 'subgroup-3' , 'subgroup-4'])
group_C = pd.DataFrame(np.random.rand(50, 4) , columns=['subgroup-1' , 'subgroup-2' , 'subgroup-3' , 'subgroup-4'])
def plot_hist(subgroup):
np.random.seed(19680801)
n_bins = 10
x = np.dstack([group_A[subgroup] , group_B[subgroup] , group_C[subgroup]])[0]
fig, axes = plt.subplots(nrows=2, ncols=2)
ax0, ax1, ax2, ax3 = axes.flatten()
ax0.hist(x, n_bins, density=True, histtype='bar', label = ['A', 'B', 'C'])
ax0.legend(prop={'size': 10})
ax0.set_title('bars with legend')
ax1.hist(x, n_bins, density=True, histtype='bar', stacked=True)
ax1.set_title('stacked bar')
ax2.hist(x, n_bins, histtype='step', stacked=True, fill=False)
ax2.set_title('stack step (unfilled)')
# Make a multiple-histogram of data-sets with different length.
x_multi = [np.random.randn(n) for n in [10000, 5000, 2000]]
ax3.hist(x_multi, n_bins, histtype='bar')
ax3.set_title('different sample sizes')
fig.tight_layout()
plt.show()
plot_hist('subgroup-1')
reference

How to make a multi-level chart column label by hue

This is a continuation of this question. But now I have a bar-chart with hue.
Here's what I have:
df = pd.DataFrame({'age': ['20-30', '20-30', '20-30', '30-40', '30-40', '30-40', '40-50', '40-50', '40-50', '50-60', '50-60', '50-60'],
'expenses':['50$', '100$', '200$', '50$', '100$', '200$', '50$', '100$', '200$', '50$', '100$', '200$'],
'users': [59, 42, 57, 68, 47, 98, 75, 73, 54, 81, 52, 43],
'buyers': [22, 35, 18, 27, 12, 57, 19, 29, 31, 47, 10, 5],
'percentage': [37.2881, 83.3333, 31.5789, 39.7058, 25.5319, 58.1632, 25.3333, 39.7260, 57.4074, 58.0246, 19.2307, 11.6279]})
index
age
expenses
users
buyers
percentage
0
20-30
50$
59
22
37.2881
1
20-30
100$
42
35
83.3333
2
20-30
200$
57
18
31.5789
3
30-40
50$
68
27
39.7058
4
30-40
100$
47
12
25.5319
5
30-40
200$
98
57
58.1632
6
40-50
50$
75
19
25.3333
7
40-50
100$
73
29
39.726
8
40-50
200$
54
31
57.4074
9
50-60
50$
81
47
58.0246
10
50-60
100$
52
10
19.2307
11
50-60
200$
43
5
11.6279
fig, ax = plt.subplots(figsize=(20, 10))
# Plot the all users
sns.barplot(x='age', y='users', data=df, hue='expenses', palette='Blues', edgecolor='grey', alpha=0.7, ax=ax)
# Plot the buyers
sns.barplot(x='age', y='buyers', data=df, hue='expenses', palette='Blues', edgecolor='darkgrey', hatch='//', ax=ax)
plt.show()
I need to get the same chart. In the case of hue, the code:
# extract the separate containers
c1, c2 = ax.containers
# annotate with the users values
ax.bar_label(c1, fontsize=13)
# annotate with the buyer and percentage values
l2 = [f"{v.get_height()}: {df.loc[i, 'percentage']}%" for i, v in enumerate(c2)]
ax.bar_label(c2, labels=l2, fontsize=8, label_type='center', fontweight='bold')
no longer works.
I would be glad for any hints.
Each object in ax.containers represents the bars for a single hue group.
When using bar_label, the annotations for each bar in '50$', then '100$', and then '200$' are added.
I think it's easier to select the correct data by annotating the 'buyers' group separately.
The answer to your previous question selects the data from the entire dataframe, but here Boolean indexing is used to select only a segment of the dataframe. Using print(data) in each loop will help with understanding.
fig, ax = plt.subplots(figsize=(20, 10))
# plot the all users
sns.barplot(x='age', y='users', data=df, hue='expenses', palette='Blues', edgecolor='grey', alpha=0.7, ax=ax)
# annotate the bars in the 3 containers (1 container per hue group)
for c in ax.containers:
ax.bar_label(c)
# plot the 'buyers', which adds 3 more containers to ax
sns.barplot(x='age', y='buyers', data=df, hue='expenses', palette='Blues', edgecolor='darkgrey', hatch='//', ax=ax)
# iterate through the last 3 new containers containing the hatched groups
for c in ax.containers[3:]:
# get the hue label, which will be used to select the data group
hue_label = c.get_label()
# select the data based on hue_label
data = df.loc[df.expenses.eq(hue_label), ['buyers', 'percentage']]
# customize the labels
labels = [f"{v.get_height()}: {data.iloc[i, 1]:0.2f}%" for i, v in enumerate(c)]
# add the labels
ax.bar_label(c, labels=labels)
plt.show()

How to add column labels to graphs

I was wondering, if you can annotate every graph in this example automatically using the column headers as labels.
import seaborn as sns
import pandas as pd
d = {'a': [100, 125, 300, 520],..., 'z': [250, 270, 278, 248]}
df = pd.DataFrame(data=d, index=[25, 26, 26, 30])
a ... z
25 100 ... 250
26 125 ... 270
26 300 ... 278
30 520 ... 248
When I use this code, I only get the column headers as a legend. However, I want the labels to be directly beside/above my graphs.
sns.lineplot(data=df, dashes=False, estimator=None)
Is this what you are looking for?
ax = sns.lineplot(data=df, dashes=False, estimator=None, legend=False)
for label, pos in df.iloc[0].iteritems():
ax.annotate(label, (df.index[0], pos*1.05), ha='left', va='bottom')
output:
Something like:
ax = sns.lineplot(data=df, dashes=False, estimator=None)
for c, l in zip(df.columns, ax.lines):
y = l.get_ydata()
ax.annotate(f'{c}', xy=(1.01,y[-1]), xycoords=('axes fraction', 'data'),
ha='left', va='center', color=l.get_color())
Source: https://stackoverflow.com/a/62703420/15239951

Pandas dataframe | groupby plotting | stacked and side by side graph

I am coming from R ggplot2 background and, and bit confused in matplotlib plot
here my dataframe
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
language county count
0 en us 32
1 cs ch 432
2 es sp 43
3 pt br 55
4 hi in 6
5 en fr 23
6 es ar 455
7 es pr 23
Now I want to plot
A stacked bar chart where x axis show language and y axis show complete count, the big total height show total count for that language and stacked bar show number of countries for that language
A side by side, with same parameters only countries show side by side instead of stacked one
Most of the example show it directly using dataframe and matplotlib plot but I want to plot it in sequential script so I have more control over it, also can edit whatever I want, something like this script
ind = np.arange(df.languages.nunique())
width = 0.35
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(ind, df.languages, width, color='r')
ax.bar(ind, df.count, width,bottom=df.languages, color='b')
ax.set_ylabel('Count')
ax.set_title('Score y language and country')
ax.set_xticks(ind, df.languages)
ax.set_yticks(np.arange(0, 81, 10))
ax.legend(labels=[df.countries])
plt.show()
btw, my panda pivot code for same plotting
df.pivot(index = "Language", columns = "Country", values = "count").plot.bar(figsize=(15,10))
plt.xticks(rotation = 0,fontsize=18)
plt.xlabel('Language' )
plt.ylabel('Count ')
plt.legend(fontsize='large', ncol=2,handleheight=1.5)
plt.show()
import matplotlib.pyplot as plt
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
modified = {}
modified['language'] = np.unique(df.language)
country_count = []
total_count = []
for x in modified['language']:
country_count.append(len(df[df['language']==x]))
total_count.append(df[df['language']==x]['count'].sum())
modified['country_count'] = country_count
modified['total_count'] = total_count
mod_df = pd.DataFrame(modified)
print(mod_df)
ind = mod_df.language
width = 0.35
p1 = plt.bar(ind,mod_df.total_count, width)
p2 = plt.bar(ind,mod_df.country_count, width,
bottom=mod_df.total_count)
plt.ylabel("Total count")
plt.xlabel("Languages")
plt.legend((p1[0], p2[0]), ('Total Count', 'Country Count'))
plt.show()
First,modify the dataframe to below dataframe.
language country_count total_count
0 cs 1 432
1 en 2 55
2 es 3 521
3 hi 1 6
4 pt 1 55
This is the plot:
As the value of country count is small, you cannot clearly see the stacked country count.
import seaborn as sns
import matplotlib.pyplot as plt
figure, axis = plt.subplots(1,1,figsize=(10,5))
sns.barplot(x="language",y="count",data=df,ci=None)#,hue='county')
axis.set_title('Score y language and country')
axis.set_ylabel('Count')
axis.set_xlabel("Language")
sns.countplot(x=df.language,data=df)

How can I iterate through a CSV file and plot it to a boxplot by each column representing a second in Python?

Say I have a csv file like so:
20 30 33 54 12 56
90 54 66 12 88 11
33 22 63 86 12 65
11 44 65 34 23 26
I want to create a boxplot where each column is a second, which is also the x-axis. The actual data to be on the y. So, 20, 90, 33, 11 will be on 1 second and on one plot and 30, 54, 22, 44 on 2 seconds and so on. Also, the csv file has more data than this that I am not sure how many data sets so I can't hard code anything in.
This is what I have so far:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/user/Desktop/test.csv', header = None)
fig = plt.figure()
ax = fig.add_subplot()
plt.xlabel('Time (s)')
plt.ylabel('ms')
df.boxplot()
plt.show()
Try this:
axes = df.groupby(df.columns//10, axis=1).boxplot(subplots=True,
figsize=(12,18))
plt.xlabel('Time (s)')
plt.ylabel('ms')
plt.show()
Output:
If you want to set y limits of the subplots:
for ax in axes.flatten():
ax.set_ylim(0,100)

Categories