How to make a multi-level chart column label by hue - python

This is a continuation of this question. But now I have a bar-chart with hue.
Here's what I have:
df = pd.DataFrame({'age': ['20-30', '20-30', '20-30', '30-40', '30-40', '30-40', '40-50', '40-50', '40-50', '50-60', '50-60', '50-60'],
'expenses':['50$', '100$', '200$', '50$', '100$', '200$', '50$', '100$', '200$', '50$', '100$', '200$'],
'users': [59, 42, 57, 68, 47, 98, 75, 73, 54, 81, 52, 43],
'buyers': [22, 35, 18, 27, 12, 57, 19, 29, 31, 47, 10, 5],
'percentage': [37.2881, 83.3333, 31.5789, 39.7058, 25.5319, 58.1632, 25.3333, 39.7260, 57.4074, 58.0246, 19.2307, 11.6279]})
index
age
expenses
users
buyers
percentage
0
20-30
50$
59
22
37.2881
1
20-30
100$
42
35
83.3333
2
20-30
200$
57
18
31.5789
3
30-40
50$
68
27
39.7058
4
30-40
100$
47
12
25.5319
5
30-40
200$
98
57
58.1632
6
40-50
50$
75
19
25.3333
7
40-50
100$
73
29
39.726
8
40-50
200$
54
31
57.4074
9
50-60
50$
81
47
58.0246
10
50-60
100$
52
10
19.2307
11
50-60
200$
43
5
11.6279
fig, ax = plt.subplots(figsize=(20, 10))
# Plot the all users
sns.barplot(x='age', y='users', data=df, hue='expenses', palette='Blues', edgecolor='grey', alpha=0.7, ax=ax)
# Plot the buyers
sns.barplot(x='age', y='buyers', data=df, hue='expenses', palette='Blues', edgecolor='darkgrey', hatch='//', ax=ax)
plt.show()
I need to get the same chart. In the case of hue, the code:
# extract the separate containers
c1, c2 = ax.containers
# annotate with the users values
ax.bar_label(c1, fontsize=13)
# annotate with the buyer and percentage values
l2 = [f"{v.get_height()}: {df.loc[i, 'percentage']}%" for i, v in enumerate(c2)]
ax.bar_label(c2, labels=l2, fontsize=8, label_type='center', fontweight='bold')
no longer works.
I would be glad for any hints.

Each object in ax.containers represents the bars for a single hue group.
When using bar_label, the annotations for each bar in '50$', then '100$', and then '200$' are added.
I think it's easier to select the correct data by annotating the 'buyers' group separately.
The answer to your previous question selects the data from the entire dataframe, but here Boolean indexing is used to select only a segment of the dataframe. Using print(data) in each loop will help with understanding.
fig, ax = plt.subplots(figsize=(20, 10))
# plot the all users
sns.barplot(x='age', y='users', data=df, hue='expenses', palette='Blues', edgecolor='grey', alpha=0.7, ax=ax)
# annotate the bars in the 3 containers (1 container per hue group)
for c in ax.containers:
ax.bar_label(c)
# plot the 'buyers', which adds 3 more containers to ax
sns.barplot(x='age', y='buyers', data=df, hue='expenses', palette='Blues', edgecolor='darkgrey', hatch='//', ax=ax)
# iterate through the last 3 new containers containing the hatched groups
for c in ax.containers[3:]:
# get the hue label, which will be used to select the data group
hue_label = c.get_label()
# select the data based on hue_label
data = df.loc[df.expenses.eq(hue_label), ['buyers', 'percentage']]
# customize the labels
labels = [f"{v.get_height()}: {data.iloc[i, 1]:0.2f}%" for i, v in enumerate(c)]
# add the labels
ax.bar_label(c, labels=labels)
plt.show()

Related

Plotting grouped multi-index data with a For loop

I am trying to produce multiple plots from a for loop.
My dataframe is multi-indexed as below:
temperature depth
ID Month
33 2 150 95
3 148 79
4 148 54
5 155 77
55 2 168 37
3 172 33
4 107 32
5 155 77
61 2 168 37
3 172 33
4 107 32
5 155 77
I want to loop through each ID and plot:
Temperature as a line against Month (x-axis)
Depth as a bar against Month (x-axis)
I want these to be on the same plot.
This is what I have so far:
# group the dataframe
grp = df.groupby([df.index.get_level_values(0), df.index.get_level_values(1)])
# create empty plots
fig, ax = plt.subplots()
# create an empty plot for combining with ax
ax2 = ax.twinx()
# for loop
for ID, group in grp:
ax.bar(df.index.get_level_values(1), group["temperature"], color='blue', label='Release')
ax2.plot(df.index.get_level_values(1), group["depth"], color='green', label='Hold')
ax.set_xticklabels(df.index.get_level_values(1))
plt.savefig("value{y}.png".format(y=ID))
next
dataframe reprex:
import pandas as pd
index = pd.MultiIndex.from_product([[33, 55, 61],['2','3','4', '5']], names=['ID','Month'])
df = pd.DataFrame([[150, 95],
[148, 79],
[148, 54],
[155, 77],
[168, 37],
[172, 33],
[107, 32],
[155, 77],
[168, 37],
[172, 33],
[107, 32],
[155, 77]],
columns=['temperature', 'depth'], index=index)

Plot two one seaborn plot from two dataframes

I try to plot two dataframes with seaborn into one figure.
given these test data:
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df['Name'] = 'Adam'
df.iloc[::5, 4] = 'Berta'
df.head(10)
A B C D Name
0 40 75 45 6 Berta
1 52 98 55 44 Adam
2 57 61 70 17 Adam
3 52 5 20 28 Adam
4 63 53 74 49 Adam
5 53 28 97 26 Berta
6 64 38 73 56 Adam
7 25 65 34 64 Adam
8 95 91 92 60 Adam
9 6 54 5 58 Adam
and
df1 = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df1['Location'] = 'New York'
df1.iloc[::5, 4] = 'Tokyo'
df1.head(10)
A B C D Location
0 89 16 23 15 Tokyo
1 7 35 26 21 New York
2 64 94 51 61 New York
3 84 16 15 36 New York
4 55 62 0 2 New York
5 73 93 4 1 Tokyo
6 93 11 27 69 New York
7 14 52 50 45 New York
8 26 77 86 32 New York
9 21 10 68 11 New York
A)The first plot I would like to plot a relplot or scatterplot where both dataframes have the same x and y axes, but a different "hue". If I try:
sb.relplot(data=df, x='Name', y='C', hue="Name", height=8.27, aspect=11.7/8.27)
sb.relplot(data=df1, x='Location', y='C', hue="Location", height=8.27, aspect=11.7/8.27)
plt.show()
The latter plot will overwrite the first or creates a new one. Any ideas?
B) Now we have the same y-axes (let's say "amount"), but with different x-axes (strings).
I found this here: How to overlay two seaborn relplots? and it looks pretty good, but if I try:
fig, ax = plt.subplots()
sb.scatterplot(x="Name", y='A', data=df, hue="Name", ax=ax)
ax2 = ax.twinx()
sb.scatterplot(data=df1, x='Location', y='A', hue="Location", ax =ax2)
plt.show()
then the second scatterplot plots the values over the values of the first one overwriting the names for x. But I would like to add the second scatterplot on the right. Is this possible?
In my opinion it doesn't make sense to concatenate the two dataframes.
Thanks very much!
Having gathered all questions you asked I assume you either want to plot two subplots in one row for two DataFrames or plot two sets of data on one figure.
As for the 'A' plot:
fig, ax = plt.subplots(1, 2, figsize=(8, 4), sharey=True)
sb.scatterplot(data=df, x='Name', y='A', hue='Name',
ax=ax[0])
sb.scatterplot(data=df1, x='Location', y='A', hue='Location',
ax=ax[1])
plt.show()
Here I created both fig and ax using plt.subplots() so then I could locate each scatter plot on a separate subplot, indicating number of rows (1) and columns (2) and a shared Y-axis. Here's what I got (sorry for not bothering for legend location and other decorations):
As for the 'B' plot, if you would want everything on one plot, then you may try:
fig, ax = plt.subplots(1, 1, figsize=(8, 4))
sb.scatterplot(data=df, x='Name', y='A', hue='Name', palette=['blue', 'orange'],
ax=ax)
sb.scatterplot(data=df1, x='Location', y='A', hue='Location', palette=['red', 'green'],
ax=ax)
ax.set_xlabel('Name/Location')
plt.show()
Here I made a single subplot and assigned both scatter plots to it. Might require color mapping and renaming X-axis:

Matplotlib error plotting interval bins for discretized values form pandas dataframe

An error is returned when I want to plot an interval.
I created an interval for my age column so now I want to show on a chart the age interval compares to the revenue
my code
bins = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
clients['tranche'] = pd.cut(clients.age, bins)
clients.head()
client_id sales revenue birth age sex tranche
0 c_1 39 558.18 1955 66 m (60, 70]
1 c_10 58 1353.60 1956 65 m (60, 70]
2 c_100 8 254.85 1992 29 m (20, 30]
3 c_1000 125 2261.89 1966 55 f (50, 60]
4 c_1001 102 1812.86 1982 39 m (30, 40]
# Plot a scatter tranche x revenue
df = clients.groupby('tranche')[['revenue']].sum().reset_index().copy()
plt.scatter(df.tranche, df.revenue)
plt.show()
But an error appears ending by
TypeError: float() argument must be a string or a number, not 'pandas._libs.interval.Interval'
How to use an interval for plotting ?
You'll need to add labels. (i tried to convert them to str using .astype(str) but that does not seem to work in 3.9)
if you do the following, it will work just fine.
labels = ['10-20', '20-30', '30-40']
df['tranche'] = pd.cut(df.age, bins, labels=labels)

Connecting means in seaborn box plot

I want to connect box plot means. I can do the basic part but cannot connect box plot means and box plots offset from x axis. similar post but not connecting means Python: seaborn pointplot and boxplot in one plot but shifted on the x-axis
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'pre_score': [4, 24, 31, 2, 3,25, 94, 57, 62, 70,5, 43, 23, 23, 51]
}
data = pd.DataFrame(raw_data, columns = ['first_name', 'pre_score'])
first_name pre_score
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
5 Jason 25
6 Molly 94
7 Tina 57
8 Jake 62
9 Amy 70
10 Jason 5
11 Molly 43
12 Tina 23
13 Jake 23
14 Amy 51
sns.set_style("ticks")
ax = sns.stripplot(x='first_name', y='pre_score', hue='first_name', jitter=True, dodge=True, size=6, zorder=0, alpha=0.5, linewidth =1, data=data)
ax = sns.boxplot(x='first_name', y='pre_score', hue='first_name', dodge=True, showfliers=True, linewidth=0.8, showmeans=True, data=data)
ax = sns.lineplot(x='first_name', y='pre_score', color='k', data=data.groupby(['first_name'], as_index=False).mean())
fig_size = [18.0, 10.0]
plt.rcParams["figure.figsize"] = fig_size
handles, labels = ax.get_legend_handles_labels()
legend_len = labels.__len__()
ax.legend(handles[int(legend_len/2):legend_len], labels[int(legend_len/2):legend_len], bbox_to_anchor=(1.01, 1), loc=2, borderaxespad=0.1);
As we can see the sns.line plot does not follow the means and box plots and names in the x axis has offset.
How can I fix this ?
When dealing with seaborn plot, I would strongly recommend you always provide an order= (and hue_order= if applicable) to avoid nasty surprise with the categories not showing up in a consistent order between calls.
For the purpose of your question, you can replace the lineplot with a pointplot, which will automatically aggregate the values by categories and plot using a line
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy','Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'pre_score': [4, 24, 31, 2, 3,25, 94, 57, 62, 70,5, 43, 23, 23, 51]
}
data = pd.DataFrame(raw_data, columns = ['first_name', 'pre_score'])
# define the order in which the categories will be plotted on the x-axis
order = np.sort(data['first_name'].unique()) # you could also create a list by hand if you want a specific order
sns.set_style("ticks")
ax = sns.stripplot(x='first_name', y='pre_score', order=order, jitter=True, size=6, zorder=0, alpha=0.5, linewidth =1, data=data)
ax = sns.boxplot(x='first_name', y='pre_score', order=order, showfliers=True, linewidth=0.8, showmeans=True, data=data)
ax = sns.pointplot(x='first_name', y='pre_score', order=order, data=data, ci=None, color='black')
If for some reason you don't want to or cannot use a seaborn function that takes an order argument, then aggregate by hand in pandas, and reindex() with your order to make sure the values appear in the right order in the dataframe before plotting with the tool of your choice.
For instance, you could replace the call to pointplot() above with:
means = data.groupby('first_name')['pre_score'].mean().reindex(order) # calculate the means and ensure they are
# displayed in the same order as the boxplots
ax.plot(means.index, means.values, 'ko-', lw=3)
and have the exact same result

How can I iterate through a CSV file and plot it to a boxplot by each column representing a second in Python?

Say I have a csv file like so:
20 30 33 54 12 56
90 54 66 12 88 11
33 22 63 86 12 65
11 44 65 34 23 26
I want to create a boxplot where each column is a second, which is also the x-axis. The actual data to be on the y. So, 20, 90, 33, 11 will be on 1 second and on one plot and 30, 54, 22, 44 on 2 seconds and so on. Also, the csv file has more data than this that I am not sure how many data sets so I can't hard code anything in.
This is what I have so far:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/user/Desktop/test.csv', header = None)
fig = plt.figure()
ax = fig.add_subplot()
plt.xlabel('Time (s)')
plt.ylabel('ms')
df.boxplot()
plt.show()
Try this:
axes = df.groupby(df.columns//10, axis=1).boxplot(subplots=True,
figsize=(12,18))
plt.xlabel('Time (s)')
plt.ylabel('ms')
plt.show()
Output:
If you want to set y limits of the subplots:
for ax in axes.flatten():
ax.set_ylim(0,100)

Categories