Let's assume I have a dataframe and I'm looking at 2 columns of it (2 series).
Using one of the columns - "no_employees" below - Can someone kindly help me figure out how to create 6 different pie charts or bar charts (1 for each grouping of no_employees) that illustrate the value counts for the Yes/No values in the treatment column? I'll use matplotlib or seaborn, whatever you feel is easiest.
I'm using the attached line of code to generate the code below.
dataframe_title.groupby(['no_employees']).treatment.value_counts().
But now I'm stuck. Do I use seaborn? .plot? This seems like it should be easy, and I know there are some cases where I can make subplots=True, but I'm really confused. Thank you so much.
no_employees treatment
1-5 Yes 88
No 71
100-500 Yes 95
No 80
26-100 Yes 149
No 139
500-1000 No 33
Yes 27
6-25 No 162
Yes 127
More than 1000 Yes 146
No 135
The importance of data encoding:
The purpose of data visualization is to more easily convey information (e.g. in this case, the relative number of 'treatments' per category)
The bar chart accommodates easily displaying the important information
how many in each group said 'Yes' or 'No'
the relative sizes of each group
A pie plot is more commonly used to display a sample, where the groups within the sample, sum to 100%.
Wikipedia: Pie Chart
Research has shown that comparison by angle, is less accurate than comparison by length, in that people are less able to discern differences.
Statisticians generally regard pie charts as a poor method of displaying information, and they are uncommon in scientific literature.
This data is not well represented by a pie plot, because each company size is a separate population, which will require 6 pie plots to be correctly represented.
The data can be placed into a pie plot, as others have shown, but that doesn't mean it should be.
Regardless of the type of plot, the data must be in the correct shape for the plot API.
Tested with pandas 1.3.0, seaborn 0.11.1, and matplotlib 3.4.2
Setup a test DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # for sample data only
np.random.seed(365)
cats = ['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']
data = {'no_employees': np.random.choice(cats, size=(1000,)),
'treatment': np.random.choice(['Yes', 'No'], size=(1000,))}
df = pd.DataFrame(data)
# set a categorical order for the x-axis to be ordered
df.no_employees = pd.Categorical(df.no_employees, categories=cats, ordered=True)
no_employees treatment
0 26-100 No
1 1-5 Yes
2 >1000 No
3 100-500 Yes
4 500-1000 Yes
Plotting with pandas.DataFrame.plot():
This requires grouping the dataframe to get .value_counts, and unstacking with pandas.DataFrame.unstack.
# to get the dataframe in the correct shape, unstack the groupby result
dfu = df.groupby(['no_employees']).treatment.value_counts().unstack()
treatment No Yes
no_employees
1-5 78 72
6-25 83 86
26-100 83 76
100-500 91 84
500-1000 78 83
>1000 95 91
# plot
ax = dfu.plot(kind='bar', figsize=(7, 5), xlabel='Number of Employees in Company', ylabel='Count', rot=0)
ax.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
Plotting with seaborn
seaborn is a high-level API for matplotlib.
seaborn.barplot()
Requires a DataFrame in a tidy (long) format, which is done by grouping the dataframe to get .value_counts, and resetting the index with pandas.Series.reset_index
May also be done with the figure-level interface using sns.catplot() with kind='bar'
# groupby, get value_counts, and reset the index
dft = df.groupby(['no_employees']).treatment.value_counts().reset_index(name='Count')
no_employees treatment Count
0 1-5 No 78
1 1-5 Yes 72
2 6-25 Yes 86
3 6-25 No 83
4 26-100 No 83
5 26-100 Yes 76
6 100-500 No 91
7 100-500 Yes 84
8 500-1000 Yes 83
9 500-1000 No 78
10 >1000 No 95
11 >1000 Yes 91
# plot
p = sns.barplot(x='no_employees', y='Count', data=dft, hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
seaborn.countplot()
Uses the original dataframe, df, without any transformations.
May also be done with the figure-level interface using sns.catplot() with kind='count'
p = sns.countplot(data=df, x='no_employees', hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
Output of barplot and countplot
Let's reshape the dataframe and plot with subplots=True:
df_chart = df1.unstack()['Pct']
axs = df_chart.plot.pie(subplots=True, figsize=(4,9), layout=(2,1), legend=False, title=df_chart.columns.tolist())
ax_flat = axs.flatten()
for ax in ax_flat:
ax.yaxis.label.set_visible(False)
Output:
Related
I would like to create a scatterplot in Seaborn that sets the hue parameter so that values are coloured based on what feature they are from. (All features have values 0 to 100)
For example, supposing I have the following dataframe:
Happiness Kindness Sadness
1 100 70 0
2 60 50 1
3 34 32 10
4 23 65 54
5 43 54 87
When plotting the values, I would like to set all the values under happiness to red, kindness to blue, sadness to green. However, with Seaborn's scatterplot, the hue parameter only accepts one variable under the dataframe. Documentation here: http://man.hubwiz.com/docset/Seaborn.docset/Contents/Resources/Documents/generated/seaborn.scatterplot.html
I want to know if there are any workarounds to the only one variable being accepted feature so I can use the hue parameter to distinguish between the different variables.
I have a dataframe that looks like:
Date Faculty Target Avg
2012-01-01 Arts 80 60
2012-01-01 Science 70 60
2012-02-01 Arts 91 89
2012-02-01 Gym 80 89
.
.
2012-07-01 Arts 83 67
2012-07-01 Science 72 67
2012-08-01 Arts 81 83
2012-08-01 Science 70 83
I want to plot all Faculty on a single scatter plot with each of their respective Target values (Y-Axis) and Avg values (X-Axis).
I'm trying to use (pseudo code) a scatterplot like:
ax1 = data.plot(kind='scatter', x='Avg', y='Target(Arts)', color='r', label='Arts')
ax2 = data.plot(kind='scatter', x='Avg', y='Target(Science)', color='g', ax=ax1, label='Science')
ax3 = data.plot(kind='scatter', x='Avg', y='Target(Gym)', color='b', ax=ax1, label='Gym')
I'd like all Faculties (there are 28 of them total) on the same plot for every Target value (marked by different colors) but there are too many to manually enter with loc (or at least I'd like to avoid this). I can't use iloc to count by index because each number of Faculty counts is different on each date.
Is there a simple way to do this?
You can groupby the Faculty, and iterate over the groups, plotting each one:
g = df.groupby('Faculty')
for faculty, data in g:
plt.scatter(data['Avg'], data['Target'], label=faculty)
plt.xlabel('Avg')
plt.ylabel('Target')
plt.legend()
plt.show()
I have two datasets. Both have different numbers of observations. Is it possible to generate a scatter plot between features from different datasets?
For example, I want to generate a scatter plot between the submission_day column of dataset 1 and the score column of dataset 2.
I am not sure how to do that using python packages.
For example consider the following two datasets:
id_student submission_day
23hv 100
24hv 99
45hv 10
56hv 16
53hv 34
id_student score
23hv 59
25gf 20
24hv 56
45hv 76
I think need merge for one DataFrame and then DataFrame.plot.scatter:
df = df1.merge(df2, on='id_student')
print (df)
id_student submission_day score
0 23hv 100 59
1 24hv 99 56
2 45hv 10 76
df.plot.scatter(x='submission_day', y='score')
This is my pandas dataframe df:
ab channel booked
0 control book_it 466
1 control contact_me 536
2 control instant 17
3 treatment book_it 494
4 treatment contact_me 56
5 treatment instant 22
I want to plot 3 groups of bar chart (according to channel):
for each channel:
plot control booked value vs treatment booked value.
hence i should get 6 bar charts, in 3 groups where each group has control and treatment booked values.
SO far i was only able to plot booked but not grouped by ab:
ax = df_conv['booked'].plot(kind='bar',figsize=(15,10), fontsize=12)
ax.set_xlabel('dim_contact_channel',fontsize=12)
ax.set_ylabel('channel',fontsize=12)
plt.show()
This is what i want (only show 4 but this is the gist):
Pivot the dataframe so control and treatment values are in separate columns.
df.pivot(index='channel', columns='ab', values='booked').plot(kind='bar')
I'm wondering how to get the proper legend labels to show up in a DataFrame plot after grouped using the group by method. One would expect the legend labels to be the group names; instead they are showing up as redundant dependent variable names.
For example, I have a DataFrame that looks like this:
ECMO Year Sex Runs p_ecmo
102 111 2011 M 2106 0.052707
104 31 2012 F 1801 0.017213
105 42 2012 M 2664 0.015766
107 59 2013 F 1039 0.056785
108 72 2013 M 1386 0.051948
And I am trying to group by sex, and plot p_ecmo by year. Thus,
ecmo_by_year[ecmo_by_year.Year>=1996].groupby('Sex').plot('Year', 'p_ecmo', grid=False)
ylabel('Proportion ECMO')
plt.legend()
But what I get is this, which is not helpful. The legend should display the group name, not the variable.
It doesn't look like you need groupby since you are not splitting, combining, or applying on p_ecmo
You can use a pivot_table to get the desired result.
pt = ecmo_by_year.pivot_table(['p_ecmo'], rows='Year', cols='Sex').fillna(0)
pt['p_ecmo'].plot()