This is my pandas dataframe df:
ab channel booked
0 control book_it 466
1 control contact_me 536
2 control instant 17
3 treatment book_it 494
4 treatment contact_me 56
5 treatment instant 22
I want to plot 3 groups of bar chart (according to channel):
for each channel:
plot control booked value vs treatment booked value.
hence i should get 6 bar charts, in 3 groups where each group has control and treatment booked values.
SO far i was only able to plot booked but not grouped by ab:
ax = df_conv['booked'].plot(kind='bar',figsize=(15,10), fontsize=12)
ax.set_xlabel('dim_contact_channel',fontsize=12)
ax.set_ylabel('channel',fontsize=12)
plt.show()
This is what i want (only show 4 but this is the gist):
Pivot the dataframe so control and treatment values are in separate columns.
df.pivot(index='channel', columns='ab', values='booked').plot(kind='bar')
Related
Let's explore an example with the well known mpg dataset:
First level is the index, let's say model_year.
Second level is the quantitative variable which in this case is a simple count.
Third level is a categorical variable. In our example, that would be the origin.
At this point I am expecting to see a regular countplot composed of indexes, bars and colors. Pretty standard!
import seaborn as sns
mpg = sns.load_dataset('mpg')
sns.countplot(x='model_year', data=mpg, hue='origin')
Ok! So what do I mean by "four levels of information"?
Here comes the actual question.
Let's say I want to stack the bars with the count per cylinder. In other words, each segment of the bar would represent the amount of produced cars for a specific number of cylinders.
For instance, for the year 70 the usa (blue bar) would have two segments (of 22 cars, 18 had 8 cylinders and 4 had 6 cylinders). Europe and Japan would have only one segment, since only 4 cylinders cars were produced in these countries, in that year.
mpg.groupby(['model_year', 'origin', 'cylinders'])['mpg'].count()
origin cylinders
europe 4 5
japan 4 2
usa 6 4
8 18
Name: mpg, dtype: int64
It is worth mentioning that the identification strategy (hatched maybe?) of the fourth information level should be consistent along the plot (equal for the same amount of cylinders) and a second legend will be necessary.
How can I achieve this with pandas and matplotlib, preferably without additional modules?
Let's assume I have a dataframe and I'm looking at 2 columns of it (2 series).
Using one of the columns - "no_employees" below - Can someone kindly help me figure out how to create 6 different pie charts or bar charts (1 for each grouping of no_employees) that illustrate the value counts for the Yes/No values in the treatment column? I'll use matplotlib or seaborn, whatever you feel is easiest.
I'm using the attached line of code to generate the code below.
dataframe_title.groupby(['no_employees']).treatment.value_counts().
But now I'm stuck. Do I use seaborn? .plot? This seems like it should be easy, and I know there are some cases where I can make subplots=True, but I'm really confused. Thank you so much.
no_employees treatment
1-5 Yes 88
No 71
100-500 Yes 95
No 80
26-100 Yes 149
No 139
500-1000 No 33
Yes 27
6-25 No 162
Yes 127
More than 1000 Yes 146
No 135
The importance of data encoding:
The purpose of data visualization is to more easily convey information (e.g. in this case, the relative number of 'treatments' per category)
The bar chart accommodates easily displaying the important information
how many in each group said 'Yes' or 'No'
the relative sizes of each group
A pie plot is more commonly used to display a sample, where the groups within the sample, sum to 100%.
Wikipedia: Pie Chart
Research has shown that comparison by angle, is less accurate than comparison by length, in that people are less able to discern differences.
Statisticians generally regard pie charts as a poor method of displaying information, and they are uncommon in scientific literature.
This data is not well represented by a pie plot, because each company size is a separate population, which will require 6 pie plots to be correctly represented.
The data can be placed into a pie plot, as others have shown, but that doesn't mean it should be.
Regardless of the type of plot, the data must be in the correct shape for the plot API.
Tested with pandas 1.3.0, seaborn 0.11.1, and matplotlib 3.4.2
Setup a test DataFrame
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np # for sample data only
np.random.seed(365)
cats = ['1-5', '6-25', '26-100', '100-500', '500-1000', '>1000']
data = {'no_employees': np.random.choice(cats, size=(1000,)),
'treatment': np.random.choice(['Yes', 'No'], size=(1000,))}
df = pd.DataFrame(data)
# set a categorical order for the x-axis to be ordered
df.no_employees = pd.Categorical(df.no_employees, categories=cats, ordered=True)
no_employees treatment
0 26-100 No
1 1-5 Yes
2 >1000 No
3 100-500 Yes
4 500-1000 Yes
Plotting with pandas.DataFrame.plot():
This requires grouping the dataframe to get .value_counts, and unstacking with pandas.DataFrame.unstack.
# to get the dataframe in the correct shape, unstack the groupby result
dfu = df.groupby(['no_employees']).treatment.value_counts().unstack()
treatment No Yes
no_employees
1-5 78 72
6-25 83 86
26-100 83 76
100-500 91 84
500-1000 78 83
>1000 95 91
# plot
ax = dfu.plot(kind='bar', figsize=(7, 5), xlabel='Number of Employees in Company', ylabel='Count', rot=0)
ax.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
Plotting with seaborn
seaborn is a high-level API for matplotlib.
seaborn.barplot()
Requires a DataFrame in a tidy (long) format, which is done by grouping the dataframe to get .value_counts, and resetting the index with pandas.Series.reset_index
May also be done with the figure-level interface using sns.catplot() with kind='bar'
# groupby, get value_counts, and reset the index
dft = df.groupby(['no_employees']).treatment.value_counts().reset_index(name='Count')
no_employees treatment Count
0 1-5 No 78
1 1-5 Yes 72
2 6-25 Yes 86
3 6-25 No 83
4 26-100 No 83
5 26-100 Yes 76
6 100-500 No 91
7 100-500 Yes 84
8 500-1000 Yes 83
9 500-1000 No 78
10 >1000 No 95
11 >1000 Yes 91
# plot
p = sns.barplot(x='no_employees', y='Count', data=dft, hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
seaborn.countplot()
Uses the original dataframe, df, without any transformations.
May also be done with the figure-level interface using sns.catplot() with kind='count'
p = sns.countplot(data=df, x='no_employees', hue='treatment')
p.legend(title='treatment', bbox_to_anchor=(1, 1), loc='upper left')
p.set(xlabel='Number of Employees in Company')
Output of barplot and countplot
Let's reshape the dataframe and plot with subplots=True:
df_chart = df1.unstack()['Pct']
axs = df_chart.plot.pie(subplots=True, figsize=(4,9), layout=(2,1), legend=False, title=df_chart.columns.tolist())
ax_flat = axs.flatten()
for ax in ax_flat:
ax.yaxis.label.set_visible(False)
Output:
From my original data frame, I used the group-by to create the new df as shown below, which has the natural disaster subtype counts for each country.
However, I'm unsure how to, for example, select 4 specific countries and set them as variables in a 2 by 2 plot.
The X-axis will be the disaster subtype name, with the Y being the value count, however, I can't quite figure out the right code to select this information.
This is how I grouped the countries -
g_grp= df_geo.groupby(['Country'])
c_val = pd.DataFrame(c_grp['Disaster Subtype'].value_counts())
c_val = c_val.rename(columns={'Disaster Subtype': 'Disaster Subtype', 'Disaster Subtype': 'Num of Disaster'})
c_val.head(40)
Output:
Country Disaster Subtype
Afghanistan Riverine flood 45
Ground movement 33
Flash flood 32
Avalanche 19
Drought 8
Bacterial disease 7
Convective storm 6
Landslide 6
Cold wave 5
Viral disease 5
Mudslide 3
Severe winter conditions 2
Forest fire 1
Locust 1
Parasitic disease 1
Albania Ground movement 16
Riverine flood 8
Severe winter conditions 3
Convective storm 2
Flash flood 2
Heat wave 2
Avalanche 1
Coastal flood 1
Drought 1
Forest fire 1
Viral disease 1
Algeria Ground movement 21
Riverine flood 20
Flash flood 8
Bacterial disease 2
Cold wave 2
Forest fire 2
Coastal flood 1
Drought 1
Heat wave 1
Landslide 1
Locust 1
American Samoa Tropical cyclone 4
Flash flood 1
Tsunami 1
However, let's say I want to select these for and plot 4 plots, 1 for each country, showing the number of each type of disaster happening in each country, I know I would need something along the lines of what's below, but I'm unsure how to set the x and y variables for each -- or if there is a more efficient way to set the variables/plot, that would be great. Usually, I would just use loc or iloc, but I need to be more specific with selecting.
fig, ax = subplots(2,2, figsize(16,10)
X1 = c_val.loc['Country'] == 'Afghanistan' #This doesn't work, just need something similar
y1 = c_val.loc['Num of Disasters']
X2 =
y2 =
X3 =
y3 =
X4 =
y4 =
ax[0,0].bar(X1,y1,width=.4, color=['#A2BDF2'])
ax[0,1].bar(X2,y2,width=.4,color=['#A2BDF2'])
ax[1,0].bar(X3,y3,width=.4,color=['#A2BDF2'])
ax[1,1].bar(X4,y4,width=.4,color=['#A2BDF2'])
IIUC, an simple way is to use catplot from seaborn package:
# Python env: pip install seaborn
# Anaconda env: conda install seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.catplot(x='Disaster Subtype', y='Num of Disaster', col='Country',
data=df, col_wrap=2, kind='bar')
g.set_xticklabels(rotation=90)
g.tight_layout()
plt.show()
Update
How I can select the specific countries to be plotted in each subplot?
subdf = df.loc[df['Country'].isin(['Albania', 'Algeria'])]
g = sns.catplot(x='Disaster Subtype', y='Num of Disaster', col='Country',
data=subdf, col_wrap=2, kind='bar')
...
I would like to plot a stacked bar plot from a csv file in python. I have three columns of data
year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.
I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.
This can be done using pandas:
import pandas as pd
df = pd.read_csv("file.csv")
# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
df.groupby(["word"], as_index=False)
.agg({"frequency": "sum"})
.sort_values("frequency")["word"]
.values
)
# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)
Which outputs:
I have a DataFrame contains as following, where first row is the "columns":
id,year,type,sale
1,1998,a,5
2,2000,b,10
3,1999,c,20
4,2001,b,15
5,2001,a,25
6,1998,b,5
...
I want to draw two figures, the first one is like
The second one is like
Figures in my draft might not be in right scale. I am a newbie to Python and I understand plotting functionality is powerful in Python. I believe there must be very easy to plot such figures.
The Pandas library provides simple and efficient tools to analyze and plot DataFrames.
Considering that the pandas library is installed and that the data are in a .csv file (matching the example you provided).
1. import the pandas library and load the data
import pandas as pd
data = pd.read_csv('filename.csv')
You now have a Pandas Dataframe as follow:
id year type sale
0 1 1998 a 5
1 2 2000 b 10
2 3 1999 c 20
3 4 2001 b 15
4 5 2001 a 25
5 6 1998 b 5
2. Plot the "sale" vs "type"
This is easily achieved by:
data.plot('type', 'sale', kind='bar')
which results in
If you want the sale for each type to be summed, data.groupby('type').sum().plot(y='sale', kind='bar') will do the trick (see #3 for explanation)
3. Plot the "sale" vs "year"
This is basically the same command, except that you have to first sum all the sale in the same year using the groupby pandas function.
data.groupby('year').sum().plot(y='sale', kind='bar')
This will result in
Edit:
4 Unstack the different type per year
You can also unstack the different 'type' per year for each bar by using groupby on 2 variables
data.groupby(['year', 'type']).sum().unstack().plot(y='sale', kind='bar', stacked=True)
Note:
See the Pandas Documentation on visualization for more information about achieving the layout you want.