Let's explore an example with the well known mpg dataset:
First level is the index, let's say model_year.
Second level is the quantitative variable which in this case is a simple count.
Third level is a categorical variable. In our example, that would be the origin.
At this point I am expecting to see a regular countplot composed of indexes, bars and colors. Pretty standard!
import seaborn as sns
mpg = sns.load_dataset('mpg')
sns.countplot(x='model_year', data=mpg, hue='origin')
Ok! So what do I mean by "four levels of information"?
Here comes the actual question.
Let's say I want to stack the bars with the count per cylinder. In other words, each segment of the bar would represent the amount of produced cars for a specific number of cylinders.
For instance, for the year 70 the usa (blue bar) would have two segments (of 22 cars, 18 had 8 cylinders and 4 had 6 cylinders). Europe and Japan would have only one segment, since only 4 cylinders cars were produced in these countries, in that year.
mpg.groupby(['model_year', 'origin', 'cylinders'])['mpg'].count()
origin cylinders
europe 4 5
japan 4 2
usa 6 4
8 18
Name: mpg, dtype: int64
It is worth mentioning that the identification strategy (hatched maybe?) of the fourth information level should be consistent along the plot (equal for the same amount of cylinders) and a second legend will be necessary.
How can I achieve this with pandas and matplotlib, preferably without additional modules?
Related
I am fairly new to python and its terminology and can be clumsy at describing the problem.Sorry for that.
What I got is three cities that produced three fruits for two years, and I need to draw the single-static chart that summarizes the data best.
The fact that dataframe have 3 categoric values (city, fruits and year) and one measure makes me confused.
At first I try to use stack bar chart, however If I use fruits in the bars and cities in X axis, I could not find where to use year value.
I tried to use pivot method to convert year value into measure, but I could not advance with two measures this time.
I mainly used Matplotlib.
Any help appreciated,
data= {
'city':['amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','amsterdam','paris','paris','paris','paris','paris','paris','berlin','berlin','berlin','berlin','berlin','berlin'],
'fruits':['apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas','apples','oranges','bananas'],
'year':[2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001,2000,2000,2000,2001,2001,2001],
'amount':[384,289,347,242,390,274,175,334,245,116,252,366,255,400,300,240,600,180]
}
df=pd.DataFrame(data)
df.head()
city
fruits
year
amount
0
paris
apples
2000
384
1
paris
oranges
2000
289
2
paris
bananas
2000
347
3
paris
apples
2001
242
4
paris
oranges
2001
390
From my original data frame, I used the group-by to create the new df as shown below, which has the natural disaster subtype counts for each country.
However, I'm unsure how to, for example, select 4 specific countries and set them as variables in a 2 by 2 plot.
The X-axis will be the disaster subtype name, with the Y being the value count, however, I can't quite figure out the right code to select this information.
This is how I grouped the countries -
g_grp= df_geo.groupby(['Country'])
c_val = pd.DataFrame(c_grp['Disaster Subtype'].value_counts())
c_val = c_val.rename(columns={'Disaster Subtype': 'Disaster Subtype', 'Disaster Subtype': 'Num of Disaster'})
c_val.head(40)
Output:
Country Disaster Subtype
Afghanistan Riverine flood 45
Ground movement 33
Flash flood 32
Avalanche 19
Drought 8
Bacterial disease 7
Convective storm 6
Landslide 6
Cold wave 5
Viral disease 5
Mudslide 3
Severe winter conditions 2
Forest fire 1
Locust 1
Parasitic disease 1
Albania Ground movement 16
Riverine flood 8
Severe winter conditions 3
Convective storm 2
Flash flood 2
Heat wave 2
Avalanche 1
Coastal flood 1
Drought 1
Forest fire 1
Viral disease 1
Algeria Ground movement 21
Riverine flood 20
Flash flood 8
Bacterial disease 2
Cold wave 2
Forest fire 2
Coastal flood 1
Drought 1
Heat wave 1
Landslide 1
Locust 1
American Samoa Tropical cyclone 4
Flash flood 1
Tsunami 1
However, let's say I want to select these for and plot 4 plots, 1 for each country, showing the number of each type of disaster happening in each country, I know I would need something along the lines of what's below, but I'm unsure how to set the x and y variables for each -- or if there is a more efficient way to set the variables/plot, that would be great. Usually, I would just use loc or iloc, but I need to be more specific with selecting.
fig, ax = subplots(2,2, figsize(16,10)
X1 = c_val.loc['Country'] == 'Afghanistan' #This doesn't work, just need something similar
y1 = c_val.loc['Num of Disasters']
X2 =
y2 =
X3 =
y3 =
X4 =
y4 =
ax[0,0].bar(X1,y1,width=.4, color=['#A2BDF2'])
ax[0,1].bar(X2,y2,width=.4,color=['#A2BDF2'])
ax[1,0].bar(X3,y3,width=.4,color=['#A2BDF2'])
ax[1,1].bar(X4,y4,width=.4,color=['#A2BDF2'])
IIUC, an simple way is to use catplot from seaborn package:
# Python env: pip install seaborn
# Anaconda env: conda install seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.catplot(x='Disaster Subtype', y='Num of Disaster', col='Country',
data=df, col_wrap=2, kind='bar')
g.set_xticklabels(rotation=90)
g.tight_layout()
plt.show()
Update
How I can select the specific countries to be plotted in each subplot?
subdf = df.loc[df['Country'].isin(['Albania', 'Algeria'])]
g = sns.catplot(x='Disaster Subtype', y='Num of Disaster', col='Country',
data=subdf, col_wrap=2, kind='bar')
...
i have two dataframes Region_education_0 and Region_education_1
Region_education_0
index
Region
ConvertedComp
1
Australia/New Zealand
122573.834171
2
Caribbean
53562.111111
3
Central Asia
134422.000000
4
East Asia
112492.507042
5
Melanesia
605
Region_education_1
index
Region
ConvertedComp
1
Australia/New Zealand
122573.834171
2
Caribbean
53562.111111
3
Central Asia
134422.000000
4
East Asia
112492.507042
Index 5, Melanesia is not present in Region_education_1 because of a condition, i want to compare them and plot so i tried this
from matplotlib.pyplot import *
Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region")
Region_education_combined.columns=["Region","Max of Bachelors Higher Ed","Higher Formal Education"]
Region_education_combined['Diff_HigherEd_Vals'] = Region_education_combined['Higher Formal Education'] - Region_education_combined['Max of Bachelors Higher Ed']
print(Region_education_combined)
comp_df.style.bar(subset=['Diff_HigherEd_Vals'], align='mid', color=['#d65f5f', '#5fba7d'])
index
Max of Bachelors Higher Ed
Higher Formal Education
Diff_HigherEd_Vals
1
151698.500659
122573.834171
-29124.666488
2
28413.753425
53562.111111
53562.111111
3
3944.750000
5883.000000
1938.250000
4
45091.041667
27052.384615
-18038.657051
Region column is missing from the output,to include I region tried
comp_df.style.bar(subset=['Diff_HigherEd_Vals','Region'], align='mid', color=['#d65f5f', '#5fba7d'])
and
comp_df.style.bar(Region_education_combined, align='mid', color=['#d65f5f', '#5fba7d'])
Is there any way to include region in the final output?
and i left out "Index 5, Melanesia" from 'Region_education_0' dataframe is there any way to include that too in the output ?
You can maintain the missing Region by using how="outer" when you call merge like this
Region_education_combined=Region_education_0.merge(Region_education_1,left_on="Region",right_on="Region")
Pay attention that in this case you will have a table which contains Nan where it is not possible to merge, in your case Melanesia will have a Nan in the Higher Formal Education column. In order to avoid problem you can set a default value with this
Region_education_combined["Higher Formal Education"].fillna(0, inplace=True)
I would like to plot a stacked bar plot from a csv file in python. I have three columns of data
year word frequency
2018 xyz 12
2017 gfh 14
2018 sdd 10
2015 fdh 1
2014 sss 3
2014 gfh 12
2013 gfh 2
2012 gfh 4
2011 wer 5
2010 krj 4
2009 krj 4
2019 bfg 4
... 300+ rows of data.
I need to go through all the data and plot a stacked bar plot which is categorized based on the year, so x axis is word and y axis is frequency, the legend color should show year wise. I want to see how the evolution of each word occured year wise. Some of the technology words are repeatedly used in every year and hence the stack bar graph should add the values on top and plot, for example the word gfh initially plots 14 for year 2017, and then in year 2014 I want the gfh word to plot (in a different color) for a value of 12 on top of the gfh of 2017. How do I do this? So far I called the csv file in my code. But I don't understand how could it go over all the rows and stack the words appropriately (as some words repeat through all the years). Any help is highly appreciated. Also the years are arranged in random order in csv but I sorted them year wise to make it easier. I am just learning python and trying to understand this plotting routine since i have 40 years of data and ~20 words. So I thought stacked bar plot is the best way to represent them. Any other visualisation method is also welcome.
This can be done using pandas:
import pandas as pd
df = pd.read_csv("file.csv")
# Aggregate data
df = df.groupby(["word", "year"], as_index=False).agg({"frequency": "sum"})
# Create list to sort by
sorter = (
df.groupby(["word"], as_index=False)
.agg({"frequency": "sum"})
.sort_values("frequency")["word"]
.values
)
# Pivot, reindex, and plot
df = df.pivot(index="word", columns="year", values="frequency")
df = df.reindex(sorter)
df.plot.bar(stacked=True)
Which outputs:
This is my pandas dataframe df:
ab channel booked
0 control book_it 466
1 control contact_me 536
2 control instant 17
3 treatment book_it 494
4 treatment contact_me 56
5 treatment instant 22
I want to plot 3 groups of bar chart (according to channel):
for each channel:
plot control booked value vs treatment booked value.
hence i should get 6 bar charts, in 3 groups where each group has control and treatment booked values.
SO far i was only able to plot booked but not grouped by ab:
ax = df_conv['booked'].plot(kind='bar',figsize=(15,10), fontsize=12)
ax.set_xlabel('dim_contact_channel',fontsize=12)
ax.set_ylabel('channel',fontsize=12)
plt.show()
This is what i want (only show 4 but this is the gist):
Pivot the dataframe so control and treatment values are in separate columns.
df.pivot(index='channel', columns='ab', values='booked').plot(kind='bar')