Pandas dataframe | groupby plotting | stacked and side by side graph - python

I am coming from R ggplot2 background and, and bit confused in matplotlib plot
here my dataframe
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
language county count
0 en us 32
1 cs ch 432
2 es sp 43
3 pt br 55
4 hi in 6
5 en fr 23
6 es ar 455
7 es pr 23
Now I want to plot
A stacked bar chart where x axis show language and y axis show complete count, the big total height show total count for that language and stacked bar show number of countries for that language
A side by side, with same parameters only countries show side by side instead of stacked one
Most of the example show it directly using dataframe and matplotlib plot but I want to plot it in sequential script so I have more control over it, also can edit whatever I want, something like this script
ind = np.arange(df.languages.nunique())
width = 0.35
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
ax.bar(ind, df.languages, width, color='r')
ax.bar(ind, df.count, width,bottom=df.languages, color='b')
ax.set_ylabel('Count')
ax.set_title('Score y language and country')
ax.set_xticks(ind, df.languages)
ax.set_yticks(np.arange(0, 81, 10))
ax.legend(labels=[df.countries])
plt.show()
btw, my panda pivot code for same plotting
df.pivot(index = "Language", columns = "Country", values = "count").plot.bar(figsize=(15,10))
plt.xticks(rotation = 0,fontsize=18)
plt.xlabel('Language' )
plt.ylabel('Count ')
plt.legend(fontsize='large', ncol=2,handleheight=1.5)
plt.show()

import matplotlib.pyplot as plt
languages = ['en','cs','es', 'pt', 'hi', 'en', 'es', 'es']
counties = ['us','ch','sp', 'br', 'in', 'fr', 'ar', 'pr']
count = [32, 432,43,55,6,23,455,23]
df = pd.DataFrame({'language': languages,'county': counties, 'count' : count})
modified = {}
modified['language'] = np.unique(df.language)
country_count = []
total_count = []
for x in modified['language']:
country_count.append(len(df[df['language']==x]))
total_count.append(df[df['language']==x]['count'].sum())
modified['country_count'] = country_count
modified['total_count'] = total_count
mod_df = pd.DataFrame(modified)
print(mod_df)
ind = mod_df.language
width = 0.35
p1 = plt.bar(ind,mod_df.total_count, width)
p2 = plt.bar(ind,mod_df.country_count, width,
bottom=mod_df.total_count)
plt.ylabel("Total count")
plt.xlabel("Languages")
plt.legend((p1[0], p2[0]), ('Total Count', 'Country Count'))
plt.show()
First,modify the dataframe to below dataframe.
language country_count total_count
0 cs 1 432
1 en 2 55
2 es 3 521
3 hi 1 6
4 pt 1 55
This is the plot:
As the value of country count is small, you cannot clearly see the stacked country count.

import seaborn as sns
import matplotlib.pyplot as plt
figure, axis = plt.subplots(1,1,figsize=(10,5))
sns.barplot(x="language",y="count",data=df,ci=None)#,hue='county')
axis.set_title('Score y language and country')
axis.set_ylabel('Count')
axis.set_xlabel("Language")
sns.countplot(x=df.language,data=df)

Related

Sorting the dataframe based on the count of one column and plot

I have two columns in my data frame:
winner opening_shortname
0 White Slav Defense
1 Black Nimzowitsch Defense
2 White King's Pawn Game
3 White Queen's Pawn Game
4 White Philidor Defense
... ... ...
20053 White Dutch Defense
20054 Black Queen's Pawn
20055 White Queen's Pawn Game
20056 White Pirc Defense
20057 Black Queen's Pawn Game
I want to create the plot below, the top 10 opening and its winner colour proportion (%).
topk = 10
z = df.groupby(['opening_shortname', 'winner']).size().unstack()
ax = z.loc[z.sum(1).sort_values().tail(topk).index].plot.barh(color=['black', 'white'], edgecolor='black')
ax.xaxis.set_visible(False)
This sorts by prevalence of opening and limits to the top k (e.g. 10 in the OP's question). The "proportion (%)" mention in the question is ambiguous: the plot provided clearly shows decreasing totals from the top opening to the next ones, and the horizontal axis is removed.
Anyway, on the sample data you provided:
Assuming your dataframe is name df, you can groupby+count+unstack. Then sort on the sum and take the top 10 to plot:
df2 = (df.assign(count=1)
.groupby(['winner', 'opening_shortname'])
.count()
.unstack(level=0)
.droplevel(0, axis=1)
)
# plot part
idx = df2.sum(axis=1).sort_values().head(10).index
(df2.div(df2.sum(axis=1), axis=0) # calculate the proportion
.fillna(0)
.loc[idx, ['White', 'Black']]
.plot.barh(color=['w', 'k'], edgecolor='k')
)
output:
First of all, you should re-shape your dataframe through:
df = df.groupby(by = ['opening_shortname', 'winner']).size().reset_index().rename(columns = {'opening_shortname': 'opening_shortname', 'winner': 'winner', 0: 'count'}).sort_values(['count', 'opening_shortname', 'winner'], ascending = False, ignore_index = True)
So you will get a dataframe like (fake data):
opening_shortname winner count
0 Queen's Pawn Game White 141
1 Queen's Pawn Game Black 132
2 Queen's Pawn White 57
3 Queen's Pawn Black 57
4 King's Pawn Game Black 57
5 Dutch Defense Black 53
6 Sicilian Defense White 51
7 Sicilian Defense Black 50
8 Nimzowitsch Defense White 46
9 Nimzowitsch Defense Black 45
10 Philidor Defense Black 44
11 Slav Defense White 43
12 Pirc Defense White 42
13 Slav Defense Black 39
14 Pirc Defense Black 38
15 King's Pawn Game White 38
16 Dutch Defense White 36
17 Philidor Defense White 31
Then you can plot your data, for example using seaborn.barplot:
sns.barplot(ax = ax, data = df, x = 'count', y = 'opening_shortname', hue = 'winner', palette = ['white', 'black'], edgecolor = 'black')
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r'data/data.csv')
df = df.groupby(by = ['opening_shortname', 'winner']).size().reset_index().rename(columns = {'opening_shortname': 'opening_shortname', 'winner': 'winner', 0: 'count'}).sort_values(['count', 'opening_shortname', 'winner'], ascending = False, ignore_index = True)
fig, ax = plt.subplots()
sns.barplot(ax = ax, data = df, x = 'count', y = 'opening_shortname', hue = 'winner', palette = ['white', 'black'], edgecolor = 'black')
plt.show()
If, in place of count, you want to plot the relative proportion, then you can add one line to the above code:
df['count'] = df['count']/df.groupby('opening_shortname')['count'].transform('sum')
Complete Code
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(r'data/data.csv')
df = df.groupby(by = ['opening_shortname', 'winner']).size().reset_index().rename(columns = {'opening_shortname': 'opening_shortname', 'winner': 'winner', 0: 'count'}).sort_values(['count', 'opening_shortname', 'winner'], ascending = False, ignore_index = True)
df['count'] = df['count']/df.groupby('opening_shortname')['count'].transform('sum')
fig, ax = plt.subplots()
sns.barplot(ax = ax, data = df, x = 'count', y = 'opening_shortname', hue = 'winner', palette = ['white', 'black'], edgecolor = 'black')
plt.show()

How can I plot a secondary y-axis with seaborn's barplot?

I'm trying to plot the data (see below). With company_name on the x-axis, status_mission_2_y on the y axis and percentage on the other y_axis. I have tried using the twinx() fucntion but I can't get it to work.
Please can you help? Thanks in advance!
def twinplot(data):
x_ = data.columns[0]
y_ = data.columns[1]
y_2 = data.columns[2]
data1 = data[[x_, y_]]
data2 = data[[x_, y_2]]
plt.figure(figsize=(15, 8))
ax = sns.barplot(x=x_, y=y_, data=data1)
ax2 = ax.twinx()
g2 = sns.barplot(x=x_, y=y_2, data=data2, ax=ax2)
plt.show()
data = ten_company_missions_failed
twinplot(data)
company_name
percentage
status_mission_2_y
EER
1
1
Ghot
1
1
Trv
1
1
Sandia
1
1
Test
1
1
US Navy
0.823529412
17
Zed
0.8
5
Gov
0.75
4
Knight
0.666666667
3
Had
0.666666667
3
Seaborn plots the two bar plots with the same color and on the same x-positions.
The following example code resizes the bar widths, with the bars belonging ax moved to the left. And the bars of ax2 moved to the right. To differentiate the right bars, a semi-transparency (alpha=0.7) and hatching is used.
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import pandas as pd
import seaborn as sns
from io import StringIO
data_str = '''company_name percentage status_mission_2_y
EER 1 1
Ghot 1 1
Trv 1 1
Sandia 1 1
Test 1 1
"US Navy" 0.823529412 17
Zed 0.8 5
Gov 0.75 4
Knight 0.666666667 3
Had 0.666666667 3'''
data = pd.read_csv(StringIO(data_str), delim_whitespace=True)
x_ = data.columns[0]
y_ = data.columns[1]
y_2 = data.columns[2]
data1 = data[[x_, y_]]
data2 = data[[x_, y_2]]
plt.figure(figsize=(15, 8))
ax = sns.barplot(x=x_, y=y_, data=data1)
width_scale = 0.45
for bar in ax.containers[0]:
bar.set_width(bar.get_width() * width_scale)
ax.yaxis.set_major_formatter(PercentFormatter(1))
ax2 = ax.twinx()
sns.barplot(x=x_, y=y_2, data=data2, alpha=0.7, hatch='xx', ax=ax2)
for bar in ax2.containers[0]:
x = bar.get_x()
w = bar.get_width()
bar.set_x(x + w * (1- width_scale))
bar.set_width(w * width_scale)
plt.show()
A simpler alternative could be to combine a barplot on ax and a lineplot on ax2.
plt.figure(figsize=(15, 8))
ax = sns.barplot(x=x_, y=y_, data=data1)
ax.yaxis.set_major_formatter(PercentFormatter(1))
ax2 = ax.twinx()
sns.lineplot(x=x_, y=y_2, data=data2, marker='o', color='crimson', lw=3, ax=ax2)
plt.show()

Python Matplotlib bars subplots by Category and Aggregation

I have a table like this:
data = {'Category':["Toys","Toys","Toys","Toys","Food","Food","Food","Food","Food","Food","Food","Food","Furniture","Furniture","Furniture"],
'Product':["AA","BB","CC","DD","SSS","DDD","FFF","RRR","EEE","WWW","LLLLL","PPPPPP","LPO","NHY","MKO"],
'QTY':[100,200,300,50,20,800,300,450,150,320,400,1000,150,900,1150]}
df = pd.DataFrame(data)
df
Out:
Category Product QTY
0 Toys AA 100
1 Toys BB 200
2 Toys CC 300
3 Toys DD 50
4 Food SSS 20
5 Food DDD 800
6 Food FFF 300
7 Food RRR 450
8 Food EEE 150
9 Food WWW 320
10 Food LLLLL 400
11 Food PPPPP 1000
12 Furniture LPO 150
13 Furniture NHY 900
14 Furniture MKO 1150
So, I need to make bars subplots like this (Sum Products in each Category):
My problem is that I can't figure out how to combine categories, series, and aggregation.
I manage to split them into 3 subplots (1 always stays blank) but I can not unite them ...
import matplotlib.pyplot as plt
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))
df['Category'].value_counts().plot.bar(
ax=axarr[0][0], fontsize=12, color='b'
)
axarr[0][0].set_title("Category", fontsize=18)
df['Product'].value_counts().plot.bar(
ax=axarr[1][0], fontsize=12, color='b'
)
axarr[1][0].set_title("Product", fontsize=18)
df['QTY'].value_counts().plot.bar(
ax=axarr[1][1], fontsize=12, color='b'
)
axarr[1][1].set_title("QTY", fontsize=18)
plt.subplots_adjust(hspace=.3)
plt.show()
Out
What do I need to add to combine them?
This would be a lot easier with seaborn and FacetGrid
import pandas as pd
import seaborn as sns
data = {'Category':["Toys","Toys","Toys","Toys","Food","Food","Food","Food","Food","Food","Food","Food","Furniture","Furniture","Furniture"],
'Product':["AA","BB","CC","DD","SSS","DDD","FFF","RRR","EEE","WWW","LLLLL","PPPPPP","LPO","NHY","MKO"],
'QTY':[100,200,300,50,20,800,300,450,150,320,400,1000,150,900,1150]}
df = pd.DataFrame(data)
g = sns.FacetGrid(df, col='Category', sharex=False, sharey=False, col_wrap=2, height=3, aspect=1.5)
g.map_dataframe(sns.barplot, x='Product', y='QTY')

Create a Radar Chart in python for each row of a Panda dataframe

I am using panda in order to assign a score to some gamers.
I computed, using the same KPIs, some attributes for every gamer and now I have, for each player, a row with the results.
The dataframe looks like this (the only difference is that it has more columns) :
| Name | Speed | ATK |
| G1 | 0.32 | 0.89 |
| G4 | 0.31 | 0.76 |
I thought it would be nice to plot a radar chart (https://en.wikipedia.org/wiki/Radar_chart)
for each row using matplotlib (if possible).
How would you do it?
Is there a better alternative to matplotlib?
Thanks.
To get this spider-look, you need at least three columns. So, I've added a Random column to your dataframe:
import pandas as pd
df = pd.DataFrame({"Name": ["G1", "G4"],
"Speed": [0.32, 0.31],
"ATK": [0.89, 0.76],
"Random": [0.4, 0.8]})
print(df)
# Name Speed ATK Random
#0 G1 0.32 0.89 0.4
#1 G4 0.31 0.76 0.8
Now, let's see how to plot this simple dataframe. The follwing code is adapted from this blog post:
# import necessary modules
import numpy as np
import matplotlib.pyplot as plt
from math import pi
# obtain df information
categories = list(df)[1:]
values = df.mean().values.flatten().tolist()
values += values[:1] # repeat the first value to close the circular graph
angles = [n / float(len(categories)) * 2 * pi for n in range(len(categories))]
angles += angles[:1]
# define plot
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 8),
subplot_kw=dict(polar=True))
plt.xticks(angles[:-1], categories, color='grey', size=12)
plt.yticks(np.arange(0.5, 2, 0.5), ['0.5', '1.0', '1.5'],
color='grey', size=12)
plt.ylim(0, 2)
ax.set_rlabel_position(30)
# draw radar-chart:
for i in range(len(df)):
val_c1 = df.loc[i].drop('Name').values.flatten().tolist()
val_c1 += val_c1[:1]
ax.plot(angles, val_c1, linewidth=1, linestyle='solid',
label=df.loc[i]["Name"])
ax.fill(angles, val_c1, alpha=0.4)
# add legent and show plot
plt.legend(loc='upper right', bbox_to_anchor=(0.1, 0.1))
plt.show()
Which results a graph like so:

Matplotlib: subplot bar charts with with height labels

I have a dataframe, df, like:
continent | country | counts
------------------------------
East Asia | Hong Kong | 33
East Asia | Japan | 51
Europe | Austria | 10
Europe | Belgium | 3
Europe | Denmark | 15
I want to plot two vertical bar charts, one for each continent, side by side, sharing the same y axis. I've gotten 90% of the way, except for adding the heights of the bars to the subplots. My code so far:
continents_ls = list(set(df["continent"]))
# continents_ls = ["East Asia", "Europe"]
fig, ax = plt.subplots(1, len(continents_ls), figsize=(30, len(continents_ls)*5), sharey=True)
for i in range(len(continents_ls)):
d_temp = df.loc[df["continent"] == continents_ls[i]].groupby("country").size().to_frame().reset_index()
# d_temp is the partition containing info for just one continent
d_temp.columns = ["country", "count"] # name the 'count' column
idx = list(d_temp["country"]) # get the list of countries in that continent
ht_arr = list(d_temp["count"])
ax[i].bar(left=range(len(ht_arr)), height=ht_arr)
ax[i].set_xticks(np.arange(len(idx)))
ax[i].set_xticklabels(idx, size=8, rotation=45)
ax[i].set_title(continents_ls[i], size=23)
ax[i].set_yticklabels(ht_arr, minor=False)
plt.tight_layout()
plt.show()
I've seen examples here and there with labels, but these tend to apply to just one bar chart, not several subplots.
You could do this with only a slight modification to your code. Using this answer: https://stackoverflow.com/a/34598688/42346
for i in range(len(continents_ls)):
d_temp = df.loc[df["continent"] == continents_ls[i]].groupby("country").size().to_frame().reset_index()
# d_temp is the partition containing info for just one continent
d_temp.columns = ["country", "count"] # name the 'count' column
idx = list(d_temp["country"]) # get the list of countries in that continent
ht_arr = list(d_temp["count"])
ax[i].bar(left=range(len(ht_arr)), height=ht_arr)
ax[i].set_xticks(np.arange(len(idx)))
ax[i].set_xticklabels(idx, size=8, rotation=45)
ax[i].set_title(continents_ls[i], size=23)
ax[i].set_yticklabels(ht_arr, minor=False)
if i == 0: # only for the first barplot
for p in ax[i].patches:
ax[i].annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()), ha='center', va='center', xytext=(0, 10), textcoords='offset points')

Categories