Matplotlib stacked bar chart set column order - python

I am plotting a stacked bar chart for a few fixed quantities using the usual:
bar = df.plot.barh(x='Intervention', y={'Remuneration','Training','Supervision'}, stacked=True )
I however noticed that depending on the input dataset, matplotlib changes the order of the plotted columns. For instance, sometimes it plots Remuneration as the first component of the stacked bar chart, whereas in other occasions it changes it to Training or Supervision. To be honest, I haven't been able to figure out what is the order being used. Ideally I'd like to keep always the same order as I have a list of colors to be used. Is there any way to force this re-ordering? Eg that the stacked bar chart always appears as Remuneration-Training-Supervision?

to set a custom order for the stacked bars, you will need to order it by setting the CategoricalIndex and sorting the data by these categories. This will sort the three categories in the way you need. A small example to show the same is below...
Data 'df'
Intervention Remuneration Training Supervision
A 21 4 12
B 41 5 21
C 33 6 7
Code
#Convert Intervention as index, so columns are the categories
df = df.set_index('Intervention')
#Set categories... Order will Remuneration-Training-Supervision
df.columns=pd.CategoricalIndex(df.columns.values, ordered=True, categories=['Remuneration','Training','Supervision'])
#Sort the data
df = df.sort_index(axis=1)
#...and plot
bar = df.plot.barh(stacked=True)
Output

Related

How to create grouped and stacked bars

I have a very huge dataset with a lot of subsidiaries serving three customer groups in various countries, something like this (in reality there are much more subsidiaries and dates):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'subsidiary': ['EU','EU','EU','EU','EU','EU','EU','EU','EU','US','US','US','US','US','US','US','US','US'],'date': ['2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05'],'business': ['RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC','RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC'],'value': [500.36,600.45,700.55,750.66,950.89,1300.13,100.05,120.00,150.01,800.79,900.55,1000,3500.79,5000.36,4500.25,50.17,75.25,90.33]})
print(df)
I'd like to make an analysis per subsidiary by producing a stacked bar chart. To do this, I started by defining the x-axis to be the unique months and by defining a subset per business type in a country like this:
x=df['date'].drop_duplicates()
EUCORP = df[(df['subsidiary']=='EU') & (df['business']=='CORP')]
EURETAIL = df[(df['subsidiary']=='EU') & (df['business']=='RETAIL')]
EUPUBLIC = df[(df['subsidiary']=='EU') & (df['business']=='PUBLIC')]
I can then make a bar chart per business type:
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35)
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35)
However, if I try to stack all three together in one chart, I keep failing:
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35, bottom=EURETAIL)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35, bottom=EURETAIL+EUCORP)
plt.show()
I always receive the below error message:
ValueError: Missing category information for StrCategoryConverter; this might be caused by unintendedly mixing categorical and numeric data
ConversionError: Failed to convert value(s) to axis units: subsidiary date business value
0 EU 2019-03 RETAIL 500.36
1 EU 2019-04 RETAIL 600.45
2 EU 2019-05 RETAIL 700.55
I tried converting the months into the dateformat and/or indexing it, but it actually confused me further...
I would really appreciate any help/support on any of the following, as I a already spend a lot of hours to try to figure this out (I am still a python noob, sry):
How can I fix the error to create a stacked bar chart?
Assuming, the error can be fixed, is this the most efficient way to create the bar chart (e.g. do I really need to create three sub-dfs per subsidiary, or is there a more elegant way?)
Would it be possible to code an iteration, that produces a stacked bar chart by country, so that I don't need to create one per subsidiary?
As an FYI, stacked bars are not the best option, because they can make it difficult to compare bar values and can easily be misinterpreted. The purpose of a visualization is to present data in an easily understood format; make sure the message is clear. Side-by-side bars are often a better option.
Side-by-side stacked bars are a difficult manual process to construct, it's better to use a figure-level method like seaborn.catplot, which will create a single, easy to read, data visualization.
Bar plot ticks are located by 0 indexed range (not datetimes), the dates are just labels, so it is not necessary to convert them to a datetime dtype.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
seaborn
import seaborn as sns
sns.catplot(kind='bar', data=df, col='subsidiary', x='date', y='value', hue='business')
Create grouped and stacked bars
See Stacked Bar Chart and Grouped bar chart with labels
The issue with the creation of the stacked bars in the OP is bottom is being set on the entire dataframe for that group, instead of only the values that make up the bar height.
do I really need to create three sub-dfs per subsidiary. Yes, a DataFrame is needed for every group, so 6, in this case.
Creating the data subsets can be automated using a dict-comprehension to unpack the .groupby object into a dict.
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])} to create a dict of DataFrames
Access the values like: data['EUCORP'].value
Automating the plot creation is more arduous, as can be seen x depends on how many groups of bars for each tick, and bottom depends on the values for each subsequent plot.
import numpy as np
import matplotlib.pyplot as plt
labels=df['date'].drop_duplicates() # set the dates as labels
x0 = np.arange(len(labels)) # create an array of values for the ticks that can perform arithmetic with width (w)
# create the data groups with a dict comprehension and groupby
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])}
# build the plots
subs = df.subsidiary.unique()
stacks = len(subs) # how many stacks in each group for a tick location
business = df.business.unique()
# set the width
w = 0.35
# this needs to be adjusted based on the number of stacks; each location needs to be split into the proper number of locations
x1 = [x0 - w/stacks, x0 + w/stacks]
fig, ax = plt.subplots()
for x, sub in zip(x1, subs):
bottom = 0
for bus in business:
height = data[f'{sub}{bus}'].value.to_numpy()
ax.bar(x=x, height=height, width=w, bottom=bottom)
bottom += height
ax.set_xticks(x0)
_ = ax.set_xticklabels(labels)
As you can see, small values are difficult to discern, and using ax.set_yscale('log') does not work as expected with stacked bars (e.g. it does not make small values more readable).
Create only stacked bars
As mentioned by #r-beginners, use .pivot, or .pivot_table, to reshape the dataframe to a wide form to create stacked bars where the x-axis is a tuple ('date', 'subsidiary').
Use .pivot if there are no repeat values for each category
Use .pivot_table, if there are repeat values that must be combined with aggfunc (e.g. 'sum', 'mean', etc.)
# reshape the dataframe
dfp = df.pivot(index=['date', 'subsidiary'], columns=['business'], values='value')
# plot stacked bars
dfp.plot(kind='bar', stacked=True, rot=0, figsize=(10, 4))

Setting same color of a category in both category plots

I am creating two category plots using Seaborn. One category plot has 6 categories whereas the second category plot has 5 categories to plot. There are 3 categories in both plots that are the same. I want to set up the same color for each of the categories that are common in both plots. I am using sns.set_palette('coolwarm') to set the color of both plots but the same categories in both plots have different colors. Is there any way of setting the same color of a category that appears in both plots?
It should work if you put them together in a single dataframe and use sns.catplot() and separate your plots by using the col= argument :
np.random.seed(111)
d1 = pd.DataFrame({'x':np.random.randint(1,4,50),
'y':np.random.randn(50),
'z':np.random.choice(['a','b','c','d','e','f'],50),
'data':'d1'
})
d2 = pd.DataFrame({'x':np.random.randint(1,4,50),
'y':np.random.randn(50),
'z':np.random.choice(['d','e','f','g','h'],50),
'data':'d2'
})
df = pd.concat([d1,d2])
df['z'] = pd.Categorical(df['z'],ordered=True)
sns.catplot(data=df,x='x',y='y',hue='z',col='data',palette='coolwarm')
sns.catplot(data=df,x='x',hue='z',col='data',kind='count',palette='coolwarm')

How to plot all the rows in each column of a pandas df to a separate bar plot

I'm trying to plot for each columns of the df_new on a separate bar plot, however it keeps on giving me all of the plots on each chart. as I have 33 columns, my solution should be 33 bar plots showing row 1 and rows 2 corresponding to each column.
Update, to make it clearer, image shows (shown in excel) how each plot would look, so the next one would be 'r(0,0)' and so on

I want to create a pie chart using a dataframe column in python

I want to create a Pie chart using single column of my dataframe, say my column name is 'Score'. I have stored scores in this column as below :
Score
.92
.81
.21
.46
.72
.11
.89
Now I want to create a pie chart with the range in percentage.
Say 0-0.4 is 30% , 0.4-0.7 is 35 % , 0.7+ is 35% .
I am using the below code using
df1['bins'] = pd.cut(df1['Score'],bins=[0,0.5,1], labels=["0-50%","50-100%"])
df1 = df.groupby(['Score', 'bins']).size().unstack(fill_value=0)
df1.plot.pie(subplots=True,figsize=(8, 3))
With the above code I am getting the Pie chart, but i don’t know how i can do this using percentage.
my pie chart look like this for now
Cutting the dataframe up into bins is the right first step. After which, you can use value_counts with normalize=True in order to get relative frequencies of values in the bins column. This will let you see percentage of data across ranges that are defined in the bins.
In terms of plotting the pie chart, I'm not sure if I understood correctly, but it seemed like you would like to display the correct legend values and the percentage values in each slice of the pie.
pandas.DataFrame.plot is a good place to see all parameters that can be passed into the plot method. You can specify what are your x and y columns to use, and by default, the dataframe index is used as the legend in the pie plot.
To show the percentage values per slice, you can use the autopct parameter as well. As mentioned in this answer, you can use all the normal matplotlib plt.pie() flags in the plot method as well.
Bringing everything together, this is the resultant code and the resultant chart:
df = pd.DataFrame({'Score': [0.92,0.81,0.21,0.46,0.72,0.11,0.89]})
df['bins'] = pd.cut(df['Score'], bins=[0,0.4,0.7,1], labels=['0-0.4','0.4-0.7','0.7-1'], right=True)
bin_percent = pd.DataFrame(df['bins'].value_counts(normalize=True) * 100)
plot = bin_percent.plot.pie(y='bins', figsize=(5, 5), autopct='%1.1f%%')
Plot of Pie Chart

Change distance between bar groups in grouped bar chart (plotting with Pandas)

I have a Dataframe with 14 rows and 7 columns where the columns represent groups and the rows represent months. I am trying to create a grouped bar plot such that at each month (on the x-axis) I will have the values for each of the groups as bars. The code is simply
ax = df.plot.bar(width=1,color=['b','g','r','c','orange','purple','y']);
ax.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
ax.set_xticklabels(months2,rotation=45)
Which produces the following result:
I would like to make the individual bars in each group wider but without them overlapping and I would also like to increase the distance between each group of bars so that there is enough space in the plot.
It might be worth mentioning that the index of the dataframe is 0,...,13.
Help would be greatly appreciated!
TH
If you want to pack 10 apples in a box and want the apples to have more space between them you have two options: (1) take a larger box, or (2) use smaller apples.
(1) How do you change the size of figures drawn with matplotlib?
(2) change the width argument.

Categories