stacked barplot with total and edited axis limit - python - python

I'm trying to do a stacked barplot, but it seems to be pretty tricky with seaborn. I have this data:
x = pd.DataFrame({"Groups" : np.random.choice(["Group1", "Group2", "Group3"], 100),
"Sex" : np.random.choice(["Masculine", "Femenine"], 100)})
x = x.groupby(["Groups", "Sex"]).size().reset_index(name="count")
x["percent (%)"] = round(x.groupby("Groups").transform(lambda x: x/sum(x))*100,1)
x
And I have this plot:
sns.barplot(x="Groups", y="percent (%)", hue="Sex", data=x);
However, I'm looking that each group has a stacked bar, the y-axis from 0 to 1, and a "group4" with a total. When I try to plot the limits like here it gives me an error as this seaborn graph doesn't allow it, and every stacked barplot from seaborn I have found have a column per each group with the values of each group in his respective column and I have all the groups in one column. Any ideas?
I'm looking for a simple solution (with or without seaborn) without changuing the structure of the data (except for adding the "total group", but I don't know if it's easier to add the total to the data, or computing the total inside the graph).

Not sure what group4 would look like, here's a stacked bar graph:
x = pd.DataFrame({"Groups" : np.random.choice(["Group1", "Group2", "Group3"], 100),
"Sex" : np.random.choice(["Masculine", "Femenine"], 100)})
xf = x.groupby(["Groups"])['Sex'].value_counts().unstack('Groups')
xf['Total'] = xf.sum(1)
xf.div(xf.sum()).T.plot.bar(stacked=True)
Output:

Related

Plot Bar Graph with different Parametes in X Axis

I have a DataFrame like below. It has Actual and Predicted columns. I want to compare Actual Vs Predicted in Bar plot in one on one. I have confidence value for Predicted column and default for Actual confidence is 1. So, I want to keep Each row in single bar group Actual and Predicted value will be X axis and corresponding Confidence score as y value.
I am unable to get the expected plot because X axis values are not aligned or grouped to same value in each row.
Actual Predicted Confidence
0 A A 0.90
1 B C 0.30
2 C C 0.60
3 D D 0.75
Expected Bar plot.
Any hint would be appreciable. Please let me know if further details required.
What I have tried so far.
df_actual = pd.DataFrame()
df_actual['Key']= df['Actual'].copy()
df_actual['Confidence'] = 1
df_actual['Identifier'] = 'Actual'
df_predicted=pd.DataFrame()
df_predicted = df[['Predicted', 'Confidence']]
df_predicted = df_predicted.rename(columns={'Predicted': 'Key'})
df_predicted['Identifier'] = 'Predicted'
df_combined = pd.concat([df_actual,df_predicted], ignore_index=True)
df_combined
fig = px.bar(df_combined, x="Key", y="Confidence", color='Identifier',
barmode='group', height=400)
fig.show()
I have found that adjusting the data first makes it easier to get the plot I want. I have used Seaborn, hope that is ok. Please see if this code works for you. I have considered that the df mentioned above is already available. I created df2 so that it aligns to what you had shown in the expected figure. Also, I used index as the X-axis column so that the order is maintained... Some adjustments to ensure xtick names align and the legend is outside as you wanted it.
Code
vals= []
conf = []
for x, y, z in zip(df.Actual, df.Predicted, df.Confidence):
vals += [x, y]
conf += [1, z]
df2 = pd.DataFrame({'Values': vals, 'Confidence':conf}).reset_index()
ax=sns.barplot(data = df2, x='index', y='Confidence', hue='Values',dodge=False)
ax.set_xticklabels(['Actual', 'Predicted']*4)
plt.legend(bbox_to_anchor=(1.0,1))
plt.show()
Plot
Update - grouping Actual and Predicted bars
Hi #Mohammed - As we have already used up hue, I don't think there is a way to do this easily with Seaborn. You would need to use matplotlib and adjust the bar position, xtick positions, etc. Below is the code that will do this. You can change SET1 to another color map to change colors. I have also added a black outline as the same colored bars were blending into one another. Further, I had to rotate the xlables, as they were on top of one another. You can change it as per your requirements. Hope this helps...
vals = df[['Actual','Predicted']].melt(value_name='texts')['texts']
conf = [1]*4 + list(df.Confidence)
ident = ['Actual', 'Predicted']*4
df2 = pd.DataFrame({'Values': vals, 'Confidence':conf, 'Identifier':ident}).reset_index()
uvals, uind = np.unique(df2["Values"], return_inverse=1)
cmap = plt.cm.get_cmap("Set1")
fig, ax=plt.subplots()
l = len(df2)
pos = np.arange(0,l) % (l//2) + (np.arange(0,l)//(l//2)-1)*0.4
ax.bar(pos, df2["Confidence"], width=0.4, align="edge", ec="k",color=cmap(uind) )
handles=[plt.Rectangle((0,0),1,1, color=cmap(i), ec="k") for i in range(len(uvals))]
ax.legend(handles=handles, labels=list(uvals), prop ={'size':10}, loc=9, ncol=8)
pos=pos+0.2
pos.sort()
ax.set_xticks(pos)
ax.set_xticklabels(df2["Identifier"][:l], rotation=45,ha='right', rotation_mode="anchor")
ax.set_ylim(0, 1.2)
plt.show()
Output plot
I updated #Redox answer to get the exact output.
df_ = pd.DataFrame({'Labels': df.reset_index()[['Actual', 'Predicted', 'index']].values.ravel(),
'Confidence': np.array(list(zip(np.repeat(1, len(df)), df['Confidence'].values, np.repeat(0, len(df))))).ravel()})
df_.loc[df_['Labels'].astype(str).str.isdigit(), 'Labels'] = ''
plt.figure(figsize=(15, 6))
ax=sns.barplot(data = df_, x=df_.index, y='Confidence', hue='Labels',dodge=False, ci=None)
ax.set_xticklabels(['Actual', 'Predicted', '']*len(df))
plt.setp(ax.get_xticklabels(), rotation=90)
ax.tick_params(labelsize=14)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
Output:
Removed loop to improve performance
Added blank bar values to look alike group chart.

Plotting overlay of another column of data in seaborn with less mentions but duplicate graph values

I can't find a way to word my issue properly in the header so I'm going to explain it a bit better, I'm making a swarm plot in seaborn, on the Y axis is Sentiment, on the X axis is a symbol, a symbol is mentioned a certain number of times and so it gets pushed out to show a larger spread of mentions on the x axis, I'm trying to overlay another column of data of 'Avg. Sentiment' I only need the point plotted once but since the average technically goes with the number of mentions it creates essentially a line on the graph where the avg would be, it's like a duplicate value almost.
as you can see I only need the value once, I can't just end up using some sort of function to plot an average from pandas or seaborn because I plan on using a custom weighted average point that's already been made
here is the code to output and test the graph
np.random.seed(5)
df = pd.DataFrame({
'Symbol': ['AMC', 'GME', 'BB', 'SPY', 'SPCE'],
'Mentions': [100, 75, 50, 25, 20],
'Avg.Sentiment':[.8,.7,.6,.5,.4]
})
df['Sentiment'] = df['Mentions'].apply(lambda x: (np.random.random(x) * 2) - 1)
df = df.explode('Sentiment')
pos = [0.0, 1.0]
colors = ['#FF5000', '#00C805']
cmap = LinearSegmentedColormap.from_list("",list(zip(pos,colors)))
matplotlib.cm.register_cmap("newmap", cmap)
sns.set_style("darkgrid")
sns.set(rc={'figure.figsize':(32,14)})
sns.set(font_scale=2.0)
dplot = sns.swarmplot(x="Symbol", y="Avg.Sentiment", color='black', data=df, marker='X', size=10)
dplot= sns.swarmplot(x="Symbol", y="Sentiment", hue='Sentiment',palette="newmap", data=df)
dplot.get_legend().remove()
plt.show()
I've found the solution, just using plt.scatter you can enter in single points from the same data frame, so in my case
plt.scatter(x="Symbol", y="Avg.Sentiment", data=df, color='black', marker='X')

Couldn't align X axis values with bars on top of them using seaborn barplot with hue [duplicate]

My graph is ending up looking like this:
I took the original titanic dataset and sliced some columns and created a new dataframe via the following code.
Cabin_group = titanic[['Fare', 'Cabin', 'Survived']] #selecting certain columns from dataframe
Cabin_group.Cabin = Cabin_group.Cabin.str[0] #cleaning the Cabin column
Cabin_group = Cabin_group.groupby('Cabin', as_index =False).Survived.mean()
Cabin_group.drop([6,7], inplace = True) #drop Cabin G and T as instances are too low
Cabin_group['Status']= ('Poor', 'Rich', 'Rich', 'Medium', 'Medium', 'Poor') #giving each Cabin a status value.
So my new dataframe `Cabin_group' ends up looking like this:
Cabin Survived Status
0 A 0.454545 Poor
1 B 0.676923 Rich
2 C 0.574468 Rich
3 D 0.652174 Medium
4 E 0.682927 Medium
5 F 0.523810 Poor
Here is how I tried to plot the dataframe
fig = plt.subplots(1,1, figsize = (10,4))
sns.barplot(x ='Cabin', y='Survived', hue ='Status', data = Cabin_group )
plt.show()
So a couple of things are off with this graph;
First we have the bars A, D, E and F shifted away from their respective x-axis labels. Secondly, the bars itself seem to appear thinner/skinnier than my usual barplots.
Not sure how to shift the bars to their proper place, as well as how to control the width of the bars.
Thank you.
This can be achieved by doing dodge = False. It is handled in the new version of seaborn.
The bar are not aligned since it expects 3 bars for each x (1 for each distinct value of Status) and only one is provided. I think one of the solution is to map a color to the Status. As far as i know it is not possible to do thaht easily. However, here is an example of how to do that. I'm not sure about that since it seems complicated to simply map a color to a category (and the legend is not displayed).
# Creating a color mapping
Cabin_group['Color'] = Series(pd.factorize(Cabin_group['Status'])[0]).map(
lambda x: sns.color_palette()[x])
g = sns.barplot(x ='Cabin', y='Survived', data=Cabin_group, palette=Cabin_group['Color'])
When I see how simple it is in R ... But infortunately the ggplot implementation in Python does not allow to plot a geom_bar with stat = 'identity'.
library(tidyverse)
Cabin_group %>% ggplot() +
geom_bar(aes(x = Cabin, y= Survived, fill = Status),
stat = 'identity')

Plotly: How to display individual value on histogram?

I am trying to make dynamic plots with plotly. I want to plot a count of data that have been aggregated (using groupby).
I want to facet the plot by color (and maybe even by column). The problem is that I want the value count to be displayed on each bar. With histogram, I get smooth bars but I can't find how to display the count:
With a bar plot I can display the count but I don't get smooth bar and the count does not appear for the whole bar but for each case composing that bar
Here is my code for the barplot
val = pd.DataFrame(data2.groupby(["program", "gender"])["experience"].value_counts())
px.bar(x=val.index.get_level_values(0), y=val, color=val.index.get_level_values(1), barmode="group", text=val)
It's basically the same for the histogram.
Thank you for your help!
px.histogram does not seem to have a text attribute. So if you're willing to do any binning before producing your plot, I would use px.Bar. Normally, you apply text to your barplot using px.Bar(... text = <something>). But this gives the results you've described with text for all subcategories of your data. But since we know that px.Bar adds data and annotations in the order that the source is organized, we can simply update text to the last subcategory applied using fig.data[-1].text = sums. The only challenge that remains is some data munging to retrieve the correct sums.
Plot:
Complete code with data example:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data
df = pd.DataFrame({'x':['a', 'b', 'c', 'd'],
'y1':[1, 4, 9, 16],
'y2':[1, 4, 9, 16],
'y3':[6, 8, 4.5, 8]})
df = df.set_index('x')
# calculations
# column sums for transposed dataframe
sums= []
for col in df.T:
sums.append(df.T[col].sum())
# change dataframe format from wide to long for input to plotly express
df = df.reset_index()
df = pd.melt(df, id_vars = ['x'], value_vars = df.columns[1:])
fig = px.bar(df, x='x', y='value', color='variable')
fig.data[-1].text = sums
fig.update_traces(textposition='inside')
fig.show()
If your first graph is with graph object librairy you can try:
# Use textposition='auto' for direct text
fig=go.Figure(data[go.Bar(x=val.index.get_level_values(0),
y=val, color=val.index.get_level_values(1),
barmode="group", text=val, textposition='auto',
)])

Plotting series using seaborn

category = df.category_name_column.value_counts()
I have the above series which returns the values:
CategoryA,100
CategoryB,200
I am trying to plot the top 5 category names in X - axis and values in y-axis
head = (category.head(5))
sns.barplot(x = head ,y=df.category_name_column.value_counts(), data=df)
It does not print the "names" of the categories in the X-axis, but the count. How to print the top 5 names in X and Values in Y?
You can pass in the series' index & values to x & y respectively in sns.barplot. With that the plotting code becomes:
sns.barplot(head.index, head.values)
I am trying to plot the top 5 category names in X
calling category.head(5) will return the first five values from the series category, which may be different than the top 5 based on the number of times each category appears. If you want the 5 most frequent categories, it is necessary to sort the series first & then call head(5). Like this:
category = df.category_name_column.value_counts()
head = category.sort_values(ascending=False).head(5)
Since the previous accepted solution is deprecated in seaborn. Another workaround could be as follows:
Convert series to dataframe
category = df.category_name_column.value_counts()
category_df = category.reset_index()
category_df.columns = ['categories', 'frequency']
Use barplot
ax = sns.barplot(x = 'categories', y = 'frequency', data = category_df)
Although this is not exactly plot of series, this is a workaround that's officially supported by seaborn.
For more barplot examples please refer here:
https://seaborn.pydata.org/generated/seaborn.barplot.html
https://stackabuse.com/seaborn-bar-plot-tutorial-and-examples/

Categories