I have a very huge dataset with a lot of subsidiaries serving three customer groups in various countries, something like this (in reality there are much more subsidiaries and dates):
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'subsidiary': ['EU','EU','EU','EU','EU','EU','EU','EU','EU','US','US','US','US','US','US','US','US','US'],'date': ['2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05','2019-03','2019-04', '2019-05'],'business': ['RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC','RETAIL','RETAIL','RETAIL','CORP','CORP','CORP','PUBLIC','PUBLIC','PUBLIC'],'value': [500.36,600.45,700.55,750.66,950.89,1300.13,100.05,120.00,150.01,800.79,900.55,1000,3500.79,5000.36,4500.25,50.17,75.25,90.33]})
print(df)
I'd like to make an analysis per subsidiary by producing a stacked bar chart. To do this, I started by defining the x-axis to be the unique months and by defining a subset per business type in a country like this:
x=df['date'].drop_duplicates()
EUCORP = df[(df['subsidiary']=='EU') & (df['business']=='CORP')]
EURETAIL = df[(df['subsidiary']=='EU') & (df['business']=='RETAIL')]
EUPUBLIC = df[(df['subsidiary']=='EU') & (df['business']=='PUBLIC')]
I can then make a bar chart per business type:
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35)
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35)
However, if I try to stack all three together in one chart, I keep failing:
plotEURETAIL = plt.bar(x=x, height=EURETAIL['value'], width=.35)
plotEUCORP = plt.bar(x=x, height=EUCORP['value'], width=.35, bottom=EURETAIL)
plotEUPUBLIC = plt.bar(x=x, height=EUPUBLIC['value'], width=.35, bottom=EURETAIL+EUCORP)
plt.show()
I always receive the below error message:
ValueError: Missing category information for StrCategoryConverter; this might be caused by unintendedly mixing categorical and numeric data
ConversionError: Failed to convert value(s) to axis units: subsidiary date business value
0 EU 2019-03 RETAIL 500.36
1 EU 2019-04 RETAIL 600.45
2 EU 2019-05 RETAIL 700.55
I tried converting the months into the dateformat and/or indexing it, but it actually confused me further...
I would really appreciate any help/support on any of the following, as I a already spend a lot of hours to try to figure this out (I am still a python noob, sry):
How can I fix the error to create a stacked bar chart?
Assuming, the error can be fixed, is this the most efficient way to create the bar chart (e.g. do I really need to create three sub-dfs per subsidiary, or is there a more elegant way?)
Would it be possible to code an iteration, that produces a stacked bar chart by country, so that I don't need to create one per subsidiary?
As an FYI, stacked bars are not the best option, because they can make it difficult to compare bar values and can easily be misinterpreted. The purpose of a visualization is to present data in an easily understood format; make sure the message is clear. Side-by-side bars are often a better option.
Side-by-side stacked bars are a difficult manual process to construct, it's better to use a figure-level method like seaborn.catplot, which will create a single, easy to read, data visualization.
Bar plot ticks are located by 0 indexed range (not datetimes), the dates are just labels, so it is not necessary to convert them to a datetime dtype.
Tested in python 3.8.11, pandas 1.3.2, matplotlib 3.4.3, seaborn 0.11.2
seaborn
import seaborn as sns
sns.catplot(kind='bar', data=df, col='subsidiary', x='date', y='value', hue='business')
Create grouped and stacked bars
See Stacked Bar Chart and Grouped bar chart with labels
The issue with the creation of the stacked bars in the OP is bottom is being set on the entire dataframe for that group, instead of only the values that make up the bar height.
do I really need to create three sub-dfs per subsidiary. Yes, a DataFrame is needed for every group, so 6, in this case.
Creating the data subsets can be automated using a dict-comprehension to unpack the .groupby object into a dict.
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])} to create a dict of DataFrames
Access the values like: data['EUCORP'].value
Automating the plot creation is more arduous, as can be seen x depends on how many groups of bars for each tick, and bottom depends on the values for each subsequent plot.
import numpy as np
import matplotlib.pyplot as plt
labels=df['date'].drop_duplicates() # set the dates as labels
x0 = np.arange(len(labels)) # create an array of values for the ticks that can perform arithmetic with width (w)
# create the data groups with a dict comprehension and groupby
data = {''.join(k): v for k, v in df.groupby(['subsidiary', 'business'])}
# build the plots
subs = df.subsidiary.unique()
stacks = len(subs) # how many stacks in each group for a tick location
business = df.business.unique()
# set the width
w = 0.35
# this needs to be adjusted based on the number of stacks; each location needs to be split into the proper number of locations
x1 = [x0 - w/stacks, x0 + w/stacks]
fig, ax = plt.subplots()
for x, sub in zip(x1, subs):
bottom = 0
for bus in business:
height = data[f'{sub}{bus}'].value.to_numpy()
ax.bar(x=x, height=height, width=w, bottom=bottom)
bottom += height
ax.set_xticks(x0)
_ = ax.set_xticklabels(labels)
As you can see, small values are difficult to discern, and using ax.set_yscale('log') does not work as expected with stacked bars (e.g. it does not make small values more readable).
Create only stacked bars
As mentioned by #r-beginners, use .pivot, or .pivot_table, to reshape the dataframe to a wide form to create stacked bars where the x-axis is a tuple ('date', 'subsidiary').
Use .pivot if there are no repeat values for each category
Use .pivot_table, if there are repeat values that must be combined with aggfunc (e.g. 'sum', 'mean', etc.)
# reshape the dataframe
dfp = df.pivot(index=['date', 'subsidiary'], columns=['business'], values='value')
# plot stacked bars
dfp.plot(kind='bar', stacked=True, rot=0, figsize=(10, 4))
Related
I want to create sort of Stacked Bar Chart [don't know the proper name]. I hand drew the graph [for years 2016 and 2017] and attached it here.
The code to create the df is below:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = [[2016.0, 0.4862, 0.4115, 0.3905, 0.3483, 0.1196],
[2017.0, 0.4471, 0.4096, 0.3725, 0.2866, 0.1387],
[2018.0, 0.4748, 0.4016, 0.3381, 0.2905, 0.2012],
[2019.0, 0.4705, 0.4247, 0.3857, 0.3333, 0.2457],
[2020.0, 0.4755, 0.4196, 0.3971, 0.3825, 0.2965]]
cols = ['attribute_time', '100-81 percentile', '80-61 percentile', '60-41 percentile', '40-21 percentile', '20-0 percentile']
df = pd.DataFrame(data, columns=cols)
#set seaborn plotting aesthetics
sns.set(style='white')
#create stacked bar chart
df.set_index('attribute_time').plot(kind='bar', stacked=True)
The data doesn't need to stack on top of each other. The code will create a stacked bar chart, but that's not exactly what needs to be displayed. The percentile needs to have labeled horizontal lines indicating the percentile on the x axis for that year. Does anyone have recommendations on how to achieve this goal? Is it a sort of modified stacked bar chart that needs to be visualized?
My approach to this is to represent the data as a categorical scatter plot (stripplot in Seaborn) using horizontal lines rather than points as markers. You'll have to make some choices about exactly how and where you want to plot things, but this should get you started!
I first modified the data a little bit:
df['attribute_time'] = df['attribute_time'].astype('int') # Just to get rid of the decimals.
df = df.melt(id_vars = ['attribute_time'],
value_name = 'pct_value',
var_name = 'pct_range')
Melting the DataFrame takes the wide data and makes it long instead, so the columns are now year, pct_value, and pct_range and there is a row for each data point.
Next is the plotting:
fig, ax = plt.subplots()
sns.stripplot(data = df,
x = 'attribute_time',
y = 'pct_value',
hue = 'pct_range',
jitter = False,
marker = '_',
s = 40,
linewidth = 3,
ax = ax)
Instead of labeling each point with the range that it belongs to, I though it would be a lot cleaner to separate them into ranges by color.
The jitter is used when there are lots of points for a given category that might overlap to try and prevent them from touching. In this case, we don't need to worry about that so I turned the jitter off. The marker style is designated here as hline.
The s parameter is the horizontal width of each line, and the linewidth is the thickness, so you can play around with those a bit to see what works best for you.
Text is added to the figure using the ax.text method as follows:
for year, value in zip(df['attribute_time'],df['pct_value']):
ax.text(year - 2016,
value,
str(value),
ha = 'center',
va = 'bottom',
fontsize = 'small')
The figure coordinates are indexed starting from 0 despite the horizontal markers displaying the years, so the x position of the text is shifted left by the minimum year (2016). The y position is equal to the value, and the text itself is a string representation of the value. The text is centered above the line and sits slightly above it due to the vertical anchor being on the bottom.
There's obviously a lot you can tweak to make it look how you want with sizing and labeling and stuff, but hopefully this is at least a good start!
I have pandas dataframe where I have nested 4 categories (50,60,70,80) within two categories (positive, negative) and I would like to plot with seaborn kdeplot of a column (eg., A_mean...) based on groupby. What I want to achieve is this (this was done by splitting the pandas to a list). I went over several posts, this code (Multiple single plots in seaborn with pandas groupby data) works for one level but not for the two if I want to plot this for each Game_RS:
for i, group in df_hb_SLR.groupby('Condition'):
sns.kdeplot(data=group['A_mean_per_subject'], shade=True, color='blue', label = 'label name')
I tried to use this one (Seaborn groupby pandas Series) but the first answer did not work for me:
sns.kdeplot(df_hb_SLR.A_mean_per_subject, groupby=df_hb_SLR.Game_RS)
AttributeError: 'Line2D' object has no property 'groupby'
and the pivot answer I was not able to make work.
Is there a direct way from seaborn or any better way directly from pandas Dataframe?
My data are accessible in csv format under this link -- data and I load them as usual:
df_hb_SLR = pd.read_csv('data.csv')
Thank you for help.
Here is a solution using seaborn's FacetGrid, which makes this kind of things really easy
g = sns.FacetGrid(data=df_hb_SLR, col="Condition", hue='Game_RS', height=5, aspect=0.5)
g = g.map(sns.kdeplot, 'A_mean_per_subject', shade=True)
g.add_legend()
The downside of FacetGrid is that it creates a new figure, so If you'd like to integrate those plots into a larger ensemble of subplots, you could achieve the same result using groupby() and some looping:
group1 = "Condition"
N1 = len(df_hb_SLR[group1].unique())
group2 = 'Game_RS'
target = 'A_mean_per_subject'
height = 5
aspect = 0.5
colour = ['gray', 'blue', 'green', 'darkorange']
fig, axs = plt.subplots(1,N1, figsize=(N1*height*aspect,N1*height*aspect), sharey=True)
for (group1Name,df1),ax in zip(df_hb_SLR.groupby(group1),axs):
ax.set_title(group1Name)
for (group2Name,df2),c in zip(df1.groupby(group2), colour):
sns.kdeplot(df2[target], shade=True, label=group2Name, ax=ax, color = c)
I want to create a Pie chart using single column of my dataframe, say my column name is 'Score'. I have stored scores in this column as below :
Score
.92
.81
.21
.46
.72
.11
.89
Now I want to create a pie chart with the range in percentage.
Say 0-0.4 is 30% , 0.4-0.7 is 35 % , 0.7+ is 35% .
I am using the below code using
df1['bins'] = pd.cut(df1['Score'],bins=[0,0.5,1], labels=["0-50%","50-100%"])
df1 = df.groupby(['Score', 'bins']).size().unstack(fill_value=0)
df1.plot.pie(subplots=True,figsize=(8, 3))
With the above code I am getting the Pie chart, but i don’t know how i can do this using percentage.
my pie chart look like this for now
Cutting the dataframe up into bins is the right first step. After which, you can use value_counts with normalize=True in order to get relative frequencies of values in the bins column. This will let you see percentage of data across ranges that are defined in the bins.
In terms of plotting the pie chart, I'm not sure if I understood correctly, but it seemed like you would like to display the correct legend values and the percentage values in each slice of the pie.
pandas.DataFrame.plot is a good place to see all parameters that can be passed into the plot method. You can specify what are your x and y columns to use, and by default, the dataframe index is used as the legend in the pie plot.
To show the percentage values per slice, you can use the autopct parameter as well. As mentioned in this answer, you can use all the normal matplotlib plt.pie() flags in the plot method as well.
Bringing everything together, this is the resultant code and the resultant chart:
df = pd.DataFrame({'Score': [0.92,0.81,0.21,0.46,0.72,0.11,0.89]})
df['bins'] = pd.cut(df['Score'], bins=[0,0.4,0.7,1], labels=['0-0.4','0.4-0.7','0.7-1'], right=True)
bin_percent = pd.DataFrame(df['bins'].value_counts(normalize=True) * 100)
plot = bin_percent.plot.pie(y='bins', figsize=(5, 5), autopct='%1.1f%%')
Plot of Pie Chart
It seems like plotting a line connecting the mean values of box plots would be a simple thing to do, but I couldn't figure out how to do this plot in pandas.
I'm using this syntax to do the boxplot so that it automatically generate the box plot for Y vs. X device without having to do external manipulation of the data frame:
df.boxplot(column='Y_Data', by="Category", showfliers=True, showmeans=True)
One way I thought of doing is to just do a line plot by getting the mean values from the boxplot, but I'm not sure how to extract that information from the plot.
You can save the axis object that gets returned from df.boxplot(), and plot the means as a line plot using that same axis. I'd suggest using Seaborn's pointplot for the lines, as it handles a categorical x-axis nicely.
First let's generate some sample data:
import pandas as pd
import numpy as np
import seaborn as sns
N = 150
values = np.random.random(size=N)
groups = np.random.choice(['A','B','C'], size=N)
df = pd.DataFrame({'value':values, 'group':groups})
print(df.head())
group value
0 A 0.816847
1 A 0.468465
2 C 0.871975
3 B 0.933708
4 A 0.480170
...
Next, make the boxplot and save the axis object:
ax = df.boxplot(column='value', by='group', showfliers=True,
positions=range(df.group.unique().shape[0]))
Note: There's a curious positions argument in Pyplot/Pandas boxplot(), which can cause off-by-one errors. See more in this discussion, including the workaround I've employed here.
Finally, use groupby to get category means, and then connect mean values with a line plot overlaid on top of the boxplot:
sns.pointplot(x='group', y='value', data=df.groupby('group', as_index=False).mean(), ax=ax)
Your title mentions "median" but you talk about category means in your post. I used means here; change the groupby aggregation to median() if you want to plot medians instead.
You can get the value of the medians by using the .get_data() property of the matplotlib.lines.Line2D objects that draw them, without having to use seaborn.
Let bp be your boxplot created as bp=plt.boxplot(data). Then, bp is a dict containing the medians key, among others. That key contains a list of matplotlib.lines.Line2D, from which you can extract the (x,y) position as follows:
bp=plt.boxplot(data)
X=[]
Y=[]
for m in bp['medians']:
[[x0, x1],[y0,y1]] = m.get_data()
X.append(np.mean((x0,x1)))
Y.append(np.mean((y0,y1)))
plt.plot(X,Y,c='C1')
For an arbitrary dataset (data), this script generates this figure. Hope it helps!
I have a dataset of 5000 products with 50 features. One of the column is 'colors' and there are more than 100 colors in the column. I'm trying to plot a bar chart to show only the top 10 colors and how many products there are in each color.
top_colors = df.colors.value_counts()
top_colors[:10].plot(kind='barh')
plt.xlabel('No. of Products');
Using Seaborn:
sns.factorplot("colors", data=df , palette="PuBu_d");
1) Is there a better way to do this?
2) How can i replicate this with Seaborn?
3) How do i plot such that the highest count is at the top (i.e black at the very top of the bar chart)
An easy trick might be to invert the y axis of your plot, rather than futzing with the data:
s = pd.Series(np.random.choice(list(string.uppercase), 1000))
counts = s.value_counts()
ax = counts.iloc[:10].plot(kind="barh")
ax.invert_yaxis()
Seaborn barplot doesn't currently support horizontally oriented bars, but if you want to control the order the bars appear in you can pass a list of values to the x_order param. But I think it's easier to use the pandas plotting methods here, anyway.
If you want to use pandas then you can first sort:
top_colors[:10].sort(ascending=0).plot(kind='barh')
Seaborn already styles your pandas plots, but you can also use:
sns.barplot(top_colors.index, top_colors.values)