Seaborn boxplot and lineplot not showing properly - python

I'm trying to overlay a seaborn lineplot over a seaborn boxplot
The result is someway "shocking" :)
It seems like the two graphs are put in the same figure but separate
The box plot is compressed on the left side, the line plot is compressed on the right side
Notice that if I run the two graph separatly they work fine
I cannot fugure out how to make it work
Thank you in advance for any help
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
mydata = pd.DataFrame({
'a':[2012, 2012, 2012, 2012, 2013, 2013, 2013, 2013, 2014, 2014, 2014, 2014, 2015, 2015, 2015, 2015, 2016, 2016, 2016, 2016, 2017, 2017, 2017, 2017, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2020, 2020, 2020, 2020],
'v':[383.00, 519.00, 366.00, 436.00, 1348.00, 211.00, 139.00, 614.00, 365.00, 365.00, 383.00, 602.00, 994.00, 719.00, 589.00, 365.00, 990.00, 1142.00, 262.00, 1263.00, 507.00, 222.00, 363.00, 274.00, 195.00, 730.00, 730.00, 592.00, 479.00, 607.00, 292.00, 657.00, 453.00, 691.00, 673.00, 705]
})
means =mydata.groupby('a').v.mean().reset_index()
fig, ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.lineplot(data=means, x='a', y='v', ax=ax)
plt.show()

Surprisingly, I did not find a duplicate for this question with a good answer, so I elevate my comment to one. Arise, Sir Comment:
Instead of lineplot, you should use pointplot
...
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.pointplot(data=means, x='a', y='v', ax=ax)
plt.show()
Sample output:
Pointplot is the equivalent to lineplot for categorical variables that are used for boxplot. Please read here more about relational and categorical plotting.
The question came up why there is no problem with lineplot for the following data:
mydata = pd.DataFrame({'a':["m1", "m1", "m1", "m2", "m2", "m2", "m2", "m3", "m3", "m3", "m3", "m4", "m4", "m4", "m4"], 'v':[11.37, 11.31, 10.93, 9.43, 9.62, 6.61, 9.31, 11.27, 8.47, 11.86, 8.77, 8.8, 9.58, 12.26, 10] })
means =mydata.groupby('a').v.mean().reset_index()
print(means)
fig, ax = plt.subplots(figsize=(15,8))
sns.boxplot(data=mydata, x='a', y='v', ax=ax, showfliers=False)
sns.lineplot(data=means, x='a', y='v', ax=ax)
plt.show()
Output:
The difference is that this example does not have any ambiguity for lineplot. Seaborn lineplot can use both - categorical and numerical data. Seemingly, the code tries first to plot it as numerical data, and if this is not possible uses them as categorical variables (I don't know the source code). This is probably a good software decision by seaborn because the other case (not accepting categorical data) would cause way more problems than the rare case that people try to plot both categorical and numerical data into the same figure. A warning by seaborn would be a good thing, though.

Related

Problem in showing the Data Labels in the Bar Chart of catplot [duplicate]

I plotted a catplot in seaborn like this
import seaborn as sns
import pandas as pd
data = {'year': [2016, 2013, 2014, 2015, 2016, 2013, 2014, 2015, 2016, 2013, 2014, 2015, 2016, 2013, 2014, 2015, 2016, 2013, 2014, 2015], 'geo_name': ['Michigan', 'Michigan', 'Michigan', 'Michigan', 'Washtenaw County, MI', 'Washtenaw County, MI', 'Washtenaw County, MI', 'Washtenaw County, MI', 'Ann Arbor, MI', 'Ann Arbor, MI', 'Ann Arbor, MI', 'Ann Arbor, MI', 'Philadelphia, PA', 'Philadelphia, PA', 'Philadelphia, PA', 'Philadelphia, PA', 'Ann Arbor, MI Metro Area', 'Ann Arbor, MI Metro Area', 'Ann Arbor, MI Metro Area', 'Ann Arbor, MI Metro Area'], 'geo': ['04000US26', '04000US26', '04000US26', '04000US26', '05000US26161', '05000US26161', '05000US26161', '05000US26161', '16000US2603000', '16000US2603000', '16000US2603000', '16000US2603000', '16000US4260000', '16000US4260000', '16000US4260000', '16000US4260000', '31000US11460', '31000US11460', '31000US11460', '31000US11460'], 'income': [50803.0, 48411.0, 49087.0, 49576.0, 62484.0, 59055.0, 60805.0, 61003.0, 57697.0, 55003.0, 56835.0, 55990.0, 39770.0, 37192.0, 37460.0, 38253.0, 62484.0, 59055.0, 60805.0, 61003.0], 'income_moe': [162.0, 163.0, 192.0, 186.0, 984.0, 985.0, 958.0, 901.0, 2046.0, 1688.0, 1320.0, 1259.0, 567.0, 424.0, 430.0, 511.0, 984.0, 985.0, 958.0, 901.0]}
df = pd.DataFrame(data)
g = sns.catplot(x='year', y='income', data=df, kind='bar', hue='geo_name', legend=True)
g.fig.set_size_inches(15,8)
g.fig.subplots_adjust(top=0.81,right=0.86)
I am getting an output like shown below
I want to add the values of each bar on its top in K representation. For example
in 2013 the bar for Michigan is at 48411 so I want to add the value 48.4K on top of that bar. Likewise for all the bars.
Updated as of matplotlib v3.4.2
Use matplotlib.pyplot.bar_label
See the matplotlib: Bar Label Demo page for additional formatting options.
Tested with pandas v1.2.4, which is using matplotlib as the plot engine.
Use the fmt parameter for simple formats, and labels parameter for customized string formatting.
See Adding value labels on a matplotlib bar chart for other plotting options related to the new method.
For single plot only
g = sns.catplot(x='year', y='income', data=df, kind='bar', hue='geo_name', legend=True)
g.fig.set_size_inches(15, 8)
g.fig.subplots_adjust(top=0.81, right=0.86)
# extract the matplotlib axes_subplot objects from the FacetGrid
ax = g.facet_axis(0, 0)
# iterate through the axes containers
for c in ax.containers:
labels = [f'{(v.get_height() / 1000):.1f}K' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
For single or multiple plots
g = sns.catplot(x='year', y='income', data=df, kind='bar', col='geo_name', col_wrap=3, legend=True)
g.fig.set_size_inches(15, 8)
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Bar Count with Annotations')
# iterate through axes
for ax in g.axes.ravel():
# add annotations
for c in ax.containers:
labels = [f'{(v.get_height() / 1000):.1f}K' for v in c]
ax.bar_label(c, labels=labels, label_type='edge')
ax.margins(y=0.2)
plt.show()
We can use the Facet grid returned by sns.catplot() and select the axis. Use a for loop to position the Y-axis value in the format we need using ax.text()
g = sns.catplot(x='year', y='income', data=data, kind='bar', hue='geo_name', legend=True)
g.fig.set_size_inches(16,8)
g.fig.subplots_adjust(top=0.81,right=0.86)
ax = g.facet_axis(0,0)
for p in ax.patches:
ax.text(p.get_x() - 0.01,
p.get_height() * 1.02,
'{0:.1f}K'.format(p.get_height()/1000), #Used to format it K representation
color='black',
rotation='horizontal',
size='large')
It's a rough solution but it does the trick.
We add the text to the axes object created by the plot.
The Y position is simple, as it corresponds exactly to the data value. We might just add 500 to each value so that the label sits nicely on top of the column.
The X position starts and is centered at 0 for the first group of columns (2013) and it spaces a unit. We have a buffer of 0.1 at each side and the columns are 5, hence each column is 0.16 wide.
g = sns.catplot(x='year', y='income', data=df, kind='bar', hue='geo_name', legend=True)
#flatax=g.axes.flatten()
#g.axes[0].text=('1')
g.fig.set_size_inches(15,8)
g.fig.subplots_adjust(top=0.81,right=0.86)
g.ax.text(-0.5,51000,'X=-0.5')
g.ax.text(-0.4,49000,'X=-0.4')
g.ax.text(0,49000,'X=0')
g.ax.text(0.5,51000,'X=0.5')
g.ax.text(0.4,49000,'X=0.4')
g.ax.text(0.6,47000,'X=0.6')
The text is, by default, aligned left (i.e. to the x value we set). Here is the documentation if you want to play with the text (change font, alignment, etc.)
We can then find the right placement for each label, knowing that the 3rd column of each group will always be centered on the unit (0,1,2,3,4).
g = sns.catplot(x='year', y='income', data=df, kind='bar', hue='geo_name', legend=True)
#flatax=g.axes.flatten()
#g.axes[0].text=('1')
g.fig.set_size_inches(15,8)
g.fig.subplots_adjust(top=0.81,right=0.86)
g.ax.text(-0.4,48411+500,'48,4K')
g.ax.text(-0.24,59055+500,'59,0K')
g.ax.text(-0.08,55003+500,'55,0K')
g.ax.text(0.08,37192+500,'37,2K')
g.ax.text(0.24,59055+500,'59,0K')
Of course, instead of manually labelling everything you should loop through the data and create the labels automatically
for i, yr in enumerate(df['year'].unique()):
for j,gn in enumerate(df['geo_name'].unique()):
Now you can iterate through your x position using: i-0.4+(j*0.16) and at the same time you have the value for year and geo_name to retrieve the correct value of income.

Plot multiple distplot in seaborn Facetgrid

I have a dataframe which looks like below:
df:
RY MAJ_CAT Value
2016 Cause Unknown 0.00227
2016 Vegetation 0.04217
2016 Vegetation 0.04393
2016 Vegetation 0.07878
2016 Defective Equip 0.00137
2018 Cause Unknown 0.00484
2018 Defective Equip 0.01546
2020 Defective Equip 0.05169
2020 Defective Equip 0.00515
2020 Cause Unknown 0.00050
I want to plot the distribution of the value over the given years. So I used distplot of seaborn by using following code:
year_2016 = df[df['RY']==2016]
year_2018 = df[df['RY']==2018]
year_2020 = df[df['RY']==2020]
sns.distplot(year_2016['value'].values, hist=False,rug=True)
sns.distplot(year_2018['value'].values, hist=False,rug=True)
sns.distplot(year_2020['value'].values, hist=False,rug=True)
In the next step I want to plot the same value distribution over the given year w.r.t MAJ_CAT. So I decided to use Facetgrid of seaborn, below is the code :
g = sns.FacetGrid(df,col='MAJ_CAT')
g = g.map(sns.distplot,df[df['RY']==2016]['value'].values, hist=False,rug=True))
g = g.map(sns.distplot,df[df['RY']==2018]['value'].values, hist=False,rug=True))
g = g.map(sns.distplot,df[df['RY']==2020]['value'].values, hist=False,rug=True))
However, when it ran the above command, it throws the following error:
KeyError: "None of [Index([(0.00227, 0.04217, 0.043930000000000004, 0.07877999999999999, 0.00137, 0.0018800000000000002, 0.00202, 0.00627, 0.00101, 0.07167000000000001, 0.01965, 0.02775, 0.00298, 0.00337, 0.00088, 0.04049, 0.01957, 0.01012, 0.12065, 0.23699, 0.03639, 0.00137, 0.03244, 0.00441, 0.06748, 0.00035, 0.0066099999999999996, 0.00302, 0.015619999999999998, 0.01571, 0.0018399999999999998, 0.03425, 0.08046, 0.01695, 0.02416, 0.08975, 0.0018800000000000002, 0.14743, 0.06366000000000001, 0.04378, 0.043, 0.02997, 0.0001, 0.22799, 0.00611, 0.13960999999999998, 0.38871, 0.018430000000000002, 0.053239999999999996, 0.06702999999999999, 0.14103, 0.022719999999999997, 0.011890000000000001, 0.00186, 0.00049, 0.13947, 0.0067, 0.00503, 0.00242, 0.00137, 0.00266, 0.38638, 0.24068, 0.0165, 0.54847, 1.02545, 0.01889, 0.32750999999999997, 0.22526, 0.24516, 0.12791, 0.00063, 0.0005200000000000001, 0.00921, 0.07665, 0.00116, 0.01042, 0.27046, 0.03501, 0.03159, 0.46748999999999996, 0.022090000000000002, 2.2972799999999998, 0.69021, 0.22529000000000002, 0.00147, 0.1102, 0.03234, 0.05799, 0.11744, 0.00896, 0.09556, 0.03202, 0.01347, 0.00923, 0.0034200000000000003, 0.041530000000000004, 0.04848, 0.00062, 0.0031100000000000004, ...)], dtype='object')] are in the [columns]"
I am not sure where am I making the mistake. Could anyone please help me in fixing the issue?
setup the dataframe
import pandas as pd
import numpy as np
import seaborn as sns
# setup dataframe of synthetic data
np.random.seed(365)
data = {'RY': np.random.choice([2016, 2018, 2020], size=400),
'MAJ_CAT': np.random.choice(['Cause Unknown', 'Vegetation', 'Defective Equip'], size=400),
'Value': np.random.random(size=400) }
df = pd.DataFrame(data)
Updated Answer
From seaborn v0.11
Use sns.displot with kind='kde' and rug=True
Is a figure-level interface for drawing distribution plots onto a FacetGrid.
Plotting all 'MAJ_CAT' together
sns.displot(data=df, x='Value', hue='RY', kind='kde', palette='tab10', rug=True)
Plotting 'MAJ_CAT' separately
sns.displot(data=df, col='MAJ_CAT', x='Value', hue='RY', kind='kde', palette='tab10', rug=True)
Original Answer
In seaborn v0.11, distplot is deprecated
distplot
Consolidate the original code to generate the distplot
for year in df.RY.unique():
values = df.Value[df.RY == year]
sns.distplot(values, hist=False, rug=True)
facetgrid
properly configure the mapping and add hue to FacetGrid
g = sns.FacetGrid(df, col='MAJ_CAT', hue='RY')
p1 = g.map(sns.distplot, 'Value', hist=False, rug=True).add_legend()

Add a date event on a line chart in Python

So, I have a line chart that shows a random sales data from 2010 to 2020. But, I want to add a vertical line, or some visual resource to indicate something important that happened in 2014, for example. How can I do that in Python? Any library would do!
try using plt.axvline() with matplotlib
import matplotlib.pyplot as plt
x = [ 2015, 2016, 2017, 2018,2019,2020]
y = [ 1000, 1200, 2500, 1000, 1100,250]
plt.plot(x,y)
plt.title("Sales Bar graph")
plt.xlabel("year")
plt.ylabel('Sales')
#drwa a line in 2019 value
plt.axvline(x=2019, label='line at x = {}'.format(2019), c='red')
plt.show()

Seaborn chart converging on same point not visible

I have a dataframe having two columns- VOL, INVOL and for a particular year, the value are the same. Hence, while plotting in seaborn, I am not able to see the value of the other column when they converge.
For example:
My dataframe is
When I use seaborn, using the below code
f5_test = df5_test.melt('FY', var_name='cols', value_name='vals')
g = sns.catplot(x="FY", y="vals", hue='cols', data=df5_test, kind='point')
the chart is not showing the same point of 0.06.
I tried using pandas plotting, having the same result.
Please advise what I should do. Thanks in advance.
You plot looks legitimate. Two lines perfectly overlap since the data from 2016 to 2018 is exactly the same. I think maybe you can try to plot the two lines separately and add or subtract some small value to one of them to move the line a little bit. For example:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.DataFrame({'FY': [2012, 2013, 2014, 2015, 2016, 2017, 2018],
'VOL_PCT': [0, 0.08, 0.07, 0.06, 0, 0, 0.06],
'INVOL_PC': [0, 0, 0, 0, 0, 0, 0.06]})
# plot
fig, ax = plt.subplots()
sns.lineplot(df.FY, df.VOL_PCT)
sns.lineplot(df.FY+.01, df.INVOL_PC-.001)
In addition, given the type of your data, you could also consider using stack plots. For example:
fig, ax = plt.subplots()
labels = ['VOL_PCT', 'INVOL_PC']
ax.stackplot(df.FY, df.VOL_PCT, df.INVOL_PC, labels=labels)
ax.legend(loc='upper left');
Ref. Stackplot

Python, Seaborn: Plotting frequencies with zero-values

I have a Pandas series with values for which I'd like to plot counts. This creates roughly what I want:
dy = sns.countplot(rated.year, color="#53A2BE")
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')
plt.show()
The problem comes with missing data. There are 31 years with ratings, but over a timespan of 42 years. That means there should be some empty bins, which are not being displayed. Is there a way to configure this in Seaborn/Matplotlib? Should I use another type of graph, or is there another fix for this?
I thought about looking into whether it is possible to configure it as a time series, but I have the same problem with rating scales. So, on a 1-10 scale the count for e.g. 4 might be zero, and therefore '4' is not in the Pandas data series, which means it also does not show up in the graph.
The result I'd like is the full scale on the x-axis, with counts (for steps of one) on the y-axis, and showing zero/empty bins for missing instances of the scale, instead of simply showing the next bin for which data is available.
EDIT:
The data (rated.year) looks something like this:
import pandas as pd
rated = pd.DataFrame(data = [2016, 2004, 2007, 2010, 2015, 2016, 2016, 2015,
2011, 2010, 2016, 1975, 2011, 2016, 2015, 2016,
1993, 2011, 2013, 2011], columns = ["year"])
It has more values, but the format is the same. As you can see in..
rated.year.value_counts()
..there are quite a few x values for which count would have to be zero in the graph. Currently plot looks like:
I solved the problem by using the solution suggested by #mwaskom in the comments to my question. I.e. to add an 'order' to the countplot with all valid values for year, including those with count equals zero. This is the code that produces the graph:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
rated = pd.DataFrame(data = [2016, 2004, 2007, 2010, 2015, 2016, 2016, 2015,
2011, 2010, 2016, 1975, 2011, 2016, 2015, 2016,
1993, 2011, 2013, 2011], columns = ["year"])
dy = sns.countplot(rated.year, color="#53A2BE", order = list(range(rated.year.min(),rated.year.max()+1)))
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')
plt.show()
Consider a seaborn barplot by creating a reindexed series casted to a dataframe:
# REINDEXED DATAFRAME
rated_ser = pd.DataFrame(rated['year'].value_counts().\
reindex(range(rated.year.min(),rated.year.max()+1), fill_value=0))\
.reset_index()
# SNS BAR PLOT
dy = sns.barplot(x='index', y='year', data=rated_ser, color="#53A2BE")
dy.set_xticklabels(dy.get_xticklabels(), rotation=90) # ROTATE LABELS, 90 DEG.
axes = dy.axes
dy.set(xlabel='Release Year', ylabel = "Count")
dy.spines['top'].set_color('none')
dy.spines['right'].set_color('none')

Categories