How to change the transparency of the confidence interval in a relplot? - python

I found an answer for regplots, but I can't get the same code to work for relplots. I want to change the transparency of the confidence intervals while keeping the lines of my graph darker, but the alpha input for relplots makes the entire graph more translucent.
My code:
cookie = sns.relplot(
data=data, x="Day", y="Touches",
hue='Sex', ci=68, kind="line", col = 'Event'
)
The line that works for regplots is
plt.setp(cookie.collections[1], alpha=0.4)

While, regplot returns one ax (subplot), relplot returns a complete grid of subplots (a FacetGrid). Often, the return value is grabbed into a variable named g (calling it cookie can make things very confusing when comparing with code from the documents).
You can loop through the individual axes of the FacetGrid and make the change for each of them:
import matplotlib.pyplot as plt
import seaborn as sns
fmri = sns.load_dataset('fmri')
g = sns.relplot(data=fmri, x="timepoint", y="signal", col="region",
hue="event", style="event", kind="line")
for ax in g.axes.flat:
ax.collections[-1].set_alpha(0.4)
plt.show()
PS: As mentioned in the comments, if you want to change the alpha of all confidence intervals (instead of just one of them, as in the code of the question), you can use sns.relplot(..., err_kws={"alpha": .4}).

Related

Two seaborn plots with different scales displayed on same plot but bars overlap

I am trying to include 2 seaborn countplots with different scales on the same plot but the bars display as different widths and overlap as shown below. Any idea how to get around this?
Setting dodge=False, doesn't work as the bars appear on top of each other.
The main problem of the approach in the question, is that the first countplot doesn't take hue into account. The second countplot won't magically move the bars of the first. An additional categorical column could be added, only taking on the 'weekend' value. Note that the column should be explicitly made categorical with two values, even if only one value is really used.
Things can be simplified a lot, just starting from the original dataframe, which supposedly already has a column 'is_weeked'. Creating the twinx ax beforehand allows to write a loop (so writing the call to sns.countplot() only once, with parameters).
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_style('dark')
# create some demo data
data = pd.DataFrame({'ride_hod': np.random.normal(13, 3, 1000).astype(int) % 24,
'is_weekend': np.random.choice(['weekday', 'weekend'], 1000, p=[5 / 7, 2 / 7])})
# now, make 'is_weekend' a categorical column (not just strings)
data['is_weekend'] = pd.Categorical(data['is_weekend'], ['weekday', 'weekend'])
fig, ax1 = plt.subplots(figsize=(16, 6))
ax2 = ax1.twinx()
for ax, category in zip((ax1, ax2), data['is_weekend'].cat.categories):
sns.countplot(data=data[data['is_weekend'] == category], x='ride_hod', hue='is_weekend', palette='Blues', ax=ax)
ax.set_ylabel(f'Count ({category})')
ax1.legend_.remove() # both axes got a legend, remove one
ax1.set_xlabel('Hour of Day')
plt.tight_layout()
plt.show()
use plt.xticks(['put the label by hand in your x label'])

Changing x-labels and width while using catplot in seaborn

I have a sample dataset as follows;
pd.DataFrame({'Day_Duration':['Evening','Evening','Evening','Evening','Evening','Morning','Morning','Morning',
'Morning','Morning','Night','Night','Night','Night','Night','Noon','Noon','Noon',
'Noon','Noon'],'place_category':['Other','Italian','Japanese','Chinese','Burger',
'Other','Juice Bar','Donut','Bakery','American','Other','Italian','Japanese','Burger',\
'American','Other','Italian','Burger','American','Salad'],'Percent_delivery':[14.03,10.61,9.25,8.19,6.89,19.58,10.18,9.14,8.36,6.53,13.60,8.42,\
8.22,7.66,6.67,17.71,10.62,8.44,8.33,7.50]})
The goal is to draw faceted barplot with Day_duration serving as facets, hence 4 facets in total. I used the following code to achieve the same,
import seaborn as sns
#g = sns.FacetGrid(top5_places, col="Day_Duration")
g=sns.catplot(x="place_category", y="Percent_delivery", hue='place_category',col='Day_Duration',\
data=top5_places,ci=None,kind='bar',height=4, aspect=.7)
g.set_xticklabels(rotation=90)
Attached is the figure I got;
Can I kindly get help with 2 things, first is it possible to get only 5 values on the x-axis for each facet(rather than seeing all the values for each facet), second, is there a way to make the bars a bit wider. Help is appreciated.
Because you're using hue the api applies a unique color to each value of place_category, but it also expects each category to be in the plot, as shown in your image.
The final figure is a FacetGrid. Using subplot is the manual way of creating one.
In order to plot only the top n categories for each Day_Duration, each plot will need to be done individually, with a custom color map.
cmap is a dictionary with place categories as keys and colors as values. It's used so there will be one legend and each category will be colored the same for each plot.
Because we're not using the legend automatically generated by the plot, one needs to be created manually.
patches uses Patch to create each item in the legend. (e.g. the rectangle, associated with color and name).
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
# create a color map for unique values or place
place_cat = df.place_category.unique()
colors = sns.color_palette('husl', n_colors=10)
cmap = dict(zip(place_cat, colors))
# plot a subplot for each Day_Duration
plt.figure(figsize=(16, 6))
for i, tod in enumerate(df.Day_Duration.unique(), 1):
data = df[df.Day_Duration == tod].sort_values(['Percent_delivery'], ascending=False)
plt.subplot(1, 4, i)
p = sns.barplot(x='place_category', y='Percent_delivery', data=data, hue='place_category', palette=cmap)
p.legend_.remove()
plt.xticks(rotation=90)
plt.title(f'Day Duration: {tod}')
plt.tight_layout()
patches = [Patch(color=v, label=k) for k, v in cmap.items()]
plt.legend(handles=patches, bbox_to_anchor=(1.04, 0.5), loc='center left', borderaxespad=0)
plt.show()

multiple boxplots, side by side, using matplotlib from a dataframe

I'm trying to plot 60+ boxplots side by side from a dataframe and I was wondering if someone could suggest some possible solutions.
At the moment I have df_new, a dataframe with 66 columns, which I'm using to plot boxplots. The easiest way I found to plot the boxplots was to use the boxplot package inside pandas:
boxplot = df_new.boxplot(column=x, figsize = (100,50))
This gives me a very very tiny chart with illegible axis which I cannot seem to change the font size for, so I'm trying to do this natively in matplotlib but I cannot think of an efficient way of doing it. I'm trying to avoid creating 66 separate boxplots using something like:
fig, ax = plt.subplots(nrows = 1,
ncols = 66,
figsize = (10,5),
sharex = True)
ax[0,0].boxplot(#insert parameters here)
I actually do not not how to get the data from df_new.describe() into the boxplot function, so any tips on this would be greatly appreciated! The documentation is confusing. Not sure what x vectors should be.
Ideally I'd like to just give the boxplot function the dataframe and for it to automatically create all the boxplots by working out all the quartiles, column separations etc on the fly - is this even possible?
Thanks!
I tried to replace the boxplot with a ridge plot, which takes up less space because:
it requires half of the width
you can partially overlap the ridges
it develops vertically, so you can scroll down all the plot
I took the code from the seaborn documentation and adapted it a little bit in order to have 60 different ridges, normally distributed; here the code:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import itertools
sns.set(style="white", rc={"axes.facecolor": (0, 0, 0, 0)})
# # Create the data
n = 20
x = list(np.random.randn(1, 60)[0])
g = [item[0] + item[1] for item in list(itertools.product(list('ABCDEFGHIJ'), list('123456')))]
df = pd.DataFrame({'x': n*x,
'g': n*g})
# Initialize the FacetGrid object
pal = sns.cubehelix_palette(10, rot=-.25, light=.7)
g = sns.FacetGrid(df, row="g", hue="g", aspect=15, height=.5, palette=pal)
# Draw the densities in a few steps
g.map(sns.kdeplot, "x", clip_on=False, shade=True, alpha=1, lw=1.5, bw=.2)
g.map(sns.kdeplot, "x", clip_on=False, color="w", lw=2, bw=.2)
g.map(plt.axhline, y=0, lw=2, clip_on=False)
# Define and use a simple function to label the plot in axes coordinates
def label(x, color, label):
ax = plt.gca()
ax.text(0, .2, label, fontweight="bold", color=color,
ha="left", va="center", transform=ax.transAxes)
g.map(label, "x")
# Set the subplots to overlap
g.fig.subplots_adjust(hspace=-.25)
# Remove axes details that don't play well with overlap
g.set_titles("")
g.set(yticks=[])
g.despine(bottom=True, left=True)
plt.show()
This is the result I get:
I don't know if it will be good for your needs, in any case keep in mind that keeping so many distributions next to each other will always require a lot of space (and a very big screen).
Maybe you could try dividing the distrubutions into smaller groups and plotting them a little at a time?

Adding second legend to scatter plot

Is there a way to add a secondary legend to a scatterplot, where the size of the scatter is proportional to some data?
I have written the following code that generates a scatterplot. The color of the scatter represents the year (and is taken from a user-defined df) while the size of the scatter represents variable 3 (also taken from a df but is raw data):
import pandas as pd
colors = pd.DataFrame({'1985':'red','1990':'b','1995':'k','2000':'g','2005':'m','2010':'y'}, index=[0,1,2,3,4,5])
fig = plt.figure()
ax = fig.add_subplot(111)
for i in df.keys():
df[i].plot(kind='scatter',x='variable1',y='variable2',ax=ax,label=i,s=df[i]['variable3']/100, c=colors[i])
ax.legend(loc='upper right')
ax.set_xlabel("Variable 1")
ax.set_ylabel("Variable 2")
This code (with my data) produces the following graph:
So while the colors/years are well and clearly defined, the size of the scatter is not.
How can I add a secondary or additional legend that defines what the size of the scatter means?
You will need to create the second legend yourself, i.e. you need to create some artists to populate the legend with. In the case of a scatter we can use a normal plot and set the marker accordingly.
This is shown in the below example. To actually add a second legend we need to add the first legend to the axes, such that the new legend does not overwrite the first one.
import matplotlib.pyplot as plt
import matplotlib.colors
import numpy as np; np.random.seed(1)
import pandas as pd
plt.rcParams["figure.subplot.right"] = 0.8
v = np.random.rand(30,4)
v[:,2] = np.random.choice(np.arange(1980,2015,5), size=30)
v[:,3] = np.random.randint(5,13,size=30)
df= pd.DataFrame(v, columns=["x","y","year","quality"])
df.year = df.year.values.astype(int)
fig, ax = plt.subplots()
for i, (name, dff) in enumerate(df.groupby("year")):
c = matplotlib.colors.to_hex(plt.cm.jet(i/7.))
dff.plot(kind='scatter',x='x',y='y', label=name, c=c,
s=dff.quality**2, ax=ax)
leg = plt.legend(loc=(1.03,0), title="Year")
ax.add_artist(leg)
h = [plt.plot([],[], color="gray", marker="o", ms=i, ls="")[0] for i in range(5,13)]
plt.legend(handles=h, labels=range(5,13),loc=(1.03,0.5), title="Quality")
plt.show()
Have a look at http://matplotlib.org/users/legend_guide.html.
It shows how to have multiple legends (about halfway down) and there is another example that shows how to set the marker size.
If that doesn't work, then you can also create a custom legend (last example).

How to use seaborn pointplot and violinplot in the same figure? (change xticks and marker of pointplot)

I am trying to create violinplots that shows confidence intervals for the mean. I thought an easy way to do this would be to plot a pointplot on top of the violinplot, but this is not working since they seem to be using different indices for the xaxis as in this example:
import matplotlib.pyplot as plt
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.dropna(inplace=True)
fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True, figsize=(12,4))
#ax1
sns.pointplot("who", "age", data=titanic, join=False,n_boot=10, ax=ax1)
#ax2
sns.violinplot(titanic.age, groupby=titanic.who, ax=ax2)
#ax3
sns.pointplot("who", "age", data=titanic, join=False, n_boot=10, ax=ax3)
sns.violinplot(titanic.age, groupby=titanic.who, ax=ax3)
ax3.set_xlim([-0.5,4])
print(ax1.get_xticks(), ax2.get_xticks())
gives: [0 1 2] [1 2 3]
Why are these plots not assigning the same xtick numbers to the 'who'-variable and is there any way I can change this?
I also wonder if there is anyway I can change the marker for pointplot, because as you can see in the figure, the point is so big so that it covers the entire confidence interval. I would like just a horizontal line if possible.
I'm posting my final solution here. The reason I wanted to do this kind of plot to begin with, was to display information about the distribution shape, shift in means, and outliers in the same figure. With mwaskom's pointers and some other tweaks I finally got what I was looking for.
The left hand figure is there as a comparison with all data points plotted as lines and the right hand one is my final figure. The thick grey line in the middle of the violin is the bootstrapped 99% confidence interval of the mean, which is the white horizontal line, both from pointplot. The three dotted lines are the standard 25th, 50th and 75th percentile and the lines outside that are the caps of the whiskers of a boxplot I plotted on top of the violin plot. Individual data points are plotted as lines beyond this points since my data usually has a few extreme ones that I need to remove manually like the two points in the violin below.
For now, I am going to to continue making histograms and boxplots in addition to these enhanced violins, but I hope to find that all the information is accurately captured in the violinplot and that I can start and rely on it as my main initial data exploration plot. Here is the final code to produce the plots in case someone else finds them useful (or finds something that can be improved). Lots of tweaking to the boxplot.
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#change the linewidth which to get a thicker confidence interval line
mpl.rc("lines", linewidth=3)
df = sns.load_dataset("titanic")
df.dropna(inplace=True)
x = 'who'
y = 'age'
fig, (ax1,ax2) = plt.subplots(1,2, sharey=True, figsize=(12,6))
#Left hand plot
sns.violinplot(df[y], groupby=df[x], ax=ax1, inner='stick')
#Right hand plot
sns.violinplot(df[y], groupby=df[x], ax=ax2, positions=0)
sns.pointplot(df[x],df[y], join=False, ci=99, n_boot=1000, ax=ax2, color=[0.3,0.3,0.3], markers=' ')
df.boxplot(y, by=x, sym='_', ax=ax2, showbox=False, showmeans=True, whiskerprops={'linewidth':0},
medianprops={'linewidth':0}, flierprops={'markeredgecolor':'k', 'markeredgewidth':1},
meanprops={'marker':'_', 'color':'w', 'markersize':6, 'markeredgewidth':1.5},
capprops={'linewidth':1, 'color':[0.3,0.3,0.3]}, positions=[0,1,2])
#One could argue that this is not beautiful
labels = [item.get_text() + '\nn=' + str(df.groupby(x).size().loc[item.get_text()]) for item in ax2.get_xticklabels()]
ax2.set_xticklabels(labels)
#Clean up
fig.suptitle('')
ax2.set_title('')
fig.set_facecolor('w')
Edit: Added 'n='
violinplot takes a positions argument that you can use to put the violins somewhere else (they currently just inherit the default matplotlib boxplot positions).
pointplot takes a markers argument that you can use to change how the point estimate is rendered.

Categories