I am toying around with seaborn violinplot, trying to make a single "violin" with each half being a different distribution, to be easily compared.
Modifying the simple example from here by changing the x axis to x=smoker I got to the following graph (linked below).
import seaborn as sns
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load the example tips dataset
tips = sns.load_dataset("tips")
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="smoker", y="total_bill", hue="smoker",
split=True, inner="quart", data=tips)
sns.despine(left=True)
This is the resulting graph
I would like that the graph does not show two separated halves, just one single violin with two different distributions and colours.
Is it possible to do this with seaborn? Or maybe with other library?
Thanks!
This is because you are specifying two things for the x axis with this line x="smoker". Namely, that it plot smoker yes and smoker no.
What you really want to do is plot all data. To do this you can just specify a single value for the x axis.
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load the example tips dataset
tips = sns.load_dataset("tips")
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x=['Data']*len(tips),y="total_bill", hue="smoker",
split=True, inner="quart",
palette={"Yes": "y", "No": "b"},
data=tips)
sns.despine(left=True)
This outputs the following:
Related
I'm trying to create a categorical plot in which the size of each marker reflects some magnitude of the corresponding sample, as in the following example using the preloaded tips data(upper plot https://i.stack.imgur.com/pRn0x.png):
import seaborn as sns
sns.set(style="whitegrid")
tips = sns.load_dataset("tips")
ax = sns.stripplot("day", "total_bill", data=tips, palette="Set2", size=tips["size"]*5, edgecolor="gray", alpha=.25)
But when I try the same with my own data, all the markers have the same size (lower plot https://i.stack.imgur.com/pRn0x.png):
import seaborn as sns
import pandas as pd
df = pd.read_csv("python_plot_test3.csv")
sns.set(style="whitegrid")
ax = sns.stripplot("log10p_value","term_name", data=df, palette="Set2", size=df['precision'], edgecolor="gray", alpha=.50)
I suspected the datatypes were not the same, but it didn't seem so, although, when I print df['precision'] it returns name and dtype and when I print tips["size"] it also returns its length.
Could someone give me a hint? I found how to change it in scatter plots, but nothing on categorical plots.
my data data:
term_name,log10p_value,precision
muscle structure development,33.34122617,15
anatomical structure morphogenesis,32.91330177,5
muscle system process,31.61813233,11
regulation of multicellular organismal process,30.84862451,25
system development,29.16494157,36
muscle cell differentiation,28.79114555,11
Okay, so looks like relplot is the right kind of function for this, at first I guessed it was specific for continuous data, but it also can handle categorized data. Although, I still don't understand why stripplot worked with the example data.
I have a boxplot below (using seaborn) where the "box" part is too squashed. How do I change the scale along the y-axis so that the boxplot is more presentable (ie. the "box" part is too squashed) but still keeping all the outliers in the plot?
Many thanks.
You can do two things here.
Make the plot bigger
Change the range of the y-axis
Since you want to keep the outliers, rescaling the y-axis may not be that effective. You haven't given any data or code examples. So I'll just add a way to make your figure bigger.
# this script makes the figure bigger and rescale the y-axis
ax = plt.figure(figsize=(20,15))
ax = sns.boxplot(x="day", y="total_bill", data=tips)
ax.set_ylim(0,100)
You could set the axis after the plot:
import seaborn as sns
df = sns.load_dataset('iris')
a = sns.boxplot(y=df["sepal_length"])
a.set(ylim=(0,10))
Additionally, you could try dropping outliers from the plot passing showfliers = False in boxplot.
Currently I have a few plots using Facet Grids in seaborn. I have the following code:
g = sns.FacetGrid(masterdata1,col = "courseName")
g=g.map(plt.scatter, "SubjectwisePercentage", "SemesterPercentage")
The above code plots subjectwisepercentage vs semesterpercentage, for different courses across a semester. How can I plot the different scatter plots in a single plot, instead of multiple plots across the facet grid? In the single plot, the plotted points for each course should be a different color.
There are links online that specify how to plot different datasets in a single plot. However I need to use the same dataset. Therefore I need to specify col="courseName", or something equivalent, to plot course wise data in a single plot. I am not sure of how to accomplish this. Thank you in advance for your help.
You can try using seaborn's scatter plot features. It allows to define, x, y, hue and style, and even size. Which gives up to a 5D view of your data. Sometimes, people like to make hue and style based on the same variables for better-looking graphs.
Sample code (not pretty much mine, since the seaborn documentation pretty much explains everything).
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
tips = sns.load_dataset("tips")
# g = sns.FacetGrid(tips, col="sex", hue="time", palette="Set1",
# hue_order=["Dinner", "Lunch"])
# g= (g.map(plt.scatter, "total_bill", "tip")).add_legend()
# sns.scatterplot(data=tips, x="total_bill", y="tip", hue='time', style='sex')
sns.scatterplot(data=tips, x="total_bill", y="tip", hue='time', style='sex', size='size')
plt.show()
The matplotlib scatter plot can also be helpful. Since you can plot several data on the same plot with different markers/colors/sizes.
See this example.
I have 2 datasets (df3 and df4 which respectively hold information for total head and efficiency) with a common independent variable (flow rate).
I am looking to plot both of them in the same graph but the dependent variables have different y-axes. I initially used lmplot() for the polynomial order functionality but this was unsuccessful in having both plots appear in one window. I would like assistance with combining both my scatter plot and regression plots into one plot which shows the overlap between the datasets.
I have used the following approach to generate my charts:
ax2.scatter(df3['Flow_Rate_(KG/S)'], df2['Efficiency_%'], color='pink')
ax2.scatter(df4['Flow_Rate_(KG/S)'], df4['Total Head'], color='teal')
plt.show()
The reason why it is important for the lines to be plotted against each other is that to monitor pump performance, we need to have both the total head (M) and efficiency % of the pump to understand the relationship and subsequent degradation of performance.
The only other way I could think of is to write the polynomial functions as equations to be put into arguments in the plot function and have them drawn out as such. I haven't yet tried this but thought I'd ask if there are any other alternatives before I head down this pathway.
Let me try to rephrase the problem: You have two datasets with common independent values, but different dependent values (f(x), g(x) respectively). You want to plot them both in the same graph, however the dependent values have totally different ranges. Therefore you want to have two different y axes, one for each dataset. The data should be plotted as a scatter plot and a regression line should be shown for each of them; you are more interested in seeing the regression line than knowing or calculating the regression curve itself. Hence you tried to use seaborn lmplot, but you were unsuccessful to get both datasets into the same graph.
In case the above is the problem you want to solve, the answer could be the following.
lmplot essentially plots a regplot to an axes grid. Because you don't need that axes grid here, using a regplot may make more sense. You may then create an axes and a twin axes and plot one regplot to each of them.
import numpy as np; np.random.seed(42)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df1 = pd.DataFrame({"x": np.sort(np.random.rand(30)),
"f": np.sort(np.random.rayleigh(size=30))})
df2 = pd.DataFrame({"x": np.sort(np.random.rand(30)),
"g": 500-0.1*np.sort(np.random.rayleigh(20,size=30))**2})
fig, ax = plt.subplots()
ax2 = ax.twinx()
sns.regplot(x="x", y="f", data=df1, order=2, ax=ax)
sns.regplot(x="x", y="g", data=df2, order=2, ax=ax2)
ax2.legend(handles=[a.lines[0] for a in [ax,ax2]],
labels=["f", "g"])
plt.show()
I am trying to create violinplots that shows confidence intervals for the mean. I thought an easy way to do this would be to plot a pointplot on top of the violinplot, but this is not working since they seem to be using different indices for the xaxis as in this example:
import matplotlib.pyplot as plt
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic.dropna(inplace=True)
fig, (ax1,ax2,ax3) = plt.subplots(1,3, sharey=True, figsize=(12,4))
#ax1
sns.pointplot("who", "age", data=titanic, join=False,n_boot=10, ax=ax1)
#ax2
sns.violinplot(titanic.age, groupby=titanic.who, ax=ax2)
#ax3
sns.pointplot("who", "age", data=titanic, join=False, n_boot=10, ax=ax3)
sns.violinplot(titanic.age, groupby=titanic.who, ax=ax3)
ax3.set_xlim([-0.5,4])
print(ax1.get_xticks(), ax2.get_xticks())
gives: [0 1 2] [1 2 3]
Why are these plots not assigning the same xtick numbers to the 'who'-variable and is there any way I can change this?
I also wonder if there is anyway I can change the marker for pointplot, because as you can see in the figure, the point is so big so that it covers the entire confidence interval. I would like just a horizontal line if possible.
I'm posting my final solution here. The reason I wanted to do this kind of plot to begin with, was to display information about the distribution shape, shift in means, and outliers in the same figure. With mwaskom's pointers and some other tweaks I finally got what I was looking for.
The left hand figure is there as a comparison with all data points plotted as lines and the right hand one is my final figure. The thick grey line in the middle of the violin is the bootstrapped 99% confidence interval of the mean, which is the white horizontal line, both from pointplot. The three dotted lines are the standard 25th, 50th and 75th percentile and the lines outside that are the caps of the whiskers of a boxplot I plotted on top of the violin plot. Individual data points are plotted as lines beyond this points since my data usually has a few extreme ones that I need to remove manually like the two points in the violin below.
For now, I am going to to continue making histograms and boxplots in addition to these enhanced violins, but I hope to find that all the information is accurately captured in the violinplot and that I can start and rely on it as my main initial data exploration plot. Here is the final code to produce the plots in case someone else finds them useful (or finds something that can be improved). Lots of tweaking to the boxplot.
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
#change the linewidth which to get a thicker confidence interval line
mpl.rc("lines", linewidth=3)
df = sns.load_dataset("titanic")
df.dropna(inplace=True)
x = 'who'
y = 'age'
fig, (ax1,ax2) = plt.subplots(1,2, sharey=True, figsize=(12,6))
#Left hand plot
sns.violinplot(df[y], groupby=df[x], ax=ax1, inner='stick')
#Right hand plot
sns.violinplot(df[y], groupby=df[x], ax=ax2, positions=0)
sns.pointplot(df[x],df[y], join=False, ci=99, n_boot=1000, ax=ax2, color=[0.3,0.3,0.3], markers=' ')
df.boxplot(y, by=x, sym='_', ax=ax2, showbox=False, showmeans=True, whiskerprops={'linewidth':0},
medianprops={'linewidth':0}, flierprops={'markeredgecolor':'k', 'markeredgewidth':1},
meanprops={'marker':'_', 'color':'w', 'markersize':6, 'markeredgewidth':1.5},
capprops={'linewidth':1, 'color':[0.3,0.3,0.3]}, positions=[0,1,2])
#One could argue that this is not beautiful
labels = [item.get_text() + '\nn=' + str(df.groupby(x).size().loc[item.get_text()]) for item in ax2.get_xticklabels()]
ax2.set_xticklabels(labels)
#Clean up
fig.suptitle('')
ax2.set_title('')
fig.set_facecolor('w')
Edit: Added 'n='
violinplot takes a positions argument that you can use to put the violins somewhere else (they currently just inherit the default matplotlib boxplot positions).
pointplot takes a markers argument that you can use to change how the point estimate is rendered.