Sawtooth look in violin plot [duplicate] - python

The following code gives me a very nice violinplot (and boxplot within).
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
foo = np.random.rand(100)
sns.violinplot(foo)
plt.boxplot(foo)
plt.show()
So far so good. However, when I look at foo, the variable does not contain any negative values. The seaborn plot seems misleading here. The normal matplotlib boxplot gives something closer to what I would expect.
How can I make violinplots with a better fit (not showing false negative values)?

As the comments note, this is a consequence (I'm not sure I'd call it an "artifact") of the assumptions underlying gaussian KDE. As has been mentioned, this is somewhat unavoidable, and if your data don't meet those assumptions, you might be better off just using a boxplot, which shows only points that exist in the actual data.
However, in your response you ask about whether it could be fit "tighter", which could mean a few things.
One answer might be to change the bandwidth of the smoothing kernel. You do that with the bw argument, which is actually a scale factor; the bandwidth that will be used is bw * data.std():
data = np.random.rand(100)
sns.violinplot(y=data, bw=.1)
Another answer might be to truncate the violin at the extremes of the datapoints. The KDE will still be fit with densities that extend past the bounds of your data, but the tails will not be shown. You do that with the cut parameter, which specifies how many units of bandwidth past the extreme values the density should be drawn. To truncate, set it to 0:
sns.violinplot(y=data, cut=0)
By the way, the API for violinplot is going to change in 0.6, and I'm using the development version here, but both the bw and cut arguments exist in the current released version and behave more or less the same way.

Related

Python/Seaborn: What does the inside horizontal distribution of the data-points means or is it random?

It seems like that inside-distribution of the histogram data points is almost random every time you plot (using Seaborn) - is it for the ease of readability or other meaningful purpose?
I am using Python 3.0 and Seaborn provided dataset called 'tips' for this question.
import seaborn as sns
tips = sns.load_dataset("tips")
After I ran my same code below twice I see differences of inside points distribution. Here is the code you can run a couple of times:
ax = sns.stripplot(x="day", y="total_bill", data=tips, alpha=.55,
palette='Set1', jitter=True, linewidth=1 )
Now, if you look into the plots (if you ran it twice for example) you will notice that the distribution of the points is not the same between 2 plots:
Please explain why points are not distributed identically with 2 separate runs? Also, judging those points on the horizontal scale; is there a reason why (for example) one red point is further left than other red point OR is it simply for readability?
Thank you in advance!
After a bit more research, I believe that the distribution of data points is random but uniform (thank you #ImportanceOfBeingErnest for pointing to the code). Therefore, answering my own questions there is no hidden meaning in terms of distribution and horizontal range is simply set for visibility that also changes or stays the same based on set/notset seed.
I do think that both displays are identical along the vertical axis (I.e. : both distributions are equal since they represent the same scatter plot of a given dataset). The slight visual differences comes along the position onto the horizontal (categorical days) axis; this one comes from the 'jitter' option (=True) that induces slight random relatively to the vertical axis they are related to (day). The jitter option helps to distinguish scatter plots with the same total_bill value (that should be superimposed if equal) : thus the difference comes from the jitter option set to True, that is used for readability.

Seaborn -- map diagonal with asymmetric axes

I have some data where it makes sense to plot certain, but not all variables against each other and in particular where it only makes sense to plot KDEs for certain variables. This seems, generally, like a good use case for seaborn's PairGrid. However, I cannot use sb.PairGrid.map_diag for the variables for which I do want a KDE when it seems like I ought to be.
The following code works as I imagine it would:
import seaborn as sb
import pandas as pd
iris=sb.load_dataset('iris')
pgiris = sb.PairGrid(data=iris,
x_vars=['sepal_width','petal_width','sepal_length'],
y_vars=['sepal_width','petal_width','sepal_length'])
pgiris.map_diag(sb.kdeplot)
Let's imagine, though, that it doesn't make sense to plot sepal_length on both axes:
pgiris=sb.PairGrid(data=iris,x_vars=['sepal_width','petal_width','sepal_length'],y_vars=['sepal_width','petal_width'])
pgiris.map_diag(sb.kdeplot)
Under seaborn 0.9.0 and python3.6.7, this throws a TypeError for reasons I do not understand-- a cursory reading suggests that it does not assign any axes its diag_axes attribute. Oddly, the map_offdiag methods seem to work just fine, so I don't think that this is intended to not work for asymmetric pairgrids.
How do I properly map functions to diagonal elements of asymmetric PairGrids?

matplotlib legend performance issue

I am using Jupyter-notebook with python 3.6.2 and matplotlib to plot some data.
When I plot my data, I want to add a legend to the plot (basically to know which line is which)
However calling plt.legend takes a lot of time (almost as much as the plot itself, which to my understanding should just be instant).
Minimal toy problem that reproduces the issue:
import numpy as np
import matplotlib.pyplot as plt
# Toy useless data (one milion x 4)
my_data = np.random.rand(1000000,4)
plt.plot(my_data)
#plt.legend(['A','C','G','T'])
plt.show()
The data here is just random and useless, but it reproduces my problem:
If I uncomment the plt.legend line, the run takes almost double the time
Why? Shouldn't the legend just look at the plot, see that 4 plots have been made, and draw a box assigning each color to the corresponding string?
Why is a simple legend taking so much time?
Am I missing something?
Replicating the answer by #bnaecker, such that this question is answered:
By default, the legend will be placed in the "best" location, which requires computing how many points from each line are inside a potential legend box. If there are many points, this can take a while. Drawing is much faster when specifying a location other than "best", e.g. plt.legend(loc=3).

How does matplotlib determine its x limits?

I am currently trying to find out on what basis matplotlib sets its automatic plot limit.
The question arose when I plotted some x_values against some y_values.
For the x_values the following holds: min(x_values) = -801.01 and max(x_values) = 798.80. The limits set by matplotlib are (-1000, 800).
As the data is almost symmetrical around 0, therefore I would like it to be plotted symmetrically around 0. Is there anyway I can tell matplotlib to automatically center the plot? Also matplotlib seems to set the "resolution" on it's limits as 200 in this case which seems a bit high to me.
Of course I could set limits manually, but I want to avoid that if possible.
PS: I don't know if it matters but I plot the values somewhere and later add the Line2D object to the figure.

Matplotlib: Avoid congestion in X axis

I'm using this code to plot a cumulative frequency plot:
lot = ocum.plot(x='index', y='cdf', yticks=np.arange(0.0, 1.05, 0.1))
plot.set_xlabel("Data usage")`
plot.set_ylabel("CDF")
fig = plot.get_figure()
fig.savefig("overall.png")
How it appears as follows and is very crowded around the initial part. This is due to my data spread. How can I make it more clear? (uploading to postimg because I don't have enough reputation points)
http://postimg.org/image/ii5z4czld/
I hope that I understood what you want: give more space to the visualization of the "CDF" development for smaller "data usage" values, right? Typically, you would achieve this by changing your X axis scale from linear to logarithmic. Head over to Plot logarithmic axes with matplotlib in python for seeing different ways to achieve that. The simplest might be, in your case, to replace plot() with semilogx().

Categories