Seaborn data visualization misunderstanding of densities?

Seaborn data visualization misunderstanding of densities? - python

I was playing around with the seaborn library for data visualization and trying to display a standard normal distribution. The basics in this case look something like:
import numpy as np
import seaborn as sns
n=1000
N= np.random.randn(n)
fig=sns.displot(N,kind="kde")
Which behaves as expected. My problem starts when I try to plot multiple distributions at the same time. I tried the brute N2= np.random.randn(n//2) and fig=sns.displot((N,N2),kind="kde"), which returns two distributions (as wanted), but the one with smaller sample size is significantly different (and flatter). Regardless of the sample size, a proper density plot (or histogram) should have the area below the graph equal to one, but this is clearly not the case.
Knowing that seaborn works with pandas Dataframes, I've tried with the more elaborate (and generally bad and inefficient, but I hope clear) code below to attempt again multiple distributions on the same graph:
import numpy as np
import seaborn as sns
import pandas as pd
n=10000
N_1= np.reshape(np.random.randn(n),(n,1))
N_2= np.reshape(np.random.randn(int(n/2)),(int(n/2),1))
N_3= np.reshape(np.random.randn(int(n/4)),(int(n/4),1))
A_1 = np.reshape(np.array(['n1' for _ in range(n)]),(n,1))
A_2 = np.reshape(np.array(['n2' for _ in range(int(n/2))]),(int(n/2),1))
A_3 = np.reshape(np.array(['n3' for _ in range(int(n/4))]),(int(n/4),1))
F_1=np.concatenate((N_1,A_1),1)
F_2=np.concatenate((N_2,A_2),1)
F_3=np.concatenate((N_3,A_3),1)
F= pd.DataFrame(data=np.concatenate((F_1,F_2,F_3),0),columns=["datar","cat"])
F["datar"]=F.datar.astype('float')
fig=sns.displot(F,x="datar",hue="cat",kind="kde")
Which shows again very different (almost scaled) distributions, confirming that the result in this case is not consistent with what I was expecting (namely, roughly overlapping distributions). Am I not understanding how this graph works? There is a completely different approach to draw multiple distributions on the same graph that I am missing?

Seaborn works happily with and without dataframes. Columns of dataframes get converted to numpy arrays in order to draw the plots.
sns.displot(..., kind="kde") refers to sns.kdeplot() which has a parameter common_norm defaulting to True. Setting it to False draws the curves independently.
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
n = 10000
N_1 = np.random.randn(n)
N_2 = np.random.randn(n // 2) + 2
N_3 = np.random.randn(n // 4) + 4
sns.displot((N_1, N_2, N_3), kind="kde", common_norm=False)
plt.show()
Note that for kdeplot, the option common_norm defaulting to True makes sense, as with kdeplot you can also create plots with three separate calls which automatically will be independent. There also is a useful option multiple (defaulting to 'layer'), which can be set to 'stack' or to 'fill'.

Related

Plotting each Cluster value percentage individually

So I have been working on this problem for a bit and seem to be stuck..so I am asking for some guidance here.
This is my code
from clusteval import clusteval
from sklearn.datasets import make_blobs
import pandas as pd
X, labels = make_blobs(n_samples=50, centers=2, n_features=5, cluster_std=1)
X = abs(X)
X = pd.DataFrame(X, columns=['Feature_1','Feature_2','Feature_3','Feature_4','Feature_5'])
ce = clusteval('kmeans', metric='euclidean', linkage='complete')
results = ce.fit(X)
X['Cluster_labels'] = results['labx']
X.groupby('Cluster_labels').Feature_1.value_counts(normalize=True).plot(kind='bar')
plt.tight_layout()
plt.show()
This produces this image:
This image is really close to what I want but notice that both clusters show up in the same graph. I would like to produce the same graph represents only one cluster. essentially for every cluster I have I want a graph like this. So if I had 10 clusters, I would have 10 graphs that showed the percentage of each value within that cluster and that cluster only.
Any guidance or help is appreciated. Thanks.

I can suggest two alternative plots. Both would benefit from visual refinement (label all axes, clean up underscores, pick nicer font sizes, etc.) but hopefully are useful starting points.
Using pandas:
axes = X.hist('Feature_1', by='Cluster_labels')
for ax in axes:
ax.set_title('Cluster_labels = ' + ax.get_title())
Using seaborn:
import seaborn as sns
sns.displot(X,
x='Feature_1',
col='Cluster_labels',
binwidth=0.5)

How to rotate Seaborn heatmap in python?

default settings of seaborn.heatmap gives
the x-axis starts from the origin of 0 then increases towards the
right
the y-axis starts from an origin of 9 then increases towards the
upward
This is odd compared to matplotlib.pyplot.pcolormesh, which gives a y-axis that starts from an origin of 0 that moves upward, like what we'd intuitively want since it only makes sense for origins to be (0,0), not (0,9)!
How to make the y-axis of heatmap also start from an origin of 0, instead of 9, moving upward? (while of course re-orienting the data correspondingly)
I tried transposing the input data, but this doesn't look right and the axes don't change. I don't think it's a flip about the y-axis that's needed, but a simple rotating of the heatmap.

You can flip the y-axis using ax.invert_yaxis():
import seaborn as sns
import numpy as np
np.random.seed(0)
sns.set_theme()
uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data)
ax.invert_yaxis()
If you want to do the rotation you describe, you have to transpose the matrix first:
import seaborn as sns
import numpy as np
np.random.seed(0)
sns.set_theme()
uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data.T)
ax.invert_yaxis()
The reason for the difference is that they are assuming different coordinate systems. pcolormesh is assuming that you want to access the elements using cartesian coordinates i.e. [x, y] and it displays them in the way you would expect. heatmap is assuming you want to access the elements using array coordinates i.e. [row, col], so the heatmap it gives has the same layout as if you print the array to the console.
Why do they use different coordinate systems? I would be speculating but I think it's due to the ages of the 2 libraries. matplotlib, particularly its older commands is a port from Matlab, so many of the assumptions are the same. seaborn was developed for Python much later, specifically aimed at statistical visualization, and after pandas was already existent. So I would guess that mwaskom chose the layout to replicate how a DataFrame looks when you print it to the screen.

You can create a graph at the lower left point by resetting yticklabels=[].Does this fit your question?
import seaborn as sns
import numpy as np
np.random.seed(0)
sns.set_theme()
uniform_data = np.random.rand(10, 12)
ax = sns.heatmap(uniform_data, yticklabels=[9,8,7,6,5,4,3,2,1,0])

Only point or scatter plot allowed in Python: plotting eigenvalues in a loop

I have the following simple code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg
for i in range(10):
M=np.array([[i**2,0],[0,i**3]]) # 2 x 2 matrix
eval,evec=np.linalg.eig(M)
# Plotting first and second eigenvalues
# Style 1
plt.plot(i,eval[0])
plt.plot(i,eval[1])
# Doesn't work
# Style 2
plt.plot(i,eval[0], '-r')
plt.plot(i,eval[1], '-b')
# Doesn't work
# Style 3
plt.plot(i,eval[0], 'ro-')
plt.plot(i,eval[1], 'bs')
# Does work
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('plot.png')
plt.show()
While plotting with three different styles, only the third style (i.e. point or scatter plots) works successfully. Hence I have very limited customization options. Any way out?
Also, can these three differently styled plots be saved into three different files without creating separately three for-loops?
Output attached below.

Move the plotting outside the loop where computation occurs. In order to plot connected lines the plot function is expecting an array of values.
import numpy as np
import matplotlib.pyplot as plt
from scipy import linalg
yvals=[]
for i in range(10):
M=np.array([[i**2,0],[0,i**3]]) # 2 x 2 matrix
eval_,evec=np.linalg.eig(M)
yvals.append(eval_)
yvals=np.array(yvals)
xvals=np.array(range(10))
plt.plot(xvals,yvals[:,0],'-r')
plt.plot(xvals,yvals[:,1],'-b')
All of your plotting styles should work now.

Python/Scipy kde fit, scaling

I have a Series in Python and I'd like to fit a density to its histogram. Question: is there a slick way to use the values from np.histogram() to achieve this result? (see Update below)
My current problem is that the kde fit I perform has (seemingly) unwanted kinks, as depicted in the second plot below. I was hoping for a kde fit that is monotone decreasing based on a histogram, which is the first figure depicted. Below I've included my current code. Thanks in advance
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import gaussian_kde as kde
df[var].hist()
plt.show() # shows the original histogram
density = kde(df[var])
xs = np.arange(0, df[var].max(), 0.1)
ys = density(xs)
plt.plot(xs, ys) # a pdf with kinks
Alternatively, is there a slick way to use
count, div = np.histogram(df[var])
and then scale the count array to apply kde() to it?
Update
Based on cel's comment below (should've been obvious, but I missed it!), I was implicitly under-binning in this case using the default params in pandas.DataFrame.hist(). In the updated plot I used
df[var].hist(bins=100)
I'll leave this post up in case others find it useful but won't mind if it gets taken down as 'too localized' etc.

If you increase the bandwidth using the bw_method parameter, then the kde will look smoother. This example comes from Justin Peel's answer; the code has been modified to take advantage of the bw_method:
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import gaussian_kde
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
density1 = gaussian_kde(data)
bandwidth = 1.5
density2 = gaussian_kde(data, bw_method=bandwidth)
xs = np.linspace(0,8,200)
plt.plot(xs,density1(xs), label='bw_method=None')
plt.plot(xs,density2(xs), label='bw_method={}'.format(bandwidth))
plt.legend(loc='best')
plt.show()
yields

The problem was under-binning as mentioned by cel, see comments above. It was clarifying to set bins=100 in pd.DataFrame.histo() which defaults to bins=10.
See also:
http://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width

Quantile-Quantile Plot using SciPy

How would you create a qq-plot using Python?
Assuming that you have a large set of measurements and are using some plotting function that takes XY-values as input. The function should plot the quantiles of the measurements against the corresponding quantiles of some distribution (normal, uniform...).
The resulting plot lets us then evaluate in our measurement follows the assumed distribution or not.
http://en.wikipedia.org/wiki/Quantile-quantile_plot
Both R and Matlab provide ready made functions for this, but I am wondering what the cleanest method for implementing in in Python would be.

Update: As folks have pointed out this answer is not correct. A probplot is different from a quantile-quantile plot. Please see those comments and other answers before you make an error in interpreting or conveying your distributions' relationship.
I think that scipy.stats.probplot will do what you want. See the documentation for more detail.
import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Result

Using qqplot of statsmodels.api is another option:
Very basic example:
import numpy as np
import statsmodels.api as sm
import pylab
test = np.random.normal(0,1, 1000)
sm.qqplot(test, line='45')
pylab.show()
Result:
Documentation and more example are here

If you need to do a QQ plot of one sample vs. another, statsmodels includes qqplot_2samples(). Like Ricky Robinson in a comment above, this is what I think of as a QQ plot vs a probability plot which is a sample against a theoretical distribution.
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot_2samples.html

I came up with this. Maybe you can improve it. Especially the method of generating the quantiles of the distribution seems cumbersome to me.
You could replace np.random.normal with any other distribution from np.random to compare data against other distributions.
#!/bin/python
import numpy as np
measurements = np.random.normal(loc = 20, scale = 5, size=100000)
def qq_plot(data, sample_size):
qq = np.ones([sample_size, 2])
np.random.shuffle(data)
qq[:, 0] = np.sort(data[0:sample_size])
qq[:, 1] = np.sort(np.random.normal(size = sample_size))
return qq
print qq_plot(measurements, 1000)

To add to the confusion around Q-Q plots and probability plots in the Python and R worlds, this is what the SciPy manual says:
"probplot generates a probability plot, which should not be confused
with a Q-Q or a P-P plot. Statsmodels has more extensive functionality
of this type, see statsmodels.api.ProbPlot."
If you try out scipy.stats.probplot, you'll see that indeed it compares a dataset to a theoretical distribution. Q-Q plots, OTOH, compare two datasets (samples).
R has functions qqnorm, qqplot and qqline. From the R help (Version 3.6.3):
qqnorm is a generic function the default method of which produces a
normal QQ plot of the values in y. qqline adds a line to a
“theoretical”, by default normal, quantile-quantile plot which passes
through the probs quantiles, by default the first and third quartiles.
qqplot produces a QQ plot of two datasets.
In short, R's qqnorm offers the same functionality that scipy.stats.probplot provides with the default setting dist=norm. But the fact that they called it qqnorm and that it's supposed to "produce a normal QQ plot" may easily confuse users.
Finally, a word of warning. These plots don't replace proper statistical testing and should be used for illustrative purposes only.

It exists now in the statsmodels package:
http://statsmodels.sourceforge.net/devel/generated/statsmodels.graphics.gofplots.qqplot.html

You can use bokeh
from bokeh.plotting import figure, show
from scipy.stats import probplot
# pd_series is the series you want to plot
series1 = probplot(pd_series, dist="norm")
p1 = figure(title="Normal QQ-Plot", background_fill_color="#E8DDCB")
p1.scatter(series1[0][0],series1[0][1], fill_color="red")
show(p1)

import numpy as np
import pylab
import scipy.stats as stats
measurements = np.random.normal(loc = 20, scale = 5, size=100)
stats.probplot(measurements, dist="norm", plot=pylab)
pylab.show()
Here probplot draw the graph measurements vs normal distribution which speofied in dist="norm"

How big is your sample? Here is another option to test your data against any distribution using OpenTURNS library. In the example below, I generate a sample x of 1.000.000 numbers from a Uniform distribution and test it against a Normal distribution.
You can replace x by your data if you reshape it as x= [[x1], [x2], .., [xn]]
import openturns as ot
x = ot.Uniform().getSample(1000000)
g = ot.VisualTest.DrawQQplot(x, ot.Normal())
g
In my Jupyter Notebook, I see:
If you are writing a script, you can do it more properly
from openturns.viewer import View`
import matplotlib.pyplot as plt
View(g)
plt.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Seaborn data visualization misunderstanding of densities? - python

Related

Plotting each Cluster value percentage individually

How to rotate Seaborn heatmap in python?

Only point or scatter plot allowed in Python: plotting eigenvalues in a loop

Python/Scipy kde fit, scaling

Quantile-Quantile Plot using SciPy

Categories

Resources