howto get fit parameters from seaborn distplot fit=? - python

I'm using seaborn distplot (data, fit=stats.gamma)
How do I get the fit parameters returned?
Here is an example:
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
df = pd.read_csv ('RequestSize.csv')
import matplotlib.pyplot as plt
reqs = df['12 web pages']
reqs = reqs.dropna()
reqs = reqs[np.logical_and (reqs > np.percentile (reqs, 0), reqs < np.percentile (reqs, 95))]
dist = sns.distplot (reqs, fit=stats.gamma)

Use the object you passed to distplot:
stats.gamma.fit(reqs)

I confirm the above is true - the sns.distplot fit method is equivalent to the fit method in scipy.stats so you can get the parameters from there, e.g.:
from scipy import stats
ax = sns.distplot(e_t_hat, bins=20, kde=False, fit=stats.norm);
plt.title('Distribution of Cointegrating Spread for Brent and Gasoil')
# Get the fitted parameters used by sns
(mu, sigma) = stats.norm.fit(e_t_hat)
print "mu={0}, sigma={1}".format(mu, sigma)
# Legend and labels
plt.legend(["normal dist. fit ($\mu=${0:.2g}, $\sigma=${1:.2f})".format(mu, sigma)])
plt.ylabel('Frequency')
# Cross-check this is indeed the case - should be overlaid over black curve
x_dummy = np.linspace(stats.norm.ppf(0.01), stats.norm.ppf(0.99), 100)
ax.plot(x_dummy, stats.norm.pdf(x_dummy, mu, sigma))
plt.legend(["normal dist. fit ($\mu=${0:.2g}, $\sigma=${1:.2f})".format(mu, sigma),
"cross-check"])

Related

How to get (random) number from Gaussian Mixture model(GMM)?

I want to get a list of random numbers from GMM.
I've plotted a distribution like this.
And I was wondering whether there is some way to get the value from the distribution curve, such as getting number 0.94, 0.96, 0.95.... How to get it?
I am not sure whether the belowing is useful for u or not. I got the code from Minimal reproduction
from matplotlib import rc
from sklearn import mixture
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
import matplotlib.ticker as tkr
import scipy.stats as stats
# x = open("prueba.dat").read().splitlines()
# create the data
x = np.concatenate((np.random.normal(5, 5, 1000),np.random.normal(10, 2, 1000)))
f = np.ravel(x).astype(np.float)
f=f.reshape(-1,1)
g = mixture.GaussianMixture(n_components=3,covariance_type='full')
g.fit(f)
weights = g.weights_
means = g.means_
covars = g.covariances_
plt.hist(f, bins=100, histtype='bar', density=True, ec='red', alpha=0.5)
# f_axis = f.copy().ravel()
# f_axis.sort()
# plt.plot(f_axis,weights[0]*stats.norm.pdf(f_axis,means[0],np.sqrt(covars[0])).ravel(), c='red')
plt.rcParams['agg.path.chunksize'] = 10000
plt.grid()
plt.show()
Thanks!
According to the documentation you can simply do g.sample(n_samples=64) on the fitted model to generate 64 samples

How to use Gaussian Mixture Model (GMM) for peak decomposition?

I have generated some data points as a linear mixture of three 1D bell shape Gaussian distributions with different parameters (mean and variance) using following code:
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
sns.set_style("darkgrid")
%matplotlib inline
from sklearn.mixture import GaussianMixture
x = np.linspace(start=-40,stop=40, num=1000)
y1 = stats.norm.pdf(x, loc=1,scale=1.5) # First Gaussian distribution
y2 = stats.norm.pdf(x, loc=5,scale=2.5) # Second Gaussian distribution
y3 = stats.norm.pdf(x, loc=-15,scale= 10) # Third Gaussian distribution
Y = y1+y2+y3
fig = plt.figure(figsize=(7, 5),dpi=300)
plt.plot(x,y1,lw=2,label='First component')
plt.plot(x,y2,lw=2,label='Second component')
plt.plot(x,y3,lw=2,label='Third component')
plt.plot(x,Y,lw=3,label= 'Linear Mixture')
plt.legend(loc='best',facecolor="white")
plt.show()
I tried to decompose these 3 peaks in a reverse process using sklearn.mixture.GaussianMixture. But it does not return my expected mean and variance of every Gaussian function.
model = GaussianMixture(n_components=3).fit(Y.reshape(-1,1))
print(model.means_)
print(model.covariances_)

Can you plot interquartile range as the error band on a seaborn lineplot?

I'm plotting time series data using seaborn lineplot (https://seaborn.pydata.org/generated/seaborn.lineplot.html), and plotting the median instead of mean. Example code:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median, data=fmri)
I want the error bands to show the interquartile range as opposed to the confidence interval. I know I can use ci = "sd" for standard deviation, but is there a simple way to add the IQR instead? I cannot figure it out.
Thank you!
I don't know if this can be done with seaborn alone, but here's one way to do it with matplotlib, keeping the seaborn style. The describe() method conveniently provides summary statistics for a DataFrame, among them the quartiles, which we can use to plot the medians with inter-quartile-ranges.
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
fmri_stats = fmri.groupby(['timepoint']).describe()
x = fmri_stats.index
medians = fmri_stats[('signal', '50%')]
medians.name = 'signal'
quartiles1 = fmri_stats[('signal', '25%')]
quartiles3 = fmri_stats[('signal', '75%')]
ax = sns.lineplot(x, medians)
ax.fill_between(x, quartiles1, quartiles3, alpha=0.3);
You can calculate the median within lineplot like you have done, set ci to be none and fill in using ax.fill_between()
import numpy as np
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(x="timepoint", y="signal", estimator = np.median,
data=fmri,ci=None)
bounds = fmri.groupby('timepoint')['signal'].quantile((0.25,0.75)).unstack()
ax.fill_between(x=bounds.index,y1=bounds.iloc[:,0],y2=bounds.iloc[:,1],alpha=0.1)
This option is possible since version 0.12 of seaborn, see here for the documentation.
pip install --upgrade seaborn
The estimator specifies the point by the name of pandas method or callable, such as 'median' or 'mean'.
The errorbar is an option to plot a distribution spread by a string, (string, number) tuple, or callable. In order to mark the median value and fill the area between the interquartile, you would need the params:
import seaborn as sns; sns.set()
import matplotlib.pyplot as plt
fmri = sns.load_dataset("fmri")
ax = sns.lineplot(data=fmri, x="timepoint", y="signal", estimator=np.median,
errorbar=lambda x: (np.quantile(x, 0.25), np.quantile(x, 0.75)))
You can now!
estimator="median", errobar=("pi",0.5)
https://seaborn.pydata.org/tutorial/error_bars

How to change plot properties of statsmodels qqplot? (Python)

So I am plotting a normal Q-Q plot using statsmodels.graphics.gofplots.qqplot().
The module uses matplotlib.pyplot to create figure instance. It plots the graph well.
However, I would like to plot the markers with alpha=0.3.
Is there a way to do this?
Here is a sample of code:
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
test = np.random.normal(0,1, 1000)
sm.qqplot(test, line='45')
plt.show()
And the output figure:
You can use statsmodels.graphics.gofplots.ProbPlot class which has qqplot method to pass matplotlib pyplot.plot **kwargs.
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
test = np.random.normal(0, 1, 1000)
pp = sm.ProbPlot(test, fit=True)
qq = pp.qqplot(marker='.', markerfacecolor='k', markeredgecolor='k', alpha=0.3)
sm.qqline(qq.axes[0], line='45', fmt='k--')
plt.show()
qqplot returns a figure object which can be used to get the lines which can then be modified using set_alpha
fig = sm.qqplot(test, line='45');
# Grab the lines with blue dots
dots = fig.findobj(lambda x: hasattr(x, 'get_color') and x.get_color() == 'b')
[d.set_alpha(0.3) for d in dots]
Obviously you have a bit of overlap of the dots so even though they have a low alpha value, where they are piled on top of one another they look to be more opaque.

Major Difference in 2D kernel Density Plots: Seaborn and R

I am trying to plot data using the 2D kernel density plot of Seaborn's jointplot function (using statsmodels' KDEMultivariate function to calculate a data-driven bandwidth). I've plotted a 2D kernel density in R using the same data and the result looks very good (using the 'ks' package), while the Seaborn plot looks very very different.
I am using the same exact data and the same exact bandwidth for each (taking the bandwidth given by KDEMultivariant and passing that to the R method).
Here is the input.csv data used: https://app.box.com/s/ot7d36t44wrr85pusp5657pc1w2kf5hj
Below are the code used in each and output images from each.
Python / Seaborn:
import matplotlib.pyplot as plt
import statsmodels.api as sm
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde", stat_func=None, bw=[bw_ml_x.bw, bw_ml_y.bw])
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
Img for Seaborn plot:
The bandwidth given by bw_ml_x.bw and bw_ml_y.bw is placed in a 2 x 2 R matrix H, where H[1,1] = bw_ml_x.bw, H[2,2] = bw_ml.y.bw, and other values set to zero.
R:
library(ks)
fhat <- kde(x=as.data.frame(data[1], data[2]), H=H)
plot(fhat, display="filled.contour2", cont=seq(10,90,by=10))
Img for R plot:
Looking at your Seaborn/Python plot, many of the points cluster along the (0,n) region and the (1,1) region of your space, just as the KDE of the R plot shows. This indicates that Seaborn and R are looking at the same data; we simply need to reformulate the call to the kde in Seaborn in order to visualize the KDE gradients.
If you modify your Python call to match the documentation for Kernel Density Estimation in Seaborn you'll get a proper 2d-kdf out of Python:
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pandas as pd
import seaborn as sns
data = pd.read_csv("input.csv", dtype={'x': float, 'y': float}, skiprows=0)
bw_ml_x = sm.nonparametric.KDEMultivariate(data=data['x'], var_type='c', bw='cv_ml')
bw_ml_y = sm.nonparametric.KDEMultivariate(data=data['y'], var_type='c', bw='cv_ml')
g = sns.jointplot(x='x', y='y', data=data, kind="kde")
g.plot_joint(plt.scatter, c="w")
g.ax_joint.collections[0].set_alpha(0)
sns.plt.show()
This accords with the R plot (though the kernel estimators seem to be slightly different, which would account for the variation in gradients between the plots):

Categories