How does scipy.stats distribution fitting work exactly? - python

I'm interested in the tail distribution of some given data, so I tried using scipy.stats to fit my data to a Gaussian, Generalized extreme value distribution, and a Generalized Pareto distribution.
This is how the data looks like:
Data Histogram
This is what I tried
data=df.loc[:,'X']
v=np.ceil(np.log2(len(data))) + 1 #This criterion is Sturge's rule, we use this formula to calculate the "adequate" number of bins to visualize our data's distribution
y,x=np.histogram(data,bins=int(v),density=True) #"clustering" our data for the plot
plt.hist(data, bins=11, density=True)
plt.title("Histogram")
plt.show()
x = (x + np.roll(x, -1))[:-1] / 2.0 #This takes the mid point of every "bin" interval as the reference x-axis point for its corresponding y probability
# =============================================================================
# Fitting our data and plotting the PDFs
# =============================================================================
fit1=stats.genextreme.fit(data,floc=0) #The fit method finds the optimal parameters (using MLE) for your data fitting a chosen probability distribution
fit2=stats.norm.fit(data)
fit3=stats.genpareto.fit(data,floc=0)
fit4=stats.weibull_min.fit(data,floc=0)
fit5=stats.exponweib.fit(data,floc=0)
fit6=stats.gumbel_r.fit(data,floc=0)
fit7=stats.gumbel_l.fit(data,floc=0)
....
First I had some strange results because I didn't set the initial location parameter to 0, I still didn't exactly understand why.
What surprised me the most though, is that genextreme and Weibull_min gave me different results, when I thought Weibull was a special case of the generalized extreme value distribution with positive shape parameter.
Especially since the Weibull fit seems to work better here.
Here is the Weibull Fit:
Weibull Fit
And this is the GEV Fit:
GEV Fit
Actually the GEV Fit was similar to the Gumbel_r one:
Gumbel_r Fit
I read one could deduce whether Weibull_min or max should be used just from the shape of the data's histogram, how can one do that?
Since I am interested in extreme positive values (Tail distribution), shouldn't I be using Weibull_max since that's the limiting distribution of the maximum?

Related

Fitting a theoretical distribution to a sampled empirical CDF with scipy stats

I have a plot for the CDF distribution of packet losses. I thus do not have the original data or the CDF model itself but samples from the CDF curve. (The data is extracted from plots published in literature.)
I want to find which distribution and with what parameters offers the closest fit to the CDF samples.
I've seen that Scipy stats distributions offer fit(data) method but all examples apply to raw data points. PDF/CDF is subsequently drawn from the fitted parameters. Using fit with my CDF samples does not give sensible results.
Am I right in assuming that fit() cannot be directly applied to data samples from an empirical CDF?
What alternatives could I use to find a matching known distribution?
I'm not sure exactly what you're trying to do. When you say you have a CDF, what does that mean? Do you have some data points, or the function itself? It would be helpful if you could post more information or some sample data.
If you have some data points and know the distribution its not hard to do using scipy. If you don't know the distribution, you could just iterate over all distributions until you find one which works reasonably well.
We can define functions of the form required for scipy.optimize.curve_fit. I.e., the first argument should be x, and then the other arguments are parameters.
I use this function to generate some test data based on the CDF of a normal random variable with a bit of added noise.
n = 100
x = np.linspace(-4,4,n)
f = lambda x,mu,sigma: scipy.stats.norm(mu,sigma).cdf(x)
data = f(x,0.2,1) + 0.05*np.random.randn(n)
Now, use curve_fit to find parameters.
mu,sigma = scipy.optimize.curve_fit(f,x,data)[0]
This gives output
>> mu,sigma
0.1828320963531838, 0.9452044983927278
We can plot the original CDF (orange), noisy data, and fit CDF (blue) and observe that it works pretty well.
Note that curve_fit can take some additional parameters, and that the output gives additional information about how good of a fit the function is.
#tch Thank you for the answer. I read on the technique and successfully applied it. I wanted to apply the fit to all continuous distribution supported by scipy.stats so I ended up doing the following:
fitted = []
failed = []
for d in dist_list:
dist_name = d[0] #fetch the distribution name
dist_object = getattr(ss, dist_name) #fetch the distribution object
param_default = d[1] #fetch the default distribution parameters
# For distributions with only location and scale set those to the default loc=0 and scale=1
if not param_default:
param_default = (0,1)
# Computed parameters of fitted distribution
try:
param,cov = curve_fit(dist_object.cdf,data_in,data_out,p0=param_default,method='trf')
# Only take distributions which do not result in zero covariance as those are not a valid fit
if np.any(cov):
fitted.append((dist_name,param),)
# Capture which distributions are not possible to be fitted (variety of reasons)
except (NotImplementedError,RuntimeError) as e:
failed.append((dist_name,e),)
pass
In the above, the empirical cdf distribution is captured in data_out which holds the sampled cdf values for a range of data_in data points. The list dist_list holds for each distribution in scipy.stats.rv_continuous the name of the distribution as first element and a list of the default parameters as second element. Default parameters I extract from scipy.stats._distr_params.
Some distributions cannot be fitted and raise an error. I keep those is failed list.
Finally, I generate a list fitted which holds for each successfully fitted distribution the estimated parameters.

Pareto distribution and whether a chart conforms to it

I have a figure as shown below, I want to know whether it conforms to the Pareto distribution, or not? Its a cumulative plot.
And, I want to find out the point in x axis which marks the point for the 80-20 rule, i.e the x-axis point which bifurcates the plot into 20 percent having 80 percent of the wealth.
Also, I'm really confused by the scipy.stats Pareto function, would be great if someone can give some intuitive explanation on that, since the documentation is pretty confusing.
scipy.stats.pareto provides a random draw from the Pareto distribution.
To know if your distribution conform to Pareto distribution you should perform a Kolmogorov-Smirnov test.
Draw a random sample from the Pareto distribution using pareto.rvs(shape, size=1000), where shape is the estimated shape parameter of your Pareto distribution, and use scipy.stats.kstest to perform the test :
pareto_smp = pareto.rvs(shape, size=1000)
D, p_value = scipy.stats.kstest(pareto_smp, values)
nobody can simply determine if an observation dataset follows a particular distribution. based on your situation, what you need:
fit empirical distribution using:
statsmodels.ECDF
then, compare (nonparametric) this with your data to see if the Null hypothesis can be rejected
for 20/80 rule:
rescale your X to range [0,1] and simply pick up 0.2 on x axis
source: https://arxiv.org/pdf/1306.0100.pdf

Very low p-values in Python Kolmogorov-Smirnov Goodness of Fit Test

I have a set of data and fit the corresponding histogram by a lognormal distribution.
I first calculate the optimal parameters for the lognormal function, and then plot the histogram and the lognormal function. This gives quite good results:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
num_data = len(data)
x_axis = np.linspace(min(data),
max(data),num_data)
number_of_bins = 240
histo, bin_edges = np.histogram(data, number_of_bins, normed=False)
shape, location, scale = sp.stats.lognorm.fit(data)
plt.hist(data, number_of_bins, normed=False);
# the scaling factor scales the normalized lognormal function up to the size
# of the histogram:
scaling_factor = len(data)*(max(data)-min(data))/number_of_bins
plt.plot(x_axis,scaling_factor*sp.stats.lognorm.pdf(x_axis, shape,
location, scale),'r-')
# adjust the axes dimensions:
plt.axis([bin_edges[0]-10,bin_edges[len(bin_edges)-1]+10,0, histo.max()*1.1])
However, when performing the Kolmogorov-Smirnov test on the data versus the fitting function, I get way too low p-values (of the order of e-32):
lognormal_ks_statistic, lognormal_ks_pvalue =
sp.stats.kstest(
data,
lambda k: sp.stats.lognorm.cdf(k, shape, location, scale),
args=(),
N=len(data),
alternative='two-sided',
mode='approx')
print(lognormal_ks_statistic)
print(lognormal_ks_pvalue)
This is not normal, since we see from the plot that the fitting is quite accurate... does anybody know where I made a mistake?
Thanks a lot!!
Charles
This simply means that your data isn't exactly log-normal. Based on the histogram, you have a lot of data points for the K-S test to use. This means that if your data is evenly slightly different than would be expected based on a log-normal distribution with those parameters, the K-S test will indicate the data isn't drawn from log-normal.
Where is the data from? If it is from an organic source, or any source other than specifically drawing random numbers from a lognormal distribution, I would expect an extremely small p-value, even if the fits looks great. This certainly isn't a problem though as long as the fit is sufficiently good for your purposes.

How to infer the parameters of a 1D gaussian distribution using PyMC?

I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=∑ln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)

Fitting a histogram with skewed gaussian

I want to fit histograms with a skewed gaussian.
I take my data from a text file:
rate, err = loadtxt('hist.dat', unpack = True)
and then plot them as a histogram:
plt.hist(rate, bins= 128)
This histogram has a skewed gaussian shape, that I would like to fit.
I can do it with a simple gaussian, because scipy has the function included, but not with a skewed. How can I proceed?
Possibly, a goodness of fit test returned would be the best.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful. This has a Skewed Gaussian model built in. Your problem might be as simple as
from lmfit.models import SkewedGaussianModel
xvals, yvals = read_your_histogram()
model = SkewedGaussianModel()
# set initial parameter values
params = model.make_params(amplitude=10, center=0, sigma=1, gamma=0)
# adjust parameters to best fit data.
result = model.fit(yvals, params, x=xvals)
print(result.fit_report())
pylab.plot(xvals, yvals)
pylab.plot(xvals, result.best_fit)
This will report the values and uncertainties for the parameters amplitude, center, sigma (for the normal Gaussian), and gamma, the skewness factor.
There are several answers out there for using the .fit() method of scipy.stats.skewnorm, but that method doesn't allow for initial parameters and is not robust. This lmfit package is better, but I will add that a non-zero baseline may still throw it off. To get it to work on my particular dataset, I used the scipy.optimize.curve_fit first with an ordinary gaussian, which was the quickest way to get the baseline, then subtracted it and refit with lmfit to get the skew.

Categories