I want to fit histograms with a skewed gaussian.
I take my data from a text file:
rate, err = loadtxt('hist.dat', unpack = True)
and then plot them as a histogram:
plt.hist(rate, bins= 128)
This histogram has a skewed gaussian shape, that I would like to fit.
I can do it with a simple gaussian, because scipy has the function included, but not with a skewed. How can I proceed?
Possibly, a goodness of fit test returned would be the best.
You might find lmfit (http://lmfit.github.io/lmfit-py/) useful. This has a Skewed Gaussian model built in. Your problem might be as simple as
from lmfit.models import SkewedGaussianModel
xvals, yvals = read_your_histogram()
model = SkewedGaussianModel()
# set initial parameter values
params = model.make_params(amplitude=10, center=0, sigma=1, gamma=0)
# adjust parameters to best fit data.
result = model.fit(yvals, params, x=xvals)
print(result.fit_report())
pylab.plot(xvals, yvals)
pylab.plot(xvals, result.best_fit)
This will report the values and uncertainties for the parameters amplitude, center, sigma (for the normal Gaussian), and gamma, the skewness factor.
There are several answers out there for using the .fit() method of scipy.stats.skewnorm, but that method doesn't allow for initial parameters and is not robust. This lmfit package is better, but I will add that a non-zero baseline may still throw it off. To get it to work on my particular dataset, I used the scipy.optimize.curve_fit first with an ordinary gaussian, which was the quickest way to get the baseline, then subtracted it and refit with lmfit to get the skew.
Related
I'm interested in the tail distribution of some given data, so I tried using scipy.stats to fit my data to a Gaussian, Generalized extreme value distribution, and a Generalized Pareto distribution.
This is how the data looks like:
Data Histogram
This is what I tried
data=df.loc[:,'X']
v=np.ceil(np.log2(len(data))) + 1 #This criterion is Sturge's rule, we use this formula to calculate the "adequate" number of bins to visualize our data's distribution
y,x=np.histogram(data,bins=int(v),density=True) #"clustering" our data for the plot
plt.hist(data, bins=11, density=True)
plt.title("Histogram")
plt.show()
x = (x + np.roll(x, -1))[:-1] / 2.0 #This takes the mid point of every "bin" interval as the reference x-axis point for its corresponding y probability
# =============================================================================
# Fitting our data and plotting the PDFs
# =============================================================================
fit1=stats.genextreme.fit(data,floc=0) #The fit method finds the optimal parameters (using MLE) for your data fitting a chosen probability distribution
fit2=stats.norm.fit(data)
fit3=stats.genpareto.fit(data,floc=0)
fit4=stats.weibull_min.fit(data,floc=0)
fit5=stats.exponweib.fit(data,floc=0)
fit6=stats.gumbel_r.fit(data,floc=0)
fit7=stats.gumbel_l.fit(data,floc=0)
....
First I had some strange results because I didn't set the initial location parameter to 0, I still didn't exactly understand why.
What surprised me the most though, is that genextreme and Weibull_min gave me different results, when I thought Weibull was a special case of the generalized extreme value distribution with positive shape parameter.
Especially since the Weibull fit seems to work better here.
Here is the Weibull Fit:
Weibull Fit
And this is the GEV Fit:
GEV Fit
Actually the GEV Fit was similar to the Gumbel_r one:
Gumbel_r Fit
I read one could deduce whether Weibull_min or max should be used just from the shape of the data's histogram, how can one do that?
Since I am interested in extreme positive values (Tail distribution), shouldn't I be using Weibull_max since that's the limiting distribution of the maximum?
I have a plot for the CDF distribution of packet losses. I thus do not have the original data or the CDF model itself but samples from the CDF curve. (The data is extracted from plots published in literature.)
I want to find which distribution and with what parameters offers the closest fit to the CDF samples.
I've seen that Scipy stats distributions offer fit(data) method but all examples apply to raw data points. PDF/CDF is subsequently drawn from the fitted parameters. Using fit with my CDF samples does not give sensible results.
Am I right in assuming that fit() cannot be directly applied to data samples from an empirical CDF?
What alternatives could I use to find a matching known distribution?
I'm not sure exactly what you're trying to do. When you say you have a CDF, what does that mean? Do you have some data points, or the function itself? It would be helpful if you could post more information or some sample data.
If you have some data points and know the distribution its not hard to do using scipy. If you don't know the distribution, you could just iterate over all distributions until you find one which works reasonably well.
We can define functions of the form required for scipy.optimize.curve_fit. I.e., the first argument should be x, and then the other arguments are parameters.
I use this function to generate some test data based on the CDF of a normal random variable with a bit of added noise.
n = 100
x = np.linspace(-4,4,n)
f = lambda x,mu,sigma: scipy.stats.norm(mu,sigma).cdf(x)
data = f(x,0.2,1) + 0.05*np.random.randn(n)
Now, use curve_fit to find parameters.
mu,sigma = scipy.optimize.curve_fit(f,x,data)[0]
This gives output
>> mu,sigma
0.1828320963531838, 0.9452044983927278
We can plot the original CDF (orange), noisy data, and fit CDF (blue) and observe that it works pretty well.
Note that curve_fit can take some additional parameters, and that the output gives additional information about how good of a fit the function is.
#tch Thank you for the answer. I read on the technique and successfully applied it. I wanted to apply the fit to all continuous distribution supported by scipy.stats so I ended up doing the following:
fitted = []
failed = []
for d in dist_list:
dist_name = d[0] #fetch the distribution name
dist_object = getattr(ss, dist_name) #fetch the distribution object
param_default = d[1] #fetch the default distribution parameters
# For distributions with only location and scale set those to the default loc=0 and scale=1
if not param_default:
param_default = (0,1)
# Computed parameters of fitted distribution
try:
param,cov = curve_fit(dist_object.cdf,data_in,data_out,p0=param_default,method='trf')
# Only take distributions which do not result in zero covariance as those are not a valid fit
if np.any(cov):
fitted.append((dist_name,param),)
# Capture which distributions are not possible to be fitted (variety of reasons)
except (NotImplementedError,RuntimeError) as e:
failed.append((dist_name,e),)
pass
In the above, the empirical cdf distribution is captured in data_out which holds the sampled cdf values for a range of data_in data points. The list dist_list holds for each distribution in scipy.stats.rv_continuous the name of the distribution as first element and a list of the default parameters as second element. Default parameters I extract from scipy.stats._distr_params.
Some distributions cannot be fitted and raise an error. I keep those is failed list.
Finally, I generate a list fitted which holds for each successfully fitted distribution the estimated parameters.
I have a database of features, a 2D np.array (2000 samples and each sample contains 100 features, 2000 X 100). I want to fit gaussian distributions to my database using python. My code is the following:
data = load_my_data() # loads a np.array with size 2000x200
clf = mixture.GaussianMixture(n_components= 50, covariance_type='full')
clf.fit(data)
I am not sure about the parameters for example the covariance_type and how can I investigate whether the fit was occured succesfully or not.
EDIT: I debug the code to investigate what is happening with the clf.means_ and appartently it produced a matrix n_components X size_of_features 50 X 20). Is there a way that i can check that the fitting was successful, or to plot data? What are the alternatives to Gaussian mixtures (mixtures of exponential for example, I cannot find any available implementation)?
I think you are using sklearn package.
Once you have fit, then type
print clf.means_
If it has output, then the data is fitted, if it raise errors, not fitted.
Hope this helps you.
You can do dimensionality reduction using PCA to 3D space (let's say) and then plot means and data.
Is is always preferred to choose a reduced set of candidate before trying to identify the distribution (in other words, use Cullen & Frey to reject the unlikely candidates) and then go for goodness of fit a select the best result,
You can just create a list of all available distributions in scipy. An example with two distributions and random data:
import numpy as np
import scipy.stats as st
data = np.random.random(10000)
#Specify all distributions here
distributions = [st.laplace, st.norm]
mles = []
for distribution in distributions:
pars = distribution.fit(data)
mle = distribution.nnlf(pars, data)
mles.append(mle)
results = [(distribution.name, mle) for distribution, mle in
zip(distributions, mles)]
best_fit = sorted(zip(distributions, mles), key=lambda d: d[1])[0]
print 'Best fit reached using {}, MLE value: {}'.format(best_fit[0].name, best_fit[1])
I understand, you may like to do regression of two different distributions, more than fitting them to an arithmetic curve. If this is the case, you may be interested in plotting one against the other one, and make a linear (or polynomial) regression, checking the coefficients
If this is the case, linear regression of two distributions, may tell you if there linear dependent or not.
Linear Regression using Scipy documentation
I have a set of data and fit the corresponding histogram by a lognormal distribution.
I first calculate the optimal parameters for the lognormal function, and then plot the histogram and the lognormal function. This gives quite good results:
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
num_data = len(data)
x_axis = np.linspace(min(data),
max(data),num_data)
number_of_bins = 240
histo, bin_edges = np.histogram(data, number_of_bins, normed=False)
shape, location, scale = sp.stats.lognorm.fit(data)
plt.hist(data, number_of_bins, normed=False);
# the scaling factor scales the normalized lognormal function up to the size
# of the histogram:
scaling_factor = len(data)*(max(data)-min(data))/number_of_bins
plt.plot(x_axis,scaling_factor*sp.stats.lognorm.pdf(x_axis, shape,
location, scale),'r-')
# adjust the axes dimensions:
plt.axis([bin_edges[0]-10,bin_edges[len(bin_edges)-1]+10,0, histo.max()*1.1])
However, when performing the Kolmogorov-Smirnov test on the data versus the fitting function, I get way too low p-values (of the order of e-32):
lognormal_ks_statistic, lognormal_ks_pvalue =
sp.stats.kstest(
data,
lambda k: sp.stats.lognorm.cdf(k, shape, location, scale),
args=(),
N=len(data),
alternative='two-sided',
mode='approx')
print(lognormal_ks_statistic)
print(lognormal_ks_pvalue)
This is not normal, since we see from the plot that the fitting is quite accurate... does anybody know where I made a mistake?
Thanks a lot!!
Charles
This simply means that your data isn't exactly log-normal. Based on the histogram, you have a lot of data points for the K-S test to use. This means that if your data is evenly slightly different than would be expected based on a log-normal distribution with those parameters, the K-S test will indicate the data isn't drawn from log-normal.
Where is the data from? If it is from an organic source, or any source other than specifically drawing random numbers from a lognormal distribution, I would expect an extremely small p-value, even if the fits looks great. This certainly isn't a problem though as long as the fit is sufficiently good for your purposes.
I want to fit an array of data (in the program called "data", of size "n") with a Gaussian function and I want to get the estimations for the parameters of the curve, namely the mean and the sigma. Is the following code, which I found on the Web, a fast way to do that? If so, how can I actually get the estimated values of the parameters?
import pylab as plb
from scipy.optimize import curve_fit
from scipy import asarray as ar,exp
x = ar(range(n))
y = data
n = len(x) #the number of data
mean = sum(x*y)/n #note this correction
sigma = sum(y*(x-mean)**2)/n #note this correction
def gaus(x,a,x0,sigma,c):
return a*exp(-(x-x0)**2/(sigma**2))+c
popt,pcov = curve_fit(gaus,x,y,p0=[1,mean,sigma,0.0])
print popt
print pcov
plt.plot(x,y,'b+:',label='data')
plt.plot(x,gaus(x,*popt),'ro:',label='fit')
plt.legend()
plt.title('Fig. 3 - Fit')
plt.xlabel('q')
plt.ylabel('data')
plt.show()
To answer your first question, "Is the following code, which I found on the Web, a fast way to do that?"
The code that you have is in fact the right way to proceed with fitting your data, when you believe is Gaussian and know the fitting function (except change the return function to
a*exp(-(x-x0)**2/(sigma**2)).
I believe for a Gaussian function you don't need the constant c parameter.
A common use of least-squares minimization is curve fitting, where one has a parametrized model function meant to explain some phenomena and wants to adjust the numerical values for the model to most closely match some data. With scipy, such problems are commonly solved with scipy.optimize.curve_fit.
To answer your second question, "If so, how can I actually get the estimated values of the parameters?"
You can go to the link provided for scipy.optimize.curve_fit and find that the best fit parameters reside in your popt variable. In your example, popt will contain the mean and sigma of your data. In addition to the best fit parameters, pcov will contain the covariance matrix, which will have the errors of your mean and sigma. To obtain 1sigma standard deviations, you can simply use np.sqrt(pcov) and obtain the same.