I'm trying to fit a set of data with uniform distribution. This is what I have tried based on normal distribution fitting. I'm not sure whether this implementation is correct or not? Can you please advise.
import matplotlib.pyplot as plt
from scipy.stats import uniform
mu, std = uniform.fit(data)
plt.hist(data, normed=True, alpha=0.6, color='#6495ED')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = uniform.pdf(x, mu, std)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f, std = %.2f" % (mu, std)
plt.title("Uniform Fitting")
plt.show()
That's generally right, once you fix the name errors (I assume logods and data are meant to be the same). Note that the parameters of the uniform distribution are general location and scale parameters (specifically, the lower boundary and width, respectively) and should not be named mu and std, which are specific to the normal distribution. But that doesn't affect the correctness of the code, just the understandability.
I would use OpenTURNS's UniformFactory: the build method returns a distribution which has a drawPDF method.
import openturns as ot
data = [1.,2.,3.,4.,5.,6., 7., 8.]
sample = ot.Sample(data,1)
distribution = ot.UniformFactory().build(sample)
distribution.drawPDF()
This produces:
Related
How would I calculate the 95% confidence interval of a list whose distribution looks like this?
Distribution
When I use various python libraries to do so, the confidence interval returned is unreasonably small.
You have what appears to be a log normal type of distribution
i would do something like this:
from scipy.stats import lognorm
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1, 1)
# Calculate the first four moments:
s = 0.9 # 0.954
mean, var, skew, kurt = lognorm.stats(s, moments='mvsk')
# Display the probability density function (pdf):
x = np.linspace(lognorm.ppf(0.01, s),
lognorm.ppf(0.99, s), 100)
ax.plot(x, lognorm.pdf(x, s),
'r-', lw=5, alpha=0.6, label='lognorm pdf')
# Check accuracy of cdf and ppf:
vals = lognorm.ppf([0.001, 0.5, 0.999], s)
np.allclose([0.001, 0.5, 0.999], lognorm.cdf(vals, s))
True
# And compare the histogram:
ax.legend(loc='best', frameon=False)
plt.show()
# get the confidence
y = lognorm.interval(0.95, s, loc=0, scale=1)
print(y)
result
(0.1713636133687539, 5.835544549636219)
I have an array of velocity data in directions V_x and V_y. I've plotted a histogram for the velocity norm using the code below,
plt.hist(V_norm_hist, bins=60, density=True, rwidth=0.95)
which gives the following figure:
Now I also want to add a Rayleigh Distribution curve on top of this, but I can't get it to work. I've been trying different combinations using scipy.stats.rayleigh but the scipy homepage isn't really intuative so I can't get it to function properly...
What exactly does the lines
mean, var, skew, kurt = rayleigh.stats(moments='mvsk')
and
x = np.linspace(rayleigh.ppf(0.01),rayleigh.ppf(0.99), 100)
ax.plot(x, rayleigh.pdf(x),'r-', lw=5, alpha=0.6, label='rayleigh pdf')
do?
You might need to first follow the link to rv_continuous, from which rayleigh is subclassed. And from there to the ppf to find out that ppf is the 'Percent point function'. x0 = ppf(0.01) tells at which spot everything less than x0 has accumulated 1% of its total 'weight' and similarly x1 = ppf(0.99) is where 99% of the 'weight' is accumulated. np.linspace(x0, x1, 100) divides the space from x0 to x1 in 100 short intervals. As a continuous distribution can be infinite, these x0 and x1 limits are needed to only show the interesting interval.
rayleigh.pdf(x) gives the pdf at x. So, an indication of how probable each x is.
rayleigh.stats(moments='mvsk') where moments is composed of letters [‘mvsk’] defines which moments to compute: ‘m’ = mean, ‘v’ = variance, ‘s’ = (Fisher’s) skew, ‘k’ = (Fisher’s) kurtosis.
To plot both the histogram and the distribution on the same plot, we need to know the parameters of Raleigh that correspond to your sample (loc and scale). Furthermore, both the pdf and the histogram would need the same x and same y. For the x we can take the limits of the histogram bins. For the y, we can scale up the pdf, knowing that the total area of the pdf is supposed to be 1. And the histogram bins are proportional to the number of entries.
If you do know the loc is 0 but don't know the scale, the wikipedia article gives a formula that connects the scale to the mean of your samples:
estimated_rayleigh_scale = samples.mean() / np.sqrt(np.pi / 2)
Supposing a loc of 0 and a scale of 0.08 the code would look like:
from matplotlib import pyplot as plt
import numpy as np
from scipy.stats import rayleigh
N = 1000
# V = np.random.uniform(0, 0.1, 2*N).reshape((N,2))
# V_norm = (np.linalg.norm(V, axis=1))
scale = 0.08
V_norm_hist = scale * np.sqrt( -2* np.log (np.random.uniform(0, 1, N)))
fig, ax = plt.subplots(1, 1)
num_bins = 60
_binvalues, bins, _patches = plt.hist(V_norm_hist, bins=num_bins, density=False, rwidth=1, ec='white', label='Histogram')
x = np.linspace(bins[0], bins[-1], 100)
binwidth = (bins[-1] - bins[0]) / num_bins
scale = V_norm_hist.mean() / np.sqrt(np.pi / 2)
plt.plot(x, rayleigh(loc=0, scale=scale).pdf(x)*len(V_norm_hist)*binwidth, lw=5, alpha=0.6, label=f'Rayleigh pdf (s={scale:.3f})')
plt.legend()
plt.show()
I have some data which I have fitted a normal distribution to using the scipy.stats.normal objects fit function like so:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
import matplotlib.mlab as mlab
x = np.random.normal(size=50000)
fig, ax = plt.subplots()
nbins = 75
mu, sigma = norm.fit(x)
n, bins, patches = ax.hist(x,nbins,normed=1,facecolor = 'grey', alpha = 0.5, label='before');
y0 = mlab.normpdf(bins, mu, sigma) # Line of best fit
ax.plot(bins,y0,'k--',linewidth = 2, label='fit before')
ax.set_title('$\mu$={}, $\sigma$={}'.format(mu, sigma))
plt.show()
I would now like to extract the uncertainty/error in the fitted mu and sigma values. How can I go about this?
You can use scipy.optimize.curve_fit:
This method does not only return the estimated optimal values of
the parameters, but also the corresponding covariance matrix:
popt : array
Optimal values for the parameters so that the sum of the squared residuals
of f(xdata, *popt) - ydata is minimized
pcov : 2d array
The estimated covariance of popt. The diagonals provide the variance of the parameter estimate. To compute one standard deviation errors on the parameters use perr = np.sqrt(np.diag(pcov)).
How the sigma parameter affects the estimated covariance depends on absolute_sigma argument, as described above.
If the Jacobian matrix at the solution doesn’t have a full rank, then ‘lm’ method returns a matrix filled with np.inf, on the other hand ‘trf’ and ‘dogbox’ methods use Moore-Penrose pseudoinverse to compute the covariance matrix.
You can calculate the standard deviation errors of the parameters from the square roots of the diagonal elements of the covariance matrix as follows:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy.optimize import curve_fit
x = np.random.normal(size=50000)
fig, ax = plt.subplots()
nbins = 75
n, bins, patches = ax.hist(x,nbins, density=True, facecolor = 'grey', alpha = 0.5, label='before');
centers = (0.5*(bins[1:]+bins[:-1]))
pars, cov = curve_fit(lambda x, mu, sig : norm.pdf(x, loc=mu, scale=sig), centers, n, p0=[0,1])
ax.plot(centers, norm.pdf(centers,*pars), 'k--',linewidth = 2, label='fit before')
ax.set_title('$\mu={:.4f}\pm{:.4f}$, $\sigma={:.4f}\pm{:.4f}$'.format(pars[0],np.sqrt(cov[0,0]), pars[1], np.sqrt(cov[1,1 ])))
plt.show()
This results in the following plot:
See also lmfit (https://github.com/lmfit/lmfit-py) which gives an easier interface and reports uncertainties in fitted variables. To fit data to a normal distribution, see http://lmfit.github.io/lmfit-py/builtin_models.html#example-1-fit-peak-data-to-gaussian-lorentzian-and-voigt-profiles
and use something like
from lmfit.models import GaussianModel
model = GaussianModel()
# create parameters with initial guesses:
params = model.make_params(center=9, amplitude=40, sigma=1)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
The report will include the 1-sigma errors like
[[Variables]]
sigma: 1.23218358 +/- 0.007374 (0.60%) (init= 1.0)
center: 9.24277047 +/- 0.007374 (0.08%) (init= 9.0)
amplitude: 30.3135620 +/- 0.157126 (0.52%) (init= 40.0)
fwhm: 2.90157055 +/- 0.017366 (0.60%) == '2.3548200*sigma'
height: 9.81457817 +/- 0.050872 (0.52%) == '0.3989423*amplitude/max(1.e-15, sigma)'
I'm tryng to fit a histogram but the fit only works with normalised data, i.e. with option normed=True in the histogram. Is there a way of doing this with scipy stats (or other method)? Here is a MWE using a uniform distribution:
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import uniform
data = []
for i in range(1000):
data.append(random.uniform(-1,1))
loc, scale = uniform.fit(data)
x = np.linspace(-1,1, 1000)
y = uniform.pdf(x, loc, scale)
plt.hist(data, bins=100, normed=False)
plt.plot(x, y, 'r-')
plt.show()
I also tried defining my own function (below) but I'm getting a bad fit.
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy import optimize
data = []
for i in range(1000):
data.append(random.uniform(-1,1))
def unif(x,avg,sig):
return avg*x + sig
y, base = np.histogram(data,bins=100)
x = [0.5 * (base[i] + base[i+1]) for i in xrange(len(base)-1)]
popt, pcov = optimize.curve_fit(unif, x, y)
x_fit = np.linspace(x[0], x[-1], 100)
y_fit = unif(x_fit, *popt)
plt.hist(data, bins=100, normed=False)
plt.plot(x_fit, y_fit, 'r-')
plt.show()
Note that it is generally a bad idea to fit a distribution to the histogram. Compared to the raw data the histogram contains less information so the fit will most likely be worse. Thus, the first MWE in the question actually contains the best approach. Simply normalize the histogram and it will match the distribution of the data: plt.hist(data, bins=100, normed=True).
However, it seems you actually want to work with the unnormalized histogram. In that case take the normalization that the histogram would normally use and apply it inverted to the fitted distribution. The documentation describes the normalization as
n/(len(x)`dbin)
which is verbose for saying dividing by the number of observations times the bin width.
Multiplying the distribution by this value results in the expected counts per bin:
loc, scale = uniform.fit(data)
x = np.linspace(-1,1, 1000)
y = uniform.pdf(x, loc, scale)
n_bins = 100
bin_width = np.ptp(data) / n_bins
plt.hist(data, bins=n_bins, normed=False)
plt.plot(x, y * len(data) * bin_width, 'r-')
The second MWE is interesting because you describe the line a a bad fit, but actually it is a very good fit :). You simply overfit the histogram because although you expect a horizontal line (one degree of freedom) you fit an arbitrary line (two degrees of freedom).
So if you want a horizontal line fit a horizontal line and don't be surprised to get something else if you fit something else...
def unif(x, sig):
return 0 * x + sig # slope is zero -> horizontal line
However, there is a much simpler way of obtaining the height of the unnormalized uniform distribution. Just average the histogram over all bins:
y, base = np.histogram(data,bins=100)
y_hat = np.mean(y)
print(y_hat)
# 10.0
Or, even simpler use the theoretical value of len(data) / n_bins == 10.
so I am trying to plot a histogram of my data and I seem to be a little confused here. I am using matplotlib in python. Here is the code from their website:
mu = 100 #mean
sigma = 15 #std deviation
x = mu + sigma * np.random.randn(10000)
# the histogram of the data
n, bins, patches = plt.hist(x, num_bins, normed=1, facecolor='green', alpha=0.5)
I am confused as to what the x -axis should be for my use. I have calculated the standard deviation and the mean but, I am uncertain if I should replace the np.random.randn(10000) with the actual data or not.
Just put your data into x variable, that's all. You do not need to compute mean or variance.