How to create dataset for fitting a function in scipy stats?

How to create dataset for fitting a function in scipy stats? - python

I want to fit some data to a Pareto distribution using the scipy.stats library. I am not sure if the issue might be numerical, so just to be safe; I have values measured for the dependent variable (let's call them 'pushes') for the independent variable ('minutes') starting at a few thousand minutes and every ten minutes thereafter (with the exception of a few points that were removed during data cleaning).
e.g.
2780.0 362.0
2800.0 376.0
2810.0 393.0
...
The best info I can find says something like
from scipy.stats import pareto
result = pareto.fit(data)
and I have no idea how this data is to be formatted in this case. I've tried the following but all result in errors.
result = pareto.fit(zip(minutes, pushes))
result = pareto.fit(pushes)
The error is usually
Warning: invalid value encountered in double_scalars
would greatly appreciate some guidance, thank you.

As I mentioned in the comments above, pareto.fit() is not what you're looking for.
The .fit() methods of the continuous distributions in scipy.stats obtain an estimate of the parameters of the distribution that maximise the probability of observing some particular set of sample values. Therefore, pareto.fit() wants only a single array argument containing the samples you want to fit the distribution to. The other keyword arguments control various aspects of the fitting process, for example by specifying initial values for the distribution parameters.
What you're actually trying to do is to fit the relationship between some independent variable x and some dependent variable y, i.e.
y_fit = f(x, params)
What you need to do is:
Choose some functional form for f. From your description, the plot of y vs x resembles the probability density function for a Pareto distribution, so perhaps either this or a decaying exponential might be appropriate.
Find the set of params that minimize some measure of the difference between y and y_fit (usually the sum of squared differences). You could use scipy.optimize.curve_fit or scipy.optimize.minimize to do this.

Related

why am I getting OptimizeError while trying to fit a gaussian/lorentzian to data using curve_fit?

I am trying to fit a data set which may fit a gaussian or lorentzian, with scipy optimize curve_fit function.
I am getting the error:
"OptimizeWarning: Covariance of the parameters could not be estimated
warnings.warn('Covariance of the parameters could not be estimated',"
the data set looks like this:
enter image description here
which , as you can see, may fit a gaussian.
my code is :
def gaussian(x,a,b,c,d):
func=a*np.exp(-((x-b)**2)/c)+d
return func
def lorentzian (x,a,b,c):
func=a/(((x-b)**2+a**2)*np.pi)+c
return func
x,y_data= np. loadtxt('in 0.6 out 0.6.dat', unpack = True)
popt, pcov = curve_fit(lorentzian, x, y_data)
thank you!

You're getting this error because the fitting algorithm couldn't find an appropriate solution. No matter where it moved the parameters, the fit quality didn't change. If you provide an initial guess, you're more likely to reach the solution. Given that the function parameters are relatively easily obtained from glancing the curves, you could provide most of them. For example, the center (which you called a) is around 545.5. Wikipedia also has a relationship for the value at the maximum for a slightly different form of your equation, which lacks the c parameter to shift the curve upwards. Providing the guess p0 = (0.1, 545.5, 0) and a bound of (0, 1E10) you get something much closer to your results, yet still unsatisfactory (next time, provide the data array, I had to use a point extractor to plot this)
Now, notice how you're supposed to reach a maximum value of 40, yet that seems unattainable by your model. I took the liberty of normalizing your model simply by dividing it by its maximum value and trying to fit again. I don't know if this is the appropriate normalization, but this is just to illustrate the difference. This time, the result is much more satisfactory:
Yet I think a lorentzian is a bit too narrow for your curve (especially evident if you set c to 0), which looks much more like a gaussian (given you provided its definition but didn't use it, I guess you probably would have used it in the future).
Note how I didn't have to normalize y.
In summary:
Provide an initial guess to your fitting algorithm, and bounds if possible.
Always plot your models and data to see what's going on.
Be aware of the limits of your models, what values it can or can't reach. Use this to evaluate if a fit is even possible in the first place.

What exactly is the variance on the parameters of SciPy curve fit? (Python)

I'm currently using the curve_fit function of the scipy.optimize package in Python, and know that if you take the square root of the diagonal entries of the covariance matrix that you get from curve_fit, you get the standard deviation on the parameters that curve_fit calculated. What I'm not sure about, is what exactly this standard deviation means. It's an approximation using a Hesse matrix as far as I understand, but what would the exact calculation be? Standard deviation on the Gaussian Bell Curve tells you what percentage of area is within a certain range of the curve, so I assumed for curve_fit it tells you how many datapoints are between certain parameter values, but apparently that isn't right...
I'm sorry if this should be basic knowledge for curve fitting, but I really can't figure out what the standard deviations do, they express an error on the parameters, but those parameters are calculated as the best possible fit for the function, it's not like there's a whole collection of optimal parameters, and we get the average value of that collection and consequently also have a standard deviation. There's only one optimal value, what is there to compare it with? I guess my question really comes down to this: how can I manually and accurately calculate these standard deviations, and not just get an approximation using a Hesse matrix?

The variance in the fitted parameters represents the uncertainty in the best-fit value based on the quality of the fit of the model to the data. That is, it describes by how much the value could change away from the best-fit value and still have a fit that is almost as good as the best-fit value.
With standard definition of chi-square,
chi_square = ( ( (data - model)/epsilon )**2 ).sum()
and reduced_chi_square = chi_square / (ndata - nvarys) (where data is the array of the data values, model the array of the calculated model, epsilon is uncertainty in the data, ndata is the number of data points, and nvarys the number of variables), a good fit should have reduced_chi_square around 1 or chi_square around ndata-nvary. (Note: not 0 -- the fit will not be perfect as there is noise in the data).
The variance in the best-fit value for a variable gives the amount by which you can change the value (and re-optimize all other values) and increase chi-square by 1. That gives the so-called '1-sigma' value of the uncertainty.
As you say, these values are expressed in the diagonal terms of the covariance matrix returned by scipy.optimize.curve_fit (the off-diagonal terms give the correlations between variables: if a value for one variable is changed away from its optimal value, how would the others respond to make the fit better). This covariance matrix is built using the trial values and derivatives near the solution as the fit is being done -- it calculates the "curvature" of the parameter space (ie, how much chi-square changes when a variables value changes).
You can calculate these uncertainties by hand. The lmfit library (https://lmfit.github.io/lmfit-py/) has routines to more explicitly explore the confidence intervals of variables from least-squares minimization or curve-fitting. These are described in more detail at
https://lmfit.github.io/lmfit-py/confidence.html. It's probably easiest to use lmfit for the curve-fitting rather than trying to re-implement the confidence interval code for curve_fit.

fitting beta distribution (in python) - clarification please

I am fitting a beta distribution with beta.fit(W). The values of W do not reach the [0,1] boundaries. My question is the following - do I need to force [0,1] bounds by beta.fit(W,loc = min(W),scale = max(W) - min(W)), or may I assume that as long as the data is within the [0,1] range, the fitting "will be fine"? Obviously, scaling the data should give different values of a and b. Which one is the "correct one"?
This question is related to:
https://stats.stackexchange.com/questions/68983/beta-distribution-fitting-in-scipy
Unfortunately, no valid answer on what to do when the data is within the expected range is give...
I tried to fit data generated with known values of a and b and neither technique gave a good fit, although scaling seemed to help a bit.
Thanks

When not passing the floc and fscale parameters, fit tries to estimate them. If you know that the data are in a specific interval you should make that additional information known to the fit function (by setting the parameters yourself) in order to improve the fit. You can also give initial guesses for α, β and the scale parameters (via the loc and scale keyword arguments); SciPy's default guessing function seems to be quite sophisticated, though.
Deriving floc and fscale from the limits of the sample set is not a good idea because the beta distribution is zero at the interval boundaries for most values of α and β, which means that you are creating large discrepancies between the data and all possible fits.

python scipy.stats pdf and expect functions

I was wondering if someone could please explain what the following functions in scipy.stats do:
rv_continuous.expect
rv_continuous.pdf
I have read the documentation but I am still confused.
Here is my task, quite simple in theory, but I am still confused with what these functions do.
So, I have a list of areas, 16383 values. I want to find the probability that the variable area takes any value between a smaller value , called "inf" and a larger value "sup".
So, what I thought is:
scipy.stats.rv_continuous.pdf(a) #a being the list of areas
scipy.stats.rv_continuous.expect(pdf, lb = inf, ub = sup)
So that i can get the probability that any area is between sup and inf.
Can anyone help me by explaining in a simple way what the functions do and any hint on how to compute the integral of f(a) between inf and sup, please?
Thanks
Blaise

rv_continuous is a base class for all of the probability distributions implemented in scipy.stats. You would not call methods on rv_continuous yourself.
Your question is not entirely clear about what you want to do, so I will assume that you have an array of 16383 data points drawn from some unknown probability distribution. From the raw data, you will need to estimate the cumulative distribution, find the values of that cumulative distribution at the sup and inf values and subtract to find the probability that a value drawn from the unknown distribution.
There are lots of ways to estimate the unknown distribution from the data depending on how much modelling you want to do and how many assumptions you want to make. At the more complicated end of the spectrum, you could try to fit one of the standard parametric probability distributions to the data. For example, if you had a suspicion that your data were lognormally distributed, you could use scipy.stats.lognorm.fit(data, floc=0) to find the parameters of the lognormal distribution that fit your data. Then you could use scipy.stats.lognorm.cdf(sup, *params) - scipy.stats.lognorm.cdf(inf, *params) to estimate the probability of the value being between those values.
In the middle are the non-parametric forms of distribution estimation like histograms and kernel density estimates. For example, scipy.stats.gaussian_kde(data).integrate_box_1d(inf, sup) is an easy way to make this estimate using a Gaussian kernel density estimate of the unknown distribution. However, kernel density estimates aren't always appropriate and require some tweaking to get right.
The simplest thing you could do is just count the number of data points that fall between inf and sup and divide by the total number of data points that you have. This only works well with a largish number of points (which you have) and with bounds that aren't too far in the tails of the data.

The cumulative density function might give you what you want.
Then the probability P of being between two values is
P(inf < area < sup) = cdf(sup) - cdf(inf)
There's a tutorial about probabilities here and here
They are all related. The pdf is the "density" of the probabilities. They must be greater than zero and sum to 1. I think of it as indicating how likely something is. The expectation is is a generalisation of the idea of average.
E[x] = sum(x.P(x))

Maximum Likelihood Estimate pseudocode

I need to code a Maximum Likelihood Estimator to estimate the mean and variance of some toy data. I have a vector with 100 samples, created with numpy.random.randn(100). The data should have zero mean and unit variance Gaussian distribution.
I checked Wikipedia and some extra sources, but I am a little bit confused since I don't have a statistics background.
Is there any pseudo code for a maximum likelihood estimator? I get the intuition of MLE but I cannot figure out where to start coding.
Wiki says taking argmax of log-likelihood. What I understand is: I need to calculate log-likelihood by using different parameters and then I'll take the parameters which gave the maximum probability. What I don't get is: where will I find the parameters in the first place? If I randomly try different mean & variance to get a high probability, when should I stop trying?

I just came across this, and I know its old, but I'm hoping that someone else benefits from this. Although the previous comments gave pretty good descriptions of what ML optimization is, no one gave pseudo-code to implement it. Python has a minimizer in Scipy that will do this. Here's pseudo code for a linear regression.
# import the packages
import numpy as np
from scipy.optimize import minimize
import scipy.stats as stats
import time
# Set up your x values
x = np.linspace(0, 100, num=100)
# Set up your observed y values with a known slope (2.4), intercept (5), and sd (4)
yObs = 5 + 2.4*x + np.random.normal(0, 4, 100)
# Define the likelihood function where params is a list of initial parameter estimates
def regressLL(params):
# Resave the initial parameter guesses
b0 = params[0]
b1 = params[1]
sd = params[2]
# Calculate the predicted values from the initial parameter guesses
yPred = b0 + b1*x
# Calculate the negative log-likelihood as the negative sum of the log of a normal
# PDF where the observed values are normally distributed around the mean (yPred)
# with a standard deviation of sd
logLik = -np.sum( stats.norm.logpdf(yObs, loc=yPred, scale=sd) )
# Tell the function to return the NLL (this is what will be minimized)
return(logLik)
# Make a list of initial parameter guesses (b0, b1, sd)
initParams = [1, 1, 1]
# Run the minimizer
results = minimize(regressLL, initParams, method='nelder-mead')
# Print the results. They should be really close to your actual values
print results.x
This works great for me. Granted, this is just the basics. It doesn't profile or give CIs on the parameter estimates, but its a start. You can also use ML techniques to find estimates for, say, ODEs and other models, as I describe here.
I know this question was old, hopefully you've figured it out since then, but hopefully someone else will benefit.

If you do maximum likelihood calculations, the first step you need to take is the following: Assume a distribution that depends on some parameters. Since you generate your data (you even know your parameters), you "tell" your program to assume Gaussian distribution. However, you don't tell your program your parameters (0 and 1), but you leave them unknown a priori and compute them afterwards.
Now, you have your sample vector (let's call it x, its elements are x[0] to x[100]) and you have to process it. To do so, you have to compute the following (f denotes the probability density function of the Gaussian distribution):
f(x[0]) * ... * f(x[100])
As you can see in my given link, f employs two parameters (the greek letters µ and σ). You now have to calculate the values for µ and σ in a way such that f(x[0]) * ... * f(x[100]) takes the maximum possible value.
When you've done that, µ is your maximum likelihood value for the mean, and σ is the maximum likelihood value for standard deviation.
Note that I don't explicitly tell you how to compute the values for µ and σ, since this is a quite mathematical procedure I don't have at hand (and probably I would not understand it); I just tell you the technique to get the values, which can be applied to any other distributions as well.
Since you want to maximize the original term, you can "simply" maximize the logarithm of the original term - this saves you from dealing with all these products, and transforms the original term into a sum with some summands.
If you really want to calculate it, you can do some simplifications that lead to the following term (hope I didn't mess up anything):
Now, you have to find values for µ and σ such that the above beast is maximal. Doing that is a very nontrivial task called nonlinear optimization.
One simplification you could try is the following: Fix one parameter and try to calculate the other. This saves you from dealing with two variables at the same time.

You need a numerical optimisation procedure. Not sure if anything is implemented in Python, but if it is then it'll be in numpy or scipy and friends.
Look for things like 'the Nelder-Mead algorithm', or 'BFGS'. If all else fails, use Rpy and call the R function 'optim()'.
These functions work by searching the function space and trying to work out where the maximum is. Imagine trying to find the top of a hill in fog. You might just try always heading up the steepest way. Or you could send some friends off with radios and GPS units and do a bit of surveying. Either method could lead you to a false summit, so you often need to do this a few times, starting from different points. Otherwise you may think the south summit is the highest when there's a massive north summit overshadowing it.

As joran said, the maximum likelihood estimates for the normal distribution can be calculated analytically. The answers are found by finding the partial derivatives of the log-likelihood function with respect to the parameters, setting each to zero, and then solving both equations simultaneously.
In the case of the normal distribution you would derive the log-likelihood with respect to the mean (mu) and then deriving with respect to the variance (sigma^2) to get two equations both equal to zero. After solving the equations for mu and sigma^2, you'll get the sample mean and sample variance as your answers.
See the wikipedia page for more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.