I want to fit part of my data with python, using lmfit( it is not a must!). I'd like to have dynamic range of data to be fitted, meaning having two fitting parameters which define the part of my data to be fitted (let's call it lower and upper boundaries).
the reason is I have many data sets. in each of the the fitting range varies and I cannot define a model to fit the whole range of data. on the other hand I cannot go through each data set and define the fitting range.
is it possible at all? I thought of multiplying a pulse function to my model which affects the original data as well. though as far as I understand I cannot tell lmfit to multiply it to the data. so I am out of idea!
The number of observations (data points or length of the residual array) returned by the model function or function to be minimized has to be the same throughout an individual fit. Of course, this can change between successive fits. So, you could try multiple fits for each data set with different ranges, perhaps set based on the previous fit.
I think that your idea of using "where the fit is bad" to determine where to not fit is somewhat suspect and you would want to make sure to avoid that leading to absurd results. If for example, the range was automatically reduced so much so that Ndata = Nvariables+1, you could probably get a very low chi-square compared to Ndata = 100*Nvariables.
Without knowing the particulars, I think you would better off coming up with criteria for selecting the data range that depended on the data alone, and not a fit to it.
Related
I was trying to fit an A*cos(wt)cos(Ot) function to this dataset:
dataset,
using the scipy curve fit function, but it either fails (doesn't find a fit) or the fits is not good.
Here is my code
def PEND(x,A,O,w): #the function I want to fit
y=A*np.cos(O*x)*np.cos(w*x)
return y
Guess=[2.1,4.39822971502571,0.029]
parameters, covariance = curve_fit(PEND, xdata=t, ydata=x,p0=Guess,bounds=([1.9,0.1,0.001],[2.2,20,1]))
A=parameters[0]
O=parameters[1]
w=parameters[2]
xfit=PEND(t,A,O,w)
Result:
[1.9 4.40327678 0.02658705]
Where I have tried changing the Guess, the variables I fit, the function, bounds etc. many times, and the best I got was with the code above resulting in:
Resulting fit
closeup
As you can see the fit is not satisfactory. I do know the model is not perfect, as the amplitude falls of gradually, but my problem doesn't change whether or not I do it on the whole data set or the first 1/3 of the dataset. As you can also see, the second frequency goes down by quite a lot in the fit, which is weird, as mine was a bit too low to begin with. Also the Amplitude goes down to the minimum and if I do not set the bounds like I do it goes down to basically zero, while the frequencies get barley changed. I believe that the program tries to fit A too much and doesn't fit the frequencies at all. If I take my best guess of the Amplitude and exclude it from the fit, I get the runtime exceeded not fit found error.
What can I do to fit this well?
I would like to know if in Python, and more precisely, in lmfit library, there is an option for fitting data by parts ? I would like to fit data defined in different ranges and then obtain a unique fit.
Thank you
Without a more concrete example, it is hard to give a concrete answer. But, if I understand your question correctly, you are looking to do a fit to one specific region of your data, then a fit (probably with a different functional form) to another region of your data, and then perhaps combine the multiple regions to get a final fit.
If that is correct, then yes, this can be done with lmfit (and probably with other libraries as well). Let's say you want to fit data that is sort of peak like with an exponential decaying background. First, isolate a region around that peak (it doesn't have to be perfect) and fit a peak (say, Gaussian to that). Then fit an exponential decay to all the data except the peak area. (Aside: numpy.where can be very useful in identifying the regions). Finally, combine the two and fit the whole curve to peak + background.
If that is too vague and doesn't point you in the right direction, please make the question more specific.
I am fitting a beta distribution with beta.fit(W). The values of W do not reach the [0,1] boundaries. My question is the following - do I need to force [0,1] bounds by beta.fit(W,loc = min(W),scale = max(W) - min(W)), or may I assume that as long as the data is within the [0,1] range, the fitting "will be fine"? Obviously, scaling the data should give different values of a and b. Which one is the "correct one"?
This question is related to:
https://stats.stackexchange.com/questions/68983/beta-distribution-fitting-in-scipy
Unfortunately, no valid answer on what to do when the data is within the expected range is give...
I tried to fit data generated with known values of a and b and neither technique gave a good fit, although scaling seemed to help a bit.
Thanks
When not passing the floc and fscale parameters, fit tries to estimate them. If you know that the data are in a specific interval you should make that additional information known to the fit function (by setting the parameters yourself) in order to improve the fit. You can also give initial guesses for α, β and the scale parameters (via the loc and scale keyword arguments); SciPy's default guessing function seems to be quite sophisticated, though.
Deriving floc and fscale from the limits of the sample set is not a good idea because the beta distribution is zero at the interval boundaries for most values of α and β, which means that you are creating large discrepancies between the data and all possible fits.
I have written python (2.7.3) code wherein I aim to create a weighted sum of 16 data sets, and compare the result to some expected value. My problem is to find the weighting coefficients which will produce the best fit to the model. To do this, I have been experimenting with scipy's optimize.minimize routines, but have had mixed results.
Each of my individual data sets is stored as a 15x15 ndarray, so their weighted sum is also a 15x15 array. I define my own 'model' of what the sum should look like (also a 15x15 array), and quantify the goodness of fit between my result and the model using a basic least squares calculation.
R=np.sum(np.abs(model/np.max(model)-myresult)**2)
'myresult' is produced as a function of some set of parameters 'wts'. I want to find the set of parameters 'wts' which will minimise R.
To do so, I have been trying this:
res = minimize(get_best_weightings,wts,bounds=bnds,method='SLSQP',options={'disp':True,'eps':100})
Where my objective function is:
def get_best_weightings(wts):
wts_tr=wts[0:16]
wts_ti=wts[16:32]
for i,j in enumerate(portlist):
originalwtsr[j]=wts_tr[i]
originalwtsi[j]=wts_ti[i]
realwts=originalwtsr
imagwts=originalwtsi
myresult=make_weighted_beam(realwts,imagwts,1)
R=np.sum((np.abs(modelbeam/np.max(modelbeam)-myresult))**2)
return R
The input (wts) is an ndarray of shape (32,), and the output, R, is just some scalar, which should get smaller as my fit gets better. By my understanding, this is exactly the sort of problem ("Minimization of scalar function of one or more variables.") which scipy.optimize.minimize is designed to optimize (http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.optimize.minimize.html ).
However, when I run the code, although the optimization routine seems to iterate over different values of all the elements of wts, only a few of them seem to 'stick'. Ie, all but four of the values are returned with the same values as my initial guess. To illustrate, I plot the values of my initial guess for wts (in blue), and the optimized values in red. You can see that for most elements, the two lines overlap.
Image:
http://imgur.com/p1hQuz7
Changing just these few parameters is not enough to get a good answer, and I can't understand why the other parameters aren't also being optimised. I suspect that maybe I'm not understanding the nature of my minimization problem, so I'm hoping someone here can point out where I'm going wrong.
I have experimented with a variety of minimize's inbuilt methods (I am by no means committed to SLSQP, or certain that it's the most appropriate choice), and with a variety of 'step sizes' eps. The bounds I am using for my parameters are all (-4000,4000). I only have scipy version .11, so I haven't tested a basinhopping routine to get the global minimum (this needs .12). I have looked at minimize.brute, but haven't tried implementing it yet - thought I'd check if anyone can steer me in a better direction first.
Any advice appreciated! Sorry for the wall of text and the possibly (probably?) idiotic question. I can post more of my code if necessary, but it's pretty long and unpolished.
I have a list of many float numbers, representing the length of an operation made several times.
For each type of operation, I have a different trend in numbers.
I'm aware of many random generators presented in some python modules, like in numpy.random
For example, I have binomial, exponencial, normal, weibul, and so on...
I'd like to know if there's a way to find the best random generator, given a list of values, that best fit each list of numbers that I have.
I.e, the generator (with its params) that best fit the trend of the numbers on the list
That's because I'd like to automatize the generation of time lengths, of each operation, so that I can simulate it during n years, without having to find by hand what method fits best what list of numbers.
EDIT: In other words, trying to clarify the problem:
I have a list of numbers. I'm trying to find the probability distribution that best fit the array of numbers I already have. The only problem I see is that each probability distribution has input params that may interfer on the result. So I'll have to figure out how to enter this params automatically, trying to best fit the list.
Any idea?
You might find it better to think about this in terms of probability distributions, rather than thinking about random number generators. You can then think in terms of testing goodness of fit for your different distributions.
As a starting point, you might try constructing probability plots for your samples. Probably the easiest in terms of the math behind it would be to consider a Q-Q plot. Using the random number generators, create a sample of the same size as your data. Sort both of these, and plot them against one another. If the distributions are the same, then you should get a straight line.
Edit: To find appropriate parameters for a statistical model, maximum likelihood estimation is a standard approach. Depending on how many samples of numbers you have and the precision you require, you may well find that just playing with the parameters by hand will give you a "good enough" solution.
Why using random numbers for this is a bad idea has already been explained. It seems to me that what you really need is to fit the distributions you mentioned to your points (for example, with a least squares fit), then check which one fits the points best (for example, with a chi-squared test).
EDIT Adding reference to numpy least squares fitting example
Given a parameterized univariate distirbution (e.g. exponential depends on lambda, or gamma depends on theta and k), the way to find the parameter values that best fit a given sample of numbers is called the Maximum Likelyhood procedure. It is not a least squares procedure, which would require binning and thus loosing information! Some Wikipedia distribution articles give expressions for the maximum likelyhood estimates of parameters, but many do not, and even the ones that do are missing expressions for error bars and covarainces. If you know calculus, you can derive these results by expressing the log likeyhood of your data set in terms of the parameters, setting the second derivative to zero to maximize it, and using the inverse of the curvature matrix at the minimum as the covariance matrix of your parameters.
Given two different fits to two different parameterized distributions, the way to compare them is called the likelyhood ratio test. Basically, you just pick the one with the larger log likelyhood.
Gabriel, if you have access to Mathematica, parameter estimation is built in:
In[43]:= data = RandomReal[ExponentialDistribution[1], 10]
Out[43]= {1.55598, 0.375999, 0.0878202, 1.58705, 0.874423, 2.17905, \
0.247473, 0.599993, 0.404341, 0.31505}
In[44]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MaximumLikelihood"]
Out[44]= ExponentialDistribution[1.21548]
In[45]:= EstimatedDistribution[data, ExponentialDistribution[la],
ParameterEstimator -> "MethodOfMoments"]
Out[45]= ExponentialDistribution[1.21548]
However, it might be easy to figure what maximum likelihood method commands the parameter to be.
In[48]:= Simplify[
D[LogLikelihood[ExponentialDistribution[la], {x}], la], x > 0]
Out[48]= 1/la - x
Hence the estimated parameter for exponential distribution is sum (1/la -x_i) from where la = 1/Mean[data]. Similar equations can be worked out for other distribution families and coded in the language of your choice.