Numpy polyfit find least divergent - python

I am using numpy polyfit to create a number of plots which show a line of best fit. This works fine.. But I am wondering... Is it possible to assertain WHICH one of my plots has got the "straightest" line
Not sure what the correct term is...
I guess from the data points given, which set of data is least divergent?
ie:
X = [1,2,3,4,5,6,7,8,9,10]
Y = [1,2,3,4,5,6,7,8,9,10]
this would be giving me a perfect fit... how can I find which is the most perfect fit?

Fitting Algorithms like Regression has a metric showing it's accuracy named RMSE (root mean square error) which shows how much does curve deviates from points. It is explained here well.

Related

scipy curve fit not working for a standing wave

I was trying to fit an A*cos(wt)cos(Ot) function to this dataset:
dataset,
using the scipy curve fit function, but it either fails (doesn't find a fit) or the fits is not good.
Here is my code
def PEND(x,A,O,w): #the function I want to fit
y=A*np.cos(O*x)*np.cos(w*x)
return y
Guess=[2.1,4.39822971502571,0.029]
parameters, covariance = curve_fit(PEND, xdata=t, ydata=x,p0=Guess,bounds=([1.9,0.1,0.001],[2.2,20,1]))
A=parameters[0]
O=parameters[1]
w=parameters[2]
xfit=PEND(t,A,O,w)
Result:
[1.9 4.40327678 0.02658705]
Where I have tried changing the Guess, the variables I fit, the function, bounds etc. many times, and the best I got was with the code above resulting in:
Resulting fit
closeup
As you can see the fit is not satisfactory. I do know the model is not perfect, as the amplitude falls of gradually, but my problem doesn't change whether or not I do it on the whole data set or the first 1/3 of the dataset. As you can also see, the second frequency goes down by quite a lot in the fit, which is weird, as mine was a bit too low to begin with. Also the Amplitude goes down to the minimum and if I do not set the bounds like I do it goes down to basically zero, while the frequencies get barley changed. I believe that the program tries to fit A too much and doesn't fit the frequencies at all. If I take my best guess of the Amplitude and exclude it from the fit, I get the runtime exceeded not fit found error.
What can I do to fit this well?

why am I getting OptimizeError while trying to fit a gaussian/lorentzian to data using curve_fit?

I am trying to fit a data set which may fit a gaussian or lorentzian, with scipy optimize curve_fit function.
I am getting the error:
"OptimizeWarning: Covariance of the parameters could not be estimated
warnings.warn('Covariance of the parameters could not be estimated',"
the data set looks like this:
enter image description here
which , as you can see, may fit a gaussian.
my code is :
def gaussian(x,a,b,c,d):
func=a*np.exp(-((x-b)**2)/c)+d
return func
def lorentzian (x,a,b,c):
func=a/(((x-b)**2+a**2)*np.pi)+c
return func
x,y_data= np. loadtxt('in 0.6 out 0.6.dat', unpack = True)
popt, pcov = curve_fit(lorentzian, x, y_data)
thank you!
You're getting this error because the fitting algorithm couldn't find an appropriate solution. No matter where it moved the parameters, the fit quality didn't change. If you provide an initial guess, you're more likely to reach the solution. Given that the function parameters are relatively easily obtained from glancing the curves, you could provide most of them. For example, the center (which you called a) is around 545.5. Wikipedia also has a relationship for the value at the maximum for a slightly different form of your equation, which lacks the c parameter to shift the curve upwards. Providing the guess p0 = (0.1, 545.5, 0) and a bound of (0, 1E10) you get something much closer to your results, yet still unsatisfactory (next time, provide the data array, I had to use a point extractor to plot this)
Now, notice how you're supposed to reach a maximum value of 40, yet that seems unattainable by your model. I took the liberty of normalizing your model simply by dividing it by its maximum value and trying to fit again. I don't know if this is the appropriate normalization, but this is just to illustrate the difference. This time, the result is much more satisfactory:
Yet I think a lorentzian is a bit too narrow for your curve (especially evident if you set c to 0), which looks much more like a gaussian (given you provided its definition but didn't use it, I guess you probably would have used it in the future).
Note how I didn't have to normalize y.
In summary:
Provide an initial guess to your fitting algorithm, and bounds if possible.
Always plot your models and data to see what's going on.
Be aware of the limits of your models, what values it can or can't reach. Use this to evaluate if a fit is even possible in the first place.

K means clustering on unevenly sized clusters

I have to use k means clustering (I am using Scikit learn) on a dataset looks like this
But when I apply the K means it doesn't give me the centroids as expected. and classifies incorrectly.
Also What would be the ideas if I want to know the points not correctly classify in scikit learn.
Here is the code.
km = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10)
km.fit(Train_data.values)
plt.plot(km.cluster_centers_[:,0],km.cluster_centers_[:,1],'ro')
plt.show()
Here Train_data is pandas frame and having 2 features and 3500 samples and the code gives following.
I might have happened because of bad choice of initial centroids but what could be the solution ?
First of all I hope you noticed that range on X and Y axis is different in both figures. So, the first centroid(sorted by X-value) isn't that bad. The second and third ones are so obtained because of large number of outliers. They are probably taking half of both the rightmost clusters each. Also, the output of k-means is dependent on initial choice of centroids so see if different runs or setting init parameter to random improves results. Another way to improve efficiency would be to remove all the points having less than some n neighbors within a radius of distance d. To implement that efficiently you would need a kd-tree probably or just use DBSCAN provided by sklearn here and see if it works better.
Also K-Means++ is likely to pick outliers as initial cluster as explained here. So you may want to change init parameter in KMeans to 'random' and perform multiple runs and take the best centroids.
For your data since it is 2-D it is easy to know if points are classified correctly or not. Use mouse to 'pick' up coordinates of approximate centroid (see here) and just compare the cluster obtained from picked coordinates to those obtained from k-means.
I got a solution for this.
The problem was scaling.
I just scaled both axes using
sklearn.preprocessing.scale
And this is my result

Checking for randomness using the Chi-Square Test

I'm running a simulation for a class project that relies heavily on random number generators, and as a result we're asked to test the random number generator to see just how "random" it is using the Chi-Square static. After looking through the some posts here, I used the follow code to find the answer:
from random import randint
import numpy as np
from scipy.stats import chisquare
numIterations = 1000 #I've run it with other numbers as well
observed = []
for i in range(0, numIterations):
observed.append(randint(0, 100))
data = np.array(observed)
print "(chi squared statistic, p-value) with", numOfIter, "samples generated: ", chisquare(data)
However, I'm getting a p-value of zero when numIterations is greater than 10, which doesn't really make sense considering the null hypothesis is that the data is uniform. Am I misinterpreting the results? Or is my code simply wrong?
A chi-square test checks how many items you observed in a bin vs how many you expected to have in that bin. It does so by summing the squared deviations between observed and expected across all bins. You can't just feed it raw data, you need to bin it first using something like scipy.stats.histogram.
Depending on what distribution your going for you can test for it, remember that having more samples will approximate the distribution better (if you could take an infinite number of samples you would have the actual distribution). Since in real life we can't run our number generators an infinite number of times we only deal with approximated situations, so we bin the distribution (see how many numbers fall into a bin http://en.wikipedia.org/wiki/Bean_machine). Now if you ran your bean machine and you found that one of the bins was significantly higher than the expected distribution (in this case Gaussian) then you would say that the process is not Gaussian. Same thing with chi squared except your the shape is different than Gaussian because your sampling multiple normal (special case Gaussian) distributions. Since you want to find out if your data is normal/gaussian (think of shapes, the shapes are determined by the distributions parameters ie mean std kurtosis) here is an example of how to do that: http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/chi-square-test-for-normality/
I don't know what your data is so I can't really tell you what to look for. All in all you will need to know what your statistical data that your given is then try to fit it to a model (in this case chi-squared) then ask yourself if it matches up with the model (the curve, your probably trying to find if its Gaussian/normal or not which you can do with the chi-squared test). You should google chi-squared, Gaussian normal ect ect.

curve fitting in scipy is of poor quality. How can I improve it?

I'm doing a fit of a set results to a predicted function. The function might be interpreted as linear but I might have to change it a little so I am doing curve fitting instead of linear regression. I use the curve_fit function in scipy. Here is how I use it
kappa = 1
alpha=2
popt,pcov = curve_fit(fitFunc1,self.X[0:3],self.Y[0:3],sigma=self.Err[0:3],p0=[kappa,alpha])
and here is fitFunc1
def fitFunc1(X,kappa,alpha):
out = []
for x in X:
y = log(kappa)
y += 4*log(pi)
y += alpha*x
y -= 2*log(2)
out.append(-y)
return np.array(out)
Here is an example of the fit . The green line is a matlab fit. The red one is a scipy fit. I carry the fist over the first three dots.
You are using non-linear fitting routines to fit the data, not linear least-squares as invoked by A\b. The result is that the matlab and/or scipy minimization routines are getting stuck in local minima during the optimizations, leading to different results.
You should get the same results (to within numerical precision) if you apply logs to the raw data prior to linear fitting with A\b (in matlab).
edit
Inspecting function fitFunc1 it looks like the x/y data have already been transformed prior to the fit within scipy.
I performed a linear fit with the data shown, using matlab. The results using linear least squares with the operation polyfit(x,y,1) (essentially a linear fit) is very similar to the scipy result:
In any case, the data looks piecewise linear so a better solution may be to attempt a piecewise linear fit. On the other the log transformation can do all sorts of unwanted stuff, so performing nonlinear fits on the original data without performing a log tranform may be the best solution.
If you don't mind having a little bit of extra work I suggest using PyMinuit or iMinuit, both are minimisation packages based on Seal Minuit.
Then you can minimise a Chi Sq function or maximise the likelihood of your data in relation to your fit function. They also provide all the errors and everything you would like to know about the fit.
Hope this helps! xD

Categories