Python: generating a "curve fit score"

Python: generating a "curve fit score" - python

I am working on a project in which I am trying to model the movement of an object in a kymograph. In order to do so, I fit a curve to each line of pixels in an image, and append the location of the vertex to approximately model the location of the object in the image. Below is a sample image.
As you can see, early in the time series (at the top of the image) the position of the object is nicely focused and easily modeled with a Gaussian curve. However, closer to the end of the time series (at the bottom of the image), the peak is much more diffuse. I suspect that the data at the bottom of the image will be fit much more closely by a curve modeling a Poisson distribution (image below, right) while the data at the top/middle of the image will be fit much more closely by a Gaussian or polynomial curve (image below, left).
Is there any way to, for each line of pixels, fit more than one curve to the same data and then score each for a least-squares fit? This way, I could (hopefully) switch models midway through an image to accommodate changing behaviors of the object that I am trying to track. My current code is below:
from PIL import Image
def populateData(picture) :
"""Open an image and populate a list of lists with the grayscale value"""
im = Image.open(picture)
size = im.size
width = size[0]
height = size[1]
allPixels = list(im.getdata())
pixelList = [allPixels[width*i :
width * (i+1)] for i in range(height)]
return(pixelList)
rawData = populateData("testTop.tif")
import numpy as np
from scipy.optimize import curve_fit
def findVertex(listOfRows) :
xList = []
for row in listOfRows :
x = np.arange(len(row))
ffunc = lambda x, a, x0, s: a*np.exp(-0.5*(x-x0)**2/s**2)
p, _ = curve_fit(ffunc, x, row, p0=[100,5,2])
x0 = p[1]
xList.append(x0)
xArray = np.array(xList)
return(xArray)
xValues = findVertex(rawData)
def buildRows(listOfRows) :
yArray = np.arange(len(listOfRows))
return(yArray)
yValues = buildRows(rawData)
from matplotlib import pyplot as plt
from scipy import ndimage
image = ndimage.imread("testTop.tif",flatten=True)
fig = plt.figure()
axes = fig.add_subplot(111)
axes.imshow(image)
axes.plot(xValues, yValues, 'k-')
axes.set_title('testLine')
axes.grid()
axes.set_xlabel('x')
axes.set_ylabel('time')
plt.show()
EDIT:
This is the file I used as an input (testTop.tif)

You will need to work out some form of goodness of fit between the fit and your data. Taking the sum of the squared differences between your current fit (a Gaussian) and your data divided by the variance.
sumerrsq = 0.
for i in range(yValues.shape[0]):
sumerrsq += np.power(yValues[i] - xValues[i],2)
goodfit = np.sqrt(sumerrsq/var)
I think you can use use the second output from curve fit (the covariance) to get the variance,
p, pcov = curve_fit(ffunc, x, row, p0=[100,5,2])
var = np.diag(pcov)
You can then check the value of goodfit and if it is not sufficient, switch to a different distribution. In using a different distribution, you may need to use a different estimation of error (this assumes the errors are normally distributed).
Note, without the data (and not being sure what array was which) I couldn't test any of this code...

According to the curve_fit docs:
To compute one standard deviation errors on the parameters use perr =
np.sqrt(np.diag(pcov)).
So if that's the value you're trying to compare, then you could take that second returned value from curve_fit (the one you are currently assigning to _), use it to calculate perr as above, and compare the error between multiple curves.

I would suggest you work with a 2D fit model. A 1d Gaussian distribution is the basis but the mean and variance depend on position and time. You then would fit the model against the 2d image data.
In case you want to stay with your approach, it looks like it's just the starting value for mean and variance which you need to tweak in order to get a better fit for the lines with large times.
To your question, you can model any score function you want, so you could do something like:
def score(x,y):
if x < 10:
return x**2 - y
else:
return x - y
So in order to work with two different models in different ranges, follow this example.

Related

How to fit multiple curves to a single scatter plot of data?

I have data from distinct curves, and want to fit each of them individually. However, the data is mixed into a single array, so first I believe I need a way to separate the data.
I know that each of the individual curves is under the family A/x+B. As of now I cut out each of the curves by hand and curve fit, but would like to automate this process, have the computer separate these curves a fit them. I attempted to use machine learning, but didn't know where to start, what packages to use. I am using python, but can also use C++, in fact I hope to transfer it to C++ by the end. Where do you think I should start, is it worth it to use unsupervised machine learning, or is there a better way to separate the data?
The expected curves:
An example of the data

Well, you sure do have an interesting problem.
I see that there are curves with Y-axis values that are considerably larger than the rest of them. I would simply take the first N-values with the largest Y-axis values and then fit them to an exponential decay curve (or that other curve you mention). You can then simply take the points that most fit that curve and then leave the other points alone.
Except...
This is a terrible way to extrapolate data. Doing this, you are cherry-picking the data you want. This is falsifying information and is very bad.
Your best bet is to create a single curve that all points fit too if you cannot isolate all of those points into separate curves with external information.
But...
We do know some information: a valid function must have only 1 output given a single input.
If the X-Axis is discreet, this means you can create a lookup table of Outputs given the input. This allows you to count how many curves there are associated with the specific X-value (which could be a time unit). In other words, you have to have external information to separate points locally. You can then reorder the points in increasing Y-value, and now you have your separate curves defined in discrete points.
Basically, this is an unsolvable problem in the general sense, but in your specific application, there might be extra rules that further define the domain and range such that you can do data filtering.
One more thing...
I am making these statements with the assumption that the (X,Y) values are floats that cannot maintain accuracy after some mathematical operations.
If you are using things like unum numbers, you might be able to keep enough information in the decimal such that your fitting functions can differentiate between points without extra filtering.
This case is more of a hope than anything, as adopting a new number representation to get more accuracy to isolate sampled points is a stretch at best.
Just for completeness, there are some mathematical libraries that might help you.
Boost.uBLAS
Eigen
LAPACK++
Hopefully, I have given you enough information to allow you to solve your problem.

I extracted data from the plot for analysis. Here is example code that loads, separates, fits and plots the three data sets. It works when the separate data files are appended into a single text file.
import numpy, scipy, matplotlib
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
##########################################################
# data load and separation section
datafilename = 'temp.dat'
textdata = open(datafilename, 'rt').read()
xLists = [[], [], []]
yLists = [[], [], []]
previousY = 0.0 # initialize
whichList = -1 # initialize
datalines = textdata.split('\n')
for line in datalines:
if not line: # allow for blank lines in data file
continue
spl = line.split()
x = float(spl[0])
y = float(spl[1])
if y > previousY + 50.0: # this separator must be greater than max noise
whichList += 1
previousY = y
xLists[whichList].append(x)
yLists[whichList].append(y)
##########################################################
# curve fitting section
def func(x, a, b):
return a / x + b
parameterLists = []
for curveIndex in range(len(xLists)):
# these are the same as the scipy defaults
initialParameters = numpy.array([1.0, 1.0])
xData = numpy.array(xLists[curveIndex], dtype=float)
yData = numpy.array(yLists[curveIndex], dtype=float)
# curve fit the test data
fittedParameters, pcov = curve_fit(func, xData, yData, initialParameters)
parameterLists.append(fittedParameters)
##########################################################
# graphics output section
def ModelAndScatterPlot(graphWidth, graphHeight):
f = plt.figure(figsize=(graphWidth/100.0, graphHeight/100.0), dpi=100)
axes = f.add_subplot(111)
for curveIndex in range(len(xLists)):
# first the raw data as a scatter plot
axes.plot(xLists[curveIndex], yLists[curveIndex], 'D')
# create data for each fitted equation plot
xModel = numpy.linspace(min(xLists[curveIndex]), max(xLists[curveIndex]))
yModel = func(xModel, *parameterLists[curveIndex])
# now the model as a line plot
axes.plot(xModel, yModel)
axes.set_xlabel('X Data') # X axis data label
axes.set_ylabel('Y Data') # Y axis data label
plt.show()
plt.close('all') # clean up after using pyplot
graphWidth = 800
graphHeight = 600
ModelAndScatterPlot(graphWidth, graphHeight)

The idea:
create N naive, easy to calculate, sufficiently precise(for clustering), approximations. Then "classify" each data-point to the closest such approximation.
This is done like this:
The approximations are analytical approximations using these two equations I derived:
where (x1,y1) and (x2,y2) are coordinates of two points on the curve.
To get these two points I assumed that (1) the first points(according to the x-axis) are distributed equally between the different real curves. And (2) the 2 first points of each real curve, are smaller or bigger than the 2 first points of each other real curve. Thus sorting them and dividing into N groups will successfully cluster the first *2*N* points. If these assumptions are false you can still manually classify the 2 first points of each real curve and the rest will be classified automatically (this is actually the first approach I implemented).
Then cluster rest of the points to each point's closest approximation. Closest meaning with the smallest error.
Edit: A stronger approach for the initial approximation could be by calculating A and B for a couple of pairs of points and using their mean A and B as the approximation. And maybe even possibly doing K-means on these points/approximations.
The Code:
import numpy as np
import matplotlib.pyplot as plt
# You should probably edit this variable
NUM_OF_CURVES = 4
# <data> should be a 1-D array containing the Y values of the series
# <x_of_data> should be a 1-D array containing the corresponding X values of the series
data, x_of_data = np.loadtxt('...')
# clustering of first 2*num_of_curves points
# I started at NUM_OF_CURVES instead of 0 because my xs started at 0.
# The range (0:NUM_OF_CURVES*2) will probably be better for you.
raw_data = data[NUM_OF_CURVES:NUM_OF_CURVES*3]
raw_xs = x_of_data[NUM_OF_CURVES:NUM_OF_CURVES*3]
sort_ind = np.argsort(raw_data)
Y = raw_data[sort_ind].reshape(NUM_OF_CURVES,-1).T
X = raw_xs[sort_ind].reshape(NUM_OF_CURVES,-1).T
# approximation of A and B for each curve
A = ((Y[0]*Y[1])*(X[0]-X[1]))/(Y[1]-Y[0])
B = (A / Y[0]) - X[0]
# creating approximating curves
f = []
for i in range(NUM_OF_CURVES):
f.append(A[i]/(x_of_data+B[i]))
curves = np.vstack(f)
# clustering the points to the approximating curves
raw_clusters = [[] for _ in range(NUM_OF_CURVES)]
for i in range(len(data)):
raw_clusters[np.abs(curves[:,i]-data[i]).argmin()].append((x_of_data[i],data[i]))
# changing the clusters to np.arrays of the shape (2,-1)
# where row 0 contains the X coordinates and row 1 the Y coordinates
clusters = []
for i in range(len(raw_clusters)):
clusters.append(np.array(list(zip(*raw_clusters[i]))))
Example:
raw series:
separated series:

Least-square fitting for points in 2d doesn't pass through symmetrical axis

I'm trying to draw the best fitting line for given (x,y) data points.
Here shows data points (red pixels) and estimated line (green), I obtained using following library.
import numpy as np
m, c = np.linalg.lstsq(A, y)[0]
Documentation for used library module
We can see data points are roughly symmetrically distributed. Problem is why is this line not having the gradient similar to the long symmetric axis through the data points? Can you please explain can this result is correct? Then, how it gives minimum error? (Line is drawn correctly using gradient returned by the lstsq method). Thank you.
EDIT
Here is the code I'm trying. Input image can be downloaded from here. In this code I've not forced the line to pass through the center of the pixel distribution. (Note: here I've used polyfit instead of lstsq. Both gives same results)
import numpy as np
import cv2
import math
img = cv2.imread('points.jpg',1);
h, w = img.shape[:2]
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
points = np.argwhere(gray>10) # get (x,y) pairs where red pixels exist
y = points[:,0]
x = points[:,1]
m, c = np.polyfit(x, y, 1) # calculate least square fit line
# calculate two cordinates (x1,y1),(x2,y2) on the line
angle = np.arctan(m)
x1, y1, length = 0, int(c), 500
x2 = int(round(math.ceil(x1 + length * np.cos(angle)),0))
y2 = int(round(math.ceil(y1 + length * np.sin(angle)),0))
# draw line on the color image
cv2.line(img, (x1, y1), (x2, y2), (0,255,0), 1, cv2.LINE_8)
# show output the image
cv2.namedWindow("Display window", cv2.WINDOW_AUTOSIZE);
cv2.imshow("Display window", img);
cv2.waitKey(0);
cv2.destroyAllWindows()
How can I have the line pass through the longest symmetric axis of the pixel distribution? Can I use principle component analysis?

It's hard to say why this would be the case. The bottom line is that I can't see the data you're using, and I can't see what the calculated slope and y intercept are for the data you're using.
Here are a couple of things that could explain what we're seeing:
(1) The density of data points is actually quite different than it appears to a casual glance and everything is working properly.
(2) You're sending the wrong arguments to the least squares function and you've got a GIGO situation. (I haven't used numpy's least squares algorithm, so I can't check this.)
(3) The scatter plot and the line plot don't agree on the scale of the axes.
(4) The least squares function in question is broken.
(5) You're not passing the same data to the least squares algorithm as you're passing to the plotting routine.
(6) The data formatting is funky so that the scatter plot and least squares routines are interpreting your data differently.
I can't know which of these is the problem, and unless it's (3), I expect we'd need more data to be able to distinguish between these possibilities.
Here's how I'd proceed if I were you: (1) Create a small artificial data set that sits on a line and pass it to the least squares function and see if it spits out the right numbers. See if these look right when plotted or not. (2) If this looks okay, record the output of the least squares algorithm, see if you can find another least squares program to calculate the slope and y intercept and compare them. If they're the same, it's probably not the routine, it's probably something to do with plotting.
If you get this far and it's still a mystery, let us know what you've found and maybe we can make another suggestion.
Good luck.

If the red dots truly represent your data, you are probably applying your linear regression function in a way that forces the line through the origin. How do i know? When using linear regression on two variables x and y, the line will intercept a few specific points. For example the average of x, and the average of y. Also, depending on your specifications, a calculated or specified intercept of the y axis. If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin. Not much more can be said before you provide som reproducible data and code.
EDIT:
I didn't have much luck with the reproducble sample provided, so I built an example with random numbers to elaborate on my original answer. I think statsmodels is a decent library for linear regression analysis. First, I'll address this earlier comment:
If all variables of x and y are positive, you will have a line that looks like yours if the line is forced through the origin.
You'll see an increasing effect of this the larger your numbers are (the further away from the origin your numbers are). Using sm.OLS(y,sm.add_constant(x)).fit() and sm.OLS(y,x).fit() for two different sets of numbers will show you exactly what I mean. First, I'll run a regression on the dataset below without an estimated constant (the line goes through the origin). This will give us a plot that at resembles your original plot:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
regLine_origin = x*results1.params[0]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.patch.set_facecolor('black')
plt.show()
Next, I'll include a constant in the regression. Now, the yellow line will represent what I think you were after in your question:
# Libraries
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Data
np.random.seed(123)
x = np.random.normal(size=2500) + 100
y = x * 2 + np.random.normal(size=2500) + 100
# Regression
results1 = sm.OLS(y,x).fit()
results2 = sm.OLS(y,sm.add_constant(x)).fit()
regLine_origin = x*results1.params[0]
regLine_constant = results2.params[0] + x*results2.params[1]
# PLot
fig, ax = plt.subplots()
ax.scatter(x, y, c='red', s=4)
ax.scatter(x, regLine_origin, c = 'green', s = 1)
ax.scatter(x, regLine_constant, c = 'yellow', s = 1)
ax.patch.set_facecolor('black')
plt.show()
And lastly, we can take a look at what happens when the numbers are closer to the origin. So to speak. Here, I'll remove the +100 part when the numbers are produced:
# The following is changed in the snippet above:
# Data
x = np.random.normal(size=2500)
y = x * 2 + np.random.normal(size=2500)
And that's why I think your original regression line is set to go through the origin. Have a look at the statsmodels package. Here you can study the details of the estimate by running print(results2.summary()):
And as you've already seen in the snippets above, you'll have direct access to the regression coefficients by using results2.params.
Edit2: My explanation still isn't 100% valid. The x and y values will have to differ a bit in size to see this effect. You'll certainly find situations where the line goes through the origin no matter the size of the numbers.
Have a look at the different x labels, and you'll see what I mean.

Parameters of cosine squared scipy optimize curvefit are incorrect in python

I am trying to fit a cosine squared to a data array from an optics interferometry intensity measurement. Unfortunately, the fit returns amplitudes and periods that are way off. Only once I received a more reasonable fit by selecting the first 200 data points from the array (and some other selections). Those fit parameters were used as initial guesses to extend the fit to the entire array, which gave back a plot similar to the image.
import csv
import numpy as np
import matplotlib.pyplot as plt
import scipy as sy
from numpy import genfromtxt
from scipy.optimize import curve_fit
# reads the data from the csv file
csvfile ="</home/pi/Desktop/molecularpolOutput_No2.csv>"
csv = genfromtxt ('molecularpolOutput_No2.csv', delimiter=",")
# defines the data as variables
pressure = csv[100:200,2]
intensity = csv[100:200,3]
temperature = csv[:,1]
pi = 3.14
P = pressure
# defines the function and initial fit parameters
def func(P, T, a, b, c):
return a*np.cos((2*pi*P)/T+b)**2+c
p0 = sy.array([2200, 45, 4000, 85])
# fits the function
coeffs, pcov = curve_fit(func, pressure, intensity, p0)
I = func(P, coeffs[0], coeffs[1], coeffs[2], coeffs[3])
print 'period =',(coeffs[0]), 'Pa'
# plots the data and the function
fig = plt.figure(figsize=(10, 3), dpi=100)
plt.plot(pressure, intensity, linestyle="none", marker=".")
plt.plot(pressure, I)
plt.xlabel('Pressure (Pa)')
plt.ylabel('Relative intensity')
plt.title('interference intensity plot of Newtons rings ')
plt.show()
I would expect the fit to be correct for both a large and small data array. However, as the figures show, extending the array messes with both the amplitude and period. The fit which looks ok, also gives values for the period comparable to other experiments. The data generated by the photoresistor is not precisely linear but I assume this should not be the problem for curve_fit. Is their something I can change in the code to get the fit working? I already tried this code: How do I fit a sine curve to my data with pylab and numpy?
update
A least square curve fit in Matlab gives the same problem. Should I try another method to fit the curve or is it the data that causes the problem?
Matlab Code:
%% Opens excel file
filename = 'vpnat_1.xlsx';
Pr = xlsread(filename,'D1:D500');
I = xlsread(filename, 'E1:E500');
P = Pr;
% defines figure size relative to screen
scrsz = get(groot,'ScreenSize');
figure('Position',[1 scrsz(4)/2 scrsz(3)/2 scrsz(4)/4])
%% fit & plots
hold on
scatter(P,I,'.'); % scatter plot
%% defines parameter guesses
Im = mean(I);
Iu = max(I);
Il = min(I);
Ia = Iu-Il;
Ip = 2000;
Id = -4000;
a_0 = [Ia; Ip; Id; Im]; % initial guesses
fun = #(a,P) a(1).*(cos((2*pi*P)./a(2)+a(3)).^2)+a(4); % defines function
fcn = #(a) sum((fun(a,P)-I).^2); % finds best fit
s = fminsearch(fcn, a_0);
plot(P,fun(s,P)) % plots fitted function
hold off

I solved the problem by using Matlab. It appears that the parameters were to poorly defined for curve_fit in python to find a least squares whithin its given boundaries (Constrain on number of iterations?).
Matlab appeared to accept a larger margin of error in the initial parameters and therefore found a fit for all selections of data. Using the fit parameters from matlab as initial parameters in Python returns a proper fit. The problem in python could be prevented by computing the guesses for the parameters to get a better start.

Problems trying to calculate FWHM with scipy.interpolate

I am having problems trying to find the FWHM of some data. I initially tried to fit a curve using interpolate.interp1d. With this I was able to create a function that when I entered an x value it would return an interpolated y value. The issue is that I need the inverse of this functionality. In other words, I want to switch my independent and dependent variables. When I try to switch them, I get errors because the independent data has to be sorted. If I sort the data, I will lose the indexes, and therefore lose the shape of my graph.
I tried:
x = np.linspace(0, line.shape[0], line.shape[0])
self.x_curve = interpolate.interp1d(x, y, 'linear')
where y is my data.
To get the inverse, I tried:
self.x_curve = interpolate.interp1d(sorted(y), x, 'linear')
but the values are off.
I then moved on and tried to use UnivariateSpline and get the roots to find the FWHM (from this question here: Finding the full width half maximum of a peak), but the roots() method keeps giving me an empty list [].
This is what I used:
x_curve = interpolate.UnivariateSpline(x, y)
r = x_curve.roots()
print(r)
Here is an image of the data (with the UnivariateSpline):
Any ideas? Thanks.

Using UnivariateSpline.roots() to get FWHM will only work if you shift the data so that its value is 0 at FWHM.
Seeing that the background of the data is noisy, I'd first estimate the baseline. For example:
y_baseline = y[(x<200) & (x>350)].mean()
(adjust the limits for x as you see fit). Then shift the data so that the middle of the baseline and the peak is at 0. Seeing that your data has a minimum and not a maximum as in the example, I'm using y.min():
y_shifted = y - (y.min()+y_baseline)/2.0
Now fit a spline to this shifted data and roots() should be able to find the roots, the difference of which is the FWHM.
x_curve = interpolate.UnivariateSpline(x, y_shifted, s=0)
x_curve.roots()
Increase the s parameter if you want to estimate the FWHM from smoothed data.

Fourier smoothing of data set

I am following this link to do a smoothing of my data set.
The technique is based on the principle of removing the higher order terms of the Fourier Transform of the signal, and so obtaining a smoothed function.
This is part of my code:
N = len(y)
y = y.astype(float) # fix issue, see below
yfft = fft(y, N)
yfft[31:] = 0.0 # set higher harmonics to zero
y_smooth = fft(yfft, N)
ax.errorbar(phase, y, yerr = err, fmt='b.', capsize=0, elinewidth=1.0)
ax.plot(phase, y_smooth/30, color='black') #arbitrary normalization, see below
However some things do not work properly.
Indeed, you can check the resulting plot :
The blue points are my data, while the black line should be the smoothed curve.
First of all I had to convert my array of data y by following this discussion.
Second, I just normalized arbitrarily to compare the curve with data, since I don't know why the original curve had values much higher than the data points.
Most importantly, the curve is like "specular" to the data point, and I don't know why this happens.
It would be great to have some advices especially to the third point, and more generally how to optimize the smoothing with this technique for my particular data set shape.

Your problem is probably due to the shifting that the standard FFT does. You can read about it here.
Your data is real, so you can take advantage of symmetries in the FT and use the special function np.fft.rfft
import numpy as np
x = np.arange(40)
y = np.log(x + 1) * np.exp(-x/8.) * x**2 + np.random.random(40) * 15
rft = np.fft.rfft(y)
rft[5:] = 0 # Note, rft.shape = 21
y_smooth = np.fft.irfft(rft)
plt.plot(x, y, label='Original')
plt.plot(x, y_smooth, label='Smoothed')
plt.legend(loc=0)
plt.show()
If you plot the absolute value of rft, you will see that there is almost no information in frequencies beyond 5, so that is why I choose that threshold (and a bit of playing around, too).
Here the results:

From what I can gather you want to build a low pass filter by doing the following:
Move to the frequency domain. (Fourier transform)
Remove undesired frequencies.
Move back to the time domain. (Inverse fourier transform)
Looking at your code, instead of doing 3) you're just doing another fourier transform. Instead, try doing an actual inverse fourier transform to move back to the time domain:
y_smooth = ifft(yfft, N)
Have a look at scipy signal to see a bunch of already available filters.
(Edit: I'd be curious to see the results, do share!)

I would be very cautious in using this technique. By zeroing out frequency components of the FFT you are effectively constructing a brick wall filter in the frequency domain. This will result in convolution with a sinc in the time domain and likely distort the information you want to process. Look up "Gibbs phenomenon" for more information.
You're probably better off designing a low pass filter or using a simple N-point moving average (which is itself a LPF) to accomplish the smoothing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.