Generalized least square on large dataset

Generalized least square on large dataset - python

I'd like to linearly fit the data that were NOT sampled independently. I came across generalized least square method:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
The equation is Matlab format; X and Y are coordinates of the data points, and V is a "variance matrix".
The problem is that due to its size (1000 rows and columns), the V matrix becomes singular, thus un-invertable. Any suggestions for how to get around this problem? Maybe using a way of solving generalized linear regression problem other than GLS? The tools that I have available and am (slightly) familiar with are Numpy/Scipy, R, and Matlab.

Instead of:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
Use
b= (X'/V *X)\X'/V*Y
That is, replace all instances of X*(Y^-1) with X/Y. Matlab will skip calculating the inverse (which is hard, and error prone) and compute the divide directly.
Edit: Even with the best matrix manipulation, some operations are not possible (for example leading to errors like you describe).
An example of that which may be relevant to your problem is if try to solve least squares problem under the constraint the multiple measurements are perfectly, 100% correlated. Except in rare, degenerate cases this cannot be accomplished, either in math or physically. You need some independence in the measurements to account for measurement noise or modeling errors. For example, if you have two measurements, each with a variance of 1, and perfectly correlated, then your V matrix would look like this:
V = [1 1; ...
1 1];
And you would never be able to fit to the data. (This generally means you need to reformulate your basis functions, but that's a longer essay.)
However, if you adjust your measurement variance to allow for some small amount of independence between the measurements, then it would work without a problem. For example, 95% correlated measurements would look like this
V = [1 0.95; ...
0.95 1 ];

You can use singular value decomposition as your solver. It'll do the best that can be done.
I usually think about least squares another way. You can read my thoughts here:
http://www.scribd.com/doc/21983425/Least-Squares-Fit
See if that works better for you.
I don't understand how the size is an issue. If you have N (x, y) pairs you still only have to solve for (M+1) coefficients in an M-order polynomial:
y = a0 + a1*x + a2*x^2 + ... + am*x^m

Related

How do you fit a polynomial to a data set?

I'm working on two functions. I have two data sets, eg [[x(1), y(1)], ..., [x(n), y(n)]], dataSet and testData.
createMatrix(D, S) which returns a data matrix, where D is the degree and S is a vector of real numbers [s(1), s(2), ..., s(n)].
I know numpy has a function called polyfit. But polyfit takes in three variables, any advice on how I'd create the matrix?
polyFit(D), which takes in the polynomial of degree D and fits it to the data sets using linear least squares. I'm trying to return the weight vector and errors. I also know that there is lstsq in numpy.linag that I found in this question: Fitting polynomials to data
Is it possible to use that question to recreate what I'm trying?
This is what I have so far, but it isn't working.
def createMatrix(D, S):
x = []
y = []
for i in dataSet:
x.append(i[0])
y.append(i[1])
polyfit(x, y, D)
What I don't get here is what does S, the vector of real numbers, have to do with this?
def polyFit(D)
I'm basing a lot of this on the question posted above. I'm unsure about how to get just w though, the weight vector. I'll be coding the errors, so that's fine I was just wondering if you have any advice on getting the weight vectors themselves.

It looks like all createMatrix is doing is creating the two vectors required by polyfit. What you have will work, but, the more pythonic way to do it is
def createMatrix(dataSet, D):
D = 3 # set this to whatever degree you're trying
x, y = zip(*dataSet)
return polyfit(x, y, D)
(This S/O link provides a detailed explanation of the zip(*dataSet) idiom.)
This will return a vector of coefficients that you can then pass to something like poly1d to generate results. (Further explanation of both polyfit and poly1d can be found here.)
Obviously, you'll need to decide what value you want for D. The simple answer to that is 1, 2, or 3. Polynomials of higher order than cubic tend to be rather unstable and the intrinsic errors make their output rather meaningless.
It sounds like you might be trying to do some sort of correlation analysis (i.e., does y vary with x and, if so, to what extent?) You'll almost certainly want to just use linear (D = 1) regression for this type of analysis. You can try to do a least squares quadratic fit (D = 2) but, again, the error bounds are probably wider than your assumptions (e.g. normality of distribution) will tolerate.

Estimation of fundamental matrix or essential matrix from feature matching

I am estimating the fundamental matrix and the essential matrix by using the inbuilt functions in opencv.I provide input points to the function by using ORB and brute force matcher.These are the problems that i am facing:
1.The essential matrix that i compute from in built function does not match with the one i find from mathematical computation using fundamental matrix as E=k.t()FK.
2.As i vary the number of points used to compute F and E,the values of F and E are constantly changing.The function uses Ransac method.How do i know which value is the correct one??
3.I am also using an inbuilt function to decompose E and find the correct R and T from the 4 possible solutions.The value of R and T also change with the changing E.More concerning is the fact that the direction vector T changes without a pattern.Say it was in X direction at a value of E,if i change the value of E ,it changes to Y or Z.Y is this happening????.Has anyone else had the same problem.???
How do i resolve this problem.My project involves taking measurements of objects from images.
Any suggestions or help would be welcome!!

Both F and E are defined up to a scale factor. It may help to normalize the matrices, e. g. by dividing by the last element.
RANSAC is a randomized algorithm, so you will get a different result every time. You can test how much it varies by triangulating the points, or by computing the reprojection errors. If the results vary too much, you may want to increase the number of RANSAC trials or decrease the distance threshold, to make sure that RANSAC converges to the correct solution.

Yes, Computing Fundamental Matrix gives a different matrix every time as it is defined up to a scale factor.
It is a Rank 2 matrix with 7DOF(3 rot, 3 trans, 1 scaling).
The fundamental matrix is a 3X3 matrix, F33(3rd col and 3rd row) is scale factor.
You make ask why do we append matrix with constant at F33, Because of (X-Left)F(x-Right)=0, This is a homogenous equation with infinite solutions, we are adding a constraint by making F33 constant.

Multiple integrations of a data-function over a small portion of its domain - accuracy and efficiency

As part of my research, I'm required to integrate data-defined functions over small subsets of their domain, many times over. This is a long post: answers to any of the three questions below will be acknowledged!
For example, let's say I have a large 1-D domain x, and a function y to be integrated over some subset of x.
In a typical example, the domain x would have 1000 grid points, where at step 'i' my data function y[i,:] is a numpy array the same size as x. Typically y will be a numpy array of shape (1000,1000).
Now, for each value of i, I need to employ quadrature, for many points in x, to find the integral of y[i,arr] over arr, where arr is a subdomain of x.
Here's my first problem: when arr is small (say 3 points), methods like scipy.integrate.cumtrapz won't give a good approximation - there are only three values in y[i,arr].
At each step i, one has to do such integrations for arr's ranging from size 3 to approximately size 200. The output of these integrations are used to update y[i+1,:], so I believe much error is being introduced due to my current use of cumtrapz.
Edit: Many thanks to #Fabian Rost, who provided an answer to what was Question 1: whether errors were in fact being introduced. He also proposed a using linear interpolation as in Question 2 below, and an estimate for how long such a technique would take. I guess what really remains is whether there is a faster technique than that proposed.
My proposed solution is to:
Create a new interpolating object, say y2, for y[i,arr2], where arr2 is a sufficiently larger subdomain than arr.
Create a new linspace x2 corresponding to the intersection of arr with arr2, then use an existing function-integration method like scipy.integrate.quadrature to integrate y2 over x2.
The result from step 2 is probably a really good approximation to the integral for y[i,arr].
Question 2: Are all these steps necessary? That is, is there a built-in that will do all this for me?
Question 3: I believe if I want to avoid errors, I have to do these interpolations->integrations, over 1000 iterations, for at least 200 subdomains at each iteration. This can clearly become quite costly. Is there a more Pythonic way to do this?
Answers to any of these questions are GREATLY appreciated! Thanks so much for reading.

Assuming linear interpolation is a good model you could define a continuous data function and integrate using scipy.integrate.quad like that:
import scipy as sp
import scipy.integrate
xp = sp.linspace(0, 1000, 1000)
yp = sp.randn(1000)
datafunc = lambda x: sp.interp(x, xp, yp)
sp.integrate.quad(datafunc, 3, 1000)
Depending on the domain size the integration take 2 to 4 ms on my machine. That would mean something like 4 hours for 1000 * 200 integrations which I think is OK, if you only need to do it once. But the time will heavily depend on your data.

Python circle fitting to data points less sensitive to random noise

I have a set of measured radii (t+epsilon+error) at an equally spaced angles.
The model is circle of radius (R) with center at (r, Alpha) with added small noise and some random error values which are much bigger than noise.
The problem is to find the center of the circle model (r,Alpha) and the radius of the circle (R). But it should not be too much sensitive to random error (in below data points at 7 and 14).
Some radii could be missing therefore the simple mean would not work here.
I tried least square optimization but it significantly reacts on error.
Is there a way to optimize least deltas but not the least squares of delta in Python?
Model:
n=36
R=100
r=10
Alpha=2*Pi/6
Data points:
[95.85, 92.66, 94.14, 90.56, 88.08, 87.63, 88.12, 152.92, 90.75, 90.73, 93.93, 92.66, 92.67, 97.24, 65.40, 97.67, 103.66, 104.43, 105.25, 106.17, 105.01, 108.52, 109.33, 108.17, 107.10, 106.93, 111.25, 109.99, 107.23, 107.18, 108.30, 101.81, 99.47, 97.97, 96.05, 95.29]

It seems like your main problem here is going to be removing outliers. There are a couple of ways to do this, but for your application, your best bet is to probably just to remove items based on their distance from the median (Since the median is much less sensitive to outliers than the mean.)
If you're using numpy that would looks like this:
def remove_outliers(data_points, margin=1.5):
nd = np.abs(data_points - np.median(data_points))
s = nd/np.median(nd)
return data_points[s<margin]
After which you should run least squares.
If you're not using numpy you can do something similar with native python lists:
def median(points):
return sorted(points)[len(points)/2] # evaluates to an int in python2
def remove_outliers(data_points, margin=1.5):
m = median(data_points)
centered_points = [abs(point - m) for point in data_points]
centered_median = median(centered_points)
ratios = [datum/centered_median for datum in centered_points]
return [point for i, point in enumerate(data_points) if ratios[i]>margin]
If you're looking to just not count outliers as highly you can just calculate the mean of your dataset, which is just a linear equivalent of the least-squares optimization.
If you're looking for something a little better I might suggest throwing your data through some kind of low pass filter, but I don't think that's really needed here.
A low-pass filter would probably be the best, which you can do as follows: (Note, alpha is a number you will have to fiddle with to get your desired output.)
def low_pass(data, alpha):
new_data = [data[0]]
for i in range(1, len(data)):
new_data.append(alpha * data[i] + (1 - alpha) * new_data[i-1])
return new_data
At which point your least squares optimization should work fine.

Replying to your final question
Is there a way to optimize least deltas but not the least squares of delta in Python?
Yes, pick an optimization method (for example downhill simplex implemented in scipy.optimize.fmin) and use the sum of absolute deviations as a merit function. Your dataset is small, I suppose that any general purpose optimization method will converge quickly. (In case of non-linear least squares fitting it is also possible to use general purpose optimization algorithm, but it's more common to use the Levenberg-Marquardt algorithm which minimizes sums of squares.)
If you are interested when minimizing absolute deviations instead of squares has theoretical justification see Numerical Recipes, chapter Robust Estimation.
From practical side, the sum of absolute deviations may not have unique minimum.
In the trivial case of two points, say, (0,5) and (1,9) and constant function y=a, any value of a between 5 and 9 gives the same sum (4). There is no such problem when deviations are squared.
If minimizing absolute deviations would not work, you may consider heuristic procedure to identify and remove outliers. Such as RANSAC or ROUT.

Using Numpy to find the average distance in a set of points

I have an array of points in unknown dimensional space, such as:
data=numpy.array(
[[ 115, 241, 314],
[ 153, 413, 144],
[ 535, 2986, 41445]])
and I would like to find the average euclidean distance between all points.
Please note that I have over 20,000 points, so I would like to do this as efficiently as possible.
Thanks.

If you have access to scipy, you could try the following:
scipy.spatial.distance.cdist(data,data)

Well, I don't think that there is a super fast way to do this, but this should do it:
tot = 0.
for i in xrange(data.shape[0]-1):
tot += ((((data[i+1:]-data[i])**2).sum(1))**.5).sum()
avg = tot/((data.shape[0]-1)*(data.shape[0])/2.)

Now that you've stated your goal of finding the outliers, you are probably better off computing the sample mean and, with that, the sample variance, since both those operations will give you an O(nd) operation. With that, you should be able to find outliers (e.g. excluding points further from the mean than some fraction of the std. dev.), and that filtering process should be possible to perform in O(nd) time for a total of O(nd).
You might be interested in a refresher on Chebyshev's inequality.

Is it ever worthwhile to optimize without a working solution? Also, computation of a distance matrix over the entire data set rarely needs to be fast because you only do it once--when you need to know a distance between two points, you just look it up, it's already calculated.
So if you don't have a place to start, here's one. If you want to do this in Numpy without the need to write any inline fortran or C, that should be no problem, though perhaps you want to include this small vector-based virtual machine called "numexpr" (available on PyPI, trivial to intall) which in this case gave a 5x performance boost versus Numpy alone.
Below i've calculated a distance matrix for 10,000 points in 2D space (a 10K x 10k matrix giving the distance between all 10k points). This took 59 seconds on my MBP.
import numpy as NP
import numexpr as NE
# data are points in 2D space (x, y)--obviously, this code can accept data of any dimension
x = NP.random.randint(0, 10, 10000)
y = NP.random.randint(0, 10, 10000)
fnx = lambda q : q - NP.reshape(q, (len(q), 1))
delX = fnx(x)
delY = fnx(y)
dist_mat = NE.evaluate("(delX**2 + delY**2)**0.5")

There's no getting around the number of evaluations:
Sum[n-i, {i, 0, n}] = http://www.equationsheet.com/latexrender/pictures/27744c0bd81116aa31c138ab38a2aa87.gif
But you can save yourself the expense of all those square roots if you can get by with an approximate result. It depends on your needs.
If you're going to calculate an average, I would advise you to not try putting all the values into an array before calculating. Just calculate the sum (and sum of squares if you need standard deviation as well) and throw away each value as you calculate it.
Since
and
, I don't know if this means you have to multiply by two somewhere.

If you want a fast and inexact solution, you could probably adapt the Fast Multipole Method algorithm.
Points that are separated by a small distance have a smaller contribution to the final average distance, so it would make sense to group points into clusters and compare the clusters distances.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.