sparse least square regression - python

I am trying to fit a linear regression Ax = b where A is a sparse matrix and b a sparse vector. I tried scipy.sparse.linalg.lsqr but apparently b needs to be a numpy (dense) array. Indeed if i run
A = [list(range(0,10)) for i in range(0,15)]
A = scipy.sparse.coo_matrix(A)
b = list(range(0,15))
b = scipy.sparse.coo_matrix(b)
scipy.sparse.linalg.lsqr(A,b)
I end up with:
AttributeError: squeeze not found
While
scipy.sparse.linalg.lsqr(A,b.toarray())
seems to work.
Unfortunately, in my case b is a 1,5 billion x 1 vector and I simply can't use a dense array. Does anybody know a workaround or other libraries for running linear regression with sparse matrix and vector?

It seems that the documentation specifically asks for numpy array. However, given the scale of your problem, maybe its easier to use the closed-form solution of Linear Least Squares?
Given that you want to solve Ax = b, you can cast the normal equations and solve those instead. In other words, you'd solve min ||Ax-b||.
The closed form solution would be x = (A.T*A)^{-1} * A.T *b.
Of course, this closed form solution comes with its own requirements (specifically, on the rank of the matrix A).
You can solve for x using spsolve or if that's too expensive, then using an iterative solver (like Conjugate Gradients) to get an inexact solution.
The code would be:
A = scipy.sparse.rand(1500,1000,0.5) #Create a random instance
b = scipy.sparse.rand(1500,1,0.5)
x = scipy.sparse.linalg.spsolve(A.T*A,A.T*b)
x_lsqr = scipy.sparse.linalg.lsqr(A,b.toarray()) #Just for comparison
print scipy.linalg.norm(x_lsqr[0]-x)
which on a few random instances, consistently gave me values less than 1E-7.

Apparently billions of observations is too much for my machine. I ended up:
Changing algorithm to Stochastic Gradient Descent (SGD): faster with many obs
Removing completely sparse examples (i.e. features and label equal to zero)
Indeed, the update rule of SGD with least square loss function is always zero for obs in 2. This reduced observations from billions to millions which turned out to be feasible under SGD on my machine.

Related

not able to resolve LinAlgError: Last 2 dimensions of the array must be square [duplicate]

I need to solve a set of simultaneous equations of the form Ax = B for x. I've used the numpy.linalg.solve function, inputting A and B, but I get the error 'LinAlgError: Last 2 dimensions of the array must be square'. How do I fix this?
Here's my code:
A = matrix([[v1x, v2x], [v1y, v2y], [v1z, v2z]])
print A
B = [(p2x-p1x-nmag[0]), (p2y-p1y-nmag[1]), (p2z-p1z-nmag[2])]
print B
x = numpy.linalg.solve(A, B)
The values of the matrix/vector are calculated earlier in the code and this works fine, but the values are:
A =
(-0.56666301, -0.52472909)
(0.44034147, 0.46768087)
(0.69641397, 0.71129036)
B =
(-0.38038602567630364, -24.092279373295057, 0.0)
x should have the form (x1,x2,0)
In case you still haven't found an answer, or in case someone in the future has this question.
To solve Ax=b:
numpy.linalg.solve uses LAPACK gesv. As mentioned in the documentation of LAPACK, gesv requires A to be square:
LA_GESV computes the solution to a real or complex linear system of equations AX = B, where A is a square matrix and X and B are rectangular matrices or vectors. Gaussian elimination with row interchanges is used to factor A as A = PL*U , where P is a permutation matrix, L is unit lower triangular, and U is upper triangular. The factored form of A is then used to solve the above system.
If A matrix is not square, it means that you either have more variables than your equations or the other way around. In these situations, you can have the cases of no solution or infinite number of solutions. What determines the solution space is the rank of the matrix compared to the number of columns. Therefore, you first have to check the rank of the matrix.
That being said, you can use another method to solve your system of linear equations. I suggest having a look at factorization methods like LU or QR or even SVD. In LAPACK you can use getrs, in Python you can different things:
first do the factorization like QR and then feed the resulting matrices to a method like scipy.linalg.solve_triangular
solve the least-squares using numpy.linalg.lstsq
Also have a look here where a simple example is formulated and solved.
A square matrix is a matrix with the same number of rows and columns. The matrix you are doing is a 3 by 2. Add a column of zeroes to fix this problem.

Is t-SNE's computational bottleneck its memory complexity?

I've been exploring different dimensionality reduction algorithms, specifically PCA and T-SNE. I'm taking a small subset of the MNIST dataset (with ~780 dimensions) and attempting to reduce the raw down to three dimensions to visualize as a scatter plot. T-SNE can be described in great detail here.
I'm using PCA as an intermediate dimensional reduction step prior to T-SNE, as described by the original creators of T-SNE on the source code from their website.
I'm finding that T-SNE takes forever to run (10-15 minutes to go from a 2000 x 25 to a 2000 x 3 feature space), while PCA runs relatively quickly (a few seconds for a 2000 x 780 => 2000 X 20).
Why is this the case? My theory is that in the PCA implementation (directly from primary author's source code in Python), he utilizes Numpy dot product notations to calculate X and X.T:
def pca(X = Math.array([]), no_dims = 50):
"""Runs PCA on the NxD array X in order to reduce its dimensionality to no_dims dimensions."""
print "Preprocessing the data using PCA..."
(n, d) = X.shape;
X = X - Math.tile(Math.mean(X, 0), (n, 1));
(l, M) = Math.linalg.eig(Math.dot(X.T, X));
Y = Math.dot(X, M[:,0:no_dims]);
return Y;
As far as I recall, this is significantly more efficient than scalar operations, and also means that only 2N (where N is the number of rows) of data is loaded into memory (you need to load one row of X and one column of X.T).
However, I don't think this is the root reason. T-SNE definitely also contains vector operations, for example, when calculating the pairwise distances D:
D = Math.add(Math.add(-2 * Math.dot(X, X.T), sum_X).T, sum_X);
Or, when calculating P (higher dimension) and Q (lower dimension). In t-SNE, however, you have to create two N X N matrices to store your pairwise distances between each data, one for its original high-dimensional space representation and the other for its reduced dimensional space.
In computing your gradient, you also have to create another N X N matrix called PQ, which is P - Q.
It seems to me that the memory complexity here is the bottleneck. T-SNE requires 3N^2 of memory. There is no way this can fit in local memory, so the algorithm experiences significant cache line misses and needs to go to global memory to retrieve the values.
Is this correct? How do I explain to a client or a reasonable non-technical person why t-SNE is slower than PCA?
The co-author's Python implementation is found here.
The main reason for t-SNE being slower than PCA is that no analytical solution exists for the criterion that is being optimised. Instead, a solution must be approximated through gradient descend iterations.
In practice, this means lots of for loops. Not in the least the main iteration for-loop in line 129, that runs up to max_iter=1000 times. Additionally, the x2p function iterates over all data points with a for loop.
The reference implementation is optimised for readability, not for computational speed. The authors link to an optimised Torch implementation as well, which should speed up the computation a lot. If you want to stay in pure Python, I recommend the implementation in Scikit-Learn, which should also be a lot faster.
t-SNE tries to lower the dimensionality while preserving the distributions of distances between elements.
This requires computing distances between all the points. Pairwise distance matrix has N^2 entries where N is the number of examples.

Estimation of fundamental matrix or essential matrix from feature matching

I am estimating the fundamental matrix and the essential matrix by using the inbuilt functions in opencv.I provide input points to the function by using ORB and brute force matcher.These are the problems that i am facing:
1.The essential matrix that i compute from in built function does not match with the one i find from mathematical computation using fundamental matrix as E=k.t()FK.
2.As i vary the number of points used to compute F and E,the values of F and E are constantly changing.The function uses Ransac method.How do i know which value is the correct one??
3.I am also using an inbuilt function to decompose E and find the correct R and T from the 4 possible solutions.The value of R and T also change with the changing E.More concerning is the fact that the direction vector T changes without a pattern.Say it was in X direction at a value of E,if i change the value of E ,it changes to Y or Z.Y is this happening????.Has anyone else had the same problem.???
How do i resolve this problem.My project involves taking measurements of objects from images.
Any suggestions or help would be welcome!!
Both F and E are defined up to a scale factor. It may help to normalize the matrices, e. g. by dividing by the last element.
RANSAC is a randomized algorithm, so you will get a different result every time. You can test how much it varies by triangulating the points, or by computing the reprojection errors. If the results vary too much, you may want to increase the number of RANSAC trials or decrease the distance threshold, to make sure that RANSAC converges to the correct solution.
Yes, Computing Fundamental Matrix gives a different matrix every time as it is defined up to a scale factor.
It is a Rank 2 matrix with 7DOF(3 rot, 3 trans, 1 scaling).
The fundamental matrix is a 3X3 matrix, F33(3rd col and 3rd row) is scale factor.
You make ask why do we append matrix with constant at F33, Because of (X-Left)F(x-Right)=0, This is a homogenous equation with infinite solutions, we are adding a constraint by making F33 constant.

Multiple linear regression in python without fitting the origin?

I found this chunk of code on http://rosettacode.org/wiki/Multiple_regression#Python, which does a multiple linear regression in python. Print b in the following code gives you the coefficients of x1, ..., xN. However, this code is fitting the line through the origin (i.e. the resulting model does not include a constant).
All I'd like to do is the exact same thing except I do not want to fit the line through the origin, I need the constant in my resulting model.
Any idea if it's a small modification to do this? I've searched and found numerous documents on multiple regressions in python, except they are lengthy and overly complicated for what I need. This code works perfect, except I just need a model that fits through the intercept not the origin.
import numpy as np
from numpy.random import random
n=100
k=10
y = np.mat(random((1,n)))
X = np.mat(random((k,n)))
b = y * X.T * np.linalg.inv(X*X.T)
print(b)
Any help would be appreciated. Thanks.
you only need to add a row to X that is all 1.
Maybe a more stable approach would be to use a least squares algorithm anyway. This can also be done in numpy in a few lines. Read the documentation about numpy.linalg.lstsq.
Here you can find an example implementation:
http://glowingpython.blogspot.de/2012/03/linear-regression-with-numpy.html
What you have written out, b = y * X.T * np.linalg.inv(X * X.T), is the solution to the normal equations, which gives the least squares fit with a multi-linear model. swang's response is correct (and EMS's elaboration)---you need to add a row of 1's to X. If you want some idea of why it works theoretically, keep in mind that you are finding b_i such that
y_j = sum_i b_i x_{ij}.
By adding a row of 1's, you are are setting x_{(k+1)j} = 1 for all j, which means that you are finding b_i such that:
y_j = (sum_i b_i x_{ij}) + b_{k+1}
because the k+1st x_ij term is always equal to one. Thus, b_{k+1} is your intercept term.

Generalized least square on large dataset

I'd like to linearly fit the data that were NOT sampled independently. I came across generalized least square method:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
The equation is Matlab format; X and Y are coordinates of the data points, and V is a "variance matrix".
The problem is that due to its size (1000 rows and columns), the V matrix becomes singular, thus un-invertable. Any suggestions for how to get around this problem? Maybe using a way of solving generalized linear regression problem other than GLS? The tools that I have available and am (slightly) familiar with are Numpy/Scipy, R, and Matlab.
Instead of:
b=(X'*V^(-1)*X)^(-1)*X'*V^(-1)*Y
Use
b= (X'/V *X)\X'/V*Y
That is, replace all instances of X*(Y^-1) with X/Y. Matlab will skip calculating the inverse (which is hard, and error prone) and compute the divide directly.
Edit: Even with the best matrix manipulation, some operations are not possible (for example leading to errors like you describe).
An example of that which may be relevant to your problem is if try to solve least squares problem under the constraint the multiple measurements are perfectly, 100% correlated. Except in rare, degenerate cases this cannot be accomplished, either in math or physically. You need some independence in the measurements to account for measurement noise or modeling errors. For example, if you have two measurements, each with a variance of 1, and perfectly correlated, then your V matrix would look like this:
V = [1 1; ...
1 1];
And you would never be able to fit to the data. (This generally means you need to reformulate your basis functions, but that's a longer essay.)
However, if you adjust your measurement variance to allow for some small amount of independence between the measurements, then it would work without a problem. For example, 95% correlated measurements would look like this
V = [1 0.95; ...
0.95 1 ];
You can use singular value decomposition as your solver. It'll do the best that can be done.
I usually think about least squares another way. You can read my thoughts here:
http://www.scribd.com/doc/21983425/Least-Squares-Fit
See if that works better for you.
I don't understand how the size is an issue. If you have N (x, y) pairs you still only have to solve for (M+1) coefficients in an M-order polynomial:
y = a0 + a1*x + a2*x^2 + ... + am*x^m

Categories