fit through origin via matrix algebra - python

Usually I use the following code to carry out a linear fit or a quadratic fit. Sometimes it is necessary to weight the model 1/x2 using weight=2. I would like to know if I can force a model through the origin via adding some matrix algebra (obviously if weight=0). Thanks.
import numpy
from pylab import *
data=loadtxt('...')
degree=1
weight=0
x,y,w=data[:,0],data[:,1],1/data[:,0]**weight
n=len(data)
d=degree+1
f=zeros(n*d).reshape((n,d))
for i in range(0,n):
for j in range(0,d):
f[i,j]=x[i]**j
q=diag(w)
fT=dot(transpose(f),q)
fTx=dot(fT,f)
fTy=dot(fT,y)
coeffs=dot(inv(fTx),fTy)

For the weight=0 case, get rid of the constant term in your feature vector by changing
for j in range(0,d) to for j in range(1,d).
For larger values of your weight term, the weights associated with 1/x^p terms would have to be zero, which probably won't happen in the ordinary least squares solution.
For best numpy practices, I would suggest that you replace zeros(n*d).reshape((n,d)) with zeros( (n,d) ) and dot(inv(fTx),fTy) with linalg.solve(fTx,fTy).

Related

Gurobi/python euclidean/manhattan distance calculations inside of a constraint, addConstr(...)

Is it possible to use the sklearn pairwise_distances function inside an addConstr(...) for computing the distance between 2 D-dimensional points in the constraint? I'd like to do something like this:
for i in range(N):
for j in range(N):
constr1 = m.addConstr( pairwise_distances(X[i,:], Y[j,:]) <= 50 for i in range(N), name="constr1")
# X and Y are N by D numpy arrays
I know that Gurobi only allows optimization functions that are linear, piecewise linear or Quadratic (as far as I know), so most other functions like sqrt aren't implementable (even for constraints). So trying to manually implement distance calculations involving square roots or absolute values may not be possible in gurobi constraints - though if I'm wrong, I'd like to be corrected.
Also how would I do this if Y wasn't a pre-set numpy array but actually a gurobi variable I'm trying to find?

how does numpy.linalg.eigh vs numpy.linalg.svd?

problem description
For a square matrix, one can obtain the SVD
X= USV'
decomposition, by using simply numpy.linalg.svd
u,s,vh = numpy.linalg.svd(X)
routine or numpy.linalg.eigh, to compute the eig decomposition on Hermitian matrix X'X and XX'
Are they using the same algorithm? Calling the same Lapack routine?
Is there any difference in terms of speed? and stability?
Indeed, numpy.linalg.svd and numpy.linalg.eigh do not call the same routine of Lapack. On the one hand, numpy.linalg.eigh refers to LAPACK's dsyevd() while numpy.linalg.svd makes use LAPACK's dgesdd().
The common point between these routines is the use of Cuppen's divide and conquer algorithm, first designed to solve tridiagonal eigenvalue problems. For instance, dsyevd() only handles Hermitian matrix and performs the following steps, only if eigenvectors are required:
Reduce matrix to tridiagonal form using DSYTRD()
Compute the eigenvectors of the tridiagonal matrix using the divide and conquer algorithm, through DSTEDC()
Apply the Householder reflection reported by DSYTRD() using DORMTR().
On the contrary, to compute the SVD, dgesdd() performs the following steps, in the case job==A (U and VT required):
Bidiagonalize A using dgebrd()
Compute the SVD of the bidiagonal matrix using divide and conquer algorithm using DBDSDC()
Revert the bidiagonalization using using the matrices P and Q returned by dgebrd() applying dormbr() twice, once for U and once for V
While the actual operations performed by LAPACK are very different, the strategies are globally similar. It may stem from the fact that computing the SVD of a general matrix A is similar to performing the eigendecomposition of the symmetric matrix A^T.A.
Regarding accuracy and performances of lapack divide and conquer SVD, see This survey of SVD methods:
They often achieve the accuracy of QR-based SVD, though it is not proven
The worst case is O(n^3) if no deflation occurs, but often proves better than that
The memory requirement is 8 times the size of the matrix, which can become prohibitive
Regarding the symmetric eigenvalue problem, the complexity is 4/3n^3 (but often proves better than that) and the memory footprint is about 2n^2 plus the size of the matrix. Hence, the best choice is likely numpy.linalg.eigh if your matrix is symmetric.
The actual complexity can be computed for your particular matrices using the following code:
import numpy as np
from matplotlib import pyplot as plt
from scipy.optimize import curve_fit
# see https://stackoverflow.com/questions/41109122/fitting-a-curve-to-a-power-law-distribution-with-curve-fit-does-not-work
def func_powerlaw(x, m, c):
return np.log(np.abs( x**m * c))
import time
start = time.time()
print("hello")
end = time.time()
print(end - start)
timeev=[]
timesvd=[]
size=[]
for n in range(10,600):
print n
size.append(n)
A=np.zeros((n,n))
#populate A, 1D diffusion.
for j in range(n):
A[j,j]=2.
if j>0:
A[j-1,j]=-1.
if j<n-1:
A[j+1,j]=-1.
#EIG
Aev=A.copy()
start = time.time()
w,v=np.linalg.eigh(Aev,'L')
end = time.time()
timeev.append(end-start)
Asvd=A.copy()
start = time.time()
u,s,vh=np.linalg.svd(Asvd)
end = time.time()
timesvd.append(end-start)
poptev, pcov = curve_fit(func_powerlaw, size[len(size)/2:], np.log(timeev[len(size)/2:]),p0=[2.1,1e-7],maxfev = 8000)
print poptev
poptsvd, pcov = curve_fit(func_powerlaw, size[len(size)/2:], np.log(timesvd[len(size)/2:]),p0=[2.1,1e-7],maxfev = 8000)
print poptsvd
plt.figure()
fig, ax = plt.subplots()
plt.plot(size,timeev,label="eigh")
plt.plot(size,[np.exp(func_powerlaw(x, poptev[0], poptev[1])) for x in size],label="eigh-adjusted complexity: "+str(poptev[0]))
plt.plot(size,timesvd,label="svd")
plt.plot(size,[np.exp(func_powerlaw(x, poptsvd[0], poptsvd[1])) for x in size],label="svd-adjusted complexity: "+str(poptsvd[0]))
ax.set_xlabel('n')
ax.set_ylabel('time, s')
#plt.legend(loc="upper left")
ax.legend(loc="lower right")
ax.set_yscale("log", nonposy='clip')
fig.tight_layout()
plt.savefig('eigh.jpg')
plt.show()
For such 1D diffusion matrices, eigh outperforms svd, but the actual complexity are similar, slightly lower than n^3, something like n^2.5.
Checking of the accuracy could be performed as well.
No they do not use the same algorithm as they do different things. They are somewhat related but also very different. Let's start with the fact that you can do SVD on m x n matrices, where m and n don't need to be the same.
Dependent on the version of numpy, you are doing. Here are the eigenvalue routines in lapack for double precision:
http://www.netlib.org/lapack/explore-html/d9/d8e/group__double_g_eeigen.html
And the according SVD routines:
http://www.netlib.org/lapack/explore-html/d1/d7e/group__double_g_esing.html
There are differences in routines. Big differences. If you care for the details, they are specified in the fortran headers very well. In many cases it makes sense to find out, what kind of matrix you have in front of you, to make a good choice of routine. Is the matrix symmetric/hermitian? Is it in upper diagonal form? Is it positive semidefinite? ...
There are gynormous differences in runtime. But as rule of thumb EIGs are cheaper than SVDs. But that depends also on convergence speed, which in turn depends a lot on condition number of the matrix, in other words, how ill posed a matrix is, ...
SVDs are usually very robust and slow algorithms and oftentimes used for inversion, speed optimisation through truncation, principle component analysis really with the expetation, that the matrix you are dealing with is just a pile of shitty rows ;)

sparse least square regression

I am trying to fit a linear regression Ax = b where A is a sparse matrix and b a sparse vector. I tried scipy.sparse.linalg.lsqr but apparently b needs to be a numpy (dense) array. Indeed if i run
A = [list(range(0,10)) for i in range(0,15)]
A = scipy.sparse.coo_matrix(A)
b = list(range(0,15))
b = scipy.sparse.coo_matrix(b)
scipy.sparse.linalg.lsqr(A,b)
I end up with:
AttributeError: squeeze not found
While
scipy.sparse.linalg.lsqr(A,b.toarray())
seems to work.
Unfortunately, in my case b is a 1,5 billion x 1 vector and I simply can't use a dense array. Does anybody know a workaround or other libraries for running linear regression with sparse matrix and vector?
It seems that the documentation specifically asks for numpy array. However, given the scale of your problem, maybe its easier to use the closed-form solution of Linear Least Squares?
Given that you want to solve Ax = b, you can cast the normal equations and solve those instead. In other words, you'd solve min ||Ax-b||.
The closed form solution would be x = (A.T*A)^{-1} * A.T *b.
Of course, this closed form solution comes with its own requirements (specifically, on the rank of the matrix A).
You can solve for x using spsolve or if that's too expensive, then using an iterative solver (like Conjugate Gradients) to get an inexact solution.
The code would be:
A = scipy.sparse.rand(1500,1000,0.5) #Create a random instance
b = scipy.sparse.rand(1500,1,0.5)
x = scipy.sparse.linalg.spsolve(A.T*A,A.T*b)
x_lsqr = scipy.sparse.linalg.lsqr(A,b.toarray()) #Just for comparison
print scipy.linalg.norm(x_lsqr[0]-x)
which on a few random instances, consistently gave me values less than 1E-7.
Apparently billions of observations is too much for my machine. I ended up:
Changing algorithm to Stochastic Gradient Descent (SGD): faster with many obs
Removing completely sparse examples (i.e. features and label equal to zero)
Indeed, the update rule of SGD with least square loss function is always zero for obs in 2. This reduced observations from billions to millions which turned out to be feasible under SGD on my machine.

How to compute scipy sparse matrix determinant without turning it to dense?

I am trying to figure out the fastest method to find the determinant of sparse symmetric and real matrices in python. using scipy sparse module but really surprised that there is no determinant function. I am aware I could use LU factorization to compute determinant but don't see a easy way to do it because the return of scipy.sparse.linalg.splu is an object and instantiating a dense L and U matrix is not worth it - I may as well do sp.linalg.det(A.todense()) where A is my scipy sparse matrix.
I am also a bit surprised why others have not faced the problem of efficient determinant computation within scipy. How would one use splu to compute determinant?
I looked into pySparse and scikits.sparse.chlmod. The latter is not practical right now for me - needs package installations and also not sure sure how fast the code is before I go into all the trouble.
Any solutions? Thanks in advance.
Here are some references I provided as part of an answer here.
I think they address the actual problem you are trying to solve:
notes for an implementation in the Shogun library
Erlend Aune, Daniel P. Simpson: Parameter estimation in high dimensional Gaussian distributions, particularly section 2.1 (arxiv:1105.5256)
Ilse C.F. Ipsen, Dean J. Lee: Determinant Approximations (arxiv:1105.0437)
Arnold Reusken: Approximation of the determinant of large sparse symmetric positive definite matrices (arxiv:hep-lat/0008007)
Quoting from the Shogun notes:
The usual technique for computing the log-determinant term in the likelihood expression relies on Cholesky factorization of the matrix, i.e. Σ=LLT, (L is the lower triangular Cholesky factor) and then using the diagonal entries of the factor to compute log(det(Σ))=2∑ni=1log(Lii). However, for sparse matrices, as covariance matrices usually are, the Cholesky factors often suffer from fill-in phenomena - they turn out to be not so sparse themselves. Therefore, for large dimensions this technique becomes infeasible because of a massive memory requirement for storing all these irrelevant non-diagonal co-efficients of the factor. While ordering techniques have been developed to permute the rows and columns beforehand in order to reduce fill-in, e.g. approximate minimum degree (AMD) reordering, these techniques depend largely on the sparsity pattern and therefore not guaranteed to give better result.
Recent research shows that using a number of techniques from complex analysis, numerical linear algebra and greedy graph coloring, we can, however, approximate the log-determinant up to an arbitrary precision [Aune et. al., 2012]. The main trick lies within the observation that we can write log(det(Σ)) as trace(log(Σ)), where log(Σ) is the matrix-logarithm.
The "standard" way to solve this problem is with a cholesky decomposition, but if you're not up to using any new compiled code, then you're out of luck. The best sparse cholesky implementation is Tim Davis's CHOLMOD, which is licensed under the LGPL and thus not available in scipy proper (scipy is BSD).
You can use scipy.sparse.linalg.splu to obtain sparse matrices for the lower (L) and upper (U) triangular matrices of an M=LU decomposition:
from scipy.sparse.linalg import splu
lu = splu(M)
The determinant det(M) can be then represented as:
det(M) = det(LU) = det(L)det(U)
The determinant of triangular matrices is just the product of the diagonal terms:
diagL = lu.L.diagonal()
diagU = lu.U.diagonal()
d = diagL.prod()*diagU.prod()
However, for large matrices underflow or overflow commonly occurs, which can be avoided by working with the logarithms.
diagL = diagL.astype(np.complex128)
diagU = diagU.astype(np.complex128)
logdet = np.log(diagL).sum() + np.log(diagU).sum()
Note that I invoke complex arithmetic to account for negative numbers that might appear in the diagonals. Now, from logdet you can recover the determinant:
det = np.exp(logdet) # usually underflows/overflows for large matrices
whereas the sign of the determinant can be calculated directly from diagL and diagU (important for example when implementing Crisfield's arc-length method):
sign = swap_sign*np.sign(diagL).prod()*np.sign(diagU).prod()
where swap_sign is a term to consider the number of permutations in the LU decomposition. Thanks to #Luiz Felippe Rodrigues, it can be calculated:
swap_sign = (-1)**minimumSwaps(lu.perm_r)
def minimumSwaps(arr):
"""
Minimum number of swaps needed to order a
permutation array
"""
# from https://www.thepoorcoder.com/hackerrank-minimum-swaps-2-solution/
a = dict(enumerate(arr))
b = {v:k for k,v in a.items()}
count = 0
for i in a:
x = a[i]
if x!=i:
y = b[i]
a[y] = x
b[x] = y
count+=1
return count
Things start to go wrong with the determinant of sparse tridiagonal (-1 2 -1) around N=1e6 using both SuperLU and CHOLMOD...
The determinant should be N+1.
It's probably propagation of error when calculating the product of the U diagonal:
from scipy.sparse import diags
from scipy.sparse.linalg import splu
from sksparse.cholmod import cholesky
from math import exp
n=int(5e6)
K = diags([-1.],-1,shape=(n,n)) + diags([2.],shape=(n,n)) + diags([-1.],1,shape=(n,n))
lu = splu(K.tocsc())
diagL = lu.L.diagonal()
diagU = lu.U.diagonal()
det=diagL.prod()*diagU.prod()
print(det)
factor = cholesky(K.tocsc())
ld = factor.logdet()
print(exp(ld))
Output:
4999993.625461911
4999993.625461119
Even if U is 10-13 digit accurate, this might be expected:
n=int(5e6)
print(n*diags([1-0.00000000000025],0,shape=(n,n)).diagonal().prod())
4999993.749444371

Multiple linear regression in python without fitting the origin?

I found this chunk of code on http://rosettacode.org/wiki/Multiple_regression#Python, which does a multiple linear regression in python. Print b in the following code gives you the coefficients of x1, ..., xN. However, this code is fitting the line through the origin (i.e. the resulting model does not include a constant).
All I'd like to do is the exact same thing except I do not want to fit the line through the origin, I need the constant in my resulting model.
Any idea if it's a small modification to do this? I've searched and found numerous documents on multiple regressions in python, except they are lengthy and overly complicated for what I need. This code works perfect, except I just need a model that fits through the intercept not the origin.
import numpy as np
from numpy.random import random
n=100
k=10
y = np.mat(random((1,n)))
X = np.mat(random((k,n)))
b = y * X.T * np.linalg.inv(X*X.T)
print(b)
Any help would be appreciated. Thanks.
you only need to add a row to X that is all 1.
Maybe a more stable approach would be to use a least squares algorithm anyway. This can also be done in numpy in a few lines. Read the documentation about numpy.linalg.lstsq.
Here you can find an example implementation:
http://glowingpython.blogspot.de/2012/03/linear-regression-with-numpy.html
What you have written out, b = y * X.T * np.linalg.inv(X * X.T), is the solution to the normal equations, which gives the least squares fit with a multi-linear model. swang's response is correct (and EMS's elaboration)---you need to add a row of 1's to X. If you want some idea of why it works theoretically, keep in mind that you are finding b_i such that
y_j = sum_i b_i x_{ij}.
By adding a row of 1's, you are are setting x_{(k+1)j} = 1 for all j, which means that you are finding b_i such that:
y_j = (sum_i b_i x_{ij}) + b_{k+1}
because the k+1st x_ij term is always equal to one. Thus, b_{k+1} is your intercept term.

Categories