How to generate sparse orthogonal matrix in python? - python

How can one generate random sparse orthogonal matrix?
I know there is a sparse matrices in scipy library but they are generally non-orthogonal. One can exploit QR-factorization, but it is not necessarily preserves sparsity.

As a preliminary thought you could partition the matrix into diagonal blocks, fill those blocks with QR and then permute rows/columns. The resulting matrices will remain orthagonal. Alternatively, you could define some sparsity pattern for Q and try to minimize f(Q, xi) subject to QQ^T=I where f is some (preferably) convex function that adds entropy through the random variable xi. Can't say anything about the efficacy of either method since I haven't actually tried them.
EDIT: A bit more about the second method. f can really be any function. One choice might be similarity of the non-zero elements to a random gaussian vector (or any other random variate): f = ||vec(Q) - x||_2^2, x ~ N(0, sigma * I). You could handle this using any general constrained optimizer. The problem of course, is that not every pattern S is guaranteed to have a (full rank) orthogonal filling. If you have the memory, L1 regularization (or a smooth approximation) could encourage sparsity in a dense matrix variable: g(Q) = f(Q) + P(Q) where P is any sparsity-inducing penalty function. Check out Wen & Yen (2010) "A feasible Method for Optimization with Orthogonality Constraints" for an algorithm specifically designed for optimization of general (differentiable) functions over (dense) orthogonal matrices and Liu, Wu, So (2015) "Quadratic Optimization with Orthogonality Constraints" for more theorical evaluation of several line/arc search algorithms for quadratic functions. If memory is a problem, you could generate each row/column separately using sparse basis pursuit, for which there are many algorithms depending on the nature of your problem. See Qu, Sun and Wright (2015) "Finding a sparse vector in a subspace: linear sparsity using alternate directions" and Bian et al (2015) "Sparse null space basis pursuit and analysis dictionary learning for high-dimensional data analysis" for algorithm details, though in both cases you will have to incorporate/replace constraints to promote orthogonality to all previous vectors.
It's also worth noting there are sparse QR algorithms that return Q as the product of sparse/structured matrices. If you are concerned about storage space alone, this might be the simplest method to create large, efficient orthogonal operators.

Related

Computing top eigenvalues, operator norm of sparse matrix

I have a large sparse square non-normal matrix: 73080 rows, but only 6 nonzero entries per row (and all equal to 1.). I'd like to compute the two largest eigenvalues, as well as the operator (2) norm, ideally with Python. The natural way for me to store this matrix is with scipy's csr_matrix, especially since I'll be multiplying it with other sparse matrices. However, I don't see a good way to compute the relevant statistics: scipy.sparse.linalg's norm method doesn't have the 2-norm implemented and converting to a dense matrix seems like it would be a bad idea, and running scipy.sparse.linalg.eigs seems to run extremely, maybe prohibitively, slowly, and in any event it computes lots of data that I just don't need. I suppose I could subtract off the spectral projector corresponding to the top eigenvalue but then I'd still need to know the top eigenvalue of the new matrix, which I'd like to do with an out-of-the-box method if at all possible, and in any event this wouldn't continue to work after multiplying with other large sparse matrices.
However, these kinds of computations seem to be doable: the top of page 6 of this paper seems to have data on the eigenvalues of ~10000-row matrices. If this is not feasible in Python, is there another way I should try to do this? Thanks in advance.

Efficient way to solve matrix equation in Python

Right now I am using the numpy.linalg.solve to solve my matrix, but the fact that I am using it to solve a 5000*17956 matrix makes it really time consuming. It runs really slow and It have taken me more than an hour to solve. The running time for this is probably O(n^3) for solving matrix equation but I never thought it would be that slow. Is there any way to solve it faster in Python?
My code is something like that, to solve a for the equation BT * UT = BT*B a, where m is the number of test cases (in my case over 5000), B is a data matrix m*17956, and u is 1*m.
C = 0.005 # hyperparameter term for regulization
I = np.identity(17956) # 17956*17956 identity matrix
rhs = np.dot(B.T, U.T) # (17956*m) * (m*1) = 17956*1
lhs = np.dot(B.T, B)+C*I # (17956*m) * (m*17956) = 17956*17956
a = np.linalg.solve(lhs, rhs) # B.T u = B.T B a, solve for a (17956*1)
Update (2 July 2018): The updated question asks about the impact of a regularization term and the type of data in the matrices. In general, this can make a large impact in terms of the datatypes a particular CPU is most optimized for (as a rough rule of thumb, AMD is better with vectorized integer math and Intel is better with vectorized floating point math when all other things are held equal), and the presence of a large number of zero values can allow for the use of sparse matrix libraries. In this particular case though, the changes on the main diagonal (well under 1% of all the values in consideration) will have a negligible impact in terms of runtime.
TLDR;
An hour is reasonable (a cubic regression suggests that this would take around 83 minutes on my machine -- a low-end chromebook).
The pre-processing to generate lhs and rhs account for almost none of that time.
You won't be able to solve that exact problem much faster than with numpy.linalg.solve.
If m is small as you suggest and if B is invertible, you can instead solve the equation U.T=Ba in a minute or less.
If this is part of a larger problem, this costly intermediate step might be able to be simplified away from a mathematical framework.
Performance bottlenecks really should be addressed with profiling to figure out which step is causing the issues.
Since this comes from real-world data, you might be able to get away with fewer features (either directly or through a reduction step like PCA, NMF, or LLE), depending on the end goal.
As mentioned in another answer, if the matrix is sufficiently sparse you can get away with sparse linear algebra routines to great effect (many natural language processing data sources are like this).
Since the output is a 1D vector, I would use np.dot(U, B).T instead of np.dot(B.T, U.T). Transposes are neat that way. This avoids doing the transpose on a big matrix like B, though since you have a cubic operation as the dominant step this doesn't matter much for your problem.
Depending on whether you need the original data anymore and if the matrices involved have any other special properties, you might be able to fiddle with the parameters in scipy.linalg.solve instead for a gain.
I've had mixed success replacing large matrix equations with block matrix equations falling back on numpy routines. That approach typically saves 5-20% over numpy approaches and takes 1% or so off scipy approaches on my system. I haven't fully explored the reason for the discrepancy.
Assuming your matrix is sparse, the scipy.sparse.linalg module will be useful. Here is the documentation for the whole module, and here is the documentation for spsolve.

Optimize search for the convex hull

I have a large matrix, or 2D array, M of floats. Right now, my matrix has 10,000 rows and 31 columns. Each row in this matrix represents a vector. I am looking to compute the convex hull of the set of rows.
Since this matrix is quite large, I am looking for a fast approach. My current approach uses this package which can be as slow as O(n²), where n is the number of vectors. My goal is to scale this algorithm to even larger matrices.
Are there faster approaches than the O(n²) speed?
I prefer to use Python, but I'm not looking for code. I'm looking for a general algorithm that I can code on my own.
For fixed dimension d, Chazelle[1] gave an optimal algorithm in 1993 that requires O(n^[d/2]) in the worst case, where n is the number of points and [d/2] denotes integer division by 2. The scipy.spatial package uses QHull which can be as slow as O(n^2) for 2D and 3D points. The O(n^2) bound does not hold for arbitrary dimension.
Most practical algorithms that compute convex hulls in arbitrary dimension implement randomized incremental construction, which is what QHull does as well, I believe. Computing the convex hull in high dimension is generally considered a hard problem. Check out this FAQ[2].
I programmed an algorithm in O(n log h) in C# than I call Ouellet Convex Hull. All the code is provided in the link and much more.
Today I finalized the "Online" version which enable you to feed it one point at the time and stay in O (log h) per point. It is faster than Chan by at least 2 times in general usage.
The article does not talk about "online" portion (I will start to write something tomorrow). But the code is accessible on GitHub in project : OuelletConvexHullAvl2Online.
Usage:
OuelletConvexHullAvl2Online.ConvexHullOnline convexHullOnline = new OuelletConvexHullAvl2Online.ConvexHullOnline();
foreach (Point pt in points)
{
convexHullOnline.DynamicallyAddAnotherPointToConvexHullIfAppropriate(pt);
}
return convexHullOnline.GetResultsAsArrayOfPoint();
Convex Hull complexity is low-bounded at ω(n · log n) in the general case (arbitrary number of dimensions), since it can be kind of reduced to a sorting algorithm.
However, you are woking in 2D, a special case known as "planar convex hull". Thus, you can use Chan's algorithm that improves the bound to ω(n · log h), where n is the total number of points and h is the number of points in the convex hull (this kind of solutions are known as "sensitive output algorithms" for obvious reasons).
Finally, just a remark: your algorithm, as well as many other alternatives, have average complexity ω(n · log n), although in the worst case grows up to ω(n^2). Thus, it is kind of wrong to state that it has quadratic complexity, since the worst case tends to be very infrequent.

Logistic regression python solvers' definitions

I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem.
Can someone briefly describe what "newton-cg", "sag", "lbfgs" and "liblinear" are doing?
Well, I hope I'm not too late for the party! Let me first try to establish some intuition before digging into loads of information (warning: this is not a brief comparison, TL;DR)
Introduction
A hypothesis h(x), takes an input and gives us the estimated output value.
This hypothesis can be as simple as a one-variable linear equation, .. up to a very complicated and long multivariate equation with respect to the type of algorithm we’re using (e.g. linear regression, logistic regression..etc).
Our task is to find the best Parameters (a.k.a Thetas or Weights) that give us the least error in predicting the output. We call the function that calculates this error a Cost or Loss Function, and apparently, our goal is to minimize the error in order to get the best-predicted output!
One more thing to recall is, the relation between the parameter value and its effect on the cost function (i.e. the error) looks like a bell curve (i.e. Quadratic; recall this because it’s important).
So if we start at any point in that curve and keep taking the derivative (i.e. tangent line) of each point we stop at (assuming it's a univariate problem, otherwise, if we have multiple features, we take the partial derivative), we will end up at what so-called the Global Optima as shown in this image:
If we take the partial derivative at the minimum cost point (i.e. global optima) we find the slope of the tangent line = 0 (then we know that we reached our target).
That’s valid only if we have a Convex Cost Function, but if we don’t, we may end up stuck at what is called Local Optima; consider this non-convex function:
Now you should have the intuition about the heck relationship between what we are doing and the terms: Derivative, Tangent Line, Cost Function, Hypothesis ..etc.
Side Note: The above-mentioned intuition is also related to the Gradient Descent Algorithm (see later).
Background
Linear Approximation:
Given a function, f(x), we can find its tangent at x=a. The equation of the tangent line L(x) is: L(x)=f(a)+f′(a)(x−a).
Take a look at the following graph of a function and its tangent line:
From this graph we can see that near x=a, the tangent line and the function have nearly the same graph. On occasion, we will use the tangent line, L(x), as an approximation to the function, f(x), near x=a. In these cases, we call the tangent line the "Linear Approximation" to the function at x=a.
Quadratic Approximation:
Same as a linear approximation, yet this time we are dealing with a curve where we cannot find the point near to 0 by using only the tangent line.
Instead, we use the parabola as it's shown in the following graph:
In order to fit a good parabola, both parabola and quadratic function should have the same value, the same first derivative, AND the same second derivative. The formula will be (just out of curiosity): Qa(x) = f(a) + f'(a)(x-a) + f''(a)(x-a)2/2
Now we should be ready to do the comparison in detail.
Comparison between the methods
1. Newton’s Method
Recall the motivation for the gradient descent step at x: we minimize the quadratic function (i.e. Cost Function).
Newton’s method uses in a sense a better quadratic function minimisation.
It's better because it uses the quadratic approximation (i.e. first AND second partial derivatives).
You can imagine it as a twisted Gradient Descent with the Hessian (the Hessian is a square matrix of second-order partial derivatives of order n X n).
Moreover, the geometric interpretation of Newton's method is that at each iteration one approximates f(x) by a quadratic function around xn, and then takes a step towards the maximum/minimum of that quadratic function (in higher dimensions, this may also be a saddle point). Note that if f(x) happens to be a quadratic function, then the exact extremum is found in one step.
Drawbacks:
It’s computationally expensive because of the Hessian Matrix (i.e. second partial derivatives calculations).
It attracts to Saddle Points which are common in multivariable optimization (i.e. a point that its partial derivatives disagree over whether this input should be a maximum or a minimum point!).
2. Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm:
In a nutshell, it is an analogue of Newton’s Method, yet here the Hessian matrix is approximated using updates specified by gradient evaluations (or approximate gradient evaluations). In other words, using estimation to the inverse Hessian matrix.
The term Limited-memory simply means it stores only a few vectors that represent the approximation implicitly.
If I dare say that when the dataset is small, L-BFGS relatively performs the best compared to other methods especially because it saves a lot of memory, however, there are some “serious” drawbacks such that if it is unsafeguarded, it may not converge to anything.
Side note: This solver has become the default solver in sklearn LogisticRegression since version 0.22, replacing LIBLINEAR.
3. A Library for Large Linear Classification:
It’s a linear classification that supports logistic regression and linear support vector machines.
The solver uses a Coordinate Descent (CD) algorithm that solves optimization problems by successively performing approximate minimization along coordinate directions or coordinate hyperplanes.
LIBLINEAR is the winner of the ICML 2008 large-scale learning challenge. It applies automatic parameter selection (a.k.a L1 Regularization) and it’s recommended when you have high dimension dataset (recommended for solving large-scale classification problems)
Drawbacks:
It may get stuck at a non-stationary point (i.e. non-optima) if the level curves of a function are not smooth.
Also cannot run in parallel.
It cannot learn a true multinomial (multiclass) model; instead, the optimization problem is decomposed in a “one-vs-rest” fashion, so separate binary classifiers are trained for all classes.
Side note: According to Scikit Documentation: The “liblinear” solver was the one used by default for historical reasons before version 0.22. Since then, the default use is Limited-memory Broyden–Fletcher–Goldfarb–Shanno Algorithm.
4. Stochastic Average Gradient:
The SAG method optimizes the sum of a finite number of smooth convex functions. Like stochastic gradient (SG) methods, the SAG method's iteration cost is independent of the number of terms in the sum. However, by incorporating a memory of previous gradient values, the SAG method achieves a faster convergence rate than black-box SG methods.
It is faster than other solvers for large datasets when both the number of samples and the number of features are large.
Drawbacks:
It only supports L2 penalization.
This is not really a drawback, but more like a comparison: although SAG is suitable for large datasets, with a memory cost of O(N), it can be less practical for very large N (as the most recent gradient evaluation for each function needs to be maintained in the memory). This is usually not a problem, but a better option would be SVRG 1, 2 which is unfortunately not implemented in scikit-learn!
5. SAGA:
The SAGA solver is a variant of SAG that also supports the non-smooth penalty L1 option (i.e. L1 Regularization). This is therefore the solver of choice for sparse multinomial logistic regression. It also has a better theoretical convergence compared to SAG.
Drawbacks:
This is not really a drawback, but more like a comparison: SAGA is similar to SAG with regard to memory cost. That's it's suitable for large datasets, yet in edge cases where the dataset is very large, the SVRG 1, 2 would be a better option (unfortunately not implemented in scikit-learn)!
Side note: According to Scikit Documentation: The SAGA solver is often the best choice.
Please note the attributes "Large" and "Small" used in Scikit-Learn and in this comparison are relative. AFAIK, there is no universal unanimous and accurate definition of the dataset boundaries to be considered as "Large", "Too Large", "Small", "Too Small"...etc!
Summary
The following table is taken from Scikit Documentation
Updated Table from the same link above (accessed 02/11/2021):
I would like to add my two cents to the terrific answer given by Yahia
My goal is to establish intuition how to get from full gradient descent method to SG then to SAG and then to SAGA.
On the Stochastic Gradient (SG) methods.
SG takes advantage of the fact that commonly used loss functions can be written as a sum of per-sample loss functions
, where w is the weight vector being optimized.
The gradient vector then is written as a sum of per-sample gradient vectors:
.
E.g. least square error has this form
, where are features of i-th sample and the i-th ground truth value (target, dependent variable).
And the logistic regression loss has this form (in notation 2)
.
SG
The main idea of stochastic gradient that instead of computing the gradient of the whole loss function, we can compute the gradient of , the loss function for a single random sample and descent towards that sample gradient direction instead of full gradient of f(x). This is much faster. The reasoning is that uniformly randomly chosen sample gradient represents an unbiased estimate of the gradient of the whole loss function.
In practice, SG descent has worse convergence rate than full gradient descent where k is the number of iterations. But it has faster convergence in terms of number of flops (simple arithmetic operations) as each iteration requires computation of only one gradient instead of n. It also suffers from high variance (indeed we may not necesserily descent when picking random i, we may as well ascent)
SAG
SAG achieves convergence rate of full gradient descent without making each iteration more expensive in flops compared to SG (if only by a constant).
SAG algorithm minimizing f(w) is straightforward (for dense matrices of features).
At step 0 pick a point (leaving aside how you pick it). Initialize with 0 memory cells for saving gradients of at later steps.
At step k update weights with an average of lagged gradients taken from the memory cells (lagged as they are not updated at every step):
Pick uniformly randomly index from 1..n and update only one single memory cell
It seems that we're computing the whole sum of lagged gradients at each step but the nice part is that we can store the cumulative sum as a variable and make a cheap update to it at every step.
We may rewrite the update step a little
and see that the sum is updated by the amount
However, when we do this descent step we're not anymore going in a direction of an unbiased estimate of the full gradient at step k. We're going in a direction of a reduced variance estimate (in part because we're making a small step) but biased. I think this is an important and beautiful thing to understand so I will cite an excerpt from SAGA paper:
Suppose that we want to use Monte Carlo samples to estimate EX and
that we can compute efficiently EY for another random variable Y that
is highly correlated with X. One variance reduction approach is to use
the following estimator θ as an approximation to EX: θ := α(X − Y) +
EY , for a step size α ∈ [0, 1]. We have that Eθ is a convex
combination of EX and EY : Eθ = αEX + (1 − α)EY . The standard
variance reduction approach uses α = 1 and the estimate is unbiased Eθ
= EX. The variance of θ is: Var(θ) = α^2*[Var(X) + Var(Y ) − 2 Cov(X, Y )], and so if Cov(X, Y ) is big enough, the variance of θ is reduced
compared to X, giving the method its name. By varying α from 0 to 1,
we increase the variance of θ towards its maximum value (which
usually is still smaller than the one for X) while decreasing its bias
towards zero.
So we applied a more or less standard variance reduction approach to get from SG to SAG. The variance reduction constant α is equal to 1/n in SAG algorithm. If Y is the randomly picked , X is the , the update
uses the estimate of full gradient in the form 1/n*(X − Y) + EY
We mentioned that SG suffers from high variance. So we may say that SAG is SG with a clever method of variance reduction applied to it. I don't want to diminish the significance of the findings - picking suitable Y random variable is not simple. Now we can play with variance reduction constants. What if we take the variance reduction constant of 1 and therefore use an unbiased estimate of the full gradient?
SAGA
This is the main idea of SAGA. Take SAG algorithm and apply unbiased estimate of full gradient with variance reduction constant α=1.
The update step gets bigger and becomes
Due to lack of bias the proof of convergence becomes simple and has better constants than in SAG case. It also allows for additional trick allowing for l1 regularization. What I mean is proximal operator.
Proximal gradient descent step in SAGA
If you don't need l1 regularisation you can skip this part as there is whole mathematical theory on proximal operators.
Proximal operator is a generalization of gradient descent in some sense. (Operator is just a function from a vector into a vector. Gradient is an operator for example)
where h(u) is a continous convex function.
In other words it it same as finding minimum of h(u) but also getting penalized for going too far from the initial point v. Proximal operator is a function from to (vector to vector, just like gradient) parametrized by h(x). It is non-expansional (i.e distance between x and y does not get bigger after applying proximal operator to x and y). Its' fixed point () is the solution of the optimization problem. Proximal operator applied iteratively actually converges to its fixed point (although this is generally not true for non-expansive operators, i.e. not true for rotation). So most simple algorithm to find minimum using proximal operator is just applying the operator multiple times . And this is similar to gradient descent in some sense. Here is why:
Suppose a differentiable convex function h and instead of gradient descent update a similar backward Euler update: . This update can be viewed as a proximal operator update , since for proximal operator we need to find minimizing or find such that so
Ok why even consider changing one minimization problem by another (computing proximal operator is a minimization problem inside a minimization problem). The answer is for most common loss functions proximal operator either has a closed form or has efficient appoximation method. Take l1 regularizer. Its proximal operator is called soft-thresholding operator and it has a simple form (I tried to insert it here but failed).
Now back to SAGA.
Assume we minimize g(x) + h(x) where g(x) is a smooth convex function and h(x) is a non-smooth convex function (e.g. l1 regularization) but for which we are able to efficiently compute the proximal operator. So the algorithm could first make a gradient descent step for g to reduce g and then apply the proximal operator of h to the result to reduce h. This is the additional trick in SAGA and it is called proximal gradient descent.
Why SAG and SAGA are well suited for very large dataset
Here I am not sure what sklearn authors meant. Making a guess - very large dataset probably means that the feature matrix is sparse (has many 0).
Now let's consider a linearly-parameterized loss function
. Each sum term has a special form
. is a function of a single variable. Notice that both cross entropy loss and least square loss have this form. By chain rule
So it's evident that the gradient is also sparse.
SAG (and SAGA) apply a clever trick for sparse matrices. The idea is that weight vector does not need to be updated in every index at every step. Update may be skipped for the indices of the weight vector that are sparse in the current randomly chosen sample at the step k.
There are other clever tricks in SAG and SAGA. But if you made it so far I invite you to look at the original papers 1 and 2. They are well written.

Performing Decomposition on Sparse Matrices in Python

I'm trying to decomposing signals in components (matrix factorization) in a large sparse matrix in Python using the sklearn library.
I made use of scipy's scipy.sparse.csc_matrix to construct my matrix of data. However I'm unable to perform any analysis such as factor analysis or independent component analysis. The only thing I'm able to do is use truncatedSVD or scipy's scipy.sparse.linalg.svds and perform PCA.
Does anyone know any work-arounds to doing ICA or FA on a sparse matrix in python? Any help would be much appreciated! Thanks.
Given:
M = UΣV^t
The drawback with SVD is that the matrix U and V^t are dense matrices. It doesn't really matter that the input matrix is sparse, U and T will be dense. Also the computational complexity of SVD is O(n^2*m) or O(m^2*n) where n is the number of rows and m the number of columns in the input matrix M. It depends on which one is biggest.
It is worth mentioning that SVD will give you the optimal solution and if you can live with a smaller loss, calculated by the frobenius norm, you might want to consider using the CUR algorithm. It will scale to larger datasets with O(n*m).
U = CUR^t
Where C and R are now SPARSE matrices.
If you want to look at a python implementation, take a look at pymf. But be a bit careful about that exact implementations since it seems, at this point in time, there is an open issue with the implementation.
Even the input matrix is sparse the output will not be a sparse matrix. If the system does not support a dense matrix neither the results will not be supported
It is usually a best practice to use coo_matrix to establish the matrix and then convert it using .tocsc() to manipulate it.

Categories