I have a large matrix, or 2D array, M of floats. Right now, my matrix has 10,000 rows and 31 columns. Each row in this matrix represents a vector. I am looking to compute the convex hull of the set of rows.
Since this matrix is quite large, I am looking for a fast approach. My current approach uses this package which can be as slow as O(n²), where n is the number of vectors. My goal is to scale this algorithm to even larger matrices.
Are there faster approaches than the O(n²) speed?
I prefer to use Python, but I'm not looking for code. I'm looking for a general algorithm that I can code on my own.
For fixed dimension d, Chazelle[1] gave an optimal algorithm in 1993 that requires O(n^[d/2]) in the worst case, where n is the number of points and [d/2] denotes integer division by 2. The scipy.spatial package uses QHull which can be as slow as O(n^2) for 2D and 3D points. The O(n^2) bound does not hold for arbitrary dimension.
Most practical algorithms that compute convex hulls in arbitrary dimension implement randomized incremental construction, which is what QHull does as well, I believe. Computing the convex hull in high dimension is generally considered a hard problem. Check out this FAQ[2].
I programmed an algorithm in O(n log h) in C# than I call Ouellet Convex Hull. All the code is provided in the link and much more.
Today I finalized the "Online" version which enable you to feed it one point at the time and stay in O (log h) per point. It is faster than Chan by at least 2 times in general usage.
The article does not talk about "online" portion (I will start to write something tomorrow). But the code is accessible on GitHub in project : OuelletConvexHullAvl2Online.
Usage:
OuelletConvexHullAvl2Online.ConvexHullOnline convexHullOnline = new OuelletConvexHullAvl2Online.ConvexHullOnline();
foreach (Point pt in points)
{
convexHullOnline.DynamicallyAddAnotherPointToConvexHullIfAppropriate(pt);
}
return convexHullOnline.GetResultsAsArrayOfPoint();
Convex Hull complexity is low-bounded at ω(n · log n) in the general case (arbitrary number of dimensions), since it can be kind of reduced to a sorting algorithm.
However, you are woking in 2D, a special case known as "planar convex hull". Thus, you can use Chan's algorithm that improves the bound to ω(n · log h), where n is the total number of points and h is the number of points in the convex hull (this kind of solutions are known as "sensitive output algorithms" for obvious reasons).
Finally, just a remark: your algorithm, as well as many other alternatives, have average complexity ω(n · log n), although in the worst case grows up to ω(n^2). Thus, it is kind of wrong to state that it has quadratic complexity, since the worst case tends to be very infrequent.
Related
Recently, I've been interested in Data analysis.
So I researched about how to do machine-learning project and do it by myself.
I learned that scaling is important in handling features.
So I scaled every features while using Tree model like Decision Tree or LightGBM.
Then, the result when I scaled had worse result.
I searched on the Internet, but all I earned is that Tree and Ensemble algorithm are not sensitive to variance of the data.
I also bought a book "Hands-on Machine-learning" by O'Relly But I couldn't get enough explanation.
Can I get more detailed explanation for this?
Though I don't know the exact notations and equations, the answer has to do with the Big O Notation for the algorithms.
Big O notation is a way of expressing the theoretical worse time for an algorithm to complete over extremely large data sets. For example, a simple loop that goes over every item in a one dimensional array of size n has a O(n) run time - which is to say that it will always run at the proportional time per size of the array no matter what.
Say you have a 2 dimensional array of X,Y coords and you are going to loop across every potential combination of x/y locations, where x is size n and y is size m, your Big O would be O(mn)
and so on. Big O is used to compare the relative speed of different algorithms in abstraction, so that you can try to determine which one is better to use.
If you grab O(n) over the different potential sizes of n, you end up with a straight 45 degree line on your graph.
As you get into more complex algorithms you can end up with O(n^2) or O(log n) or even more complex. -- generally though most algorithms fall into either O(n), O(n^(some exponent)), O(log n) or O(sqrt(n)) - there are obviously others but generally most fall into this with some form of co-efficient in front or after that modifies where they are on the graph. If you graph each one of those curves you'll see which ones are better for extremely large data sets very quickly
It would entirely depend on how well your algorithm is coded, but it might look something like this: (don't trust me on this math, i tried to start doing it and then just googled it.)
Fitting a decision tree of depth ‘m’:
Naïve analysis: 2m-1 trees -> O(2m-1 n d log(n)).
each object appearing only once at a given depth: O(m n d log n)
and a Log n graph ... well pretty much doesn't change at all even with sufficiently large numbers of n, does it?
so it doesn't matter how big your data set is, these algorithms are very efficient in what they do, but also do not scale because of the nature of a log curve on a graph (the worst increase in performance for +1 n is at the very beginning, then it levels off with only extremely minor increases to time with more and more n)
Do not confuse trees and ensembles (which may be consist from models, that need to be scaled).
Trees do not need to scale features, because at each node, the entire set of observations is divided by the value of one of the features: relatively speaking, to the left everything is less than a certain value, and to the right - more. What difference then, what scale is chosen?
Right now I am using the numpy.linalg.solve to solve my matrix, but the fact that I am using it to solve a 5000*17956 matrix makes it really time consuming. It runs really slow and It have taken me more than an hour to solve. The running time for this is probably O(n^3) for solving matrix equation but I never thought it would be that slow. Is there any way to solve it faster in Python?
My code is something like that, to solve a for the equation BT * UT = BT*B a, where m is the number of test cases (in my case over 5000), B is a data matrix m*17956, and u is 1*m.
C = 0.005 # hyperparameter term for regulization
I = np.identity(17956) # 17956*17956 identity matrix
rhs = np.dot(B.T, U.T) # (17956*m) * (m*1) = 17956*1
lhs = np.dot(B.T, B)+C*I # (17956*m) * (m*17956) = 17956*17956
a = np.linalg.solve(lhs, rhs) # B.T u = B.T B a, solve for a (17956*1)
Update (2 July 2018): The updated question asks about the impact of a regularization term and the type of data in the matrices. In general, this can make a large impact in terms of the datatypes a particular CPU is most optimized for (as a rough rule of thumb, AMD is better with vectorized integer math and Intel is better with vectorized floating point math when all other things are held equal), and the presence of a large number of zero values can allow for the use of sparse matrix libraries. In this particular case though, the changes on the main diagonal (well under 1% of all the values in consideration) will have a negligible impact in terms of runtime.
TLDR;
An hour is reasonable (a cubic regression suggests that this would take around 83 minutes on my machine -- a low-end chromebook).
The pre-processing to generate lhs and rhs account for almost none of that time.
You won't be able to solve that exact problem much faster than with numpy.linalg.solve.
If m is small as you suggest and if B is invertible, you can instead solve the equation U.T=Ba in a minute or less.
If this is part of a larger problem, this costly intermediate step might be able to be simplified away from a mathematical framework.
Performance bottlenecks really should be addressed with profiling to figure out which step is causing the issues.
Since this comes from real-world data, you might be able to get away with fewer features (either directly or through a reduction step like PCA, NMF, or LLE), depending on the end goal.
As mentioned in another answer, if the matrix is sufficiently sparse you can get away with sparse linear algebra routines to great effect (many natural language processing data sources are like this).
Since the output is a 1D vector, I would use np.dot(U, B).T instead of np.dot(B.T, U.T). Transposes are neat that way. This avoids doing the transpose on a big matrix like B, though since you have a cubic operation as the dominant step this doesn't matter much for your problem.
Depending on whether you need the original data anymore and if the matrices involved have any other special properties, you might be able to fiddle with the parameters in scipy.linalg.solve instead for a gain.
I've had mixed success replacing large matrix equations with block matrix equations falling back on numpy routines. That approach typically saves 5-20% over numpy approaches and takes 1% or so off scipy approaches on my system. I haven't fully explored the reason for the discrepancy.
Assuming your matrix is sparse, the scipy.sparse.linalg module will be useful. Here is the documentation for the whole module, and here is the documentation for spsolve.
How can one generate random sparse orthogonal matrix?
I know there is a sparse matrices in scipy library but they are generally non-orthogonal. One can exploit QR-factorization, but it is not necessarily preserves sparsity.
As a preliminary thought you could partition the matrix into diagonal blocks, fill those blocks with QR and then permute rows/columns. The resulting matrices will remain orthagonal. Alternatively, you could define some sparsity pattern for Q and try to minimize f(Q, xi) subject to QQ^T=I where f is some (preferably) convex function that adds entropy through the random variable xi. Can't say anything about the efficacy of either method since I haven't actually tried them.
EDIT: A bit more about the second method. f can really be any function. One choice might be similarity of the non-zero elements to a random gaussian vector (or any other random variate): f = ||vec(Q) - x||_2^2, x ~ N(0, sigma * I). You could handle this using any general constrained optimizer. The problem of course, is that not every pattern S is guaranteed to have a (full rank) orthogonal filling. If you have the memory, L1 regularization (or a smooth approximation) could encourage sparsity in a dense matrix variable: g(Q) = f(Q) + P(Q) where P is any sparsity-inducing penalty function. Check out Wen & Yen (2010) "A feasible Method for Optimization with Orthogonality Constraints" for an algorithm specifically designed for optimization of general (differentiable) functions over (dense) orthogonal matrices and Liu, Wu, So (2015) "Quadratic Optimization with Orthogonality Constraints" for more theorical evaluation of several line/arc search algorithms for quadratic functions. If memory is a problem, you could generate each row/column separately using sparse basis pursuit, for which there are many algorithms depending on the nature of your problem. See Qu, Sun and Wright (2015) "Finding a sparse vector in a subspace: linear sparsity using alternate directions" and Bian et al (2015) "Sparse null space basis pursuit and analysis dictionary learning for high-dimensional data analysis" for algorithm details, though in both cases you will have to incorporate/replace constraints to promote orthogonality to all previous vectors.
It's also worth noting there are sparse QR algorithms that return Q as the product of sparse/structured matrices. If you are concerned about storage space alone, this might be the simplest method to create large, efficient orthogonal operators.
I need to find the diameter of the points cloud (two points with maximum distance between them) in 3-dimensional space. As a temporary solution, right now I'm just iterating through all possible pairs and comparing the distance between them, which is a very slow, O(n^2) solution.
I believe it can be done in O(n log n). It's a fairly easy task in 2D (just find the convex hull and then apply the rotating calipers algorithm), but in 3D I can't imagine how to use rotating calipers, since there is no way to order the points.
Is there any simple way to do it (or ready-to-use implementation in python or C/C++)?
PS: There are similar questions on StackOverflow, but the answers that I found only refers to Rotating Calipers (or similar) algorithms, which works fine in 2D situation but not really clear how to implement in 3D (or higher dimensionals).
While O(n log n) expected time algorithms exist in 3d, they seem tricky to implement (while staying competitive to brute-force O(n^2) algorithms).
An algorithm is described in Har-Peled 2001. The authors provide a source code than can optionally be used for optimal computation. I was not able to download the latest version, the "old" version could be enough for your purpose, or you might want to contact the authors for the code.
An alternative approach is presented in Malandain & Boissonnat 2002 and the authors provide code. Altough this algorithm is presented as approximate in higher dimensions, it could fit your purpose. Note that their code provide an implementation of Har-Peled's method for exact computation that you might also check.
In any case, in a real-world usage you should always check that your algorithm remains competitive with respect to the naïve O(n^2) approach.
Until now I used numpy.linalg.eigvals to calculate the eigenvalues of quadratic matrices with at least 1000 rows/columns and, for most cases, about a fifth of its entries non-zero (I don't know if that should be considered a sparse matrix). I found another topic indicating that scipy can possibly do a better job.
However, since I have to calculate the eigenvalues for hundreds of thousands of large matrices of increasing size (possibly up to 20000 rows/columns and yes, I need ALL of their eigenvalues), this will always take awfully long. If I can speed things up, even just the tiniest bit, it would most likely be worth the effort.
So my question is: Is there a faster way to calculate the eigenvalues when not restricting myself to python?
#HighPerformanceMark is correct in the comments, in that the algorithms behind numpy (LAPACK and the like) are some of the best, but perhaps not state of the art, numerical algorithms out there for diagonalizing full matrices. However, you can substantially speed things up if you have:
Sparse matrices
If your matrix is sparse, i.e. the number of filled entries is k, is such that k<<N**2 then you should look at scipy.sparse.
Banded matrices
There are numerous algorithms for working with matrices of a specific banded structure.
Check out the solvers in scipy.linalg.solve.banded.
Largest Eigenvalues
Most of the time, you don't really need all of the eigenvalues. In fact, most of the physical information comes from the largest eigenvalues and the rest are simply high frequency oscillations that are only transient. In that case you should look into eigenvalue solutions that quickly converge to those largest eigenvalues/vectors such as the Lanczos algorithm.
An easy way to maybe get a decent speedup with no code changes (especially on a many-core machine) is to link numpy to a faster linear algebra library, like MKL, ACML, or OpenBLAS. If you're associated with an academic institution, the excellent Anaconda python distribution will let you easily link to MKL for free; otherwise, you can shell out $30 (in which case you should try the 30-day trial of the optimizations first) or do it yourself (a mildly annoying process but definitely doable).
I'd definitely try a sparse eigenvalue solver as well, though.