Is there a way to know exactly how much memory I will need to do PCA/kPCA in Python?
For example, if I have a matrix of N rows and M columns:
What memory will I need if N = 100, 10000, 100000?
And does M have an effect on the memory needed for PCA/kPCA?
Interesting question. From what I can glean from the paper Online principal component analysis in high dimension; which algorithm to choose? by Cardot & Degras (2015) the complexity of PCA depends on the SVD (singular value deomposition) step. Thus, the space complexity of 'batch' (in memory) PCA is, using your notation, at least O(NM) for the data, plus O(M2) for the covariance matrix, so the complexity is O(M2) in total.
This is all backed up by this tutorial paper by Li et al. (Note that your N is their m and your M is their n.) They give more detail though, showing that about 3 copies are made of each matrix, and they show how the choice of algorithm matters.
This answer to a question about NumPy's SVD implementation might help you compute the exact memory footprint of the process. Or check out a tool like memory-profiler, which will also give you an exact answer.
In short, the memory required depends on the algorithm but is very roughly 3 × (NM + M2). You can assume that each float in those arrays needs 8 bytes of memory.
Related
I'm trying to solve a system of equations that is a 1 Million x 1 Million square matrix and one 1 Million solution vector.
To do this, I'm using np.linalg.solve(matrix, answers) but it's taking a very long time.
Is there a way to speed it up?
Thanks #Chris but that doesn't answer the question since I've also tried using the Scipy module and it still takes a very long time to solve. I don't think my computer can hold that much data in RAM
OK for clarity, I've just found out that the name of the matrix that I'm trying to solve is a Hilbert matrix
Please reconsider the need for solving such a HUGE system unless your system is very sparse.
Indeed, this is barely possible to store the input/output on a PC storage device: the input dense matrix takes 8 TB with double-precision values and the output will certainly also takes few TB not to mention a temporary data storage is needed to compute the result (at least 8 TB for a dense matrix). Sparse matrices can help a lot if your input matrix is almost full of zeros but you need the matrix to contain >99.95% of zeros so to store it in your RAM.
Furthermore, the time complexity of solving a system is O(n m min(n,m)) so O(n^3) in your case (see: this post). This means a several billion billions operations. A basic mainstream processor do not exceed 0.5 TFlops. In fact, my relatively good i5-9600KF reach 0.3 TFlops in the LINPACK computationally intensive benchmark. This means the computation will certainly take a month to compute assuming is is bounded only by the speed of a mainstream processor. Actually, solving a large system of equations is known to be memory bound so it will be much slower in practice because modern RAM are a bottleneck in modern computers (see: memory wall). So for a mainstream PC, this should take from from several months to a year assuming the computation can be done in your RAM which is not possible as said before for a dense system. Since high-end SSD are about an order of magnitude slower than the RAM of a good PC, you should expect the computation to take several years. Not to mention a 20 TB high-end SSD is very expensive and it might be a good idea to consider power outages and OS failure for such a long computational time... Again, sparse matrices can help a lot, but note that solving sparse systems is known to be significantly slower than dense one unless the number of zeros is pretty small.
Such systems are solved on supercomputers (or at least large computing clusters), not regular PCs. This requires to use distributed computing and tools likes MPI and distributed linear solvers. A whole field of research is working on this topic to make them efficient on large scale systems.
Note that computing approximations can be faster, but one should solve the space problem in the first place...
I saw a post on stack overflow where someone showed that the CSR representation of a vector/matrix was slower for computations compared to just using the typical matrix/vector format for various numpy computations. The speed seems to depend on the computation and how sparse the vectors or matrices are.
I have lots of sparse vectors (average number of 0s is 66%) for which I would like to take the dot product of. Note that all elements in my vectors are either a 0 or 1. Which representation is better for this (eg. csr, normal vector, etc.) in terms of computational speed? Does it depend on how sparse my vector is? If so, is there a certain sparsity (%) after which one is better than the other?
Any help with this issue is much appreciated! Thanks in advance!
Right now I am using the numpy.linalg.solve to solve my matrix, but the fact that I am using it to solve a 5000*17956 matrix makes it really time consuming. It runs really slow and It have taken me more than an hour to solve. The running time for this is probably O(n^3) for solving matrix equation but I never thought it would be that slow. Is there any way to solve it faster in Python?
My code is something like that, to solve a for the equation BT * UT = BT*B a, where m is the number of test cases (in my case over 5000), B is a data matrix m*17956, and u is 1*m.
C = 0.005 # hyperparameter term for regulization
I = np.identity(17956) # 17956*17956 identity matrix
rhs = np.dot(B.T, U.T) # (17956*m) * (m*1) = 17956*1
lhs = np.dot(B.T, B)+C*I # (17956*m) * (m*17956) = 17956*17956
a = np.linalg.solve(lhs, rhs) # B.T u = B.T B a, solve for a (17956*1)
Update (2 July 2018): The updated question asks about the impact of a regularization term and the type of data in the matrices. In general, this can make a large impact in terms of the datatypes a particular CPU is most optimized for (as a rough rule of thumb, AMD is better with vectorized integer math and Intel is better with vectorized floating point math when all other things are held equal), and the presence of a large number of zero values can allow for the use of sparse matrix libraries. In this particular case though, the changes on the main diagonal (well under 1% of all the values in consideration) will have a negligible impact in terms of runtime.
TLDR;
An hour is reasonable (a cubic regression suggests that this would take around 83 minutes on my machine -- a low-end chromebook).
The pre-processing to generate lhs and rhs account for almost none of that time.
You won't be able to solve that exact problem much faster than with numpy.linalg.solve.
If m is small as you suggest and if B is invertible, you can instead solve the equation U.T=Ba in a minute or less.
If this is part of a larger problem, this costly intermediate step might be able to be simplified away from a mathematical framework.
Performance bottlenecks really should be addressed with profiling to figure out which step is causing the issues.
Since this comes from real-world data, you might be able to get away with fewer features (either directly or through a reduction step like PCA, NMF, or LLE), depending on the end goal.
As mentioned in another answer, if the matrix is sufficiently sparse you can get away with sparse linear algebra routines to great effect (many natural language processing data sources are like this).
Since the output is a 1D vector, I would use np.dot(U, B).T instead of np.dot(B.T, U.T). Transposes are neat that way. This avoids doing the transpose on a big matrix like B, though since you have a cubic operation as the dominant step this doesn't matter much for your problem.
Depending on whether you need the original data anymore and if the matrices involved have any other special properties, you might be able to fiddle with the parameters in scipy.linalg.solve instead for a gain.
I've had mixed success replacing large matrix equations with block matrix equations falling back on numpy routines. That approach typically saves 5-20% over numpy approaches and takes 1% or so off scipy approaches on my system. I haven't fully explored the reason for the discrepancy.
Assuming your matrix is sparse, the scipy.sparse.linalg module will be useful. Here is the documentation for the whole module, and here is the documentation for spsolve.
Until now I used numpy.linalg.eigvals to calculate the eigenvalues of quadratic matrices with at least 1000 rows/columns and, for most cases, about a fifth of its entries non-zero (I don't know if that should be considered a sparse matrix). I found another topic indicating that scipy can possibly do a better job.
However, since I have to calculate the eigenvalues for hundreds of thousands of large matrices of increasing size (possibly up to 20000 rows/columns and yes, I need ALL of their eigenvalues), this will always take awfully long. If I can speed things up, even just the tiniest bit, it would most likely be worth the effort.
So my question is: Is there a faster way to calculate the eigenvalues when not restricting myself to python?
#HighPerformanceMark is correct in the comments, in that the algorithms behind numpy (LAPACK and the like) are some of the best, but perhaps not state of the art, numerical algorithms out there for diagonalizing full matrices. However, you can substantially speed things up if you have:
Sparse matrices
If your matrix is sparse, i.e. the number of filled entries is k, is such that k<<N**2 then you should look at scipy.sparse.
Banded matrices
There are numerous algorithms for working with matrices of a specific banded structure.
Check out the solvers in scipy.linalg.solve.banded.
Largest Eigenvalues
Most of the time, you don't really need all of the eigenvalues. In fact, most of the physical information comes from the largest eigenvalues and the rest are simply high frequency oscillations that are only transient. In that case you should look into eigenvalue solutions that quickly converge to those largest eigenvalues/vectors such as the Lanczos algorithm.
An easy way to maybe get a decent speedup with no code changes (especially on a many-core machine) is to link numpy to a faster linear algebra library, like MKL, ACML, or OpenBLAS. If you're associated with an academic institution, the excellent Anaconda python distribution will let you easily link to MKL for free; otherwise, you can shell out $30 (in which case you should try the 30-day trial of the optimizations first) or do it yourself (a mildly annoying process but definitely doable).
I'd definitely try a sparse eigenvalue solver as well, though.
I am trying to apply hierarchial clustering to my dataset which consists of 14039 vectors of users. Each vector has 10 features, where each feature is basically frequency of tags tagged by that user.
I am using Scipy api for clustering.
Now I need to calculate pairwise distances between these 14039 users and pass tis distance matrix to linkage function.
import scipy.cluster.hierarchy as sch
Y = sch.distance.pdist( allUserVector,'cosine')
set_printoptions(threshold='nan')
print Y
But my program gives me MemoryError while calculating the distance matrix itself
File "/usr/lib/pymodules/python2.7/numpy/core/numeric.py", line 1424, in array_str
return array2string(a, max_line_width, precision, suppress_small, ' ', "", str)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 306, in array2string
separator, prefix)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 210, in _array2string
format_function = FloatFormat(data, precision, suppress_small)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 392, in __init__
self.fillFormat(data)
File "/usr/lib/pymodules/python2.7/numpy/core/arrayprint.py", line 399, in fillFormat
non_zero = absolute(data.compress(not_equal(data, 0) & ~special))
MemoryError
Any idea how to fix this? Is my dataset too large? But I guess clustering 14k users shouldnt be too much that it should cause Memory error.
I am running it on i3 and 4 Gb Ram.
I need to apply DBScan clustering too, but that too needs distance matrix as input.
Any suggestions appreciated.
Edit: I get the error only when I print Y. Any ideas why?
Well, hierarchical clustering doesn't make that much sense for large datasets. It's actually mostly a textbook example in my opinion. The problem with hierarchical clustering is that it doesn't really build sensible clusters. It builds a dendrogram, but with 14000 objects the dendrogram becomes pretty much unusable. And very few implementations of hierarchical clustering have non-trivial methods to extract sensible clusters from the dendrogram. Plus, in the general case, hierarchical clustering is of complexity O(n^3) which makes it scale really bad to large datasets.
DBSCAN technically does not need a distance matrix. In fact, when you use a distance matrix, it will be slow, as computing the distance matrix already is O(n^2). And even then, you can safe the O(n^2) memory cost for DBSCAN by computing the distances on the fly at the cost of computing distances twice each. DBSCAN visits each point once, so there is next to no benefit from using a distance matrix except the symmetry gain. And technically, you could do some neat caching tricks to even reduce that, since DBSCAN also just needs to know which objects are below the epsilon threshold. When the epsilon is chosen reasonably, managing the neighbor sets on the fly will use significantly less memory than O(n^2) at the same CPU cost of computing the distance matrix.
Any really good implementation of DBSCAN (it is spelled all uppercase, btw, as it is an abbreviation, not a scan) however should have support for index structures and then run in O(n log n) runtime.
On http://elki.dbs.ifi.lmu.de/wiki/Benchmarking they run DBSCAN on a 110250 object dataset and 8 dimensions, and the non-indexed variant takes 1446 seconds, the one with index just 219. That is about 7 times faster, including index buildup. (It's not python, however) Similarly, OPTICS is 5 times faster with the index. And their kmeans implementation in my experiments was around 6x faster than WEKA kmeans and using much less memory. Their single-link hierarchical clustering also is an optimized O(n^2) implementation. Actually the only one I've seen so far that is not the naive O(n^3) matrix-editing approach.
If you are willing to go beyond python, that might be a good choice.
It's possible that you really are running out of RAM. Finding pairwise distances between N objects means storing N^2 distances. In your case, N^2 is going to be 14039 ^ 2 = 1.97 * 10^8. If we assume that each distance takes only four bytes (which is almost certainly not the case, as they have to be held in some sort of data structure which may have non-constant overhead) that works out to 800 megabytes. That's a lot of memory for the interpreter to be working with. 32-bit architectures only allow up to 2 GB of process memory, and just your raw data is taking up around 50% of that. With the overhead of the data structure you could be looking at usage much higher than that -- I can't say how much because I don't know the memory model behind SciPy/numpy.
I would try breaking your data sets up into smaller sets, or not constructing the full distance matrix. You can break it down into more manageable chunks (say, 14 subsets of around 1000 elements) and do nearest-neighbor between each chunk and all of the vectors -- then you're looking at loading an order of magnitude less into memory at any one time (14000 * 1000, 14 times instead of 14000 * 14000 once).
Edit: agf is completely right on both counts: I missed your edit, and the problem probably comes about when it tries to construct the giant string that represents your matrix. If it's printing floating point values, and we assume 10 characters are printed per element and the string is stored with one byte per character, then you're looking at exactly 2 GB of memory usage just for the string.