python: Not enough memory to perform factorization - python

I am using python sparse module to compute a eigenvalue problem. It is a very big sparse matrix, which would end up with large memory requirement. But the strange thing is I am using a cluster with 256GB memory which should be definitely enough for my problem. But I get the not enough memory error as below. I am wondering if anyone would give me a hint how to work around this issue?
Not enough memory to perform factorization.
Traceback (most recent call last):
File "init_Re620eta1_40X_2Z_omega10.py", line 158, in <module>
exec_stabDiagBatchFFfollow_2D(geometry,baseFlowFolder,baseFlowVarb,baseFlowMethod,h,y_max_factor_EVP,Ny_EVP,Nz_EVP,x_p_stabDiag,x_p_orig,eigSolver,noEigs2solv,noEigs2save,SIGMA0,arnoldiTol,OmegaTol,disc_y,disc_z,y_i_factor_EVP,z_i_factor_EVP,periodicZ,BC_top,customComment,BETA,ALPHA_min,ALPHA_max,noALPHA,ALPHA_start,xp_start,u_0,nu_0,y_cut,ParallelFlowA,comm,rank,RESTART,saveJobStep,saveResultFormat)
File "/lustre/cray/ws8/ws/iagyonwu-Re620eta1/omega10DNS_wTS_icoPerbFoam/LST_functions_linstab2D_Temperal_mpi_multiTracking4.py", line 3893, in exec_stabDiagBatchFFfollow_2D
OMEGA, eigVecs = sp.linalg.eigs(L0, k=noEigs2solv, sigma=SIGMA, v0=options_v0, tol=arnoldiTol)
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1288, in eigs
symmetric=False, tol=tol)
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1046, in get_OPinv_matvec
return SpLuInv(A.tocsc()).matvec
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 907, in __init__
self.M_lu = splu(M)
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/dsolve/linsolve.py", line 267, in splu
ilu=False, options=_options)
MemoryError
The typical size of the matirx is 50000x50000, and here followed the structure of L0:
sparse matrix L0

Related

PCA on large and high dimensional dataset

I am trying to perform a PCA analysis on a large dataset (410 000 entries and 32 000 features) in python, but sklearn.decomposition.PCA does not work, as the underlying LAPACK implementation can't handle as much data as i have. It throws the following error.
Traceback (most recent call last):
File "main.py", line 47, in <module>
model.fit(x_std.transform(deep_data))
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 344, in fit
self._fit(X)
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 416, in _fit
return self._fit_full(X, n_components)
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 447, in _fit_full
U, S, V = linalg.svd(X, full_matrices=False)
File "/home/lib/python3.6/site-
packages/scipy/linalg/decomp_svd.py", line 125, in svd
compute_uv=compute_uv, full_matrices=full_matrices)
File "/home/lib/python3.6/site-
packages/scipy/linalg/lapack.py", line 605, in _compute_lwork
raise ValueError("Too large work array required -- computation cannot "
ValueError: Too large work array required -- computation cannot be performed with standard 32-bit LAPACK.
I have also tried sklearn.decomposition.IncrementalPCA but as I dont have any issues with RAM it did not solve my problem, it only introduced more as it does not allow me to have all 32000 components if my batch size is smaller than that.
Is there any other implementation of PCA that can handle this much data? I don't necessarily need all 410 000 samples, but i need at least 32 000 so that i can analyze all principal components.

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.
That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)

Python raises Memory Error despite of 16gb Swap

I'm getting MemoryError while running some large matrix operations (chroma, cqt, mfcc extraction) with numpy (1.81), scipy (0.17.0), librosa (0.4.2) on a Jetson TK 1 with ~2 GB Ram and a 16GB swap file.
Any help is much appreciated!
ERROR MESSAGE
Traceback (most recent call last):
File "./analyze_structure.py", line 480, in <module>
args.cutoff, args.order, args.sr, args.feature, bool(args.as_diff))
File "./analyze_structure.py", line 452, in plotData
tracks)
File "./analyze_structure.py", line 178, in plotStructure
feat, beat_times = extractChroma(filename, file_ext)
File "./analyze_structure.py", line 75, in extractChroma
hop_length=HOP_LENGTH)
File "/usr/local/lib/python2.7/dist-packages/librosa-0.4.2-py2.7.egg/librosa/feature/spectral.py", line 800, in chroma_stft
tuning = estimate_tuning(S=S, sr=sr, bins_per_octave=n_chroma)
File "/usr/local/lib/python2.7/dist-packages/librosa-0.4.2-py2.7.egg/librosa/core/pitch.py", line 82, in estimate_tuning
pitch, mag = piptrack(y=y, sr=sr, S=S, n_fft=n_fft, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/librosa-0.4.2-py2.7.egg/librosa/core/pitch.py", line 270, in piptrack
util.localmax(S * (S > (threshold * S.max(axis=0)))))
File "/usr/local/lib/python2.7/dist-packages/librosa-0.4.2-py2.7.egg/librosa/util/utils.py", line 820, in localmax
x_pad = np.pad(x, paddings, mode='edge')
File "/usr/lib/python2.7/dist-packages/numpy/lib/arraypad.py", line 1364, in pad
newmat = _prepend_edge(newmat, pad_before, axis)
File "/usr/lib/python2.7/dist-packages/numpy/lib/arraypad.py", line 175, in _prepend_edge
axis=axis)
MemoryError
The Jetson TK1 is a 32-bit processor. It doesn't have sufficient virtual address space to access more than 4GB of RAM from one process.
The kernel can leverage your 16GB page file to provide 4GB of ram to many separate processes, but this still does not expose more than 4GB of addresses to a single process. It simply allows separate processes to individually use up to 4GB of RAM (on Linux, you'll most likely have a 2GB or 3GB limit depending on your kernel settings).
You should split your work into smaller pieces or use a platform with more address space available.
I believe that's because your processor is 32-bit:
The board has the following devices on-board:
NVIDIA Tegra124 (Tegra K1 32-bit)
On 32-bit installs of Python, it only has 2 gig of RAM available (as with any 32 bit application by default). Try to re-factor your code accordingly.
No amount of swap space will help this, and relying on swap for large calculations is a really bad idea since it takes a long time. Swap is meant for accidental overflows, and not to be relied on.

Memory error for decision tree with 66k features, using scikit python packages

Problem Statement
I am using a document of 1600000 lines and ~66k features.
I am using the bag of words approach to build a decision tree.
Following code is working fine for 1000 line document.
But throws memory error for the actual 1600000 line document.
My Server has a 64GB of RAM.
Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR
Is there any options to reduce the default type float64?
Kindly help me on this.
Code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.todense(),corpus2)
Error:
Traceback (most recent call last):
File "test123.py", line 103, in <module>
clf = clf.fit(X_train.todense(),corpus2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
return np.asmatrix(self.toarray())
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
In short, is there any methods to use classification tree for large data set with 66k features.?
Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)
As for sparse/dense I have similar problem.
GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.

python scipy sparse matrix SVD with error ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration

I was using scipy to do sparse matrix svd on some large data.
The matix is around 200,000*8,000,000 size, with 1.19% non-zero entries.
The machine I was using has 160G memory so i suppose memory shouldn't be an issue.
So here is some code i used:
from scipy import *
from scipy.sparse import *
import scipy.sparse.linalg as slin
from numpy import *
K=1500
coom=coo_matrix((value,(row,col)),shape=(M,N))
coom=coom.astype('float32')
u,s,v=slin.svds(coom,K,ncv=8*K)
The error message is like:
Traceback (most recent call last):
File "sparse_svd.py", line 35, in <module>
u,s,v=slin.svds(coom,K,ncv=2*K+1)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 731, in svds
eigvals, eigvec = eigensolver(XH_X, k=k, tol=tol**2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 680, in eigsh
params.iterate()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 278, in iterate
raise ArpackError(self.info)
scipy.sparse.linalg.eigen.arpack.arpack.ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.
when K=1000 (i.e. #eigen values=1000) everything is ok. when I try K>=1250 the error begins to appear.
I have also tried various ncv values, still get the same error message...
Any suggestions and help appreciated.
Thanks a lot :)

Categories