PCA on large and high dimensional dataset

PCA on large and high dimensional dataset - python

I am trying to perform a PCA analysis on a large dataset (410 000 entries and 32 000 features) in python, but sklearn.decomposition.PCA does not work, as the underlying LAPACK implementation can't handle as much data as i have. It throws the following error.
Traceback (most recent call last):
File "main.py", line 47, in <module>
model.fit(x_std.transform(deep_data))
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 344, in fit
self._fit(X)
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 416, in _fit
return self._fit_full(X, n_components)
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 447, in _fit_full
U, S, V = linalg.svd(X, full_matrices=False)
File "/home/lib/python3.6/site-
packages/scipy/linalg/decomp_svd.py", line 125, in svd
compute_uv=compute_uv, full_matrices=full_matrices)
File "/home/lib/python3.6/site-
packages/scipy/linalg/lapack.py", line 605, in _compute_lwork
raise ValueError("Too large work array required -- computation cannot "
ValueError: Too large work array required -- computation cannot be performed with standard 32-bit LAPACK.
I have also tried sklearn.decomposition.IncrementalPCA but as I dont have any issues with RAM it did not solve my problem, it only introduced more as it does not allow me to have all 32000 components if my batch size is smaller than that.
Is there any other implementation of PCA that can handle this much data? I don't necessarily need all 410 000 samples, but i need at least 32 000 so that i can analyze all principal components.

Related

How to find the source of a MemoryError in Python?

I'm running a hyperparameter optimization using Hyperopt for a Neural Network. While doing so, after some iterations, I get a MemoryError exception
So far, I tried clearing all variables after they had been used (assigning None or empty lists to them, is there a better way for this?) and printing all locals(), dirs() and globals() with their sizes, but those counts never increase and the sizes are quite small.
The structure looks like this:
def create_model(params):
## load data from temp files
## pre-process data accordingly
## Train NN with crossvalidation clearing Keras' session every time
## save stats and clean all variables (assigning None or empty lists to them)
def Optimize():
for model in models: #I have multiple models
## load data
## save data to temp files
trials = Trials()
best_run = fmin(create_model,
space,
algo=tpe.suggest,
max_evals=100,
trials=trials)
After X number of iterations (sometimes it completes the firsts 100 and shifts to the second model) it throws a memory error.
My guess is that some variables remain in memory and I'm not clearing them, but I wasn't able to detect them.
EDIT:
Traceback (most recent call last):
File "Main.py", line 32, in <module>
optimal = Optimize(training_sets)
File "/home/User1/Optimizer/optimization2.py", line 394, in Optimize
trials=trials)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 307, in fmin
return_argmin=return_argmin,
File "/usr/local/lib/python3.5/dist-packages/hyperopt/base.py", line 635, in fmin
return_argmin=return_argmin)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 320, in fmin
rval.exhaust()
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 199, in exhaust
self.run(self.max_evals - n_done, block_until_done=self.async)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 173, in run
self.serial_evaluate()
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 92, in serial_evaluate
result = self.domain.evaluate(spec, ctrl)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/base.py", line 840, in evaluate
rval = self.fn(pyll_rval)
File "/home/User1/Optimizer/optimization2.py", line 184, in create_model
x_train, x_test = x[train_indices], x[val_indices]
MemoryError

It took me a couple of days to figure this out so I'll answer my own question to save whoever encounters this issue some time.
Usually, when using Hyperopt for Keras, the suggested return of the create_model function is something like this:
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
But in large models, with many evaluations, you don't want to return and save in memory every model, all you need is the set of Hyperparameters that gave the lowest loss
By simply removing the model from the returned dict, the issue of memory increasing with each evaluation is resolved.
return {'loss': -acc, 'status': STATUS_OK}

python: Not enough memory to perform factorization

I am using python sparse module to compute a eigenvalue problem. It is a very big sparse matrix, which would end up with large memory requirement. But the strange thing is I am using a cluster with 256GB memory which should be definitely enough for my problem. But I get the not enough memory error as below. I am wondering if anyone would give me a hint how to work around this issue?
Not enough memory to perform factorization.
Traceback (most recent call last):
File "init_Re620eta1_40X_2Z_omega10.py", line 158, in <module>
exec_stabDiagBatchFFfollow_2D(geometry,baseFlowFolder,baseFlowVarb,baseFlowMethod,h,y_max_factor_EVP,Ny_EVP,Nz_EVP,x_p_stabDiag,x_p_orig,eigSolver,noEigs2solv,noEigs2save,SIGMA0,arnoldiTol,OmegaTol,disc_y,disc_z,y_i_factor_EVP,z_i_factor_EVP,periodicZ,BC_top,customComment,BETA,ALPHA_min,ALPHA_max,noALPHA,ALPHA_start,xp_start,u_0,nu_0,y_cut,ParallelFlowA,comm,rank,RESTART,saveJobStep,saveResultFormat)
File "/lustre/cray/ws8/ws/iagyonwu-Re620eta1/omega10DNS_wTS_icoPerbFoam/LST_functions_linstab2D_Temperal_mpi_multiTracking4.py", line 3893, in exec_stabDiagBatchFFfollow_2D
OMEGA, eigVecs = sp.linalg.eigs(L0, k=noEigs2solv, sigma=SIGMA, v0=options_v0, tol=arnoldiTol)
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1288, in eigs
symmetric=False, tol=tol)
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1046, in get_OPinv_matvec
return SpLuInv(A.tocsc()).matvec
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 907, in __init__
self.M_lu = splu(M)
File "/opt/python/3.6.1.1/lib/python3.6/site-packages/scipy/sparse/linalg/dsolve/linsolve.py", line 267, in splu
ilu=False, options=_options)
MemoryError
The typical size of the matirx is 50000x50000, and here followed the structure of L0:
sparse matrix L0

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.

That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)

DBSCAN handling big data crashes and memory error [duplicate]

This question already has answers here:
scikit-learn DBSCAN memory usage
(5 answers)
Closed 5 years ago.
I am doing a DBSCAN on a dataset of 400K data points. Here is what i get as the error:
Traceback (most recent call last):
File "/myproject/DBSCAN_section.py", line 498, in perform_dbscan_on_data
db = DBSCAN(eps=2, min_samples=5).fit(data)
File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 266, in fit
**self.get_params())
File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 138, in dbscan
return_distance=False)
File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/neighbors/base.py", line 621, in radius_neighbors
return_distance=return_distance)
File "sklearn/neighbors/binary_tree.pxi", line 1491, in sklearn.neighbors.kd_tree.BinaryTree.query_radius (sklearn/neighbors/kd_tree.c:13013)
MemoryError
How can I fix this? is there any limit to DBSCAN to process the big number of data?
my source of example is from: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
my data is in X, Y coordinates format:
11.342276,11.163416
11.050597,10.745579
10.798838,10.559784
11.249279,11.445535
11.385767,10.989214
10.825875,10.530120
10.598493,11.236947
10.571042,10.830799
11.454966,11.295484
11.431454,11.200208
10.774908,11.102601
10.602692,11.395169
11.324441,11.088243
10.731538,10.695864
10.537385,10.923226
11.215886,11.391537
should I convert my data to sparse CSR? how?

sklearn's DBSCAN needs O(n*k) memory, where k is the number of neighbors within epsilon. For a large data set, and epsilon, this will be a problem.
For a small data set, it is faster on Python, because it does more work in Cython, outside of the slow interpreter. The sklearn authors chose to do this variation.
For now, consider using a smaller epsilon, too.
But this is not what the original DBSCAN proposed, and other implementations such as ELKI's a known to scale to millions of points. It queries one point at a time, so it needs only O(n+k) memory.
It also has OPTICS, which is reported to work very well on coordinates.

python scipy sparse matrix SVD with error ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration

I was using scipy to do sparse matrix svd on some large data.
The matix is around 200,000*8,000,000 size, with 1.19% non-zero entries.
The machine I was using has 160G memory so i suppose memory shouldn't be an issue.
So here is some code i used:
from scipy import *
from scipy.sparse import *
import scipy.sparse.linalg as slin
from numpy import *
K=1500
coom=coo_matrix((value,(row,col)),shape=(M,N))
coom=coom.astype('float32')
u,s,v=slin.svds(coom,K,ncv=8*K)
The error message is like:
Traceback (most recent call last):
File "sparse_svd.py", line 35, in <module>
u,s,v=slin.svds(coom,K,ncv=2*K+1)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 731, in svds
eigvals, eigvec = eigensolver(XH_X, k=k, tol=tol**2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 680, in eigsh
params.iterate()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 278, in iterate
raise ArpackError(self.info)
scipy.sparse.linalg.eigen.arpack.arpack.ArpackError: ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration. One possibility is to increase the size of NCV relative to NEV.
when K=1000 (i.e. #eigen values=1000) everything is ok. when I try K>=1250 the error begins to appear.
I have also tried various ncv values, still get the same error message...
Any suggestions and help appreciated.
Thanks a lot :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PCA on large and high dimensional dataset - python

Related

How to find the source of a MemoryError in Python?

python: Not enough memory to perform factorization

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

DBSCAN handling big data crashes and memory error [duplicate]

python scipy sparse matrix SVD with error ARPACK error 3: No shifts could be applied during a cycle of the Implicitly restarted Arnoldi iteration

Categories

Resources