DBSCAN handling big data crashes and memory error [duplicate] - python

This question already has answers here:
scikit-learn DBSCAN memory usage
(5 answers)
Closed 5 years ago.
I am doing a DBSCAN on a dataset of 400K data points. Here is what i get as the error:
Traceback (most recent call last):
File "/myproject/DBSCAN_section.py", line 498, in perform_dbscan_on_data
db = DBSCAN(eps=2, min_samples=5).fit(data)
File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 266, in fit
**self.get_params())
File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/cluster/dbscan_.py", line 138, in dbscan
return_distance=False)
File "/usr/local/Python/2.7.13/lib/python2.7/site-packages/sklearn/neighbors/base.py", line 621, in radius_neighbors
return_distance=return_distance)
File "sklearn/neighbors/binary_tree.pxi", line 1491, in sklearn.neighbors.kd_tree.BinaryTree.query_radius (sklearn/neighbors/kd_tree.c:13013)
MemoryError
How can I fix this? is there any limit to DBSCAN to process the big number of data?
my source of example is from: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
my data is in X, Y coordinates format:
11.342276,11.163416
11.050597,10.745579
10.798838,10.559784
11.249279,11.445535
11.385767,10.989214
10.825875,10.530120
10.598493,11.236947
10.571042,10.830799
11.454966,11.295484
11.431454,11.200208
10.774908,11.102601
10.602692,11.395169
11.324441,11.088243
10.731538,10.695864
10.537385,10.923226
11.215886,11.391537
should I convert my data to sparse CSR? how?

sklearn's DBSCAN needs O(n*k) memory, where k is the number of neighbors within epsilon. For a large data set, and epsilon, this will be a problem.
For a small data set, it is faster on Python, because it does more work in Cython, outside of the slow interpreter. The sklearn authors chose to do this variation.
For now, consider using a smaller epsilon, too.
But this is not what the original DBSCAN proposed, and other implementations such as ELKI's a known to scale to millions of points. It queries one point at a time, so it needs only O(n+k) memory.
It also has OPTICS, which is reported to work very well on coordinates.

Related

Extreme Learning Machine Implementation Error

I am new to the extreme learning machine (ELM) and trying to implement its code. I am using a sample code from this link.
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
# split data in training and testing sets
tr_set, te_set = elm.split_sets(data, training_percent=.8, perm=True)
#train and test
tr_result = elmk.train(tr_set)
te_result = elmk.test(te_set)
print(te_result.get_accuracy)
So far, I just ran the below portion of the code and I got an error:
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
Here is the error:
Traceback (most recent call last):
File "C:/Users/Mahsa/PycharmProjects/test/ELM.py", line 16, in <module>
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\elm\elmk.py", line 489, in search_param
param_kernel=param_ranges[1])
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 212, in minimize
pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 245, in optimize
solution, report = solver.optimize(f, maximize, pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\solvers\CMAES.py", line 139, in optimize
sigma=self.sigma)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\deap\cma.py", line 90, in __init__
self.dim = len(self.centroid)
TypeError: len() of unsized object
I searched a lot to find the solution but it seems many people had this problem and no one offered them any solution. Can anyone help me please?
BTW, I know someone posted the same question on stackoverflow, but since he didn't received any answers I asked this question again.

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.
That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)

Meanshift in scikit learn (python) doesn't understand datatype

I have a dataset which has 7265 samples and 132 features.
I want to use the meanshift algorithm from scikit learn but I ran into this error:
Traceback (most recent call last):
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
labels, centers = getClusters(data,clusters)
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
ms.fit(np.array(dataarray))
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
cluster_all=self.cluster_all)
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
return self._fit(X)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
My code:
dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
If I check the datatype of the data variable I see:
print isinstance( dataarray, np.ndarray )
>>> True
The bandwidth is 0.925538333061 and the dataarray.dtype is float64
I'm using scikit learn 0.14.1
I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?
EDIT:
The data can be found here:
(pickle format) : http://ojtwist.be/datatocluster.p
and : http://ojtwist.be/datatocluster.npz
That`s a bug in scikit project. It is documented here.
There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.
If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.
Try this:
>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)
And then use data2 into your code.
It worked for me.
If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)
Edit: You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.
Good luck!

Using sparse matrices/online learning in Naive Bayes (Python, scikit)

I'm trying to do Naive Bayes on a dataset that has over 6,000,000 entries and each entry 150k features. I've tried to implement the code from the following link:
Implementing Bag-of-Words Naive-Bayes classifier in NLTK
The problem is (as I understand), that when I try to run the train-method with a dok_matrix as it's parameter, it cannot find iterkeys (I've paired the rows with OrderedDict as labels):
Traceback (most recent call last):
File "skitest.py", line 96, in <module>
classif.train(add_label(matr, labels))
File "/usr/lib/pymodules/python2.6/nltk/classify/scikitlearn.py", line 92, in train
for f in fs.iterkeys():
File "/usr/lib/python2.6/dist-packages/scipy/sparse/csr.py", line 88, in __getattr__
return _cs_matrix.__getattr__(self, attr)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 429, in __getattr__
raise AttributeError, attr + " not found"
AttributeError: iterkeys not found
My question is, is there a way to either avoid using a sparse matrix by teaching the classifier entry by entry (online), or is there a sparse matrix format I could use in this case efficiently instead of dok_matrix? Or am I missing something obvious?
Thanks for anyone's time. :)
EDIT, 6th sep:
Found the iterkeys, so atleast the code runs. It's still too slow, as it has taken several hours with a dataset of the size of 32k, and still hasn't finished. Here's what I got at the moment:
matr = dok_matrix((6000000, 150000), dtype=float32)
labels = OrderedDict()
#collect the data into the matrix
pipeline = Pipeline([('nb', MultinomialNB())])
classif = SklearnClassifier(pipeline)
add_label = lambda lst, lab: [(lst.getrow(x).todok(), lab[x])
for x in xrange(lentweets-foldsize)]
classif.train(add_label(matr[:(lentweets-foldsize),0], labels))
readrow = [matr.getrow(x + foldsize).todok() for x in xrange(lentweets-foldsize)]
data = np.array(classif.batch_classify(readrow))
The problem might be that each row that is taken doesn't utilize the sparseness of the vector, but goes through each of the 150k entry. As a continuation for the issue, does anyone know how to utilize this Naive Bayes with sparse matrices, or is there any other way to optimize the above code?
Check out the document classification example in scikit-learn. The trick is to let the library handle the feature extraction for you. Skip the NLTK wrapper, as it's not intended for such large datasets.(*)
If you have the documents in text files, then you can just hand those text files to the TfidfVectorizer, which creates a sparse matrix from them:
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(input='filename')
X = vect.fit_transform(list_of_filenames)
You now have a training set X in the CSR sparse matrix format, that you can feed to a Naive Bayes classifier if you also have a list of labels y (perhaps derived from the filenames, if you encoded the class in them):
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()
nb.fit(X, y)
If it turns out this doesn't work because the set of documents is too large (unlikely since the TfidfVectorizer was optimized for just this number of documents), look at the out-of-core document classification example, which demonstrates the HashingVectorizer and the partial_fit API for minibatch learning. You'll need scikit-learn 0.14 for this to work.
(*) I know, because I wrote that wrapper. Like the rest of NLTK, it's intended for educational purposes. I also worked on performance improvements in scikit-learn, and some of the code I'm advertising is my own.

Factors Analysis using MDP in Python

Excuse my ignorance, I'm very new to Python. I'm trying to perform factor analysis in Python using MDP (though I can use another library if there's a better solution).
I have an m by n matrix (called matrix) and I tried to do:
import mdp
mdp.nodes.FANode()(matrix)
but I get back an error. I'm guessing maybe my matrix isn't formed properly? My goal is find out how many components are in the data and find out which rows load onto which components.
Here is the traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "mdp/signal_node.py", line 630, in __call__
return self.execute(x, *args, **kwargs)
File "mdp/signal_node.py", line 611, in execute
self._pre_execution_checks(x)
File "mdp/signal_node.py", line 480, in _pre_execution_checks
self.train(x)
File "mdp/signal_node.py", line 571, in train
self._check_input(x)
File "mdp/signal_node.py", line 429, in _check_input
if not x.ndim == 2:
AttributeError: 'list' object has no attribute 'ndim'
Does anyone have any idea what's going on, and feel like explaining it to a Python newbie?
I have absolutely no experience with mdp, but it looks like it expects your matrices to be passed as a Numpy array instead of a list. Numpy is a package for high performance scientific computing. You can go to the Numpy home page and install it. After doing so, try altering your code to this:
import mdp, numpy
mdp.nodes.FANode()(numpy.array(matrix))
As Stephen said, the data must be a numpy array. More precisely it must be a 2D array, with the first index representing the different sampes and the second index representing the data dimensions (using the wrong order here can lead to the "singular matrix" error).
You should also take a look at the MDP documentation, which should answer all your questions. If that doesn't help there is the MDP user mailing list.

Categories