Extreme Learning Machine Implementation Error - python

I am new to the extreme learning machine (ELM) and trying to implement its code. I am using a sample code from this link.
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
# split data in training and testing sets
tr_set, te_set = elm.split_sets(data, training_percent=.8, perm=True)
#train and test
tr_result = elmk.train(tr_set)
te_result = elmk.test(te_set)
print(te_result.get_accuracy)
So far, I just ran the below portion of the code and I got an error:
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
Here is the error:
Traceback (most recent call last):
File "C:/Users/Mahsa/PycharmProjects/test/ELM.py", line 16, in <module>
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\elm\elmk.py", line 489, in search_param
param_kernel=param_ranges[1])
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 212, in minimize
pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 245, in optimize
solution, report = solver.optimize(f, maximize, pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\solvers\CMAES.py", line 139, in optimize
sigma=self.sigma)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\deap\cma.py", line 90, in __init__
self.dim = len(self.centroid)
TypeError: len() of unsized object
I searched a lot to find the solution but it seems many people had this problem and no one offered them any solution. Can anyone help me please?
BTW, I know someone posted the same question on stackoverflow, but since he didn't received any answers I asked this question again.

Related

How to find the source of a MemoryError in Python?

I'm running a hyperparameter optimization using Hyperopt for a Neural Network. While doing so, after some iterations, I get a MemoryError exception
So far, I tried clearing all variables after they had been used (assigning None or empty lists to them, is there a better way for this?) and printing all locals(), dirs() and globals() with their sizes, but those counts never increase and the sizes are quite small.
The structure looks like this:
def create_model(params):
## load data from temp files
## pre-process data accordingly
## Train NN with crossvalidation clearing Keras' session every time
## save stats and clean all variables (assigning None or empty lists to them)
def Optimize():
for model in models: #I have multiple models
## load data
## save data to temp files
trials = Trials()
best_run = fmin(create_model,
space,
algo=tpe.suggest,
max_evals=100,
trials=trials)
After X number of iterations (sometimes it completes the firsts 100 and shifts to the second model) it throws a memory error.
My guess is that some variables remain in memory and I'm not clearing them, but I wasn't able to detect them.
EDIT:
Traceback (most recent call last):
File "Main.py", line 32, in <module>
optimal = Optimize(training_sets)
File "/home/User1/Optimizer/optimization2.py", line 394, in Optimize
trials=trials)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 307, in fmin
return_argmin=return_argmin,
File "/usr/local/lib/python3.5/dist-packages/hyperopt/base.py", line 635, in fmin
return_argmin=return_argmin)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 320, in fmin
rval.exhaust()
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 199, in exhaust
self.run(self.max_evals - n_done, block_until_done=self.async)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 173, in run
self.serial_evaluate()
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 92, in serial_evaluate
result = self.domain.evaluate(spec, ctrl)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/base.py", line 840, in evaluate
rval = self.fn(pyll_rval)
File "/home/User1/Optimizer/optimization2.py", line 184, in create_model
x_train, x_test = x[train_indices], x[val_indices]
MemoryError
It took me a couple of days to figure this out so I'll answer my own question to save whoever encounters this issue some time.
Usually, when using Hyperopt for Keras, the suggested return of the create_model function is something like this:
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
But in large models, with many evaluations, you don't want to return and save in memory every model, all you need is the set of Hyperparameters that gave the lowest loss
By simply removing the model from the returned dict, the issue of memory increasing with each evaluation is resolved.
return {'loss': -acc, 'status': STATUS_OK}

Not enough memory to perform factorization expm scipy.sparse.linalg.splu

I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.
That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)

Memory error for decision tree with 66k features, using scikit python packages

Problem Statement
I am using a document of 1600000 lines and ~66k features.
I am using the bag of words approach to build a decision tree.
Following code is working fine for 1000 line document.
But throws memory error for the actual 1600000 line document.
My Server has a 64GB of RAM.
Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR
Is there any options to reduce the default type float64?
Kindly help me on this.
Code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.todense(),corpus2)
Error:
Traceback (most recent call last):
File "test123.py", line 103, in <module>
clf = clf.fit(X_train.todense(),corpus2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
return np.asmatrix(self.toarray())
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
In short, is there any methods to use classification tree for large data set with 66k features.?
Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)
As for sparse/dense I have similar problem.
GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.

Meanshift in scikit learn (python) doesn't understand datatype

I have a dataset which has 7265 samples and 132 features.
I want to use the meanshift algorithm from scikit learn but I ran into this error:
Traceback (most recent call last):
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 130, in <module>
labels, centers = getClusters(data,clusters)
File "C:\Users\OJ\Dropbox\Dt\Code\visual\facetest\facetracker_video.py", line 34, in getClusters
ms.fit(np.array(dataarray))
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 280, in fit
cluster_all=self.cluster_all)
File "C:\python2.7\lib\site-packages\sklearn\cluster\mean_shift_.py", line 137, in mean_shift
nbrs = NearestNeighbors(radius=bandwidth).fit(sorted_centers)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 642, in fit
return self._fit(X)
File "C:\python2.7\lib\site-packages\sklearn\neighbors\base.py", line 180, in _fit
raise ValueError("data type not understood")
ValueError: data type not understood
My code:
dataarray = np.array(data)
bandwidth = estimate_bandwidth(dataarray, quantile=0.2, n_samples=len(dataarray))
ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
ms.fit(dataarray)
labels = ms.labels_
cluster_centers = ms.cluster_centers_
If I check the datatype of the data variable I see:
print isinstance( dataarray, np.ndarray )
>>> True
The bandwidth is 0.925538333061 and the dataarray.dtype is float64
I'm using scikit learn 0.14.1
I can cluster with other algorithms in sci-kit (tried kmeans and dbscan). What am I doing wrong ?
EDIT:
The data can be found here:
(pickle format) : http://ojtwist.be/datatocluster.p
and : http://ojtwist.be/datatocluster.npz
That`s a bug in scikit project. It is documented here.
There is a float -> int casting during the fitting process that can crash in some cases (by making the seed points be placed at the corner of the bins instead in the center). There is some code in the link to fix the problem.
If you don't wanna get your hands into the scikit code (and maintain compatibility between your code with other machines) i suggest you normalize your data before passing it to MeanShift.
Try this:
>>>from sklearn import preprocessing
>>>data2 = preprocessing.scale(dataarray)
And then use data2 into your code.
It worked for me.
If you don't want to do either solution, it is a great opportunity to contribute to the project, making a pull request with the solution :)
Edit: You probably want to retain information to "descale" the results of meanshift. So, use a StandardScaler object, instead using a function to scale.
Good luck!

Python NLTK Maximum Entropy Classifier Error

I'm currently using NLTK's Naive Bayes classifier, however I also wanted to try out the Max Ent classifier. It seems from the documentation that it should take the same format for the feature set as the Naive Bayes, but for some reason I am getting this error when I try it:
File "/usr/lib/python2.7/site-packages/nltk/classify/maxent.py", line 323, in train
gaussian_prior_sigma, **cutoffs)
File "/usr/lib/python2.7/site-packages/nltk/classify/maxent.py", line 1453, in train_maxent_classifier_with_scipy
model.fit(algorithm=algorithm)
File "/usr/lib64/python2.7/site-packages/scipy/maxentropy/maxentropy.py", line 1026, in fit
return model.fit(self, self.K, algorithm)
File "/usr/lib64/python2.7/site-packages/scipy/maxentropy/maxentropy.py", line 226, in fit
callback=callback)
File "/usr/lib64/python2.7/site-packages/scipy/optimize/optimize.py", line 636, in fmin_cg
gfk = myfprime(x0)
File "/usr/lib64/python2.7/site-packages/scipy/optimize/optimize.py", line 176, in function_wrapper
return function(x, *args)
File "/usr/lib64/python2.7/site-packages/scipy/maxentropy/maxentropy.py", line 420, in grad
G = self.expectations() - self.K
ValueError: shape mismatch: objects cannot be broadcast to a single shape
I'm not sure what this means, but I am using the same exact input as I am when I run Naive Bayes and that works.(Training data, represented as a list of pairs, the first member of which is a featureset, and the second of which is a classification label.) Any ideas?
Thanks!
I also encountered this problem with NLTK. While I was unable to resolve it satisfactorily (i.e. get Maxent working using scipy), I was able to train a maxent classifier in NLTK when I used a different algorithm. Try training with
me_classifier = nltk.MaxentClassifier.train(trainset,algorithm="iis")
or one of the other acceptable values for algorithm, like "gis" or "megam".
This issue is also dependent on what version of scipy you are using.
NLTK makes use of scipy.maxentropy which was deprecated in scipy 0.10 and removed in 0.11, see the docs for it: http://docs.scipy.org/doc/scipy-0.10.0/reference/maxentropy.html#
I did create an issue for that on github: https://github.com/nltk/nltk/issues/307
you must install nltk then you can classify.
use the code bellow to classify using maximum entropy in python
me_classifier = nltk.MaxentClassifier.train(trainset,algorithm="gis")
print(me_classifier.classify(testing))

Categories