I'm running a hyperparameter optimization using Hyperopt for a Neural Network. While doing so, after some iterations, I get a MemoryError exception
So far, I tried clearing all variables after they had been used (assigning None or empty lists to them, is there a better way for this?) and printing all locals(), dirs() and globals() with their sizes, but those counts never increase and the sizes are quite small.
The structure looks like this:
def create_model(params):
## load data from temp files
## pre-process data accordingly
## Train NN with crossvalidation clearing Keras' session every time
## save stats and clean all variables (assigning None or empty lists to them)
def Optimize():
for model in models: #I have multiple models
## load data
## save data to temp files
trials = Trials()
best_run = fmin(create_model,
space,
algo=tpe.suggest,
max_evals=100,
trials=trials)
After X number of iterations (sometimes it completes the firsts 100 and shifts to the second model) it throws a memory error.
My guess is that some variables remain in memory and I'm not clearing them, but I wasn't able to detect them.
EDIT:
Traceback (most recent call last):
File "Main.py", line 32, in <module>
optimal = Optimize(training_sets)
File "/home/User1/Optimizer/optimization2.py", line 394, in Optimize
trials=trials)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 307, in fmin
return_argmin=return_argmin,
File "/usr/local/lib/python3.5/dist-packages/hyperopt/base.py", line 635, in fmin
return_argmin=return_argmin)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 320, in fmin
rval.exhaust()
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 199, in exhaust
self.run(self.max_evals - n_done, block_until_done=self.async)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 173, in run
self.serial_evaluate()
File "/usr/local/lib/python3.5/dist-packages/hyperopt/fmin.py", line 92, in serial_evaluate
result = self.domain.evaluate(spec, ctrl)
File "/usr/local/lib/python3.5/dist-packages/hyperopt/base.py", line 840, in evaluate
rval = self.fn(pyll_rval)
File "/home/User1/Optimizer/optimization2.py", line 184, in create_model
x_train, x_test = x[train_indices], x[val_indices]
MemoryError
It took me a couple of days to figure this out so I'll answer my own question to save whoever encounters this issue some time.
Usually, when using Hyperopt for Keras, the suggested return of the create_model function is something like this:
return {'loss': -acc, 'status': STATUS_OK, 'model': model}
But in large models, with many evaluations, you don't want to return and save in memory every model, all you need is the set of Hyperparameters that gave the lowest loss
By simply removing the model from the returned dict, the issue of memory increasing with each evaluation is resolved.
return {'loss': -acc, 'status': STATUS_OK}
Related
I am trying to perform a PCA analysis on a large dataset (410 000 entries and 32 000 features) in python, but sklearn.decomposition.PCA does not work, as the underlying LAPACK implementation can't handle as much data as i have. It throws the following error.
Traceback (most recent call last):
File "main.py", line 47, in <module>
model.fit(x_std.transform(deep_data))
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 344, in fit
self._fit(X)
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 416, in _fit
return self._fit_full(X, n_components)
File "/home/lib/python3.6/site-
packages/sklearn/decomposition/_pca.py", line 447, in _fit_full
U, S, V = linalg.svd(X, full_matrices=False)
File "/home/lib/python3.6/site-
packages/scipy/linalg/decomp_svd.py", line 125, in svd
compute_uv=compute_uv, full_matrices=full_matrices)
File "/home/lib/python3.6/site-
packages/scipy/linalg/lapack.py", line 605, in _compute_lwork
raise ValueError("Too large work array required -- computation cannot "
ValueError: Too large work array required -- computation cannot be performed with standard 32-bit LAPACK.
I have also tried sklearn.decomposition.IncrementalPCA but as I dont have any issues with RAM it did not solve my problem, it only introduced more as it does not allow me to have all 32000 components if my batch size is smaller than that.
Is there any other implementation of PCA that can handle this much data? I don't necessarily need all 410 000 samples, but i need at least 32 000 so that i can analyze all principal components.
I try to run the language modeling program. When I use the data train with 15000 sentences in a document, the program running properly. But, when I try to change the data with the bigger one (10 times bigger) it's encountered an error as below:
Traceback (most recent call last):
File "<ipython-input-2-aa5ef9098286>", line 1, in <module>
runfile('C:/Users/cerdas/Documents/Bil/Lat/lstm-plato-lm/platolm.py', wdir='C:/Users/cerdas/Documents/Bil/Lat/lstm-plato-lm')
File "C:\Users\cerdas\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile
execfile(filename, namespace)
File "C:\Users\cerdas\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/cerdas/Documents/Bil/Lat/lstm-plato-lm/platolm.py", line 35, in <module>
y = to_categorical(y, num_classes=vocab_size)
File "C:\Users\cerdas\Anaconda3\lib\site-packages\keras\utils\np_utils.py", line 30, in to_categorical
categorical = np.zeros((n, num_classes), dtype=np.float32)
MemoryError
here is the suspected line of error code:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
and also the np.utils
categorical = np.zeros((n, num_classes), dtype=np.float64)
i've trying to search the solution for similar problem, i found that i have to change categorical_crossentropy to sparse_categorical_crossentropy. I have do that but it's still error with the same traceback.
Thanks
If you switch to sparse categorical cross-entropy loss, then you don't need to to_categorical call, which is actually the one that is giving an error. Sparse categorical cross-entropy should work for this.
I think this error is expected. The real issue here is that you don't have enough space to allocate 1) the parameter matrix of the decision layer, and/or 2) the intermediate tensor.
The parameter matrix has the shape of input_feat_dim x output_num_classes. As you can see, this matrix will consume a huge amount of memory when the vocabulary is large.
To train a network, we also need to keep intermediate tensors for BP, which will be even bigger -- batch_size x input_feat_dim x output_num_classes.
So one thing you can try very quick is to reduce your batch_size to 1/10. Of course, you can't set your batch size too small. In this case, you may want to accumulate gradients until seeing enough samples.
I am new to the extreme learning machine (ELM) and trying to implement its code. I am using a sample code from this link.
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
# split data in training and testing sets
tr_set, te_set = elm.split_sets(data, training_percent=.8, perm=True)
#train and test
tr_result = elmk.train(tr_set)
te_result = elmk.test(te_set)
print(te_result.get_accuracy)
So far, I just ran the below portion of the code and I got an error:
import elm
# load dataset
data = elm.read("iris.data")
# create a classifier
elmk = elm.ELMKernel()
# search for best parameter for this dataset
# define "kfold" cross-validation method, "accuracy" as a objective function
# to be optimized and perform 10 searching steps.
# best parameters will be saved inside 'elmk' object
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
Here is the error:
Traceback (most recent call last):
File "C:/Users/Mahsa/PycharmProjects/test/ELM.py", line 16, in <module>
elmk.search_param(data, cv="kfold", of="accuracy", eval=10)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\elm\elmk.py", line 489, in search_param
param_kernel=param_ranges[1])
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 212, in minimize
pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\api.py", line 245, in optimize
solution, report = solver.optimize(f, maximize, pmap=pmap)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\optunity\solvers\CMAES.py", line 139, in optimize
sigma=self.sigma)
File "C:\Users\Mahsa\AppData\Local\Programs\Python\Python37\lib\site-packages\deap\cma.py", line 90, in __init__
self.dim = len(self.centroid)
TypeError: len() of unsized object
I searched a lot to find the solution but it seems many people had this problem and no one offered them any solution. Can anyone help me please?
BTW, I know someone posted the same question on stackoverflow, but since he didn't received any answers I asked this question again.
I have been trying to use TieDIE. In a few words, this software includes an algorithm that find significant subnetwork when you pass some query nodes and a network. With smaller networks It works just fine, but the network that I am interested in, is quite big, It has 21988 nodes and 360474 edges. TieDIE generates an initial network kernel using scipy (although Matlab is also an option to generate this kernel I do not own a license). During the generation of this kernel I get the following error:
Not enough memory to perform factorization. Traceback (most recent call last):
File "Trials.py",
line 44, in <module> diffuser = SciPYKernel(network_path)
File "lib/kernel_scipy.py",
line 83, in __init__ self.kernel = expm(time_T*L)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 602, in expm return _expm(A, use_exact_onenorm='auto')
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 665, in _expm X = _solve_P_Q(U, V, structure=structure)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/matfuncs.py",
line 699, in _solve_P_Q return spsolve(Q, P)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 198, in spsolve Afactsolve = factorized(A)
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 440, in factorized return splu(A).solve
File "/home/agmoreno/TieDIE-trials/TieDIE/local/lib/python2.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py",
line 309, in splu ilu=False, options=_options)
MemoryError
What is the most interesting thing about this is that I am using a cluster computer that has 64 cpus, and 700GB or RAM and the software peaks at 1.3% of Memory usage (~10GB), according to a ps monitoring, at some moment of execution and crushing later. I have been told that there is no limit in the usage of RAM... So I really have no clue about what could be happening ...
Maybe someone here could help me on finding an alternative to scipy or solving it.
Is it possible that the memory error comes because of just one node is being used? In this the case, how could I distribute the work across the nodes?
Thanks in advance.
That's right, for a very large network like that you'll need high memory on a single node. The easiest solution is of course a workaround, either:
(1) Is there any way you reduce the size of your input network while still capturing relevant biology? Maybe just look for all the nodes 2 steps away from your input nodes?
(2) Use the new Cytoscape API to do the diffusion for you: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005598 (https://github.com/idekerlab/heat-diffusion)
(3) Use PageRank instead of computing a heat kernel (not ideal, as we've shown that Diffusion tends to work better on biological networks).
Hope this helps!
-Evan Paull (TieDIE developer/lead author)
Problem Statement
I am using a document of 1600000 lines and ~66k features.
I am using the bag of words approach to build a decision tree.
Following code is working fine for 1000 line document.
But throws memory error for the actual 1600000 line document.
My Server has a 64GB of RAM.
Instead of using .todense() or .toarray(), is there any way to use the sparse matrix directly ? OR
Is there any options to reduce the default type float64?
Kindly help me on this.
Code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english')
X_train = vectorizer.fit_transform(corpus)
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train.todense(),corpus2)
Error:
Traceback (most recent call last):
File "test123.py", line 103, in <module>
clf = clf.fit(X_train.todense(),corpus2)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/base.py", line 458, in todense
return np.asmatrix(self.toarray())
File "/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 550, in toarray
return self.tocoo(copy=False).toarray()
File "/usr/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 219, in toarray
B = np.zeros(self.shape, dtype=self.dtype)
MemoryError
In short, is there any methods to use classification tree for large data set with 66k features.?
Add dtype=np.float32 eg: vec = TfidfVectorizer(..., dtype=np.float32)
As for sparse/dense I have similar problem.
GradientBoostingClassifier, RandomForestClassifier or DecisionTreeClassifier need dense data, for that reason I use SVC.