Making big Machine Learning models on 4 GB RAM Laptop - python

I have want to create a Digit Recognition Classifier using MNIST dataset. But whenever I train the classifier, due large size of dataset, my laptop freezes. But, somehow I managed to train the classifier and dumped into joblib file. This joblib file has a size of 377 MB. So, whenever I load the classifier to get the accuracy or predict the test values, my laptop freezes and I have to restart it again.
I also used the following command so that I don't have to store the classifier and not use much RAM:
from joblib import load
from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(load(Digit Recognizer.joblib), x_train, y_train, cv=3, scoring='accuracy')
But, this also didn't work and my laptop freezes. How can work with such a large classifier with just 4 GB RAM?

Related

Saving MultinominalNB Model to Disk takes too much time and memory

I have a dataset (around 120k entries, 8mb, 4 cols, one with text). I ran a MultinominalNB to classify the text column in order to predict its classe (another column).
I did that with a pipeline as follows (the text column goes though cleaning text process, including stopwords removal prior to pipeline).
text_clf_comp = Pipeline([('vect', CountVectorizer(ngram_range=(1,6))),
('tfidf', TfidfTransformer(use_idf=False)),
('clf', MultinomialNB(alpha=0.01)),])
text_clf_comp = text_clf_comp.fit(X_train_comp, y_train)
The parameter were optimized using GridSearch.
The pipeline and fit takes 17s and the model is very good at predicting.
The problem occurs when I try to save the model using joblib or pickle. It creates a 300mb file and takes 7 min to run. Doesn't make sense, considering the time to train and the size of the data.
saved_model=joblib.dump(text_clf_comp,'saved_model.joblib')
I created a LSTM model that takes like 1 hour to train and saving it took like a couple of seconds and 2 mb.
Right now, is better to train my MultinominalNB classifier everytime than saving and loading it.

sklearn, joblib: Loading large model causes memory error

I have trained and saved an sklearn RandomForestClassifier in the following way:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0).fit(X, y)
import joblib
joblib.dump(clf, "D:/mymodel.gz", compress=3)
The final model (as saved on HDD) is about 6 GB large.
When I try to load the model again using:
clf_loaded = joblib.load("D:/mymodel.gz")
pred = clf_loaded.predict(X)
I get a memory error even though my available RAM is just under 60 GB.
It seems that joblib.load() requires a huge amount of RAM in excess of the original file size (6 GB).
Q: What are alternative options to save an sklearn model locally and load it again without consuming massive RAM space?
As far as I know, pickle will not outperform joblib and is per docs not recommended for sklearn models.
I checked some other questions such as this, this, this, and this but could not find a workable solution.

the size of samples in each k-fold on GPU is different from the size of samples in each k-fold on CPU

I am running the same code training the same CNN model using the same dataset on GUP and CPU, and I am using k-fold cross validation in my code. The problem is that k-fold seems not working properly on GPU, because on CPU the number of samples that was used for training after the cross validation is about 700 samples in each fold. However, on GPU it is only 27 samples used for training in each fold.
I don't know what is the problem? could someone please help me with this?

Why is cross_val_predict so much slower than fit for KNeighborsClassifier?

Running locally on a Jupyter notebook and using the MNIST dataset (28k entries, 28x28 pixels per image, the following takes 27 seconds.
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier(n_jobs=1)
knn_clf.fit(pixels, labels)
However, the following takes 1722 seconds, in other words ~64 times longer:
from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(knn_clf, pixels, labels, cv = 3, n_jobs=1)
My naive understanding is that cross_val_predict with cv=3 is doing 3-fold cross validation, so I'd expect it to fit the model 3 times, and so take at least ~3 times longer, but I don't see why it would take 64x!
To check if it was something specific to my environment, I ran the same in a Colab notebook - the difference was less extreme (15x), but still way above the ~3x I expected:
What am I missing? Why is cross_val_predict so much slower than just fitting the model?
In case it matters, I'm running scikit-learn 0.20.2.
KNN is also called as lazy algorithm because during fitting it does nothing but saves the input data, specifically there is no learning at all.
During predict is the actual distance calculation happens for each test datapoint. Hence, you could understand that when using cross_val_predict, KNN has to predict on the validation data points, which makes the computation time higher!
cross_val_predict does a fit and a predict so it might take longer than just fitting, but I did not expect 64 times longer

Memory error of fitting sklearn.KNN in Python

I am fitting a kNN model in Ubuntu, Python, and sklearn with 32 GB DDR4 RAM.
I wan to load the model without fitting the model again, so I decided to dump the fitted model after fitting it.
The input is a large dataset, 1M+ rows, 1052 columns. (Most of columns are the dummy of categorical variables in [0,1].)
After the KNN fitted, I got a Memory Error when trying to dump the model with pickle.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 14)
knn = knn.fit(x,y)
import pickle
pickle.dump(knn, open('KNN.sav', 'wb'))
From the other post, the solution always to use higher specifications and RAM to solve it. I think it is not a good solution for the long run.
Updated questions : Any possible way to limit the model train use 26/32GB and left 6GB for Dumping? or any method to save the fitted model etc.
There's a question - do you really need so huge training dataset? The idea is that the model maybe doesn't need whole dataset. A good code example from sklearn. Or use some of the cloud instance for working with such huge dataset.

Categories