sklearn, joblib: Loading large model causes memory error - python

I have trained and saved an sklearn RandomForestClassifier in the following way:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0).fit(X, y)
import joblib
joblib.dump(clf, "D:/mymodel.gz", compress=3)
The final model (as saved on HDD) is about 6 GB large.
When I try to load the model again using:
clf_loaded = joblib.load("D:/mymodel.gz")
pred = clf_loaded.predict(X)
I get a memory error even though my available RAM is just under 60 GB.
It seems that joblib.load() requires a huge amount of RAM in excess of the original file size (6 GB).
Q: What are alternative options to save an sklearn model locally and load it again without consuming massive RAM space?
As far as I know, pickle will not outperform joblib and is per docs not recommended for sklearn models.
I checked some other questions such as this, this, this, and this but could not find a workable solution.

Related

How to set `n_jobs` in sklearn ElasticNet

I am trying to run the ElasticNet function from scikit-learn on a machine with multiple CPUs. However, I need the ElasticNet fit to only use one CPU, since I need to run other fitting routines in parallel on the remaining CPUs. Whenever the thread containing ElasticNet starts the fit, it quickly takes over any free space on all CPUs instead of just the one its called on. Because other code routines are running on these machines already, ElasticNet oversubscribes the machines and slows everything down tremendously, including itself. I need these routines to run in parallel, so I cannot just run the ElasticNet fit serially ahead of time.
Unlike other regression functions (linear, logistic...) in sklearn there is no n_jobs argument for ElasticNet. Reading the documentation, it appears that ElasticNet defaults to the n_jobs specified in joblib.parallel_backend which itself defaults to n_jobs=-1, which is all available CPUs.
I am trying to figure out the proper method for specifying n_jobs in parallel_backend so that it will override the default for ElasticNet. Following are three attempts to change n_jobs that have not worked so far.
Attempt 1
from joblib import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
model.fit(predictors, response)
Attempt 2
from sklearn.utils import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
model.fit(predictors, response)
Both Attempt 1 and Attempt 2 do not throw any errors, but also do not appear to change n_jobs from the default of using every available CPU. ElasticNet still takes over all available CPU space across all CPUs and quickly oversubscribes the machines.
Attempt 3
This is my first time using joblib directly, and so I've been reading the documentation on parallelization with joblib. Most of the example routines placed in the parallel_backend container are prefaced with the Parallel() helper class.
Following the examples, I modified Attempt 1 in the following way:
from joblib import parallel_backend
from sklearn.linear_model import ElasticNet
with parallel_backend('loky', n_jobs=1):
model = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, fit_intercept=False,
normalize=False, copy_X=True, max_iter=10000, tol=10,
random_state=42, precompute=False, warm_start=False,
positive=False, selection='cyclic')
Parallel(n_jobs=1)(model.fit(predictors, response))
However when running Attempt 3, I get the following error message:
TypeError: 'ElasticNet' object is not iterable
Does anyone know how to set n_jobs=1 for sklearn's ElasticNet? There must be some way to do this because ElasticNetCV has n_jobs as a possible argument. Any help with this is greatly appreciated!
In addition to the first solution, you can manually set MKL_NUM_THREADS=1 by
import mkl
mkl.set_num_threads(1)

Making big Machine Learning models on 4 GB RAM Laptop

I have want to create a Digit Recognition Classifier using MNIST dataset. But whenever I train the classifier, due large size of dataset, my laptop freezes. But, somehow I managed to train the classifier and dumped into joblib file. This joblib file has a size of 377 MB. So, whenever I load the classifier to get the accuracy or predict the test values, my laptop freezes and I have to restart it again.
I also used the following command so that I don't have to store the classifier and not use much RAM:
from joblib import load
from sklearn.model_selection import cross_val_score
accuracy = cross_val_score(load(Digit Recognizer.joblib), x_train, y_train, cv=3, scoring='accuracy')
But, this also didn't work and my laptop freezes. How can work with such a large classifier with just 4 GB RAM?

How to build re-usable scikit-learn pipeline for Random Forest Classifier?

I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train) , I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train) works fine.
In a normal case scenario, without any pipeline code, what i have usually done is taken a ML model and applied fit_transform() on my training dataset and transform on my unseen dataset for generating predictions.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
But when i execute pipeline.fit_transform(X_train, y_train), I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier does not transform data because it is a model, not a transformer. Pipelines implement either transform or predict (and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train), then in testing or production you'll call pipeline.predict(X_test, y_test) (or predict_proba, or ...), which internally will transform with the first step(s) and predict with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]

scikit-learn linear regression memory overflow for MNIST

I am trying to perform basic linear regression of MNIST data using the scikit-learn module. It seems to crash with MemoryError. What am I doing wrong? The shape of the training dataset is (60000, 728)
import numpy as np
from tensorflow.keras.datasets import mnist
from sklearn import linear_model
(xTrain, yTrain), (xTest, yTest) = mnist.load_data()
xTrain2D = xTrain.reshape((len(xTrain), -1))
xTest2D = xTest.reshape((len(xTest), -1))
reg = linear_model.LinearRegression()
reg.fit(xTrain2D, yTrain)
The problem is with the implementation of sklearn older versions of sklearn have this issue, the older versions have issues in resource management. Try upgrading sklearn.
Other viable option is to run this code in kaggle or google colab.

Memory error of fitting sklearn.KNN in Python

I am fitting a kNN model in Ubuntu, Python, and sklearn with 32 GB DDR4 RAM.
I wan to load the model without fitting the model again, so I decided to dump the fitted model after fitting it.
The input is a large dataset, 1M+ rows, 1052 columns. (Most of columns are the dummy of categorical variables in [0,1].)
After the KNN fitted, I got a Memory Error when trying to dump the model with pickle.
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 14)
knn = knn.fit(x,y)
import pickle
pickle.dump(knn, open('KNN.sav', 'wb'))
From the other post, the solution always to use higher specifications and RAM to solve it. I think it is not a good solution for the long run.
Updated questions : Any possible way to limit the model train use 26/32GB and left 6GB for Dumping? or any method to save the fitted model etc.
There's a question - do you really need so huge training dataset? The idea is that the model maybe doesn't need whole dataset. A good code example from sklearn. Or use some of the cloud instance for working with such huge dataset.

Categories