I am trying to perform basic linear regression of MNIST data using the scikit-learn module. It seems to crash with MemoryError. What am I doing wrong? The shape of the training dataset is (60000, 728)
import numpy as np
from tensorflow.keras.datasets import mnist
from sklearn import linear_model
(xTrain, yTrain), (xTest, yTest) = mnist.load_data()
xTrain2D = xTrain.reshape((len(xTrain), -1))
xTest2D = xTest.reshape((len(xTest), -1))
reg = linear_model.LinearRegression()
reg.fit(xTrain2D, yTrain)
The problem is with the implementation of sklearn older versions of sklearn have this issue, the older versions have issues in resource management. Try upgrading sklearn.
Other viable option is to run this code in kaggle or google colab.
Related
I have trained and saved an sklearn RandomForestClassifier in the following way:
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=0, shuffle=False)
clf = RandomForestClassifier(max_depth=2, random_state=0).fit(X, y)
import joblib
joblib.dump(clf, "D:/mymodel.gz", compress=3)
The final model (as saved on HDD) is about 6 GB large.
When I try to load the model again using:
clf_loaded = joblib.load("D:/mymodel.gz")
pred = clf_loaded.predict(X)
I get a memory error even though my available RAM is just under 60 GB.
It seems that joblib.load() requires a huge amount of RAM in excess of the original file size (6 GB).
Q: What are alternative options to save an sklearn model locally and load it again without consuming massive RAM space?
As far as I know, pickle will not outperform joblib and is per docs not recommended for sklearn models.
I checked some other questions such as this, this, this, and this but could not find a workable solution.
I am trying to understand how scikit-learn pipelines work. I have some dummy data and I am trying to fit a Random Forest model to iris data. Here is some code
from sklearn import datasets
from sklearn import svm
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import sklearn.externals
import joblib
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
# Load the Iris dataset
iris = datasets.load_iris()
Divide data into train and test and create a pipeline with 2 steps
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,random_state=0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
pipeline = Pipeline([('feature_selection', SelectKBest(chi2, k=2)), ('classification', RandomForestClassifier()) ])
print(type(pipeline))
(112, 4) (38, 4) (112,) (38,)
<class 'sklearn.pipeline.Pipeline'>
But when i execute pipeline.fit_transform(X_train, y_train) , I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
However, pipeline.fit(X_train, y_train) works fine.
In a normal case scenario, without any pipeline code, what i have usually done is taken a ML model and applied fit_transform() on my training dataset and transform on my unseen dataset for generating predictions.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
But when i execute pipeline.fit_transform(X_train, y_train), I get an error saying AttributeError: 'RandomForestClassifier' object has no attribute 'transform'
Indeed, RandomForestClassifier does not transform data because it is a model, not a transformer. Pipelines implement either transform or predict (and its variants) depending on whether the last estimator is a transformer or a model.
So, generally, you'll want to call just pipeline.fit(X_train, y_train), then in testing or production you'll call pipeline.predict(X_test, y_test) (or predict_proba, or ...), which internally will transform with the first step(s) and predict with the last step.
How do I set something similar using pipelines in sklearn. I want to SAVE my pipeline and then perform scoring by LOADING it back? Can I do it using pickle?
Yes; see sklearn Model Persistence for more details and recommendations.
Another thing is regarding the RF model itself. I can get model summary using RF model methods but I dont see any methods in my pipeline where I can print model summary using pipeline.
You can access individual steps of a pipeline in a few ways; see sklearn Pipeline accessing steps
pipeline.named_steps.classification
pipeline['classification']
pipeline[-1]
As we know For the support vector machine we can use SVC as well as SGDClassifier with hinge loss implementation. Is SGDClassifier with hinge loss implementation is faster than SVC. Why?
Links of both implementations of SVC in scikit-learn:
SVC
SGDClassifier
I read on the documentation page of the sci-kit learn that SVC uses some algorithm of libsvm library for optimization. While SGDClassifier uses SGD(obviously).
Maybe it is better to start trying some practical cases and read the code. Let's start...
First of all, if we read the documentation of SGDC, it says the linear SVM is used only:
Linear classifiers (SVM, logistic regression, a.o.) with SGD training
What if instead of using the usual SVC, we use the LinearSVC?
Similar to SVC with parameter kernel=’linear’, but implemented in terms of liblinear rather than libsvm, so it has more flexibility in the choice of penalties and loss functions and should scale better to large numbers of samples.
Let's add an example for the three types of algorithms:
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC
from sklearn import datasets
import numpy as np
iris = datasets.load_iris()
X = np.random.rand(20000,2)
Y = np.random.choice(a=[False, True], size=(20000, 1))
# hinge is used as the default
svc = SVC(kernel='linear')
sgd = SGDClassifier(loss='hinge')
svcl = LinearSVC(loss='hinge')
Using jupyter and the command %%time we get the execution time (you can use similar ways in normal python, but this is how I did it):
%%time
svc.fit(X, Y)
Wall time: 5.61 s
%%time
sgd.fit(X, Y)
Wall time: 24ms
%%time
svcl.fit(X, Y)
Wall time: 26.5ms
As we can see there is a huge difference between all of them, but linear and SGDC have more or less the same time. The time keeps being a little bit different, but this will always happen since the execution of each algorithm does not come from the same code.
If you are interested in each implementation, I suggest you read the github code using the new github reading tool which is really good!
Code of linearSVC
Code of SGDC
I think its because of the batch size used in SGD, if you use full batch with SGD classifier it should take same time as SVM but changing the batch size can lead to faster convergence.
The sklearn SVM is computationally expensive compared to sklearn SGD classifier with loss='hinge'. Hence we use SGD classifier which is faster. This is good only for linear SVM. If we are using 'rbf' kernel, then SGD is not suitable.
I am solving the classic regression problem using the python language and the scikit-learn library. It's simple:
ml_model = GradientBoostingRegressor()
ml_params = {}
ml_model.fit(X_train, y_train)
where y_train is one-dimensional array-like object
Now I would like to expand the functionality of the task, to get not a single target value, but a set of them. Training set of samples X_train will remain the same.
An intuitive solution to the problem is to train several models, where X_train for all of them will be the same but y_train for each model will be specific. This is definitely a working, but, it seems to me, inefficient solution.
When searching for alternatives, I came across such concepts as Multi-Target Regression. As I understand such functionality is not implemented in scikit-learn.
How to solve Multi-Target Regression problem in python in efficient way? Thanks)
It depends on what problem you solve, training data you have, and an algorithm you choose to find a solution. It's really hard to suggest anything without knowing all the details. You could try a random forest as a starting point. It's a very powerful and robust algorithm which is resistant to overfitting in the case you have not so much data, and also it can be used for multi-target regression. Here is a working example:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
X, y = make_regression(n_targets=2)
print('Feature vector:', X.shape)
print('Target vector:', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print('Build and fit a regressor model...')
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print('Done. Score', score)
Output:
Feature vector: (100, 100)
Target vector: (100, 2)
Build and fit a regressor model...
Done. Score 0.4405974071273537
This algorithm natively supports multi-target regression. For those ones which don't, you can use the multi-output regressor which simply fits one regressor per target.
Another alternative to the random forest approach would be to use an adapted version of Support Vector Regression, that fits multi-target regression problems. The advantage over fitting SVR with MultiOutputRegressor is that this method takes the underlying correlations between the multiple targets into account and hence should perform better.
A working implementation with a paper reference can be found here
I am trying to build decision trees and regression trees with Python. I am using sci-kit, but am open to alternatives.
What I don't understand about this library is whether a training and a validation subset can be provided, so that the library builds the model on the training subset, tests it on the validation and stops splitting based on some rules (typically when additional splits don't result in better performance on the validation subset- this prevents overfitting).
For example, this is what the JMP software does (http://www.jmp.com/support/help/Validation_2.shtml#1016975).
I found no mention of how to use a validation subset in the official website (http://scikit-learn.org/stable/modules/tree.html), nor on the internet.
Any help would be most welcome! Thanks!
There is a fairly rich set of cross validation routines and examples in the scikit learn cross validation section of the userguide.
Note that lots of progress seems to have been made in cross validation between SK-Learn version 0.14 and 0.15, so I recommend upgrading to 0.15 if you haven't already.
As jme notes in his comment, some of the cross validation capability has also been incorporated into the grid search and pipeline capabilities of SK-learn.
For completeness in answering your question, here is simple example, but more advanced k-fold, leave-one-out, leave-p-out, shuffle-split, etc. are all available:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
iris.target,
test_size=0.4,
random_state=0)
X_train.shape, y_train.shape
((90, 4), (90,))
X_test.shape, y_test.shape
((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
0.96...
I hope this helps... Good luck!