Adding and removing SVM parameters without having to totally retrain

Adding and removing SVM parameters without having to totally retrain - python

I have a support vector machine trained on ~300,000 examples, and it takes roughly 1.5-2 hours to train this model, and I pickled(serialized) it. Currently, I want to add/remove a couple of the parameters of the model. Is there a way to do this without having to retrain the entire model? I am using sklearn in python.

If you are using SVC from sklearn then the answer is no. There is no way to do it, this implementation is purely batch training based. If you are training linear SVM using SGDClassifier from sklearn then the answer is yes as you can simply start the optimization from the previous solution (when removing feature - simply with removed corresponding weight, and when adding - with added any weight there).

Related

How to add training data to the model after initial training?

I am trying to add data for my scikit-learn model after it has already been trained. For example, I have the data that I used in the beginning (there are about 250 of them). After that, I need to train this model one more time by calling the function, and so on. The only thing that came to my mind was to add new values to the existing data array every time and train the model again, but this is very resource-intensive and takes more time.
Is there another way to train the machine learning model?
model = LinearRegression().fit(test, result)
reg.predict(task)
### and here I want to add some data, for example one or two examples like:
model.addFit(one_test, one_result)

The short answer in your case (using the sklearn.linear_model.LinearRegression model) is no, it is not possible to add one or two more examples and train without adding this to the original training set and fitting it all at the same time. Under the hood, the model is simply using Ordinary Least Squares (described here) which requires the complete matrix of training data on which to fit your model. However, this algorithm is very fast and in the case of ~ hundreds of training examples, it would be very quick to re-calculate the parameters of the model with each new couple examples.

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.

You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

Machine learning parameter tuning using partitioned benchmark dataset

I know this will be very basic, however I'm really confused and I would like to understand parameter tuning better.
I'm working on a benchmark dataset that is already partitioned to three splits training, development, and testing and I would like to tune my classifier parameters using GridSearchCV from sklearn.
What is the correct partition to tune the parameter? is it the development or the training?
I've seen researchers in the literature mentioning that they "tuned the parameters using GridSearchCV on the development split" another example is found here;
Do they mean they trained on the training split then tested on the development split? or do ML practitioners usually mean they perform the GridSearchCV entirely on the development split?
I'd really appreciate a clarification. Thanks,

Usually in a 3-way split you train a model using a training set, then you validate it on a development (which is also called validation set) set to tune hyperpameters and then after all the tuning is complete you perform a final evaluation of a model on an unseen before testing set (also known as evaluation set).
In a two-way split you just have a train set and a test set, so you perform tuning/evaluation on the same test set.

How to update an SVM model with new data

I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?

In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on

Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.

Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.

from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.

Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.