How to add training data to the model after initial training? - python

I am trying to add data for my scikit-learn model after it has already been trained. For example, I have the data that I used in the beginning (there are about 250 of them). After that, I need to train this model one more time by calling the function, and so on. The only thing that came to my mind was to add new values ​​to the existing data array every time and train the model again, but this is very resource-intensive and takes more time.
Is there another way to train the machine learning model?
model = LinearRegression().fit(test, result)
reg.predict(task)
### and here I want to add some data, for example one or two examples like:
model.addFit(one_test, one_result)

The short answer in your case (using the sklearn.linear_model.LinearRegression model) is no, it is not possible to add one or two more examples and train without adding this to the original training set and fitting it all at the same time. Under the hood, the model is simply using Ordinary Least Squares (described here) which requires the complete matrix of training data on which to fit your model. However, this algorithm is very fast and in the case of ~ hundreds of training examples, it would be very quick to re-calculate the parameters of the model with each new couple examples.

Related

Why xgboost BDT model constructed with histogram tree method depends on the training data ordering?

I was using Python xgboost to train some models (with binary logistic) on some data (50k in total) and I used the histogram tree method for the training (tree_method="hist"). I shuffled the events in the data and used them for the training. It turned out that the models built are slightly different depending on the order of events and some result based on the corresponding prediction on validation set (different than training set) could vary up to 5%. As a double check I also used the lightgbm and this effect is also presented. It seems that this is a problem of histogram method because if I use the exact method in xgboost (tree_method="exact") then this problem disappears.
Does anyone know why the bdt model based on histogram method depends on the event order?
I tried to look for the reference paper but was totally lost.

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

Question about finetuning model to increase number of classes w/additional data using Tensor Flow Custom Object Detection

Using Tensorflow's Custom Object Classification API w/ SSD MobileNet V2 FPNLite 320x320 as the base, I was able to train my model to succesfully detect classes A and B using Training Data 1 (about 200 images). This performed well on Test Set 1, which only has images of class A and B.
I wanted to add several classes to the model, so I constructed a separate dataset, Training Data 2 (about 300 images). This dataset contains labeled data for class B, and new classes C, D and E. However it does NOT include data for class A. Upon training the model on this data, it performed well on Test Set 2 which contained only images of B, C, D and E (however the accuracy on B did not go up despite extra data)
Concerned, I checked the accuracy of the model on Test Set 1 again, and as I had assumed, the model didn't recognize class A at all. In this case I'm assuming I didn't actually refine the model but instead retrained the model completely.
My Question: Am I correct in assuming I cannot refine the model on a completely separate set of data, and instead if I want to add more classes to my trained model that I must combine Training Set 1 and Training Set 2 and train on the entirety of the data?
Thank you!
It mostly depends on your hyperparameters, namely, your learning rate and the number of epochs trained. Higher learning rates will make the model forget the old data faster. Also, be sure not to be overfitting your data, have a validation set as well. Models that have overfit the training data tend to be very sensitive to weight (and data) perturbations.
TLDR. If not trained on all data, ML models tend to forget old data in favor of new data.
There is a lot of "moving parts". I propose the followings:
Take the "SSD MobileNet V2 FPNLite 320x320" as a basemodel without its last classification layer (argument include_top=False when loading the model), and freeze its parameters using command basemodel.trainable=False
Add new prediction layer with command prediction_layer=tf.keras.layers.Dense(1) and make other required things (details step by step in page https://www.tensorflow.org/tutorials/images/transfer_learning)
After the procedure above verify that you have understanding which parameters of the new network (including "old" convolutional part and your own new prediction layer) are trainable and which are not. Change the hyperparameters if needed.
Next train the network using a standard procedures.
Use directly final number of classes according to your idea (25). If you have no data yet for all classes, do not worry, generate some random images for the purpose, and of course take into account that the results are not valid for the classes with no appropriate data.
For simplicity divide the data - principally independently from the number of classes - to training and test data and nothing more complicated in first hand. When amount of data increases the statistics will diminish problems with sampling. And when training, monitor how the amount of data increase the performance of the classification.
So - in a nutshell - 1) make the network - 2) select which parameters to train - 3) train with one dataset and 4) test with another.
And finally direct answer for the question in title and in the end of the question:
-According to experience first utilize out all performance of the basemodel by training only the last layers of the network. After you are sure no more performance can be found this way, begin to finetune the convolutional layers tuning carefully hyperparameters.
-You can refine the model totally only by using your own new data; this is special benefit and art of transfer learning

How to use data augmentation with cross validation

I need to use data augmentation on what would be my training data from the data augmentation step. The problem is that i am using cross-validation, so i can't find a reference how to adjust my model to use data augmentation. My cross-validation is somewhat indexing by hand my data.
There is articles and general content about data augmentation, but very little and with no generalization for cross validation with data augmentation
I need to use data augmentation on training data by simply rotating and adding zoom, cross validate for the best weights and save them, but i wouldnt know how.
This example can be copy pasted for better reproducibility, in short how would i employ data augmentation and also save the weights with the best accuracy?
When training machine learning models, you should not test model on the samples used during model training phase (if you care for realistic results).
Cross validation is a method for estimating model accuracy. The essence of the method is that you split your available labeled data into several parts (or folds), and then use one part as a test set, training the model on all the rest, and repeating this procedure for all parts one by one. This way you essentially test your model on all the available data, without hurting training too much. There is an implicit assumption that data distribution is the same in all folds. As a rule of thumb, the number of cross validation folds is usually 5 or 7. This depends on the amount of the labeled data at one's disposal - if you have lots of data, you can afford to leave less data to train the model and increase test set size. The higher the number of folds, the better accuracy estimation you can achieve, as the training size part increases, and more time you have to invest into the procedure. In extreme case one have a leave-one-out training procedure: train on everything but one single sample, effectively making number of the folds equal to the number of data samples.
So for a 5-fold CV you train 5 different models, which have a a large overlap of the training data. As a result, you should get 5 models that have similar performance. (If it is not the case, you have a problem ;) ) After you have the test results, you throw away all 5 models you have trained, and train a new model on all the available data, assuming it's performance would be a mean of the values you've got during CV phase.
Now about the augmented data. You should not allow data obtained by augmentation of the training part leak into the test. Each data point created from the training part should be used only for training, same applies to the test set.
So you should split your original data into k-folds (for example using KFold or GroupKFold), then create augmented data for each fold and concatenate them to the original. Then you follow regular CV procedure.
In your case, you can simply pass each group (such as x_group1) through augmenting procedure before concatenating them, and you should be fine.
Please note, that splitting data in linear way can lead to unbalanced data sets and it is not the best way of splitting the data. You should consider functions I've mentioned above.

How to implement incremental learning using naive bayes algorithm in python?

I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).

Categories