I'm building a NaiveBayes classifier using scikit-learn, and so far things are going well if I have a set body of data to train. However, for the particular project I'm working on, there will be new data coming in every day that ideally would be part of the training set.
I'm aware that you can pickle the classifier to store it for later use, but is there any way to "update" the classifier with new data?
Re-training the classifier from scratch every day is obviously an option, but that would require drawing a lot of historical data each time, for a growing time period.
Use the partial_fit method on the naive Bayes estimator.
Related
I have implemented ML model using naive Bayes algorithm, where I want to implement incremental learning. The issue that I am facing is when I train my model and it generates 1500 features while preprocessing and then after a month using feedback mechanism if I want to train my model with new data which might contain some new features, may be less than or more than 1500 (i.e of my previous dataset) here if I use fit_transform to get the new features then my existing feature set gets lost.
I have been using partial fit but the issue with partial fit is you require same number of features as of previous model. How do I make it learn incrementally?
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray() #replaces my older feature set
classifier = GaussianNB()
classifier.partial_fit(X,y)
#does not fit because the size of feature set count is not equal to previous feature set count
You could use just transform() for the CountVectorizer() and then partial_fit() for Naive-Bayes like the following for the incremental learning. Remember, transform extracts the same set of features, which you had learned using the training dataset.
X = cv.transform(corpus)
classifier.partial_fit(X,y)
But, you cannot revamp the features all from scratch and continue the incremental leaning. Meaning the number of feature needs to be consistent for any model to do incremental learning.
If you think, your new dataset have significantly different features compared to older one, use cv.fit_transform() and then classifier.fit() on complete dataset (both old and new one), which means we are going to create a new model for the entire available data. You could adopt this, if your dataset not big enough to keep in memory!
You cannot with CountVectorizer. You will need to fix the number of features for partial_fit() in GaussianNB.
Now you can use a different preprocessor (in place of CountVectorizer) which can map the inputs (old and new) to same feature space. Have a look at HashingVectorizer which is recommended by scikit-learn authors to be used in just the scenario you mentioned. While initializing, you will need to specify the number of features you want. In most cases, default value is enough for not having collisions in hashes of different words. You may try experimenting with different numbers. Try using that and check out the performance. If not at par with CountVectorizer then you can do what #AI_Learning suggests and make a new model on the whole data (old+new).
I am trying to use sklearn to build an Isolation Forest machine learning program to go through a ton of data. I can only store the past 10 days of data, so I was wondering:
When I use the "fit" function on new data that comes in, does it refit the model considering the hyper-parameters from the old data without having had access to that old data anymore? Or is it completely recreating the model?
In general, only the estimators implementing the partial_fit method are able to do this. Unfortunately, IsolationForest is not one of them.
I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!
I am scraping approximately 200,000 websites, looking for certain types of media posted on the websites of small businesses. I have a pickled linearSVC, which I've trained to predict the probability that a link found on a web page contains media of the type that I'm looking for, and it performs rather well (overall accuracy around 95%). However, I would like the scraper to periodically update the classifier with new data as it scrapes.
So my question is, if I have loaded a pickled sklearn LinearSVC, is there a way to add in new training data without re-training the whole model? Or do I have to load all of the previous training data, add the new data, and train an entirely new model?
You cannot add data to SVM and achieve the same result as if you would add it to the original training set. You can either retrain with extended training set starting with the previous solution (should be faster) or train on new data only and completely diverge from the previous solution.
There are only few models that can do what you would like to achieve here - like for example Ridge Regression or Linear Discriminant Analysis (and their Kernelized - Kernel Ridge Regression or Kernel Fischer Discriminant, or "extreme"-counterparts - ELM or EEM), which have a property of being able to add new training data "on the fly".
I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?
In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.