How to continuously train our pre-trained model on real time data? - python

I have some sensors which fetch data from cement factory and sends data to AWS IoT. The data is then tested on pre-trained model and the model predicts quality of cement based on some parameters. The data is coming in one second interval.
Since the data is coming in real-time, I want to train the model incrementally in real time.
Can anybody suggest how train model continuously?

You could aggregate certain numbers of training data and then use .partial_fit() to update your model.
.partial_fit() is the incremental learning option, which is available in Sklearn.
If your incremental data would not fit in RAM, then its worth trying dask-ml wrapper for incremental learning.

Related

How to partial training on the additional data for pre-trained model?

In my case, I would like to weekly tune/adjust the model parameters value.
I have pre-trained the model by using the 100K data rows, Keras, and saved the model.
Then, as the new data collection (10K data rows), I need to tune the model parameter but don't want to retrain the whole dataset (110K).
How can I just partially fit the data on the model? load model -> model.fit(10K_data)?
Yes, that is correct you will train only on the new dataset (10k) model.fit(10K_data). I will recommend to change the learning rate for the retraining (reducing the learning rate) as you will just want to do a minor update to the parameters while keeping the earlier learning intact (or trying to leavarage the earlier learning).

Will it lead to Overfitting / Curse of Dimensionality

Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.

Real time data using sklearn

I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!

add training data to existing LinearSVC

I am scraping approximately 200,000 websites, looking for certain types of media posted on the websites of small businesses. I have a pickled linearSVC, which I've trained to predict the probability that a link found on a web page contains media of the type that I'm looking for, and it performs rather well (overall accuracy around 95%). However, I would like the scraper to periodically update the classifier with new data as it scrapes.
So my question is, if I have loaded a pickled sklearn LinearSVC, is there a way to add in new training data without re-training the whole model? Or do I have to load all of the previous training data, add the new data, and train an entirely new model?
You cannot add data to SVM and achieve the same result as if you would add it to the original training set. You can either retrain with extended training set starting with the previous solution (should be faster) or train on new data only and completely diverge from the previous solution.
There are only few models that can do what you would like to achieve here - like for example Ridge Regression or Linear Discriminant Analysis (and their Kernelized - Kernel Ridge Regression or Kernel Fischer Discriminant, or "extreme"-counterparts - ELM or EEM), which have a property of being able to add new training data "on the fly".

How to use previously trained data for new test data in Python

I use Gaussian process regression in Python. I have big data for training and try to predict test data. The trained data will not vary, but test data will. My question is that if it is possible to save the results of training and whenever new test data come in, just quickly predict the target of the test data without retraining all over again. I would appreciate any help.
Thanks,
Jay

Categories