Real time data using sklearn - python

I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.

With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.

It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!

Related

How to continuously train our pre-trained model on real time data?

I have some sensors which fetch data from cement factory and sends data to AWS IoT. The data is then tested on pre-trained model and the model predicts quality of cement based on some parameters. The data is coming in one second interval.
Since the data is coming in real-time, I want to train the model incrementally in real time.
Can anybody suggest how train model continuously?
You could aggregate certain numbers of training data and then use .partial_fit() to update your model.
.partial_fit() is the incremental learning option, which is available in Sklearn.
If your incremental data would not fit in RAM, then its worth trying dask-ml wrapper for incremental learning.

How to launch a Machine Learning model?

First of all thank you for taking your time to read my question. I have done a Machine Learning model with a dataset (The famous one about Cancer) and I want to know how can I do to predict the results for new variables. I think that I have to keep training the data (often) to have more accured data to use in my prediction but for predicting new data, ¿Is as simple as changing the test data (y variable) to the new data?
Thank you so much for taking your time and any help would be appreciate it.
You are probably using the SVC class from sklearn.svm.
After fitting the model with the fit method you can predict new data with the predict method. See here.
By the way: For Support Vector Machines you don't have to fit your data multiple times. Maybe you are confusing that with neural networks.
If you are talking in the sense that you are changing the number of features in your test data then you cannot do that.
The number of features has to be the same in training and test set.
However, if your test data have some class of categorical variable which was not there in training data then its better you train your model with one extra category as "NONE" of "Others" for all your features.
This way when you encounter new class of categorical variable in your test data then you changed it to "NONE" or "Others" and do prediction on your trained model.
This way it will not break your model.
I hope I understand your question correctly.

How to use previously trained data for new test data in Python

I use Gaussian process regression in Python. I have big data for training and try to predict test data. The trained data will not vary, but test data will. My question is that if it is possible to save the results of training and whenever new test data come in, just quickly predict the target of the test data without retraining all over again. I would appreciate any help.
Thanks,
Jay

Updating a NaiveBayes Classifier (in scikit-learn) over time

I'm building a NaiveBayes classifier using scikit-learn, and so far things are going well if I have a set body of data to train. However, for the particular project I'm working on, there will be new data coming in every day that ideally would be part of the training set.
I'm aware that you can pickle the classifier to store it for later use, but is there any way to "update" the classifier with new data?
Re-training the classifier from scratch every day is obviously an option, but that would require drawing a lot of historical data each time, for a growing time period.
Use the partial_fit method on the naive Bayes estimator.

Save Naive Bayes Classifier in memory

I am new in NLTK and machine learning. I'm using Python with NLTK Naive Bayes Classifier . I have create a Naive Bayes Classifier for text classification using NLTK and save it on disk. I am also able to load it when needed to classify some test data by using this python code:
import pickle
f = open('classifier.pickle')
classifier = pickle.load(f)
f.close()
But my problem is that whenever an new test data come , I have to load this classifier again and again in memory that takes lots of time (2-3 min) to load as it have large size. Also if I have to run two instances of the same sentimental analysis program, that will take double RAM as both program will load this classifier separately. My questions is: Is there any technique to store this classifier in memory so that whenever needed the sentimental anylysis programs can read this directly from memory or is there any other method through which the load time of the classifier can be minimize.
Thanks in advance for your help.
You can't have it both ways. You can either keep pickling/unpickling one at a time to use less RAM, or you can store both in memory, using twice as much ram, but reducing load times and disk i/o wait times.
Are the two classifiers trained using different training data, or are you using the same classifier in parallel? It sounds like the latter from your usage of "two instances", and in that case you may want to look into threading to allow the same classifier to work with two sets of data (some parallelism may be achieved by classify some of the data, then doing some other stuff like results processing to allow the other thread to classify, repeat).
My expertise in this comes from having started an open source NLTK based sentiment analysis system: https://bitbucket.org/tommyjcarpenter/evopminer.

Categories