predicting new non-standardized data with classifier trained on standardized data - python

I have some data with say, L features. I have standardized them using StandardScaler() by doing a fit_transform on X_train. Now while predicting, i did clf.predict(scaler.transform(X_test)). So far so good... now if I want to pickle the model for later reuse, how would I go about predicting on the new data in future with this saved model ? the new (future) data will not be standardized and I didn't pickle the scaler.
Is there anything else that I have to do before pickling the model the way I am doing it right now (to be able to predict on non-standardized data)?
reddit post: https://redd.it/4iekc9
Thanks. :)

To solve this problem you should use a pipeline. The first stage there is scaling, and the second one is your model. Then you can pickle the whole pipeline and have fun with your new data.

Related

How to train a trained model with new examples in scikit-learn?

I'm working on a machine learning classification task in which I have trained many models with different algorithms in scikit-learn and Random Forest Classifier performed the best. Now I want to train the model further with new examples but if I train the same model by calling the fit method on new examples then it will start training the model from beginning by erasing the old parameters.
So, how can I train the trained model by training it with new examples in scikit-learn?
I got some idea by reading online to pickle and unpickle the model but how would it help I don't know.
You should use incremental learning and estimators implementing the partial_fit API.
RandomForrestClassifier has a flag warm_start. Note that this will not give the same results as if you train on both sets at once.
Append the new data to your existing dataset, and train over the whole thing. Might want to reserve some of the new data for your testset.

How to launch a Machine Learning model?

First of all thank you for taking your time to read my question. I have done a Machine Learning model with a dataset (The famous one about Cancer) and I want to know how can I do to predict the results for new variables. I think that I have to keep training the data (often) to have more accured data to use in my prediction but for predicting new data, ¿Is as simple as changing the test data (y variable) to the new data?
Thank you so much for taking your time and any help would be appreciate it.
You are probably using the SVC class from sklearn.svm.
After fitting the model with the fit method you can predict new data with the predict method. See here.
By the way: For Support Vector Machines you don't have to fit your data multiple times. Maybe you are confusing that with neural networks.
If you are talking in the sense that you are changing the number of features in your test data then you cannot do that.
The number of features has to be the same in training and test set.
However, if your test data have some class of categorical variable which was not there in training data then its better you train your model with one extra category as "NONE" of "Others" for all your features.
This way when you encounter new class of categorical variable in your test data then you changed it to "NONE" or "Others" and do prediction on your trained model.
This way it will not break your model.
I hope I understand your question correctly.

Using pretrained feature-extractor in pipeline

I'm new to scikit-learn, and can't find the answer to what I think is a common use-case. I have lots of unlabelled data, and only some labelled data. I want to first train a transformer for feature-extraction, then save that part of the pipeline somewhere. Then, I'd like to create a pipeline of the feature-extractor plus a classifier, which I'll train on labelled data.
But, I don't want the feature-extractor to re-fit on the new data -- I want it to keep its parameters from when I trained it on the unlabelled data.
Can somebody point me in the direction of how best to do this? Thank you.

Do you need to store your old data to refit a model in sklearn?

I am trying to use sklearn to build an Isolation Forest machine learning program to go through a ton of data. I can only store the past 10 days of data, so I was wondering:
When I use the "fit" function on new data that comes in, does it refit the model considering the hyper-parameters from the old data without having had access to that old data anymore? Or is it completely recreating the model?
In general, only the estimators implementing the partial_fit method are able to do this. Unfortunately, IsolationForest is not one of them.

Real time data using sklearn

I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!

Categories