How to launch a Machine Learning model? - python

First of all thank you for taking your time to read my question. I have done a Machine Learning model with a dataset (The famous one about Cancer) and I want to know how can I do to predict the results for new variables. I think that I have to keep training the data (often) to have more accured data to use in my prediction but for predicting new data, ¿Is as simple as changing the test data (y variable) to the new data?
Thank you so much for taking your time and any help would be appreciate it.

You are probably using the SVC class from sklearn.svm.
After fitting the model with the fit method you can predict new data with the predict method. See here.
By the way: For Support Vector Machines you don't have to fit your data multiple times. Maybe you are confusing that with neural networks.

If you are talking in the sense that you are changing the number of features in your test data then you cannot do that.
The number of features has to be the same in training and test set.
However, if your test data have some class of categorical variable which was not there in training data then its better you train your model with one extra category as "NONE" of "Others" for all your features.
This way when you encounter new class of categorical variable in your test data then you changed it to "NONE" or "Others" and do prediction on your trained model.
This way it will not break your model.
I hope I understand your question correctly.

Related

How to add training data to the model after initial training?

I am trying to add data for my scikit-learn model after it has already been trained. For example, I have the data that I used in the beginning (there are about 250 of them). After that, I need to train this model one more time by calling the function, and so on. The only thing that came to my mind was to add new values ​​to the existing data array every time and train the model again, but this is very resource-intensive and takes more time.
Is there another way to train the machine learning model?
model = LinearRegression().fit(test, result)
reg.predict(task)
### and here I want to add some data, for example one or two examples like:
model.addFit(one_test, one_result)
The short answer in your case (using the sklearn.linear_model.LinearRegression model) is no, it is not possible to add one or two more examples and train without adding this to the original training set and fitting it all at the same time. Under the hood, the model is simply using Ordinary Least Squares (described here) which requires the complete matrix of training data on which to fit your model. However, this algorithm is very fast and in the case of ~ hundreds of training examples, it would be very quick to re-calculate the parameters of the model with each new couple examples.

In Leave One Out Cross Validation, How can I Use `shap.Explainer()` Function to Explain a Machine Learning Model?

Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.

Will it lead to Overfitting / Curse of Dimensionality

Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.

Real time data using sklearn

I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!

How to use previously trained data for new test data in Python

I use Gaussian process regression in Python. I have big data for training and try to predict test data. The trained data will not vary, but test data will. My question is that if it is possible to save the results of training and whenever new test data come in, just quickly predict the target of the test data without retraining all over again. I would appreciate any help.
Thanks,
Jay

Categories