My independent variable is a datetime object and my dependent variable is an float. Currently, I have a keras model that predicts accurately, but I found out that model.predict() only returns predictions for the values that are already known. Is there a method I can call to tell the program to use the model to predict unknown values? If there isn't please give me instructions about how to predict these unknown values.
Currently, I have a Keras model that predicts accurately, but I found out that model.predict() only returns predictions for the values that are already known
That is incorrect. A predict statement doesn't just 'search and return' results from training data. That's not how machine learning works at all. The whole reason that you build models and have a train and test dataset is to ensure you have a model that is generalizable (i.e. can be used to make predictions on unseen data, assuming the observation is coming from the same underlying distribution that the model is trained on)
In your specific case, you are using a DateTime variable an independent, which means you should refrain from using variable such as year, which are non-recurring since you can use it to make predictions about the future (model learns patterns in 2019 but 2020 may be out of its vocabulary and thus years after that are not feasible to use for predictions.)
Instead, you should engineer some features from your DateTime variable and use recurring variables which may show reveal some patterns in the dependent variable. These variables are like days of the week, months, seasons, hours of the day. Depending on what your dependent variable is, you can surely find some patterns in these.
All of this totally depends on what you are trying to model and what is the goal of the model.predict() w.r.t your problem statement. Please elaborate if possible so that people can give you more specific answers.
Your assumption is incorrect. model.predict is specifically intended to use a trained model to make predictions on a data set typically not used previously for example a test set and not a training or validation set. To use it you need to create a data set to feed to model.predict. See answer here. on how to provide input to model.predict
Related
I am trying to add data for my scikit-learn model after it has already been trained. For example, I have the data that I used in the beginning (there are about 250 of them). After that, I need to train this model one more time by calling the function, and so on. The only thing that came to my mind was to add new values to the existing data array every time and train the model again, but this is very resource-intensive and takes more time.
Is there another way to train the machine learning model?
model = LinearRegression().fit(test, result)
reg.predict(task)
### and here I want to add some data, for example one or two examples like:
model.addFit(one_test, one_result)
The short answer in your case (using the sklearn.linear_model.LinearRegression model) is no, it is not possible to add one or two more examples and train without adding this to the original training set and fitting it all at the same time. Under the hood, the model is simply using Ordinary Least Squares (described here) which requires the complete matrix of training data on which to fit your model. However, this algorithm is very fast and in the case of ~ hundreds of training examples, it would be very quick to re-calculate the parameters of the model with each new couple examples.
I use the tensorflow library to solve the time series problem.
I get the dimensions or properties by subtracting the current value from the previous value (according to this article)
In this article, there is the data needed for forecasting. It chooses a value for training and a value for testing that there are no problems.
But my question is how can I predict the future? Suppose if I want to forecast 5 months later there will be no dimensions or attributes to send to the forecast function.
--If you have a better source, please introduce it ...Thanks in advance
If you have a lot of data it could be possible, it means that your model knows a lot of data and can generalize with new data and it can find a knowed pattern. If you have a poor model it will throws bad predictions because the new input is new and the model can't find a knowed pattern
Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.
First of all thank you for taking your time to read my question. I have done a Machine Learning model with a dataset (The famous one about Cancer) and I want to know how can I do to predict the results for new variables. I think that I have to keep training the data (often) to have more accured data to use in my prediction but for predicting new data, ¿Is as simple as changing the test data (y variable) to the new data?
Thank you so much for taking your time and any help would be appreciate it.
You are probably using the SVC class from sklearn.svm.
After fitting the model with the fit method you can predict new data with the predict method. See here.
By the way: For Support Vector Machines you don't have to fit your data multiple times. Maybe you are confusing that with neural networks.
If you are talking in the sense that you are changing the number of features in your test data then you cannot do that.
The number of features has to be the same in training and test set.
However, if your test data have some class of categorical variable which was not there in training data then its better you train your model with one extra category as "NONE" of "Others" for all your features.
This way when you encounter new class of categorical variable in your test data then you changed it to "NONE" or "Others" and do prediction on your trained model.
This way it will not break your model.
I hope I understand your question correctly.
Idea behind nullifying/ignoring a feature from test set is to understand how important is it considered by the model to predict the target variable (by comparing the evaluation metric's value). For numerical variables, I thought of setting them to 0, assuming the multiplication (with weights) would be 0 and thus it would get eliminated from the set. Is this approach right, else what should be done?
I am using tensorflow's DNNRegressor for modelling.
For deep models there's no general input-independent way to do this kind of feature ablation you want (take a pretrained model and just change that feature's representation on the test set).
Instead I recommend you do training time ablation: train different variations of your model with different feature combinations and compare their validation set performances. This will actually tell you how much each feature helps.