How to test new data against a trained model? - python

I'm a beginner in Machine Learning. At first, my model gave me an accuracy of 85.82%, which was good. But now I would like to test the model again total new data but I can't figure out what to add to the code as I can only get the test accuracy when the model is tested with validation data.
The following is my code:

Create a new data set in EXACTLY the same manner by which you created the original test set. The term exactly is important, If you pre=processed your training and validation data a you must do the same pre-processing on this new data.
One other approach is to split the training data into 3 groups, train, validation and test. You can do that with train test split as follows.
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y,train_size=.8 random_state = 1)
X_valid, X_test, Y_valid, Y_test = train_test_split(X_temp, Y_temp, train_size=.5, shuffle=False)
This will take your input set and use 80% of it for training, 10 % for validation and 10% for test. In your code what you called the test set is actually the validation set so when you did model.evaluate you got the same accuracy as you did for validation accuracy. So now in model.fit make validation_data=(X_valid, Y_valid). Now your test set is independent of the validation set so when you run model.evaluate you should get a somewhat different accuracy than that of the validation accuracy.

Related

Should you FIT train, test or all x and y values for a LinearRegression?

I have seen so many examples for LinearRegression and all so different. The question is should I fit train, test or all data to the model? Any example had a different way of handling the regression...
This is the split of data, no problem here:
X = data[['day']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
But when I fit the model, what option should I choose?
model = LinearRegression()
1. model.fit(X_train,y_train)
2. model.fit(X_test,y_test)
3. model.fit(data[['day']].values, data[['ozone']].values) #X and y
Also, I must say that the offered plots show the best results using the third method. Is this the correct approach?
The reason so many examples out there are so different is because context trumps method. For example, if I'm running a physics experiment to verify a theoretical equation, then in essence the theoretical equation itself is the "test set" and I would want to use as much data as I could (i.e. use all the data) to reduce bias and variance in the estimate. So, if your ozone problem is very well supported by theoretical physical reasoning and you just want to solve for some coefficients (i.e. physical constants) then you want to use the entire dataset to nail down those coefficients as best as possible. In the statistical sense, the physical motivation acts as a "prior," and can be extremely well known (for more about this viewpoint I suggest Kruschke's Bayesian Analysis book).
On the other hand, if you have no idea what sort of effects could be driving the ozone measurements you're getting, and you want to solve for an unknown mapping (using a linear basis set that you assume will work to describe that mapping) then you should hold out some level of your actual measurements to see how well that mapping can generalize.
Lots of "machine learning" these days is primarily data driven because we're coming to the point where we have loads of accessible data, and as such when you take courses that describe methods of data fitting and descriptors (e.g. linear regression) they often come from the entirely data driven context. Whether physically driven or data driven, the methods are very similar and the way you use the methods can even blend together in the middle of either extreme.
More to your question and how to code it, if you take the data driven approach of dividing a train and test set then what you are really doing is saying you want to fit your model to some random sample of "training data" but since you have nothing to compare to later you need to see how well that fit generalizes to some more data, the "test data." So fit your "train data" and then predict or evaluate on your "test data" to see how well your model or mapping works on "unseen" data.
E.g. (expanding on your code)
X = data[['day']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
model = make_pipeline(PolynomialFeatures(3), LinearRegression()) # If you think 3rd order poly basis ought to work
model.fit(X_train,y_train)
The quick answer is: train your model on the train sample.
Whatever your model is (linear regression or anything else), you always want to make sure your model is not over-fitted, meaning it will still perform well for unseen data. That's why you should always train your model on a subset of the full dataset (the training dataset), and use the testing set to assess the model performances (R2 or the metrics best suited for your application).
So you should:
X = data[['day']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
model = LinearRegression()
model.fit(X_train,y_train)
And then:
y_pred = model.predict(X_test)
From there you can compare y_pred to y_test.
You fit your model on the train sets, so the features X_train and the target y_train. So in your case, it is option 1:
model.fit(X_train,y_train)
Once your model is trained, you can test your model on the X_test, and comparing the y_pred that results from running the model on the test set to the y_test.
The reason you get the 'best plots' for your metric while using option 3 is that you are essentially training on the whole dataset. If you then test on a subset of that, then naturally you will get a better score, as you then test your model on the data it had seen during training. You should never do that.

how does validation_split work in training a neural network model?

'''model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, verbose=2)'''
in the above line of code model is a sequential keras model having layers and is compiled.
what is the use of the parameter validaiton_data. the model is going to train on X_train and y_train data. so based on y_train the parameters would be adjusted and back propagation is done.
what is the use of validation_data and why is a different data in this case testing data provided.
During training, (x_train, y_train) data is used to adjust trainable parameters of the model. However, we don't know whether model is overfit or under fit, whether model is going to do well when new data is provided. So, that is the reason why we have validation data (x_test, y_test) to test the accuracy of model on any unseen data.
Depending on training and validation accuracy, we can decide.
- Whether the model is overfit/underfit,
- whether to collect more data,
- do we need to implement regularization technique,
- do we need to use data augmentation techniques,.
- whether we need to do tune hyper parameters etc.

How to make sure keras model outputs same accuracy?

I set the following:
np.random.seed(7)
# split data to train, validate, test (60%, 20%, 20%)
train, validate, test = np.split(data, [int(.6*len(data)), int(.8*len(data))])
history = model.fit(train, train, epochs=1, batch_size=32, verbose=1, shuffle=True,
validation_data=(validate, validate), callbacks=[cb])
score = model.evaluate(test, test, verbose=1)
shuffle=True shouldn't matter here since I'm only training for one epoch.
Now from what I've read this should ensure that my model's accuracy is always the same after training from scratch, but the accuracy results for various runs are 48%, 48%, 56%, 48%, 56%, 47.5% and so on. So I'm wondering if there is something else I have to do to ensure that the resulting accuracy stays the same?
The parameters of the model are initialized differently everytime you fit the model, even if it is for the same data. So it will result in different accuracies. If you are insistent on getting the same accuracy, run the model once, save the parameters in a file and then load them again when you are running the code. Refer Keras Documentation for more details.

Training a model by looping through the train_test_split and training without looping

I am new to python and Keras please bear with my question.
I recently created a model in Keras, trained it and got the 'mean square error MSE' post prediction. I used the train_test_split function on the data set used.
Next I created a while loop with 50 iterations and applied it to the above said model. However I kept the train_test_split function (*random_number not specified) within the loop such that in every iteration I would have a new set of X_train, y_train, X_test and y_test values. I obtained 50 MSE values as output and calculated their 'mean' and 'standard' deviation'.
My query was did I do the right thing by placing the train_test_split function within the loop? Does it effect my goal which was to see the different MSE values generated for my data set?
If I had placed the train_test_split function outside my while loop and performed the above said activity, wouldn't the X_train, y_train, X_test and y_test values remain the same through out all of my 50 iterations? Wouldn't this cause an over fitting problem to my model?
I would really appreciate your feedback.
My code snippet:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
MSE=np.zeros(50)
for i in range(50):
predictors_train,predictors_test,target_train,target_test=train_test_split(predictors,target,test_size=0.3)
model=regression_model()
model.fit(predictors_train,target_train,validation_data=(predictors_test,target_test),epochs=50,verbose=0)
model.evaluate(predictors_test,target_test, verbose=0)
target_predicted=model.predict(predictors_test)
MSE[i]=metrics.mean_squared_error(target_test, target_predicted)
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
The method you are implementing is named Cross validation, it allow your model to have a better "view" of your data, and reduce the chance that your training data was "too perfect" or "too noisy".
So putting your train_test_set in the loop will generate new training batches from your original data, and by meaning the outputs you will have what you want.
If you put the train_test_set outside, the batch of training data will remain the same for all your training loop, resulting in overfitting like you said.
However train_test_split is random, so you can have two random batch that are very likely, so this method is not optimal.
A better way is by using the k-fold cross validation :
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
MSE = []
for train, test in kfold.split(X, Y):
model = regression_model()
model.fit(X[train],y[train],validation_data= (X[test],y[test]),epochs=50,verbose=0)
model.evaluate(X[test],y[test], verbose=0)
target_predicted = model.predict(predictors_test)
MSE.append(metrics.mean_squared_error(y[test], target_predicted))
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
print("Mean MSE for {}-fold cross validation : {}".format(len(MSE), np.mean(MSE))
This method will create 10 folds of your training data and will fit your model using different one at each iteration.
You can have more info here : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Hope this will help you !
EDIT FOR PRECISION
Indeed don't use this method on your TEST data, but only on your VALIDATION data !!
You model must never see your TEST data before !
You don't want to use test set during training at all. You will be tweaking the model to the point where it will start "overfitting" even the test set and your error estimates will be too optimistic.
Yes, if you place train_test_split outside of that for loop, your sets will stay the same for the whole training and it can lead to overfitting. That is why you have validation set which is not used for training but for validation, mostly to find out whether your model is ovefitting the train set or not. If it is overfitting, you should solve it by tweaking your model (making it less complex, implementing regularization, early stopping...).
But don't train your model on the same data you use for testing. Training your data on validation set is a different story and it is normally used when implementing K-fold cross validation.
So the general steps to follow are:
split your dataset into test set and the "other" set, put the test set away and don't show it to your model until you are ready for final testing => only when you have already trained and tuned your model
choose whether you want to implement k-fold cross-validation or not. If not, then split your data into training and validation set and use them throughout the whole training => training set for training and validation set for validating
if you want to implement k-fold cross-validation then follow the step 2, measure the error metric that you want to track, pick the other set again, split it into a different training set and validation set, and do the whole training again. Repeat this multiple times to and take average of the error metrics measured during these cycles to get better (average) error estimate
tune your model and repeat the steps 2 and 3 until you are happy with the results
measure the error of your final model on the test set to see whether it generalizes well
Note that while implementing k-fold cross validation is generally a good idea, this approach might be infeasible for larger neural networks because it can dramatically increase the time it takes to train them. If that is the case, you might want to stick with just one training set and one validation set or set k (in k-folds) to some low number such as 3.

Choose random validation data set

Given a numpy array consisting of data which has been generated for ongoing time from a simulation. Based on this I'm using tensorflow and keras to train a neural network and my question refers to this line of code in my model:
model.fit(X1, Y1, epochs=1000, batch_size=100, verbose=1, shuffle=True, validation_split=0.2)
After having read in the documentation of Keras I found out that the validation data set (in this case 20% of the original data) is sliced from the end. As Im generating data for ongoing time I obviously don't want the last part to be sliced off because it would not be representative for validation. I'd rather want the validation data to be chosen randomly from the whole data set. For this purpose I am right now shuffling my whole data set (inputs and outputs for the ANN in unison) before training to gain random validation data.
I feel like I don't want to ruin the time component in my data which is why I'm searching for a solution to just choose the validation set randomly without having to shuffle the whole data set. Also, I'd like to know what you guys think of not shuffling time continuous data. Again, I'm not asking about the nature of the validation split, I just want to know how to modify the manner of how the validation data is selected.
As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.
Or, your can simply use the sklearn train_test_split() method:
x_train, x_valid, y_train, y_valid = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
This method has an argument named "shuffle" which determines whether to shuffle the data prior to the split (it is set on True by default).
However, a better split of the data would be by using the "stratify" argument, which will provide a similar distribution of labels among the validation and training sets:
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2,
random_state=0,
stratify=y)

Categories