Choose random validation data set - python

Given a numpy array consisting of data which has been generated for ongoing time from a simulation. Based on this I'm using tensorflow and keras to train a neural network and my question refers to this line of code in my model:
model.fit(X1, Y1, epochs=1000, batch_size=100, verbose=1, shuffle=True, validation_split=0.2)
After having read in the documentation of Keras I found out that the validation data set (in this case 20% of the original data) is sliced from the end. As Im generating data for ongoing time I obviously don't want the last part to be sliced off because it would not be representative for validation. I'd rather want the validation data to be chosen randomly from the whole data set. For this purpose I am right now shuffling my whole data set (inputs and outputs for the ANN in unison) before training to gain random validation data.
I feel like I don't want to ruin the time component in my data which is why I'm searching for a solution to just choose the validation set randomly without having to shuffle the whole data set. Also, I'd like to know what you guys think of not shuffling time continuous data. Again, I'm not asking about the nature of the validation split, I just want to know how to modify the manner of how the validation data is selected.

As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.
Or, your can simply use the sklearn train_test_split() method:
x_train, x_valid, y_train, y_valid = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
This method has an argument named "shuffle" which determines whether to shuffle the data prior to the split (it is set on True by default).
However, a better split of the data would be by using the "stratify" argument, which will provide a similar distribution of labels among the validation and training sets:
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2,
random_state=0,
stratify=y)

Related

Is There ANYWAY to see the x_test data and labels after the train test split functions operations

I am have been searching google etc for a solution to this challenge for a few weeks now.
What is it?
I am trying to visualise the data that is being used in the XTEST Variable via the split() function below.
Either via text/string output or the actual image that is being used therein that variable at that given time. Both would be very helpful.
For now I am using 80 videos in a Training 80 : Testing 20 split, where the validation is taking 20% of Training.
I selected various types of data for the training to see how well the model is at predicting the outcome.
So in the end I have just 16 videos for Testing for now.
WHAT I AM TRYING TO SOLVE: IS ==> what those videos are?!!!
I have no way of knowing what selection of videos were chosen in that group of 16 for processing.
To solve this, I am trying to pass in the video label so that it can present an ID of the specific selection within the XTEST data variable.
WHY I AM DOING THIS
The model is being challenge by a selection of videos that I have no control over.
If I can identify these videos, I can analyse the data and enhance the model's performance accordingly.
The confusion matrix is doing the same, it is presenting me with just 4 misclassifications but what those 4 misclassifications are? I have no clue.
That is not a good approach, hence me asking these questions.
** THE CODE ** where I am at
X_train, Xtest, Y_train, Ytest = train_test_split(X, Y, train_size=0.8, test_size=0.2, random_state=1, stratify=Y, shuffle=True)
#print(Xtest)
history = model.fit(X_train, Y_train, validation_split=0.20, batch_size=args.batch,epochs=args.epoch, verbose=1, callbacks=[rlronp], shuffle=True)
predict_labels = model.predict(Xtest, batch_size=args.batch,verbose=1,steps=None,callbacks=None,max_queue_size=10,workers=1,use_multiprocessing=False)
print('This is prediction labels',predict_labels)# This has no video label indentifiers
This is working fine, but I cannot draw a hypothesis until I see what's within the Xtest variable.
All I am getting is an array of data with no labels.
For example: Xtest has 16 videos after the split operations:
is it vid04.mp4, vid34.mp4, vid21.mp4, vid34.mp4, vid74.mp4, vid54.mp4, vid71.mp4, vid40.mp4, vid06.mp4, vid27.mp4, vid32.mp4, vid18.mp4, vid66.mp4, vid42.mp4, vid8.mp4, vid14.mp4, etc???!?!??!?!
This is what I really want to see!!!
Please assist me to understand the process and where I am going wrong..
Thanx in advance for acknowledging my challenge!

How to test new data against a trained model?

I'm a beginner in Machine Learning. At first, my model gave me an accuracy of 85.82%, which was good. But now I would like to test the model again total new data but I can't figure out what to add to the code as I can only get the test accuracy when the model is tested with validation data.
The following is my code:
Create a new data set in EXACTLY the same manner by which you created the original test set. The term exactly is important, If you pre=processed your training and validation data a you must do the same pre-processing on this new data.
One other approach is to split the training data into 3 groups, train, validation and test. You can do that with train test split as follows.
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y,train_size=.8 random_state = 1)
X_valid, X_test, Y_valid, Y_test = train_test_split(X_temp, Y_temp, train_size=.5, shuffle=False)
This will take your input set and use 80% of it for training, 10 % for validation and 10% for test. In your code what you called the test set is actually the validation set so when you did model.evaluate you got the same accuracy as you did for validation accuracy. So now in model.fit make validation_data=(X_valid, Y_valid). Now your test set is independent of the validation set so when you run model.evaluate you should get a somewhat different accuracy than that of the validation accuracy.

Should you FIT train, test or all x and y values for a LinearRegression?

I have seen so many examples for LinearRegression and all so different. The question is should I fit train, test or all data to the model? Any example had a different way of handling the regression...
This is the split of data, no problem here:
X = data[['day']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
But when I fit the model, what option should I choose?
model = LinearRegression()
1. model.fit(X_train,y_train)
2. model.fit(X_test,y_test)
3. model.fit(data[['day']].values, data[['ozone']].values) #X and y
Also, I must say that the offered plots show the best results using the third method. Is this the correct approach?
The reason so many examples out there are so different is because context trumps method. For example, if I'm running a physics experiment to verify a theoretical equation, then in essence the theoretical equation itself is the "test set" and I would want to use as much data as I could (i.e. use all the data) to reduce bias and variance in the estimate. So, if your ozone problem is very well supported by theoretical physical reasoning and you just want to solve for some coefficients (i.e. physical constants) then you want to use the entire dataset to nail down those coefficients as best as possible. In the statistical sense, the physical motivation acts as a "prior," and can be extremely well known (for more about this viewpoint I suggest Kruschke's Bayesian Analysis book).
On the other hand, if you have no idea what sort of effects could be driving the ozone measurements you're getting, and you want to solve for an unknown mapping (using a linear basis set that you assume will work to describe that mapping) then you should hold out some level of your actual measurements to see how well that mapping can generalize.
Lots of "machine learning" these days is primarily data driven because we're coming to the point where we have loads of accessible data, and as such when you take courses that describe methods of data fitting and descriptors (e.g. linear regression) they often come from the entirely data driven context. Whether physically driven or data driven, the methods are very similar and the way you use the methods can even blend together in the middle of either extreme.
More to your question and how to code it, if you take the data driven approach of dividing a train and test set then what you are really doing is saying you want to fit your model to some random sample of "training data" but since you have nothing to compare to later you need to see how well that fit generalizes to some more data, the "test data." So fit your "train data" and then predict or evaluate on your "test data" to see how well your model or mapping works on "unseen" data.
E.g. (expanding on your code)
X = data[['day']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
model = make_pipeline(PolynomialFeatures(3), LinearRegression()) # If you think 3rd order poly basis ought to work
model.fit(X_train,y_train)
The quick answer is: train your model on the train sample.
Whatever your model is (linear regression or anything else), you always want to make sure your model is not over-fitted, meaning it will still perform well for unseen data. That's why you should always train your model on a subset of the full dataset (the training dataset), and use the testing set to assess the model performances (R2 or the metrics best suited for your application).
So you should:
X = data[['day']].values
y = data[['ozone']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=False)
model = LinearRegression()
model.fit(X_train,y_train)
And then:
y_pred = model.predict(X_test)
From there you can compare y_pred to y_test.
You fit your model on the train sets, so the features X_train and the target y_train. So in your case, it is option 1:
model.fit(X_train,y_train)
Once your model is trained, you can test your model on the X_test, and comparing the y_pred that results from running the model on the test set to the y_test.
The reason you get the 'best plots' for your metric while using option 3 is that you are essentially training on the whole dataset. If you then test on a subset of that, then naturally you will get a better score, as you then test your model on the data it had seen during training. You should never do that.

Training a model by looping through the train_test_split and training without looping

I am new to python and Keras please bear with my question.
I recently created a model in Keras, trained it and got the 'mean square error MSE' post prediction. I used the train_test_split function on the data set used.
Next I created a while loop with 50 iterations and applied it to the above said model. However I kept the train_test_split function (*random_number not specified) within the loop such that in every iteration I would have a new set of X_train, y_train, X_test and y_test values. I obtained 50 MSE values as output and calculated their 'mean' and 'standard' deviation'.
My query was did I do the right thing by placing the train_test_split function within the loop? Does it effect my goal which was to see the different MSE values generated for my data set?
If I had placed the train_test_split function outside my while loop and performed the above said activity, wouldn't the X_train, y_train, X_test and y_test values remain the same through out all of my 50 iterations? Wouldn't this cause an over fitting problem to my model?
I would really appreciate your feedback.
My code snippet:
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
MSE=np.zeros(50)
for i in range(50):
predictors_train,predictors_test,target_train,target_test=train_test_split(predictors,target,test_size=0.3)
model=regression_model()
model.fit(predictors_train,target_train,validation_data=(predictors_test,target_test),epochs=50,verbose=0)
model.evaluate(predictors_test,target_test, verbose=0)
target_predicted=model.predict(predictors_test)
MSE[i]=metrics.mean_squared_error(target_test, target_predicted)
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
The method you are implementing is named Cross validation, it allow your model to have a better "view" of your data, and reduce the chance that your training data was "too perfect" or "too noisy".
So putting your train_test_set in the loop will generate new training batches from your original data, and by meaning the outputs you will have what you want.
If you put the train_test_set outside, the batch of training data will remain the same for all your training loop, resulting in overfitting like you said.
However train_test_split is random, so you can have two random batch that are very likely, so this method is not optimal.
A better way is by using the k-fold cross validation :
from sklearn.model_selection import StratifiedKFold
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
MSE = []
for train, test in kfold.split(X, Y):
model = regression_model()
model.fit(X[train],y[train],validation_data= (X[test],y[test]),epochs=50,verbose=0)
model.evaluate(X[test],y[test], verbose=0)
target_predicted = model.predict(predictors_test)
MSE.append(metrics.mean_squared_error(y[test], target_predicted))
print("Test set MSE for {} cycle:{}".format(i+1,MSE[i]))
print("Mean MSE for {}-fold cross validation : {}".format(len(MSE), np.mean(MSE))
This method will create 10 folds of your training data and will fit your model using different one at each iteration.
You can have more info here : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html
Hope this will help you !
EDIT FOR PRECISION
Indeed don't use this method on your TEST data, but only on your VALIDATION data !!
You model must never see your TEST data before !
You don't want to use test set during training at all. You will be tweaking the model to the point where it will start "overfitting" even the test set and your error estimates will be too optimistic.
Yes, if you place train_test_split outside of that for loop, your sets will stay the same for the whole training and it can lead to overfitting. That is why you have validation set which is not used for training but for validation, mostly to find out whether your model is ovefitting the train set or not. If it is overfitting, you should solve it by tweaking your model (making it less complex, implementing regularization, early stopping...).
But don't train your model on the same data you use for testing. Training your data on validation set is a different story and it is normally used when implementing K-fold cross validation.
So the general steps to follow are:
split your dataset into test set and the "other" set, put the test set away and don't show it to your model until you are ready for final testing => only when you have already trained and tuned your model
choose whether you want to implement k-fold cross-validation or not. If not, then split your data into training and validation set and use them throughout the whole training => training set for training and validation set for validating
if you want to implement k-fold cross-validation then follow the step 2, measure the error metric that you want to track, pick the other set again, split it into a different training set and validation set, and do the whole training again. Repeat this multiple times to and take average of the error metrics measured during these cycles to get better (average) error estimate
tune your model and repeat the steps 2 and 3 until you are happy with the results
measure the error of your final model on the test set to see whether it generalizes well
Note that while implementing k-fold cross validation is generally a good idea, this approach might be infeasible for larger neural networks because it can dramatically increase the time it takes to train them. If that is the case, you might want to stick with just one training set and one validation set or set k (in k-folds) to some low number such as 3.

Is there a keras method to split data?

I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?
Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.
Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.
ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)
An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.
The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.
Unfortunately, the answer (despite our wish) is No! There are some existing datasets like MNIST etc. which can be directly loaded:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
This direct loading in a splitted way makes one have a false hope to have a general method, but unfortunately that isn't present here, though you may would be interested in using the wrappers for SciKit-Learn on Keras.
There is almost similar question on DataScience SE

Categories