Is there a keras method to split data? - python

I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?

Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.
Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.
ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)
An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.
The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.

Unfortunately, the answer (despite our wish) is No! There are some existing datasets like MNIST etc. which can be directly loaded:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
This direct loading in a splitted way makes one have a false hope to have a general method, but unfortunately that isn't present here, though you may would be interested in using the wrappers for SciKit-Learn on Keras.
There is almost similar question on DataScience SE

Related

Is There ANYWAY to see the x_test data and labels after the train test split functions operations

I am have been searching google etc for a solution to this challenge for a few weeks now.
What is it?
I am trying to visualise the data that is being used in the XTEST Variable via the split() function below.
Either via text/string output or the actual image that is being used therein that variable at that given time. Both would be very helpful.
For now I am using 80 videos in a Training 80 : Testing 20 split, where the validation is taking 20% of Training.
I selected various types of data for the training to see how well the model is at predicting the outcome.
So in the end I have just 16 videos for Testing for now.
WHAT I AM TRYING TO SOLVE: IS ==> what those videos are?!!!
I have no way of knowing what selection of videos were chosen in that group of 16 for processing.
To solve this, I am trying to pass in the video label so that it can present an ID of the specific selection within the XTEST data variable.
WHY I AM DOING THIS
The model is being challenge by a selection of videos that I have no control over.
If I can identify these videos, I can analyse the data and enhance the model's performance accordingly.
The confusion matrix is doing the same, it is presenting me with just 4 misclassifications but what those 4 misclassifications are? I have no clue.
That is not a good approach, hence me asking these questions.
** THE CODE ** where I am at
X_train, Xtest, Y_train, Ytest = train_test_split(X, Y, train_size=0.8, test_size=0.2, random_state=1, stratify=Y, shuffle=True)
#print(Xtest)
history = model.fit(X_train, Y_train, validation_split=0.20, batch_size=args.batch,epochs=args.epoch, verbose=1, callbacks=[rlronp], shuffle=True)
predict_labels = model.predict(Xtest, batch_size=args.batch,verbose=1,steps=None,callbacks=None,max_queue_size=10,workers=1,use_multiprocessing=False)
print('This is prediction labels',predict_labels)# This has no video label indentifiers
This is working fine, but I cannot draw a hypothesis until I see what's within the Xtest variable.
All I am getting is an array of data with no labels.
For example: Xtest has 16 videos after the split operations:
is it vid04.mp4, vid34.mp4, vid21.mp4, vid34.mp4, vid74.mp4, vid54.mp4, vid71.mp4, vid40.mp4, vid06.mp4, vid27.mp4, vid32.mp4, vid18.mp4, vid66.mp4, vid42.mp4, vid8.mp4, vid14.mp4, etc???!?!??!?!
This is what I really want to see!!!
Please assist me to understand the process and where I am going wrong..
Thanx in advance for acknowledging my challenge!

Is there a straight forward approach to use a Trained Machine Learning Model on a brand new set of data in Python

I am noticing similar questions on this topic when I search the Internet; however, most of the answers points to generating random data to explain the approach to a viable solution and do not seem to explain what I am trying to understand in Python, sklearn, LogisticRegression.
I am trying to learn and understand the Machine Learning Model Prediction. I visited Kaggle and downloaded the Titanic data to play and build a Survive prediction model. I was able to build a Logistic Regression to train my model and save it for later.
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(data_train[['Sex', 'Pclass', 'Age','Relatives', 'Fare']], data_train.Survived, test_size=0.33, random_state=0)
# print(X_train.shape)
clf = LogisticRegression(random_state=0).fit(X_train, y_train)
# save the model to disk with JobLib
filename = 'final_model_Joblib.sav'
joblib.dump(clf, filename)
I would like to now use this model on a brand new Tatanic data set, attempting to predict the survival, which do not exist in this new data set.
How would I go about importing my trained model on this new Titanic data set to make the prediction, where X_test and y_test represent my new Titanic data without survival data?
# load the model from disk
loaded_model = joblib.load(filename)
result = loaded_model.score(X_test, y_test)
print(result)
Well, the whole purpose of training a model is to predict on the unseen data, given the features and class distribution of features are the same in your training data or the unseen data.
Once you dump a model using joblib or pickle it serializes the model (convert into python byte stream object) and if you load it you will get the same object back. You can use loaded_model.predict(x) according to sklearn docs to find the class prediction on unseen data or the score function to get the accuracy score of your model. for more info, you can check this - https://www.geeksforgeeks.org/saving-a-machine-learning-model/.
Hope this answer your question.

sklearn preprocessing.scale() function , when to use it?

i'm building a neural network using sklearn.neural_network.MLPClassifier :
clf = sklearn.neural_network.MLPClassifier(hidden_layer_sizes= (11,11,11),max_iter = 500)
before training it, I'm creating a new fetchers from existing ones using the
preprocessing.scale()
like so:
labels = someDataBase.loadLabels()
fetchers = someDataBase.loadFetchers()
fetchers = preprocessing.scale(fetchers)
and from them, using the train_test_split function, creating the test an train values, like so:
X_train, X_test, y_train, y_test = train_test_split(fetchers,labels,test_size = 0.2)
then I feed it to the fit function of the MLPClassifier
clf.fit(X_train, y_train)
now that I have a trained neural network
I wanna used it to predict base on a new fetchers
using the predict method of the MLPClassifier
this fetchers are not the test one, the are total new values
should I be using the preprocessing.scale() again?
and then feeding them into the predict method?
or just used them as they are?
Уour method may give different scaling factors. It is for single scaling jobs, but not for the ones requiring consistent transformation.
I suggest you rather use sklearn.preprocessing.StandardScaler. It is quite well documented and has examples here.
Call fit_transform method when training, and just transform when predicting.

Choose random validation data set

Given a numpy array consisting of data which has been generated for ongoing time from a simulation. Based on this I'm using tensorflow and keras to train a neural network and my question refers to this line of code in my model:
model.fit(X1, Y1, epochs=1000, batch_size=100, verbose=1, shuffle=True, validation_split=0.2)
After having read in the documentation of Keras I found out that the validation data set (in this case 20% of the original data) is sliced from the end. As Im generating data for ongoing time I obviously don't want the last part to be sliced off because it would not be representative for validation. I'd rather want the validation data to be chosen randomly from the whole data set. For this purpose I am right now shuffling my whole data set (inputs and outputs for the ANN in unison) before training to gain random validation data.
I feel like I don't want to ruin the time component in my data which is why I'm searching for a solution to just choose the validation set randomly without having to shuffle the whole data set. Also, I'd like to know what you guys think of not shuffling time continuous data. Again, I'm not asking about the nature of the validation split, I just want to know how to modify the manner of how the validation data is selected.
As you mentioned, Keras simply takes the last x samples of the dataset, so if you want to keep using it, you need to shuffle your dataset in advance.
Or, your can simply use the sklearn train_test_split() method:
x_train, x_valid, y_train, y_valid = sklearn.model_selection.train_test_split(x, y, test_size=0.2)
This method has an argument named "shuffle" which determines whether to shuffle the data prior to the split (it is set on True by default).
However, a better split of the data would be by using the "stratify" argument, which will provide a similar distribution of labels among the validation and training sets:
x_train, x_test, y_train, y_test = train_test_split(x, y,
test_size=0.2,
random_state=0,
stratify=y)

Are the two kinds of interface of xgboost work completely same?

I'm currently working on a In Class Competition in Kaggle.
I have read about the official python API reference, and I'm kind of confused about the two kinds of interfaces, especially in grid-search, cross-validation and early-stopping.
In XGBoost API, I can use xgb.cv(), which split the whose dataset into two parts to cross validate, to tune a good hyper parameters and then get the best_iteration.
Thus I can adjust the num_boost_round to the best_iteration. To maximizely utilize the data, I train the whole dataset again with the well-tuned hyper parameters, and then use it to classify. The only defect is I have to write the code of GridSearch myself.
ATTENTION: this cross validation set is changed at each fold, so the traning result will have no specific tendency to any part of the data.
But in sklearn, it seem that I can not get best_iteration using clf.fit() as I do in xgb model. Indeed, fit() method has early_stopping_rounds and eval_set to implement the early stopping part. Many people implement the code like that:
X_train, X_test, y_train, y_test = train_test_split(train, target_label, test_size=0.2, random_state=0)
clf = GridSearchCV(xgb_model, para_grid, scoring='roc_auc', cv=5, \
verbose=True, refit=True, return_train_score=False)
clf.fit(X_train, y_train, early_stopping_rounds=30, eval_set=[(X_test, y_test)])
....
clf.predict(something)
But problem is that I have split the data into two part at first. The cross validation set will not be changed at each fold. So maybe the result will have a tendency toward this random part of the whole dataset. The same problem also occurs in the grid search, the final parameter may tend to fit
X_test and y_test more.
I'm fond of the GridSearchCV in sklearn, but I also want to get the eval_set changed at each fold, just like xgb.cv do. I believe it can utilize the data while preventing overfitting.
How should I do?
I have thought of two ways:
using XGB API, and write GridSearch myself.
using sklean API, and change the eval_set manually at each fold.
Are there any more convenient methods?
AS you have summarised, both approaches have advantages and disadvantages.
xgb.cv will use the left-out fold for early stopping, thus you do not need an additional split into a validation/train sample to determine when to trigger early stopping.
GridSearchCV (or maybe you try out RandomizedSearchCV) will handle parameter grid and optimal choice for you.
Note, that it is not a problem to use a fixed sub-sample for early stopping in all CV folds. So i do not think that you have to do anything like "change the eval_set manually at each fold". The evaluation sample used in early stopping does not directly affect model parameters- it is used to decide when evaluation metric on a hold-out sample stops improving. For the final model you can drop early-stopping- you can see when the model stops with the optimal hyper-parameters using the aforementioned split and then use that number of tree as a fixed parameter in the final model fit.
So at the end it is a matter of taste as in both cases you will need to compromise on something. IMO, the sklearn API is the optimal choice as it allows to use the rest of sklearn tools (e.g. for data pre-processing) in a natural way in a pipeline in CV and it allows a homogeneous interface to model training for various approaches. But at the end it is up to you

Categories