What does these codes mean?
can you explain to me :
features_train, labels_train, features_test, labels_test = makeTerrainData()
def submitAccuracy():
return acc
In machine learning development you want to split your available data into train/test sets and if possible an additional validation set. You do this to test for overfitting and ensure your model is generalizable to unseen observations. The final validation set is often useful because without knowing it, often users will try to optimize their parameters on the test partition accuracy, and in doing so are basically giving hints to the model of what that data is. The validation set is useful to test that this hasn't occurred and your model isn't overfit.
With only seeing the code provided, train_features likely corresponds to the actual data being used to develop the model, in the train partition. The labels are the categories you are trying to predict.
The test partition is simply a random sample of your available data. Features/labels are the same as above.
You want to build the model off of the training data, and assess accuracy on the test partition.
Sebastian Rascka provides a marvelous overview of machine learning in python. The code samples and some explanations can be found at https://github.com/rasbt/python-machine-learning-book/tree/master/code
Related
Hey I am training a CNN model , and was wondering what will happen if I use the same data for validation and test?
Does the model train on validation data as well? (Does my model see the validation data?) Or just the error and accuracy are calculatd and taken into account for training?
You use your validation_set to tune your model. It means that you don`t train on this data but the model takes it into account. For example, you use it to tune the model's hyperparameters.
In order to have a good evaluation - as test set you should use a data which is totally unknown to this model.
Take a look at this article for more information which here I point out the most relevant parts of it to your question :
A validation dataset is a sample of data held back from training your
model that is used to give an estimate of model skill while tuning
model’s hyperparameters.
The validation dataset is different from the test dataset that is also
held back from the training of the model, but is instead used to give
an unbiased estimate of the skill of the final tuned model when
comparing or selecting between final models.
If you use the same set for validation and test, your model may overfit (since it has seen the test data before the final test stage).
Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.
I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets
Just I am curious why I have to scale the testing set on the testing set, and not on the training set when I’m training a model on, for example, CNN?!
Or am I wrong? And I still have to scale it on the training set.
Also, can I train a dataset in the CNN that contents positive and negative elements as the first input of the network?
Any answers with reference will be really appreciated.
We usually have 3 types of datasets for getting a model trained,
Training Dataset
Validation Dataset
Test Dataset
Training Dataset
This should be an evenly distributed data set which covers all varieties of data. If your train with more epochs, the model will get used to the training dataset and will only give proper proper prediction on the training dataset and this is called Overfitting. Only way to keep a check on overfitting is by having other datasets which the model has never been trained on.
Validation Dataset
This can be used fine tune model hyperparameters
Test Dataset
This is the dataset which the model has not been trained on has never been a part of deciding the hyperparameters and will give the reality of how the model is performing.
If scaling and normalization is used, the testing set should use the same parameters used during training.
A good answer that links to that: https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well
Also, some models tend to require normalization and others do not.
The Neural Network architectures are normally robust and might not need normalization.
Scaling data depends upon the requirement as well the feed/data you got. Test data gets scaled with Test data only, because Test data don't have the Target variable (one less feature in Test data). If we scale our Training data with new Test data, our model will not be able to correlate with any target variable and thus fail to learn. So the key difference is the existence of Target variable.
I know this will be very basic, however I'm really confused and I would like to understand parameter tuning better.
I'm working on a benchmark dataset that is already partitioned to three splits training, development, and testing and I would like to tune my classifier parameters using GridSearchCV from sklearn.
What is the correct partition to tune the parameter? is it the development or the training?
I've seen researchers in the literature mentioning that they "tuned the parameters using GridSearchCV on the development split" another example is found here;
Do they mean they trained on the training split then tested on the development split? or do ML practitioners usually mean they perform the GridSearchCV entirely on the development split?
I'd really appreciate a clarification. Thanks,
Usually in a 3-way split you train a model using a training set, then you validate it on a development (which is also called validation set) set to tune hyperpameters and then after all the tuning is complete you perform a final evaluation of a model on an unseen before testing set (also known as evaluation set).
In a two-way split you just have a train set and a test set, so you perform tuning/evaluation on the same test set.