Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am almost new to machine learning. I have a biological dataset like-
index position B y
1 1001 0 0.567
2 1010 0 0.682
3 1012 1 0.346
4 1016 1
5 1020 1 0.875
6 1040 0
7 1044 0 1.00
8 1047 1 0.101
9 1056 0 0.00
I am trying to predict the y values that are missing. I am using KNN regression for this. I have read that train-test-validation splitting is better than train-test splitting followed by cross validation. I have some question-
When i train my model, should i exclude the rows where y is not known?
How to workout with that validation and test set?
After i have trained my model, should i only take rows where y is not known and predict the values?
Is the accuracy and error rate using test dataset is the accuracy of my model?
I have a lot question. I wanted to figure these out watching tutorials but couldn't get a complete insight. Any help will be appreciated. Thank you.
I will try to answer one by one your questions
No, when you train a model, you should tell which target value is right for fixed features (the columns on your dataset). So you always use the target value y in the training phase.
Using cross-validation means to not waste your test set (remember you can use it only once!) and improve your training avoiding overfitting.
You should test your trained model with the test set. So you actually predict on your test set (that does not have target values) and you see the results.
The test set, if used once, provides you with an accuracy of the model. Mind that there are cases which the accuracy is pointless (*)
In the end, the process for training a model it's more or less this:
Split your dataset in 80% training set and 20% test set (or 70-30, depending on how many data you have)
From the training set, you build a training set and validation set using this (this is just a tip). (validation set of 10-15%)
You train your model with the training set and do validation (very important!) with the validation set.
Discard (and save somewhere else) the y column on the test set and use your trained model to predict the test set.
With the values it provides, you compute metrics (such as MSE) between these values and the once you saved at the previous step and see how good it's your model.
Remember that this is a guideline but it's much more complex. You will see it the more you get in the matter.
(*) For example, if you train your model with samples that are almost the same and the test set is composed of the 99% of samples that are similar and 1% that is different, you will get an accuracy of the 99%. That's a lot but it's useless since the model can predict only one class. So ofc mind each step depending on your case.
Related
I am training a neural network. For training I get 80% of my data and divide it to a number of mini-batches. I train on each mini batch, then update parameters, until all data is visited. I repeat the whole procedure for a number of epochs.
The question is about the remaining 10%+10% of data: how to change the validation set during this process? Should I use rotating mini batches for validation set as well?
I think this question is more or less answered here: What is the meaning of batch_size for validation?
Since you don't train the model anymore - it does not affect the results. In other words, since you don't apply Mini-Batch Gradient Descent while validating your model with the validation set, it does not really matter. It may have an impact memory-wise though.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
i have this model that i trained with 100 epochs :
Model with 100 Epoch
and then i save the model and train for another 100 epoch (total 200 epoch):
Model with additional 100 epoch (200 epoch)
my question is, is my model not overfitting ? is it optimal ?
Overfitting is when a model captures patterns that won't recur in the future. This leads to a decrease in prediction accuracy.
You need to test your model on data that has not been seen in training or validation to determine if it is overfitting or not.
Over fitting is when your model scores very highly on your training set and poorly on a validation test set (or real life post-training predictions).
When you are training your model make sure that you have split your training dataset into two subsets. One for training and one for validation. If you see that your validation accuracy is decreasing as training goes on it means that your CNN has "overfitted" to the training set specifically and should not be generalized.
There are many ways to combat overfitting that should be used while training your model. Seeking more data and using harsh dropout are popular ways to ensure that a model is not overfitting. Check out this article for a good description of your problem and possible solutions.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to understand how Gridsearchcv's logic works. I looked at here, the official documentation, and the source code, but I couldn't figure out the following:
What is the general logic behind Gridsearchcv?
Clarifications:
If I use the default cv = 5, what are the % splits of the input data
into: train, validation, and test?
How often does Gridsearchcv perform such a split, and how does it decide which observation belong to train / validation / test?
Since cross validation is being done, where does any averaging come into play for the hyper parameter tuning? i.e. is the optimal hyper parameter value is one that optimizes some sort of average?
This question here shares my concern, but I don't know how up-to-date the information is and I am not sure I understand all the information there. For example, according to the OP, my understanding is that:
The test set is 25% of the input data set and is created once.
The union of the train set and validation set is correspondingly created once and this union is 75% of the original data.
Then, the procedure creates 5 (because cv = 5) further splits of this 75% into 60% train and 15% validation
The optimized hyper parameter value is one that optimizes the average of some metric over these 5 splits.
Is this understanding correct and still applicable now? And how does the procedure do the original 25%-75% split?
First your split your data into train and test. The testing set is left out for post training and optimization of the model. The gridsearchcv takes the 75% of your data and splits them into 5 slices. First it trains 4 slices and validates on 1, then takes 4 slices introducing the previously left out set for validation and tests on a new set etc... 5 times.
Then the performance of each run can be seen + the average of them to understand overall how your model behaves.
Since you are doing a gridsearch, the best_params will be saved at the end of your modeling to predict your test set.
So to summarize, the best parameters will be chosen and used for your model after the whole training, therefore, you can easily use them to predict(X_test)
Read more here.
Usually if you don't perform CV, the model will try to optimize its weights with preset parameters and the left out test set, will help to assess the model performance. However, for a real model training, it is highly important to re-split the training data into train and validation, where you use the validation to hypertune the parameters of the model (manually). However, over-hyptertuning the model to get the best performance on the validation set is cheating.
Theoretical K-Folds
More details
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a ML-based Intrusion prediction. In the learning process, I used training and test data both labeled to evaluate the accuracy and generate confusion matrixes. I came up with good accuracy and now I want to test it with new data( Unlabeled data). How do I do that?
Okay so say you do test on unlabeled data and your algorithm predicts some X output. How can you check the accuracy, how can you check if this is correct or not? This is the only thing that matters in predictions, how your program works on data it has not seen before.
The short answer is, you can't. You need to split your data into:
Training 70%
Validation 10%
Test 20%
All of these should be labled and accuracy, confusion matrix, f measure and anything else should be computed on the labled test data that your program has not seen before. Your train on training data and every once in a while you check the performance on the validation data to see if it is doing well or if you need to do adjustments. In the very end you check on test data. This is supervised learning, you always need labeled data.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am just getting started with machine learning and is exploring different algorithms. I took a binary classification problem from the internet and tried applying various machine learning techniques.
First I tried running a naive baysien classifier on it and I found a success rate of about 75%. I tried out logistic regression and found a staggering success rate of 90%. I tried applying regularisation to my classifier and here is the curve that I found when I varied Lambda(the regularisation parameter) over 30 values. . The red plot is the training set and the blue one is the validation set. As you can see, the error margin in both the curves are increasing over Lambda. I think this would suggest that my hypothesis is underfit to begin with and the underfitting is getting worse with increase in lambda. Is this the correct way to interpret this?
Either way, in order to tackle the problem of underfitting, it would make sense to try a more complicated model so I turned to a neural network. My initial problem has 31 features characterising it and I choose a network with two hidden layers having 10 nodes each.
After training I found that it is classifying only 65% of the training data correctly. That is worse than the Naive-Baysien and the logistic regression. How often does this happen? Is it more likely that there is something wrong with my implementation of the neural network?
It is also interesting to note that the neural network seems to be converging after just 25-30 iterations. My logistic regression took 300 iterations to converge. I did considered the possibility that the neural network might be getting stuck in a local minima but according to Andrew NG's excellent course on machine learning which I am following, that is rather unlikely.
From what the course explained, the neural network in general, gives out better predictions than a logistic regression but you may run into problems with overfitting. However, I don't think that is the problem here since the 65% success rate is on the training set.
Do I need to go over my neural network implementation or is this a possible thing that can happen?
First, please try larger hidden layers such as 200 nodes each. Then update your result so we can see what is the critical problem.
When you use a neural network to classify your data, it actually fit a vector space which is suitable to do this task. In this case, suppose your data has 31 dimensions, at least a 32-dimensional space can perfectly classify your data if there is no sample both in positive class and negative class. So if you get a bad performance on training set, just enlarge your neural network until you get 100% result on training set, then you can start to think about generalization problem.