Dataset contains :
15000 Observations/Rows
3000 Features/Columns
Can I train Machine Learning model on these Dataset
Yes, you can apply the ML model but before that understanding of your problem statement come into a picture with all of the feature name available in the data set. If you are having big dataset try to convert it into a cluster of 2 or else take a small dataset to analyze what your data speaks about.
That is why population & sampling come to practical use.
You have to check whether accuracy of the train data set & test data set should be the same, if not then your model is memorizing instead of learning & here Regularization in Machine Learning comes into a picture.
No one can answer this based on the information you provided. The simplest approach is to run a sanity check in the form of cross validation. Does your model perform well on unseen data? If it does, it is probably not overfit. If it does not, check if the model is performing well on the training data. A model that performs well on training data but not on unseen data is the definition of a model being overfit.
Related
I am working with the YOLOv3 model for an object detection task. I am using pre-trained weights that were generated for the COCO dataset, however, I have my own data for the problem I am working on. According to my knowledge, using those trained weights as a starting point for my own model should not have any effect on the performance of the model once it is trained on an entirely different dataset (right?).
My question is: will the model give "honest" results if I train it multiple times and test it on the same test set each time, or would it have better performance since it has already been exposed to those test images during an earlier experiment? I've heard people say things like "the model has already seen that data", does that apply in my case?
For hyper-parameter selection, evaluation of different models, or evaluating during training, you should always use a validation set.
You are not allowed to use the test set until the end!
The whole purpose of test data is to get an estimation of the performance after the deployment. When you use it during training to train your model or evaluate your model, you expose that data. For example, based on the accuracy of the test set, you decide to increase the number of layers.
Now, your performance will be increased on the test set. However, it will come with a price!
Your estimation on the test set becomes biased, and you no longer be able to use that estimation to talk about data that your model sees after deployment.
For example, You want to train an object detector for self-driving cars, and you exposed the test set during training. Therefore, you can not use the accuracy on the test set to talk about the performance of the object detector when you put it on a car and sell it to a customer.
There is an old sentence related to this matter:
If you torture the data enough, it will confess.
I modelled a LSTM based text generator using a data set I have. The purpose of the model is to predict the end of sentences. My training is showing a validation accuracy of around 81%. When reading through a couple of articles, I found that unlike a classification problem I should be worried more about loss rather than accuracy. Is this the case, and if so what would be an ideal loss value? Right now my loss is around 1.5+.
There is no minimum limit for accuracy in any of the machine learning or Deep Learning problem.It's as many say garbage IN, garbage OUT
Quality of data and with a decent model will give you good accuracy.
Generally, these accuracy benchmark is set for the standard dataset available on an open internet like SQUAD, RACE, SWAG, GLUE and many more.
Usually, the state of the art models will check their performance on these datasets and set a accuarcy benchmark specific to these dataset.
Coming to your problem, you can tell the model is performing goog based on accuracy, and the evaluation metric you are using, generally in NLP to calculate loss is bit tricky. Considering your case where you are trying to predict the end of a sentence where there is no fixed dimension the reason being that the same information can be expressed in multiple ways with varying number of words.
By looking at the validation and test accuracy of your model it looks decent, but before pushing the accuracy you should worry about the overfitting problem also, the model should not be biased on your data.
You can try with different metrics to evaluate the model and you can compare the results on your own.
I hope this answers your question, Happy Learning!
I'm wondering if the result of the test set is used to make the optimization of model's weights. I'm trying to make a model but the issue I have is I don't have many data because they are medical research patients. The number of patient is limited in my case (61) and I have 5 feature vectors per patient. What I tried is to create a deep learning model by excluding one subject and I used the exclude subject as the test set. My problem is there is a large variability in subject features and my model fits well the training set (60 subjects) but not that good the 1 excluded subject.
So I'm wondering if the testset (in my case the excluded subject) could be used in a certain way to make converge the model to better classify the exclude subject?
You should not use the test data of your data set in your training process. If your training data is not enough, one approach using a lot during this days(especially for medical images) is data augmentation. So I highly recommend you to use this technique in your training process. How to use Deep Learning when you have Limited Data is one of the good tutorial about data augmentation.
No , you souldn't use your test set for training to prevent overfitting , if you use cross-validation principles you need exactly to split your data into three datasets a train set which you'll use to train your model , a validation set to test different value of your hyperparameters , and a test set to finally test your model , if you use all your data for training, your model will overfit obviously.
One thing to remember deep learning work well if you have a large and very rich datasets
Just I am curious why I have to scale the testing set on the testing set, and not on the training set when I’m training a model on, for example, CNN?!
Or am I wrong? And I still have to scale it on the training set.
Also, can I train a dataset in the CNN that contents positive and negative elements as the first input of the network?
Any answers with reference will be really appreciated.
We usually have 3 types of datasets for getting a model trained,
Training Dataset
Validation Dataset
Test Dataset
Training Dataset
This should be an evenly distributed data set which covers all varieties of data. If your train with more epochs, the model will get used to the training dataset and will only give proper proper prediction on the training dataset and this is called Overfitting. Only way to keep a check on overfitting is by having other datasets which the model has never been trained on.
Validation Dataset
This can be used fine tune model hyperparameters
Test Dataset
This is the dataset which the model has not been trained on has never been a part of deciding the hyperparameters and will give the reality of how the model is performing.
If scaling and normalization is used, the testing set should use the same parameters used during training.
A good answer that links to that: https://datascience.stackexchange.com/questions/27615/should-we-apply-normalization-to-test-data-as-well
Also, some models tend to require normalization and others do not.
The Neural Network architectures are normally robust and might not need normalization.
Scaling data depends upon the requirement as well the feed/data you got. Test data gets scaled with Test data only, because Test data don't have the Target variable (one less feature in Test data). If we scale our Training data with new Test data, our model will not be able to correlate with any target variable and thus fail to learn. So the key difference is the existence of Target variable.
I am studying the ensemble machine learning and when I read some articles online, I encountered 2 questions.
1.
In this article, it mentions
Instead, model 2 may have a better overall performance on all the data
points, but it has worse performance on the very set of points where
model 1 is better. The idea is to combine these two models where they
perform the best. This is why creating out-of-sample predictions have
a higher chance of capturing distinct regions where each model
performs the best.
But I still cannot get the point, why not train all training data can avoid the problem?
2.
From this article, in the prediction section, it mentions
Simply, for a given input data point, all we need to do is to pass it
through the M base-learners and get M number of predictions, and send
those M predictions through the meta-learner as inputs
But in the training process, we use k -fold train data to train M base-learner, so should I also train M base-learner based on all train data for the input to predict?
Assume red and blue were the best models you could find.
One works better in region 1, the other on region 2.
Now you would also train a classifier to predict which model to use, i.e., you would try to learn the two regions.
Do the validation on the outside. You can overfit if you give the two inner models access to data that the meta model does not see.
The idea in ensembles is that a group of weak predictors outperform a strong predictor. So, if we train different models with different predictive results and use the majority rule as the final result of our ensemble, this result is better than just trying to train one single model. Assume, for example, that the data consist of two distinct patterns, one linear and one quadratic. Then using a single classifier can either overfit or produce inaccurate results.
You can read this tutorial to learn more about ensembles and bagging and boosting.
1) "But I still cannot get the point, why not train all training data can avoid the problem?" - We will hold that data for validation purpose, just like the way we do in K-fold
2) "so should I also train M base-learner based on all train data for the input to predict?" - If you give same data to all the learners then the output of all of them would be same and there is no use in creating them. So we will give a subset of data to each learner.
For question 1 I will prove why we train two models in a contradictory way.
Suppose you train a model with all the data points.During training whenever the model will see a data point belonging to the red class, then it will try to fit itself so that it can classify red points with minimal error.Same is true for data points belonging to the blue class.Therefore during training the model is leaning towards a specific data point(either red or blue).And at the end model will try to fit itself so that it does not make much mistakes on both the data points and the final model will be an average model.
But instead if you train two models for the two different datasets, then each model will be trained on a specific dataset and a model doesn't have to care about data points which belong to another class.
It will be more clearer with the following metaphor.
Suppose there are two persons which are specialized to do two completely different jobs.Now when a job comes if you tell them that both of you have to do the job and each of them need to do 50% of the job. Now think what kind of result you will get at the end. Now also think what could be the result if you would tell them that a person should work on only the job at which the person is best.
In question 2 you have to split the train dataset into M datasets.And during training give M datasets to M base learners.