Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a ML-based Intrusion prediction. In the learning process, I used training and test data both labeled to evaluate the accuracy and generate confusion matrixes. I came up with good accuracy and now I want to test it with new data( Unlabeled data). How do I do that?
Okay so say you do test on unlabeled data and your algorithm predicts some X output. How can you check the accuracy, how can you check if this is correct or not? This is the only thing that matters in predictions, how your program works on data it has not seen before.
The short answer is, you can't. You need to split your data into:
Training 70%
Validation 10%
Test 20%
All of these should be labled and accuracy, confusion matrix, f measure and anything else should be computed on the labled test data that your program has not seen before. Your train on training data and every once in a while you check the performance on the validation data to see if it is doing well or if you need to do adjustments. In the very end you check on test data. This is supervised learning, you always need labeled data.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a question regarding the model.fit method and overfitting from the scikit learn library in Pandas
Does the generic sklearn method model.fit(x---, y--) returns the score after applying the model to the specified training data?
Also, it is overfitting when performance on the test data degrades as more training data is used to learn the model?
model.fit(X, y) doesn't explicitly give you the score, if you assign a variable to it, it stores all the artifacts, training parameters. You can get the score by using model.score(X, y).
Overfitting in simple words is increasing the variance in your model by which your model fails to generalize. There are ways to reduce overfitting like feature engineering, normalization, regularization, ensemble methods etc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to understand how Gridsearchcv's logic works. I looked at here, the official documentation, and the source code, but I couldn't figure out the following:
What is the general logic behind Gridsearchcv?
Clarifications:
If I use the default cv = 5, what are the % splits of the input data
into: train, validation, and test?
How often does Gridsearchcv perform such a split, and how does it decide which observation belong to train / validation / test?
Since cross validation is being done, where does any averaging come into play for the hyper parameter tuning? i.e. is the optimal hyper parameter value is one that optimizes some sort of average?
This question here shares my concern, but I don't know how up-to-date the information is and I am not sure I understand all the information there. For example, according to the OP, my understanding is that:
The test set is 25% of the input data set and is created once.
The union of the train set and validation set is correspondingly created once and this union is 75% of the original data.
Then, the procedure creates 5 (because cv = 5) further splits of this 75% into 60% train and 15% validation
The optimized hyper parameter value is one that optimizes the average of some metric over these 5 splits.
Is this understanding correct and still applicable now? And how does the procedure do the original 25%-75% split?
First your split your data into train and test. The testing set is left out for post training and optimization of the model. The gridsearchcv takes the 75% of your data and splits them into 5 slices. First it trains 4 slices and validates on 1, then takes 4 slices introducing the previously left out set for validation and tests on a new set etc... 5 times.
Then the performance of each run can be seen + the average of them to understand overall how your model behaves.
Since you are doing a gridsearch, the best_params will be saved at the end of your modeling to predict your test set.
So to summarize, the best parameters will be chosen and used for your model after the whole training, therefore, you can easily use them to predict(X_test)
Read more here.
Usually if you don't perform CV, the model will try to optimize its weights with preset parameters and the left out test set, will help to assess the model performance. However, for a real model training, it is highly important to re-split the training data into train and validation, where you use the validation to hypertune the parameters of the model (manually). However, over-hyptertuning the model to get the best performance on the validation set is cheating.
Theoretical K-Folds
More details
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to build a classification model and my target is not binary. The correlations of my features against my target are all weak (mostly 0.1). I have preprocessed my data and applied the all the algorithms i used to it (the algorithms i used are svm, knn, naivebayes,logistic regression, decision tree,gradient boosting, random forest). I evaluated all of the models with sklearn metrics.accuracy_score just to know how good they perform on my data but all of them scored 0.1~0.2 . The target is productline column.
My questions
How could this happen?
How to tackle this issue?
Is there any other algorithm that could make better score?
What's the accuracy if you use a dummy classifier? The accuracy of the models you have tried should be at least equal to that of the dummy classifier.
"How could this happen?" If there's no relationship between the features and the target variable, the model isn't going to return good results.
I'm not sure about the details of your dataset, but you can try to 1) Get more data 2) Get more features 3) Do some feature engineering 4) Clean your dataset if you haven't, there might be outliers or wrong inputs affecting your results
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.
The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list
You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
How to determine what's the optimal number of iterations in learning a neural network?
One way of doing it is to split your training data into a train and validation set. During training, the error on the training set should decrease steadily. The error on the validation set will decrease and at some point start to increase again. At this point the net starts to overfit to the training data. What that means is that the model adapts to the random variations in the data rather than learning the true regularities. You should retain the model with overall lowest validation error. This is called Early Stopping.
Alternatively, you can use Dropout. With a high enough Dropout probability, you can essentially train for as long as you want, and overfitting will not be a significant issue.