Gridsearchcv: internal logic [closed]

Gridsearchcv: internal logic [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I'm trying to understand how Gridsearchcv's logic works. I looked at here, the official documentation, and the source code, but I couldn't figure out the following:
What is the general logic behind Gridsearchcv?
Clarifications:
If I use the default cv = 5, what are the % splits of the input data
into: train, validation, and test?
How often does Gridsearchcv perform such a split, and how does it decide which observation belong to train / validation / test?
Since cross validation is being done, where does any averaging come into play for the hyper parameter tuning? i.e. is the optimal hyper parameter value is one that optimizes some sort of average?
This question here shares my concern, but I don't know how up-to-date the information is and I am not sure I understand all the information there. For example, according to the OP, my understanding is that:
The test set is 25% of the input data set and is created once.
The union of the train set and validation set is correspondingly created once and this union is 75% of the original data.
Then, the procedure creates 5 (because cv = 5) further splits of this 75% into 60% train and 15% validation
The optimized hyper parameter value is one that optimizes the average of some metric over these 5 splits.
Is this understanding correct and still applicable now? And how does the procedure do the original 25%-75% split?

First your split your data into train and test. The testing set is left out for post training and optimization of the model. The gridsearchcv takes the 75% of your data and splits them into 5 slices. First it trains 4 slices and validates on 1, then takes 4 slices introducing the previously left out set for validation and tests on a new set etc... 5 times.
Then the performance of each run can be seen + the average of them to understand overall how your model behaves.
Since you are doing a gridsearch, the best_params will be saved at the end of your modeling to predict your test set.
So to summarize, the best parameters will be chosen and used for your model after the whole training, therefore, you can easily use them to predict(X_test)
Read more here.
Usually if you don't perform CV, the model will try to optimize its weights with preset parameters and the left out test set, will help to assess the model performance. However, for a real model training, it is highly important to re-split the training data into train and validation, where you use the validation to hypertune the parameters of the model (manually). However, over-hyptertuning the model to get the best performance on the validation set is cheating.
Theoretical K-Folds
More details

Related

Training a model in machine learning and predict some values [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I am almost new to machine learning. I have a biological dataset like-
index position B y
1 1001 0 0.567
2 1010 0 0.682
3 1012 1 0.346
4 1016 1
5 1020 1 0.875
6 1040 0
7 1044 0 1.00
8 1047 1 0.101
9 1056 0 0.00
I am trying to predict the y values that are missing. I am using KNN regression for this. I have read that train-test-validation splitting is better than train-test splitting followed by cross validation. I have some question-
When i train my model, should i exclude the rows where y is not known?
How to workout with that validation and test set?
After i have trained my model, should i only take rows where y is not known and predict the values?
Is the accuracy and error rate using test dataset is the accuracy of my model?
I have a lot question. I wanted to figure these out watching tutorials but couldn't get a complete insight. Any help will be appreciated. Thank you.

I will try to answer one by one your questions
No, when you train a model, you should tell which target value is right for fixed features (the columns on your dataset). So you always use the target value y in the training phase.
Using cross-validation means to not waste your test set (remember you can use it only once!) and improve your training avoiding overfitting.
You should test your trained model with the test set. So you actually predict on your test set (that does not have target values) and you see the results.
The test set, if used once, provides you with an accuracy of the model. Mind that there are cases which the accuracy is pointless (*)
In the end, the process for training a model it's more or less this:
Split your dataset in 80% training set and 20% test set (or 70-30, depending on how many data you have)
From the training set, you build a training set and validation set using this (this is just a tip). (validation set of 10-15%)
You train your model with the training set and do validation (very important!) with the validation set.
Discard (and save somewhere else) the y column on the test set and use your trained model to predict the test set.
With the values it provides, you compute metrics (such as MSE) between these values and the once you saved at the previous step and see how good it's your model.
Remember that this is a guideline but it's much more complex. You will see it the more you get in the matter.
(*) For example, if you train your model with samples that are almost the same and the test set is composed of the 99% of samples that are similar and 1% that is different, you will get an accuracy of the 99%. That's a lot but it's useless since the model can predict only one class. So ofc mind each step depending on your case.

How to fit data into a machine leaning model in parts? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.

The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list

You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.

How to predict unlabeled test data using trained machine learning model? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have written a ML-based Intrusion prediction. In the learning process, I used training and test data both labeled to evaluate the accuracy and generate confusion matrixes. I came up with good accuracy and now I want to test it with new data( Unlabeled data). How do I do that?

Okay so say you do test on unlabeled data and your algorithm predicts some X output. How can you check the accuracy, how can you check if this is correct or not? This is the only thing that matters in predictions, how your program works on data it has not seen before.
The short answer is, you can't. You need to split your data into:
Training 70%
Validation 10%
Test 20%
All of these should be labled and accuracy, confusion matrix, f measure and anything else should be computed on the labled test data that your program has not seen before. Your train on training data and every once in a while you check the performance on the validation data to see if it is doing well or if you need to do adjustments. In the very end you check on test data. This is supervised learning, you always need labeled data.

What does the "fit" method in scikit-learn do? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Could you please explain what the "fit" method in scikit-learn does? Why is it useful?

In a nutshell: fitting is equal to training. Then, after it is trained, the model can be used to make predictions, usually with a .predict() method call.
To elaborate: Fitting your model to (i.e. using the .fit() method on) the training data is essentially the training part of the modeling process. It finds the coefficients for the equation specified via the algorithm being used (take for example umutto's linear regression example, above).
Then, for a classifier, you can classify incoming data points (from a test set, or otherwise) using the predict method. Or, in the case of regression, your model will interpolate/extrapolate when predict is used on incoming data points.
It also should be noted that sometimes the "fit" nomenclature is used for non-machine-learning methods, such as scalers and other preprocessing steps. In this case, you are merely "applying" the specified function to your data, as in the case with a min-max scaler, TF-IDF, or other transformation.
Note: here are a couple of references...
fit method in python sklearn
http://scikit-learn.org/stable/tutorial/basic/tutorial.html

Outlier-Detection in scikit-learn using Transformers in a pipeline [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm wondering if it is possible to include scikit-learn outlier detections like isolation forests in scikit-learn's pipelines?
So the problem here is that we want to fit such an object only on the training data and do nothing on the test data. Particularly, one might want to use cross-validation here.
How could a solution look like?
Build a class that inherits from TransformerMixin (and BaseEstimator for ParameterTuning).
Now define a fit_transform function that stores the state if the function has been called yet or not. If it hasn't been called yet, the function fits and predicts the outlier function on the data. If the function has been called before, the outlier detection already has been called on the training data, thus we assume that we now find the test data which we simply return.
Does such an approach have a chance to work or am I missing something here?

Your problem is basically the outlier detection problem.
Hopefully scikit-learn provides some functions to predict whether a sample in your train set is an outlier or not.
How does it work ? If you look at the documentation, it basically says:
One common way of performing outlier detection is to assume that the regular data come from a known distribution (e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.
sklearn provides some functions that allow you to estimate the shape of your data. Take a look at : elliptic envelope and isolation forests.
As far as I am concerned, I prefer to use the IsolationForest algorithm that returns the anomaly score of each sample in your train set. Then you can take them off your training set.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.