Extract information from each iteration of GridSearchCV to compute confusion matrix

Extract information from each iteration of GridSearchCV to compute confusion matrix - python

I am working on a machine learning project using sklearn GridSearchCV. My goal is that from the output of the GridSearchCV I need to find all the parameter settings which give the positive predicted value of greater than 0.95 and compute the confusion matrix for each of them. I have implemented the customized score function for the grid to compute the PPV value and inside the score function, I am computing the confusion matrix and writing them into a text file. However, due to the parallel execution, I can't keep track of them.
Is there a way that I can accomplish this?

Related

How to get Confusion Matrix or True/False Positive/Negative labels from Google VertexAI's aiplatform library?

I'm trying to retrieve metrics and other model evaluations from Google Cloud's VertexAI via their python library, specifically, I'm having trouble accessing a confusion matrix for a text classification model. Or at least get the values of true positives/false positives/false positives/true negatives, so I can create a confusion matrix myself.
I've been able to extract log loss, precision, recall, f1, AUPRC and confidence thresholds via .get_model_evaluation() in aiplatform or .GetModelEvaluationRequest() in aiplatform_v1. I can also get the same set of metrics by level metrics by using corresponding evaluation slice methods.
However, I can't seem to find anywhere in their documentation how to generate either a confusion matrix or the values used in the evaluation to get the metrics returned above. In the Evaluation UI they list items by true/false pos/neg but I can't figure out a way to retrieve these. Also Confusion Matrix is listed in their evaluation guides here for text classification and in the ClassificationEvaluationMetrics schema yaml in GCS they link to in the guide, but I can't find it in the model evaluation object. Is it possible to pull a confusion matrix and/or labels used in training/evaluation using python?

What is the difference between Linear regression classifier and linear regression to extract the confidential interval?

I am a beginner with machine learning. I want to use time series linear regression to extract confidential interval of my dataset. I don't need to use the linear regression as a classifier. Firstly what is the difference between the two cases? Secondly in python, Is there different way to implement them ?

The main difference is the classifier will compute a probabilty about a label. The regression will compute a quantitative output.
Generally, classifier is used to compute a probability of label, and a regression is often use to compute a quantity. For instance if you want to compute the price of a flat considering some criterias you will use a regression, if you want to compute a label (luxurious, modest, ...) about the same flat considering some criterias you will use classifier.
But to use regressions in order to compute a threshold to seperate labels observed is a technic often used too. That is the case of linear SVM, which compute a boundary between labels. It is called decision boundary. Warning, the main drawback with linear is that is linear: it means the boundary will necessary be a straight line to separate labels. Sometimes it is good enough, sometimes it is not.
Logistic regression is an exception because it compute a probability actually. Its name is misleading.
For regression, when you want to compute a quantitative output, you can use a confidence interval to have an idea about the error. In a classification there is not confidence interval, even if you use linear SVM, it is non sensical. You can use the decision function but it is difficult to interpret in reality, or use the predicted probabilities and to check the number of time the label is wrong and compute a ratio of error. There are plethora ratios available considering your problematic, and it is buntly the subject of a whole book actually.
Anyway, if you're computing a time series, as far as I know your goal is to obtain a quantitative output, then you do not need a classifier as you said. And about extracting it depends totally of the object you used to compute it in python: meaning it depends of the available attributes of the object used. Then depends of the library too. So it would be very better, to answer to you, if you would indicate which libraries and objects you are using.

What are the initial estimates taken in Logistic regression in Scikit-learn for the first iteration?

I am trying out logistic regression from scratch in python.(through finding probability estimates,cost function,applying gradient descent for increasing the maximum likelihood).But I have a confusion regarding which estimates should I take for the first iteration process.I took all the estimates as 0(including the intercept).But the results are different from that we get in Scikit-learn.I want to know which are the initial estimates taken in Scikit-learn for logistic regression?

First of all scikit learn's LogsiticRegerssion uses regularization. So unless you apply that too , it is unlikely you will get exactly the same estimates. if you really want to test your method versus scikit's , it is better to use their gradient decent implementation of Logistic regersion which is called SGDClassifier . Make certain you put loss='log' for logistic regression and set alpha=0 to remove regularization, but again you will need to adjust the iterations and eta as their implementation is likely to be slightly different than yours.
To answer specifically about the initial estimates, I don't think it matters, but most commonly you set everything to 0 (including the intercept) and should converge just fine.
Also bear in mind GD (gradient Decent) models are hard to tune sometimes and you may need to apply some scaling(like StandardScaler) to your data beforehand as very high values are very likely to drive your gradient out of its slope. Scikit's implementation adjusts for that.

SGDClassifier with constraints

I am trying to do logistic regression on a huge data set using scikit-learn SGDClassifier (I am using partial_fit to be precise). The coefficients I obtained are of different sign, whereas I would like to force the classifier to look only for positive values (I know it may not be the best approach in terms of methodology however it is what would be ok for now).
My question is:
Is there any way to impose constraints on coefficients using SGDClassifier?
Thanks for your time

This is not possible with SGDClassifier in its current implementation.
If you wanted to implement this, you have to add a penalty, call it e.g. 'positivity', which makes sure this constraint is verified by placing infinite cost on negative values.
It may be possible to implement this using e.g. this paper, Duchi 2009 (but I think there are follow-ups in newer literature that could be more up to the job). What you need to do at every mini-batch is to project onto the positive orthant. This is done by simply setting all negative values the occur after a gradient step in the logistic loss to 0.

k-fold Cross Validation for determining k in k-means?

In a document clustering process, as a data pre-processing step, I first applied singular vector decomposition to obtain U, S and Vt and then by choosing a suitable number of eigen values I truncated Vt, which now gives me a good document-document correlation from what I read here. Now I am performing clustering on the columns of the matrix Vt to cluster similar documents together and for this I chose k-means and the initial results looked acceptable to me (with k = 10 clusters) but I wanted to dig a bit deeper on choosing the k value itself. To determine the number of clusters k in k-means, I was suggested to look at cross-validation.
Before implementing it I wanted to figure out if there is a built-in way to achieve it using numpy or scipy. Currently, the way I am performing kmeans is to simply use the function from scipy.
import numpy, scipy
# Preprocess the data and compute svd
U, S, Vt = svd(A) # A is the TFIDF representation of the original term-document matrix
# Obtain the document-document correlations from Vt
# This 50 is the threshold obtained after examining a scree plot of S
docvectors = numpy.transpose(self.Vt[0:50, 0:])
# Prepare the data to run k-means
whitened = whiten(docvectors)
res, idx = kmeans2(whitened, 10, iter=20)
Assuming my methodology is correct so far (please correct me if I am missing some step), at this stage, what is the standard way of using the output to perform cross-validation? Any reference/implementations/suggestions on how this would be applied to k-means would be greatly appreciated.

To run k-fold cross validation, you'd need some measure of quality to optimize for. This could be either a classification measure such as accuracy or F1, or a specialized one such as the V-measure.
Even the clustering quality measures that I know of need a labeled dataset ("ground truth") to work; the difference with classification is that you only need part of your data to be labeled for the evaluation, while the k-means algorithm can make use all the data to determine the centroids and thus the clusters.
V-measure and several other scores are implemented in scikit-learn, as well as generic cross validation code and a "grid search" module that optimizes according to a specified measure of evaluation using k-fold CV. Disclaimer: I'm involved in scikit-learn development, though I didn't write any of the code mentioned.

Indeed to do traditional cross validation with F1-score or V-Measure as scoring function you would need some labeled data as ground truth. But in this case you could just count the number of classes in the ground truth dataset and use it as your optimal value for K, hence no-need for cross-validation.
Alternatively you could use a cluster stability measure as unsupervised performance evaluation and do some kind of cross validation procedure for that. However this is not yet implemented in scikit-learn even though it's still on my personal todo list.
You can find additional info on this approach in the following answer on metaoptimize.com/qa. In particular you should read Clustering Stability: An Overview by Ulrike von Luxburg.

Here they use withinss to find an optimal number of clusters. "withinss" is an attribute of the kmeans object returned. That could be used to find a minimum "error"
https://www.statmethods.net/advstats/cluster.html
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
This formula isn't exactly it. But I'm working on one myself. The model would still change every time, but it would at least be the best model out of a bunch of iterations.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.