I have trained the binary classification model using AWS built-in algorithm with SageMaker and want to evaluate the model using the AUC and confusion matrix. However, I see that SageMaker's Training and HyperTuner job just accepts the Accuracy metric.
Is there a way in SageMaker to add the custom metric for a built-in image classification algorithm?
As I understand AUC/Confusion Matrix/Precision/Recall/F1 are good metrics for a binary classifier, then Why these are missing in the AWS built-in image classification algorithm?
Is there a way where I can batch transform my test data and get these metrics to evaluate the model as Accuracy alone is not good for evaluation?
SageMaker Built-in algorithms cannot accept custom metrics, they work only for the built-in metrics
Confusion matrix is not a metric, it's a visualization. Also note that the image classifier is not a binary classifier, it's a general classifier that can have a large number of labels. Regarding the other metrics I can't speak on behalf of AWS teams :)
Yes, using Batch Transform or real-time endpoints to create predictions to be used in your own custom analytics is a good idea. For example, in this blog post an ephemeral endpoint is created to produce predictions and a confusion matrix for the built-in linear classifier https://aws.amazon.com/blogs/machine-learning/build-multiclass-classifiers-with-amazon-sagemaker-linear-learner/
Related
In my previous project, I need to frame an image classification task as a regression problem. I implement the regression model using Tensorflow, with standard Sequential model with a 1 node Dense layer with no activation function as the last layer. In order to measure the performance, I need to use standard classification metrics, such as accuracy and cohen kappa.
However, I can't directly use those metrics because my model is a regression model, so I need to clip and round the output before feeding them to the metrics. I use a workaround by defining my own metric, however that workaround is not practical. Therefore, I'm thinking about contributing to Tensorflow by implementing a custom transformation_function to transform y_pred by a Tensor lambda function before storing them in the __update_state method. After reading the source code, I get doubts regarding this idea. So, I'm asking out to you, fellow Tensorflow user/contributors, what is the best practice of transforming y_pred before feeding it to a metric? Is this functionality already implemented in the newest version?
Thank you!
I made a decision trees and logistical regression model. I am satisfied with the results. How do I use it on unsupervised data?
Also: Will I need to always use StandardScaler to new data?
While your question is too broad for SO I still want to give some short advices:
You need supervised data just for training stage of your model. When you already have trained model you can make predictions on unsupervised data (i.e. data that have no labels/targets) and model returns predicted labels. Usually you can do it by using predict method
Important moment: to use the predict method, it is necessary to transfer data to the model input in the same form as it was during training - the same set of features and the same number of features (excluding labels/targets of course)
The same goes for preprocessing - if you used StandardScaler for training data you must use it for new data too - the SAME StandardScaler (i.e. call transform method of already fitted on trining data scaler)
The philosophy of using StandatdScaler or some normalisation: is short - use it for linear model (and for your logistic regression). Read about it here for example: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
But for trees it is not necessary. Example: https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6
I am trying to perform cross-validation on NMF to find the best parameters to use. I tried using the sklearn cross-validation but get an error that states the NMF does not have a scoring method. Could anyone here help me with that? Thank you all
A property of nmf is that it is an unsupervised (machine learning) method. This generally means that there is no labeled data that can serve as a 'golden standard'.
In case of NMF you can not define what is the 'desired' outcome beforehand.
The cross validation in sklearn is designed for supervised machine learning, in which you have labeled data by definition.
What cross validation does, it holds out sets of labeled data, then trains a model on the data that is leftover and evaluates this model on the held out set. For this evaluation any metric can be used. For example: accuracy, precision, recall and F-measure, and for computing these measures it needs labeled data.
I'm trying a range of Online classifiers in the ski-kitlearn library to train a model from huge data. I found there are many classifiers supporting the partial_fit allowing for incremental learning. I want to use the Ridge Regression classifier in this setting, but could not find it in the implementation. Is there an alternative model that can do this in sklearn?
sklearn.linear_model.SGDClassifier, its loss function contain ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’
sklearn.linear_model.SGDRegressor, its default loss function is squared_loss, The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’
I have a very big dataset that can not be loaded in memory.
I want to use this dataset as training set of a scikit-learn classifier - for example a LogisticRegression.
Is there the possibility to perform a mini batch-training of a scikit-learn classifier where I provide the mini batches?
I believe that some of the classifiers in sklearn have a partial_fit method. This method allows you to pass minibatches of data to the classifier, such that a gradient descent step is performed for each minibatch. You would simply load a minibatch from disk, pass it to partial_fit, release the minibatch from memory, and repeat.
If you are particularly interested in doing this for Logistic Regression, then you'll want to use SGDClassifier, which can be set to use logistic regression when loss = 'log'.
You simply pass the features and labels for your minibatch to partial_fit in the same way that you would use fit:
clf.partial_fit(X_minibatch, y_minibatch)
Update:
I recently came across the dask-ml library which would make this task very easy by combining dask arrays with partial_fit. There is an example on the linked webpage.
Have a look at the scaling strategies included in the sklearn documentation:
http://scikit-learn.org/stable/modules/scaling_strategies.html
A good example is provided here:
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html