How to perform cross validation on NMF Python

How to perform cross validation on NMF Python - python

I am trying to perform cross-validation on NMF to find the best parameters to use. I tried using the sklearn cross-validation but get an error that states the NMF does not have a scoring method. Could anyone here help me with that? Thank you all

A property of nmf is that it is an unsupervised (machine learning) method. This generally means that there is no labeled data that can serve as a 'golden standard'.
In case of NMF you can not define what is the 'desired' outcome beforehand.
The cross validation in sklearn is designed for supervised machine learning, in which you have labeled data by definition.
What cross validation does, it holds out sets of labeled data, then trains a model on the data that is leftover and evaluates this model on the held out set. For this evaluation any metric can be used. For example: accuracy, precision, recall and F-measure, and for computing these measures it needs labeled data.

Related

How do I use my supervised ML model with unsupervised data?

I made a decision trees and logistical regression model. I am satisfied with the results. How do I use it on unsupervised data?
Also: Will I need to always use StandardScaler to new data?

While your question is too broad for SO I still want to give some short advices:
You need supervised data just for training stage of your model. When you already have trained model you can make predictions on unsupervised data (i.e. data that have no labels/targets) and model returns predicted labels. Usually you can do it by using predict method
Important moment: to use the predict method, it is necessary to transfer data to the model input in the same form as it was during training - the same set of features and the same number of features (excluding labels/targets of course)
The same goes for preprocessing - if you used StandardScaler for training data you must use it for new data too - the SAME StandardScaler (i.e. call transform method of already fitted on trining data scaler)
The philosophy of using StandatdScaler or some normalisation: is short - use it for linear model (and for your logistic regression). Read about it here for example: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
But for trees it is not necessary. Example: https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6

What is the best way to implement novelty detection using Scikit-learn when I have unpolluted training data (outlier-free)?

I have a labeled database with a separate class "-1" in which all the outliers are.
I am currently using sklearn's LocalOutlierFactor and OneClassSVM for fitting them on outlier-free training data and after that, test them on test data containing outliers. My objective is to check if new unseen examples are outliers before classifying them using a classification model.
It appears that since the training data which I use to fit the models is free of outliers and since the examples in each class are very similar, I get the best results (precision and recall) on the test data if I set the hyperparameter contamination of the LocalOutlierFactor as low as possible, something like 10**-100. The higher I set this value the more false outliers my model detects on new data.
I observe a similar behaviour using OneClassSVM. The hyperparameters gamma and nu have to be extremely low that the model serves me the best results.
According to Scikit-learn the training data shall not be polluted by outliers to perform novelty detection. This is explained at the top here.
Given this, my question is if I am missing something or if my approach is legit. I don't get why I even have to set the contamination hyperparameter if the training data shall not be polluted. It is perfectly clean in my case and contamination can't be set to 0 and it seems weird to set it manually to such a low value.

sklearn calibrated classifier with random forest

Scikit has a very useful classifier wrappers called CalibratedClassifer and CalibratedClassifierCV, which try to make sure that the predict_proba function of a classifier really predicts a probability and not just an arbitrary number (albeit perhaps well-ranked) between zero and one.
However, when using random forests it is customary to use oob_decision_function_ to determine the performance on the training data, but this is no longer available when using the the calibrated models. The calibration should therefore work well for new data but not for the training data. How can we evaluate performance on the training data to determine, e.g., overfitting?

Apparently there really was no solution to this, and so I made a pull request to scikit-learn.
The problem was that the out-of-bag predictions are created during learning. Therefore, in the CalibratedClassifierCV each of the sub-classifiers does have its own oob decision function. However, this decision function is calculated on a fold of the data. Therefore, it is necessary to store each oob prediction (keeping nan values for samples that are not in the fold), then convert all the predictions using the calibration transformation, and then average the calibrated oob predictions to create an updated oob prediction.
As mentioned, I created a pull request at https://github.com/scikit-learn/scikit-learn/pull/11175. It will probably be a while before it is merged into the package, though, so if anyone really needs to use it then feel free to use my fork of scikit-learn at https://github.com/yishaishimoni/scikit-learn.

How to update an SVM model with new data

I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?

In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.

When using multiple classifiers - How to measure the ensemble's performance? [SciKit Learn]

I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)

To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"

You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.