I have a trained neural networks in which I am trying to average their prediction using EnsembleVoteClassifier from mlxtend.classifier. The problem is my neural network don't share the same input, (I performed feature reduction and feature select algorithms randomly and stored the results on new different variables, so I have something like X_test_algo1, X_test_algo2 and X_test_algo3 and Y_test).
I am trying to average the weights, but as I said, I don't have the same X, and I didn't any example on the documentation. How can I average the predictions for my three models model1, model2 and model3
eclf = EnsembleVoteClassifier(clfs=[model1, model2, model3], weights=[1,1,1], refit=False)
names = ['NN1', 'NN2', 'NN2', 'Ensemble']
eclf.fit(X_train_algo1, Ytrain) #????
If it's not possible, that is okay. I am only interested on how to calculate the formulas of Hard Voting, Hard Voting and Weighted Voting, or if there is anther library that is more flexible or the explicit expressions of the formulas could be helpful too.
Why would you need a library to do that?
Simply pass the same examples through all your neural networks and get the predictions (either logits or probabilities or labels).
Hard voting choose the label predicted most often by classifiers.
Soft voting, average probabilities predicted by classifiers and choose the label having the highest.
Weighted voting - either of the above can be weighted. Just assign weights to each classifier and multiply their predictions by them. Weights are usually normalized to (0, 1] range.
In principle you could also sum logits and choose the label with highest.
Oh, and weight averaging is different technique and requires you to have the same model and usually is done for the same initialization but at different training timesteps. You can read about it in this blog post.
Related
I am implementing a feed-forward neural network for a specific clustering problem.
I'm not sure if it is possible or even makes sense, but the network consists of multiple layers followed by a clustering layer (say, k-means) used to calculate the clustering loss.
The NN layers act as a feature extractor, while the last layer is only used to calculate the loss (for example, by calculating the similarity score among different data points).
Actually, this network architecture is part of a bigger auto-encoder similar to what is discussed in this paper.
The question here is can I define a custom loss function in Tensorflow/Keras that receives the output of NN and compute the clustering loss? And how?
I've recently started exploring for myself features columns by TensorFlow.
If I understood documentation right, feature columns are just a 'frame' for further transformations just before fitting data to the model. So, if I want to use it, I define some feature columns, create DenseFeatures layer from them, and when I fit data into a model, all features go through that DenseFeatures layer, transforms and then fits into first Dense layer of my NN.
My question is that is it possible at all somehow check correlations of transformed features to my target variable?
For example, I have a categorical feature, which corresponds to a day of a week (Mon/Tue.../Sun) (say, I change it to 1/2..7). Correlation of it to my target feature will not be the same as correlation of categorical feature column (f.e. indicator), as a model don't understand that 7 is the maximum of the possible sequence, but in case of categories, it will be a one-hot encoded feature with precise borders.
Let me know if all is clear.
Will be grateful for the help!
Tensorflow does not provide the feature_importance feature with the way Sklearn provides for XGBoost.
However, you could do this to test the importance or correlation of your feature with the target feature in TensorFlow as follows.
1) Shuffle the values of the particular feature whose correlation with the target feature you want to test. As in, if your feature is say fea1,the value at df['fea1'][0] becomes the value df['fea1'][4], the value at df['fea1'][2] becomes the value df['fea1'][3] and so on.
2) Now fit the model to your modified training data and check the accuracy with validation data.
3) Now if your accuracy goes down drastically, it means your feature had a high correlation with the target feature, else if the accuracy didn't vary much, it means the feature isn't of great importance (high error = high importance).
You can do the same with other features you introduced to your training data.
This could take some time and effort.
Is there a way to give weights to features before training a neural network model? The meaning of weight here would be similar to that in linear regression - a representation of how much to trust each feature. This is unrelated to node weights or feature importance estimation after training.
Something similar to this publication.
So I am using scikit to do some relatively basic machine learning in python. I am trying to train a model to take in some feature values and return a 0 or a 1. In my specific case, an output of 0 means that the model doesn't think a Facebook post will be shared more than 10 times whereas a 1 means the model predicts the given facebook post will be shared more than 10 times.
I have trained a few different models using different techniques like logistic regression, neural networks and stochastic gradient descent. Once I have trained these models I run test them and for each model type, ie logistic regression, neural networks, etc, I see how many 1 predictions each model made and how many it got right.
Now the problem I am faced with emerges. Say the logistic regression model, when tested on 3000 items worth of test data, predicted 30 of the posts would get more than 10 shares, so it returns 1. It was correct 97% of the time when it made predictions of 1. This is all well and good but I would be more than willing to trade some accuracy to generate more predictions. For example if I could generate 200 predictions with 80% accuracy, I would make this tradeoff in a heart beat.
What are the methods that I could use to go about doing this and how would it be done? Is it even possible?
This is basically the precision-recall tradeoff problem.
For Logistic regression, you could change the decision threshold to have higher recall, lower precision.
You can read more about it here: http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.