I have a data set of ~150 samples where for each sample I have 11 inputs and 3 outputs. I tried to build a full regression model to take in 11 inputs to be trained to predict the 3 outputs The issue is with so few samples training a full model is almost impossible. For this reason I am experimenting with regression such as linear regression in pythons sklearn. From what I can find most regression models either support one input to predict one output (after regression is complete) or many inputs to predict one output.
Question: Are there any types of regression that support many inputs to predict many outputs. Or perhaps any regression types at all that may be better suited for my needs.
Thank you for any help!
Have you considered simply performing separate linear regressions for each dependent variable?
Also, you need to decide which inputs are theoretically significant (in terms of explaining the variation in the output) and then test for statistical significance to determine which ones should be retained in the model.
Additionally, test for multicollinearity to determine if you have variables which are statistically related and inadvertently influencing the output. Then, test in turn for serial correlation (if the data is a time series) and heteroscedasticity.
The approach you are describing of "garbage in, garbage out" risks overfitting - since you don't seem to be screening the inputs themselves for their relevance in predicting the output.
Related
It's not clear to me why some resources online demonstrate a multi-target Random Forest regression as being instantiated as either
model = MultiOutputRegressor(RandomForestRegressor())
versus:
model = RandomForestRegressor()
when both seemingly generate multiple regressed outputs. Can anyone clarify?
The internal models are different, but they are both multioutput regressors.
MultiOutputRegressor fits one random forest for each target. Each tree inside then is predicting one of your outputs.
Without the wrapper, RandomForestRegressor fits trees targeting all the outputs at once. The split criteria are based on the average impurity reduction across the outputs. See the User Guide.
The latter may be better computationally, since fewer trees are being built. It can also make use of the fact that the several outputs for a given input may well be correlated. That's all discussed in the user guide as well.
Some conjecture on my part: On the other hand, if the several outputs for a given input are not correlated, internal splits that are good for one output may be lousy for other inputs, so simply averaging them might not work as well. I think in that case increasing the tree complexity can alleviate the issue (but will also take more computation).
Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer(), I need to pass an ML model (e.g. XGBRegressor()). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer() function of shap library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer(). However, as there is 250 participants' data, if I apply shap here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
You seem to train model on a 250 datapoints while doing LOOCV. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don't sift through different sets of hyperparams -- note, 250 LOOCV is already overkill. Will you do that with 250'000 rows? -- you are rather trying to understand which features influence output in what direction and by how much.
Training has it's own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don't overestimate explanation exercise either. It's still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV the model is still the same (same features, hyperparams may be different, depending on your definition of iteration). It's still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn't matter. Feed resulting model to SHAP explainer and you'll get what you want.
Scikit-learn allows sample weights to be provided to linear, logistic, and ridge regressions (among others), but not to elastic net or lasso regressions. By sample weights, I mean each element of the input to fit on (and the corresponding output) is of varying importance, and should have an effect on the estimated coefficients proportional to its weight.
Is there a way I can manipulate my data before passing it to ElasticNet.fit() to incorporate my sample weights?
If not, is there a fundamental reason it is not possible?
Thanks!
You can read some discussion about this in sklearn's issue-tracker.
It basically reads like:
not that hard to do (theory-wise)
pain keeping all the basic sklearn'APIs and supporting all possible cases (dense vs. sparse)
As you can see in this thread and the linked one about adaptive lasso, there is not much activity there (probably because not many people care and the related paper is not popular enough; but that's only a guess).
Depending on your exact task (size? sparseness?), you could build your own optimizer quite easily based on scipy.optimize, supporting this kind of sample-weights (which will be a bit slower, but robust and precise)!
I have two data set with different size.
1) Data set 1 is with high dimensions 4500 samples (sketches).
2) Data set 2 is with low dimension 1000 samples (real data).
I suppose that "both data set have the same distribution"
I want to train an non linear SVM model using sklearn on the first data set (as a pre-training ), and after that I want to update the model on a part of the second data set (to fit the model).
How can I develop a kind of update on sklearn. How can I update a SVM model?
In sklearn you can do this only for linear kernel and using SGDClassifier (with appropiate selection of loss/penalty terms, loss should be hinge, and penalty L2). Incremental learning is supported through partial_fit methods, and this is not implemented for neither SVC nor LinearSVC.
Unfortunately, in practise fitting SVM in incremental fashion for such small datasets is rather useless. SVM has easy obtainable global solution, thus you do not need pretraining of any form, in fact it should not matter at all, if you are thinking about pretraining in the neural network sense. If correctly implemented, SVM should completely forget previous dataset. Why not learn on the whole data in one pass? This is what SVM is supposed to do. Unless you are working with some non-convex modification of SVM (then pretraining makes sense).
To sum up:
From theoretical and practical point of view there is no point in pretraining SVM. You can either learn only on the second dataset, or on both in the same time. Pretraining is only reasonable for methods which suffer from local minima (or hard convergence of any kind) thus need to start near actual solution to be able to find reasonable model (like neural networks). SVM is not one of them.
You can use incremental fitting (although in sklearn it is very limited) for efficiency reasons, but for such small dataset you will be just fine fitting whole dataset at once.
I have a classification problem (predicting whether a sequence belongs to a class or not), for which I decided to use multiple classification methods, in order to help filter out the false positives.
(The problem is in bioinformatics - classifying protein sequences as being Neuropeptide precursors sequences. Here's the original article if anyone's interested, and the code used to generate features and to train a single predictor) .
Now, the classifiers have roughly similar performance metrics (83-94% accuracy/precision/etc' on the training set for 10-fold CV), so my 'naive' approach was to simply use multiple classifiers (Random Forests, ExtraTrees, SVM (Linear kernel), SVM (RBF kernel) and GRB) , and to use a simple majority vote.
MY question is:
How can I get the performance metrics for the different classifiers and/or their votes predictions?
That is, I want to see if using the multiple classifiers improves my performance at all, or which combination of them does.
My intuition is maybe to use the ROC score, but I don't know how to "combine" the results and to get it from a combination of classifiers. (That is, to see what the ROC curve is just for each classifier alone [already known], then to see the ROC curve or AUC for the training data using combinations of classifiers).
(I currently filter the predictions using "predict probabilities" with the Random Forests and ExtraTrees methods, then I filter arbitrarily for results with a predicted score below '0.85'. An additional layer of filtering is "how many classifiers agree on this protein's positive classification").
Thank you very much!!
(The website implementation, where we're using the multiple classifiers - http://neuropid.cs.huji.ac.il/ )
The whole shebang is implemented using SciKit learn and python. Citations and all!)
To evaluate the performance of the ensemble, simply follow the same approach as you would normally. However, you will want to get the 10 fold data set partitions first, and for each fold, train all of your ensemble on that same fold, measure the accuracy, rinse and repeat with the other folds and then compute the accuracy of the ensemble. So the key difference is to not train the individual algorithms using k fold cross-validation when evaluating the ensemble. The important thing is not to let the ensemble see the test data either directly or by letting one of it's algorithms see the test data.
Note also that RF and Extra Trees are already ensemble algorithms in their own right.
An alternative approach (again making sure the ensemble approach) is to take the probabilities and \ or labels output by your classifiers, and feed them into another classifier (say a DT, RF, SVM, or whatever) that produces a prediction by combining the best guesses from these other classifiers. This is termed "Stacking"
You can use a linear regression for stacking. For each 10-fold, you can split the data with:
8 training sets
1 validation set
1 test set
Optimise the hyper-parameters for each algorithm using the training set and validation set, then stack yours predictions by using a linear regression - or a logistic regression - over the validation set. Your final model will be p = a_o + a_1 p_1 + … + a_k p_K, where K is the number of classifier, p_k is the probability given by model k and a_k is the weight of the model k. You can also directly use the predicted outcomes, if the model doesn't give you probabilities.
If yours models are the same, you can optimise for the parameters of the models and the weights in the same time.
If you have obvious differences, you can do different bins with different parameters for each. For example one bin could be short sequences and the other long sequences. Or different type of proteins.
You can use the metric whatever metric you want, as long as it makes sens, like for not blended algorithms.
You may want to look at the 2007 Belkor solution of the Netflix challenges, section Blending. In 2008 and 2009 they used more advances technics, it may also be interesting for you.