Multi-target regression using scikit-learn - python

I am solving the classic regression problem using the python language and the scikit-learn library. It's simple:
ml_model = GradientBoostingRegressor()
ml_params = {}
ml_model.fit(X_train, y_train)
where y_train is one-dimensional array-like object
Now I would like to expand the functionality of the task, to get not a single target value, but a set of them. Training set of samples X_train will remain the same.
An intuitive solution to the problem is to train several models, where X_train for all of them will be the same but y_train for each model will be specific. This is definitely a working, but, it seems to me, inefficient solution.
When searching for alternatives, I came across such concepts as Multi-Target Regression. As I understand such functionality is not implemented in scikit-learn.
How to solve Multi-Target Regression problem in python in efficient way? Thanks)

It depends on what problem you solve, training data you have, and an algorithm you choose to find a solution. It's really hard to suggest anything without knowing all the details. You could try a random forest as a starting point. It's a very powerful and robust algorithm which is resistant to overfitting in the case you have not so much data, and also it can be used for multi-target regression. Here is a working example:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
X, y = make_regression(n_targets=2)
print('Feature vector:', X.shape)
print('Target vector:', y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)
print('Build and fit a regressor model...')
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print('Done. Score', score)
Output:
Feature vector: (100, 100)
Target vector: (100, 2)
Build and fit a regressor model...
Done. Score 0.4405974071273537
This algorithm natively supports multi-target regression. For those ones which don't, you can use the multi-output regressor which simply fits one regressor per target.

Another alternative to the random forest approach would be to use an adapted version of Support Vector Regression, that fits multi-target regression problems. The advantage over fitting SVR with MultiOutputRegressor is that this method takes the underlying correlations between the multiple targets into account and hence should perform better.
A working implementation with a paper reference can be found here

Related

Improve speed of scikit-learn multinomial logistic regression

i am trying to train a logistic regression model in scikit learn and it is taking very long to train, around 2 hours. The size of the dataset is 21613 x 19. I am new to scikit learn, as such i dont know whether my code is wrong or that it just takes very long to train. Any suggestion on how to improve the training speed would be very much appreciated!
code used to train is below
# get the LogisticRegression estimator
from sklearn.linear_model import LogisticRegression
# training the model
# apply algorithm to data using fit()
clf = LogisticRegression(solver='newton-cg',multi_class='multinomial')
clf.fit(X_train,y_train)
If you have a specific reason for using this solver, one thing you can do is parallelize the computations by setting the n_jobs=-1 argument.
If you're open to using other solvers, you can use faster solvers with a one-versus-rest strategy. For instance:
clf = LogisticRegression(solver='liblinear', multi_class='ovr')
It's all in the documentation, which can help you guide your choice of solver:
solver : {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
Algorithm to use in the optimization problem.
For small datasets, ‘liblinear’ is a good choice, whereas ‘sag’ and
‘saga’ are faster for large ones.
For multiclass problems, only ‘newton-cg’, ‘sag’, ‘saga’ and ‘lbfgs’
handle multinomial loss; ‘liblinear’ is limited to one-versus-rest
schemes.
‘newton-cg’, ‘lbfgs’, ‘sag’ and ‘saga’ handle L2 or no penalty
‘liblinear’ and ‘saga’ also handle L1 penalty
‘saga’ also supports ‘elasticnet’ penalty
‘liblinear’ does not support setting penalty='none'
It's probably that slow because of the solver you have chosen. The newton-cg is a newton method. It's slow for large datasets because it computes the second derivatives. Use a different solver like sag or saga, they are fast for big datasets.
You might want to change your solver. The documentation says that scikit-learn has 5 different solvers: 'sag', 'saga', 'newton-cg', 'lbfgs', and 'liblinear' (not suitable for multinomial).
import time
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Set training and validation sets
X, y = make_classification(n_samples=1000000, n_features=19, n_classes = 8, n_informative=8)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1000)
# Solvers
solvers = ['newton-cg', 'sag', 'saga', 'lbfgs']
for sol in solvers:
start = time.time()
logreg = LogisticRegression(solver=sol, multi_class='multinomial')
logreg.fit(X_train, y_train)
end = time.time()
print(sol + " Fit Time: ",end-start)
Output (from 16GB RAM 8 Core Macbook):
Choosing the right solver for a problem can save a lot of time (code adapted from here). To determine which solver is right for your problem, you can check out the table from the documentation to learn more (notice that 'newton-cg' is not faster for large datasets).

Is it possible to tune the linear regression (hyper)parameter in sklearn

I'm starting to learn a bit of sci-kit learn and ML in general and i'm running into a problem.
I've created a model using linear regression.
the .score is good (above 0.8) but i want to get it better (perhaps to 0.9).
I've searched the documentation of sklearn and googled this question but I cannot seem to find the answer.
My question is: Is it possible to tune the LinearRegression model?
and if so, where can I find it?
#----- Forecast in hours -----#
forecast_out = 48
#----- Import and prep data -----#
using pandas to create X and y
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#----- Linear Regression-----#
lr = LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
lr.fit(x_train, y_train)
lr_confidence = lr.score(x_test, y_test)
print("lr confidence: ", lr_confidence)
x_forecast = np.array(data.drop(['Prediction'],1))[-forecast_out:]
lr_prediction = lr.predict(x_forecast)
There is always room for improvement. Parameters are there in the LinearRegression model. Use .get_params() to find out parameters names and their default values, and then use .set_params(**params) to set values from a dictionary.
GridSearchCV and RandomSearchCV can help you tune them better than you can, and quicker.
This is a very open-ended question and you should just look up the documentation. It's all there, really, trust me - I've looked. Just Google LinearRegression documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
It seems that sklearn.linear_model.LinearRegression does not have hyperparameters that can be tuned. So, instead please use sklearn.linear_model.SGDRegressor, which will provide many possiblites for tuning hyperparameters.
Its documentation can be found here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html .
No, it is not possible.
For Hyperparams tune Linear Regressions, try Lasso, Ridge or ElasticNet

Using statsmodels OLS on a test-set

I would like to use a technique from Scikit Learn, namely the ShuffleSplit to benchmark my linear regression model with a sequence of randomized test and train sets. This is well established and works great for the LinearModel in Scikit Learn using:
from sklearn.linear_model import LinearRegression
LM = LinearRegression()
train_score = LM.score(X[train_index], Y[train_index])
test_score = LM.score(X[test_index], Y[test_index])
The score one gets here is only the R² values and nothing more. Using the statsmodel OLS implementation for linear models gives a very rich set of scores among whcih are adjusted R² and AIC, BIC etc. However here on can only fit the model with the training data to get these scores. Is there a way to get them also for the test set?
so in my example:
from sklearn.model_selection import ShuffleSplit
from statsmodels.regression.linear_model import OLS
ss = ShuffleSplit(n_splits=40, train_size=0.15, random_state=42)
for train_index, test_index in ss.split(X):
regr = OLS( Y.[train_index], X.[train_index]).fit()
train_score_AIC = regr.aic
is there a way to add something like
test_score_AIC = regr.test(Y.[test_index], X.[test_index]).aic
Most of those measure are goodness of fit measures that are build into the model/results classes and only available for the training data or estimation sample.
Many of those measures are not well defined for out of sample, predictive accuracy measures, or I have never seen definitions that would fit that case.
Specifically, loglike is a method of the model and can only be evaluated at the attached training sample.
related issues:
https://github.com/statsmodels/statsmodels/issues/2572
https://github.com/statsmodels/statsmodels/issues/1282
It would be possible to partially work around the current limitations of statsmodels but none of those are currently supported and unit tested.

Evaluate Loss Function Value Getting From Training Set on Cross Validation Set

I am following Andrew NG instruction to evaluate the algorithm in Classification:
Find the Loss Function of the Training Set.
Compare it with the Loss Function of the Cross Validation.
If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.
Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.
This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
Make n_splits bigger
provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] }
make_scorer(log_loss, greater_is_better=False,**args_loss)
Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does
Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.

Python - scikit-learn: how to specify a validation subset in decision and regression trees?

I am trying to build decision trees and regression trees with Python. I am using sci-kit, but am open to alternatives.
What I don't understand about this library is whether a training and a validation subset can be provided, so that the library builds the model on the training subset, tests it on the validation and stops splitting based on some rules (typically when additional splits don't result in better performance on the validation subset- this prevents overfitting).
For example, this is what the JMP software does (http://www.jmp.com/support/help/Validation_2.shtml#1016975).
I found no mention of how to use a validation subset in the official website (http://scikit-learn.org/stable/modules/tree.html), nor on the internet.
Any help would be most welcome! Thanks!
There is a fairly rich set of cross validation routines and examples in the scikit learn cross validation section of the userguide.
Note that lots of progress seems to have been made in cross validation between SK-Learn version 0.14 and 0.15, so I recommend upgrading to 0.15 if you haven't already.
As jme notes in his comment, some of the cross validation capability has also been incorporated into the grid search and pipeline capabilities of SK-learn.
For completeness in answering your question, here is simple example, but more advanced k-fold, leave-one-out, leave-p-out, shuffle-split, etc. are all available:
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
iris = datasets.load_iris()
iris.data.shape, iris.target.shape
((150, 4), (150,))
X_train, X_test, y_train, y_test = cross_validation.train_test_split(iris.data,
iris.target,
test_size=0.4,
random_state=0)
X_train.shape, y_train.shape
((90, 4), (90,))
X_test.shape, y_test.shape
((60, 4), (60,))
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
0.96...
I hope this helps... Good luck!

Categories