How can I pass sample weights to Pipelines within a VotingRegressor? - python

I'm trying to pass sample weights to a scikit learn ensemble with the following structure, but I can't find a way to navigate the interaction between the VotingRegressor and the Pipeline.
ensemble = VotingRegressor(\[
('m1Pipeline',Pipeline(\[('getFeaturesModel1',feature_transformer_1),('m1',Model1())\])),
('m2Pipeline',Pipeline(\[('getFeaturesModel2',feature_transformer_2),('m2',Model2())\]))
\])
It's designed this way because I need to provide specific features to the first model, and specific features to the second model (which are different from the first model), and I need to average their outputs.
First, I tried passing an overall sample weight, since both underlying models support it, and therefore I expected the VotingRegressor to accept it at the top level:
ensemble.fit(X,Y,sample_weight=weights)
ValueError: Pipeline.fit does not accept the sample_weight parameter. You can pass parameters to specific steps of your pipeline using the stepname__parameter format, e.g. Pipeline.fit(X, y, logisticregression__sample_weight=sample_weight).
I then tried to pass the sample weights to the individual models:
ensemble.fit(X,Y,m1Pipeline__m1__sample_weight=weights,m2Pipeline__m2__sample_weight=weights)
fit_params={'vr__en__modelEN__sample_weight':weights,'vr__rf__modelRF__sample_weight':weights})
TypeError: fit() got an unexpected keyword argument 'model1Pipeline__model1__sample_weight'
By the way, I tried extending this design pattern, which does work:
ensemble = VotingRegressor(\[('model1',Model1())\])),('model2',Model2())\])
ensemble.fit(X,Y,sample_weight=weights)
Any suggestions on how I can accomplish this would be much appreciated!

Related

Custom Criterion for DecisionTreeRegressor in sklearn

I want to use a DecisionTreeRegressor for multi-output regression, but I want to use a different "importance" weight for each output (e.g. predicting y1 accurately is twice as important as predicting y2).
Is there a way of including these weights directly in the DecisionTreeRegressor of sklearn? If not, how can I create a custom MSE criterion with different weights for each output in sklearn?
I am afraid you can only provide one weight-set when you fit
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor.fit
And the more disappointing thing is that since only one weight-set is allowed, the algorithms in sklearn is all about one weight-set.
As for custom criterion:
There is a similar issue in scikit-learn
https://github.com/scikit-learn/scikit-learn/issues/17436
Potential solution is to create a criterion class mimicking the existing one (e.g. MAE) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
However, if you see the code in detail, you will find that all the variables about weights are "one weight-set", which is unspecific to the tasks.
So to customize, you may need to hack a lot of code, including:
hacking the fit function to accept a 2D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_classes.py#L142
Bypassing the checking (otherwise continue to hack...)
Modify tree builder to allow the weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L111
It is terrible, there are a lot of related variable, you should change double to double*
Modify Criterion class to accept a 2-D array of weights
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L976
In init, reset and update, you have to keep attributions such as self.weighted_n_node_samples specific to outputs (tasks).
TBH, I think it is really difficult to implement. Maybe we need to raise an issue for scikit-learn group.

How to get the number of features from a fitted scikit-learn model?

I am trying to extract the number of features from a model after I had fitted this model to my data.
I have looked through the model's directory and found ways to get the number only for specific models (e.g. looking at the dimensions of support vectors for SVM), but I didn't find a general way I could use for any type of a model.
Say I have my dataset of instances and corresponding classes
X, y # dataset
and use an arbitrary model from the scikit-learn library to fit this data
model.fit(X,y)
Later I want to use this model to find the dimensions of the original dataset, something in the way of
model.n_features_
Is there a quick and general way to do this?
There is no single common attribute for all classifier in Sklearn.
I would recommend the following:
For any sklearn.linear_model/sklearn.svm.svc, you can use the following approach.
>>> clf.coef_.shape[-1]
For any tree based models (DecisionTreeClassifier/RandomForestClassifier/GradientBoostingClassifier), you can use
>>> clf.n_features_
Update:
New in version 1.0.
n_features_in_: int
Number of features seen during fit.
feature_names_in_:
Names of features seen during fit. Defined only when X has feature names that are all strings.

Simple way to load specific sample using Pytorch dataloader

I am currently training a 3D CNN for binary classification with relatively sparse labels (~ 1% of voxels in label data correspond to target class).
In order to perform basic sanity checks during the training (e.g. does the network learn at all?) it would be handy to present the network with a small, handpicked subset of training examples having an above-average fraction of target class labels.
As suggested by the Pytorch documentation, I implemented my own dataset class (inheriting from torch.utils.data.Dataset) which provides training examples via it's __get_item__ method to the torch.utils.data.DataLoader.
In the pytorch tutorials I found, the DataLoader is used as an iterator to generate the training loop like so:
for i, data in enumerate(self.dataloader):
# Get training data
inputs, labels = data
# Train the network
# [...]
What I am wondering now is whether there exist a simple way to load a single or a couple of specific training examples (using a the linear index understood by Dataset's __get_item__ method). However, DataLoader does not have a __get_item__ method and repeatedly calling __next__ until I reach the desired index does not seem elegant.
Apparently one possible way to solve this would be to define a custom sampler or batch_sampler inheriting from the abstract torch.utils.data.Sampler. But this seems over the top to retrieve a few specific samples.
I suppose I am overlooking something very simple and obvious here. Any advice appreciated!
Just in case anyone with a similar question comes across this at some point:
The quick-and-dirty workaround I ended up using was to bypass the dataloader in the training loop by directly accessing it's associated dataset attribute. Suppose we want to quickly check if our network learns at all by repeatedly presenting it a single, handpicked training example with linear index sample_idx (as defined by the dataset class).
Then one can do something like this:
for i, _ in enumerate(self.dataloader):
# Get training data
# inputs, labels = data
inputs, labels = self.dataloader.dataset[sample_idx]
inputs = inputs.unsqueeze(0)
labels = labels.unsqueeze(0)
# Train the network
# [...]
EDIT:
One brief remark, since some people seem to be finding this workaround helpful: When using this hack I found it to be crucial to instantiate the DataLoader with num_workers = 0. Otherwise, memory segmentation errors might occur in which case you could end up with very weird looking training data.
If you have defined
train_set = torchvision.datasets.CIFAR10(root='~/datasets/', train=True,
download=True, transform=(transform['train']))
then you can do something like
train_set.data[index] where index is the index of the specific example you want.
Now you can redefine you Dataset class with this new dataset that includes these specific examples and there you have it.

How to train statsmodels.tsa.ARIMA model with multiple series

The usual way to fit an ARIMA model with the statsmodels python package is:
model = statsmodels.tsa.ARMA(series, order=(2,2))
result = model.fit(trend='nc', disp=1)
however, i have multiple time series data to train with, say, from the same underlying process, how could i do that?
When you say, multiple time series data, it is not clear if they are of the same type. There is no straightforward way to specify multiple series in ARMA model. However you could use the 'exog' optional variable to indicate the second series.
Please refer for the actual definition of ARMA model.
model = statsmodels.tsa.ARMA(endog = series1, exog=series2, order=(2,2))
Please refer for the explanation of the endog, exog variables.
Please see a working example of how this could be implemented

GridSearchCV: passing weights to a scorer

I am trying to find an optimal parameter set for an XGB_Classifier using GridSearchCV.
Since my data is very unbalanced, both fitting and scoring (in cross_validation) must be performed using weights, therefore I have to use a custom scorer, which takes a 'weights' vector as a parameter.
However, I can't find a way to have GridSearchCV pass 'weights' vector to a scorer.
There were some attempts to add this functionality to gridsearch:
https://github.com/ndawe/scikit-learn/commit/3da7fb708e67dd27d7ef26b40d29447b7dc565d7
But they were not merged into master and now I am afraid that this code is not compatible with upstream changes.
Has anyone faced a similar problem and is there any 'easy' way to cope with it?
You could manually balance your training dataset as in the answer to Scikit-learn balanced subsampling

Categories