ConvergenceWarning: Regressors in active set degenerate - python

I am running various Regressions in Python with lots of variables. For a more sparse variable selection, I implemented a relaxed Lasso (https://relaxedlasso.readthedocs.io/en/latest/content.html#implementation).
The code works all fine and I get a more sparse variable selection and reasonable R-2 scores. The code is displayed below:
relaxed_lasso = RelaxedLassoLars(alpha=#result_from_RelaxedLassoLarsCV,theta=#result_from_RelaxedLassoLarsCV).fit(X_train, y_train.values.ravel())
print("Training set score: {:.2f}".format(relaxed_lasso.score(X_train, y_train.values.ravel())))
print("Test set score: {:.2f}".format(relaxed_lasso.score(X_test, y_test)))
print("Number of features used: {}".format(np.sum(relaxed_lasso.coef_ != 0)))
But I always get the following warning:
ConvergenceWarning: Regressors in active set degenerate. Dropping a regressor, after 10 iterations, i.e. alpha=1.165e-05, with an active set of 10 regressors, and the smallest cholesky pivot element being 1.825e-08. Reduce max_iter or increase eps parameters.
I would like to know what this means for the validity of my result. I couldn't find any meaninful explanation online. That's why I'm asking here.
Thanks!

Related

How are the votes of individual trees calculated for Random Forest and Extra Trees in Sklearn?

I have been constructing my own Extra Trees (XT) classifier in Rust for binary classification. To verify correctness of my classifier, I have been comparing it against Sklearns implementation of XT, but I constantly get different results. I thought that there must be a bug in my code at first, but now I realize it's not a bug, but instead a different method of calculating votes amongst the different trees in the ensemble. In my code, each tree votes based on the most frequent classification in a leafs' subset of data. So for example, if we are traversing a tree, and find ourselves at a leaf node that has 40 classifications of 0, and 60 classifications of 1, the tree classifies the data as 1.
Looking at Sklearn's documentation for XT (As seen here), I read the following line in regards to the predict method
The predicted class of an input sample is a vote by the trees in the forest, weighted by their probability estimates. That is, the predicted class is the one with highest mean probability estimate across the trees.
While this gives me some idea about how individual trees vote, I still have more questions. Perhaps an exact mathematical expression of how these weights are calculated would help, but I have yet to find one in the documentation.
I will provide more details in the upcoming paragraphs, but I wish to ask my question concisely here. How are these weights calculated at a high level, what are the mathematics behind it? Is there a way to change how individual XT trees calculate their votes?
---------------------------------------- Additional Details -----------------------------------------------
For my current tests, this is how I build my classifier
classifier = ExtraTreesClassifier(n_estimators=5, criterion='gini',
max_depth=1, max_features=5,random_state=0)
To predict unseen transactions X, I use classifier.predict(X). Digging through the source code of predict (seen here, line 630-ish), I see that this is all the code that executes for binary classification
proba = self.predict_proba(X)
if self.n_outputs_ == 1:
return self.classes_.take(np.argmax(proba, axis=1), axis=0)
What this code is doing is relatively obvious to me. It merely determines the most likely classification of transactions by taking the argmax of proba. What I fail to understand is how this proba value is made in the first place. I beleive that the predict_proba method that predict uses is defined here at Line 650-ish. Here is what I believe the relevant source code to be
check_is_fitted(self)
# Check data
X = self._validate_X_predict(X)
# Assign chunk of trees to jobs
n_jobs, _, _ = _partition_estimators(self.n_estimators, self.n_jobs)
# avoid storing the output of every estimator by summing them here
all_proba = [np.zeros((X.shape[0], j), dtype=np.float64)
for j in np.atleast_1d(self.n_classes_)]
lock = threading.Lock()
Parallel(n_jobs=n_jobs, verbose=self.verbose,
**_joblib_parallel_args(require="sharedmem"))(
delayed(_accumulate_prediction)(e.predict_proba, X, all_proba,
lock)
for e in self.estimators_)
for proba in all_proba:
proba /= len(self.estimators_)
if len(all_proba) == 1:
return all_proba[0]
else:
return all_proba
I fail to understand what exactly is being calculated here. This is where my trail goes a bit cold and I get confused, and find myself in need of help.
Trees can predict probability estimates, according to the training sample proportions in each leaf. In your example, the probability of class 0 is 0.4, and 0.6 for class 1.
Random forests and extremely random trees in sklearn perform soft voting: each tree predicts the class probabilities as above, and then the ensemble just averages those across trees. That produces a probability for each class, and then the predicted class is the one with the largest probability.
In the code, the relevant bit is _accumulate_predictions, which just sums the probability estimates, followed by the division by the number of estimators.

MinMax scaling the target

I applied a linear regression on some features to predict the target with 10 folds cross validation.
MinMax scale was applied for both the features and the target.
Then the features standardized.
When I run the model, the r2 equal to 0.65 and MSE is 0.02.
But when I use the target as they are without MinMax scaling, I got r2 same but the MSE increase a lot to 18.
My question is, do we have to deal with targets as same we do with features in terms of data preprocessing? and which of the values above is correct? because the mse got quit bigger with out scaling the target.
Some people say we have to scale the targets too while others say no.
Thanks in advance.
Whether you scale your target or not will change the 'meaning' of your error. For example, consider 2 different targets, one ranged [0, 100] and another one [0, 10000]. If you run models against them (with no scaling), MSE of 20 would mean different things for the two models. In the former case it will be disastrous, while in the latter case it will be pretty decent.
So the fact that you get lower MSE with target range [0, 1] than the original is not surprising.
At the same time, r2 value is independent of the range since it is calculated using variances.
Scaling allows you to compare model performance for different targets, among other things.
Also for some model types (like NNs) scaling would be more important.
Hope it helps!

Intuition behind nloglikelihood value in xgboost poisson run

When I am using count:poisson instead of rmse I am seeing nloglikelihood values. Now I am not sure how to compare those numbers with rmse or mae.
Definitely lesser the value better .. but not getting actual error intuition that we get with rmse or Mae.
For example -> train-poisson-nloglik:2.01885 val-poisson-nloglik:2.02898
Here can we say, actual values differ by 2.02 error.
Can someone explain with small example.
Thanks.
There is a good post on the computation of the value here
Just to be more exhaustive, the value is:
mean(factorial(label) + preds - label*log(preds))
If you compare with the true formula of the negative log-likelihood, it should be the sum instead of the mean. I guess that they choose to take the mean so that the train and the test values are more comparable.
Finally, to answer the question, the likelihood is the probability that the data came from the distribution with a specific parameter. In the Poisson model, the parameters are just the set of predictions. So the better is your prediction, the greater is the probability, the smaller is the associate negative log-likelihood.
rmse or mae are based on the expectation of the difference between the prediction and the truth whereas negative log-likelihood is looking at a probability.

Can I set feature priors in sklearn Bayesian classifier?

I have done some simple Bayesian classification
X = [[1,0,0], [1,1,0]] ### there are more data of course
Y = [1,0]
classifier = BernoulliNB()
classifier.fit(X, Y)
Now I have got some "insider tips" that the first element in every X is more important than the others.
Can I incorporate this knowledge before I train the model please?
If sklearn doesn't allow it, is there any other classifier or other library that allows us to incorporate our prior before model training please?
I do not know the answer of the question 2 but I can answer question 1.
In the comment "multiply the first element for each observation by different values" is a wrong approach.
When you are using BernoulliNB or Binomial, the way you incorporate prior knowledge is by adding your knowledge into the sample (data).
Let's say you are flipping the coin and you know that the coin is rigged towards more head. Then you are adding more samples that show more heads. If your prior knowledge says 70% heads and 30% tails: You can add total 100 samples, 70 heads and 30 tails, to your data X.
Think about what the algorithm is actually doing. Naive Bayes performs the following classification:
p(class = k | data) ~ p(class = k) * p(data | class = k)
In words: The (posterior) probability of an observation being in class k is proportional to the probability of any observation being in class k (that's the prior) times the probability of seeing the observation, given it came from class k (the likelihood).
Usually when we don't know anything, we assume that p(class = k) just reflects the distribution of the observed data.
In your case, you're saying that you have some information, in addition to the observed data, that leads you to believe that the prior, p(class = k) should be amended. This is perfectly legitimate. In fact, that's the beauty of Bayesian inference. Whatever your prior knowledge is, you should incorporate that into this term. So in your case, perhaps that's increasing the probability of being in a particular class (i.e. increasing its weight as suggested in the comments), if you know that it's more likely to occur than the data suggests.

how to set the number of features to use in random selection sklearn

I am using sklearn RandomForest Classifier/Bag classifier for learning and I am not getting the expected results when compared to Java/Weka Machine Learning library.
In Weka, I am learning the model with - Random forest of 10 trees, each constructed while considering 6 random features. (setNumFeatures need to be set and default is 10 trees)
In sklearn - I am not sure how to specify the number of features to randomly consider while constructing a random forest of 10 trees. This what I am doing:
rf_classifier = RandomForestClassifier(n_estimators=num_trees, max_features=6)
rf_classifier = rf_classifier.fit(train_file, train_file_label)
for items in rf_classifier.estimators_:
classifier_list.append(items)
I saw the docs and there is a parameter - max_features but I am not sure if that serves the purpose. I get this error when I am trying to calculate entropy:
# code to calculate voting entropy for all features (unlabeled data)
vote_count_for_features = list(classifier_list[0].predict(feature_data_arr))
for i in range(1, len(classifier_list)):
res_temp = []
res_temp = list(classifier_list[i].predict(feature_data_arr))
vote_count_for_features = [x + y for x, y in zip(vote_count_for_features, res_temp)]
If I set that parameter to 6, than my code fails with the error message:
Number of features of the model must match the input. Model n_features
is 6 and input n_features is 31
Inputs: Sample set of 1 million records with 31 features. When I run weka, the number of rules extracted are around 1000 whereas when I run the same thing through sklearn - I get hardly 70 rules.
I am new to python and sklearn and I am trying to figure out where am I doing wrong. (Weka code has been tested well and gives 95% precision, 80% recall - so I am assuming that's good)
Note: I have used sklearn imputer to impute missing values using 'mean' strategy whereas Weka has ways to handle NaN.
This is what I am trying to achieve: Learn Random Forest on a sample set, extract rules, evaluate rules and then apply on the bigger set
Any suggestions or input will really help me debug through the issue and solve it quickly.
I think the issue is that the individual trees get confused since they only use 6 features, but you give them 31. You can try to get the prediction to work by setting check_input = False:
list(classifier_list[i].predict(feature_data_arr, check_input = False))

Categories