Sum of fitted models with sklearn

Sum of fitted models with sklearn - python

I am trying to do something that involves taking the sum of two fitted models such that the output is another LinearRegression type object. I have fitted the two models using the standard LinearRegression method from sklearn.
from sklearn.linear_model import LinearRegression
reg_1 = LinearRegression().fit(X1, y)
reg_2 = LinearRegression().fit(X2, y)
and I want to be able to produce something like
reg = reg_1 + reg_2
such that I can still do standard operations such as
reg.predict(X3)
Is there an easy way to do this, clearly I can obtain the coefficients of both reg_1 and reg_2 so if I can define reg ussing those, it would work but I couldn't see a way to do this.

Since your reason for doing this is that "they are just different datasets with the same features" I would recommend simply appending the datasets and creating one model on all data.
But if this isn't possible for some reason you could do this by manually setting the coef_ and intercept_ attributes of a third linear model as the averages of the first two, such as:
reg = LinearRegression()
reg.coef_ = np.array([np.mean(t) for t in zip(reg_1.coef_, reg_2.coef_)])
reg.intercept_ = np.mean([reg_1.intercept_, reg_2.intercept_])
Then you can just use the reg.predict(X3) method as usual to make predictions from the combined averages of the 2 linear models' terms.
There are dangers in this approach though, if for example one of the datasets used to fit the original models is much larger than the other one, then the smaller dataset's intercept and coefficient terms would be over-weighted in the combined model, and you would probably want to do some weighting when averaging the intercept and coefficient terms.

Related

Building ML classifier with imbalanced data

I have a dataset with 1400 obs and 19 columns. The Target variable has values 1 (value that I am most interested in) and 0. The distribution of classes shows imbalance (70:30).
Using the code below I am getting weird values (all 1s). I am not figuring out if this is due to a problem of overfitting/imbalance data or to feature selection (I used Pearson correlation since all values are numeric/boolean).
I am thinking that the steps followed are wrong.
import numpy as np
import math
import sklearn.metrics as metrics
from sklearn.metrics import f1_score
y = df['Label']
X = df.drop('Label',axis=1)
def create_cv(X,y):
if type(X)!=np.ndarray:
X=X.values
y=y.values
test_size=1/5
proportion_of_true=y[y==1].shape[0]/y.shape[0]
num_test_samples=math.ceil(y.shape[0]*test_size)
num_test_true_labels=math.floor(num_test_samples*proportion_of_true)
num_test_false_labels=math.floor(num_test_samples-num_test_true_labels)
y_test=np.concatenate([y[y==0][:num_test_false_labels],y[y==1][:num_test_true_labels]])
y_train=np.concatenate([y[y==0][num_test_false_labels:],y[y==1][num_test_true_labels:]])
X_test=np.concatenate([X[y==0][:num_test_false_labels] ,X[y==1][:num_test_true_labels]],axis=0)
X_train=np.concatenate([X[y==0][num_test_false_labels:],X[y==1][num_test_true_labels:]],axis=0)
return X_train,X_test,y_train,y_test
X_train,X_test,y_train,y_test=create_cv(X,y)
X_train,X_crossv,y_train,y_crossv=create_cv(X_train,y_train)
tree = DecisionTreeClassifier(max_depth = 5)
tree.fit(X_train, y_train)
y_predict_test = tree.predict(X_test)
print(classification_report(y_test, y_predict_test))
f1_score(y_test, y_predict_test)
Output:
precision recall f1-score support
0 1.00 1.00 1.00 24
1 1.00 1.00 1.00 70
accuracy 1.00 94
macro avg 1.00 1.00 1.00 94
weighted avg 1.00 1.00 1.00 94
Has anyone experienced similar issues in building a classifier when data has imbalance, using CV and/or under sampling? Happy to share the whole dataset, in case you might want to replicate the output.
What I would like to ask you for some clear answer to follow that can show me the steps and what I am doing wrong.
I know that, to reduce overfitting and work with balance data, there are some methods such as random sampling (over/under), SMOTE, CV. My idea is
Split the data on train/test taking into account imbalance
Perform CV on trains set
Apply undersampling only on a test fold
After the model has been chosen with the help of CV, undersample the train set and train the classifier
Estimate the performance on the untouched test set
(f1-score)
as also outlined in this question: CV and under sampling on a test fold .
I think the steps above should make sense, but happy to receive any feedback that you might have on this.

When you have imbalanced data you have to perform stratification. The usual way is to oversample the class that has less values.
Another option is to train your algorithm with less data. If you have a good dataset that should not be a problem. In this case you grab first the samples from the less represented class use the size of the set to compute how many samples to get from the other class:
This code may help you split your dataset that way:
def split_dataset(dataset: pd.DataFrame, train_share=0.8):
"""Splits the dataset into training and test sets"""
all_idx = range(len(dataset))
train_count = int(len(all_idx) * train_share)
train_idx = random.sample(all_idx, train_count)
test_idx = list(set(all_idx).difference(set(train_idx)))
train = dataset.iloc[train_idx]
test = dataset.iloc[test_idx]
return train, test
def split_dataset_stratified(dataset, target_attr, positive_class, train_share=0.8):
"""Splits the dataset as in `split_dataset` but with stratification"""
data_pos = dataset[dataset[target_attr] == positive_class]
data_neg = dataset[dataset[target_attr] != positive_class]
if len(data_pos) < len(data_neg):
train_pos, test_pos = split_dataset(data_pos, train_share)
train_neg, test_neg = split_dataset(data_neg, len(train_pos)/len(data_neg))
# set.difference makes the test set larger
test_neg = test_neg.iloc[0:len(test_pos)]
else:
train_neg, test_neg = split_dataset(data_neg, train_share)
train_pos, test_pos = split_dataset(data_pos, len(train_neg)/len(data_pos))
# set.difference makes the test set larger
test_pos = test_pos.iloc[0:len(test_neg)]
return train_pos.append(train_neg).sample(frac = 1).reset_index(drop = True), \
test_pos.append(test_neg).sample(frac = 1).reset_index(drop = True)
Usage:
train_ds, test_ds = split_dataset_stratified(data, target_attr, positive_class)
You can now perform cross validation on train_ds and evaluate your model in test_ds.

There is another solution that is in the model-level - using models that support weights of samples, such as Gradient Boosted Trees. Of those, CatBoost is usually the best as its training method leads to less leakage (as described in their article).
Example code:
from catboost import CatBoostClassifier
y = df['Label']
X = df.drop('Label',axis=1)
label_ratio = (y==1).sum() / (y==0).sum()
model = CatBoostClassifier(scale_pos_weight = label_ratio)
model.fit(X, y)
And so forth.
This works because Catboost treats each sample with a weight, so you can determine class weights in advance (scale_pos_weight).
This is better than downsampling, and is technically equal to oversampling (but requires less memory).
Also, a major part of treating imbalanced data, is making sure your metrics are weighted as well, or at least well-defined, as you might want equal performance (or skewed performance) on these metrics.
And if you want a more visual output than sklearn's classification_report, you can use one of the Deepchecks built-in checks (disclosure - I'm one of the maintainers):
from deepchecks.checks import PerformanceReport
from deepchecks import Dataset
PerformanceReport().run(Dataset(train_df, label='Label'), Dataset(test_df, label='Label'), model)

your implementation of stratified train/test creation is not optimal, as it lacks randomness. Very often data comes in batches, so it is not a good practice to take sequences of data as is, without shuffling.
as #sturgemeister mentioned, classes ratio 3:7 is not critical, so you should not worry too much of class imbalance. When you artificially change data balance in training you will need to compensate it by multiplication by prior for some algorithms.
as for your "perfect" results either your model overtrained or the model is indeed classifies the data perfectly. Use different train/test split to check this.
another point: your test set is only 94 data points. It is definitely not 1/5 of 1400. Check your numbers.
to get realistic estimates, you need lots of test data. This is the reason why you need to apply Cross Validation strategy.
as for general strategy for 5-fold CV I suggest following:
split your data to 5 folds with respect to labels (this is called stratified split and you can use StratifiedShuffleSplit function)
take 4 splits and train your model. If you want to use under/oversampling, modify the data in those 4 training splits.
apply the model to the remaining part. Do not under/over sample data in the test part. This way you get realistic performance estimate. Save the results.
repeat 2. and 3. for all test splits (totally 5 times obviously). Important: do not change parameters (e.g. tree depth) of the model when training - they should be the same for all splits.
now you have all your data points tested without being trained on them. This is the core idea of cross validation. Concatenate all the saved results, and estimate the performance .

Cross-validation or held-out set
First of all, you are not doing cross-validation. You are splitting your data in a train/validation/test set, which is good, and often sufficient when the number of training samples is large (say, >2e4). However, when the number of samples is small, which is your case, cross-validation becomes useful.
It is explained in depth in scikit-learn's documentation. You will start by taking out a test set from your data, as your create_cv function does. Then, you split the rest of the training data in e.g. 3 splits. Then, you do, for i in {1, 2, 3}: train on data j != i, evaluate on data i. The documentation explains it with prettier and colorful figures, you should have a look! It can be quite cumbersome to implement, but hopefully scikit does it out of the box.
As for the dataset being unbalanced, it is a very good idea to keep the same ratio of labels in each set. But again, you can let scikit handle it for you!
Purpose
Also, the purpose of cross-validation is to choose the right values for the hyper-parameters. You want the right amount of regularization, not too big (under-fitting) nor too small (over-fitting). If you're using a decision tree, the maximum depth (or the minimum number of samples per leaf) is the right metric to consider to estimate the regularization of your method.
Conclusion
Simply use GridSearchCV. You will have cross-validation and label balance done for you.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/5, stratified=True)
tree = DecisionTreeClassifier()
parameters = {'min_samples_leaf': [1, 5, 10]}
clf = GridSearchCV(svc, parameters, cv=5) # Specifying cv does StratifiedShuffleSplit, see documentation
clf.fit(iris.data, iris.target)
sorted(clf.cv_results_.keys())
You can also replace the cv variable by a fancier shuffler, such as StratifiedGroupKFold (no intersection between groups).
I would also advise looking towards random trees, which are less interpretable but said to have better performances in practice.

Just wanted to add thresholding and cost sensitive learning to the list of possible approaches mentioned by the others. The former is well described here and consists in finding a new threshold for classifying positive vs negative classes (generally is 0.5 but it can be treated as an hyper parameter). The latter consists on weighting the classes to cope with their unbalancedness. This article was really useful to me to understand how to deal with unbalanced data sets. In it, you can find also cost sensitive learning with a specific explanation using decision tree as a model. Also all other approaches are really nicely reviewed including: Adaptive Synthetic Sampling, informed undersampling etc.

Residual plot diagnostic and how to improve the regression model

When creating regression models for this housing dataset, we can plot the residuals in function of real values.
from sklearn.linear_model import LinearRegression
X = housing[['lotsize']]
y = housing[['price']]
model = LinearRegression()
model.fit(X, y)
plt.scatter(y,model.predict(X)-y)
We can clearly see that the difference (prediction - real value) is mainly positive for lower prices, and the difference is negative for higher prices.
It is true for linear regression, because the model is optimized for RMSE (so the sign of the residual is not taken into account).
But when doing KNN
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors = 3)
We can find a similar plot.
In this case, what interpretation can we give, and how can we improve the model.
EDIT: we can use all the other predictors, the results are similar.
housing = housing.replace(to_replace='yes', value=1, regex=True)
housing = housing.replace(to_replace='no', value=0, regex=True)
X = housing[['lotsize','bedrooms','stories','bathrms','bathrms','driveway','recroom',
'fullbase','gashw','airco','garagepl','prefarea']]
The following graph is for KNN with 3 neighbors. With 3 neighbors, one would expect overfitting, I can't figure out why there is this trend.

If you look at the fit:
plt.scatter(X,y)
plt.plot(X,model.predict(X), '--k')
You get negative values for higher values of y because there is a cluster of data around x=8000 with high y values that deviate a lot from what you expect.
Now if you do a knn, bear in mind your independent variable is only 1 dimensional, meaning, you are defining neighbours based on your lotsize, and you use the mean of the groups as a predictive value. For those high outlier values around x=8000, they will group together with values lower than them, making the difference negative
If you plot this out:
plt.scatter(X,y)
plt.scatter(X,model.predict(X))
How to improve the model? With only one predictor, there's not much you can do, maybe categorize lotsize but I doubt it changes much. Most likely you need other variables to see what is causing that bump around lotsize = 8000, then you can model the dependent variable better.

Isolation Forest Sklearn for 1D array or list and how to tune hyper parameters

Is there a way to implement sklearn isolation forest for a 1D array or list? All the examples I came across are for data of 2 Dimension or more.
I have right now developed a model with three features and the example code snipped is mentioned below:
# dataframe of three columns
df_data = datafr[['col_A', 'col_B', 'col_C']]
w_train = page_data[:700]
w_test = page_data[700:-2]
from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples='auto')
clf.fit(w_train)
#testing it using test set
y_pred_test = clf.predict(w_test)
The reference I mainly relied upon: IsolationForest example | scikit-learn
The df_data is a data frame with three columns. I am actually looking to find outlier in 1 Dimension or list data.
The other question is how to tune an isolation forest model? One of the ways is to increase the contamination value to reduce the false positives. But how to use the other parameters like n_estimators, max_samples, max_features, versbose, etc.

It won't make sense to apply Isolation forest to 1D array or list. This is because in that case it would simply be a one to one mapping from feature to target.
You can read the official documentation to get a better idea of the different parameters helps
contamination
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
Try experimenting with different values in range [0,0.5] to see which one gives the best results
max_features
The number of features to draw from X to train each base estimator.
Try values like 5,6,10, etc. any int of your choice and validate it with the final test data
n_estimators try multiple values like 10,20,50, etc. to see which works best.
You can also use GridSearchCV to automate this process of parameter estimation.
Just try experimenting with different values using gridSearchCV and see which one gives the best results.
Try this
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
my_scoring_func = make_scorer(f1_score)
parameters = {'n_estimators':[10,30,50,80], 'max_features':[0.1, 0.2, 0.3,0.4], 'contamination' : [0.1, 0.2, 0.3]}
iso_for = IsolationForest(max_samples='auto')
clf = GridSearchCV(iso_for, parameters, scoring=my_scoring_func)
Then use clf to fit the data. Although note that GridSearchCV requires bot x and y (i.e. train data and labels) for the fit method.
Note :You can read this blog post for further reference if you wish to use GridSearchCv with Isolation forest, else you can manually try with different values and plot graphs to see the results.

Input format for logistic regression in scikit-learn as in R

When Using logistic regression in R, the data input for the 'glm' function (family = binomial) can be: (?family) in several formats, and specifically in the format of:
......
For the binomial and quasibinomial families the response can be
specified in one of three ways:
......
As a numerical vector with values between 0 and 1, interpreted as the
proportion of successful cases (with the total number of cases given
by the weights)....
I have aggregated data that represents proportion of success out of trials (number between 0 and 1) and their equivalent weights, I'm interested in applying logistic regression with it, which would be trivial to use in R.
Unfortunately i cant use R in this project, and would like to use scikit-learn to estimate the logistic regression coefficients . More precise, i'm looking to apply the sklearn.linear_model.LogisticRegression in a form of input that will allow me to insert the model proportions and wights, in a similar fashion as available in R.
example:
from sklearn import linear_model
import pandas as pd
df = pd.DataFrame([[1,1,1,0], [1,1,1,0],[1,1,1,1],[2,2,1,1] , [2,2,1,1],[2,2,1,0] , [3,3,1,0] ],columns=['a', 'b','Trials','Success'])
logistic = linear_model.LogisticRegression()
#this works
logistic.fit(X=df[['a','b','Trials']] , y=df.Success)
logistic.predict_proba(df[['a','b','Trials']])
prob_to_success = logistic.predict_proba(df[['a','b','Trials']])[:,1]
prob_to_success
Out[51]: array([ 0.45535843, 0.45535843, 0.45535843, 0.42212169, 0.42212169,
0.42212169, 0.38957565])
#How can i use the following Data?
df_agg = df.groupby(['a','b'] , as_index=False)['Trials','Success'].sum()
df_agg["Prop"] = df_agg.Success / (df_agg.Trials)
df_agg
#I want to use Prop & Trials as weights in df_agg
Thanks in advance!

Convert to log-odds form and use linear regression on the transformation. Sklearn doesn't seem to have a quasi-binomial conversion for logistic regression. As you said, trivial in R but sklearn seems to not have anything of the sort.

If you want to use weights, you can use them in the fit function of LogisticRegression:
fit(X, y, sample_weight=None)

How to find the importance of the features for a logistic regression model?

I have a binary prediction model trained by logistic regression algorithm. I want know which features(predictors) are more important for the decision of positive or negative class. I know there is coef_ parameter comes from the scikit-learn package, but I don't know whether it is enough to for the importance. Another thing is how I can evaluate the coef_ values in terms of the importance for negative and positive classes. I also read about standardized regression coefficients and I don't know what it is.
Lets say there are features like size of tumor, weight of tumor, and etc to make a decision for a test case like malignant or not malignant. I want to know which of the features are more important for malignant and not malignant prediction. Does it make sort of sense?

One of the simplest options to get a feeling for the "influence" of a given parameter in a linear classification model (logistic being one of those), is to consider the magnitude of its coefficient times the standard deviation of the corresponding parameter in the data.
Consider this example:
import numpy as np
from sklearn.linear_model import LogisticRegression
x1 = np.random.randn(100)
x2 = 4*np.random.randn(100)
x3 = 0.5*np.random.randn(100)
y = (3 + x1 + x2 + x3 + 0.2*np.random.randn()) > 0
X = np.column_stack([x1, x2, x3])
m = LogisticRegression()
m.fit(X, y)
# The estimated coefficients will all be around 1:
print(m.coef_)
# Those values, however, will show that the second parameter
# is more influential
print(np.std(X, 0)*m.coef_)
An alternative way to get a similar result is to examine the coefficients of the model fit on standardized parameters:
m.fit(X / np.std(X, 0), y)
print(m.coef_)
Note that this is the most basic approach and a number of other techniques for finding feature importance or parameter influence exist (using p-values, bootstrap scores, various "discriminative indices", etc).
I am pretty sure you would get more interesting answers at https://stats.stackexchange.com/.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.