Python scikit-learn (find deviance) - python

The deviance of Y and its expected value E(Y), estimated by the model constructed in c), measures the goodness of fit of the model. The lower the deviance, the better is the model. Below is the equation of how it should be calculated.
𝐷=2∑𝑛𝑖=1{𝑌𝑙𝑜𝑔[𝑌𝔼(𝑌)]−[𝑌−𝔼(𝑌)]}
If Y = 0, the expression log[Y/exp(E(Y))] will be taken as zero. Employ your own Python program to compute D without using the score() function of the scikit-learn package.
How do I go about doing this question? Please helppp!!

What you have is the deviance for a model fitted assuming poisson distribution, you can check wiki for how this definition is derived. Using example from the poisson regressor
from sklearn import linear_model
import numpy as np
clf = linear_model.PoissonRegressor()
X = [[1, 2], [2, 3], [3, 4], [4, 3]]
y = [12, 17, 22, 21]
clf.fit(X, y)
The deviance is:
def calculate_dev(y_true,y_pred):
return (2*(y_true * np.log(y_true/y_pred) - (y_true-y_pred))).sum()
D = calculate_dev(y,pred)
D
0.03453083031027196
Compare with the score() function, where it is defined as 1 - dev(model)/ dev(null), as in the documentation:
clf.score(X, y)
0.99048551488916
nullD = calculate_dev(y,np.mean(y))
1 - D / nullD
0.99048551488916

Related

Interaction between sample_weight and min_samples_split in decision tree

In sklearn.ensemble.RandomForestClassifier, if we define both sample_weight and min_samples_split, does the sample weight impact the min_samples_split. For example, if min_sample_split = 20 and the weight of data points in samples are all 2, then 10 data points satisfy the min_sample_split condition?
No, see the source; min_samples_split does not take into consideration sample weights. Compare to min_samples_leaf and its weighted cousin min_weight_fraction_leaf (source).
Your example suggests an easy experiment to check:
from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([1, 2, 3]).reshape(-1, 1)
y = [0, 0, 1]
tree = DecisionTreeClassifier()
tree.fit(X, y)
print(len(tree.tree_.feature)) # number of nodes
# 3
tree.set_params(min_samples_split=10)
tree.fit(X, y)
print(len(tree.tree_.feature))
# 1
tree.set_params(min_samples_split=10)
tree.fit(X, y, sample_weight=[20, 20, 20])
print(len(tree.tree_.feature))
# 1; the sample weights don't count to make
# each sample "large" enough for min_samples_split

selecting data points neighbourhood to support vectors

I have been thinking of this but not sure how to do it. I have a binary imbalanced data, and would like to use svm to select just subset of the majority data points nearest to support vector. Thereafter, I can fit a binary classifier on this "balanced" data.
To illustrate what I mean, a MWE:
# packages import
from collections import Counter
from sklearn.datasets import make_classification
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC
import seaborn as sns
# sample data
X, y = make_classification(n_samples=100, n_features=2, n_redundant=0,
n_clusters_per_class=1, weights=[0.9], flip_y=0, random_state=1)
# class distribution summary
print(Counter(y))
Counter({0: 91, 1: 9})
# fit svm model
svc_model = SVC(kernel='linear', random_state=32)
svc_model.fit(X, y)
plt.figure(figsize=(10, 8))
# Plotting our two-features-space
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
# Constructing a hyperplane using a formula.
w = svc_model.coef_[0] # w consists of 2 elements
b = svc_model.intercept_[0] # b consists of 1 element
x_points = np.linspace(-1, 1) # generating x-points from -1 to 1
y_points = -(w[0] / w[1]) * x_points - b / w[1] # getting corresponding y-points
# Plotting a red hyperplane
plt.plot(x_points, y_points, c='r')
The two classes are well separated by the hyperplane. We can see the support vectors for both classes (even better for class 1).
Since the minority class 0 has 9-data-points, I want to down-sample class 0 by selecting its support vectors, and 8 other data points nearest to it. So that the class distribution becomes {0: 9, 1: 9} ignoring all other data points of 0. I will then use this to fit a binary classifier like LR (or even SVC).
My question is how to select those data points of class 0 nearest to the class support vector, taking into account, a way to reach a balance with data points of minority class 1.
This can be achieved as follows: Get the support vector for class 0, (sv0), iterate over all data points in class 0 (X[y == 0]), compute the distances (d) to the point represented by the support vector, sort them, take the 9 with the smallest values, and concatenate them with the points of class 1 to create the downsampled data (X_ds, y_ds).
sv0 = svc_model.support_vectors_[0]
distances = []
for i, x in enumerate(X[y == 0]):
d = np.linalg.norm(sv0 - x)
distances.append((i, d))
distances.sort(key=lambda tup: tup[1])
index = [i for i, d in distances][:9]
X_ds = np.concatenate((X[y == 0][index], X[y == 1]))
y_ds = np.concatenate((y[y == 0][index], y[y == 1]))
plt.plot(x_points[19:-29], y_points[19:-29], c='r')
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, s=50)
plt.scatter(X_ds[y_ds == 0][:,0], X_ds[y_ds == 0][:,1], color='yellow', alpha=0.4)

Absurd solution using gurobi python in regression

So I am new to gurobi and I decided to start working with it on a well known problem as regression. I found this official notebook, where an L0 penalized regression model was solved and I took just the part of the regression model out of it. However, when I solve this problem in gurobi, I get a really strange solution, totally different from the actual correct regression solution.
The code I am running is:
import gurobipy as gp
from gurobipy import GRB
import numpy as np
from sklearn.datasets import load_boston
from itertools import product
boston = load_boston()
x = boston.data
x = x[:, [0, 2, 4, 5, 6, 7, 10, 11, 12]] # select non-categorical variables
response = boston.target
samples, dim = x.shape
regressor = gp.Model()
# Append a column of ones to the feature matrix to account for the y-intercept
x = np.concatenate([x, np.ones((samples, 1))], axis=1)
# Decision variables
beta = regressor.addVars(dim + 1, name="beta") # Beta
# Objective Function (OF): minimize 1/2 * RSS using the fact that
# if x* is a minimizer of f(x), it is also a minimizer of k*f(x) iff k > 0
Quad = np.dot(x.T, x)
lin = np.dot(response.T, x)
obj = sum(0.5 * Quad[i, j] * beta[i] * beta[j] for i, j in product(range(dim + 1), repeat=2))
obj -= sum(lin[i] * beta[i] for i in range(dim + 1))
obj += 0.5 * np.dot(response, response)
regressor.setObjective(obj, GRB.MINIMIZE)
regressor.optimize()
beta_sol_gurobi = np.array([beta[i].X for i in range(dim+1)])
The solution provided by this code is
array([1.22933632e-14, 2.40073891e-15, 1.10109084e-13, 2.93142174e+00,
6.14486489e-16, 3.93021623e-01, 5.52707727e-15, 8.61271603e-03,
1.55963041e-15, 3.19117429e-13])
While the true linear regression solution should be
from sklearn import linear_model
lr = linear_model.LinearRegression()
lr.fit(x, response)
lr.coef_
lr.intercept_
That yields,
array([-5.23730841e-02, -3.35655253e-02, -1.39501039e+01, 4.40955833e+00,
-7.33680982e-03, -1.24312668e+00, -9.59615262e-01, 8.60275557e-03,
-5.17452533e-01])
29.531492975441015
So gurobi solution is completely different. Any guess / suggestion on whats happening? Am I doing anything wrong here?
PD: I know that this problem can be solved using other packages, or even other optimization frameworks, but I am specially interested in solving it in gurobi python, since I want to start using gurobi in some more complex problems.
The wrong result is due to your decision variables. Since Gurobi assumes the lower bound 0 for all variables by default, you need to explicitly set the lower bound:
beta = regressor.addVars(dim + 1, lb = -GRB.INFINITY, name="beta") # Beta

How to get errors for coefficients from a linear regression?

I've been able to calculate the coefficients of a linear regression. But is there a way to get the associated errors of the coefficients? My code shown below.
from scipy.interpolate import *
from numpy import *
x = np.array([4, 12, 56, 58.6,67, 89])
y = np.array([5, 6, 7, 16,18, 19])
degrees = [0,1] # list of degrees of x to use
matrix = np.stack([x**d for d in degrees], axis=-1)
coeff = np.linalg.lstsq(matrix, y)[0]
print("Coefficients", coeff)
fit = np.dot(matrix, coeff)
print("Linear regression", fit)
p1=polyfit(x,y,1)
Output:
Coefficients for y=a +bx [3.70720668 0.17012128]
Linear fit [ 4.38769182 5.74866209 13.23399857 13.67631391 15.10533269 18.84800093]
Errors are not shown! How to calculate the errors?
You can generate the "predicted" values for y, let's call it y_pred, and compare them to y to get the errors.
predicted_line = poly1d(coeff)
y_pred = predicted_line(x)
errors = y-y_pred
Althorugh I like the solution of David Moseler, if you want an error to evaluate the goodness of your regression, you could use the R2 score (which use the squared error), already implemented in sklearn:
from sklearn.linear_model import LinearRegression
import numpy as np
x = np.array([4, 12, 56, 58.6,67, 89]).reshape(-1, 1)
y = np.array([5, 6, 7, 16,18, 19])
reg = LinearRegression().fit(x, y)
reg.score(x, y) # R2 score
# 0.7481301984276703
If the R2 value is near 1, the model is a good one

How to configure lasso regression to not penalize certain variables?

I'm trying to use lasso regression in python.
I'm currently using lasso function in scikit-learn library.
I want my model not to penalize certain variables while training. (penalize only the rest of variables)
Below is my current code for training
rg_mdt = linear_model.LassoCV(alphas=np.array(10**np.linspace(0, -4, 100)), fit_intercept=True, normalize=True, cv=10)
rg_mdt.fit(df_mdt_rgmt.loc[df_mdt_rgmt.CLUSTER_ID == k].drop(['RESPONSE', 'CLUSTER_ID'], axis=1), df_mdt_rgmt.loc[df_mdt_rgmt.CLUSTER_ID == k, 'RESPONSE'])
df_mdt_rgmt is the data mart and I'm trying to keep the coefficient for certain columns non-zero.
glmnet in R provides 'penalty factor' parameter that let me do this, but how can I do that in python scikit-learn?
Below is the code I have in R
get.Lassomodel <- function(TB.EXP, TB.RSP){
VT.PEN <- rep(1, ncol(TB.EXP))
VT.PEN[which(colnames(TB.EXP) == "DC_RATE")] <- 0
VT.PEN[which(colnames(TB.EXP) == "FR_PRICE_PW_REP")] <- 0
VT.GRID <- 10^seq(0, -4, length=100)
REG.MOD <- cv.glmnet(as.matrix(TB.EXP), as.matrix(TB.RSP), alpha=1,
lambda=VT.GRID, penalty.factor=VT.PEN, nfolds=10, intercept=TRUE)
return(REG.MOD)
}
I'm afraid you can't. Of course it's not an theoretical issue, but just a design-decision.
My reasoning is based on the available API and while sometimes there are undocumented functions, this time i don't think there is what you need because the user-guide already posts this problem in the 1-factor-norm-of-all form alpha*||w||_1
Depending on your setting you might modify sklearn's code (a bit scared about CD-tunings) or even implement a customized-objective using scipy.optimize (although the latter might be a bit slower).
Here is some example showing the scipy.optimize approach. I simplified the problem by removing intercept's.
""" data """
import numpy as np
from sklearn import datasets
diabetes = datasets.load_diabetes()
A = diabetes.data[:150]
y = diabetes.target[:150]
alpha=0.1
weights=np.ones(A.shape[1])
""" sklearn """
from sklearn import linear_model
clf = linear_model.Lasso(alpha=alpha, fit_intercept=False)
clf.fit(A, y)
""" scipy """
from scipy.optimize import minimize
def lasso(x): # following sklearn's definition from user-guide!
return (1. / (2*A.shape[0])) * np.square(np.linalg.norm(A.dot(x) - y, 2)) + alpha * np.linalg.norm(weights*x, 1)
""" Test with weights = 1 """
x0 = np.zeros(A.shape[1])
res = minimize(lasso, x0, method='L-BFGS-B', options={'disp': False})
print('Equal weights')
print(lasso(clf.coef_), clf.coef_[:5])
print(lasso(res.x), res.x[:5])
""" Test scipy-based with special weights """
weights[[0, 3, 5]] = 0.0
res = minimize(lasso, x0, method='L-BFGS-B', options={'disp': False})
print('Specific weights')
print(lasso(res.x), res.x[:5])
Output:
Equal weights
12467.4614224 [-524.03922009 -75.41111354 820.0330707 40.08184085 -307.86020107]
12467.6514697 [-526.7102518 -67.42487561 825.70158417 40.04699607 -271.02909258]
Specific weights
12362.6078842 [ -6.12843589e+02 -1.51628334e+01 8.47561732e+02 9.54387812e+01
-1.02957112e-05]

Categories