Fitting a bias in a weight decay regression using least-squares

Fitting a bias in a weight decay regression using least-squares - python

I'm calcualting the weights for a linear regression with weight-decay, i.e. normally I am trying to find beta = (X'X + lambda I)^-1 X'Y where X has n rows of D features each and Y is a vector of outputs for each row of X.
I've been fitting without a bias term by using:
def wd_fit(A, y, lamb=0):
n_col = A.shape[1]
return np.linalg.lstsq(A.T.dot(A) + lamb * np.identity(n_col), A.T.dot(y))
I'd like to also calculate a bias or intercept term for the fit, instead of having it pass through the origin. I'd like to keep the same call to lstsq, so if there's some matrix transform I can carry out, that would be ideal. My inclination is to append column of 1s somewhere, so that X_mod say would then have D+1 features where the last relates to the intercept value, but I'm not quite sure where that should be or even if it's correct.

If you don't want to mean-center your variables, adding a column of ones will work and is a perfectly acceptable solution.
The bias term will just be the coefficient at the position of the added column.

Related

Defining Input feature importance with Vector Auto Regression

I am using following package: https://www.statsmodels.org/dev/generated/statsmodels.tsa.vector_ar.var_model.VAR.html
I am trying to find out which of my input features has the biggest influence on the forecast that I am performing.
I am performing a multi-variate time-series forecast.
I differenced all variables until they became stationary.
I now want to find out which variables have the biggest influence in defining the output. (The idea similar to extracting the feature importance in a random forest)
As far as I understand, in VAR, the coefficients relate to the lags of each variable until the order is achieved.
You can write the formula as something like
Y = alpha + B1 x Y t-1 + ... + Bp x Yt-p
Y(t-1) = alpha + B(var1(t-1))* var1(t-1) + B(var2(t-1))* var2(t-1) + ... + B(varX(t-1))*varX(t-1)
With 'B' being the value of the coefficients and 'p' the current order (= lag)
Based on this formula, it would seem that the bigger the value of the coefficient, the bigger its effect on the output.
However, my data is not normalized / standardized (as is not necessary to do so in VAR), and the coeffients go from 1 - p (so until the order is achieved).
Based on above formula, there are p coefficients for each input variable.
And each input variable is scaled differently.
Is there an obvious / simple way in defining the most important features in a VAR?
Or a metric provided by the VAR algorithm that helps me define the most important features?
Right now, what I think of doing is do this:
Influence_var1 = B1 * B(var1(t-1))* var1(t-1) + B2 * B(var(t-2))* var1(t-2) + ... + Bp * B(var(t-p))* var1(t-p)
To determine the influence of var1, and do this for each input variable until varX.
But I am fairly certain I don't have to do all of that and that there are simpler solutions.
With the summary function from the package, the VAR system is estimated.

Getting a negative R-squared value with curve_fit()

I've read a related post on manually calculating R-squared values after using scipy.optimize.curve_fit(). However, they calculate an R-squared value when their function follows the power-law (f(x) = a*x^b). I'm trying to do the same but get negative R-squared values.
Here is my code:
def powerlaw(x, a, b):
'''Generic power law function.'''
return a * x**b
X = s_lt[4:] # independent variable (Pandas series)
Y = s_lm[4:] # dependent variable (Pandas series)
popt, pcov = curve_fit(powerlaw, X, Y)
residuals = Y - powerlaw(X, *popt)
ss_res = np.sum(residuals**2) # residual sum of squares
ss_tot = np.sum((Y-np.mean(Y))**2) # total sum of squares
r_squared = 1 - (ss_res / ss_tot) # r-squared value
print("R-squared of power-law fit = ", str(r_squared))
I got an R-squared value of -0.057....
From my understanding, it's not good to use R-squared values for non-linear functions, but I expected to get a much higher R-squared value than a linear model due to overfitting. Did something else go wrong?

See The R-squared and nonlinear regression: a difficult marriage?. Also When is R squared negative?.
Basically, we have two problems:
nonlinear models do not have an intercept term, at least, not in the usual sense;
the equality SStot=SSreg+SSres may not hold.
The first reference above denotes your statistic "pseudo-R-square" (in the case of non-linear models), and notes that it may be lower than 0.
To further understand what's going on you probably want to plot your data Y as a function of X, the predicted values from the power law as a function of X, and the residuals as a function of X.
For non-linear models I have sometimes calculated the sum of squared deviation from zero, to examine how much of that is explained by the model. Something like this:
pred = powerlaw(X, *popt)
ss_total = np.sum(Y**2) # Not deviation from mean.
ss_resid = np.sum((Y - pred)**2)
pseudo_r_squared = 1 - ss_resid/ss_total
Calculated this way, pseudo_r_squared can potentially be negative (if the model is really bad, worse than just guessing the data are all 0), but if pseudo_r_squared is positive I interpret it as the amount of "variation from 0" explained by the model.

Interpreting logistic regression coefficients of scaled features

I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).

Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.

How do I get raw coefficients from bspline when using Patsy in pystatsmodels

I'm running a GLM and have to hand over discrete values that come from the variable*coefficient to our IT department.
That said, I'm not sure how to calculate the slopes in a piecewise regression model using the bs() function from patsy.
Let's say I have the following model:
y ~ bs(length, degree = 1, knots = [32]
This gives me two rows of the standard pystatsmodel parameters (coefficeints, pvalues, standard error, etc).
Those values are,
variable coeff
y ~ bs(length, degree = 1, knots = [32][0] .3763
y ~ bs(length, degree = 1, knots = [32][1] .4335
I can also run it like this:
y ~ length + np.maximum(length-32,0)
Which yields
variable coeff
length .0118
length -.0074
What I don't understand is when I run a test set through both of these models, they yield the same prediction.
I'm not sure what patsy is doing in the background in either case and I'm wondering, to answer my question, should I
slope 1 for length should come right from the exponent of the coefficient and
slope 2 for length is the exponent(coefficient1 + ceoff2). If that's the case, does that rule apply to both types of syntax?

Vectorized SVM gradient

I was going through the code for SVM loss and derivative, I did understand the loss but I cannot understand how the gradient is being computed in a vectorized manner
def svm_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train = X.shape[0]
scores = X.dot(W)
yi_scores = scores[np.arange(scores.shape[0]),y]
margins = np.maximum(0, scores - np.matrix(yi_scores).T + 1)
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
loss += 0.5 * reg * np.sum(W * W)
Understood up to here, After here I cannot understand why we are summing up row-wise in binary matrix and subtracting by its sum
binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)
# Average
dW /= num_train
# Regularize
dW += reg*W
return loss, dW

Let us recap the scenario and the loss function first, so we are on the same page:
Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.
The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.
There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.
Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as
sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])
or equivalently if we write #A[i] for the number of elements in A[i]
(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]
The partial derivatives with respect to the score are thus simply
| -#A[i] if j == Y[i]
dloss / dY0[i, j] = { 1 if j in A[i]
| 0 else
which is precisely what the first four lines you say you don't understand compute.
The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.
It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.

Personally, I found it much easier to understand the whole gradient calculation through looking at the analytic derivation of the loss function in more detail. To extend on the given answer, I would like to point to the derivatives of the loss function
with respect to the weights as follows:
Loss gradient wrt w_yi (correct class)
Hence, we count the cases where w_j is not meeting the margin requirement and sum those cases up. This negative sum is then specified as weight for the position of the correct class w_yi. (we later need to multiply this value with xi, this is what you do in your code in line 5)
2) Loss gradient wrt w_j (incorrect classes)
where 1 is the indicator function, 1 if true, else 0.
In other words, "programatically" we need to apply equation (2) to all cases where the margin requirement is not met, and adding the negative sum of all unmet requirements to the true class column (as in (1)).
So what you did in the first 3 lines of your code is to determine the cases where the margin is not met, as well as adding the negative sum of these cases to the correct class column (j). In the 5 line, you do the final step where you multiply the x_i's to the other term - and this completes the gradient calculations as in (1) and (2).
I hope this makes it easier to understand, let me know if anything remains unclear. source

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fitting a bias in a weight decay regression using least-squares - python

If you don't want to mean-center your variables, adding a column of ones will work and is a perfectly acceptable solution. The bias term will just be the coefficient at the position of the added column.

Related

Defining Input feature importance with Vector Auto Regression

Getting a negative R-squared value with curve_fit()

Interpreting logistic regression coefficients of scaled features

How do I get raw coefficients from bspline when using Patsy in pystatsmodels

Vectorized SVM gradient

Categories

Resources