What does coef_ store?
coef_ originates from the Lasso regression method when trying to accomplish feature selection
It stores the parameter vector, describing the weights you multiply each feature by to get your predicted values. Essentially, coef_ are the parameters of your model (excluding the regularisation and the intercept (w0) term - see below).
When using Lasso regression you are performing linear regression with regularisation.
Without regularisation your predicted value for a given instance is of the form:
y = w0 + w1 * x1 + w2 * x2 + … + wn * xn
where y is your predicted value and your parameter vector is w = [w0, w1, w2, ..., wn] and your feature vector for the training instance is x = [x1, x2, ..., xn]. When you are performing your regression you are changing the values of the parameter vector (or 'weight vector') to get the predicted values which minimize the differences between the predicted values and the target values. This is achieved by minimizing a cost function (a measure of how far your predictions vary from the true values).
With (lasso) regularisation use simply add the l1 norm of the parameter vector to the cost function (which you minimize) which helps keep the values of the feature vector as small as possible.
Related
I'm trying to implement the backward pass for an NN with a final layer of Softmax and loss function of Cross Entropy. I'm following the notes in this article (particularly the "Matrix Multiplication" section).
I'd first like to make sure I'm calculating the derivative of the error with respect to the final outputs correctly. I'm working on the MNIST classification problem, and so y represents a one-hot encoding of the target and y_hat is my predicted probabilities.
def cross_entropy(y, y_hat):
value = np.log2(np.sum(y*y_hat))
return value
def d_cross_entropy(y, y_hat):
return -y/y_hat*np.log(2)
I'm a lot more confused on getting the gradient of Softmax. If we say that A = Softmax(Wx+b), then taking the gradient of A with respect to X is more difficult because Ai does not just depend on Xi but on all elements of the X vector. This means that rather than getting a simple 10-dimensional dA/dX term, I get a 10x10 matrix which throws off the matrix multiplication. I tried taking the sum to reduce this to a 10-dimensional vector, but this seems incorrect
def softmax(x):
exp = np.exp(x)
return exp/np.sum(exp)
def d_softmax(x):
softmax_x = softmax(x)
jacobian = np.outer(softmax_x, -softmax_x)
adj = np.eye(x.shape[0])*softmax_x
jacobian += adj
return jacobian.reshape((x.shape[0], x.shape[0])).sum()
I'm trying to wrap my head around the categorical cross entropy loss. Looking at the implementation of the cross entropy loss in Keras:
# scale preds so that the class probas of each sample sum to 1
output = output / math_ops.reduce_sum(output, axis, True)
# Compute cross entropy from probabilities.
epsilon_ = _constant_to_tensor(epsilon(), output.dtype.base_dtype)
output = clip_ops.clip_by_value(output, epsilon_, 1. - epsilon_)
return -math_ops.reduce_sum(target * math_ops.log(output), axis)
I do not see where the delta = output - target
is calculated.
See here.
What am I missing?
I think you might be confusing two different concepts / events here.
The categorical cross entropy loss is a measure of the error of your model, as calculated by :
def categorical_crossentropy(target, output, from_logits=False, axis=-1):
<etc>
This just returns an array of losses for each label, it is the direct difference between the true label and what your model thinks the label should be.
The next step after calculating the loss (part of the forward propagation phase) is to then start backpropagation, i.e. we want to find the influence that each weight/bias matrix has on the loss you've calculated above, so that we can perform the update step.
The first step is then to calculate dL/dz i.e. the derivative of the loss function with respect to the linear function (y = Wx + b), which itself is the combination of dL/da * da/dz (i.e. the deriv loss wrt activation * deriv activation wrt the linear function).
The link you posted is the derivative of the activation function wrt the linear function. This blog does a decent job of explaining how all the parts fit together, although the activation function they use is a sigmoid, but the overall pieces that fit together are the same.
resid = df['Actual'] - df['Predicted']
resid_mean = resid.mean()
print(resid_mean)
Output:
250.8173868583906
Is my model predicting value correctly or not?
Linear regression involve minimising the mean squared error (Q) to find the best fitting slope (a) and intercept(b). That is, Q will be minimized at the values of a and b for which ∂Q / ∂a = 0 and ∂Q / ∂b = 0.
The sum of the residuals, and therefore the mean is always zero, for the data that you regressed on. That is one of the above 2 conditions in linear regression.
So, unless you are checking residual mean for data not used in training, there appears to be some mistake in the linear regression procedure you employed.
Detailed proof available here: http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf
I'm writing a gradient descent function for a multi-class classifier using softmax. I'm a bit confused about how regularization should work in the gradient function. I've specified my matrix, X, such that the first column is populated by ones, and w is a matrix where each row corresponds to the weights of features and each column corresponds to a label. I understand that the bias term/intercept should not be regularized. However, I'm not clear on how to leave the bias term out.
Some of the code I'm learning from has the following in the function to calculate the gradient:
scores = np.dot(X,w)
predictions = softmax_function(scores)
gradient = -np.dot(X.T,y_actual-y_predictions)/len(y_actual)
regularizer = np.hstack((np.zeros((w.shape[0],1)),w[:,1:w.shape[1]]))
return (gradient, regularizer)
Then, when w is updated at the end of the epoch:
w_new = w_old - learning_rate*(gradient+regularizer*lambd)
So, here's my question. In the code above, why is hstack() used to populate the first column in the regularization term with zeros? It seems like we'd want to use vstack() to make the first row in the regularizer zeros, since the bias weights are going to be the first row.
I tried to compare logistic regression result from statsmodel with sklearn logisticRegression result. actually I tried to compare with R result also.
I made the options C=1e6(no penalty) but I got almost same coefficients except the intercept.
model = sm.Logit(Y, X).fit()
print(model.summary())
==> intercept = 5.4020
model = LogisticRegression(C=1e6,fit_intercept=False)
model = model.fit(X, Y)
===> intercept = 2.4508
so I read the user guide, they said Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
what is this meaning? due to this, sklearn logisticRegression gave a different intercept value?
please help me
LogisticRegression is in some aspects similar to the Perceptron Model and LinearRegression.
You multiply your weights with the data points and compare it to a threshold value b:
w_1 * x_1 + ... + w_n*x_n > b
This can be rewritten as:
-b + w_1 * x_1 + ... + w_n*x_n > 0
or
w_0 * 1 + w_1 * x_1 + ... + w_n*x_n > 0
For linear regression we keep this, for the perceptron we feed this to a chosen function and here for the logistic regression pass this to the logistic function.
Instead of learning n parameters now n+1 are learned. For the perceptron it is called bias, for regression intercept.
For linear regression it's easy to understand geometrically. In the 2D case you can think about this as a shifting the decision boundary by w_0 in the y direction**.
or y = m*x vs y = m*x + c
So now the decision boundary does not go through (0,0) anymore.
For the logistic function it is similar it shifts it away for the origin.
Implementation wise what happens, you add one more weight and a constant 1 to the X values. And then you proceed as normal.
if fit_intercept:
intercept = np.ones((X_train.shape[0], 1))
X_train = np.hstack((intercept, X_train))
weights = np.zeros(X_train.shape[1])