Compute the gradient of the SVM loss function

Compute the gradient of the SVM loss function - python

I am trying to implement the SVM loss function and its gradient.
I found some example projects that implement these two, but I could not figure out how they can use the loss function when computing the gradient.
Here is the formula of loss function:
What I cannot understand is that how can I use the loss function's result while computing gradient?
The example project computes the gradient as follows:
for i in xrange(num_train):
scores = X[i].dot(W)
correct_class_score = scores[y[i]]
for j in xrange(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
dW[:,j] += X[i]
dW[:,y[i]] -= X[i]
dW is for gradient result. And X is the array of training data.
But I didn't understand how the derivative of the loss function results in this code.

The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:
and with respect to W(j) when j!=yi is:
The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.
Since you are using cs231n example, you should definitely check note and videos if needed.
Hope this helps!

If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.

If we don't keep these two lines of code:
dW[:,j] += X[i]
dW[:,y[i]] -= X[i]
we get loss value.

Related

Build Linear Regression model in Python without SK Learn

I am trying to build a Linear Regression model without using SK learn package. For a single independent variable I have the code (given by my professor) which is below:
def get_gradient(w, x, y):
y_estimate = (np.power(x,1).dot(w)).flatten() #hypothesis
error = (y.flatten() - y_estimate)
mse = (1.0/len(x))*np.sum(np.power(error,2)) # mse
gradient = -(1.0/len(x)) * error.dot(np.power(x,1)) # gradient
return gradient, mse
w = np.random.randn(2) # Random Intialization
alpha = 0.5 # learning rate
tolerance = 1e-3 # param for stopping the loop
print("Intial values of Weights:")
print(w[1], w[0])
# Perform Gradient Descent
iterations = 1
while True:
gradient, error = get_gradient(w, train_x, train_y)
new_w = w - alpha * gradient
# Stopping Condition
if np.sum(abs(new_w - w)) < tolerance:
print ("Converged")
break
# Print error every 10 iterations
if iterations % 10 == 0:
print ("Iteration: %d - Error: %.4f" %(iterations, error))
print ("Updated Weights : {:f} , {:f}".format(w[1], w[0]))
iterations += 1
w = new_w
print ("Final Weights : {:f} , {:f}".format(w[1], w[0]))
print ("Test Cost =", get_gradient(w, test_x, test_y)[1])
This works fine with one variable (+ the intercept). Note - created a numpy array for x and y each for purpose of this code.
But now I wish to use this on the Boston Housing Dataset which has 13 Independent variables/features.
I cannot figure out how to use this for multiple variables. I cannot figure out how to:
Have each weight work with only the values in the column of one variable - apply gradient descent and recalculate the weight. If I try to run the gradient descent function with 13 weights inside the weights array - it ends up (I think) multiplying all 13 weights with ALL rows in EACH variable. I think this is what is happening because initially, I got 'error' as infinity after only a few iterations. Also, it got thrown into an infinite loop.
I tried running a for loop for each weight in the weights variable (13 variables) and THEN calculating gradient for each of the 13 variables. I then put each weight in a list (later converted to numpy array) before calling the 'get gradient' function in the while loop. This worked, at the time but error reached infinity and the program went into an infinite loop just like before. Anyway, this should not even be a solution since by doing a for loop I am making the first variable get all the relevant weight first.
I also tried to divide all the weights and errors by 13 - to at least get the program to run and then I would tweak it. But no success there either.
Strangely I cannot replicate the for loop anymore. I keep getting some or the error - shape errors, value errors etc.
Please help.
If any further information is required - please do let me know.
Thank you so much!

Stochastic Gradient Decent vs. Gradient Decent for x**2 function

I would like to understand the difference between SGD and GD on the easiest example of function: y=x**2
The function of GD is here:
def gradient_descent(
gradient, start, learn_rate, n_iter=50, tolerance=1e-06
):
vector = start
for _ in range(n_iter):
diff = -learn_rate * gradient(vector)
if np.all(np.abs(diff) <= tolerance):
break
vector += diff
return vector
And in order to find the minimum of x**2 function we should do next (the answer is almost 0, which is correct):
gradient_descent(gradient=lambda v: 2 * x, start=10.0, learn_rate=0.2)
How i understood, in the classical GD the gradient is calculated precisely from all the data points. What is "all data points" in the implementation that i showed above?
And further, how we should modernize this function in order to call it SGD (SGD uses one single data point for calculating gradient. Where is "one single point" in the gradient_descent function?)

The function minimized in your example does not depend on any data, so it is not helpful to illustrate the difference between GD and SGD.
Consider this example:
import numpy as np
rng = np.random.default_rng(7263)
y = rng.normal(loc=10, scale=4, size=100)
def loss(y, mean):
return 0.5 * ((y-mean)**2).sum()
def gradient(y, mean):
return (mean - y).sum()
def mean_gd(y, learning_rate=0.005, n_iter=15, start=0):
"""Estimate the mean of y using gradient descent"""
mean = start
for i in range(n_iter):
mean -= learning_rate * gradient(y, mean)
print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
return mean
def mean_sgd(y, learning_rate=0.005, n_iter=15, start=0):
"""Estimate the mean of y using stochastic gradient descent"""
mean = start
for i in range(n_iter):
rng.shuffle(y)
for single_point in y:
mean -= learning_rate * gradient(single_point, mean)
print(f'Iter {i} mean {mean:0.2f} loss {loss(y, mean):0.2f}')
return mean
mean_gd(y)
mean_sgd(y)
y.mean()
Two (very simplistic) versions of GD and SGD are used to estimate the mean of a random sample y. Estimating the mean is achieved by minimizing the squared loss.
As you understood correctly, in GD each update uses the gradient computed on the whole dataset and in SGD we look at a single random point at a time.

Vectorized SVM gradient

I was going through the code for SVM loss and derivative, I did understand the loss but I cannot understand how the gradient is being computed in a vectorized manner
def svm_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train = X.shape[0]
scores = X.dot(W)
yi_scores = scores[np.arange(scores.shape[0]),y]
margins = np.maximum(0, scores - np.matrix(yi_scores).T + 1)
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
loss += 0.5 * reg * np.sum(W * W)
Understood up to here, After here I cannot understand why we are summing up row-wise in binary matrix and subtracting by its sum
binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)
# Average
dW /= num_train
# Regularize
dW += reg*W
return loss, dW

Let us recap the scenario and the loss function first, so we are on the same page:
Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.
The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.
There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.
Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as
sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])
or equivalently if we write #A[i] for the number of elements in A[i]
(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]
The partial derivatives with respect to the score are thus simply
| -#A[i] if j == Y[i]
dloss / dY0[i, j] = { 1 if j in A[i]
| 0 else
which is precisely what the first four lines you say you don't understand compute.
The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.
It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.

Personally, I found it much easier to understand the whole gradient calculation through looking at the analytic derivation of the loss function in more detail. To extend on the given answer, I would like to point to the derivatives of the loss function
with respect to the weights as follows:
Loss gradient wrt w_yi (correct class)
Hence, we count the cases where w_j is not meeting the margin requirement and sum those cases up. This negative sum is then specified as weight for the position of the correct class w_yi. (we later need to multiply this value with xi, this is what you do in your code in line 5)
2) Loss gradient wrt w_j (incorrect classes)
where 1 is the indicator function, 1 if true, else 0.
In other words, "programatically" we need to apply equation (2) to all cases where the margin requirement is not met, and adding the negative sum of all unmet requirements to the true class column (as in (1)).
So what you did in the first 3 lines of your code is to determine the cases where the margin is not met, as well as adding the negative sum of these cases to the correct class column (j). In the 5 line, you do the final step where you multiply the x_i's to the other term - and this completes the gradient calculations as in (1) and (2).
I hope this makes it easier to understand, let me know if anything remains unclear. source

Gradient Descent Algorithm in Python

I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.

The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.

what's wrong of the ridge regression gradient descent function?

I code the python function but the prediction doesn't accord with the fact.The price it predicts is negative. However, I cant't find where it is wrong. It is right or not when I compute the derivative[i] and weight[i]? please help.
following is a function which the function in picture use:
def feature_derivative_ridge(errors, feature, weight, l2_penalty, feature_is_constant):
# If feature_is_constant is True, derivative is twice the dot product of errors and feature
if feature_is_constant == True:
derivative = 2*np.dot(errors, feature)
# Otherwise, derivative is twice the dot product plus 2*l2_penalty*weight
else:
derivative = (2*np.dot(errors, feature) + 2*l2_penalty*weight)
return derivative

oh,I have found the answer.
First: errors = output - predictions
should be : errors = predictions
then: weights[i] = (1-....
should be: weights[i] = weights[i] - step_size*derivative[i]
(recall the formula)
finally, the output is right

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compute the gradient of the SVM loss function - python

If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.

If we don't keep these two lines of code: dW[:,j] += X[i] dW[:,y[i]] -= X[i] we get loss value.

Related

Build Linear Regression model in Python without SK Learn

Stochastic Gradient Decent vs. Gradient Decent for x**2 function

Vectorized SVM gradient

Gradient Descent Algorithm in Python

what's wrong of the ridge regression gradient descent function?

Categories

Resources