Question
In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost calculation and backward propagation in the training).
Please help me understand why it needs to be divided by the batch size.
The chain rule to get the gradient should be below. Where should I incorporate the division?
Derivative of Softmax loss function
Code
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
#Train a Linear Classifier
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in range(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # <---------------------- Why?
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
It's because you are averaging the gradients instead of taking directly the sum of all the gradients.
You could of course not divide for that size, but this division has a lot of advantages. The main reason is that it's a sort of regularization (to avoid overfitting). With smaller gradients the weights cannot grow out of proportions.
And this normalization allows comparison between different configuration of batch sizes in different experiments (How can I compare two batch performances if they are dependent to the batch size?)
If you divide for that size the gradients sum it could be useful to work with greater learning rates to make the training faster.
This answer in the crossvalidated community is quite useful.
Came to notice that the dot in dW = np.dot(X.T, dscores) for the gradient at W is Σ over the num_sample instances. Since the dscore, which is probability (softmax output), was divided by the num_samples, did not understand that it was normalization for dot and sum part later in the code. Now understood divide by num_sample is required (may still work without normalization if the learning rate is trained though).
I believe the code below explains better.
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores) / num_examples
db = np.sum(dscores, axis=0, keepdims=True) / num_examples
Related
I could not understand well especially how gradients were computed with regards to matrix transposes. My question is for DW2 but if you want also to discuss about the computation of the other gradients and extend my question I am open to discussion. Mathematically things seem a little bit different but this code is reliable and on github so I trust this code.
from __future__ import print_function
from builtins import range
from builtins import object
import numpy as np
import matplotlib.pyplot as plt
from past.builtins import xrange
class TwoLayerNet(object):
"""
A two-layer fully-connected neural network. The net has an input dimension of
D* (correction), a hidden layer dimension of H, and performs classification over C classes.
We train the network with a softmax loss function and L2 regularization on the
weight matrices. The network uses a ReLU nonlinearity after the first fully
connected layer.
In other words, the network has the following architecture:
input - fully connected layer - ReLU - fully connected layer - softmax
The outputs of the second fully-connected layer are the scores for each class.
"""
def __init__(self, input_size, hidden_size, output_size, std=1e-4):
"""
Initialize the model. Weights are initialized to small random values and
biases are initialized to zero. Weights and biases are stored in the
variable self.params, which is a dictionary with the following keys:
W1: First layer weights; has shape (D, H)
b1: First layer biases; has shape (H,)
W2: Second layer weights; has shape (H, C)
b2: Second layer biases; has shape (C,)
Inputs:
- input_size: The dimension D of the input data.
- hidden_size: The number of neurons H in the hidden layer.
- output_size: The number of classes C.
"""
self.params = {}
self.params['W1'] = std * np.random.randn(input_size, hidden_size)
self.params['b1'] = np.zeros(hidden_size)
self.params['W2'] = std * np.random.randn(hidden_size, output_size)
self.params['b2'] = np.zeros(output_size)
def loss(self, X, y=None, reg=0.0):
"""
Compute the loss and gradients for a two layer fully connected neural
network.
Inputs:
- X: Input data of shape (N, D). Each X[i] is a training sample.
- y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
an integer in the range 0 <= y[i] < C. This parameter is optional; if it
is not passed then we only return scores, and if it is passed then we
instead return the loss and gradients.
- reg: Regularization strength.
Returns:
If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
the score for class c on input X[i].
If y is not None, instead return a tuple of:
- loss: Loss (data loss and regularization loss) for this batch of training
samples.
- grads: Dictionary mapping parameter names to gradients of those parameters
with respect to the loss function; has the same keys as self.params.
"""
# Unpack variables from the params dictionary
W1, b1 = self.params['W1'], self.params['b1']
W2, b2 = self.params['W2'], self.params['b2']
N, D = X.shape
# Compute the forward pass
scores = None
#############################################################################
# TODO: Perform the forward pass, computing the class scores for the input. #
# Store the result in the scores variable, which should be an array of #
# shape (N, C). #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# perform the forward pass and compute the class scores for the input
# input - fully connected layer - ReLU - fully connected layer - softmax
# define lamba function for relu
relu = lambda x: np.maximum(0, x)
# a1 = X x W1 = (N x D) x (D x H) = N x H
a1 = relu(X.dot(W1) + b1) # activations of fully connected layer #1
# store the result in the scores variable, which should be an array of
# shape (N, C).
# scores = a1 x W2 = (N x H) x (H x C) = N x C
scores = a1.dot(W2) + b2 # output of softmax
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# If the targets are not given then jump out, we're done
if y is None:
return scores
# Compute the loss
loss = None
#############################################################################
# TODO: Finish the forward pass, and compute the loss. This should include #
# both the data loss and L2 regularization for W1 and W2. Store the result #
# in the variable loss, which should be a scalar. Use the Softmax #
# classifier loss. #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# shift values for 'scores' for numeric reasons (over-flow cautious)
# figure out the max score across all classes
# scores.shape is N x C
scores -= scores.max(axis = 1, keepdims = True)
# probs.shape is N x C
probs = np.exp(scores)/np.sum(np.exp(scores), axis = 1, keepdims = True)
loss = -np.log(probs[np.arange(N), y])
# loss is a single number
loss = np.sum(loss)
# Right now the loss is a sum over all training examples, but we want it
# to be an average instead so we divide by N.
loss /= N
# Add regularization to the loss.
loss += reg * (np.sum(W1 * W1) + np.sum(W2 * W2))
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# Backward pass: compute gradients
grads = {}
#############################################################################
# TODO: Compute the backward pass, computing the derivatives of the weights #
# and biases. Store the results in the grads dictionary. For example, #
# grads['W1'] should store the gradient on W1, and be a matrix of same size #
#############################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# since dL(i)/df(k) = p(k) - 1 (if k = y[i]), where f is a vector of scores for the given example
# i is the training sample and k is the class
dscores = probs.reshape(N, -1) # dscores is (N x C)
dscores[np.arange(N), y] -= 1
# since scores = a1.dot(W2), we get dW2 by multiplying a1.T and dscores
# W2 is H x C so dW2 should also match those dimensions
# a1.T x dscores = (H x N) x (N x C) = H x C
dW2 = np.dot(a1.T, dscores)
# Right now the gradient is a sum over all training examples, but we want it
# to be an average instead so we divide by N.
dW2 /= N
# b2 gradient: sum dscores over all N and C
db2 = dscores.sum(axis = 0)/N
# since a1 = X.dot(W1), we get dW1 by multiplying X.T and da1
# W1 is D x H so dW1 should also match those dimensions
# X.T x da1 = (D x N) x (N x H) = D x H
# first get da1 using scores = a1.dot(W2)
# a1 is N x H so da1 should also match those dimensions
# dscores x W2.T = (N x C) x (C x H) = N x H
da1 = dscores.dot(W2.T)
da1[a1 == 0] = 0 # set gradient of units that did not activate to 0
dW1 = X.T.dot(da1)
# Right now the gradient is a sum over all training examples, but we want it
# to be an average instead so we divide by N.
dW1 /= N
# b1 gradient: sum da1 over all N and H
db1 = da1.sum(axis = 0)/N
# Add regularization loss to the gradient
dW1 += 2 * reg * W1
dW2 += 2 * reg * W2
grads = {'W1': dW1, 'b1': db1, 'W2': dW2, 'b2': db2}
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return loss, grads
def train(self, X, y, X_val, y_val,
learning_rate=1e-3, learning_rate_decay=0.95,
reg=5e-6, num_iters=100,
batch_size=200, verbose=False):
"""
Train this neural network using stochastic gradient descent.
Inputs:
- X: A numpy array of shape (N, D) giving training data.
- y: A numpy array f shape (N,) giving training labels; y[i] = c means that
X[i] has label c, where 0 <= c < C.
- X_val: A numpy array of shape (N_val, D) giving validation data.
- y_val: A numpy array of shape (N_val,) giving validation labels.
- learning_rate: Scalar giving learning rate for optimization.
- learning_rate_decay: Scalar giving factor used to decay the learning rate
after each epoch.
- reg: Scalar giving regularization strength.
- num_iters: Number of steps to take when optimizing.
- batch_size: Number of training examples to use per step.
- verbose: boolean; if true print progress during optimization.
"""
num_train = X.shape[0]
iterations_per_epoch = max(num_train / batch_size, 1)
# Use SGD to optimize the parameters in self.model
loss_history = []
train_acc_history = []
val_acc_history = []
for it in range(num_iters):
X_batch = None
y_batch = None
#########################################################################
# TODO: Create a random minibatch of training data and labels, storing #
# them in X_batch and y_batch respectively. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# generate random indices
indices = np.random.choice(num_train, batch_size)
X_batch, y_batch = X[indices], y[indices]
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# Compute loss and gradients using the current minibatch
loss, grads = self.loss(X_batch, y=y_batch, reg=reg)
loss_history.append(loss)
#########################################################################
# TODO: Use the gradients in the grads dictionary to update the #
# parameters of the network (stored in the dictionary self.params) #
# using stochastic gradient descent. You'll need to use the gradients #
# stored in the grads dictionary defined above. #
#########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
self.params['W1'] -= learning_rate * grads['W1']
self.params['W2'] -= learning_rate * grads['W2']
self.params['b1'] -= learning_rate * grads['b1']
self.params['b2'] -= learning_rate * grads['b2']
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
if verbose and it % 100 == 0:
print('iteration %d / %d: loss %f' % (it, num_iters, loss))
# Every epoch, check train and val accuracy and decay learning rate.
if it % iterations_per_epoch == 0:
# Check accuracy
train_acc = (self.predict(X_batch) == y_batch).mean()
val_acc = (self.predict(X_val) == y_val).mean()
train_acc_history.append(train_acc)
val_acc_history.append(val_acc)
# Decay learning rate
learning_rate *= learning_rate_decay
return {
'loss_history': loss_history,
'train_acc_history': train_acc_history,
'val_acc_history': val_acc_history,
}
def predict(self, X):
"""
Use the trained weights of this two-layer network to predict labels for
data points. For each data point we predict scores for each of the C
classes, and assign each data point to the class with the highest score.
Inputs:
- X: A numpy array of shape (N, D) giving N D-dimensional data points to
classify.
Returns:
- y_pred: A numpy array of shape (N,) giving predicted labels for each of
the elements of X. For all i, y_pred[i] = c means that X[i] is predicted
to have class c, where 0 <= c < C.
"""
y_pred = None
###########################################################################
# TODO: Implement this function; it should be VERY simple! #
###########################################################################
# *****START OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
# define lamba function for relu
relu = lambda x: np.maximum(0, x)
# activations of fully connected layer #1
a1 = relu(X.dot(self.params['W1']) + self.params['b1'])
# output of softmax
# scores = a1 x W2 = (N x H) x (H x C) = N x C
scores = a1.dot(self.params['W2']) + self.params['b2']
y_pred = np.argmax(scores, axis = 1)
# *****END OF YOUR CODE (DO NOT DELETE/MODIFY THIS LINE)*****
return y_pred
With regards to above code, I could not understand how DW2 was computed well. I took picture of the point I need to clarify and need an explanation for the difference.enter image description here
My ideas
I am new to PyTorch and I would like to implement linear regression partly with PyTorch and partly on my own. I want to use squared features for my regression:
import torch
# init
x = torch.tensor([1,2,3,4,5])
y = torch.tensor([[1],[4],[9],[16],[25]])
w = torch.tensor([[0.5], [0.5], [0.5]], requires_grad=True)
iterations = 30
alpha = 0.01
def forward(X):
# feature transformation [1, x, x^2]
psi = torch.tensor([[1.0, x[0], x[0]**2]])
for i in range(1, len(X)):
psi = torch.cat((psi, torch.tensor([[1.0, x[i], x[i]**2]])), 0)
return torch.matmul(psi, w)
def loss(y, y_hat):
return ((y-y_hat)**2).mean()
for i in range(iterations):
y_hat = forward(x)
l = loss(y, y_hat)
l.backward()
with torch.no_grad():
w -= alpha * w.grad
w.grad.zero_()
if i%10 == 0:
print(f'Iteration {i}: The weight is:\n{w.detach().numpy()}\nThe loss is:{l}\n')
When I execute my code, the regression doesn't learn the correct features and the loss increases permanently. The output is the following:
Iteration 0: The weight is:
[[0.57 ]
[0.81 ]
[1.898]]
The loss is:25.450000762939453
Iteration 10: The weight is:
[[ 5529.5835]
[22452.398 ]
[97326.12 ]]
The loss is:210414632960.0
Iteration 20: The weight is:
[[5.0884394e+08]
[2.0662339e+09]
[8.9567642e+09]]
The loss is:1.7820802835250162e+21
Does somebody know, why my model is not learning?
UPDATE
Is there a reason why it performs so poorly? I thought it's because of the low number of training data. But also with 10 data points, it is not performing well :
You should normalize your data. Also, since you're trying to fit x -> ax² + bx + c, c is essentially the bias. It should be wiser to remove it from the training data (I'm referring to psi here) and use a separate parameter for the bias.
What could be done:
normalize your input data and targets with mean and standard deviation.
separate the parameters into w (a two-component weight tensor) and b (the bias).
you don't need to construct psi on every inference since x is identical.
you can build psi with torch.stack([torch.ones_like(x), x, x**2], 1), but here we won't need the ones, as we've essentially detached the bias from the weight tensor.
Here's how it would look like:
x = torch.tensor([1,2,3,4,5]).float()
psi = torch.stack([x, x**2], 1).float()
psi = (psi - psi.mean(0)) / psi.std(0)
y = torch.tensor([[1],[4],[9],[16],[25]]).float()
y = (y - y.mean(0)) / y.std(0)
w = torch.tensor([[0.5], [0.5]], requires_grad=True)
b = torch.tensor([0.5], requires_grad=True)
iterations = 30
alpha = 0.02
def loss(y, y_hat):
return ((y-y_hat)**2).mean()
for i in range(iterations):
y_hat = torch.matmul(psi, w) + b
l = loss(y, y_hat)
l.backward()
with torch.no_grad():
w -= alpha * w.grad
b -= alpha * b.grad
w.grad.zero_()
b.grad.zero_()
if i%10 == 0:
print(f'Iteration {i}: The weight is:\n{w.detach().numpy()}\nThe loss is:{l}\n')
And the results:
Iteration 0: The weight is:
[[0.49954653]
[0.5004535 ]]
The loss is:0.25755801796913147
Iteration 10: The weight is:
[[0.49503425]
[0.5049657 ]]
The loss is:0.07994867861270905
Iteration 20: The weight is:
[[0.49056274]
[0.50943726]]
The loss is:0.028329044580459595
I have had used linear regression using ML packages in python, but for sake of self gratification, I coded it from scratch. The loss starts at around 0.90 and keeps increasing (not learning) for some reason. I do not understand what mistake I may have committed.
Standardised the dataset as part of preprocessing
Initialise weight matrix with MLE estimate for parameter W i.e., (X^TX)^-1X^TY
Compute the output
Calculate gradient of loss function SSE (Sum of Squared Error) wrt param W and bias B
Use the gradients to update the parameters using gradient descent.
import preprocess as pre
import numpy as np
import matplotlib.pyplot as plt
data = pre.load_file('airfoil_self_noise.dat')
data = pre.organise(data,"\t","\r\n")
data = pre.standardise(data,data.shape[1])
t = np.reshape(data[:,5],[-1,1])
data = data[:,:5]
N = data.shape[0]
M = 5
lr = 1e-3
# W = np.random.random([M,1])
W = np.dot(np.dot(np.linalg.inv(np.dot(data.T,data)),data.T),t)
data = data.T # Examples are arranged in columns [features,N]
b = np.random.rand()
epochs = 1000000
loss = np.zeros([epochs])
for epoch in range(epochs):
if epoch%1000 == 0:
lr /= 10
# Obtain the output
y = np.dot(W.T,data).T + b
sse = np.dot((t-y).T,(t-y))
loss[epoch]= sse/N
var = sse/N
# log likelihood
ll = (-N/2)*(np.log(2*np.pi))-(N*np.log(np.sqrt(var)))-(sse/(2*var))
# Gradient Descent
W_grad = np.zeros([M,1])
B_grad = 0
for i in range(N):
err = (t[i]-y[i])
W_grad += err * np.reshape(data[:,i],[-1,1])
B_grad += err
W_grad /= N
B_grad /= N
W += lr * W_grad
b += lr * B_grad
print("Epoch: %d, Loss: %.3f, Log-Likelihood: %.3f"%(epoch,loss[epoch],ll))
plt.figure()
plt.plot(range(epochs),loss,'-r')
plt.show()
Now if you run the above code you are likely not to find anything wrong since I am doing W += lr * W_grad instead of W -= lr * W_grad. I would like to know why this is the case because it is the gradient descent formula to subtract the gradient from old weight matrix. The error constantly increase when I do it. What is that I am missing ?
Found it. The problem was I took the gradient of loss function from a slide which apparently was not right (at least it wasn't entirely wrong, instead it was already pointing to the steepest descent), which when I subtracted from weights it started pointing to the direction of greatest increase. This was what that gave rise to what I observed.
I did the partial derivative of loss function to clarify, and got this:
W_grad += data[:,i].reshape([-1,1])*(y[i]-t[i]).reshape([])
This points to the direction of greatest increase and when I multiply it with -lr it starts pointing to the steepest descent, and started working properly.
I'm trying to implement Gradient Descent (GD) (not stochastic one) for logistic regression in Python 3x. And have some troubles.
Logistic regression is defined as follows (1):
logistic regression formula
Formulas for gradients are defined as follows (2):
gradient descent for logistic regression
Description of data:
X is (Nx2)-matrix of objects (consist of positive and negative float numbers)
y is (Nx1)-vector of class labels (-1 or +1)
Task:
Implement gradient descent 1) with L2-regularization; and 2) without regularization. Desired results: vectors of weights.
Parameters: regularization rate C=10 for regularized regression and C=0 for unregularized regression; gradient step k=0.1; max.number of iterations = 10000; tolerance = 1e-5.
Note: GD is converged if distance between weighs vectors from current and previous steps is less than tolerance (1e-5).
Here is my implementation:
k - gradient step;
C - regularization rate.
import numpy as np
def sigmoid(z):
result = 1./(1. + np.exp(-z))
return result
def distance(vector1, vector2):
vector1 = np.array(vector1, dtype='f')
vector2 = np.array(vector2, dtype='f')
return np.linalg.norm(vector1-vector2)
def GD(X, y, C, k=0.1, tolerance=1e-5, max_iter=10000):
X = np.matrix(X)
y = np.matrix(y)
l=len(X)
w1, w2 = 0., 0. # weights (look formula (2) in the beginning of question)
difference = 1.
iteration = 1
while(difference > tolerance):
hypothesis = y*(X*np.matrix([w1, w2]).T)
w1_updated = w1 + (k/l)*np.sum(y*X[:,0]*(1.-(sigmoid(hypothesis)))) - k*C*w1
w2_updated = w2 + (k/l)*np.sum(y*X[:,1]*(1.-(sigmoid(hypothesis)))) - k*C*w2
difference = distance([w1, w2], [w1_updated, w2_updated])
w1, w2 = w1_updated, w2_updated
if(iteration >= max_iter):
break;
iteration = iteration + 1
return [w1_updated, w2_updated] #vector of weights
Respectively:
# call for UNregularized GD: C=0
w = GD(X, y, C=0., k=0.1)
and
# call for regularized GD: C=10
w_reg = GD(X, y, C=10., k=0.1)
Here are the resuls (weights-vectors):
# UNregularized GD
[0.035736331265589463, 0.032464572442830832]
# regularized GD
[5.0979561973044096e-06, 4.6312243707352652e-06]
However, it should be (right answers for self-control):
# UNregularized GD
[0.28801877, 0.09179177]
# regularized GD
[0.02855938, 0.02478083]
!!! Please, can you tell me whats going wrong here? I'm sitting with this problem for three days in a row and still have no idea.
Thank you in advance.
First of all, the sigmoid functions should be
def sigmoid(Z):
A=1/(1+np.exp(-Z))
return A
Try to run it again with this formula. Then, what is L?
I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs:
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
Returns:
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i = W.dot(X[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per http://cs231n.github.io/linear-classify/#softmax
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW
Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
where
which is the origin of the X[:,i] in the code.
I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.
A supplement to this answer with a small example.
I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.