Neuro Network on MNIST--Result is not expected

Neuro Network on MNIST--Result is not expected - python

After #IVlad gave me really useful feedback, I tried modifying my code, and the modified part would look like:
syn0 = (2*np.random.random((784,len(train_sample))) - 1)/8
syn1 = (2*np.random.random((len(train_sample),10)) - 1)/8
for i in xrange(10000):
#forward propagation
l0=train_sample
l1=nonlin(np.dot(l0, syn0))
l2=nonlin(np.dot(l1, syn1))
#calculate error
l2_error=train_tag_bool-l2
if (i% 1000) == 0:
print "Error:" + str(np.mean(np.abs(l2_error)))
#apply sigmoid to the error
l2_delta = l2_error*nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
#update weights
syn1 += alpha* (l1.T.dot(l2_delta) - beta*syn1)
syn0 += alpha* (l0.T.dot(l1_delta) - beta*syn0)
Note that the tags (true label) now are in a matrix of <3000 x 10>, each row is a sample and the ten columns describes which digit each sample represents. (the train_tag_bool, now to think about it it's not really in boolean format so naming is kinda bad, but for the sake of the discussion I'll keep it this way for now.)
In this project, I'm using one hidden layer between input and output layers only, hoping it will be sufficient enough to complete the job. I have applied learning rate and weight decay, as well as making the initial weights a bit smaller.
I used the code from the website when calculating the error rate, which is
np.mean(np.abs(l2_error))
and the result came out to be 0.1. I'm not sure what to take from here.
Also, I went into the l2 layer (supposedly output layer that gives the prediction), and the values are all extremely small (<10^-9 for the largest value for each sample, and the smallest can reach 10^-85). This is after only 5 iterations, though, but I doubt things will be any different had I run it for 1k loops or more. If I return the max of each row, it's always the 9th element (represents digit '9'), which is totally wrong.
I'm stuck again on this problem. Overflow problem is and has been the biggest challenge of my whole ML experience (back then MATLAB, not Numpy), and I've yet to find a way to deal with it.....
train_tag_bool code:
train_tag_bool=np.array([[0]*10]*len(train_tag)).astype('float64')
for i in range(len(train_tag)):
if train_tag[i]==0:
train_tag_bool[i][0]=1
elif train_tag[i]==1:
train_tag_bool[i][1]=1
elif train_tag[i]==2:
train_tag_bool[i][2]=1
elif train_tag[i]==3:
train_tag_bool[i][3]=1
elif train_tag[i]==4:
train_tag_bool[i][4]=1
elif train_tag[i]==5:
train_tag_bool[i][5]=1
elif train_tag[i]==6:
train_tag_bool[i][6]=1
elif train_tag[i]==7:
train_tag_bool[i][7]=1
elif train_tag[i]==8:
train_tag_bool[i][8]=1
elif train_tag[i]==9:
train_tag_bool[i][9]=1
Brute force, I know, but that's the least of my concern right now. The result is a 3000 x 10 matrix with 1's corresponding to what the digit is for each sample. the first element represents digit 0, the last represents 9
ex. [0 0 0 0 0 0 1 0 0 0] represents 6, [1 0 0 0 0 0 0 0 0 0] represents 0.
The original code:
import cPickle, gzip
import numpy as np
#from deeplearning.net
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
#sigmoid function
def nonlin(x, deriv=False):
if (deriv ==True):
return x*(1-x)
return 1/(1+np.exp(-x))
#seed random numbers to make calculation
#deterministic (just a good practice)
np.random.seed(1)
#need to decrease the sample size or else computer dies
train_sample=train_set[0][0:3000]
train_tag=train_set[1][0:3000]
train_tag=train_tag.reshape(len(train_tag), 1)
#train_set's dimension for the pixels are 50000(samples) x 784 (28x28 for each sample)
#therefore the coefficients should be 784x50000 to make the hidden layer 50k x 50k
syn0 = 2*np.random.random((784,len(train_sample))) - 1
syn1 = 2*np.random.random((len(train_sample),1)) - 1
for i in xrange(10000):
#forward propagation
l0=train_sample
l1=nonlin(np.dot(l0, syn0))
l2=nonlin(np.dot(l1, syn1))
#calculate error
l2_error=train_tag-l2
if (i% 1000) == 0:
print "Error:" + str(np.mean(np.abs(l2_error)))
#apply sigmoid to the error
l2_delta = l2_error*nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
#update weights
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
Reference:
http://iamtrask.github.io/2015/07/12/basic-python-network/
http://yann.lecun.com/exdb/mnist/

I can't currently run the code, but there are a few things that stand out. I'm surprised it works well even on the toy problems used on the blog.
Before we start, you'll need more output neurons: 10 to be exact.
syn1 = 2*np.random.random((len(train_sample), 10)) - 1
And your labels (y) better by a length 10 array with a 1 at the position of the correct digit and 0 elsewhere.
First of all, one thing I always attempt by default is to use float64 wherever possible... which almost never changes anything, so I'm not sure if you should get into this habit or not. Probably not.
Second, that code has no learning rate that you can set. This means that the learning rate is implicitly 1, which is huge for your problem, where people use 0.01 or even much less. To add a learning rate alpha, do:
syn1 += alpha * l1.T.dot(l2_delta)
syn0 += alpha * l0.T.dot(l1_delta)
And set it to at most 0.01. You'll have to fiddle with it for best results.
Third, it's usually better to initialize the net with small weights. [0, 1) might be too big. Try:
syn0 = (np.random.random((784,len(train_sample))) - 0.5) / 4
syn1 = (np.random.random((len(train_sample),1)) - 0.5) / 4
There are more involved initialization schemes that you can search for if you're interested, but I've gotten decent results with the above.
Fourth, regularization. The easiest to implement is probably weight decay. Implementing weight decay lambda can be done like this:
syn1 += alpha * l1.T.dot(l2_delta) - alpha * lambda * syn1
syn0 += alpha * l0.T.dot(l1_delta) - alpha * lambda * syn0
Common values are also < 0.1 or even < 0.01.
Dropout can also help, but it's a bit harder to implement and understand if you're just starting out, in my opinion. It's also more useful for deeper nets AFAIK. So maybe leave this for last.
Fifth, maybe also use momentum (explained in the weight decay link), which should decrease the learning time for your network. Also tune the number of iterations: you don't want too many, but not too few either.
Sixth, look into softmax for the output layer.
Seventh, look into tanh instead of your current nonlin sigmoid function.
If you apply these incrementally, you should start getting some meaningful results. I think regularization and smaller initial weights should help with the overflow errors.
Update:
I have changed the code like this. After only 100 training epochs, accuracy is 84.79%. Not too bad with barely tweaking anything.
I have added bias neurons, momentum, weight decay, used fewer hidden units (was way too slow with what you had), changed to tanh function and a few others.
You should be able to tweak it some more from here. I use Python 3.4, so I had to change a few things to get it to run, but it's nothing major.
import pickle, gzip
import numpy as np
#from deeplearning.net
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
f.close()
#sigmoid function
def nonlin(x, deriv=False):
if (deriv ==True):
return 1-x*x
return np.tanh(x)
#seed random numbers to make calculation
#deterministic (just a good practice)
np.random.seed(1)
def make_proper_pairs_from_set(data_set):
data_set_x, data_set_y = data_set
data_set_y = np.eye(10)[:, data_set_y].T
return data_set_x, data_set_y
train_x, train_y = make_proper_pairs_from_set(train_set)
train_x = train_x
train_y = train_y
test_x, test_y = make_proper_pairs_from_set(test_set)
print(len(train_y))
#train_set's dimension for the pixels are 50000(samples) x 784 (28x28 for each sample)
#therefore the coefficients should be 784x50000 to make the hidden layer 50k x 50k
# changed to 200 hidden neurons, should be plenty
syn0 = (2*np.random.random((785,200)) - 1) / 10
syn1 = (2*np.random.random((201,10)) - 1) / 10
velocities0 = np.zeros(syn0.shape)
velocities1 = np.zeros(syn1.shape)
alpha = 0.01
beta = 0.0001
momentum = 0.99
m = len(train_x) # number of training samples
# moved the forward propagation to a function and added bias neurons
def forward_prop(set_x, m):
l0 = np.c_[np.ones((m, 1)), set_x]
l1 = nonlin(np.dot(l0, syn0))
l1 = np.c_[np.ones((m, 1)), l1]
l2 = nonlin(np.dot(l1, syn1))
return l0, l1, l2, l2.argmax(axis=1)
num_epochs = 100
for i in range(num_epochs):
# forward propagation
l0, l1, l2, _ = forward_prop(train_x, m)
# calculate error
l2_error = l2 - train_y
print("Error " + str(i) + ": " + str(np.mean(np.abs(l2_error))))
# apply sigmoid to the error
l2_delta = l2_error * nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
l1_delta = l1_delta[:, 1:]
# update weights
# divide gradients by the number of samples
grad0 = l0.T.dot(l1_delta) / m
grad1 = l1.T.dot(l2_delta) / m
v0 = velocities0
v1 = velocities1
velocities0 = velocities0 * momentum - alpha * grad0
velocities1 = velocities1 * momentum - alpha * grad1
# divide regularization by number of samples
# because L2 regularization reduces to this
syn1 += -v1 * momentum + (1 + momentum) * velocities1 - alpha * beta * syn1 / m
syn0 += -v0 * momentum + (1 + momentum) * velocities0 - alpha * beta * syn0 / m
# find accuracy on test set
predictions = []
corrects = []
for i in range(len(test_x)): # you can eliminate this loop too with a bit of work, but this part is very fast anyway
_, _, _, rez = forward_prop([test_x[i, :]], 1)
predictions.append(rez[0])
corrects.append(test_y[i].argmax())
predictions = np.array(predictions)
corrects = np.array(corrects)
print(np.sum(predictions == corrects) / len(test_x))
Update 2:
If you increase the learning rate to 0.05 and the epochs to 1000, you get 95.43% accuracy.
Seeding the random number generator with the current time, adding more hidden neurons (or hidden layers) and more parameter tweaks can get this simple model to about 98% accuracy AFAIK. The problem is that it's slow to train.
Also, this methodology isn't really sound. I optimized the parameters to increase the accuracy on the test set, so I might be overfitting the test set. You should use cross validation or the validation set.
Anyway, as you can see, there are no overflow errors. If you want to discuss things in more detail, feel free to drop me an e-mail (address in profile).

Related

Problem with implementation manual Linear Regression using Stochastic Gradient Descent

I am working with a real estate dataset, the size of which is about 21 thousand, the size of the training data is 15129. There are 15 features. The task is to implement manual linear regression using SGD and compare features weights with the weights that the sklearn linear regression model gives us. ( all data is normalized using sklearn StandardScaler )
def gradient3(X,y):
X = pd.DataFrame(X)
y = pd.DataFrame(y)
w1 = np.random.randn(len(X.axes[1]))
w2 = np.random.randn(len(X.axes[1]))
b = 0
eps = 0.001
alpha = 1
counter = 1
lmbda = 0.1
while np.linalg.norm(w1 - w2) > eps:
#choosing random index
rand_index = np.random.randint(len(X.axes[0]))
X_tr = X.loc[rand_index].values
y_tr = y.loc[rand_index].values
# colculating new w
err = X_tr.dot(w1) + b - y_tr
loss_w = 2 * err * X_tr + (lmbda * w1)
loss_b = 2 * err
w2 = w1.copy()
w1 = w1 - alpha * loss_w
b = b - alpha * loss_b
# reducing alpha
counter += 1
alpha = 1/counter
return w1, b
I tried implement SGD and expect to get list of feature weights – w, and bias value – b. The problem is that the program sometimes just goes into an infinite loop, sometimes it shows me absolutely chaotic weights, it depends on my learning rate parameter (alpha) and how fast it decreases. I don't quite understand what exactly the problem is. Maybe SGD just doesn't work with this dataset and I need a mini-batch, maybe I missed something in the algorithm, maybe I'm implementing regularization incorrectly.
I would be very grateful if someone could write what is wrong with my implementation.

Why is softmax classifier gradient divided by batch size (CS231n)?

Question
In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost calculation and backward propagation in the training).
Please help me understand why it needs to be divided by the batch size.
The chain rule to get the gradient should be below. Where should I incorporate the division?
Derivative of Softmax loss function
Code
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
#Train a Linear Classifier
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in range(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # <---------------------- Why?
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db

It's because you are averaging the gradients instead of taking directly the sum of all the gradients.
You could of course not divide for that size, but this division has a lot of advantages. The main reason is that it's a sort of regularization (to avoid overfitting). With smaller gradients the weights cannot grow out of proportions.
And this normalization allows comparison between different configuration of batch sizes in different experiments (How can I compare two batch performances if they are dependent to the batch size?)
If you divide for that size the gradients sum it could be useful to work with greater learning rates to make the training faster.
This answer in the crossvalidated community is quite useful.

Came to notice that the dot in dW = np.dot(X.T, dscores) for the gradient at W is Σ over the num_sample instances. Since the dscore, which is probability (softmax output), was divided by the num_samples, did not understand that it was normalization for dot and sum part later in the code. Now understood divide by num_sample is required (may still work without normalization if the learning rate is trained though).
I believe the code below explains better.
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores) / num_examples
db = np.sum(dscores, axis=0, keepdims=True) / num_examples

Why does the loss of a CNN decrease for a long time and then suddenly increase?

I made a simple network to find broken lines and I had a very strange training run. The loss, keras.losses.binary_crossentropy, was decreasing steadily for around 1500 epochs then suddenly, it shot up and plateaued.
What are some reasons this happens? Optimizers, loss function, network structure?
I checked the weights, and none of the weights have a NaN value. The input data is 250,000+ 32x32 images with lines on them, and the same stack of images where the lines have a few pixels removed from so they're "broken".
Here is the model creation code:
input_shape = (1, 32, 32)
kernel_shape = (16, 16)
keras.backend.set_image_data_format("channels_first")
n_filters = 64
input_layer = engine.Input(input_shape)
active_1 = layers.Activation("relu")(input_layer)
conv_1 = layers.Conv2D(n_filters, kernel_shape)(active_1)
conv_2 = layers.Conv2D(2*n_filters, kernel_shape)(conv_1)
pool_1 = layers.MaxPooling2D()(conv_2)
s = tupleFromShape(pool_1.shape)
p = 1
for d in s:
p *= d
shaped_1 = layers.Reshape((p,))(pool_1)
dense_1 = layers.Dense(2)(shaped_1)
out = layers.Activation("softmax")(dense_1)
model = engine.Model(input_layer, out)
model.save("broken-lines-start.h5")
And the training code:
full = #numpy array (c, slices, 32, 32)
broken = #numpy array(c, slices, 32, 32)
full = full[0]
broken = broken[0]
n = len(full) - 1024
n2 = len(broken) - 1024
random.shuffle(full)
random.shuffle(broken)
optimizer = keras.optimizers.Adam(0.00001)
loss_function = keras.losses.binary_crossentropy
model.compile(
model,
optimizer,
loss_function=loss_function)
batch_size = 256
steps = n//batch_size + n2//batch_size
model.fit_generator(generator=getDataGenerator(full[:n], broken[:n2], batch_size),
steps_per_epoch=steps,
epochs=4680,
validation_data=getDataGenerator(full[n:], broken[n2:], batch_size),
validation_steps=2048//batch_size,
callbacks=[saves_last_epoch_and_best_epoch]
)
model.save("broken-lines-trained.h5")
The generator code:
def getDataGenerator(solid, broken, batch_size=128):
zed = [([chunk], [1, 0]) for chunk in solid] + [([chunk], [0, 1]) for chunk in broken]
random.shuffle(zed)
xbatch = []
ybatch = []
while True:
for i in range(len(zed)):
x,y = zed[i]
xbatch.append(x)
ybatch.append(y)
if len(xbatch)==batch_size:
yield numpy.array(xbatch),numpy.array(ybatch)
xbatch = []
ybatch = []
I have greatly improved this model, and it hasn't exhibited this behavior yet, but I would like to understand why this happened.
Subsequent things I have tried:
Change the loss function to logcosh -> works
Change the epsilon value of the adam optimizer -> still blows up.
Change the optimizer to SGD -> blows up faster, didn't have initial decrease.

One of the possible issues might be with the Adam optimizer -- it is known to "explode" when you train it for a long time.
Let's look at the formula of Adam (sorry for the ugly presentation, may change to beautiful LaTeX later):
t <- t + 1
lr_t <- learning_rate * sqrt(1 - beta2^t) / (1 - beta1^t)
m_t <- beta1 * m_{t-1} + (1 - beta1) * g
v_t <- beta2 * v_{t-1} + (1 - beta2) * g * g
variable <- variable - lr_t * m_t / (sqrt(v_t) + epsilon)
where m and v are estimates of the first moment (the mean) and the second moment (the uncentered variance) of the gradients respectively. When you trained the model for a long time, v can become very small.
By default, according to tensorflow docs, beta1=0.9 and beta2=0.999. So m changes more quickly than v. So m can start being big again while v cannot catch up. This will result in a large number dividing by a very small value and explode.
Try to increase the epsilon parameter, which is 1e-08 by default. Try experimenting with values like 0.01, or 0.001, depending on your model.

Linear Regression using Python

I have had used linear regression using ML packages in python, but for sake of self gratification, I coded it from scratch. The loss starts at around 0.90 and keeps increasing (not learning) for some reason. I do not understand what mistake I may have committed.
Standardised the dataset as part of preprocessing
Initialise weight matrix with MLE estimate for parameter W i.e., (X^TX)^-1X^TY
Compute the output
Calculate gradient of loss function SSE (Sum of Squared Error) wrt param W and bias B
Use the gradients to update the parameters using gradient descent.
import preprocess as pre
import numpy as np
import matplotlib.pyplot as plt
data = pre.load_file('airfoil_self_noise.dat')
data = pre.organise(data,"\t","\r\n")
data = pre.standardise(data,data.shape[1])
t = np.reshape(data[:,5],[-1,1])
data = data[:,:5]
N = data.shape[0]
M = 5
lr = 1e-3
# W = np.random.random([M,1])
W = np.dot(np.dot(np.linalg.inv(np.dot(data.T,data)),data.T),t)
data = data.T # Examples are arranged in columns [features,N]
b = np.random.rand()
epochs = 1000000
loss = np.zeros([epochs])
for epoch in range(epochs):
if epoch%1000 == 0:
lr /= 10
# Obtain the output
y = np.dot(W.T,data).T + b
sse = np.dot((t-y).T,(t-y))
loss[epoch]= sse/N
var = sse/N
# log likelihood
ll = (-N/2)*(np.log(2*np.pi))-(N*np.log(np.sqrt(var)))-(sse/(2*var))
# Gradient Descent
W_grad = np.zeros([M,1])
B_grad = 0
for i in range(N):
err = (t[i]-y[i])
W_grad += err * np.reshape(data[:,i],[-1,1])
B_grad += err
W_grad /= N
B_grad /= N
W += lr * W_grad
b += lr * B_grad
print("Epoch: %d, Loss: %.3f, Log-Likelihood: %.3f"%(epoch,loss[epoch],ll))
plt.figure()
plt.plot(range(epochs),loss,'-r')
plt.show()
Now if you run the above code you are likely not to find anything wrong since I am doing W += lr * W_grad instead of W -= lr * W_grad. I would like to know why this is the case because it is the gradient descent formula to subtract the gradient from old weight matrix. The error constantly increase when I do it. What is that I am missing ?

Found it. The problem was I took the gradient of loss function from a slide which apparently was not right (at least it wasn't entirely wrong, instead it was already pointing to the steepest descent), which when I subtracted from weights it started pointing to the direction of greatest increase. This was what that gave rise to what I observed.
I did the partial derivative of loss function to clarify, and got this:
W_grad += data[:,i].reshape([-1,1])*(y[i]-t[i]).reshape([])
This points to the direction of greatest increase and when I multiply it with -lr it starts pointing to the steepest descent, and started working properly.

CS231n: How to calculate gradient for Softmax loss function?

I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs:
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
Returns:
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i = W.dot(X[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per http://cs231n.github.io/linear-classify/#softmax
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW

Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
where
which is the origin of the X[:,i] in the code.

I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.

A supplement to this answer with a small example.

I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Neuro Network on MNIST--Result is not expected - python

Related

Problem with implementation manual Linear Regression using Stochastic Gradient Descent

Why is softmax classifier gradient divided by batch size (CS231n)?

Why does the loss of a CNN decrease for a long time and then suddenly increase?

Linear Regression using Python

CS231n: How to calculate gradient for Softmax loss function?

Categories

Resources