Related
My neural network is stuck at 11.35 percent accuracy and i am unable to trace the error.
low accuracy at 11.35 percent
I am following this code https://github.com/MLForNerds/DL_Projects/blob/main/mnist_ann.ipynb which I found in a youtube video.
Here is my code for the neural network(I have defined Xavier weight initialization in a module called nn):
"""1. 784 neurons in input layer
2. 128 neurons in hidden layer 1
3. 64 neurons in hidden layer 2
4. 10 neurons in output layer"""
def softmax(input):
y = np.exp(input - input.max())
activated = y/ np.sum(y, axis=0)
return activated
def softmax_grad(x):
exps = np.exp(x-x.max())
return exps / np.sum(exps,axis = 0) * (1 - exps /np.sum(exps,axis = 0))
def sigmoid(input):
activated = 1/(1 + np.exp(-input))
return activated
def sigmoid_grad(input):
grad = input*(1-input)
return grad
class DenseNN:
def __init__(self,d0,d1,d2,d3):
self.params = {'w1': nn.Xavier.initialize(d0, d1),
'w2': nn.Xavier.initialize(d1, d2),
'w3': nn.Xavier.initialize(d2, d3)}
def forward(self,a0):
params = self.params
params['a0'] = a0
params['z1'] = np.dot(params['w1'],params['a0'])
params['a1'] = sigmoid(params['z1'])
params['z2'] = np.dot(params['w2'],params['a1'])
params['a2'] = sigmoid(params['z2'])
params['z3'] = np.dot(params['w3'],params['a2'])
params['a3'] = softmax(params['z3'])
return params['a3']
def backprop(self,y_true,y_pred):
params = self.params
w_change = {}
error = softmax_grad(params['z3'])*((y_pred - y_true)/y_true.shape[0])
w_change['w3'] = np.outer(error,params['a2'])
error = np.dot(params['w3'].T,error)*sigmoid_grad(params['a2'])
w_change['w2'] = np.outer(error,params['a1'])
error = np.dot(params['w2'].T,error)*sigmoid_grad(params['a1'])
w_change['w1'] = np.outer(error,params['a0'])
return w_change
def update_weights(self,learning_rate,w_change):
self.params['w1'] -= learning_rate*w_change['w1']
self.params['w2'] -= learning_rate*w_change['w2']
self.params['w3'] -= learning_rate*w_change['w3']
def train(self,epochs,lr):
for epoch in range(epochs):
for i in range(60000):
a0 = np.array([x_train[i]]).T
o = np.array([y_train[i]]).T
y_pred = self.forward(a0)
w_change = self.backprop(o,y_pred)
self.update_weights(lr,w_change)
# print(self.compute_accuracy()*100)
# print(calc_mse(a3, o))
print((self.compute_accuracy())*100)
def compute_accuracy(self):
'''
This function does a forward pass of x, then checks if the indices
of the maximum value in the output equals the indices in the label
y. Then it sums over each prediction and calculates the accuracy.
'''
predictions = []
for i in range(10000):
idx = i
a0 = x_test[idx]
a0 = np.array([a0]).T
#print("acc a1",np.shape(a1))
o = y_test[idx]
o = np.array([o]).T
#print("acc o",np.shape(o))
output = self.forward(a0)
pred = np.argmax(output)
predictions.append(pred == np.argmax(o))
return np.mean(predictions)
Here is the code for loading the data:
#load dataset csv
train_data = pd.read_csv('../Datasets/MNIST/mnist_train.csv')
test_data = pd.read_csv('../Datasets/MNIST/mnist_test.csv')
#train data
x_train = train_data.drop('label',axis=1).to_numpy()
y_train = pd.get_dummies(train_data['label']).values
#test data
x_test = test_data.drop('label',axis=1).to_numpy()
y_test = pd.get_dummies(test_data['label']).values
fac = 0.99 / 255
x_train = np.asfarray(x_train) * fac + 0.01
x_test = np.asfarray(x_test) * fac + 0.01
# train_labels = np.asfarray(train_data[:, :1])
# test_labels = np.asfarray(test_data[:, :1])
#printing dimensions
print(np.shape(x_train)) #(60000,784)
print(np.shape(y_train)) #(60000,10)
print(np.shape(x_test)) #(10000,784)
print(np.shape(y_test)) #(10000,10)
print((x_train))
Kindly help
I am a newbie in machine learning so any help would be appreciated.I am unable to figure out where i am going wrong.Most of the code is almost similar to https://github.com/MLForNerds/DL_Projects/blob/main/mnist_ann.ipynb but it manages to get 60 percent accuracy.
EDIT
I found the mistake :
Thanks to Bartosz Mikulski.
The problem was with how the weights were initialized in my Xavier weights initialization algorithm.
I changed the code for weights initialization to this:
self.params = {
'w1':np.random.randn(d1, d0) * np.sqrt(1. / d1),
'w2':np.random.randn(d2, d1) * np.sqrt(1. / d2),
'w3':np.random.randn(d3, d2) * np.sqrt(1. / d3),
'b1':np.random.randn(d1, 1) * np.sqrt(1. / d1),
'b2':np.random.randn(d2, 1) * np.sqrt(1. / d2),
'b3':np.random.randn(d3, 1) * np.sqrt(1. / d3),
}
then i got the output:
After changing weights initialization
after adding the bias parameters i got the output:
After changing weights initialization and adding bias
3: After changing weights initialization and adding bias
The one problem that I can see is that you are using only weights but no biases. They are very important because they allow your model to change the position of the decision plane (boundary) in the solution space. If you only have weights you can only angle the solution.
I guess that basically, this is the best fit you can get without biases. The dense layer is basically a linear function: w*x + b and you are missing the b. See the PyTorch documentation for the example: https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#linear.
Also, can you show your Xavier initialization? In your case, even the simple normal distributed values would be enough as initialization, no need to rush into more advanced topics.
I would also suggest you start from the smaller problem (for example Iris dataset) and no hidden layers (just a simple linear regression that learns by using gradient descent). Then you can expand it by adding hidden layers, and then by trying harder problems with the code you already have.
Question
In CS231 Computing the Analytic Gradient with Backpropagation which is first implementing a Softmax Classifier, the gradient from (softmax + log loss) is divided by the batch size (number of data being used in a cycle of forward cost calculation and backward propagation in the training).
Please help me understand why it needs to be divided by the batch size.
The chain rule to get the gradient should be below. Where should I incorporate the division?
Derivative of Softmax loss function
Code
N = 100 # number of points per class
D = 2 # dimensionality
K = 3 # number of classes
X = np.zeros((N*K,D)) # data matrix (each row = single example)
y = np.zeros(N*K, dtype='uint8') # class labels
#Train a Linear Classifier
# initialize parameters randomly
W = 0.01 * np.random.randn(D,K)
b = np.zeros((1,K))
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
# gradient descent loop
num_examples = X.shape[0]
for i in range(200):
# evaluate class scores, [N x K]
scores = np.dot(X, W) + b
# compute the class probabilities
exp_scores = np.exp(scores)
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True) # [N x K]
# compute the loss: average cross-entropy loss and regularization
correct_logprobs = -np.log(probs[range(num_examples),y])
data_loss = np.sum(correct_logprobs)/num_examples
reg_loss = 0.5*reg*np.sum(W*W)
loss = data_loss + reg_loss
if i % 10 == 0:
print "iteration %d: loss %f" % (i, loss)
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples # <---------------------- Why?
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # regularization gradient
# perform a parameter update
W += -step_size * dW
b += -step_size * db
It's because you are averaging the gradients instead of taking directly the sum of all the gradients.
You could of course not divide for that size, but this division has a lot of advantages. The main reason is that it's a sort of regularization (to avoid overfitting). With smaller gradients the weights cannot grow out of proportions.
And this normalization allows comparison between different configuration of batch sizes in different experiments (How can I compare two batch performances if they are dependent to the batch size?)
If you divide for that size the gradients sum it could be useful to work with greater learning rates to make the training faster.
This answer in the crossvalidated community is quite useful.
Came to notice that the dot in dW = np.dot(X.T, dscores) for the gradient at W is Σ over the num_sample instances. Since the dscore, which is probability (softmax output), was divided by the num_samples, did not understand that it was normalization for dot and sum part later in the code. Now understood divide by num_sample is required (may still work without normalization if the learning rate is trained though).
I believe the code below explains better.
# compute the gradient on scores
dscores = probs
dscores[range(num_examples),y] -= 1
# backpropate the gradient to the parameters (W,b)
dW = np.dot(X.T, dscores) / num_examples
db = np.sum(dscores, axis=0, keepdims=True) / num_examples
I have had used linear regression using ML packages in python, but for sake of self gratification, I coded it from scratch. The loss starts at around 0.90 and keeps increasing (not learning) for some reason. I do not understand what mistake I may have committed.
Standardised the dataset as part of preprocessing
Initialise weight matrix with MLE estimate for parameter W i.e., (X^TX)^-1X^TY
Compute the output
Calculate gradient of loss function SSE (Sum of Squared Error) wrt param W and bias B
Use the gradients to update the parameters using gradient descent.
import preprocess as pre
import numpy as np
import matplotlib.pyplot as plt
data = pre.load_file('airfoil_self_noise.dat')
data = pre.organise(data,"\t","\r\n")
data = pre.standardise(data,data.shape[1])
t = np.reshape(data[:,5],[-1,1])
data = data[:,:5]
N = data.shape[0]
M = 5
lr = 1e-3
# W = np.random.random([M,1])
W = np.dot(np.dot(np.linalg.inv(np.dot(data.T,data)),data.T),t)
data = data.T # Examples are arranged in columns [features,N]
b = np.random.rand()
epochs = 1000000
loss = np.zeros([epochs])
for epoch in range(epochs):
if epoch%1000 == 0:
lr /= 10
# Obtain the output
y = np.dot(W.T,data).T + b
sse = np.dot((t-y).T,(t-y))
loss[epoch]= sse/N
var = sse/N
# log likelihood
ll = (-N/2)*(np.log(2*np.pi))-(N*np.log(np.sqrt(var)))-(sse/(2*var))
# Gradient Descent
W_grad = np.zeros([M,1])
B_grad = 0
for i in range(N):
err = (t[i]-y[i])
W_grad += err * np.reshape(data[:,i],[-1,1])
B_grad += err
W_grad /= N
B_grad /= N
W += lr * W_grad
b += lr * B_grad
print("Epoch: %d, Loss: %.3f, Log-Likelihood: %.3f"%(epoch,loss[epoch],ll))
plt.figure()
plt.plot(range(epochs),loss,'-r')
plt.show()
Now if you run the above code you are likely not to find anything wrong since I am doing W += lr * W_grad instead of W -= lr * W_grad. I would like to know why this is the case because it is the gradient descent formula to subtract the gradient from old weight matrix. The error constantly increase when I do it. What is that I am missing ?
Found it. The problem was I took the gradient of loss function from a slide which apparently was not right (at least it wasn't entirely wrong, instead it was already pointing to the steepest descent), which when I subtracted from weights it started pointing to the direction of greatest increase. This was what that gave rise to what I observed.
I did the partial derivative of loss function to clarify, and got this:
W_grad += data[:,i].reshape([-1,1])*(y[i]-t[i]).reshape([])
This points to the direction of greatest increase and when I multiply it with -lr it starts pointing to the steepest descent, and started working properly.
I want to perform SGD on the following neural network:
Training set size = 200000
input layer size = 784
hidden layer size = 50
output layer size = 10
I have an algorithm that performs batch gradient descent.I guess to perform SGD , the cost function should be modified to perform calculations on single training data(array of size 784) and then theta should be updated for each training data. Is it the correct way of implementing SGD ?If yes, I am not able to get the following cost function(for batch gradient descent) to work for single training data.How can I make it run on a single training set ? If no, then what is the correct way to implement SGD on a neural network ?
python function to calculate cost function and gradient of theta for batch gradient descent :
def cost(theta,X,y,lamb):
#get theta1 and theta2 from unrolled theta vector
th1 = (theta[0:(hiddenLayerSize*(inputLayerSize+1))].reshape((inputLayerSize+1,hiddenLayerSize))).T
th2 = (theta[(hiddenLayerSize*(inputLayerSize+1)):].reshape((hiddenLayerSize+1,outputLayerSize))).T
#matrices to store gradient of theta1 &theta2
th1_grad = np.zeros(th1.shape)
th2_grad = np.zeros(th2.shape)
I = np.identity(outputLayerSize,int)
Y = np.zeros((realTrainSetSize ,outputLayerSize))
#get Y[i] to the size of output Layer
for i in range(0,realTrainSetSize ):
Y[i] = I[y[i]]
#add bais unit in each training example and perform forward prop and backprop
A1 = np.hstack([np.ones((realTrainSetSize ,1)),X])
Z2 = A1 # (th1.T)
A2 = np.hstack([np.ones((len(Z2),1)),sigmoid(Z2)])
Z3 = A2 # (th2.T)
H = A3 = sigmoid(Z3)
penalty = (lamb/(2*trainSetSize))*(sum(sum(np.delete(th1,0,1)**2))+ sum(sum(np.delete(th2,0,1)**2)) )
J = (1/2)*sum(sum( np.multiply(-Y,log(H)) - np.multiply((1-Y),log(1-H)) ))
#backprop
sigma3 = A3 - Y;
sigma2 = np.multiply(sigma3#theta2,sigmoidGradient(np.hstack([np.ones((len(Z2),1)),Z2])))
sigma2 = np.delete(sigma2,0,1)
delta_1 = sigma2.T # A1 #getting dimension mismatch error
delta_2 = sigma3.T # A2
#calculation of gradient of theta1 and theta2
th1_grad = np.divide(delta_1,trainSetSize)+(lamb/trainSetSize)*(np.hstack([np.zeros((len(th1),1)) , np.delete(th1,0,1)]))
th2_grad = np.divide(delta_2,trainSetSize)+(lamb/trainSetSize)*(np.hstack([np.zeros((len(th2),1)) , np.delete(th2,0,1)]))
#unroll gradients of theta1 and theta2
theta_grad = np.concatenate(((th1_grad.T).ravel(),(th2_grad.T).ravel()))
return (J,theta_grad)
I am getting dimension mismatch error while calculating delta_1 and delta_2 on calling this function with single training data but it works fine when called with entire training batch.
After #IVlad gave me really useful feedback, I tried modifying my code, and the modified part would look like:
syn0 = (2*np.random.random((784,len(train_sample))) - 1)/8
syn1 = (2*np.random.random((len(train_sample),10)) - 1)/8
for i in xrange(10000):
#forward propagation
l0=train_sample
l1=nonlin(np.dot(l0, syn0))
l2=nonlin(np.dot(l1, syn1))
#calculate error
l2_error=train_tag_bool-l2
if (i% 1000) == 0:
print "Error:" + str(np.mean(np.abs(l2_error)))
#apply sigmoid to the error
l2_delta = l2_error*nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
#update weights
syn1 += alpha* (l1.T.dot(l2_delta) - beta*syn1)
syn0 += alpha* (l0.T.dot(l1_delta) - beta*syn0)
Note that the tags (true label) now are in a matrix of <3000 x 10>, each row is a sample and the ten columns describes which digit each sample represents. (the train_tag_bool, now to think about it it's not really in boolean format so naming is kinda bad, but for the sake of the discussion I'll keep it this way for now.)
In this project, I'm using one hidden layer between input and output layers only, hoping it will be sufficient enough to complete the job. I have applied learning rate and weight decay, as well as making the initial weights a bit smaller.
I used the code from the website when calculating the error rate, which is
np.mean(np.abs(l2_error))
and the result came out to be 0.1. I'm not sure what to take from here.
Also, I went into the l2 layer (supposedly output layer that gives the prediction), and the values are all extremely small (<10^-9 for the largest value for each sample, and the smallest can reach 10^-85). This is after only 5 iterations, though, but I doubt things will be any different had I run it for 1k loops or more. If I return the max of each row, it's always the 9th element (represents digit '9'), which is totally wrong.
I'm stuck again on this problem. Overflow problem is and has been the biggest challenge of my whole ML experience (back then MATLAB, not Numpy), and I've yet to find a way to deal with it.....
train_tag_bool code:
train_tag_bool=np.array([[0]*10]*len(train_tag)).astype('float64')
for i in range(len(train_tag)):
if train_tag[i]==0:
train_tag_bool[i][0]=1
elif train_tag[i]==1:
train_tag_bool[i][1]=1
elif train_tag[i]==2:
train_tag_bool[i][2]=1
elif train_tag[i]==3:
train_tag_bool[i][3]=1
elif train_tag[i]==4:
train_tag_bool[i][4]=1
elif train_tag[i]==5:
train_tag_bool[i][5]=1
elif train_tag[i]==6:
train_tag_bool[i][6]=1
elif train_tag[i]==7:
train_tag_bool[i][7]=1
elif train_tag[i]==8:
train_tag_bool[i][8]=1
elif train_tag[i]==9:
train_tag_bool[i][9]=1
Brute force, I know, but that's the least of my concern right now. The result is a 3000 x 10 matrix with 1's corresponding to what the digit is for each sample. the first element represents digit 0, the last represents 9
ex. [0 0 0 0 0 0 1 0 0 0] represents 6, [1 0 0 0 0 0 0 0 0 0] represents 0.
The original code:
import cPickle, gzip
import numpy as np
#from deeplearning.net
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = cPickle.load(f)
f.close()
#sigmoid function
def nonlin(x, deriv=False):
if (deriv ==True):
return x*(1-x)
return 1/(1+np.exp(-x))
#seed random numbers to make calculation
#deterministic (just a good practice)
np.random.seed(1)
#need to decrease the sample size or else computer dies
train_sample=train_set[0][0:3000]
train_tag=train_set[1][0:3000]
train_tag=train_tag.reshape(len(train_tag), 1)
#train_set's dimension for the pixels are 50000(samples) x 784 (28x28 for each sample)
#therefore the coefficients should be 784x50000 to make the hidden layer 50k x 50k
syn0 = 2*np.random.random((784,len(train_sample))) - 1
syn1 = 2*np.random.random((len(train_sample),1)) - 1
for i in xrange(10000):
#forward propagation
l0=train_sample
l1=nonlin(np.dot(l0, syn0))
l2=nonlin(np.dot(l1, syn1))
#calculate error
l2_error=train_tag-l2
if (i% 1000) == 0:
print "Error:" + str(np.mean(np.abs(l2_error)))
#apply sigmoid to the error
l2_delta = l2_error*nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
#update weights
syn1 += l1.T.dot(l2_delta)
syn0 += l0.T.dot(l1_delta)
Reference:
http://iamtrask.github.io/2015/07/12/basic-python-network/
http://yann.lecun.com/exdb/mnist/
I can't currently run the code, but there are a few things that stand out. I'm surprised it works well even on the toy problems used on the blog.
Before we start, you'll need more output neurons: 10 to be exact.
syn1 = 2*np.random.random((len(train_sample), 10)) - 1
And your labels (y) better by a length 10 array with a 1 at the position of the correct digit and 0 elsewhere.
First of all, one thing I always attempt by default is to use float64 wherever possible... which almost never changes anything, so I'm not sure if you should get into this habit or not. Probably not.
Second, that code has no learning rate that you can set. This means that the learning rate is implicitly 1, which is huge for your problem, where people use 0.01 or even much less. To add a learning rate alpha, do:
syn1 += alpha * l1.T.dot(l2_delta)
syn0 += alpha * l0.T.dot(l1_delta)
And set it to at most 0.01. You'll have to fiddle with it for best results.
Third, it's usually better to initialize the net with small weights. [0, 1) might be too big. Try:
syn0 = (np.random.random((784,len(train_sample))) - 0.5) / 4
syn1 = (np.random.random((len(train_sample),1)) - 0.5) / 4
There are more involved initialization schemes that you can search for if you're interested, but I've gotten decent results with the above.
Fourth, regularization. The easiest to implement is probably weight decay. Implementing weight decay lambda can be done like this:
syn1 += alpha * l1.T.dot(l2_delta) - alpha * lambda * syn1
syn0 += alpha * l0.T.dot(l1_delta) - alpha * lambda * syn0
Common values are also < 0.1 or even < 0.01.
Dropout can also help, but it's a bit harder to implement and understand if you're just starting out, in my opinion. It's also more useful for deeper nets AFAIK. So maybe leave this for last.
Fifth, maybe also use momentum (explained in the weight decay link), which should decrease the learning time for your network. Also tune the number of iterations: you don't want too many, but not too few either.
Sixth, look into softmax for the output layer.
Seventh, look into tanh instead of your current nonlin sigmoid function.
If you apply these incrementally, you should start getting some meaningful results. I think regularization and smaller initial weights should help with the overflow errors.
Update:
I have changed the code like this. After only 100 training epochs, accuracy is 84.79%. Not too bad with barely tweaking anything.
I have added bias neurons, momentum, weight decay, used fewer hidden units (was way too slow with what you had), changed to tanh function and a few others.
You should be able to tweak it some more from here. I use Python 3.4, so I had to change a few things to get it to run, but it's nothing major.
import pickle, gzip
import numpy as np
#from deeplearning.net
# Load the dataset
f = gzip.open('mnist.pkl.gz', 'rb')
train_set, valid_set, test_set = pickle.load(f, encoding='latin1')
f.close()
#sigmoid function
def nonlin(x, deriv=False):
if (deriv ==True):
return 1-x*x
return np.tanh(x)
#seed random numbers to make calculation
#deterministic (just a good practice)
np.random.seed(1)
def make_proper_pairs_from_set(data_set):
data_set_x, data_set_y = data_set
data_set_y = np.eye(10)[:, data_set_y].T
return data_set_x, data_set_y
train_x, train_y = make_proper_pairs_from_set(train_set)
train_x = train_x
train_y = train_y
test_x, test_y = make_proper_pairs_from_set(test_set)
print(len(train_y))
#train_set's dimension for the pixels are 50000(samples) x 784 (28x28 for each sample)
#therefore the coefficients should be 784x50000 to make the hidden layer 50k x 50k
# changed to 200 hidden neurons, should be plenty
syn0 = (2*np.random.random((785,200)) - 1) / 10
syn1 = (2*np.random.random((201,10)) - 1) / 10
velocities0 = np.zeros(syn0.shape)
velocities1 = np.zeros(syn1.shape)
alpha = 0.01
beta = 0.0001
momentum = 0.99
m = len(train_x) # number of training samples
# moved the forward propagation to a function and added bias neurons
def forward_prop(set_x, m):
l0 = np.c_[np.ones((m, 1)), set_x]
l1 = nonlin(np.dot(l0, syn0))
l1 = np.c_[np.ones((m, 1)), l1]
l2 = nonlin(np.dot(l1, syn1))
return l0, l1, l2, l2.argmax(axis=1)
num_epochs = 100
for i in range(num_epochs):
# forward propagation
l0, l1, l2, _ = forward_prop(train_x, m)
# calculate error
l2_error = l2 - train_y
print("Error " + str(i) + ": " + str(np.mean(np.abs(l2_error))))
# apply sigmoid to the error
l2_delta = l2_error * nonlin(l2,deriv=True)
l1_error = l2_delta.dot(syn1.T)
l1_delta = l1_error * nonlin(l1,deriv=True)
l1_delta = l1_delta[:, 1:]
# update weights
# divide gradients by the number of samples
grad0 = l0.T.dot(l1_delta) / m
grad1 = l1.T.dot(l2_delta) / m
v0 = velocities0
v1 = velocities1
velocities0 = velocities0 * momentum - alpha * grad0
velocities1 = velocities1 * momentum - alpha * grad1
# divide regularization by number of samples
# because L2 regularization reduces to this
syn1 += -v1 * momentum + (1 + momentum) * velocities1 - alpha * beta * syn1 / m
syn0 += -v0 * momentum + (1 + momentum) * velocities0 - alpha * beta * syn0 / m
# find accuracy on test set
predictions = []
corrects = []
for i in range(len(test_x)): # you can eliminate this loop too with a bit of work, but this part is very fast anyway
_, _, _, rez = forward_prop([test_x[i, :]], 1)
predictions.append(rez[0])
corrects.append(test_y[i].argmax())
predictions = np.array(predictions)
corrects = np.array(corrects)
print(np.sum(predictions == corrects) / len(test_x))
Update 2:
If you increase the learning rate to 0.05 and the epochs to 1000, you get 95.43% accuracy.
Seeding the random number generator with the current time, adding more hidden neurons (or hidden layers) and more parameter tweaks can get this simple model to about 98% accuracy AFAIK. The problem is that it's slow to train.
Also, this methodology isn't really sound. I optimized the parameters to increase the accuracy on the test set, so I might be overfitting the test set. You should use cross validation or the validation set.
Anyway, as you can see, there are no overflow errors. If you want to discuss things in more detail, feel free to drop me an e-mail (address in profile).