Improving accuracy of multinomial logistic regression model built from scratch - python

I am currently working on creating a multi class classifier using numpy and finally got a working model using softmax as follows:
class MultinomialLogReg:
def fit(self, X, y, lr=0.00001, epochs=1000):
self.X = self.norm_x(np.insert(X, 0, 1, axis=1))
self.y = y
self.classes = np.unique(y)
self.theta = np.zeros((len(self.classes), self.X.shape[1]))
self.o_h_y = self.one_hot(y)
for e in range(epochs):
preds = self.probs(self.X)
l, grad = self.get_loss(self.theta, self.X, self.o_h_y, preds)
if e%10000 == 0:
print("epoch: ", e, "loss: ", l)
self.theta -= (lr*grad)
return self
def norm_x(self, X):
for i in range(X.shape[0]):
mn = np.amin(X[i])
mx = np.amax(X[i])
X[i] = (X[i] - mn)/(mx-mn)
return X
def one_hot(self, y):
Y = np.zeros((y.shape[0], len(self.classes)))
for i in range(Y.shape[0]):
to_put = [0]*len(self.classes)
to_put[y[i]] = 1
Y[i] = to_put
return Y
def probs(self, X):
return self.softmax(np.dot(X, self.theta.T))
def get_loss(self, w,x,y,preds):
m = x.shape[0]
loss = (-1 / m) * np.sum(y * np.log(preds) + (1-y) * np.log(1-preds))
grad = (1 / m) * (np.dot((preds - y).T, x)) #And compute the gradient for that loss
return loss,grad
def softmax(self, z):
return np.exp(z) / np.sum(np.exp(z), axis=1).reshape(-1,1)
def predict(self, X):
X = np.insert(X, 0, 1, axis=1)
return np.argmax(self.probs(X), axis=1)
#return np.vectorize(lambda i: self.classes[i])(np.argmax(self.probs(X), axis=1))
def score(self, X, y):
return np.mean(self.predict(X) == y)
And had several questions:
Is this a correct mutlinomial logistic regression implementation?
It takes 100,000 epochs using learning rate 0.1 for the loss to be 1 - 0.5 and to get an accuracy of 70 - 90 % on the test set. Would this be considered bad performance?
What are some ways for improving performance or speeding up training (to need less epochs)?
I saw this cost function online which gives better accuracy, it looks like cross-entropy, but it is different from the equations of cross-entropy optimization I saw, can someone explain how the two differ:
error = preds - self.o_h_y
grad = np.dot(error.T, self.X)
self.theta -= (lr*grad)

This looks right, but I think the preprocessing you perform in the fit function should be done outside of the model.
It's hard to know whether this is good or bad. While the loss landscape is convex, the time it takes to obtain a minima varies for different problems. One way to ensure you've obtained the optimal solution is to add a threshold that tests the size of the gradient norm, which is small when you're close to the optima. Something like np.linalg.norm(grad) < 1e-8.
You can use a better optimizer, such as Newton's method, or a quasi-Newton method, such as LBFGS. I would start with Newton's method as it's easier to implement. LBFGS is a non-trivial algorithm that approximates the Hessian required to perform Newton's method.
It's the same; the gradients aren't being averaged. Since you're performing gradient descent, the averaging is a constant that can be ignored since a properly tuned learning rate is required anyways. In general, I think averaging makes it a bit easier to obtain a stable learning rate over different splits of the same dataset.
A question for you: When you evaluate your test set, are you preprocessing them the same way you do the training set in your fit function?

Related

Neural networks very bad accuracy when using more than one hidden layer

I have created the following neural network:
def init_weights(m, n=1):
"""
initialize a matrix/vector of weights with xavier initialization
:param m: out dim
:param n: in dim
:return: matrix/vector of random weights
"""
limit = (6 / (n * m)) ** 0.5
weights = np.random.uniform(-limit, limit, size=(m, n))
if n == 1:
weights = weights.reshape((-1,))
return weights
def softmax(v):
exp = np.exp(v)
return exp / np.tile(exp.sum(1), (v.shape[1], 1)).T
def relu(x):
return np.maximum(x, 0)
def sign(x):
return (x > 0).astype(int)
class Model:
"""
A class for neural network model
"""
def __init__(self, sizes, lr):
self.lr = lr
self.weights = []
self.biases = []
self.memory = []
for i in range(len(sizes) - 1):
self.weights.append(init_weights(sizes[i + 1], sizes[i]))
self.biases.append(init_weights(sizes[i + 1]))
def forward(self, X):
self.memory = [X]
X = np.dot(self.weights[0], X.T).T + self.biases[0]
for W, b in zip(self.weights[1:], self.biases[1:]):
X = relu(X)
self.memory.append(X)
X = np.dot(W, X.T).T + b
return softmax(X)
def backward(self, y, y_pred):
# calculate the errors for each layer
y = np.eye(y_pred.shape[1])[y]
errors = [y_pred - y]
for i in range(len(self.weights) - 1, 0, -1):
new_err = sign(self.memory[i]) * \
np.dot(errors[0], self.weights[i])
errors.insert(0, new_err)
# update weights
for i in range(len(self.weights)):
self.weights[i] -= self.lr *\
np.dot(self.memory[i].T, errors[i]).T
self.biases[i] -= self.lr * errors[i].sum(0)
The data has 10 classes. When using a single hidden layer the accuracy is almost 40%. when using 2 or 3 hidden layers, the accuracy is around 9-10% from the first epoch and remains that way. The accuracy on the train set is also in that range. Is there a problem with my implementation that could cause such a thing?
You asked about the accuracy improvement of a machine learning model, which is a very broad and ambiguous problem in the era of ML, because it varies between various model types and data types
In your case the model is neural network that has several factors on which accuracy is dependent. You are trying to optimize the accuracy on the basis of activation functions, weights or number of hidden layers which is not the correct way. To increase the accuracy you have to consider other factors too e.g. your basic checklist can be following
Increase Hidden Layers
Change Activation Functions
Experiment with initial weight initialization
Normalize Training Data
Scale Training Data
Check for Class Biasness
Now you are trying to achieve state of the art accuracy on the basis of very few factors, I don't know about your dataset as you haven't shown the pre processing code, but I recommend that you double check the dataset may be by correctly normalizing the dataset you can increase accuracy, also check if your dataset can be scaled and the most important thing if one of the class sample in your dataset is overloaded or too big in count as compared to other samples then it will also lead to the poor accuracy matrix.
For more details check this it contains the mathematical proof and explanation how these things affect your ML model accuracy

Python regularized gradient descent for logistic regression

I'm trying to implement Gradient Descent (GD) (not stochastic one) for logistic regression in Python 3x. And have some troubles.
Logistic regression is defined as follows (1):
logistic regression formula
Formulas for gradients are defined as follows (2):
gradient descent for logistic regression
Description of data:
X is (Nx2)-matrix of objects (consist of positive and negative float numbers)
y is (Nx1)-vector of class labels (-1 or +1)
Task:
Implement gradient descent 1) with L2-regularization; and 2) without regularization. Desired results: vectors of weights.
Parameters: regularization rate C=10 for regularized regression and C=0 for unregularized regression; gradient step k=0.1; max.number of iterations = 10000; tolerance = 1e-5.
Note: GD is converged if distance between weighs vectors from current and previous steps is less than tolerance (1e-5).
Here is my implementation:
k - gradient step;
C - regularization rate.
import numpy as np
def sigmoid(z):
result = 1./(1. + np.exp(-z))
return result
def distance(vector1, vector2):
vector1 = np.array(vector1, dtype='f')
vector2 = np.array(vector2, dtype='f')
return np.linalg.norm(vector1-vector2)
def GD(X, y, C, k=0.1, tolerance=1e-5, max_iter=10000):
X = np.matrix(X)
y = np.matrix(y)
l=len(X)
w1, w2 = 0., 0. # weights (look formula (2) in the beginning of question)
difference = 1.
iteration = 1
while(difference > tolerance):
hypothesis = y*(X*np.matrix([w1, w2]).T)
w1_updated = w1 + (k/l)*np.sum(y*X[:,0]*(1.-(sigmoid(hypothesis)))) - k*C*w1
w2_updated = w2 + (k/l)*np.sum(y*X[:,1]*(1.-(sigmoid(hypothesis)))) - k*C*w2
difference = distance([w1, w2], [w1_updated, w2_updated])
w1, w2 = w1_updated, w2_updated
if(iteration >= max_iter):
break;
iteration = iteration + 1
return [w1_updated, w2_updated] #vector of weights
Respectively:
# call for UNregularized GD: C=0
w = GD(X, y, C=0., k=0.1)
and
# call for regularized GD: C=10
w_reg = GD(X, y, C=10., k=0.1)
Here are the resuls (weights-vectors):
# UNregularized GD
[0.035736331265589463, 0.032464572442830832]
# regularized GD
[5.0979561973044096e-06, 4.6312243707352652e-06]
However, it should be (right answers for self-control):
# UNregularized GD
[0.28801877, 0.09179177]
# regularized GD
[0.02855938, 0.02478083]
!!! Please, can you tell me whats going wrong here? I'm sitting with this problem for three days in a row and still have no idea.
Thank you in advance.
First of all, the sigmoid functions should be
def sigmoid(Z):
A=1/(1+np.exp(-Z))
return A
Try to run it again with this formula. Then, what is L?

CS231n: How to calculate gradient for Softmax loss function?

I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
"""
Softmax loss function, naive implementation (with loops)
Inputs:
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
Returns:
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
#############################################################################
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
#############################################################################
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i = W.dot(X[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per http://cs231n.github.io/linear-classify/#softmax
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW
Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
where
which is the origin of the X[:,i] in the code.
I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.
A supplement to this answer with a small example.
I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.

Issue with computing gradient for Rnn in Theano

I am playing with vanilla Rnn's, training with gradient descent (non-batch version), and I am having an issue with the gradient computation for the (scalar) cost; here's the relevant portion of my code:
class Rnn(object):
# ............ [skipping the trivial initialization]
def recurrence(x_t, h_tm_prev):
h_t = T.tanh(T.dot(x_t, self.W_xh) +
T.dot(h_tm_prev, self.W_hh) + self.b_h)
return h_t
h, _ = theano.scan(
recurrence,
sequences=self.input,
outputs_info=self.h0
)
y_t = T.dot(h[-1], self.W_hy) + self.b_y
self.p_y_given_x = T.nnet.softmax(y_t)
self.y_pred = T.argmax(self.p_y_given_x, axis=1)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[:, y])
def testRnn(dataset, vocabulary, learning_rate=0.01, n_epochs=50):
# ............ [skipping the trivial initialization]
index = T.lscalar('index')
x = T.fmatrix('x')
y = T.iscalar('y')
rnn = Rnn(x, n_x=27, n_h=12, n_y=27)
nll = rnn.negative_log_likelihood(y)
cost = T.lscalar('cost')
gparams = [T.grad(cost, param) for param in rnn.params]
updates = [(param, param - learning_rate * gparam)
for param, gparam in zip(rnn.params, gparams)
]
train_model = theano.function(
inputs=[index],
outputs=nll,
givens={
x: train_set_x[index],
y: train_set_y[index]
},
)
sgd_step = theano.function(
inputs=[cost],
outputs=[],
updates=updates
)
done_looping = False
while(epoch < n_epochs) and (not done_looping):
epoch += 1
tr_cost = 0.
for idx in xrange(n_train_examples):
tr_cost += train_model(idx)
# perform sgd step after going through the complete training set
sgd_step(tr_cost)
For some reasons I don't want to pass complete (training) data to the train_model(..), instead I want to pass individual examples at a time. Now the problem is that each call to train_model(..) returns me the cost (negative log-likelihood) of that particular example and then I have to aggregate all the cost (of the complete (training) data-set) and then take derivative and perform the relevant update to the weight parameters in the sgd_step(..), and for obvious reasons with my current implementation I am getting this error: theano.gradient.DisconnectedInputError: grad method was asked to compute the gradient with respect to a variable that is not part of the computational graph of the cost, or is used only by a non-differentiable operator: W_xh. Now I don't understand how to make 'cost' a part of computational graph (as in my case when I have to wait for it to be aggregated) or is there any better/elegant way to achieve the same thing ?
Thanks.
It turns out one cannot bring the symbolic variable into Theano graph if they are not part of computational graph. Therefore, I have to change the way to pass data to the train_model(..); passing the complete training data instead of individual example fix the issue.

Logistic Regression with regularization in python failing to minimize

I'm implementing logistic regression based on the Coursera documentation, both in python and Octave.
In Octave, I managed to do it and achieve the right training accuracy, but in python, since I don't have access to fminunc, I cannot figure out a work around.
Currently, this is my code:
df = pandas.DataFrame.from_csv('ex2data2.txt', header=None, index_col=None)
df.columns = ['x1', 'x2', 'y']
y = df[df.columns[-1]].as_matrix()
m = len(y)
y = y.reshape(m, 1)
X = df[df.columns[:-1]]
X = X.as_matrix()
from sklearn.preprocessing import PolynomialFeatures
feature_mapper = PolynomialFeatures(degree=6)
X = feature_mapper.fit_transform(X)
def sigmoid(z):
return 1/(1+np.power(np.e, z))
def cost_function_reg(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
reg = (_lambda / (2.0*m))* shifted_theta.T.dot(shifted_theta)
J = ((1.0/m)*(-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1-h)))) + reg
return J
def gradient(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
gradR = _lambda*shifted_theta
gradR.shape = (gradR.shape[0], 1)
grad = (1.0/m)*(X.T.dot(h-y)+gradR)
return grad.flatten()
from scipy.optimize import *
theta = fmin_ncg(cost_f, initial_theta, fprime=gradient)
predictions = predict(theta, X)
accuracy = np.mean(np.double(predictions == y)) * 100
print 'Train Accuracy: %.2f' % accuracy
The output is:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 0.693147
Iterations: 0
Function evaluations: 22
Gradient evaluations: 12
Hessian evaluations: 0
Train Accuracy: 50.85
In octave, the accuracy is: 83.05.
Any help is appreciated.
There were two problems on that implementation:
The first one, fmin_ncg is not ideal for that minimization. I have used it on the previous exercise, but it was failing to find the theta with that gradient function, which is ideal to the one in Octave.
Switching to
theta = fmin_bfgs(cost_function_reg, initial_theta)
Fixed that issue.
The second issue was that the accuracy was being miscalculated.
Once I optimized with fmin_bfgs, and achieved the cost that matched the Octave results (0.529), the (predictions == y) part had different shapes ((118, 118) and (118,1)) , yielding a matrix that was MxM instead of vector.

Categories