My classifier produces soft classifications and I wish to select an optimal threshold (that is, one that maximizes accuracy) from the results of the method on the training cases, and use this threshold to produce the hard classification. While in general the problem is relatively easy, I find it hard to optimise the code so that the computation does not last forever. Below you'll find the code that essentially recreates the optimisation procedure on some dummy data. Could you please point me into any direction which could possibly improve performance?
y_pred = np.random.rand(400000)
y_true = np.random.randint(2, size=400000)
accs = [(accuracy_score(y_true, y_pred > t), t) for t in np.unique(y_pred)]
train_acc, train_thresh = max(accs, key=lambda pair: pair[0])
I realize that I could sort both y_pred and y_true prior to the loop, and use that to my advantage when binarizing y_pred but that didn't bring much improvement (unless I did something wrong).
Any help would be much appreciated.
Sort y_pred descendantly and use Kadane's Algorithm to calculate an index i such that the subarray of y_true from 0 to i has maximum sum. Your optimal threshold b is then b = (y_pred[i] + y_pred[i+i]) / 2. This will be the same output that SVM would give you, that is, the hyperplane (or for your 1-dimensional case, a threshold) that maximizes the margin between classes.
I wrote a helper function in python:
def opt_threshold_acc(y_true, y_pred):
A = list(zip(y_true, y_pred))
A = sorted(A, key=lambda x: x[1])
total = len(A)
tp = len([1 for x in A if x[0]==1])
tn = 0
th_acc = []
for x in A:
th = x[1]
if x[0] == 1:
tp -= 1
tn += 1
acc = (tp + tn) / total
th_acc.append((th, acc))
return max(th_acc, key=lambda x: x[1])
I am currently working on creating a multi class classifier using numpy and finally got a working model using softmax as follows:
class MultinomialLogReg:
def fit(self, X, y, lr=0.00001, epochs=1000):
self.X = self.norm_x(np.insert(X, 0, 1, axis=1))
self.y = y
self.classes = np.unique(y)
self.theta = np.zeros((len(self.classes), self.X.shape[1]))
self.o_h_y = self.one_hot(y)
for e in range(epochs):
preds = self.probs(self.X)
l, grad = self.get_loss(self.theta, self.X, self.o_h_y, preds)
if e%10000 == 0:
print("epoch: ", e, "loss: ", l)
self.theta -= (lr*grad)
return self
def norm_x(self, X):
for i in range(X.shape[0]):
mn = np.amin(X[i])
mx = np.amax(X[i])
X[i] = (X[i] - mn)/(mx-mn)
return X
def one_hot(self, y):
Y = np.zeros((y.shape[0], len(self.classes)))
for i in range(Y.shape[0]):
to_put = [0]*len(self.classes)
to_put[y[i]] = 1
Y[i] = to_put
return Y
def probs(self, X):
return self.softmax(, self.theta.T))
def get_loss(self, w,x,y,preds):
m = x.shape[0]
loss = (-1 / m) * np.sum(y * np.log(preds) + (1-y) * np.log(1-preds))
grad = (1 / m) * ( - y).T, x)) #And compute the gradient for that loss
return loss,grad
def softmax(self, z):
return np.exp(z) / np.sum(np.exp(z), axis=1).reshape(-1,1)
def predict(self, X):
X = np.insert(X, 0, 1, axis=1)
return np.argmax(self.probs(X), axis=1)
#return np.vectorize(lambda i: self.classes[i])(np.argmax(self.probs(X), axis=1))
def score(self, X, y):
return np.mean(self.predict(X) == y)
And had several questions:
Is this a correct mutlinomial logistic regression implementation?
It takes 100,000 epochs using learning rate 0.1 for the loss to be 1 - 0.5 and to get an accuracy of 70 - 90 % on the test set. Would this be considered bad performance?
What are some ways for improving performance or speeding up training (to need less epochs)?
I saw this cost function online which gives better accuracy, it looks like cross-entropy, but it is different from the equations of cross-entropy optimization I saw, can someone explain how the two differ:
error = preds - self.o_h_y
grad =, self.X)
self.theta -= (lr*grad)
This looks right, but I think the preprocessing you perform in the fit function should be done outside of the model.
It's hard to know whether this is good or bad. While the loss landscape is convex, the time it takes to obtain a minima varies for different problems. One way to ensure you've obtained the optimal solution is to add a threshold that tests the size of the gradient norm, which is small when you're close to the optima. Something like np.linalg.norm(grad) < 1e-8.
You can use a better optimizer, such as Newton's method, or a quasi-Newton method, such as LBFGS. I would start with Newton's method as it's easier to implement. LBFGS is a non-trivial algorithm that approximates the Hessian required to perform Newton's method.
It's the same; the gradients aren't being averaged. Since you're performing gradient descent, the averaging is a constant that can be ignored since a properly tuned learning rate is required anyways. In general, I think averaging makes it a bit easier to obtain a stable learning rate over different splits of the same dataset.
A question for you: When you evaluate your test set, are you preprocessing them the same way you do the training set in your fit function?
I want to choose an optimal threshold which maximises accuracy. Using for loop, I found and appended to the list all the values of accuracy and could print and see which threshold is optimal for maximum accuracy. However, I would like to make a code that will output only the maximum accuracy and its threshold OR that will take separately the maximum accuracy and its threshold under variables.
lm__pred_train = lm_.predict(X_train)
def fn_accuracy(actuals, predictions):
return np.mean(actuals == predictions)
thresholds = np.arange(0, 1, 0.001)
accuracy = []
for th in thresholds:
acc = np.round(fn_accuracy(lm__pred_train > th, y_train), 3)
print(th, acc)
I'm kinda of newb and so far get lost in for loop thus will really appreciate any help on my issue.
lm__pred_train = lm_.predict(X_train)
def fn_accuracy(actuals, predictions):
return np.mean(actuals == predictions)
thresholds = np.arange(0, 1, 0.001)
accuracy = []
for th in thresholds:
acc = np.round(fn_accuracy(lm__pred_train > th, y_train), 3)
if acc>max_acc:
print(max_th, max_acc)
I am watching some videos for Stanford CS231: Convolutional Neural Networks for Visual Recognition but do not quite understand how to calculate analytical gradient for softmax loss function using numpy.
From this stackexchange answer, softmax gradient is calculated as:
Python implementation for above is:
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
Could anyone explain how the above snippet work? Detailed implementation for softmax is also included below.
def softmax_loss_naive(W, X, y, reg):
Softmax loss function, naive implementation (with loops)
- W: C x D array of weights
- X: D x N array of data. Data are D-dimensional columns
- y: 1-dimensional array of length N with labels 0...K-1, for K classes
- reg: (float) regularization strength
a tuple of:
- loss as single float
- gradient with respect to weights W, an array of same size as W
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
# Compute the softmax loss and its gradient using explicit loops. #
# Store the loss in loss and the gradient in dW. If you are not careful #
# here, it is easy to run into numeric instability. Don't forget the #
# regularization! #
# Get shapes
num_classes = W.shape[0]
num_train = X.shape[1]
for i in range(num_train):
# Compute vector of scores
f_i =[:, i]) # in R^{num_classes}
# Normalization trick to avoid numerical instability, per
log_c = np.max(f_i)
f_i -= log_c
# Compute loss (and add to it, divided later)
# L_i = - f(x_i)_{y_i} + log \sum_j e^{f(x_i)_j}
sum_i = 0.0
for f_i_j in f_i:
sum_i += np.exp(f_i_j)
loss += -f_i[y[i]] + np.log(sum_i)
# Compute gradient
# dw_j = 1/num_train * \sum_i[x_i * (p(y_i = j)-Ind{y_i = j} )]
# Here we are computing the contribution to the inner sum for a given i.
for j in range(num_classes):
p = np.exp(f_i[j])/sum_i
dW[j, :] += (p-(j == y[i])) * X[:, i]
# Compute average
loss /= num_train
dW /= num_train
# Regularization
loss += 0.5 * reg * np.sum(W * W)
dW += reg*W
return loss, dW
Not sure if this helps, but:
is really the indicator function , as described here. This forms the expression (j == y[i]) in the code.
Also, the gradient of the loss with respect to the weights is:
which is the origin of the X[:,i] in the code.
I know this is late but here's my answer:
I'm assuming you are familiar with the cs231n Softmax loss function.
We know that:
So just as we did with the SVM loss function the gradients are as follows:
Hope that helped.
A supplement to this answer with a small example.
I came across this post and still was not 100% clear how to arrive at the partial derivatives.
For that reason I took another approach to get to the same results - maybe it is helpful to others too.
So I've the following numpy arrays.
X validation set, X_val: (47151, 32, 32, 1)
y validation set (labels), y_val_dummy: (47151, 5, 10)
y validation prediction set, y_pred: (47151, 5, 10)
When I run the code, it seems to take forever. Can someone suggest why? I believe it's a code efficiency problem. I can't seem to complete the process.
y_pred_list = model.predict(X_val)
correct_preds = 0
# Iterate over sample dimension
for i in range(X_val.shape[0]):
pred_list_i = [y_pred_array[i] for y_pred in y_pred_array]
val_list_i = [y_val_dummy[i] for y_val in y_val_dummy]
matching_preds = [pred.argmax(-1) == val.argmax(-1) for pred, val in zip(pred_list_i, val_list_i)]
correct_preds = int(np.all(matching_preds))
total_acc = correct_preds / float(x_val.shape[0])
You're main problem is that you're generating a massive number of very large lists for no real reason
for i in range(X_val.shape[0]):
# this line generates a 47151 x 5 x 10 array every time
pred_list_i = [y_pred_array[i] for y_pred in y_pred_array]
What's happening is that iterating over an nd numpy array iterates over the slowest varying index (i.e. the leftmost), so every list comprehension is running operating on 47K entries.
Marginally better would be
for i in range(X_val.shape[0]):
pred_list_i = [y_pred for y_pred in y_pred_array[i]]
val_list_i = [y_val for y_val in y_val_dummy[i]]
matching_preds = [pred.argmax(-1) == val.argmax(-1) for pred, val in zip(pred_list_i, val_list_i)]
correct_preds = int(np.all(matching_preds))
But you're still copying a lot of arrays for no real purpose. The following code should do the same, without the useless copying.
correct_preds = 0.0
for pred, val in zip(y_pred_array, y_val_dummy):
correct_preds += all(p.argmax(-1) == v.argmax(-1)
for p, v in zip(pred, val))
total_accuracy = correct_preds / x_val.shape[0]
This assumes that your criteria for a correct prediction is accurate.
You can probably avoid the explicit loop entirely with a couple of calls to np.argmax, but you'll have to work that out on your own.
I'm implementing logistic regression based on the Coursera documentation, both in python and Octave.
In Octave, I managed to do it and achieve the right training accuracy, but in python, since I don't have access to fminunc, I cannot figure out a work around.
Currently, this is my code:
df = pandas.DataFrame.from_csv('ex2data2.txt', header=None, index_col=None)
df.columns = ['x1', 'x2', 'y']
y = df[df.columns[-1]].as_matrix()
m = len(y)
y = y.reshape(m, 1)
X = df[df.columns[:-1]]
X = X.as_matrix()
from sklearn.preprocessing import PolynomialFeatures
feature_mapper = PolynomialFeatures(degree=6)
X = feature_mapper.fit_transform(X)
def sigmoid(z):
return 1/(1+np.power(np.e, z))
def cost_function_reg(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(, _theta))
reg = (_lambda / (2.0*m))*
J = ((1.0/m)*( - (1 - y) + reg
return J
def gradient(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(, _theta))
gradR = _lambda*shifted_theta
gradR.shape = (gradR.shape[0], 1)
grad = (1.0/m)*(
return grad.flatten()
from scipy.optimize import *
theta = fmin_ncg(cost_f, initial_theta, fprime=gradient)
predictions = predict(theta, X)
accuracy = np.mean(np.double(predictions == y)) * 100
print 'Train Accuracy: %.2f' % accuracy
The output is:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 0.693147
Iterations: 0
Function evaluations: 22
Gradient evaluations: 12
Hessian evaluations: 0
Train Accuracy: 50.85
In octave, the accuracy is: 83.05.
Any help is appreciated.
There were two problems on that implementation:
The first one, fmin_ncg is not ideal for that minimization. I have used it on the previous exercise, but it was failing to find the theta with that gradient function, which is ideal to the one in Octave.
Switching to
theta = fmin_bfgs(cost_function_reg, initial_theta)
Fixed that issue.
The second issue was that the accuracy was being miscalculated.
Once I optimized with fmin_bfgs, and achieved the cost that matched the Octave results (0.529), the (predictions == y) part had different shapes ((118, 118) and (118,1)) , yielding a matrix that was MxM instead of vector.