I'm implementing logistic regression based on the Coursera documentation, both in python and Octave.
In Octave, I managed to do it and achieve the right training accuracy, but in python, since I don't have access to fminunc, I cannot figure out a work around.
Currently, this is my code:
df = pandas.DataFrame.from_csv('ex2data2.txt', header=None, index_col=None)
df.columns = ['x1', 'x2', 'y']
y = df[df.columns[-1]].as_matrix()
m = len(y)
y = y.reshape(m, 1)
X = df[df.columns[:-1]]
X = X.as_matrix()
from sklearn.preprocessing import PolynomialFeatures
feature_mapper = PolynomialFeatures(degree=6)
X = feature_mapper.fit_transform(X)
def sigmoid(z):
return 1/(1+np.power(np.e, z))
def cost_function_reg(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
reg = (_lambda / (2.0*m))* shifted_theta.T.dot(shifted_theta)
J = ((1.0/m)*(-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1-h)))) + reg
return J
def gradient(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
gradR = _lambda*shifted_theta
gradR.shape = (gradR.shape[0], 1)
grad = (1.0/m)*(X.T.dot(h-y)+gradR)
return grad.flatten()
from scipy.optimize import *
theta = fmin_ncg(cost_f, initial_theta, fprime=gradient)
predictions = predict(theta, X)
accuracy = np.mean(np.double(predictions == y)) * 100
print 'Train Accuracy: %.2f' % accuracy
The output is:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 0.693147
Iterations: 0
Function evaluations: 22
Gradient evaluations: 12
Hessian evaluations: 0
Train Accuracy: 50.85
In octave, the accuracy is: 83.05.
Any help is appreciated.
There were two problems on that implementation:
The first one, fmin_ncg is not ideal for that minimization. I have used it on the previous exercise, but it was failing to find the theta with that gradient function, which is ideal to the one in Octave.
Switching to
theta = fmin_bfgs(cost_function_reg, initial_theta)
Fixed that issue.
The second issue was that the accuracy was being miscalculated.
Once I optimized with fmin_bfgs, and achieved the cost that matched the Octave results (0.529), the (predictions == y) part had different shapes ((118, 118) and (118,1)) , yielding a matrix that was MxM instead of vector.
Related
I am currently working on creating a multi class classifier using numpy and finally got a working model using softmax as follows:
class MultinomialLogReg:
def fit(self, X, y, lr=0.00001, epochs=1000):
self.X = self.norm_x(np.insert(X, 0, 1, axis=1))
self.y = y
self.classes = np.unique(y)
self.theta = np.zeros((len(self.classes), self.X.shape[1]))
self.o_h_y = self.one_hot(y)
for e in range(epochs):
preds = self.probs(self.X)
l, grad = self.get_loss(self.theta, self.X, self.o_h_y, preds)
if e%10000 == 0:
print("epoch: ", e, "loss: ", l)
self.theta -= (lr*grad)
return self
def norm_x(self, X):
for i in range(X.shape[0]):
mn = np.amin(X[i])
mx = np.amax(X[i])
X[i] = (X[i] - mn)/(mx-mn)
return X
def one_hot(self, y):
Y = np.zeros((y.shape[0], len(self.classes)))
for i in range(Y.shape[0]):
to_put = [0]*len(self.classes)
to_put[y[i]] = 1
Y[i] = to_put
return Y
def probs(self, X):
return self.softmax(np.dot(X, self.theta.T))
def get_loss(self, w,x,y,preds):
m = x.shape[0]
loss = (-1 / m) * np.sum(y * np.log(preds) + (1-y) * np.log(1-preds))
grad = (1 / m) * (np.dot((preds - y).T, x)) #And compute the gradient for that loss
return loss,grad
def softmax(self, z):
return np.exp(z) / np.sum(np.exp(z), axis=1).reshape(-1,1)
def predict(self, X):
X = np.insert(X, 0, 1, axis=1)
return np.argmax(self.probs(X), axis=1)
#return np.vectorize(lambda i: self.classes[i])(np.argmax(self.probs(X), axis=1))
def score(self, X, y):
return np.mean(self.predict(X) == y)
And had several questions:
Is this a correct mutlinomial logistic regression implementation?
It takes 100,000 epochs using learning rate 0.1 for the loss to be 1 - 0.5 and to get an accuracy of 70 - 90 % on the test set. Would this be considered bad performance?
What are some ways for improving performance or speeding up training (to need less epochs)?
I saw this cost function online which gives better accuracy, it looks like cross-entropy, but it is different from the equations of cross-entropy optimization I saw, can someone explain how the two differ:
error = preds - self.o_h_y
grad = np.dot(error.T, self.X)
self.theta -= (lr*grad)
This looks right, but I think the preprocessing you perform in the fit function should be done outside of the model.
It's hard to know whether this is good or bad. While the loss landscape is convex, the time it takes to obtain a minima varies for different problems. One way to ensure you've obtained the optimal solution is to add a threshold that tests the size of the gradient norm, which is small when you're close to the optima. Something like np.linalg.norm(grad) < 1e-8.
You can use a better optimizer, such as Newton's method, or a quasi-Newton method, such as LBFGS. I would start with Newton's method as it's easier to implement. LBFGS is a non-trivial algorithm that approximates the Hessian required to perform Newton's method.
It's the same; the gradients aren't being averaged. Since you're performing gradient descent, the averaging is a constant that can be ignored since a properly tuned learning rate is required anyways. In general, I think averaging makes it a bit easier to obtain a stable learning rate over different splits of the same dataset.
A question for you: When you evaluate your test set, are you preprocessing them the same way you do the training set in your fit function?
I have been trying to replicate the logistic regression model with sklearn results. There seem to be having some minor difference in the cost. I think there would be some mistake.
Here is the code from Sklearn:
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.30)
classifier = sklearn.linear_model.LogisticRegression(max_iter = 10000)
classifier.fit(X_train,Y_train)
function = np.dot(X_train, classifier.coef_.reshape(-1,1)) + classifier.intercept_ - Y_train.reshape(-1,1)
activation = 1 / (1 + np.exp(-function))
cost = np.sum(-(Y_train.reshape(-1,1) * np.log(activation) + (1-Y_train.reshape(-1,1)) * np.log(1-activation)))/Y_train.shape[0]
print(cost)
cost = 0.09494712120076532
My attempt of replicating logistic regression model manually:
## My Logistic Model:
## Preprocessing
ones_train = np.ones((X_train.shape[0],1)) # Vector of ones for train X
ones_test = np.ones((X_test.shape[0],1)) # Vector of ones for train X
X_train_rev = np.concatenate((X_train,ones_train),axis = 1) # Append Intercept column in the training X data
X_test_rev = np.concatenate((X_test,ones_test),axis = 1) # Append Intercept column in the test X data
Y_train = Y_train.reshape(-1,1) # Reshape Y_train
Y_test = Y_test.reshape(-1,1) # Reshape Y_test
m = X_train_rev.shape[0]
n = X_train_rev.shape[1]
# Paramaterization
alpha = 0.001 # Learning_Rate
# Initilization
coefficient = np.random.randn(1,n) # Initialisation of coefficients including intercept
# Update Gradients
for i in range(10000):
Z = np.dot(X_train_rev, coefficient.reshape(-1,1)) - Y_train.reshape(-1,1) # Compute Z
A = 1 / (1 + np.exp(-Z)) # Compute A
cost = np.sum(-(Y_train.reshape(-1,1) * np.log(A) + (1-Y_train.reshape(-1,1)) * np.log(1-A)))/X_train_rev.shape[0] # Compute cost
if i % 1000 == 0: print(cost) # Print cost
grad = np.dot((A - Y_train).T,X_train_rev) # Compute Gradient
coefficient = coefficient - (alpha * grad) # adjust coefficients including intercept
Cost after 1000th iteration:
1.4916689930810232
0.0875458497191988
0.0875181643157349
0.08751717663190926
0.08751704000862144
0.08751701549904194
0.08751701099400619
0.08751701016449238
0.08751701001176625
0.08751700998365015
It is observed that there is a minor difference of 0.01 approximately between Sklearn and my manual implementation. I guess this is considered to be a big difference in ML. I re ran the code multiple times and it gives the cost from manual implementation to be approx. 0.01 lower than what Sklearn provides.
Here are the coefficients learnt by the two model. They are quite different:
Thank you all
I current found gpytorch (https://github.com/cornellius-gp/gpytorch). It seems to be a great package for integrating GPR into pytorch. First tests were also positive. Using gpytorch the GPU-Power as well as intelligent algorithms can used in order to improve performance in comparison to other packages such as scikit-learn.
However, I found that it is much harder to estimate the hyperparameters that are needed. In scikit-learn that happens in the background and is very robust. I would like get some feed from the community about the reasons and to discuss if there might be a better way to estimatethese parameter than provided by the example in the documentation of gpytorch.
For comparisson, I took the code of a provided example on the offcial page of gpytorch (https://github.com/cornellius-gp/gpytorch/blob/master/examples/03_Multitask_GP_Regression/Multitask_GP_Regression.ipynb) and modified it in two parts:
I use a different kernel (gpytorch.kernels.MaternKernel(nu=2.5) in stead of gpytorch.kernels.RBFKernel())
I used a different output function
In the following, I provide first the code using gpytorch. Subsequently, I provide the code for scikit-learn. Finally, I compare the results
Importing (for gpytorch and scikit-learn):
import math
import torch
import numpy as np
import gpytorch
Generating data (for gpytorch and scikit-learn):
n = 20
train_x = torch.zeros(pow(n, 2), 2)
for i in range(n):
for j in range(n):
# Each coordinate varies from 0 to 1 in n=100 steps
train_x[i * n + j][0] = float(i) / (n-1)
train_x[i * n + j][1] = float(j) / (n-1)
train_y_1 = (torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01))/4
train_y_2 = torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01)
train_y = torch.stack([train_y_1, train_y_2], -1)
test_x = torch.rand((n, len(train_x.shape)))
test_y_1 = (torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01))/4
test_y_2 = torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01)
test_y = torch.stack([test_y_1, test_y_2], -1)
Now comes the estimation as described in the provided example from the cited documentation:
torch.manual_seed(2) # For a more robust comparison
class MultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.MultitaskMean(
gpytorch.means.ConstantMean(), num_tasks=2
)
self.covar_module = gpytorch.kernels.MultitaskKernel(
gpytorch.kernels.MaternKernel(nu=2.5), num_tasks=2, rank=1
)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)
likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
model = MultitaskGPModel(train_x, train_y, likelihood)
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam([
{'params': model.parameters()}, # Includes GaussianLikelihood parameters
], lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
n_iter = 50
for i in range(n_iter):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
# print('Iter %d/%d - Loss: %.3f' % (i + 1, n_iter, loss.item()))
optimizer.step()
# Set into eval mode
model.eval()
likelihood.eval()
# Make predictions
with torch.no_grad(), gpytorch.settings.fast_pred_var():
predictions = likelihood(model(test_x))
mean = predictions.mean
lower, upper = predictions.confidence_region()
test_results_gpytorch = np.median((test_y - mean) / test_y, axis=0)
In the following, I provide the code for scikit-learn. Which is a little bit more convenient^^:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, Matern
kernel = 1.0 * Matern(length_scale=0.1, length_scale_bounds=(1e-5, 1e5), nu=2.5) \
+ WhiteKernel()
gp = GaussianProcessRegressor(kernel=kernel, alpha=0.0).fit(train_x.numpy(),
train_y.numpy())
# x_interpolation = test_x.detach().numpy()[np.newaxis, :].transpose()
y_mean_interpol, y_std_norm = gp.predict(test_x.numpy(), return_std=True)
test_results_scitlearn = np.median((test_y.numpy() - y_mean_interpol) / test_y.numpy(), axis=0)
Finally I compare the results:
comparisson = (test_results_scitlearn - test_results_gpytorch)/test_results_scitlearn
print('Variable 1: scitkit learn is more accurate my factor: ' + str(abs(comparisson[0]))
print('Variable 2: scitkit learn is more accurate my factor: ' + str(comparisson[1]))
Unfortunatelly, I did not find an easy way to fix the seed for scikit-learn. The last time I have run the code, it returned:
Variable 1: scitkit learn is more accurate my factor: 11.362540360431087
Variable 2: scitkit learn is more accurate my factor: 29.64760087022618
In case of gpytorch, I assume that the optimizer runs in some local optima. But I cannot think of any more robust optimization algorithm that still uses pytorch.
I am looking forward for suggestions!
Lazloo
(I also answer your question on the GitHub issue you created for it here)
Primarily this happened because you used different models in sklearn and gpytorch. In particular, sklearn learns independent GPs in the multi-output setting by default (see e.g., the discussion here). In GPyTorch, you used the multitask GP method introduced in Bonilla et al, 2008. Correcting for this difference yields:
test_results_gpytorch = [5.207913e-04 -8.469360e-05]
test_results_scitlearn = [3.65288816e-04 4.79017145e-05]
I have had used linear regression using ML packages in python, but for sake of self gratification, I coded it from scratch. The loss starts at around 0.90 and keeps increasing (not learning) for some reason. I do not understand what mistake I may have committed.
Standardised the dataset as part of preprocessing
Initialise weight matrix with MLE estimate for parameter W i.e., (X^TX)^-1X^TY
Compute the output
Calculate gradient of loss function SSE (Sum of Squared Error) wrt param W and bias B
Use the gradients to update the parameters using gradient descent.
import preprocess as pre
import numpy as np
import matplotlib.pyplot as plt
data = pre.load_file('airfoil_self_noise.dat')
data = pre.organise(data,"\t","\r\n")
data = pre.standardise(data,data.shape[1])
t = np.reshape(data[:,5],[-1,1])
data = data[:,:5]
N = data.shape[0]
M = 5
lr = 1e-3
# W = np.random.random([M,1])
W = np.dot(np.dot(np.linalg.inv(np.dot(data.T,data)),data.T),t)
data = data.T # Examples are arranged in columns [features,N]
b = np.random.rand()
epochs = 1000000
loss = np.zeros([epochs])
for epoch in range(epochs):
if epoch%1000 == 0:
lr /= 10
# Obtain the output
y = np.dot(W.T,data).T + b
sse = np.dot((t-y).T,(t-y))
loss[epoch]= sse/N
var = sse/N
# log likelihood
ll = (-N/2)*(np.log(2*np.pi))-(N*np.log(np.sqrt(var)))-(sse/(2*var))
# Gradient Descent
W_grad = np.zeros([M,1])
B_grad = 0
for i in range(N):
err = (t[i]-y[i])
W_grad += err * np.reshape(data[:,i],[-1,1])
B_grad += err
W_grad /= N
B_grad /= N
W += lr * W_grad
b += lr * B_grad
print("Epoch: %d, Loss: %.3f, Log-Likelihood: %.3f"%(epoch,loss[epoch],ll))
plt.figure()
plt.plot(range(epochs),loss,'-r')
plt.show()
Now if you run the above code you are likely not to find anything wrong since I am doing W += lr * W_grad instead of W -= lr * W_grad. I would like to know why this is the case because it is the gradient descent formula to subtract the gradient from old weight matrix. The error constantly increase when I do it. What is that I am missing ?
Found it. The problem was I took the gradient of loss function from a slide which apparently was not right (at least it wasn't entirely wrong, instead it was already pointing to the steepest descent), which when I subtracted from weights it started pointing to the direction of greatest increase. This was what that gave rise to what I observed.
I did the partial derivative of loss function to clarify, and got this:
W_grad += data[:,i].reshape([-1,1])*(y[i]-t[i]).reshape([])
This points to the direction of greatest increase and when I multiply it with -lr it starts pointing to the steepest descent, and started working properly.
I'm trying to implement Gradient Descent (GD) (not stochastic one) for logistic regression in Python 3x. And have some troubles.
Logistic regression is defined as follows (1):
logistic regression formula
Formulas for gradients are defined as follows (2):
gradient descent for logistic regression
Description of data:
X is (Nx2)-matrix of objects (consist of positive and negative float numbers)
y is (Nx1)-vector of class labels (-1 or +1)
Task:
Implement gradient descent 1) with L2-regularization; and 2) without regularization. Desired results: vectors of weights.
Parameters: regularization rate C=10 for regularized regression and C=0 for unregularized regression; gradient step k=0.1; max.number of iterations = 10000; tolerance = 1e-5.
Note: GD is converged if distance between weighs vectors from current and previous steps is less than tolerance (1e-5).
Here is my implementation:
k - gradient step;
C - regularization rate.
import numpy as np
def sigmoid(z):
result = 1./(1. + np.exp(-z))
return result
def distance(vector1, vector2):
vector1 = np.array(vector1, dtype='f')
vector2 = np.array(vector2, dtype='f')
return np.linalg.norm(vector1-vector2)
def GD(X, y, C, k=0.1, tolerance=1e-5, max_iter=10000):
X = np.matrix(X)
y = np.matrix(y)
l=len(X)
w1, w2 = 0., 0. # weights (look formula (2) in the beginning of question)
difference = 1.
iteration = 1
while(difference > tolerance):
hypothesis = y*(X*np.matrix([w1, w2]).T)
w1_updated = w1 + (k/l)*np.sum(y*X[:,0]*(1.-(sigmoid(hypothesis)))) - k*C*w1
w2_updated = w2 + (k/l)*np.sum(y*X[:,1]*(1.-(sigmoid(hypothesis)))) - k*C*w2
difference = distance([w1, w2], [w1_updated, w2_updated])
w1, w2 = w1_updated, w2_updated
if(iteration >= max_iter):
break;
iteration = iteration + 1
return [w1_updated, w2_updated] #vector of weights
Respectively:
# call for UNregularized GD: C=0
w = GD(X, y, C=0., k=0.1)
and
# call for regularized GD: C=10
w_reg = GD(X, y, C=10., k=0.1)
Here are the resuls (weights-vectors):
# UNregularized GD
[0.035736331265589463, 0.032464572442830832]
# regularized GD
[5.0979561973044096e-06, 4.6312243707352652e-06]
However, it should be (right answers for self-control):
# UNregularized GD
[0.28801877, 0.09179177]
# regularized GD
[0.02855938, 0.02478083]
!!! Please, can you tell me whats going wrong here? I'm sitting with this problem for three days in a row and still have no idea.
Thank you in advance.
First of all, the sigmoid functions should be
def sigmoid(Z):
A=1/(1+np.exp(-Z))
return A
Try to run it again with this formula. Then, what is L?