Why does gpytorch seem to be less accurate than scikit-learn? - python

I current found gpytorch (https://github.com/cornellius-gp/gpytorch). It seems to be a great package for integrating GPR into pytorch. First tests were also positive. Using gpytorch the GPU-Power as well as intelligent algorithms can used in order to improve performance in comparison to other packages such as scikit-learn.
However, I found that it is much harder to estimate the hyperparameters that are needed. In scikit-learn that happens in the background and is very robust. I would like get some feed from the community about the reasons and to discuss if there might be a better way to estimatethese parameter than provided by the example in the documentation of gpytorch.
For comparisson, I took the code of a provided example on the offcial page of gpytorch (https://github.com/cornellius-gp/gpytorch/blob/master/examples/03_Multitask_GP_Regression/Multitask_GP_Regression.ipynb) and modified it in two parts:
I use a different kernel (gpytorch.kernels.MaternKernel(nu=2.5) in stead of gpytorch.kernels.RBFKernel())
I used a different output function
In the following, I provide first the code using gpytorch. Subsequently, I provide the code for scikit-learn. Finally, I compare the results
Importing (for gpytorch and scikit-learn):
import math
import torch
import numpy as np
import gpytorch
Generating data (for gpytorch and scikit-learn):
n = 20
train_x = torch.zeros(pow(n, 2), 2)
for i in range(n):
for j in range(n):
# Each coordinate varies from 0 to 1 in n=100 steps
train_x[i * n + j][0] = float(i) / (n-1)
train_x[i * n + j][1] = float(j) / (n-1)
train_y_1 = (torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01))/4
train_y_2 = torch.sin(train_x[:, 0]) + torch.cos(train_x[:, 1]) * (2 * math.pi) + torch.randn_like(train_x[:, 0]).mul(0.01)
train_y = torch.stack([train_y_1, train_y_2], -1)
test_x = torch.rand((n, len(train_x.shape)))
test_y_1 = (torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01))/4
test_y_2 = torch.sin(test_x[:, 0]) + torch.cos(test_x[:, 1]) * (2 * math.pi) + torch.randn_like(test_x[:, 0]).mul(0.01)
test_y = torch.stack([test_y_1, test_y_2], -1)
Now comes the estimation as described in the provided example from the cited documentation:
torch.manual_seed(2) # For a more robust comparison
class MultitaskGPModel(gpytorch.models.ExactGP):
def __init__(self, train_x, train_y, likelihood):
super(MultitaskGPModel, self).__init__(train_x, train_y, likelihood)
self.mean_module = gpytorch.means.MultitaskMean(
gpytorch.means.ConstantMean(), num_tasks=2
)
self.covar_module = gpytorch.kernels.MultitaskKernel(
gpytorch.kernels.MaternKernel(nu=2.5), num_tasks=2, rank=1
)
def forward(self, x):
mean_x = self.mean_module(x)
covar_x = self.covar_module(x)
return gpytorch.distributions.MultitaskMultivariateNormal(mean_x, covar_x)
likelihood = gpytorch.likelihoods.MultitaskGaussianLikelihood(num_tasks=2)
model = MultitaskGPModel(train_x, train_y, likelihood)
# Find optimal model hyperparameters
model.train()
likelihood.train()
# Use the adam optimizer
optimizer = torch.optim.Adam([
{'params': model.parameters()}, # Includes GaussianLikelihood parameters
], lr=0.1)
# "Loss" for GPs - the marginal log likelihood
mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)
n_iter = 50
for i in range(n_iter):
optimizer.zero_grad()
output = model(train_x)
loss = -mll(output, train_y)
loss.backward()
# print('Iter %d/%d - Loss: %.3f' % (i + 1, n_iter, loss.item()))
optimizer.step()
# Set into eval mode
model.eval()
likelihood.eval()
# Make predictions
with torch.no_grad(), gpytorch.settings.fast_pred_var():
predictions = likelihood(model(test_x))
mean = predictions.mean
lower, upper = predictions.confidence_region()
test_results_gpytorch = np.median((test_y - mean) / test_y, axis=0)
In the following, I provide the code for scikit-learn. Which is a little bit more convenient^^:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, Matern
kernel = 1.0 * Matern(length_scale=0.1, length_scale_bounds=(1e-5, 1e5), nu=2.5) \
+ WhiteKernel()
gp = GaussianProcessRegressor(kernel=kernel, alpha=0.0).fit(train_x.numpy(),
train_y.numpy())
# x_interpolation = test_x.detach().numpy()[np.newaxis, :].transpose()
y_mean_interpol, y_std_norm = gp.predict(test_x.numpy(), return_std=True)
test_results_scitlearn = np.median((test_y.numpy() - y_mean_interpol) / test_y.numpy(), axis=0)
Finally I compare the results:
comparisson = (test_results_scitlearn - test_results_gpytorch)/test_results_scitlearn
print('Variable 1: scitkit learn is more accurate my factor: ' + str(abs(comparisson[0]))
print('Variable 2: scitkit learn is more accurate my factor: ' + str(comparisson[1]))
Unfortunatelly, I did not find an easy way to fix the seed for scikit-learn. The last time I have run the code, it returned:
Variable 1: scitkit learn is more accurate my factor: 11.362540360431087
Variable 2: scitkit learn is more accurate my factor: 29.64760087022618
In case of gpytorch, I assume that the optimizer runs in some local optima. But I cannot think of any more robust optimization algorithm that still uses pytorch.
I am looking forward for suggestions!
Lazloo

(I also answer your question on the GitHub issue you created for it here)
Primarily this happened because you used different models in sklearn and gpytorch. In particular, sklearn learns independent GPs in the multi-output setting by default (see e.g., the discussion here). In GPyTorch, you used the multitask GP method introduced in Bonilla et al, 2008. Correcting for this difference yields:
test_results_gpytorch = [5.207913e-04 -8.469360e-05]
test_results_scitlearn = [3.65288816e-04 4.79017145e-05]

Related

How does a gradient backpropagates through random samples?

I'm learning about policy gradients and I'm having hard time understanding how does the gradient passes through a random operation. From here: It is not possible to directly backpropagate through random samples. However, there are two main methods for creating surrogate functions that can be backpropagated through.
They have an example of the score function:
probs = policy_network(state)
# Note that this is equivalent to what used to be called multinomial
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()
Which I tried to create an example of:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.distributions import Normal
import matplotlib.pyplot as plt
from tqdm import tqdm
softplus = torch.nn.Softplus()
class Model_RL(nn.Module):
def __init__(self):
super(Model_RL, self).__init__()
self.fc1 = nn.Linear(1, 20)
self.fc2 = nn.Linear(20, 30)
self.fc3 = nn.Linear(30, 2)
def forward(self, x):
x1 = self.fc1(x)
x = torch.relu(x1)
x2 = self.fc2(x)
x = torch.relu(x2)
x3 = softplus(self.fc3(x))
return x3, x2, x1
# basic
net_RL = Model_RL()
features = torch.tensor([1.0])
x = torch.tensor([1.0])
y = torch.tensor(3.0)
baseline = 0
baseline_lr = 0.1
epochs = 3
opt_RL = optim.Adam(net_RL.parameters(), lr=1e-3)
losses = []
xs = []
for _ in tqdm(range(epochs)):
out_RL = net_RL(x)
mu, std = out_RL[0]
dist = Normal(mu, std)
print(dist)
a = dist.sample()
log_p = dist.log_prob(a)
out = features * a
reward = -torch.square((y - out))
baseline = (1-baseline_lr)*baseline + baseline_lr*reward
loss = -(reward-baseline)*log_p
opt_RL.zero_grad()
loss.backward()
opt_RL.step()
losses.append(loss.item())
This seems to work magically fine which again, I don't understand how the gradient passes through as they mentioned that it can't pass through the random operation (but then somehow it does).
Now since the gradient can't flow through the random operation I tried to replace
mu, std = out_RL[0] with mu, std = out_RL[0].detach() and that caused the error:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn. If the gradient doesn't pass through the random operation, I don't understand why would detaching a tensor before the operation matter.
It is indeed true that sampling is not a differentiable operation per se. However, there exist two (broad) ways to mitigate this - [1] The REINFORCE way and [2] The reparameterization way. Since your example is related to [1], I will stick my answer to REINFORCE.
What REINFORCE does is it entirely gets rid of sampling operation in the computation graph. However, the sampling operation remains outside the graph. So, your statement
.. how does the gradient passes through a random operation ..
isn't correct. It does not pass through any random operation. Let's see your example
mu, std = out_RL[0]
dist = Normal(mu, std)
a = dist.sample()
log_p = dist.log_prob(a)
Computation of a does not involve creating a computation graph. It is technically equivalent to plugging in some offline data from a dataset (as in supervised learning)
mu, std = out_RL[0]
dist = Normal(mu, std)
# a = dist.sample()
a = torch.tensor([1.23, 4.01, -1.2, ...], device='cuda')
log_p = dist.log_prob(a)
Since we don't have offline data beforehand, we create them on the fly and the .sample() method does merely that.
So, there is no random operation on the graph. The log_p depends on mu and std deterministically, just like any standard computation graph. If you cut the connection like this
mu, std = out_RL[0].detach()
.. of course it is going to complaint.
Also, do not get confused by this operation
dist = Normal(mu, std)
log_p = dist.log_prob(a)
as it does not contain any randomness by itself. This is merely a shortcut for writing the tedious log-likelihood formula for Normal distribution.
#ayandas explained the first way very well.
The second way, the reparameterization method, is quite different.
In contrast to the sample(), the reparameterization using rsample() returns a sample that sustains the computation graph.
It is done by adding a random value (but keeping the parameters of the model).
Check this explanation with the simple code.

Improving accuracy of multinomial logistic regression model built from scratch

I am currently working on creating a multi class classifier using numpy and finally got a working model using softmax as follows:
class MultinomialLogReg:
def fit(self, X, y, lr=0.00001, epochs=1000):
self.X = self.norm_x(np.insert(X, 0, 1, axis=1))
self.y = y
self.classes = np.unique(y)
self.theta = np.zeros((len(self.classes), self.X.shape[1]))
self.o_h_y = self.one_hot(y)
for e in range(epochs):
preds = self.probs(self.X)
l, grad = self.get_loss(self.theta, self.X, self.o_h_y, preds)
if e%10000 == 0:
print("epoch: ", e, "loss: ", l)
self.theta -= (lr*grad)
return self
def norm_x(self, X):
for i in range(X.shape[0]):
mn = np.amin(X[i])
mx = np.amax(X[i])
X[i] = (X[i] - mn)/(mx-mn)
return X
def one_hot(self, y):
Y = np.zeros((y.shape[0], len(self.classes)))
for i in range(Y.shape[0]):
to_put = [0]*len(self.classes)
to_put[y[i]] = 1
Y[i] = to_put
return Y
def probs(self, X):
return self.softmax(np.dot(X, self.theta.T))
def get_loss(self, w,x,y,preds):
m = x.shape[0]
loss = (-1 / m) * np.sum(y * np.log(preds) + (1-y) * np.log(1-preds))
grad = (1 / m) * (np.dot((preds - y).T, x)) #And compute the gradient for that loss
return loss,grad
def softmax(self, z):
return np.exp(z) / np.sum(np.exp(z), axis=1).reshape(-1,1)
def predict(self, X):
X = np.insert(X, 0, 1, axis=1)
return np.argmax(self.probs(X), axis=1)
#return np.vectorize(lambda i: self.classes[i])(np.argmax(self.probs(X), axis=1))
def score(self, X, y):
return np.mean(self.predict(X) == y)
And had several questions:
Is this a correct mutlinomial logistic regression implementation?
It takes 100,000 epochs using learning rate 0.1 for the loss to be 1 - 0.5 and to get an accuracy of 70 - 90 % on the test set. Would this be considered bad performance?
What are some ways for improving performance or speeding up training (to need less epochs)?
I saw this cost function online which gives better accuracy, it looks like cross-entropy, but it is different from the equations of cross-entropy optimization I saw, can someone explain how the two differ:
error = preds - self.o_h_y
grad = np.dot(error.T, self.X)
self.theta -= (lr*grad)
This looks right, but I think the preprocessing you perform in the fit function should be done outside of the model.
It's hard to know whether this is good or bad. While the loss landscape is convex, the time it takes to obtain a minima varies for different problems. One way to ensure you've obtained the optimal solution is to add a threshold that tests the size of the gradient norm, which is small when you're close to the optima. Something like np.linalg.norm(grad) < 1e-8.
You can use a better optimizer, such as Newton's method, or a quasi-Newton method, such as LBFGS. I would start with Newton's method as it's easier to implement. LBFGS is a non-trivial algorithm that approximates the Hessian required to perform Newton's method.
It's the same; the gradients aren't being averaged. Since you're performing gradient descent, the averaging is a constant that can be ignored since a properly tuned learning rate is required anyways. In general, I think averaging makes it a bit easier to obtain a stable learning rate over different splits of the same dataset.
A question for you: When you evaluate your test set, are you preprocessing them the same way you do the training set in your fit function?

How do I resolve the error resulting from this Simple Linear Regression in Python and Numpy?

So, I'm trying to make a simple neural network for linear regression using only python and numpy.
I have solved most of the original problems and it works well except that the network's error only increases.
My code:
import numpy as np
from matplotlib import pyplot as plt
class Regression:
def __init__(self,size):
self.W = np.random.random(size)
self.b = np.random.random(size[1])
def test(self,X):
return X#self.W + self.b
def train(self,X,Y, epochs = 50, lr = 0.2):
self.error_list = []
for i in range(epochs):
pred = self.test(X)
error = (Y - pred)**2
error_pred = -2*(Y - pred)
pred_W = X.T
pred_b = np.ones_like(self.b)
error_W = pred_W # error_pred
error_b = np.sum(error_pred * pred_b, 0)
self.W -= error_W * lr
self.b -= error_b * lr
self.error_list.append(np.mean(error))
plt.plot(self.error_list)
plt.title("Training Loss")
plt.show()
if __name__ == "__main__":
nn = Regression([2,1])
X = np.array([[0,0],
[0,1],
[1,0],
[1,1]])
Y = np.sum(X,1).reshape(-1,1)
nn.train(X,Y,100)
print(nn.test([[1,2],
[2,3]]))
This is the final output:
[[5.23598775e+18]
[7.47065723e+18]]
See plot here:
Error plot (PNG)
Do you think you can fix it?
Try tuning your model training using different learning rates. lr=0.2 is too aggressive and causes your model to diverge. For lr=0.1, the model looks to be learning OK.
Due to large learning rate, your model is unable to find minima.
In your code, use adaptive learning rate as:
lr = lr - lr/epochs
Add this at the last inside your for loop.
Also, initialize your learning rate at a smaller value.

PyTorch does not converge when approximating square function with linear model

I'm trying to learn some PyTorch and am referencing this discussion here
The author provides a minimum working piece of code that illustrates how you can use PyTorch to solve for an unknown linear function that has been polluted with random noise.
This code runs fine for me.
However, when I change the function such that I want t = X^2, the parameter does not seem to converge.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
# Let's make some data for a linear regression.
A = 3.1415926
b = 2.7189351
error = 0.1
N = 100 # number of data points
# Data
X = Variable(torch.randn(N, 1))
# (noisy) Target values that we want to learn.
t = X * X + Variable(torch.randn(N, 1) * error)
# Creating a model, making the optimizer, defining loss
model = nn.Linear(1, 1)
optimizer = optim.SGD(model.parameters(), lr=0.05)
loss_fn = nn.MSELoss()
# Run training
niter = 50
for _ in range(0, niter):
optimizer.zero_grad()
predictions = model(X)
loss = loss_fn(predictions, t)
loss.backward()
optimizer.step()
print("-" * 50)
print("error = {}".format(loss.data[0]))
print("learned A = {}".format(list(model.parameters())[0].data[0, 0]))
print("learned b = {}".format(list(model.parameters())[1].data[0]))
When I execute this code, the new A and b parameters are seemingly random thus it does not converge. I think this should converge because you can approximate any function with a slope and offset function. My theory is that I'm using PyTorch incorrectly.
Can any identify a problem with my t = X * X + Variable(torch.randn(N, 1) * error) line of code?
You cannot fit a 2nd degree polynomial with a linear function. You cannot expect more than random (since you have random samples from the polynomial).
What you can do is try and have two inputs, x and x^2 and fit from them:
model = nn.Linear(2, 1) # you have 2 inputs now
X_input = torch.cat((X, X**2), dim=1) # have 2 inputs per entry
# ...
predictions = model(X_input) # 2 inputs -> 1 output
loss = loss_fn(predictions, t)
# ...
# learning t = c*x^2 + a*x + b
print("learned a = {}".format(list(model.parameters())[0].data[0, 0]))
print("learned c = {}".format(list(model.parameters())[0].data[0, 1]))
print("learned b = {}".format(list(model.parameters())[1].data[0]))

Logistic Regression with regularization in python failing to minimize

I'm implementing logistic regression based on the Coursera documentation, both in python and Octave.
In Octave, I managed to do it and achieve the right training accuracy, but in python, since I don't have access to fminunc, I cannot figure out a work around.
Currently, this is my code:
df = pandas.DataFrame.from_csv('ex2data2.txt', header=None, index_col=None)
df.columns = ['x1', 'x2', 'y']
y = df[df.columns[-1]].as_matrix()
m = len(y)
y = y.reshape(m, 1)
X = df[df.columns[:-1]]
X = X.as_matrix()
from sklearn.preprocessing import PolynomialFeatures
feature_mapper = PolynomialFeatures(degree=6)
X = feature_mapper.fit_transform(X)
def sigmoid(z):
return 1/(1+np.power(np.e, z))
def cost_function_reg(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
reg = (_lambda / (2.0*m))* shifted_theta.T.dot(shifted_theta)
J = ((1.0/m)*(-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1-h)))) + reg
return J
def gradient(theta):
_theta = theta.copy().reshape(-1, 1)
shifted_theta = np.insert(_theta[1:], 0, 0)
h = sigmoid(np.dot(X, _theta))
gradR = _lambda*shifted_theta
gradR.shape = (gradR.shape[0], 1)
grad = (1.0/m)*(X.T.dot(h-y)+gradR)
return grad.flatten()
from scipy.optimize import *
theta = fmin_ncg(cost_f, initial_theta, fprime=gradient)
predictions = predict(theta, X)
accuracy = np.mean(np.double(predictions == y)) * 100
print 'Train Accuracy: %.2f' % accuracy
The output is:
Warning: Desired error not necessarily achieved due to precision loss.
Current function value: 0.693147
Iterations: 0
Function evaluations: 22
Gradient evaluations: 12
Hessian evaluations: 0
Train Accuracy: 50.85
In octave, the accuracy is: 83.05.
Any help is appreciated.
There were two problems on that implementation:
The first one, fmin_ncg is not ideal for that minimization. I have used it on the previous exercise, but it was failing to find the theta with that gradient function, which is ideal to the one in Octave.
Switching to
theta = fmin_bfgs(cost_function_reg, initial_theta)
Fixed that issue.
The second issue was that the accuracy was being miscalculated.
Once I optimized with fmin_bfgs, and achieved the cost that matched the Octave results (0.529), the (predictions == y) part had different shapes ((118, 118) and (118,1)) , yielding a matrix that was MxM instead of vector.

Categories