Given a Gaussian Process Model with multidimensional features and scalar observations, how do I compute derivatives of the output wrt to each input, in GPyTorch or GPFlow (or scikit-learn)?
If I understand your question correctly, the following should give you what you want in GPflow with TensorFlow:
import numpy as np
import tensorflow as tf
import gpflow
### Set up toy data & model -- change as appropriate:
X = np.linspace(0, 10, 5)[:, None]
Y = np.random.randn(5, 1)
data = (X, Y)
kernel = gpflow.kernels.SquaredExponential()
model = gpflow.models.GPR(data, kernel)
Xtest = np.linspace(-1, 11, 7)[:, None] # where you want to predict
### Compute gradient of prediction with respect to input:
# TensorFlow can only compute gradients with respect to tensor objects,
# so let's convert the inputs to a tensor:
Xtest_tensor = tf.convert_to_tensor(Xtest)
with tf.GradientTape(
persistent=True # this allows us to compute different gradients below
) as tape:
# By default, only Variables are watched. For gradients with respect to tensors,
# we need to explicitly watch them:
tape.watch(Xtest_tensor)
mean, var = model.predict_f(Xtest_tensor) # or any other predict function
grad_mean = tape.gradient(mean, Xtest_tensor)
grad_var = tape.gradient(var, Xtest_tensor)
Related
I've noticed that Tensorflow's automatic differentiation does not give the same values as finite differences when the loss function converts the input to a numpy array to calculate the output value. Here's a minimum working example of the problem:
import tensorflow as tf
import numpy as np
def lossFn(inputTensor):
# Input is a rank-2 square tensor
return tf.linalg.trace(inputTensor # inputTensor)
def lossFnWithNumpy(inputTensor):
# Same function, but converts input to a numpy array before performing the norm
inputArray = inputTensor.numpy()
return tf.linalg.trace(inputArray # inputArray)
N = 2
tf.random.set_seed(0)
randomTensor = tf.random.uniform([N, N])
# Prove that the two functions give the same output; evaluates to exactly zero
print(lossFn(randomTensor) - lossFnWithNumpy(randomTensor))
theoretical, numerical = tf.test.compute_gradient(lossFn, [randomTensor])
# These two values match
print(theoretical[0])
print(numerical[0])
theoretical, numerical = tf.test.compute_gradient(lossFnWithNumpy, [randomTensor])
# The theoretical value is [0 0 0 0]
print(theoretical[0])
print(numerical[0])
The function tf.test.compute_gradients computes the 'theoretical' gradient using automatic differentiation, and the numerical gradient using finite differences. As the code shows, if I use .numpy() in the loss function the automatic differentiation does not calculate the gradient.
Could anybody explain the reason for this?
From the guide : Introduction to Gradients and Automatic Differentiation
The tape can't record the gradient path if the calculation exits TensorFlow. For example:
x = tf.Variable([[1.0, 2.0],
[3.0, 4.0]], dtype=tf.float32)
with tf.GradientTape() as tape:
x2 = x**2
# This step is calculated with NumPy
y = np.mean(x2, axis=0)
# Like most ops, reduce_mean will cast the NumPy array to a constant tensor
# using `tf.convert_to_tensor`.
y = tf.reduce_mean(y,axis=0)
print(tape.gradient(y, x))
outputs None
The numpy value will be cast back as a constant tensor in the call to tf.linalg.trace, which Tensorflow cannot compute gradients on.
My inducing points are set to trainable but do not change when I call opt.minimize(). Why is it and what does it mean? Does it mean, the model is not learning?
What is the difference between tf.optimizers.Adam(lr) and gpflow.optimizers.Scipy?
The following is the simple classification example adapted from the documentation. When I run this code example with gpflow's Scipy optimizer then I get the trained results and the values for inducing variables keep changing. But when I use Adam optimizer then I get only a straight line prediction, and the values for inducing points remain the same. It indicates that the model is not learning with Adam optimizer.
plot of data before training
plot of data after training with Adam
plot of data after training with gpflow optimizer Scipy
The link for the example is https://gpflow.readthedocs.io/en/develop/notebooks/advanced/multiclass_classification.html
import numpy as np
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore') # ignore DeprecationWarnings from tensorflow
import matplotlib.pyplot as plt
import gpflow
from gpflow.utilities import print_summary, set_trainable
from gpflow.ci_utils import ci_niter
from tensorflow2_work.multiclass_classification import plot_posterior_predictions, colors
np.random.seed(0) # reproducibility
# Number of functions and number of data points
C = 3
N = 100
# RBF kernel lengthscale
lengthscale = 0.1
# Jitter
jitter_eye = np.eye(N) * 1e-6
# Input
X = np.random.rand(N, 1)
kernel_se = gpflow.kernels.SquaredExponential(lengthscale=lengthscale)
K = kernel_se(X) + jitter_eye
# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(C)).T
# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
print(Y.shape)
# One-hot encoding
Y_hot = np.zeros((N, C), dtype=bool)
Y_hot[np.arange(N), Y] = 1
data = (X, Y)
plt.figure(figsize=(12, 6))
order = np.argsort(X.reshape(-1,))
print(order.shape)
for c in range(C):
plt.plot(X[order], f[order, c], '.', color=colors[c], label=str(c))
plt.plot(X[order], Y_hot[order, c], '-', color=colors[c])
plt.legend()
plt.xlabel('$X$')
plt.ylabel('Latent (dots) and one-hot labels (lines)')
plt.title('Sample from the joint $p(Y, \mathbf{f})$')
plt.grid()
plt.show()
# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(C) # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(C, invlink=invlink) # Multiclass likelihood
Z = X[::5].copy() # inducing inputs
#print(Z)
m = gpflow.models.SVGP(kernel=kernel, likelihood=likelihood,
inducing_variable=Z, num_latent_gps=C, whiten=True, q_diag=True)
# Only train the variational parameters
set_trainable(m.kernel.kernels[1].variance, True)
set_trainable(m.inducing_variable, True)
print(m.inducing_variable.Z)
print_summary(m)
training_loss = m.training_loss_closure(data)
opt.minimize(training_loss, m.trainable_variables)
print(m.inducing_variable.Z)
print_summary(m.inducing_variable.Z)
print(m.inducing_variable.Z)
# %%
plot_posterior_predictions(m, X, Y)
The example given in the question isn't copy&pastable, but it seems like you simply exchange opt = gpflow.optimizers.Scipy() with opt = tf.optimizers.Adam(). The minimize() method of gpflow's Scipy optimizer runs one call of scipy.optimize.minimize, which by default runs to convergence (you can also specify a maximum number of iterations by passing, e.g., options=dict(maxiter=100) to the minimize() call).
In contrast, the minimize() method of TensorFlow optimizers runs only a single optimization step. To run more steps, say iter = 100, you need to manually write a loop:
for _ in range(iter):
opt.minimize(model.training_loss, model.trainable_variables)
For this to actually run fast, you also need to wrap the optimization step in tf.function:
#tf.function
def optimization_step():
opt.minimize(model.training_loss, model.trainable_variables)
for _ in range(iter):
optimization_step()
This runs exactly iter steps - in TensorFlow you have to handle convergence checks yourself, your model may or may not be converged after this many steps.
So in your usage, you only ran one step - this did change the parameters, but presumably too little to notice the difference. (You could see a larger effect in one step by making the learning rate much higher, though that would not be a good idea for actually optimizing the model with many steps.)
Usage of the Adam optimizer with GPflow models is demonstrated in the notebook on stochastic variational inference, though it also works for non-stochastic optimization.
Note that, in any case, all parameters such as inducing point locations are set trainable by default, so your call to set_trainable(..., True) doesn't affect what's going on here.
I'd like to train a neural network in Python and Keras using a metric learning custom loss function. The loss minimizes the distances of the outputs for similar inputs and maximizes the distances between dissimilar ones. The part considering similar inputs is:
# function to create a pairwise similarity matrix, i.e
# L[i,j] == 1 for similar samples i, j and 0 otherwise
def build_indicator_matrix(y_, thr=0.1):
# y_: contains the labels of the samples,
# samples are similar in case of same label
# prevent checking equality of floats --> check if absolute
# differences are below threshold
lbls_diff = K.expand_dims(y_, axis=0) - K.expand_dims(y_, axis=1)
lbls_thr = K.less(K.abs(lbls_diff), thr)
# cast bool tensor back to float32
L = K.cast(lbls_thr, 'float32')
# POSSIBLE WORKAROUND
#L = K.sum(L, axis=2)
return L
# function to compute the (squared) Euclidean distances between all pairs
# of samples, store in DIST[i,j] the distance between output y_pred[i,:] and y_pred[j,:]
def compute_pairwise_distances(y_pred):
DIFF = K.expand_dims(y_pred, axis=0) - K.expand_dims(y_pred, axis=1)
DIST = K.sum(K.square(DIFF), axis=-1)
return DIST
# function to compute the average distance between all similar samples
def my_loss(y_true, y_pred):
# y_true: contains true labels of the samples
# y_pred: contains network outputs
L = build_indicator_matrix(y_true)
DIST = compute_pairwise_distances(y_pred)
return K.mean(DIST * L, axis=1)
For training, I pass a numpy array y of shape (n,) as target variable to my_loss. However, I found (using the computational graph in TensorBoard) that the tensorflow backend creates a 2D variable out of y (displayed shape ? x ?), and hence L in build_indicator_matrix is not 2 but 3-dimensional (shape ? x ? x ? in TensorBoard). This causes net.evaulate() and net.fit() to compute wrong results.
Why does tensorflow create a 2D rather than a 1D array? And how does this affect net.evaluate() and net.fit()?
As quick workarounds I found that either replacing the build_indicator_matrix() with static numpy code for computing L , or collapsing the "fake" dimension with the line L = K.sum(L, axis=2) solves the problem. In the latter case, however, the output of K.eval(build_indicator_matrix(y)) is of only of shape (n,) and not (n,n), so I do not understand why this workaround still yields correct results. Why does tensorflow introduce an additional dimension?
My library versions are:
keras: 2.2.4
tensorflow: 1.8.0
numpy: 1.15.0
This is because evaluate and fit work in batches.
The first dimension you see in tensorboard is the batch dimension, unknown in advance and therefore denoted ?.
When using custom metrics, remember the tensors (y_true and y_pred) you get are the ones corresponding to the batch.
For more info, show us how you call both those functions.
I would like do following steps for a trainning:
create a tensor for linear: target=Weight * X
select top target values and drop all the rest samples.
get corresponding labels, which is Y.
use GradientDescentOptimizer, minimize sum(Y) and get fitted variable(W)
code
from tensorflow.python.framework import ops
import numpy as np
import tensorflow as tf
sess = tf.Session()
X=tf.placeholder(shape=[None,2], dtype=tf.float32)
Y=tf.placeholder(shape=[None,1], dtype=tf.float32)
W = tf.Variable(tf.random_normal(shape=[2, 1]), dtype=tf.float32)
target=tf.matmul(X, W)
flattened=tf.reshape(target,[-1])
selected_targets, keys=tf.nn.top_k(flattened, k=100)
#get corresponding Y
selected_y = tf.gather(Y, keys)
#now we have top 100 selected_targets, and selected_y, train and evaluate W, and fit minimal sum(Y)
train_target = tf.reduce_sum(selected_y) #But if use selected_targets instead of selected_y, it would run successfully, why?
optimizer = tf.train.GradientDescentOptimizer(1)
train = optimizer.minimize(train_target)
# training
x_vals = np.random.rand(1000,2)
y_vals = np.random.rand(1000,1)
sess.run(tf.global_variables_initializer())
sess.run(tf.local_variables_initializer())
sess.run(train, {X: x_vals, Y:y_vals})
print(sess.run([W]))
I got this error:
ValueError: No gradients provided for any variable, check your graph
for ops that do not support gradients, between variables
[""] and loss
Tensor("Sum:0", shape=(), dtype=float32).
Anyone could help on this issue? I found it happens when applying tf.nn.top_k on tensors. But why?
It says this because there are no gradients. The graph you create is not differentiable, because the keys are the only thing connecting the loss to the W variable and the keys are non-differentiable integers.
I was implementing a simple k-nearest-neighbor algorithm from sklearn on google cloud platform ml engine. I used a custom metric to calculate the distance between two input vectors so that the distance is the weighted sum of elements in the element-wise squared difference between two vectors. The code is below:
import os.path
from sklearn import neighbors
import numpy as np
from six.moves import cPickle as pickle
import tensorflow as tf
from tensorflow.python.lib.io import file_io
flags = tf.app.flags
FLAGS = flags.FLAGS
flags.DEFINE_string('input_dir', 'input', 'Input Directory.')
flags.DEFINE_string('input_train_data','train_data','Input Training Data File Name.')
pickle_file = os.path.join(FLAGS.input_dir, FLAGS.input_train_data)
def mydist(x, y):
return np.dot((x - y) ** 2, weight)
with file_io.FileIO(pickle_file, 'r') as f:
save = pickle.load(f)
train_dataset, train_labels, valid_dataset, valid_labels = save['train_dataset'], save['train_labels'], save[
'valid_dataset'], save['valid_labels']
train_data = train_dataset[:1000]
train_label = train_labels[:1000]
test_data = valid_dataset[:100]
weight = [1.0]* len(train_dataset[1])
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric=lambda x, y: mydist(x, y))
knn.fit(train_data, train_label)
predict = knn.predict(test_data)
print(predict)
train_dataset is a numpy array of shape (86667,13) and valid_dataset has shape (8000,13). Train_labels has shape (86667,1) and valid_labels (8000,1). For some reason I got a dimension mismatch:
line 15, in mydist return np.dot((x - y) ** 2, weight) ValueError: shapes
(10,) and (13,) not aligned: 10 (dim 0) != 13 (dim 0)
Both x and y in the input of the custom metric should have size 13 yet somehow they have size 10. Can anyone explain what is wrong here?
You are taking the distance between the wrong terms. You cannot take the distance between the labels and the train features. These are of two different dimensions. You need to calculate the distance between any two feature points say x1 and x2, not between the label and it's feature point (say x1 and y1). Secondly, when declaring the KNeighborsRegressor object, you have specified the parameter incorrectly. In the 'metric' parameter you specify the 'string' or the 'DistanceMetric' Object. First you have to make a distance metric object and then pass it as metric. So, this is the correct way of calling your function:
my_metric=DistanceMetric.get_metric('myfunc',func=mydist)
knn = neighbors.KNeighborsRegressor(weights='distance', n_neighbors=20, metric='myfunc')
Sklearn will itself take care of how the parameters are to be passed in the distance function. I am assuming that the weights variable is global for your code to function correctly.