Differentiation Issue in Predictive Alignment for Attention Implementation

Differentiation Issue in Predictive Alignment for Attention Implementation - python

I am trying to implement local-p attention based on this paper: https://arxiv.org/pdf/1508.04025.pdf Specifically, equation (9) derives a alignment position based on taking the sigmoid of some non-linear functions, and then multiplying the resultant with number of timesteps. As sigmoid returns values between 0 and 1, this multiplication yields a valid index between 0 and number of timesteps. I can soft round this to infer the predicted position, however, I couldn't find a way to convert this to a integer to use within slicing/indexing operations since tf.cast() is not differentiable. Another problem is that the derived positions are in shape (B, 1), and hence one aligned position for each example in the batch. See below to understand these operations:
"""B = batch size, S = sequence length (num. timesteps), V = vocabulary size, H = number of hidden dimensions"""
class LocalAttention(Layer):
def __init__(self, size, window_width=None, **kwargs):
super(LocalAttention, self).__init__(**kwargs)
self.size = size
self.window_width = window_width # 2*D
def build(self, input_shape):
self.W_p = Dense(units=input_shape[2], use_bias=False)
self.W_p.build(input_shape=(None, None, input_shape[2])) # (B, 1, H)
self._trainable_weights += self.W_p.trainable_weights
self.v_p = Dense(units=1, use_bias=False)
self.v_p.build(input_shape=(None, None, input_shape[2])) # (B, 1, H)
self._trainable_weights += self.v_p.trainable_weights
super(Attention, self).build(input_shape)
def call(self, inputs):
sequence_length = inputs.shape[1]
## Get h_t, the current (target) hidden state ##
target_hidden_state = Lambda(function=lambda x: x[:, -1, :])(inputs) # (B, H)
## Get h_s, source hidden states ##
aligned_position = self.W_p(target_hidden_state) # (B, H)
aligned_position = Activation('tanh')(aligned_position) # (B, H)
aligned_position = self.v_p(aligned_position) # (B, 1)
aligned_position = Activation('sigmoid')(aligned_position) # (B, 1)
aligned_position = aligned_position * sequence_length # (B, 1)
Let's say the aligned_position tensor has elements [24.2, 15.1, 12.3] for a batch size = B = 3 for simplification. Then, the source hidden states are derived from input hidden states (B=3, S, H) such that for the first example we take timesteps starting from 24, hence something along the lines of first_batch_states = Lambda(function=lambda x: x[:, 24:, :])(inputs) and so on. Note that the implementation of local-p attention is more complicated than this, but I simplified it here. Hence, the main challenge is converting 24.2 to 24 without losing differentiability, or using some sort of a mask operation to get the indexes through dot product. The mask operation is preferred, as we will have to do this for each example in batch, and having a loop inside a custom Keras layer is not neat. Do you have any ideas on how to accomplish this task? I will appreciate any answers and comments!

There are two ways I found to go about solving this problem.
Applying a Gaussian distribution based on the aligned position shown in the original question to the attention weights, making the process differentiable, as #Siddhant suggested:
gaussian_estimation = lambda s: tf.exp(-tf.square(s - aligned_position) /
(2 * tf.square(self.window_width / 2)))
gaussian_factor = gaussian_estimation(0)
for i in range(1, sequence_length):
gaussian_factor = Concatenate()([gaussian_factor, gaussian_estimation(i)])
# Adjust weights via gaussian_factor: (B, S*) to allow differentiability
attention_weights = attention_weights * gaussian_factor # (B, S*)
It should be noted that there is no hard slicing operation involved here, only simple adjusting according to distance.
Keeping the top n values and zeroing out the rest as suggested by #Vlad here, How to implement a custom keras layer that only keeps the top n values and zeros out all the rest?:
aligned_position = self.W_p(inputs) # (B, S, H)
aligned_position = Activation('tanh')(aligned_position) # (B, S, H)
aligned_position = self.v_p(aligned_position) # (B, S, 1)
aligned_position = Activation('sigmoid')(aligned_position) # (B, S, 1)
## Only keep top D values out of the sigmoid activation, and zero-out the rest ##
aligned_position = tf.squeeze(aligned_position, axis=-1) # (B, S)
top_probabilities = tf.nn.top_k(input=aligned_position,
k=self.window_width,
sorted=False) # (values:(B, D), indices:(B, D))
onehot_vector = tf.one_hot(indices=top_probabilities.indices,
depth=sequence_length) # (B, D, S)
onehot_vector = tf.reduce_sum(onehot_vector, axis=1) # (B, S)
aligned_position = Multiply()([aligned_position, onehot_vector]) # (B, S)
aligned_position = tf.expand_dims(aligned_position, axis=-1) # (B, S, 1)
source_hidden_states = Multiply()([inputs, aligned_position]) # (B, S*=S(D), H)
## Scale back-to approximately original hidden state values ##
aligned_position += 1 # (B, S, 1)
source_hidden_states /= aligned_position # (B, S*=S(D), H)
It should be noted that here we are instead applying the dense layers to all hidden source states to get a shape of (B,S,1) instead of (B,1) for aligned_position. I believe this is as close as we can get to what the paper suggests.
Anybody who is trying to implement attention mechanisms can check my repo https://github.com/uzaymacar/attention-mechanisms. Layers here are designed for many-to-one sequence tasks, but can be adapted to other forms with minor tweaks.

Related

How can I convert a 1112x42 tensor to a 1112x42x42 Cholesky Decomposition tensor in LibTorch (C++)?

I have a 1112x42 tensor and I would like to do a Cholesky Decomposition to it in LibTorch. The C++ version of PyTorch.
I would like the result to be a 1112x42x42 matrix, which is a batch of 1112 Cholesky Decomposed 42x42 matracies. I could run a loop to transpose each 1x42, but this would be too time consuming. I want to apply the decomposition to the 1112x42 matrix.
My attempt does not work because 1112x42 is not square.
torch::linalg::cholesky(torch::rand({1112, 42}));
I found a working example of how to convert a 1112x42 tensor to a Cholesky Decomposed matrix in Python below. It is within the forward part of this PyTorch neural net I cannot recreate because the things that make this possible in Python are not possible in C++. I will include anyway in case it inspires the answer.
class ActorContinuous(nn.Module):
def init(self):
super(ActorContinuous, self).init()
self.da = 42
self.ds = 385
self.lin1 = nn.Linear(self.ds, 256)
self.lin2 = nn.Linear(256, 256)
self.mean_layer = nn.Linear(256, self.da)
self.cholesky_layer = nn.Linear(256, (self.da * (self.da + 1)) // 2)
def forward(self, state):
"""
forwards input through the network
:param state: (B, ds)
:return: mean vector (B, da) and cholesky factorization of covariance matrix (B, da, da)
"""
device = state.device
B = 1112
da = 42
x = F.relu(self.lin1(state))
x = F.relu(self.lin2(x))
mean = torch.sigmoid(self.mean_layer(x)) # (B, da)
cholesky_vector = self.cholesky_layer(x) # (B, (da*(da+1))//2)
cholesky_diag_index = torch.arange(da, dtype=torch.long) + 1
cholesky_diag_index = (cholesky_diag_index * (cholesky_diag_index + 1)) // 2 - 1
cholesky_vector[:, cholesky_diag_index] = F.softplus(cholesky_vector[:, cholesky_diag_index])
tril_indices = torch.tril_indices(row=da, col=da, offset=0)
cholesky = torch.zeros(size=(B, da, da), dtype=torch.float32).to(device)
cholesky[:, tril_indices[0], tril_indices[1]] = cholesky_vector
return mean, cholesky
act = ActorContinuous()
m, c = act.forward(torch.rand(1112, 385))
c.shape
# returns the desired Cholesky Decomposed matrix of torch.Size([1112, 42, 42])

Pytorch correaltion matrix of batches

I have a tensor input of dimensions (B,C,H,W) and I would like to find a correlation matrix of the input. The code I am using is :
def corr(x):
"""
x: [B, C, H, W]
"""
# [B, C, H, W] -> [B, C, H * W]
x = x.view((x.size(0), x.size(1), -1))
# estimated covariance
x = x - x.mean(dim=-1, keepdim=True)
factor = 1 / (x.shape[-1] - 1)
cov = factor * (x # x.transpose(-1, -2))
return torch.div(cov,torch.diagonal(cov, dim1=-2, dim2=-1))
So I rechecked myself and it looks like I am getting good results for the cov variable in a function but when I try to normalize it to get the correlation, the result's range is very strange, there are values above 1 and below -1, and overall the solution does not seem to be right.
Any suggestions on how to solve the problem?

Convolutional layer in Python using Numpy

I am trying to implement a convolutional layer in Python using Numpy.
The input is a 4-dimensional array of shape [N, H, W, C], where:
N: Batch size
H: Height of image
W: Width of image
C: Number of channels
The convolutional filter is also a 4-dimensional array of shape [F, F, Cin, Cout], where
F: Height and width of a square filter
Cin: Number of input channels (Cin = C)
Cout: Number of output channels
Assuming a stride of one along all axes, and no padding, the output should be a 4-dimensional array of shape [N, H - F + 1, W - F + 1, Cout].
My code is as follows:
import numpy as np
def conv2d(image, filter):
# Height and width of output image
Hout = image.shape[1] - filter.shape[0] + 1
Wout = image.shape[2] - filter.shape[1] + 1
output = np.zeros([image.shape[0], Hout, Wout, filter.shape[3]])
for n in range(output.shape[0]):
for i in range(output.shape[1]):
for j in range(output.shape[2]):
for cout in range(output.shape[3]):
output[n,i,j,cout] = np.multiply(image[n, i:i+filter.shape[0], j:j+filter.shape[1], :], filter[:,:,:,cout]).sum()
return output
This works perfectly, but uses four for loops and is extremely slow. Is there a better way of implementing a convolutional layer that takes 4-dimensional input and filter, and returns a 4-dimensional output, using Numpy?

This a straightforward implementation of this kind of keras-like (?) convolution. It might be hard to understand for beginners because it uses a lot of broadcasting and stride tricks.
from numpy.lib.stride_tricks import as_strided
def conv2d(a, b):
a = as_strided(a,(len(a),a.shape[1]-len(b)+1,a.shape[2]-b.shape[1]+1,len(b),b.shape[1],a.shape[3]),a.strides[:3]+a.strides[1:])
return np.einsum('abcijk,ijkd', a, b[::-1,::-1])
BTW: if you are doing convolution with very-big kernel, use Fourier-based algorithm instead.
EDIT: The [::-1,::-1] should be removed in the case that convolution does not involve flipping the kernel first (like what's in tensorflow).
EDIT: np.tensordot(a, b, axes=3) performs much better than np.einsum("abcijk,ijkd", a, b), and is highly recommended.
So, the function becomes:
from numpy.lib.stride_tricks import as_strided
def conv2d(a, b):
Hout = a.shape[1] - b.shape[0] + 1
Wout = a.shape[2] - b.shape[1] + 1
a = as_strided(a, (a.shape[0], Hout, Wout, b.shape[0], b.shape[1], a.shape[3]), a.strides[:3] + a.strides[1:])
return np.tensordot(a, b, axes=3)

How to add the grad method to a theano Op?

I have create a theano.Op that returns distance between each pair of the two collections of inputs, converting the scipy cdist:
class Cdist(theano.Op):
__props__ = ()
def __init__(self):
#self.fn = scipy_cdist2
super(Cdist, self).__init__()
def make_node(self, x, w):
#print('make_node')
return gof.Apply(self, [x, w], [x.type()])
def perform(self, node, inputs, output_storage):
#print('perform')
x, w = inputs[0], inputs[1]
z = output_storage[0]
z[0] = distance.cdist(x, w, 'euclidean')
It works, but now want to add the grad method. I have read the guide and the documentation about the grad method. But i still dont't understand how it works. For example in the guide to get the gradient of a method that return a*x + b, they use:
def grad(self, inputs, output_grads):
return [a * output_grads[0] + b]
why? I'm going to quote what is written in the documentation about the grad:
If the output list of the op is [f_1, ... f_n], then the list
output_gradients is [grad_{f_1}(C), grad_{f_2}(C), ... ,
grad_{f_n}(C)]. If inputs consists of the list [x_1, ..., x_m], then
Op.grad should return the list [grad_{x_1}(C), grad_{x_2}(C), ...,
grad_{x_m}(C)], where (grad_{y}(Z))_i = \frac{\partial Z}{\partial
y_i} (and i can stand for multiple dimensions).
They are told me that i have to write the gradient? But in the example the make a combination of output_grads and integger values. Really i'm not understanding.

There's nothing wrong about docs. In grad method you should write a symbolic expression, as opposed to perform method where you write a numerical expression.
grad method is called from theano.grad, while perform is called inside the compiled function.
For example, assuming euclidean distance:
def grad(self, inputs, out_grads):
x, y = inputs # matrices of shape [mA, n] and [mB, n]]
g, = out_grads # matrix of shape [mA, mB]
diff = x.dimshuffle(0, 'x', 1) - y.dimshuffle('x', 0, 1) # [mA, mB, n] tensor
z = T.sqrt(T.sum(T.sqr(diff), axis=2, keepdims=True))
diff = g * diff / z
return [T.sum(diff, axis=1), -T.sum(diff, axis=0)]
For this particular case, I'd suggest writing a L_op instead of grad. L_op additionally reuses output in the forward Op.
def L_op(self, inputs, outputs, out_grads):
x, y = inputs # matrices of shape [mA, n] and [mB, n]
z, = outputs # matrix of shape [mA, mB]
g, = out_grads # idem
diff = x.dimshuffle(0, 'x', 1) - y.dimshuffle('x', 0, 1) # [mA, mB, n] tensor
diff = g.dimshuffle(0, 1, 'x') * diff / z.dimshuffle(0, 1, 'x')
return [T.sum(diff, axis=1), -T.sum(diff, axis=0)]
Well, the grad expressions are probably wrong but you get the idea.
As you can see, we are calling symbolic functions such as dimshuffle. However there are cases where you want to write a class for grad Op. Either because the symbolic graph is too inefficient or you want a custom gradient.
For example:
class CDistGrad(theano.Op):
def __init__(...):
# <...>
pass
def c_code(...):
# implement this in case you want more performance
pass
def perform(...):
# <...>
pass
def make_node(...):
# <...>
pass
class CDist(theano.Op):
# <...>
def grad(self, inputs, output_grads):
return CDistGrad()(*inputs, *output_grads)
Still, symbolic expression is used in grad method. Just a custom Op replaced vanilla Theano expression.

Error in backpropagation python neural net

Darn thing just won't learn. Sometimes weights seem to become nan.
I haven't played with different numbers of hidden layers/inputs/outputs but the bug appears consistent across different sizes of hidden layer.
from __future__ import division
import numpy
import matplotlib
import random
class Net:
def __init__(self, *sizes):
sizes = list(sizes)
sizes[0] += 1
self.sizes = sizes
self.weights = [numpy.random.uniform(-1, 1, (sizes[i+1],sizes[i])) for i in range(len(sizes)-1)]
#staticmethod
def activate(x):
return 1/(1+numpy.exp(-x))
def y(self, x_):
x = numpy.concatenate(([1], numpy.atleast_1d(x_.copy())))
o = [x] #o[i] is the (activated) output of hidden layer i, "hidden layer 0" is inputs
for weight in self.weights[:-1]:
x = weight.dot(x)
x = Net.activate(x)
o.append(x)
o.append(self.weights[-1].dot(x))
return o
def __call__(self, x):
return self.y(x)[-1]
def delta(self, x, t):
o = self.y(x)
delta = [(o[-1]-t) * o[-1] * (1-o[-1])]
for i, weight in enumerate(reversed(self.weights)):
delta.append(weight.T.dot(delta[-1]) * o[-i-2] * (1-o[-i-2]))
delta.reverse()
return o, delta
def train(self, inputs, outputs, epochs=100, rate=.1):
for epoch in range(epochs):
pairs = zip(inputs, outputs)
random.shuffle(pairs)
for x, t in pairs: #shuffle? subset?
o, d = self.delta(x, t)
for layer in range(len(self.sizes)-1):
self.weights[layer] -= rate * numpy.outer(o[layer+1], d[layer])
n = Net(1, 4, 1)
x = numpy.linspace(0, 2*3.14, 10)
t = numpy.sin(x)
matplotlib.pyplot.plot(x, t, 'g')
matplotlib.pyplot.plot(x, map(n, x), 'r')
n.train(x, t)
print n.weights
matplotlib.pyplot.plot(x, map(n, x), 'b')
matplotlib.pyplot.show()

I haven't looked for a particular bug in your code, but can you please try the following things to narrow down your problem further? Otherwise it is very tedious to find the needle in the haystack.
1) Please try to use a real dataset to have an idea what to expect, e.g., MNIST, and/or standardize your data, because your weights may become NaN if they become too small.
2) Try different learning rates and plot the cost function vs. epochs to check if you are converging. It should look somewhat like this (note that I used minibatch learning and averaged the minibatch chunks for each epoch).
3) I see that you are using a sigmoid activation, your implementation is correct, but to make it numerically more stable, replace 1.0 / (1.0 + np.exp(-z)) by expit(z) from scipy.special (same function but more efficient).
4) Implement gradient checking. Here, you compare the analytical solution to a numerically approximated gradient
Or an even better approach that yields a more accurate approximation of the gradient is to compute the symmetric (or centered) difference quotient given by the two-point formula
PS: If you are interested and find it useful, I have a working vanilla NumPy neural net implemented here.

I fixed it! Thanks for all the suggestions. I worked out numeric partials and found that my o and deltas were correct, but I was multiplying the wrong ones. That's why I now take numpy.outer(d[layer+1], o[layer]) instead of numpy.outer(d[layer], o[layer+1]).
I was also skipping the update on one layer. That's why I changed for layer in range(self.hidden_layers) to for layer in range(self.hidden_layers+1).
I'll add that I caught a bug just before posting originally. My output layer delta was incorrect because my net (intentionally) doesn't activate the final outputs, but my delta was computed as though it did.
Debugged primarily with a one hidden layer, one hidden unit net, then moved to a 2 input, 3 hidden layers of 2 neurons each, 2 output model.
from __future__ import division
import numpy
import scipy
import scipy.special
import matplotlib
#from pylab import *
#numpy.random.seed(23)
def nmap(f, x):
return numpy.array(map(f, x))
class Net:
def __init__(self, *sizes):
self.hidden_layers = len(sizes)-2
self.weights = [numpy.random.uniform(-1, 1, (sizes[i+1],sizes[i])) for i in range(self.hidden_layers+1)]
#staticmethod
def activate(x):
return scipy.special.expit(x)
#return 1/(1+numpy.exp(-x))
#staticmethod
def activate_(x):
s = scipy.special.expit(x)
return s*(1-s)
def y(self, x):
o = [numpy.array(x)] #o[i] is the (activated) output of hidden layer i, "hidden layer 0" is inputs and not activated
for weight in self.weights[:-1]:
o.append(Net.activate(weight.dot(o[-1])))
o.append(self.weights[-1].dot(o[-1]))
# for weight in self.weights:
# o.append(Net.activate(weight.dot(o[-1])))
return o
def __call__(self, x):
return self.y(x)[-1]
def delta(self, x, t):
x = numpy.array(x)
t = numpy.array(t)
o = self.y(x)
#delta = [(o[-1]-t) * o[-1] * (1-o[-1])]
delta = [o[-1]-t]
for i, weight in enumerate(reversed(self.weights)):
delta.append(weight.T.dot(delta[-1]) * o[-i-2] * (1-o[-i-2]))
delta.reverse() #surely i need this
return o, delta
def train(self, inputs, outputs, epochs=1000, rate=.1):
errors = []
for epoch in range(epochs):
for x, t in zip(inputs, outputs): #shuffle? subset?
o, d = self.delta(x, t)
for layer in range(self.hidden_layers+1):
grad = numpy.outer(d[layer+1], o[layer])
self.weights[layer] -= rate * grad
return errors
def rmse(self, inputs, outputs):
return ((outputs - nmap(self, inputs))**2).sum()**.5/len(inputs)
n = Net(1, 8, 1)
X = numpy.linspace(0, 2*3.1415, 10)
T = numpy.sin(X)
Y = map(n, X)
Y = numpy.array([y[0,0] for y in Y])
matplotlib.pyplot.plot(X, T, 'g')
matplotlib.pyplot.plot(X, Y, 'r')
print 'output successful'
print n.rmse(X, T)
errors = n.train(X, T)
print 'tried to train successfully'
print n.rmse(X, T)
Y = map(n, X)
Y = numpy.array([y[0,0] for y in Y])
matplotlib.pyplot.plot(x, Y, 'b')
matplotlib.pyplot.show()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Differentiation Issue in Predictive Alignment for Attention Implementation - python

Related

How can I convert a 1112x42 tensor to a 1112x42x42 Cholesky Decomposition tensor in LibTorch (C++)?

Pytorch correaltion matrix of batches

Convolutional layer in Python using Numpy

How to add the grad method to a theano Op?

Error in backpropagation python neural net

Categories

Resources