I have n-vectors which need to be influenced by each other and output n vectors with same dimensionality d. I believe this is what torch.nn.MultiheadAttention does. But the forward function expects query, key and value as inputs. According to this blog, I need to initialize a random weight matrix of shape (d x d) for each of q, k and v and multiply each of my vectors with these weight matrices and get 3 (n x d) matrices. Now are the q, k and v expected by torch.nn.MultiheadAttention just these three matrices or do I have it mistaken?
When you want to use self attention, just pass your input vector into torch.nn.MultiheadAttention for the query, key and value.
attention = torch.nn.MultiheadAttention(<input-size>, <num-heads>)
x, _ = attention(x, x, x)
The pytorch class returns the output states (same shape as input) and the weights used in the attention process.
Related
I am trying to generate a matrix (tensor object on PyTorch) that is similar to Gram matrix except I need to apply a kernel function instead of inner product on my input matrix.
For loops like the one below works:
N = x.shape[0] # x.shape = (N,d)
G = torch.zeros((N,N))
for i in range(N):
for j in range(N):
G[i][j] = K(x[i], x[j])
where x is my input tensor whose shape is (N,d) and the kernel function K(a,b) yields a real value after performing some math. For example:
def K(a,b):
return ((1+(a*b)).sum()).pow(2) #second degree polynomial.
I want to generate this matrix, G without having to change the kernel function K() and of course, without for-loops!
My initial attempt is to use a lambda approach but this code below obviously doesn't work as it only yields a list of k(x[i],x[i]).
G = torch.tensor(list(map(lambda a,b: K(a,b),x,x))
How can I use the lambda function to yield N-by-N matrix?
What would be some other ways to tackle this problem?
Any insight would be appreciated.
You can calculate G from x simply with:
G = (1 + torch.matmul(x, x.T)).pow(2)
I'm doing the online Computer Vision course by UMich and am new to PyTorch. One of the assignment questions is on batch matrix multiplication, where we have to find the batch matrix product with and without the bmm function. Here is the code.
def batched_matrix_multiply(x, y, use_loop=True):
"""
Perform batched matrix multiplication between the tensor x of shape (B, N, M)
and the tensor y of shape (B, M, P).
If use_loop=True, then you should use an explicit loop over the batch
dimension B. If loop=False, then you should instead compute the batched
matrix multiply without an explicit loop using a single PyTorch operator.
Inputs:
- x: Tensor of shape (B, N, M)
- y: Tensor of shape (B, M, P)
- use_loop: Whether to use an explicit Python loop.
Hint: torch.stack, bmm
Returns:
- z: Tensor of shape (B, N, P) where z[i] of shape (N, P) is the result of
matrix multiplication between x[i] of shape (N, M) and y[i] of shape
(M, P). It should have the same dtype as x.
"""
z = None
#############################################################################
# TODO: Implement this function #
#############################################################################
# Replace "pass" statement with your code
z = torch.zeros(x.shape[0], x.shape[1], y.shape[2])
if use_loop == True:
for i in range(x.shape[0]):
z[i] = torch.mm(x[i], y[i])
else:
z = torch.bmm(x,y)
#############################################################################
# END OF YOUR CODE #
#############################################################################
return z
I've managed to do it without bmm, but without using the torch.stack hint. I initialized a zeroes matrix 'z' with the dimensions of the output matrix and performed normal matrix multiplication for each batch using the for loop.
I'd like to know what the more efficient answer using torch.stack is.
great question. I just tried solving this myself for two hours now. Here's my solution and it really speeds up the computation, as needed.
if use_loop == False:
z = torch.bmm(x,y)
else:
z = torch.zeros(x.shape[0], x.shape[1], y.shape[2])
for i in range(x.shape[0],2):
z[i] = torch.stack([x[i] # y[i], x[i+1] # y[i+1]])
Hoped this helped!
I have two tensors named x_t, x_k with follwing shapes NxHxW and KxNxHxW respectively, where K, is the number of autoencoders used to reconstruct x_t (if you have no idea what is this, assume they're K different nets aiming to predict x_t, this probably has nothing to do with the question anyways) N is batch size, H matrix height, W matrix width.
I'm trying to apply Kullback-Leibler divergence algorithm to both tensors (after broadcasting x_t as x_k along the Kth dimension) using Pytorch's nn.functional.kl_div method.
However, it does not seem to be working as I expected. I'm looking to calcualte the kl_div between each observation in x_t and x_k resulting in a tensor of size KxN (i.e., kl_div of each observation for each K autoencoder).
The actual output is a single value if I use the reduction argument, and the same tensor size (i.e., KxNxHxW) if I do not use it.
Has anyone tried something similar?
Reproducible example:
import torch
import torch.nn.functional as F
# K N H W
x_t = torch.randn( 10, 5, 5)
x_k = torch.randn( 3, 10, 5, 5)
x_broadcasted = x_t.expand_as(x_k)
loss = F.kl_div(x_t, x_k, reduction="none") # or "batchmean", or there are many options
It's unclear to me what exactly constitutes a probability distribution in your model. With reduction='none', kl_div, given log(x_n) and y_n, computes kl_div = y_n * (log(y_n) - log(x_n)), which is the "summed" part of the actual Kullback-Leibler divergence. Summation (or, in other words, taking the expectation) is up to you. If your point is that H, W are the two dimensions over which you want to take expectation, it's as simple as
loss = F.kl_div(x_t, x_k, reduction="none").sum(dim=(-1, -2))
Which is of shape [K, N]. If your network output is to be interpreted differently, you need to better specify which are the event dimensions and which are sample dimensions of your distribution.
I'm watching the Youtube videos of Stanford's cs231n, and trying to do the assignments as exercice. While doing the SVM one I ran into the following piece of code:
def svm_loss_naive(W, X, y, reg):
"""
Structured SVM loss function, naive implementation (with loops).
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W) # This line
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
Heres the line I'm having trouble with:
scores = X[i].dot(W)
This is doing the product xW, shouldn't it be Wx? by that I mean W.dot(X[i])
Because the array shapes are (D, C) and (N, D) for W and X respectively, you can't take the dot product directly, without transposing them both first (they must be (C, D)·(D, N) for matrix multiplication.
Since X.T.dot(W.T) == W.dot(X), the implementation simply reverses the order of the dot product as opposed to taking the transform of each array. Effectively, this just comes down to a decision around how the inputs are arranged. In this case the (somewhat arbitrary) decision was made to arrange the samples and features in a more intuitive way versus having the dot product as x·W.
I have two matrices A, B, NxKxD dimensions and I want get matrix C, NxKxDxD dimensions, where C[n, k] = A[n, k] x B[n, k].T (here "x" means product of matrices of dimensions Dx1 and 1xD, so the result must be DxD dimensional), so now my code looking like this (here A = B = X):
def square(X):
out = np.zeros((N, K, D, D))
for n in range(N):
for k in range(K):
out[n, k] = np.dot(X[n, k, :, np.newaxis], X[n, k, np.newaxis, :])
return out
It may be slow for big N and K because of python's for cycle. Is there some way to make this multiplication in one numpy function?
It seems you are not using np.dot for sum-reduction, but just for expansion that results in broadcasting. So, you can simply extend the array to have one more dimension with the use of np.newaxis/None and let the implicit broadcasting help out.
Thus, an implementation would be -
X[...,None]*X[...,None,:]
More info on broadcasting specifically how to add new axes could be found in this other post.