I'm trying to understand this code from lightaime's Github page. It is a vetorized softmax method. What confuses me is "softmax_output[range(num_train), list(y)]"
What does this expression mean?
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorize implementation
Inputs have dimension D, there are C classes, and we operate on minibatches of N examples.
Inputs:
W: A numpy array of shape (D, C) containing weights.
X: A numpy array of shape (N, D) containing a minibatch of data.
y: A numpy array of shape (N,) containing training labels; y[i] = c means that X[i] has label c, where 0 <= c < C.
reg: (float) regularization strength
Returns a tuple of:
loss as single float
gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_classes = W.shape[1]
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.max(scores, axis = 1).reshape(-1,1)
softmax_output = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis = 1).reshape(-1,1)
loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dS = softmax_output.copy()
dS[range(num_train), list(y)] += -1
dW = (X.T).dot(dS)
dW = dW/num_train + reg* W
return loss, dW
This expression means: slice an array softmax_output of shape (N, C) extracting from it only values related to the training labels y.
Two dimensional numpy.array can be sliced with two lists containing appropriate values (i.e. they should not cause an index error)
range(num_train) creates an index for the first axis which allows to select specific values in each row with the second index - list(y). You can find it in the numpy documentation for indexing.
The first index range_num has a length equals to the first dimension of softmax_output (= N). It points to each row of the matrix; then for each row it selects target value via corresponding value from the second part of an index - list(y).
Example:
softmax_output = np.array( # dummy values, not softmax
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]]
)
num_train = 4 # length of the array
y = [2, 1, 0, 2] # a labels; values for indexing along the second axis
softmax_output[range(num_train), list(y)]
Out:
[3, 5, 7, 12]
So, it selects third element from the first row, second from the second row etc. That's how it works.
(p.s. Do I misunderstand you and you interested in "why", not "how"?)
The loss here is defined by following equation
Here, y is 1 for the class datapoint belongs and 0 for all other classes. Thus we are only interested in softmax outputs for datapoint class. Thus above equation can be rewritten as
Thus then following code representing above equation.
loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))
The code softmax_output[range(num_train), list(y)] is used to select softmax outputs for respective classes. range(num_train) represents all the training samples and list(y) represents respective classes.
This indexing is nicely explained Mikhail in his answer.
Related
Is there any ways to implement maxpooling according to norm of sub vectors in a group in Pytorch? Specifically, this is what I want to implement:
Input:
x: a 2-D float tensor, shape #Nodes * dim
cluster: a 1-D long tensor, shape #Nodes
Output:
y, a 2-D float tensor, and:
y[i]=x[k] where k=argmax_{cluster[k]=i}(torch.norm(x[k],p=2)).
I tried torch.scatter with reduce="max", but this only works for dim=1 and x[i]>0.
Can someone help me to solve the problem?
I don't think there's any built-in function to do what you want. Basically this would be some form of scatter_reduce on the norm of x, but instead of selecting the max norm you want to select the row corresponding to the max norm.
A straightforward implementation may look something like this
"""
input
x: float tensor of size [NODES, DIMS]
cluster: long tensor of size [NODES]
output
float tensor of size [cluster.max()+1, DIMS]
"""
num_clusters = cluster.max().item() + 1
y = torch.zeros((num_clusters, DIMS), dtype=x.dtype, device=x.device)
for cluster_id in torch.unique(cluster):
x_cluster = x[cluster == cluster_id]
y[cluster_id] = x_cluster[torch.argmax(torch.norm(x_cluster, dim=1), dim=0)]
Which should work just fine if clusters.max() is relatively small. If there are many clusters though then this approach has to unnecessarily create masks over cluster for every unique cluster id. To avoid this you can make use of argsort. The best I could come up with in pure python was the following.
num_clusters = cluster.max().item() + 1
x_norm = torch.norm(x, dim=1)
cluster_sortidx = torch.argsort(cluster)
cluster_ids, cluster_counts = torch.unique_consecutive(cluster[cluster_sortidx], return_counts=True)
end_indices = torch.cumsum(cluster_counts, dim=0).cpu().tolist()
start_indices = [0] + end_indices[:-1]
y = torch.zeros((num_clusters, DIMS), dtype=x.dtype, device=x.device)
for cluster_id, a, b in zip(cluster_ids, start_indices, end_indices):
indices = cluster_sortidx[a:b]
y[cluster_id] = x[indices[torch.argmax(x_norm[indices], dim=0)]]
For example in random tests with NODES = 60000, DIMS = 512, cluster.max()=6000 the first version takes about 620ms whie the second version takes about 78ms.
I'm trying to work with a custom Feedforward implementation that takes varying rows of the input and performs some sort of operation on it.
For example, imagine if the function, f, just sums the rows and columns of an input Tensor:
f = lambda x: torch.sum(x) # sum across all dimensions, producing a scalar
Now, for the input Tensor I have an (n, m) matrix and I want to map the function f over all the rows except the row under consideration. For example, here is the vanilla implementation that works:
d = [] # append the values to d
my_tensor = torch.rand(3, 5, requires_grad=True) # = (n, m)
indices = list(range(n)) # list of indices
for i in range(n): # loop through the indices
curr_range = indices[:i] + indices[i+1:] # fetch all indices except for the current one
d.append(f(my_tensor[curr_range]) # calculate sum over all elements excluding row i
Produces a (n, 1) matrix, which is what I want. The problem is Pytorch cannot auto-differentiate over this and I'm getting errors having to do with lack of grad because I have non-primitive Torch operations:
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
Turns out casting the output to using x = torch.tensor(d, requires_grad=True) did the trick in the feedforward.
I have two tensors in PyTorch, z is a 3d tensor of shape (n_samples, n_features, n_views) in which n_samples is the number of samples in the dataset, n_features is the number of features for each sample, and n_views is the number of different views that describe the same (n_samples, n_features) feature matrix, but with other values.
I have another 2d tensor b, of shape (n_samples, n_views), which purpose is to rescale all the features of the samples across the different views. In other words, it encapsulates the importance of the features of each view for the same sample.
For example:
import torch
z = torch.Tensor(([[2,3], [1,1], [4,5]],
[[2,2], [1,2], [7,7]],
[[2,3], [1,1], [4,5]],
[[2,3], [1,1], [4,5]]))
b = torch.Tensor(([1, 0],
[0, 1],
[0.2, 0.8],
[0.5, 0.5]))
print(z.shape, b.shape)
>>>torch.Size([4, 3, 2]) torch.Size([4, 2])
I want to obtain a third tensor r of shape (n_samples, n_features) as a result of operations between z and b.
One possible solution is:
b = b.unsqueeze(1)
r = z * b
r = torch.sum(r, dim=-1)
print(r, r.shape)
>>>tensor([[2.0000, 1.0000, 4.0000],
[2.0000, 2.0000, 7.0000],
[2.8000, 1.0000, 4.8000],
[2.5000, 1.0000, 4.5000]]) torch.Size([4, 3])
Is it possible to achieve that same result using torch.matmul()?. I've tried many times to permute the dimensions of the two vector, but to no avail.
Yes that's possible. If you have mutiple batch dimensions in both operatns, you can use the broadcasting. In this case the last two dimensions of each operand are interpreted as a matrix size. (I recommend looking it up in the documentation.)
So you need an additional dimension for your vectors b, to make them a n x 1 "matrix" (column vector):
# original implementation
b1 = b.unsqueeze(1)
r1 = z * b1
r1 = torch.sum(r1, dim=-1)
print(r1.shape)
# using torch.matmul
r2 = torch.matmul(z, b.unsqueeze(2))[...,0]
print(r2.shape)
print((r1-r2).abs().sum()) # should be zero if we do the same operation
Alternatively, torch.einsum also makes this very straightforward.
# using torch.einsum
r3 = torch.einsum('ijk,ik->ij', z, b)
print((r1-r3).abs().sum()) # should be zero if we do the same operation
einsum is a very powerful operation that can do a lot of things: you can permute tensor dimensions, sum along them, or perform scalar products, all with or without broadcasting. It is derived from the Einstein summation convention mostly used in physics. The rough idea is that you give every dimension of your operans a name, and then, using these names define what the output should look like. I think it is best to read the documentation. In our case we have a 4 x 3 x 2 tensor as well as a 4 x 2 tensor. So the let's call the dimensions of the first tensor ijk. Here i and k should be considered the same as the dimensions of the second tensor, so this one can be described as ik. Finally the output should have clearly be ij (it mus be a 4 x 3 tensor). From this "signature" ijk, ik -> ij it is clear that the dimension i is preserved, and the dimensions k must be "summe/multiplied" away (scalar product).
I'm watching the Youtube videos of Stanford's cs231n, and trying to do the assignments as exercice. While doing the SVM one I ran into the following piece of code:
def svm_loss_naive(W, X, y, reg):
"""
Structured SVM loss function, naive implementation (with loops).
Inputs have dimension D, there are C classes, and we operate on minibatches
of N examples.
Inputs:
- W: A numpy array of shape (D, C) containing weights.
- X: A numpy array of shape (N, D) containing a minibatch of data.
- y: A numpy array of shape (N,) containing training labels; y[i] = c means
that X[i] has label c, where 0 <= c < C.
- reg: (float) regularization strength
Returns a tuple of:
- loss as single float
- gradient with respect to weights W; an array of same shape as W
"""
dW = np.zeros(W.shape) # initialize the gradient as zero
# compute the loss and the gradient
num_classes = W.shape[1]
num_train = X.shape[0]
loss = 0.0
for i in range(num_train):
scores = X[i].dot(W) # This line
correct_class_score = scores[y[i]]
for j in range(num_classes):
if j == y[i]:
continue
margin = scores[j] - correct_class_score + 1 # note delta = 1
if margin > 0:
loss += margin
Heres the line I'm having trouble with:
scores = X[i].dot(W)
This is doing the product xW, shouldn't it be Wx? by that I mean W.dot(X[i])
Because the array shapes are (D, C) and (N, D) for W and X respectively, you can't take the dot product directly, without transposing them both first (they must be (C, D)·(D, N) for matrix multiplication.
Since X.T.dot(W.T) == W.dot(X), the implementation simply reverses the order of the dot product as opposed to taking the transform of each array. Effectively, this just comes down to a decision around how the inputs are arranged. In this case the (somewhat arbitrary) decision was made to arrange the samples and features in a more intuitive way versus having the dot product as x·W.
Suppose I just have got a matrix (2D tensor) X, whose shape is (batch_size x num_labels). And the scores of labels for each sample are stored in the matrix. Now I want to extract the true labels' scores, while the true labels are stored in another 1D tensor y, whose shape is (batch_size).
What can I do ?
I know that in Theano or Numpy. It can be done with a single expression: X[y].
BUT in TensorFlow, what is the most convenient or cost-less way to achieve that ?
X = tf.get_variable("X",[batch_size,num_labels])
y = tf.placeholder(tf.int32,[batch_size])
Note 0 <= y[i] <= num_labels - 1. The output z should be 1D tensor where z[i]= X[i][y[i]]
I understand that X is a vector containing probabilities for each class and batch instance and that you want to get the probability of the true label. I propose on solution, though it may not be the optimal one:
# Create mask for values
increasing = tf.range(start=0, limit=tf.shape(X)[0], delta=1)
# Concatenate batch index and true label
# Note that in Tensorflow < 1.0.0 you must call tf.pack
mask = tf.stack([increasing, y], axis=1)
# Extract values
masked = tf.gather_nd(params=X, indices=mask)
Hope it helps.