Sinusoidal embedding - Attention is all you need

Sinusoidal embedding - Attention is all you need - python

In Attention Is All You Need, the authors implement a positional embedding (which adds information about where a word is in a sequence). For this, they use a sinusoidal embedding:
PE(pos,2i) = sin(pos/10000**(2*i/hidden_units))
PE(pos,2i+1) = cos(pos/10000**(2*i/hidden_units))
where pos is the position and i is the dimension. It must result in an embedding matrix of shape [max_length, embedding_size], i.e., given a position in a sequence, it returns the tensor of PE[position,:].
I found the Kyubyong's implementation, but I do not fully understand it.
I tried to implement it in numpy the following way:
hidden_units = 100 # Dimension of embedding
vocab_size = 10 # Maximum sentence length
# Matrix of [[1, ..., 99], [1, ..., 99], ...]
i = np.tile(np.expand_dims(range(hidden_units), 0), [vocab_size, 1])
# Matrix of [[1, ..., 1], [2, ..., 2], ...]
pos = np.tile(np.expand_dims(range(vocab_size), 1), [1, hidden_units])
# Apply the intermediate funcitons
pos = np.multiply(pos, 1/10000.0)
i = np.multiply(i, 2.0/hidden_units)
matrix = np.power(pos, i)
# Apply the sine function to the even colums
matrix[:, 1::2] = np.sin(matrix[:, 1::2]) # even
# Apply the cosine function to the odd columns
matrix[:, ::2] = np.cos(matrix[:, ::2]) # odd
# Plot
im = plt.imshow(matrix, cmap='hot', aspect='auto')
I don't understand how this matrix can give information on the position of inputs. Could someone first tell me if this is the right way to compute it and second what is the rationale behind it?
Thank you.

I found the answer in a pytorch implementation:
# keep dim 0 for padding token position encoding zero vector
position_enc = np.array([
[pos / np.power(10000, 2*i/d_pos_vec) for i in range(d_pos_vec)]
if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])
position_enc[1:, 0::2] = np.sin(position_enc[1:, 0::2]) # dim 2i
position_enc[1:, 1::2] = np.cos(position_enc[1:, 1::2]) # dim 2i+1
return torch.from_numpy(position_enc).type(torch.FloatTensor)
where d_pos_vec is the embedding dimension and n_position the max sequence length.
EDIT:
In the paper, the authors say that this representation of the embedding matrix allows "the model to extrapolate to sequence lengths longer than the ones encountered during training".
The only difference between two positions is the pos variable. Check the image below for a graphical representation.

Related

Defining custom layer for tensorflow model

I'm trying to build a model in tensorflow that should take a number of points with n dimensions and find a set of hyperplanes that form a hull around one set of points while including as little of another set of points.
To do this I would input a Matrix of size [n,np] with n denoting dimensions and np denoting the number of points each defined in n dimensions. Like:
[[ 0.04370488 -0.09842589 -0.01787493]
[ 0.1415032 0.05342565 0.63025913]
[-0.84298323 -0.91433908 -0.9716289 ]
[ 0.19159608 -0.68356499 0.55441537]
[ 0.34797942 0.55592542 -0.74667198]]
As a last layer I would like to have n+1 hyperplanes that are each defined by two vectors, one of them pointing to a point on the hyperplane, the other being the normal vector of the hyperplane. In three dimensions this would give me 4 hyperplanes each defined by 2 vectors with 3 dimensions. So I guess this would be a 4x2x3 matrix or 24 values. Like:
[[0, 0, 0] [1, 0, 0]]
[[0, 0, 0] [0, 1, 0]]
[[0, 0, 0] [0, 0, 1]]
[[5, 5, 5] [-1, -1, -1]]
I was thinking this layer to either be the output of the model OR
to be used in calculating whether a point is on the in- or outside of the hull. Which could just be encoded as 0 or 1
For now I have a barebones model where I managed to input a Matrix with the correct size but couldn't yet manage to write a loss function or custom layer that makes it possible to evaluate whether a point is in or outside of the hull
The ys array is a 800,1 array containing labels for each point saying it is either a point that should be in the hull or outside the hull.
from tensorflow import keras
def in_convex_hull(point, plane_point, plane_normal):
if np.dot(plane, (point - a)) == 1:
return true
return false
def custom_loss(actual, pred):
loss = 0
return loss
def custom_layer():
return
model = keras.Sequential([keras.layers.Dense(units=1, input_shape=[800,3])])
model.add(keras.layers.Dense(1000))
model.add(keras.layers.Dense(24))
model.compile(optimizer='Adam', loss='BinaryCrossentropy', metrics = ["accuracy"])
xs = np.array([np.random.rand(800,3) for i in range(1)])
ys = np.array([np.eye(1)[np.random.choice(1, 800)]])
history = model.fit(xs, ys, epochs=10, batch_size=1, verbose = 1)
Any pointers on how this setup could be achieved is greatly appreciated

pairwise/rowwise comparison of pytorch tensor

I have a 2D tensor representing integer coordinates on a grid.
And I would like to check my tensor for any occurences of a specific coordinate (x,y)
A psuedo-code example:
positions = torch.arange(20).repeat(2).view(-1,2)
xy_dst1 = torch.tensor((5,7))
xy_dst2 = torch.tensor((4,5))
positions == xy_dst1 # should give none
positions == xy_dst2 # should give index 2 and 12
My only solution so far is to convert the tensors to lists or tuples and then go through them iteratively, but with the conversions back and forth and the iterations that can't be a very good solution.
Does anyone know of a better solution that stays in the tensor framework?

Try
def check(positions, xy):
return (positions == xy.view(1, 2)).all(dim=1).nonzero()
print(check(positions, xy_dst1))
# Output: tensor([], size=(0, 1), dtype=torch.int64)
print(check(positions, xy_dst2))
# Output:
# tensor([[ 2],
# [12]])

Fast way to do consecutive one-to-all calculations on Numpy arrays without a for-loop?

I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.
Say I have a 2D numpy array with observed x-y coordinates:
from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])
I also have a list of x-y coordinates to compare to these points (y):
y = np.array([11,13], [12, 14])
I have a function that takes the sum of manhattan differences between a value of x and all of the values in y:
def find_sum(ref_row, comp_rows):
modeled_counts = []
y = ref_row * len(comp_rows)
res = list(map(distance.cityblock, ref_row, comp_rows))
modeled_counts.append(sum(res))
return sum(modeled_counts)
Essentially, what I would like to do is find the sum of manhattan distances for every item in y with each item in x (so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y).
I've tried this out with the following line of code:
z = list(map(find_sum, x, y))
However, z is of length 2 (like y), and not 4 like x. Is there a way to ensure that z is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0] and every set in y, and so on and so forth, so the length of z should be equal to the length of x.
Is there a simple way to do this without a for loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!
Thanks!

This solution implements the distance in numpy, as I think it is a good example of broadcasting, which is a very useful thing to know if you need to use arrays and matrices.
By definition of Manhattan distance, you need to evaluate the sum of the absolute value of difference between each column. However, the first column of x, x[:, 0], has shape (4,) and the first column of y, y[:, 0], has shape (2,), so they are not compatible in the sense of applying subtraction: the broadcasting property says that each shape is compared starting with the trailing dimensions and two dimensions are compatible when they are equal or one of them is 1. Sadly, none of them are true for your columns.
However, you can add a new dimension of value 1 using np.newaxis, so
x[:, 0]
is array([1, 2, 4, 5]), but
x[:, 0, np.newaxis]
is
array([[1],
[2],
[4],
[5]])
and its shape is (4 ,1). Now, a matrix of shape (4, 1) subtracted by an array of shape 2 results in a matrix of shape (4, 2), by numpy's broadcasting treatment:
4 x 1
2
= 4 x 2
You can obtain the differences for each column:
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
and evaluate the sum of their absolute values:
np.abs(first_column_difference) + np.abs(second_column_difference)
which results in a (4, 2) matrix. Now, you want to sum the values for each row, so that you have 4 values:
np.sum(np.abs(first_column_difference) + np.abs(second_column_difference), axis=1)
which results in array([73, 69, 61, 57]). The rule is simple: the parameter axis will eliminate that dimension from the result, therefore using axis=1 for a (4, 2) matrix generates 4 values -- if you use axis=0, it will generate 2 values.
So, this will solve your problem:
x = np.array([[1, 2], [2, 3], [4, 5], [5, 6]])
y = np.array([[11, 13], [12, 43]])
first_column_difference = x[:, 0, np.newaxis] - y[:, 0]
second_column_difference = x[:, 1, np.newaxis] - y[:, 1]
z = np.abs(first_column_difference) + np.abs(second_column_difference)
print(np.sum(z, axis=1))
You can also skip the intermediate steps for each column and evaluate everything at once (it is a little bit harder to understand, so I prefer the method described above to explain what is happening):
print(np.abs(x[:, np.newaxis] - y).sum(axis=(1, 2)))
It is a general case for an n-dimensional Manhattan distance: if x is (u, n) and y is (v, n), it generates u rows by broadcasting (u, 1, n) by (v, n) = (u, v, n), then applying sum to eliminate the second and third axis.

Here is how you can do it using numpy broadcast with simplified explanation
Adjust Shape For Broadcasting
import numpy as np
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
## using np.newaxis as index add a new dimension at that position
## : give all the elements on that dimension
start_points = start_points[np.newaxis, :, :]
dest_points = dest_points[:, np.newaxis, :]
## Now lets check he shape of the point arrays
print('start_points.shape: ', start_points.shape) # (1, 4, 2)
print('dest_points.shape', dest_points.shape) # (2, 1, 2)
Lets try to understand
last element of shape represent x and y of a point, size 2
we can think of start_points as having 1 row and 4 columns of points
we can think of dest_points as having 2 rows and 1 columns of points
We can think start_points and dest_points as matrix or a table of points of size (1X4) and (2X1)
We clearly see that size are not compatible. What will happen if we perform arithmatic
operation between them? Here is where a smart part of numpy comes, called broadcast.
It will repeat rows of start_points to match that of dest_point making matrix of (2X4)
It will repeat columns of dest_point to match that of start_points making matrix of (2X4)
Result is arithmetic operation between every pair of elements on start_points and dest_points
Calculate the distance
diff_x_y = start_points - dest_points
print(diff_x_y.shape) # (2, 4, 2)
abs_diff_x_y = np.abs(start_points - dest_points)
man_distance = np.sum(abs_diff_x_y, axis=2)
print('man_distance:\n', man_distance)
sum_distance = np.sum(man_distance, axis=0)
print('sum_distance:\n', sum_distance)
Oneliner
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
Here is more detail explanation of broadcasting if you want to understand it more

With so many rows you can make substantial savings by using a smart algorithm. Let us for simplicity assume there is just one dimension; once we have established the algorithm, getting back to the general case is a simple matter of summing over coordinates.
The naive algorithm is O(mn) where m,n are the sizes of sets X,Y. Our algorithm is O((m+n)log(m+n)) so it scales much better.
We first have to sort the union of X and Y by coordinate and then form the cumsum over Y. Next, we find for each x in X the number YbefX of y in Y to its left and use it to look up the corresponding cumsum item YbefXval. The summed distances to all y to the left of x are YbefX times coordinate of x minus YbefXval, the distances to all y to the right are sum of all y coordinates minus YbefXval minus n - YbefX times coordinate of x.
Where does the saving come from? Sorting coordinates enables us to recycle the summations we have done before, instead of starting each time from scratch. This uses the fact that up to a sign we always sum the same y coordinates and going from left to right the signs flip one by one.
Code:
import numpy as np
from scipy.spatial.distance import cdist
from timeit import timeit
def pp(X,Y):
(m,k),(n,k) = X.shape,Y.shape
XY = np.concatenate([X.T,Y.T],1)
idx = XY.argsort(1)
Xmsk = idx<m
Ymsk = ~Xmsk
Xidx = np.arange(k)[:,None],idx[Xmsk].reshape(k,m)
Yidx = np.arange(k)[:,None],idx[Ymsk].reshape(k,n)
YbefX = Ymsk.cumsum(1)[Xmsk].reshape(k,m)
YbefXval = XY[Yidx].cumsum(1)[np.arange(k)[:,None],YbefX-1]
YbefXval[YbefX==0] = 0
XY[Xidx] = ((2*YbefX-n)*XY[Xidx]) - 2*YbefXval + Y.sum(0)[:,None]
return XY[:,:m].sum(0)
def summed_cdist(X,Y):
return cdist(X,Y,"minkowski",p=1).sum(1)
# demo
m,n,k = 1000,500,10
X,Y = np.random.randn(m,k),np.random.randn(n,k)
print("same result:",np.allclose(pp(X,Y),summed_cdist(X,Y)))
print("sort :",timeit(lambda:pp(X,Y),number=1000),"ms")
print("scipy cdist:",timeit(lambda:summed_cdist(X,Y),number=100)*10,"ms")
Sample run, comparing smart algo "sort" to naive algo implemented using cdist library function:
same result: True
sort : 1.4447695480193943 ms
scipy cdist: 36.41934019047767 ms

implementing softmax method in python

I'm trying to understand this code from lightaime's Github page. It is a vetorized softmax method. What confuses me is "softmax_output[range(num_train), list(y)]"
What does this expression mean?
def softmax_loss_vectorized(W, X, y, reg):
"""
Softmax loss function, vectorize implementation
Inputs have dimension D, there are C classes, and we operate on minibatches of N examples.
Inputs:
W: A numpy array of shape (D, C) containing weights.
X: A numpy array of shape (N, D) containing a minibatch of data.
y: A numpy array of shape (N,) containing training labels; y[i] = c means that X[i] has label c, where 0 <= c < C.
reg: (float) regularization strength
Returns a tuple of:
loss as single float
gradient with respect to weights W; an array of same shape as W
"""
# Initialize the loss and gradient to zero.
loss = 0.0
dW = np.zeros_like(W)
num_classes = W.shape[1]
num_train = X.shape[0]
scores = X.dot(W)
shift_scores = scores - np.max(scores, axis = 1).reshape(-1,1)
softmax_output = np.exp(shift_scores)/np.sum(np.exp(shift_scores), axis = 1).reshape(-1,1)
loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))
loss /= num_train
loss += 0.5* reg * np.sum(W * W)
dS = softmax_output.copy()
dS[range(num_train), list(y)] += -1
dW = (X.T).dot(dS)
dW = dW/num_train + reg* W
return loss, dW

This expression means: slice an array softmax_output of shape (N, C) extracting from it only values related to the training labels y.
Two dimensional numpy.array can be sliced with two lists containing appropriate values (i.e. they should not cause an index error)
range(num_train) creates an index for the first axis which allows to select specific values in each row with the second index - list(y). You can find it in the numpy documentation for indexing.
The first index range_num has a length equals to the first dimension of softmax_output (= N). It points to each row of the matrix; then for each row it selects target value via corresponding value from the second part of an index - list(y).
Example:
softmax_output = np.array( # dummy values, not softmax
[[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]]
)
num_train = 4 # length of the array
y = [2, 1, 0, 2] # a labels; values for indexing along the second axis
softmax_output[range(num_train), list(y)]
Out:
[3, 5, 7, 12]
So, it selects third element from the first row, second from the second row etc. That's how it works.
(p.s. Do I misunderstand you and you interested in "why", not "how"?)

The loss here is defined by following equation
Here, y is 1 for the class datapoint belongs and 0 for all other classes. Thus we are only interested in softmax outputs for datapoint class. Thus above equation can be rewritten as
Thus then following code representing above equation.
loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))
The code softmax_output[range(num_train), list(y)] is used to select softmax outputs for respective classes. range(num_train) represents all the training samples and list(y) represents respective classes.
This indexing is nicely explained Mikhail in his answer.

Summarize ndarray by 2d array in Python

I want to summarize a 3d array dat using indices contained in a 2d array idx.
Consider the example below. For each margin along dat[:, :, i], I want to compute the median according to some index idx. The desired output (out) is a 2d array, whose rows record the index and columns record the margin. The following code works but is not very efficient. Any suggestions?
import numpy as np
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
out = np.empty((3, 3))
for i in np.unique(idx):
out[i,] = np.median(dat[idx==i], axis = 0)
print(out)
Output:
[[ 1.5 2.5 3.5]
[ 6. 7. 8. ]
[ 9. 10. 11. ]]

To visualize the problem better, I will refer to the 2x2 dimensions of the array as the rows and columns, and the 3 dimension as depth. I will refer to vectors along the 3rd dimension as "pixels" (pixels have length 3), and planes along the first two dimensions as "channels".
Your loop is accumulating a set of pixels selected by the mask idx == i, and taking the median of each channel within that set. The result is an Nx3 array, where N is the number of distinct incides that you have.
One day, generalized ufuncs will be ubiquitous in numpy, and np.median will be such a function. On that day, you will be able to use reduceat magic1 to do something like
unq, ind = np.unique(idx, return_inverse=True)
np.median.reduceat(dat.reshape(-1, dat.shape[-1]), np.r_[0, np.where(np.diff(unq[ind]))[0]+1])
1 See Applying operation to unevenly split portions of numpy array for more info on the specific type of magic.
Since this is not currently possible, you can use scipy.ndimage.median instead. This version allows you to compute medians over a set of labeled areas in an array, which is exactly what you have with idx. This method assumes that your index array contains N densely packed values, all of which are in range(N). Otherwise the reshaping operations will not work properly.
If that is not the case, start by transforming idx:
_, ind = np.unique(idx, return_inverse=True)
idx = ind.reshape(idx.shape)
OR
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
Since you are actually computing a separate median for each region and channel, you will need to have a set of labels for each channel. Flesh out idx to have a distinct set of indices for each channel:
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
Now index has an identical set of regions defined in each channel, which you can use in scipy.ndimage.median:
out = scipy.ndimage.median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
The input labels must be densely packed from zero to offset * chan for index=range(offset * chan) to work properly, and the reshape operation to have the right number of elements. The final transpose is just an artifact of how the labels are arranged.
Here is the complete product, along with an IDEOne demo of the result:
import numpy as np
from scipy.ndimage import median
dat = np.arange(12).reshape(2, 2, 3)
idx = np.array([[0, 0], [1, 2]])
def summarize(dat, idx):
idx = np.unique(idx, return_inverse=True)[1].reshape(idx.shape)
chan = dat.shape[-1]
offset = idx.max() + 1
index = np.stack([idx + i * offset for i in range(chan)], axis=-1)
return median(dat, index, index=range(offset * chan)).reshape(chan, offset).T
print(summarize(dat, idx))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sinusoidal embedding - Attention is all you need - python

Related

Defining custom layer for tensorflow model

pairwise/rowwise comparison of pytorch tensor

Fast way to do consecutive one-to-all calculations on Numpy arrays without a for-loop?

implementing softmax method in python

Summarize ndarray by 2d array in Python

Categories

Resources