Pytorch pad packed data based on each sequence's length

Pytorch pad packed data based on each sequence's length - python

I have a packed data and each sequences' length.
Example:
data = torch.tensor([4, 1, 3, 5, 2, 6])
lengths = torch.tensor([2,1,3])
I want to create a pad 2-D (batch_size,max_lengths) matrix like:
output = torch.tensor([[4,1,0], #length=2
[3,0,0],#length=1
[5,2,6])#length=3
And due to my training purpose, this operation should be able to track backward gradient.

Related

How to convert sparse to dense adjacency matrix?

I am trying to convert a sparse adjacency matrix/list that only contains the indices of the non-zero elements ([[rows], [columns]]) to a dense matrix that contains 1s at the indices and otherwise 0s. I found a solution using to_dense_adj from Pytorch geometric (Documentation). But this does not exactly what I want, since the shape of the dense matrix is not as expected. Here is an example:
sparse_adj = torch.tensor([[0, 1, 2, 1, 0], [0, 1, 2, 3, 4]])
So the dense matrix should be of size 5x3 (the second array "stores" the columns; with non-zero elements at (0,0), (1,1), (2,2),(1,3) and (0,4)) because the elements in the first array are lower or equal than 2.
However,
dense_adj = to_dense(sparse_adj)[0]
outputs a dense matrix, but of shape (5,5). Is it possible to define the output shape or is there a different solution to get what I want?
Edit: I have a solution to convert it back to the sparse representation now that works
dense_adj = torch.sparse.FloatTensor(sparse_adj, torch.ones(5), torch.Size([3,5])).to_dense()
ind = dense_adj.nonzero(as_tuple=False).t().contiguous()
sparse_adj = torch.stack((ind[1], ind[0]), dim=0)
Or is there any alternative way that is better?

You can acheive this by first constructing a sparse matrix with torch.sparse then converting it to a dense matrix. For this you will need to provide torch.sparse.FloatTensor a 2D tensor of indices, a tensor of values as well as a output size:
sparse_adj = torch.tensor([[0, 1, 2, 1, 0], [0, 1, 2, 3, 4]])
torch.sparse.FloatTensor(sparse_adj, torch.ones(5), torch.Size([3,5])).to_dense()
You can get the size of the output matrix dynamically with
sparse_adj.max(axis=1).values + 1
So it becomes:
torch.sparse.FloatTensor(
sparse_adj,
torch.ones(sparse_adj.shape[1]),
(sparse_adj.max(axis=1).values + 1).tolist())

Randomly select rows from numpy array based on a condition

Let's say I have 2 arrays of arrays, labels is 1D and data is 5D note that both arrays have the same first dimension.
To simplify things let's say labels contain only 3 arrays :
labels=np.array([[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]])
And let's say I have a datalist of data arrays (length=3) where each array has a 5D shape where the first dimension of each one is the same as the arrays of the labels array.
In this example, datalist has 3 arrays of shapes : (8,3,100,10,1), (5,3,100,10,1) and (10,3,100,10,1) respectively. Here, the first dimension of each of these arrays is the same as the lengths of each array in label.
Now I want to reduce the number of zeros in each array of labels and keep the other values. Let's say I want to keep only 3 zeros for each array. Therefore, the length of each array in labels as well as the first dimension of each array in data will be 6, 4 and 8.
In order to reduce the number of zeros in each array of labels, I want to randomly select and keep only 3. Now these same random selected indexes will be used then to select the correspondant rows from data.
For this example, the new_labels array will be something like this :
new_labels=np.array([[0,0,1,1,2,0],[4,0,0,0],[0,3,2,1,0,1,7,0]])
Here's what I have tried so far :
all_ind=[] #to store indexes where value=0 for all arrays
indexes_to_keep=[] #to store the random selected indexes
new_labels=[] #to store the final results
for i in range(len(labels)):
ind=[] #to store indexes where value=0 for one array
for j in range(len(labels[i])):
if (labels[i][j]==0):
ind.append(j)
all_ind.append(ind)
for k in range(len(labels)):
indexes_to_keep.append(np.random.choice(all_ind[i], 3))
aux= np.zeros(len(labels[i]) - len(all_ind[i]) + 3)
....
....
Here, how can I fill **aux** with the values ?
....
....
new_labels.append(aux)
Any suggestions ?

Playing with numpy arrays of different lenghts is not a good idea therefore you are required to iterate each item and perform some method on it. Assuming you want to optimize that method only, masking might work pretty well here:
def specific_choice(x, n):
'''leaving n random zeros of the list x'''
x = np.array(x)
mask = x != 0
idx = np.flatnonzero(~mask)
np.random.shuffle(idx) #dynamical change of idx value, quite fast
idx = idx[:n]
mask[idx] = True
return x[mask] # or mask if you need it
Iteration of list is faster than one of array so effective usage would be:
labels = [[0,0,0,1,1,2,0,0],[0,4,0,0,0],[0,3,0,2,1,0,0,1,7,0]]
output = [specific_choice(n, 3) for n in labels]
Output:
[array([0, 1, 1, 2, 0, 0]), array([0, 4, 0, 0]), array([0, 3, 0, 2, 1, 1, 7, 0])]

Extract patches similar to that of max pooling or separable convolution

I am trying to create a custom layer that is similar to Max Pooling or the first step of a separable convolution.
For example with a 2-Tensor in which I want to extract the non-overlapping 2x2 patches:
if I have the [4,4] tensor
[[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9,10,11],
[12,13,14,15]]
I want to end up with the following [2,2,4] Tensor
[[[ 0, 1, 4, 5],[ 2, 3, 6, 7]],
[[ 8, 9,12,13],[10,11,14,15]]]
For a 3-Tensor, I want something similar but to also separate out the 3rd dimension. tf.extract_image_patches almost does what I want, but it folds the "depth" dimension into each patch.
Ideally if I had a tensor of shape [32,64,7] and wanted to extract all the [2,2] patches out of it: I would end up with a shape of [16,32,7,4]
To be clear, I just want to extract the patches, not to actually do max pooling nor separable convolution.
Since I am not actually augmenting the data, I suspect that you can do it with some tf.reshape trickery... Is there any nice way to achieve this in tensorflow without resorting to slicing+stitching/for loops?
Also, what is the correct terminology for this operation? Windowing? Tiling?

Turns out this is really easy to do with tf.transpose. The solution that ended up working for me is:
#Assume x is in BHWC form
def pool(x,size=2):
channels = x.get_shape()[-1]
x = tf.extract_image_patches(
x,
ksizes=[1,size,size,1],
strides=[1,size,size,1],
rates=[1,1,1,1],
padding="SAME"
)
x = tf.reshape(x,[-1],x.get_shape()[1:3]+[size**2,channels])
x = tf.transpose(x,[0,1,2,4,3])
return x

How to replicate numpy.choose() in tensorflow?

I'm trying to efficiently replicate numpy's ndarray.choose() method.
Here's a numpy example of what I'm looking for:
b = np.arange(15).reshape(3, 5)
c = np.array([1,0,4])
c.choose(b.T) # trying to replicate in tensorflow
-> array([ 1, 5, 14])
The best I've been able to do with this is generate a batch_size square matrix (which is huge if batch size is huge) and take the diagonal of it:
tf_b = tf.constant(b)
tf_c = tf.constant(c)
sess.run(tf.diag_part(tf.gather(tf.transpose(tf_b), tf_c)))
-> array([ 1, 5, 14])
Is there a way to do this that is just linear in the first dimension (instead of squared)?

Yeah, there's an easier way to do this. Flatten your b array to 1-d, so it's [0, 1, 2, ..., 13, 14]. Take an array of indices that are in the range of the number of 'choices' you are taking (3 in your case). That will be [0, 1, 2]. Multiply this range by the second dimension of your original shape, which is the number of options for each choice (5 in your case). That gives you [0, 5, 10]. Then add your indices to this to obtain [1, 5, 14]. Now you're good to call tf.gather().
Here is some code that I've taken from here that does a similar thing for RNN outputs. Yours will be slightly different, but the idea is the same.
index = tf.range(0, batch_size) * max_length + (length - 1)
flat = tf.reshape(output, [-1, out_size])
relevant = tf.gather(flat, index)
return relevant
In a big picture, the operation is pretty straightforward. You use the range operation to get the index of the beginning of each row, then add the index of where you are in each row. I think doing it in 1D is easiest, so that's why we flatten it.

PyBrain addSample multi-dimensional array

In all of the examples it seems that addSample(input, target) is used with 1 dimensional arrays, such as:
INPUT = 5
OUTPUT = 1
input = [5, 5, 5, 5, 5]
target = [1]
ds = SequentialDataSet(5, 1)
#add data using addSample
How does one do this when the input is multi-dimensional in this way:
input = [[5, 5, 5, 5, 5], [5, 5, 5, 5, 5]]
target = [1]
How does one use addSample with such structures? I tried this:
ds = SequentialDataSet(2, 1)
ds.addSample(input, target)
and get the error message:
Could not broadcast input array from shape (2, 5) into shape 2.
Meaning the SequentialDataSet(2, 1) does not work for this structure, but SequentialDataSet((2, 5), 1) also errors. This should be easy but I cannot find the answer.

It looks like you're trying to train some sort of Feed Forward network, perhaps a multi-layer perceptron? 5 layers in, one or more hidden layers, and a single output layer but it's not clear so this is a leap on my end.
Either way your input layer should be a single array. If you have a structure, or multi-dimensional array you'll need to collapse it and feed it in as a single set of data. So for your 5x2 suggestion you'd simply have 10 elements on the input, and you would be responsible for "parsing" your input structures consistently as they're fed into the network. For a 5x5 structure you'd have 25 inputs etc.
In my experience a big part of the success/challenge with ANNs is structuring the data in so that the input form is normalized and represented in a way that the network can mathematically find a pattern with.

According to the post linked beneath you should just input one array:
Pybrain multi dimensional data input
For SequentialDataSet I used this example:
data = [(1,2), (1,3), (10,2), (2,0), (2,9), (4,3), (1,2), (10,5)]
ds = SequentialDataSet(2,2)
for sample, next_sample in zip(data, cycle(test_data[1:])):
ds.addSample(sample, next_sample)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pytorch pad packed data based on each sequence's length - python

Related

How to convert sparse to dense adjacency matrix?

Randomly select rows from numpy array based on a condition

Extract patches similar to that of max pooling or separable convolution

How to replicate numpy.choose() in tensorflow?

PyBrain addSample multi-dimensional array

Categories

Resources