Multiple issues with axes while implementing a Seq2Seq with attention in CNTK

Multiple issues with axes while implementing a Seq2Seq with attention in CNTK - python

I'm trying to implement a Seq2Seq model with attention in CNTK, something very similar to CNTK Tutorial 204. However, several small differences lead to various issues and error messages, which I don't understand. There are many questions here, which are probably interconnected and all stem from some single thing I don't understand.
Note (in case it's important). My input data comes from MinibatchSourceFromData, created from NumPy arrays that fit in RAM, I don't store it in a CTF.
ins = C.sequence.input_variable(input_dim, name="in", sequence_axis=inAxis)
y = C.sequence.input_variable(label_dim, name="y", sequence_axis=outAxis)
Thus, the shapes are [#, *](input_dim) and [#, *](label_dim).
Question 1: When I run the CNTK 204 Tutorial and dump its graph into a .dot file using cntk.logging.plot, I see that its input shapes are [#](-2,). How is this possible?
Where did the sequence axis (*) disappear?
How can a dimension be negative?
Question 2: In the same tutorial, we have attention_axis = -3. I don't understand this. In my model there are 2 dynamic axis and 1 static, so "third to last" axis would be #, the batch axis. But attention definitely shouldn't be computed over the batch axis.
I hoped that looking at the actual axes in the tutorial code would help me understand this, but the [#](-2,) issue above made this even more confusing.
Setting attention_axis to -2 gives the following error:
RuntimeError: Times: The left operand 'Placeholder('stab_result', [#, outAxis], [128])'
rank (1) must be >= #axes (2) being reduced over.
during creation of the training-time model:
def train_model(m):
#C.Function
def model(ins: InputSequence[Tensor[input_dim]],
labels: OutputSequence[Tensor[label_dim]]):
past_labels = Delay(initial_state=C.Constant(seq_start_encoding))(labels)
return m(ins, past_labels) #<<<<<<<<<<<<<< HERE
return model
where stab_result is a Stabilizer right before the final Dense layer in the decoder. I can see in the dot-file that there are spurious trailing dimensions of size 1 that appear in the middle of the AttentionModel implementation.
Setting attention_axis to -1 gives the following error:
RuntimeError: Binary elementwise operation ElementTimes: Left operand 'Output('Block346442_Output_0', [#, outAxis], [64])'
shape '[64]' is not compatible with right operand
'Output('attention_weights', [#, outAxis], [200])' shape '[200]'.
where 64 is my attention_dim and 200 is my attention_span. As I understand, the elementwise * inside the attention model definitely shouldn't be conflating these two together, therefore -1 is definitely not the right axis here.
Question 3: Is my understanding above correct? What should be the right axis and why is it causing one of the two exceptions above?
Thanks for the explanations!

First, some good news: A couple of things have been fixed in the AttentionModel in the latest master (will be generally available with CNTK 2.2 in a few days):
You don't need to specify an attention_span or an attention_axis. If you don't specify them and leave them at their default values, the attention is computed over the whole sequence. In fact these arguments have been deprecated.
If you do the above the 204 notebook runs 2x faster, so the 204 notebook does not use these arguments anymore
A bug has been fixed in the AttentionModel and it now faithfully implements the Bahdanau et. al. paper.
Regarding your questions:
The dimension is not negative. We use certain negative numbers in various places to mean certain things: -1 is a dimension that will be inferred once based on the first minibatch, -2 is I think the shape of a placeholder, and -3 is a dimension that will be inferred with each minibatch (such as when you feed variable sized images to convolutions). I think if you print the graph after the first minibatch, you should see all shapes are concrete.
attention_axis is an implementation detail that should have been hidden. Basically attention_axis=-3 will create a shape of (1, 1, 200), attention_axis=-4 will create a shape of (1, 1, 1, 200) and so on. In general anything more than -3 is not guaranteed to work and anything less than -3 just adds more 1s without any clear benefit. The good news of course is that you can just ignore this argument in the latest master.
TL;DR: If you are in master (or starting with CNTK 2.2 in a few days) replace AttentionModel(attention_dim, attention_span=200, attention_axis=-3) with
AttentionModel(attention_dim). It is faster and does not contain confusing arguments. Starting from CNTK 2.2 the original API is deprecated.

Related

RNN with inconsistent (repeated) padding (using Pytorch's Pack_padded_sequence)

Following the example from PyTorch docs I am trying to solve a problem where the padding is inconsistent rather than at the end of the tensor for each batch (in other words, no pun intended, I have a left-censored and right-censored problem across my batches):
# Data structure example from docs
seq = torch.tensor([[1,2,0], [3,0,0], [4,5,6]])
# Data structure of my problem
inconsistent_seq = torch.tensor([[1,2,0], [0,3,0], [0,5,6]])
lens = ...?
packed = pack_padded_sequence(seq, lens, batch_first=True, enforce_sorted=False)
How can I solve the problem of masking these padded 0’s when running them through an LSTM using (preferably) PyTorch functionality?

I "solved" this by essentially reindexing my data and padding left-censored data with 0's (makes sense for my problem). I also injected and extra tensor to the input dimension to track this padding. I then masked the right-censored data using the pack_padded_sequence method from the PyTorch library. Found a good source here:
https://www.kdnuggets.com/2018/06/taming-lstms-variable-sized-mini-batches-pytorch.html

" ValueError: operands could not be broadcast together with shapes (109,109) (2,) (109,109) "

I am using astropy in jupyter notebooks to process fits files. I am using a third party application called pyKLIP.
Can someone explain the construction of this error message as far as it having THREE SETS of parentheses (verbatim, sic). I have found no such construction of this error message which has three sets of parentheses, and that makes it harder to decipher what it needs.
ValueError: operands could not be broadcast together with shapes (109,109) (2,) (109,109)
The 109,109 can only be the resolution of a fits image, held inside a numpy array of 91 individual frames of 109 rows by 109 columns.
The 2 I cannot yet figure because I cannot get print statements to print inside the third party functions. The only 2 that I know of is an array of 91 sets of an x center and y center.
The spec going into the third party application calls for:
input = Array of shape (N,y,x) for N images of shape (y,x)
centers = Array of shape (N,2) for N input centers in the format [x_cent, y_cent]
If I print these members as size and shape I get:
dataset.input.size : 1081171
dataset.input.shape : (91, 109, 109)
dataset.centers.size : 182
dataset.centers.shape : (91, 2)
Any pointers to decipher this broadcast error are greatly appreciated. Thank you.

When searching for help on errors like this focus on the first part of the message, the broadcast together. The THREE SETS of parentheses is specific to your case. Knowing broadcasting rules it's obvious that these 3 arrays won't work together. Your task is to identify which ones.
The fact that you found:
dataset.input.shape : (91, 109, 109)
dataset.centers.shape : (91, 2)
suggests that your code (or what's being called) is doing some sort of 'batch' iteration on the first dimension of these arrays, doing something with
dataset.input[i]
dataset.centers[i]
But without the full traceback we can't say more. Even with a traceback it can be hard to trace the immediate error back to its source in your code.

I would prefer to credit the individual whose recommendation let to the solution.
Iguananaut said get a debugger. I researched debuggers, decided Python Visual Debugger was the ticket, then discovered it required Anaconda. Installed Anaconda, then pyklip for Anaconda, then made my own print statements inside Anaconda's pyklip modules, and for whatever reason, Anaconda's pyklip modules were able to execute my print statements.
So, I got to see that the system *desired to add TWO separate ONE-instance two-element tuples, but was being *fed a ONE-instance two-element tuple + a NNN-instance array of two-element tuples. This was precisely the line number where the failure message said it was.
I will have to check more, but I think the spec might have been vague enough so as to allow wrong conclusions. I *think that "centers = Array of shape (N,2) for N input centers in the format [x_cent, y_cent]" should say the equivalent of "shape (1,2)". (IIRC tuples have no shape members..)
Can anyone speak to how or why Anaconda's installation of a third party library, pyklip, allows print statements inside pyklip code to successfully execute, whereas the github repo of pyklip did *Not allow print statements inside pyklip code to successfully execute?
The material answer to the opening question is that the 109 row by 109 column needs to have each cell value shifted up or down by a very small amount to effect a re-centering of the image. It could not get to that point because the dx and dy, differences between the TWO separate ONE-instance two-element tuples, could not be calculated correctly.

Trying to get Convolution of Single Image

I am trying to get the result of a single convolution over an image using tf.keras.backend.conv2d.
The specifications of the input are 227 pixels by 227 pixels, with a channel size of 3 (RGB image.)
The filter size I would like to use is 11x11 and a stride of 4. There is no zero padding included.
I am not married to the idea of using tf.keras.backend.conv2d. I am willing to change methods/packages, just as long as I get a convolved image with the specified requirements above.
Here is the chunk of code I'm trying to make work:
import tensorflow as tf
from tensorflow import keras
import cv2
image = cv2.imread('pic.jpg')
tf.keras.backend.conv2d(image,11,strides=4,data_format="channels_last",dilation_rate=(1))
I get this error message
InvalidArgumentError: cannot compute Conv2D as input #1(zero-based) was expected to be a double tensor but is a int32 tensor [Op:Conv2D] name: convolution/
If there is anything I can add to clarify, please let me know. I can post the entirety of the code, but most of it is irrelevant, at least in my opinion.
Thank you to whoever takes their time to help me!

You are using the wrong function. What you are using is the convolution op, which takes an input and a filter tensor and performs the convolution. As such, the second argument should be the filter tensor itself. You are trying to pass 11 as the filter tensor which obviously doesn't make sense. What I suspect you want to use is tf.keras.layers.Conv2D which takes care of creating the filter according to some specification and then wraps the convolution op as well. Try this:
conv_layer = tf.layers.Conv2D(1, 11, 4)
result = conv_layer(image)
This creates an 11x11 filter and a convolution op with stride 4; the second line then calls the op. I put 1 as the number of filters (first argument) since I don't know what exactly you are trying to do.

Select which tensor to use in middle of TensorFlow graph

In Tensorflow, how would I go about selecting between a python list of Tensors in the middle of my graph as an input to the rest of the graph?
Basically, I have a python list of Tensors that are candidates to be used as inputs in the rest of the graph. I want to select from one of them without adding extra dependencies that require all of the Tensors in the list to be computed (I think that would happen if I used tf.cond). How can I select one of them? I can't do it at the python level because I choose the tensor based on a value computed from a placeholder. So for example:'
x = tf.placeholder(tf.int32, shape=(num_steps, None))
y = tf.placeholder(tf.int32, shape=(None,))
lengths = tf.placeholder(tf.int32, shape=(None,))
# Pretend there is a bunch of lines of code here
output_index = max_sequence_length = tf.reduce_max(lengths)
final_output = potential_outputs[output_index] # won't work, output_index is Tensor
# Pretend the rest of the model uses final_output
More info if you want it:
I am unrolling an RNN and I want to only unroll to the maximum length of the sequence. When this is less then the number of unrolling steps, there is a lot of wasted computation. Dynamic_rnn and static_rnn do not meet my needs, so I am trying to come up with my own custom method of unrolling the graph.

To index in tensorflow use tf.slice.
It should be noted that based on the code you provided, I don't think you are indexing the outputs correctly using tf.reduce_max function since this is providing the actual maximum value across a given axis which may not be an integer (but I'm not sure how your network works). You may be looking for tf.argmax that returns to index for the maximum value. The issue with this however is that tensorflow does not a have a gradient defined for tf.argmax so that function cannot be a learned part of your network.

Confused about X in GaussianHMM.fit([X])

With this code:
X = numpy.array(range(0,5))
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
I get
tuple index out of range
self.n_features = obs[0].shape[1]
So what are you supposed to pass .fit() exactly? The hidden states AND emissions in a tuple? If so in what order? The documentation is less than helpful.
I noticed it likes being passed tuples as this does not give an error:
X = numpy.column_stack([range(0,5),range(0,5)])
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
Edit:
Let me clarify a bit, the documentation indicates that the ordinality of the array must be:
List of array-like observation sequences (shape (n_i, n_features)).
This would almost indicate that you pass a tuple for each sample that indicates in a binary fashion which observations are present. However their example indicates otherwise:
# pack diff and volume for training
X = np.column_stack([diff, volume])
hence the confusion

It would appear the GaussianHMM function is for multivariate-emission-only HMM problems, hence the requirement to have >1 emission vectors. When the documentation refers to 'n_features' they are not referring to the number of ways emissions can express themselves but the number of orthogonal emission vectors.
Hence, "features" (the orthogonal emission vectors) are not to be confused with "symbols" which, in sklearn's parlance (which is likely shared with the greater hmm community for all I know), refer to what actual unique values the system is capable of emitting.
For univariate emission-vector problems, use MultinomialHMM.
Hope that clarifies for anyone else who want to use this stuff without becoming the world's foremost authority on HMMs :)

I realize this is an old thread but the problem in the example code is still there. I believe the example is now at this link and still giving the same error:
tuple index out of range
self.n_features = obs[0].shape[1]
The offending line of code is:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit(X)
Which should be:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit([X])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.