The aim of this question is to ask for a bare-minimal guide to get someone up to speed with TensorFlow 1 and TensorFlow 2. I feel like there isn't a coherent guide that explains differences between TF1 and TF2 and TF has been through major revisions and evolving at a rapid pace.
For reference when I say,
v1 or TF1 - I refer to TF 1.15.0
v2 or TF2 - I refer to TF 2.0.0
The questions I have are,
How does TF1/TF2 work? What are their key differences?
What are different datatypes/ data structures in TF1 and TF2?
What is Keras and how does that fit in all these? What kind of different APIs Keras provide to implement deep learning models? Can you provide examples of each?
What are the most recurring warnings/errors I have to look out for while using TF and Keras?
Performance differences between TF1 and TF2
How does TF1/TF2 work? And their differences
TF1
TF1 follows an execution style known as define-then-run. This is opposed to define-by-run which is for example Python execution style. But what does that mean? Define then run means that, just because you called/defined something it's not executed. You have to explicitly execute what you defined.
TF has this concept of a Graph. First you define all the computations you need (e.g. all the layer computations of a neural network, loss computation and an optimizer that minimizes the loss - these are represented as ops or operations). After you define the computation/data-flow graph you execute bits and pieces of this using a Session. Let's see a simple example in action.
# Graph generation
tf_a = tf.placeholder(dtype=tf.float32)
tf_b = tf.placeholder(dtype=tf.float32)
tf_c = tf.add(tf_a, tf.math.multiply(tf_b, 2.0))
# Execution
with tf.Session() as sess:
c = sess.run(tf_c, feed_dict={tf_a: 5.0, tf_b: 2.0})
print(c)
The computational graph (also known as data flow graph) will look like below.
tf_a tf_b tf.constant(2.0)
\ \ /
\ tf.math.multiply
\ /
tf.add
|
tf_c
Analogy: Think about you making a cake. You download the recipe from the internet. Then you start following the steps to actually make the cake. The recipe is the Graph and the process of making the cake is what the Session does (i.e. execution of the graph).
TF2
TF2 follows immediate execution style or define-by-run. You call/define something, it is executed. Let's see an example.
a = tf.constant(5.0)
b = tf.constant(3.0)
c = tf_a + (tf_b * 2.0)
print(c.numpy())
Woah! It looks so clean compared to the TF1 example. Everything looks so Pythonic.
Analogy: Now think that you are in a hands-on cake workshop. You are making cake as the instructor explains. And the instructor explains what the result of each step is immediately. So, unlike in the previous example you don't have to wait till you bake the cake to see if you got it right (which is a reference to the fact that you cannot debug code). But you get instant feedback on how you are doing (you know what this means).
Does that mean TF2 doesn't build a graph? Panic attack
Well yes and no. There's two features in TF2 you should know about eager execution and AutoGraph functions.
Tip: To be exact TF1 also had eager execution (off by default) and can be enabled using tf.enable_eager_execution(). TF2 has eager_execution on by default.
Eager execution
Eager execution can immediately execute Tensors and Operations. This is what you observed in the TF2 example. But the flipside is that it does not build a graph. So for example you use eager execution to implement and run a neural network, it will be very slow (as neural networks do very repetitive tasks (forward computation - loss computation - backward pass) over and over again).
AutoGraph
This is where the AutoGraph feature comes to the rescue. AutoGraph is one of my favorite features in TF2. What this does is that if you are doing "TensorFlow" stuff in a function, it analyses the function and builds the graph for you (mind blown). So for example you do the following. TensorFlow builds the graph.
#tf.function
def do_silly_computation(x, y):
a = tf.constant(x)
b = tf.constant(y)
c = tf_a + (tf_b * 2.0)
return c
print(do_silly_computation(5.0, 3.0).numpy())
So all you need to do is define a function which takes the necessary inputs and return the correct output. Most importantly add #tf.function decorator as that's the trigger for TensorFlow AutoGraph to analyse a given function.
Warning: AutoGraph is not a silver bullet and not to be used naively. There are various limitations of AutoGraph too.
Differences between TF1 and TF2
TF1 requires a tf.Session() object to execute the graph and TF2 doesn't
In TF1 the unreferenced variables were not collected by the Python GC, but in TF2 they are
TF1 does not promote code modularity as you need the full graph defined before starting the computations. However, with the AutoGraph function code modularity is encouraged
What are different datatypes in TF1 and TF2?
You've already seen lot of the main data types. But you might have questions about what they do and how they behave. Well this section is all about that.
TF1 Data types / Data structures
tf.placeholder: This is how you provide inputs to the computational graph. As the name suggest, it does not have a value attached to it. Rather, you feed a value at runtime. tf_a and tf_b are examples of these. Think of this as an empty box. You fill that with water/sand/fluffy teddy bears depending on the need.
tf.Variable: This is what you use to define the parameters of your neural network. Unlike placeholders, variables are initialized with some value. But their value can also be changed over time. This is what happens to the parameters of a neural network during back propagation.
tf.Operation: Operations are various transformations you can execute on Placeholders, Tensors and Variables. For example tf.add() and tf.mul() are operations. These operations return a Tensor (most of the time). If you want proof of an op that doesn't return a Tensor, check this out.
tf.Tensor: This is similar to a variable in the sense that it has an initial value. However, once they are defined, their value cannot be changed (i.e. they are immutable). For example, tf_c in the previous example is a tf.Tensor.
TF2 Data types / Data structures
tf.Variable
tf.Tensor
tf.Operation
In terms of the behavior nothing much has changed in data types going from TF1 to TF2. The only main difference is that, the tf.placeholders are gone. You can also have a look at the full list of data types.
What is Keras and how does that fit in all these?
Keras used to be a separate library providing high-level implementations of components (e.g. layers and models) that are mainly used for deep learning models. But since later versions of TensorFlow, Keras got integrated into TensorFlow.
So as I explained, Keras hides lot of unnecessary intricacies you have to deal with if you were to work with bare-bone TensorFlow. There are two main things Keras offers Layer objects and Model objects for implementing NNs. Keras also has two most common model APIs that lets you develop models: the Sequential API and the Functional API. Let's see how different Keras and TensorFlow are in a quick example. Let's build a simple CNN.
Tip: Keras allows you to achieve what you can with do achieve with TF much easier. But Keras also provide capabilities that are not yet strong in TF (e.g. text processing capabilities).
height=64
width = 64
n_channels = 3
n_outputs = 10
Keras (Sequential API) example
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(2,2),
activation='relu',input_shape=(height, width, n_channels)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(filters=64, kernel_size=(2,2), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam')
model.summary()
Pros
Straight-forward to implement simple models
Cons
Cannot be used to implement complex models (e.g. models with multiple inputs)
Keras (Functional API) example
inp = Input(shape=(height, width, n_channels))
out = Conv2D(filters=32, kernel_size=(2,2), activation='relu',input_shape=(height, width, n_channels))(inp)
out = MaxPooling2D(pool_size=(2,2))(out)
out = Conv2D(filters=64, kernel_size=(2,2), activation='relu')(out)
out = MaxPooling2D(pool_size=(2,2))(out)
out = Flatten()(out)
out = Dense(n_outputs, activation='softmax')(out)
model = Model(inputs=inp, outputs=out)
model.compile(loss='binary_crossentropy', optimizer='adam')
model.summary()
Pros
Can be used to implement complex models involving multiple inputs and outputs
Cons
Needs to have a very good understanding of the shapes of the inputs outputs and what's expected as an input by each layer
TF1 example
# Input
tf_in = tf.placeholder(shape=[None, height, width, n_channels], dtype=tf.float32)
# 1st conv and max pool
conv1 = tf.Variable(tf.initializers.glorot_uniform()([2,2,3,32]))
tf_out = tf.nn.conv2d(tf_in, filters=conv1, strides=[1,1,1,1], padding='SAME') # 64,64
tf_out = tf.nn.max_pool2d(tf_out, ksize=[2,2], strides=[1,2,2,1], padding='SAME') # 32,32
# 2nd conv and max pool
conv2 = tf.Variable(tf.initializers.glorot_uniform()([2,2,32,64]))
tf_out = tf.nn.conv2d(tf_out, filters=conv2, strides=[1,1,1,1], padding='SAME') # 32, 32
tf_out = tf.nn.max_pool2d(tf_out, ksize=[2,2], strides=[1,2,2,1], padding='SAME') # 16, 16
tf_out = tf.reshape(tf_out, [-1, 16*16*64])
# Dense layer
dense = conv1 = tf.Variable(tf.initializers.glorot_uniform()([16*16*64, n_outputs]))
tf_out = tf.matmul(tf_out, dense)
Pros
Is very good for cutting edge research involving atypical operations (e.g. changing the sizes of layers dynamically)
Cons
Poor readability
Caveats and Gotchas
Here I will be listing down few things you have to watch out for when using TF (coming from my experience).
TF1 - Forgetting to feed all the dependent placeholders to compute the result
tf_a = tf.placeholder(dtype=tf.float32)
tf_b = tf.placeholder(dtype=tf.float32)
tf_c = tf.add(tf_a, tf.math.multiply(tf_b, 2.0))
with tf.Session() as sess:
c = sess.run(tf_c, feed_dict={tf_a: 5.0})
print(c)
InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder_8' with dtype float
[[node Placeholder_8 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
The reason you get an error here is because, you haven't fed a value to tf_b. So make sure you feed values to all the dependent placeholder to compute a result.
TF1 - Be very very careful of data types
tf_a = tf.placeholder(dtype=tf.int32)
tf_b = tf.placeholder(dtype=tf.float32)
tf_c = tf.add(tf_a, tf.math.multiply(tf_b, 2.0))
with tf.Session() as sess:
c = sess.run(tf_c, feed_dict={tf_a: 5, tf_b: 2.0})
print(c)
TypeError: Input 'y' of 'Add' Op has type float32 that does not match type int32 of argument 'x'.
Can you spot the error? It is because you have to match data types when passing them to operations. Otherwise, use tf.cast() operation to cast your data type to a compatible data type.
Keras - Understand what input shape each layer expects
model = Sequential()
model.add(Conv2D(filters=32, kernel_size=(2,2),
activation='relu',input_shape=(height, width)))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(filters=64, kernel_size=(2,2), activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(n_outputs, activation='softmax'))
model.compile(loss='binary_crossentropy', optimizer='adam')
ValueError: Input 0 of layer conv2d_8 is incompatible with the layer: expected ndim=4, found ndim=3. Full shape received: [None, 64, 64]
Here, you have defined an input shape [None, height, width] (when you add the batch dimension). But Conv2D expects a 4D input [None, height, width, n_channels]. Therefore you get the error above. Some commonly misunderstood/error-prone layers are,
Conv2D layer - Expects a 4D input [None, height, width, n_channels]. To know about the convolution layer/operation in more detail have a look at this answer
LSTM layer - Expects a 3D input [None, timesteps, n_dim]
ConvLSTM2D layer - Expects a 5D input [None, timesteps, height, width, n_channels]
Concatenate layer - Except the axis the data concatenated on all other dimension needs to be the same
Keras - Feeding in the wrong input/output shape during fit()
height=64
width = 64
n_channels = 3
n_outputs = 10
Xtrain = np.random.normal(size=(500, height, width, 1))
Ytrain = np.random.choice([0,1], size=(500, n_outputs))
# Build the model
# fit network
model.fit(Xtrain, Ytrain, epochs=10, batch_size=32, verbose=0)
ValueError: Error when checking input: expected conv2d_9_input to have shape (64, 64, 3) but got array with shape (64, 64, 1)
You should know this one. We are feeding an input of shape [batch size, height, width, 1] when we should be feeding a [batch size, height, width, 3] input.
Performance differences between TF1 and TF2
This has already been in discussion here. So I will not reiterate what's in there.
Things I wish I could have talked about but couldn't
I'm leaving this with some links to further reading.
tf.data.Dataset
tf.RaggedTensor
Related
I’m currently trying to use a pretrained DenseNet in my model. I’m following this tutorial: https://pytorch.org/hub/pytorch_vision_densenet/, and it works well, with an input of [1,3,244,244], it returns a [1,1000] tensor, exactly as expected.
However, currently I’m using this code to load a pretrained Densenet into my model, and use it as a “feature extraction” model. This is the code in the init function
base_model = torch.hub.load('pytorch/vision:v0.10.0', 'densenet121', pretrained=True)
self.base_model = nn.Sequential(*list(base_model.children())[:-1])
And it is being used like this in the forward function
x = self.base_model(x)
This however, taking the same input, returns a tensor of the size: ([1, 1024, 7, 7]). I can not figure out what is not working, I think it is due to the fact that DenseNet connects all the layers together, but I do not know how to get it to work in the same method. Any tips in how to use pretrained DenseNet in my own model?
Generally nn.Modules have logic inside the forward definition, which means it won't be accessible by just converting the model to a sequential block. Most notably, you can generally find downsampling and/or flattening occurring between the CNN and the classifier layer(s) of the network. This is the for DenseNet.
If you look at Torchvision's forward implementation of DenseNet here you will see:
def forward(self, x: Tensor) -> Tensor:
features = self.features(x)
out = F.relu(features, inplace=True)
out = F.adaptive_avg_pool2d(out, (1, 1))
out = torch.flatten(out, 1)
out = self.classifier(out)
return out
You can see how the tensor outputted by the CNN self.features (shaped (*, 1024, 7, 7)) is processed through a ReLU, Adaptive average pool, and flatten before being fed to the classifier (the last layer).
This code comes from https://www.kaggle.com/dkaraflos/1-geomean-nn-and-6featlgbm-2-259-private-lb, The goal of this competition is to use seismic signals to predict the timing of laboratory earthquakes. The person in this link has won first place among more than 4000 teams
def get_model():
inp = Input(shape=(1,train_sample.shape[1]))
x = BatchNormalization()(inp)
x = LSTM(128,return_sequences=True)(x) # LSTM as first layer performed better than Dense.
x = Convolution1D(128, (2),activation='relu', padding="same")(x)
x = Convolution1D(84, (2),activation='relu', padding="same")(x)
x = Convolution1D(64, (2),activation='relu', padding="same")(x)
x = Flatten()(x)
x = Dense(64, activation="relu")(x)
x = Dense(32, activation="relu")(x)
#outputs
ttf = Dense(1, activation='relu',name='regressor')(x) # Time to Failure
tsf = Dense(1)(x) # Time Since Failure
classifier = Dense(1, activation='sigmoid')(x) # Binary for TTF<0.5 seconds
model = models.Model(inputs=inp, outputs=[ttf,tsf,classifier])
opt = optimizers.Nadam(lr=0.008)
# We are fitting to 3 targets simultaneously: Time to Failure (TTF), Time Since Failure (TSF), and Binary for TTF<0.5 seconds
# We weight the model to optimize heavily for TTF
# Optimizing for TSF and Binary TTF<0.5 helps to reduce overfitting, and helps for generalization.
model.compile(optimizer=opt, loss=['mae','mae','binary_crossentropy'],loss_weights=[8,1,1],metrics=['mae'])
return model
However, According to my derivation, I think x = Convolution1D(128, (2),activation='relu', padding="same")(x) and x = Dense(128, activation='relu ')(x) has the same effect, because the convolution kernel performs convolution on the sequence with a time step of 1. In principle, it is very similar to the fully connected layer. Why use conv1D here instead of directly using the fullly connection layer? Is my derivation wrong?
1) Assuming you would input a sequence to the LSTM (the normal use case):
It would not be the same since the LSTM returns a sequence (return_sequences=True), thereby not reducing the input dimensionality. The output shape is therefore (Batch, Sequence, Hid). This is being fed to the Convolution1D layer which performs convolution on the Sequence dimension, i.e. on (Sequence, Hid). So in effect, the purpose of the 1D Convolutions is to extract local 1D subsequences/patches after the LSTM.
If we had return_sequences=False, the LSTM would return the final state h_t. To ensure the same behavior as a Dense layer, you need a fully connected convolutional layer, i.e. a kernel size of Sequence length, and we need as many filters as we have Hid in the output shape. This would then make the 1D Convolution equivalent to a Dense layer.
2) Assuming you do not input a sequence to the LSTM (your example):
In your example, the LSTM is used as a replacement for a Dense layer.
It serves the same function, though it gives you a slightly different
result as the gates do additional transformations (even though we
have no sequence).
Since the Convolution is then performed on (Sequence, Hid) = (1, Hid), it is indeed operating per timestep. Since we have 128 inputs and 128 filters, it is fully connected and the kernel size is large enough to operate on the single element. This meets the above defined criteria for a 1D Convolution to be equivalent to a Dense layer, so you're correct.
As a side note, this type of architecture is something you would typically get with a Neural Architecture Search. The "replacements" used here are not really commonplace and not generally guaranteed to be better than the more established counterparts. In a lot of cases, using Reinforcement Learning or Evolutionary Algorithms can however yield slightly better accuracy using "untraditional" solutions since very small performance gains can just happen by chance and don't have to necessarily reflect back on the usefulness of the architecture.
Although not new to Machine Learning, I am still relatively new to Neural Networks, more specifically how to implement them (In Keras/Python). Feedforwards and Convolutional architectures are fairly straightforward, but I am having trouble with RNNs.
My X data consists of variable length sequences, each data-point in that sequence having 26 features. My y data, although of variable length, each pair of X and y have the same length, e.g:
X_train[0].shape: (226,26)
y_train[0].shape: (226,)
X_train[1].shape: (314,26)
y_train[1].shape: (314,)
X_train[2].shape: (189,26)
y_train[2].shape: (189,)
And my objective is to classify each item in the sequence into one of 39 categories.
What I can gather thus far from reading example code, is that we do something like the following:
encoder_inputs = Input(shape=(None, 26))
encoder = GRU(256, return_state=True)
encoder_outputs, state_h = encoder(encoder_inputs)
decoder_inputs = Input(shape=(None, 39))
decoder_gru= GRU(256, return_sequences=True)
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
decoder_dense = Dense(39, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
Which makes sense to me, because each of the sequences have different lengths.
So with a for loop that loops over all sequences, we use None in the input shape of the first GRU layer because we are unsure what the sequence length will be, and then return the hidden state state_h of that encoder. With the second GRU layer returning sequences, and the initial state being the state returned from the encoder, we then pass the outputs to a final softmax activation layer.
Obviously something is flawed here because I get:
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow/python/framework/ops.py", line 458, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is
enabled. To iterate over this tensor use tf.map_fn.
This link points to a proposed solution, but I don't understand why you would add encoder states to a tuple for as many layers you have in the network.
I'm really looking for help in being able to successfully write this RNN to do this task, but also understanding. I am very interested in RNNs and want to understand them more in depth so I can apply them to other problems.
As an extra note, each sequence is of shape (sequence_length, 26), but I expand the dimension to be (1, sequence_length, 26) for X and (1, sequence_length) for y, and then pass them in a for loop to be fit, with the decoder_target_data one step ahead of the current input:
for idx in range(X_train.shape[0]):
X_train_s = np.expand_dims(X_train[idx], axis=0)
y_train_s = np.expand_dims(y_train[idx], axis=0)
y_train_s1 = np.expand_dims(y_train[idx+1], axis=0)
encoder_input_data = X_train_s
decoder_input_data = y_train_s
decoder_target_data = y_train_s1
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
epochs=50,
validation_split=0.2)
With other networks I have wrote (FeedForward and CNN), I specify the model by adding layers on top of Keras's Sequential class. Because of the inherent complexity of RNNs I see the general format of using Keras's Input class like above and retrieving hidden states (and cell states for LSTM) etc... to be logical, but I have also seen them built from using Keras's Sequential Class. Although these were many to one type tasks, I would be interested in how you would write it that way too.
The problem is that the decoder_gru layer does not return its state, therefore you should not use _ as the return value for the state (i.e. just remove , _):
decoder_outputs = decoder_gru(decoder_inputs, initial_state=state_h)
Since the input and output lengths are the same and there is a one to one mapping between the elements of input and output, you can alternatively construct the model this way:
inputs = Input(shape=(None, 26))
gru = GRU(64, return_sequences=True)(inputs)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Now you can make this model more complex (i.e. increase its capacity) by stacking multiple GRU layers on top of each other:
inputs = Input(shape=(None, 26))
gru = GRU(256, return_sequences=True)(inputs)
gru = GRU(128, return_sequences=True)(gru)
gru = GRU(64, return_sequences=True)(gru)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Further, instead of using GRU layers, you can use LSTM layers which has more representational capacity (of course this may come at the cost of increasing computational cost). And don't forget that when you increase the capacity of the model you increase the chance of overfitting as well. So you must keep that in mind and consider solutions that prevent overfitting (e.g. adding regularization).
Side note: If you have a GPU available, then you can use CuDNNGRU (or CuDNNLSTM) layer instead, which has been optimized for GPUs so it runs much faster compared to GRU.
I want to use Inception-v3 with pretrained weights on ImageNet to take inputs that are not just 3 channel RGB images but have more channels, such that the dimension is (224, 224, x!=3), and then assigning a self-defined set of weights to the following Conv2D layer. I was trying to change the input layer and the subsequent Conv2D layer such that it suits my needs, but I could not find a structured way of doing so.
I tried building a custom Conv2d tensor with Conv2D(...)(input) and assigning that to the corresponding layer of Inception, but this fails because it requires actual layers, while the above instruction yields a tensor. For all it matters, Conv2D(...)(Input) and Inception.layers[1].output yields the correct same output (which it should be since I just want to change the input dimensions and weights), the question is how to wrap the new Conv2D input-output mapping as a layer and replace it in Inception?
I could try hacking my way through this, but generally I wondered if there is a swift and elegant way of reassigning certain layers in those pretrained models with custom specifications.
Thank you!
Edit:
What works is inserting these lines at line 394 of the inception_v3.py from Keras, disabling the exception for more than 3 channel inputs and then simply calling the constructor with the desired input. (Note that Original calls the original InceptionV3 constructor)
Code:
original_model = Original(weights='imagenet', include_top=False, input_shape=(299, 299, 3))
weights = model.get_weights()
original_weights = original_model.get_weights()
for i in range(1, len(original_weights)):
weights[i] = original_weights[i]
averaged_weights = np.mean(weights[0], axis=2)[:, :, None, :]
replicated_weights = np.repeat(averaged_weights, 20, axis=2)
weights[0] = replicated_weights
Then I can call
InceptionV3(weights='imagenet', include_top=False, input_shape=(299, 299, 20))
This work and gives the desired result, but seems very hacky.
I am brand new to Deep-Learning so I'm reading though Deep Learning with Keras by Antonio Gulli and learning a lot. I want to start using some of the concepts. I want to try and implement a neural network with a 1-dimensional convolutional layer that feeds into a bidirectional recurrent layer (like the paper below). All the tutorials or code snippets I've encountered do not implement anything remotely similar to this (e.g. image recognition) or use an older version of keras with different functions and usage.
What I'm trying to do is a variation of this paper:
(1) convert DNA sequences to one-hot encoding vectors; ✓
(2) use a 1 dimensional convolutional neural network; ✓
(3) with max pooling; ✓
(4) send the output to a bidirectional RNN; ⓧ
(5) classify the input;
I cannot figure out how to get the shapes to match up on the Bidirectional RNN. I can't even get an ordinary RNN to work at this stage. How can I restructure the incoming layers to work with a Bidirectional RNN?
Note:
The original code came from https://github.com/uci-cbcl/DanQ/blob/master/DanQ_train.py but I simplified the output layer to just do binary classification. This processed was described (kind of) in https://github.com/fchollet/keras/issues/3322 but I cannot get it to work with the updated keras. The original code (and the 2nd link) work on a very large dataset so I am generating some fake data to illustrate the concept. They are also using an older version of keras where key functionality changes have been made since then.
# Imports
import tensorflow as tf
import numpy as np
from tensorflow.python.keras._impl.keras.layers.core import *
from tensorflow.python.keras._impl.keras.layers import Conv1D, MaxPooling1D, SimpleRNN, Bidirectional, Input
from tensorflow.python.keras._impl.keras.models import Model, Sequential
# Set up TensorFlow backend
K = tf.keras.backend
K.set_session(tf.Session())
np.random.seed(0) # For keras?
# Constants
NUMBER_OF_POSITIONS = 40
NUMBER_OF_CLASSES = 2
NUMBER_OF_SAMPLES_IN_EACH_CLASS = 25
# Generate sequences
https://pastebin.com/GvfLQte2
# Build model
# ===========
# Input Layer
input_layer = Input(shape=(NUMBER_OF_POSITIONS,4))
# Hidden Layers
y = Conv1D(100, 10, strides=1, activation="relu", )(input_layer)
y = MaxPooling1D(pool_size=5, strides=5)(y)
y = Flatten()(y)
y = Bidirectional(SimpleRNN(100, return_sequences = True, activation="tanh", ))(y)
y = Flatten()(y)
y = Dense(100, activation='relu')(y)
# Output layer
output_layer = Dense(NUMBER_OF_CLASSES, activation="softmax")(y)
model = Model(input_layer, output_layer)
model.compile(optimizer="adam", loss="categorical_crossentropy", )
model.summary()
# ~/anaconda/lib/python3.6/site-packages/tensorflow/python/keras/_impl/keras/layers/recurrent.py in build(self, input_shape)
# 1049 input_shape = tensor_shape.TensorShape(input_shape).as_list()
# 1050 batch_size = input_shape[0] if self.stateful else None
# -> 1051 self.input_dim = input_shape[2]
# 1052 self.input_spec[0] = InputSpec(shape=(batch_size, None, self.input_dim))
# 1053
# IndexError: list index out of range
You don't need to restructure anything at all to get the output of a Conv1D layer into an LSTM layer.
So, the problem is simply the presence of the Flatten layer, which destroys the shape.
These are the shapes used by Conv1D and LSTM:
Conv1D: (batch, length, channels)
LSTM: (batch, timeSteps, features)
Length is the same as timeSteps, and channels is the same as features.
Using the Bidirectional wrapper won't change a thing either. It will only duplicate your output features.
Classifying.
If you're going to classify the entire sequence as a whole, your last LSTM must use return_sequences=False. (Or you may use some flatten + dense instead after)
If you're going to classify each step of the sequence, all your LSTMs should have return_sequences=True. You should not flatten the data after them.