This seems to be one of the most common questions about LSTMs in PyTorch, but I am still unable to figure out what should be the input shape to PyTorch LSTM.
Even after following several posts (1, 2, 3) and trying out the solutions, it doesn't seem to work.
Background: I have encoded text sequences (variable length) in a batch of size 12 and the sequences are padded and packed using pad_packed_sequence functionality. MAX_LEN for each sequence is 384 and each token (or word) in the sequence has a dimension of 768. Hence my batch tensor could have one of the following shapes: [12, 384, 768] or [384, 12, 768].
The batch will be my input to the PyTorch rnn module (lstm here).
According to the PyTorch documentation for LSTMs, its input dimensions are (seq_len, batch, input_size) which I understand as following.
seq_len - the number of time steps in each input stream (feature vector length).
batch - the size of each batch of input sequences.
input_size - the dimension for each input token or time step.
lstm = nn.LSTM(input_size=?, hidden_size=?, batch_first=True)
What should be the exact input_size and hidden_size values here?
You have explained the structure of your input, but you haven't made the connection between your input dimensions and the LSTM's expected input dimensions.
Let's break down your input (assigning names to the dimensions):
batch_size: 12
seq_len: 384
input_size / num_features: 768
That means the input_size of the LSTM needs to be 768.
The hidden_size is not dependent on your input, but rather how many features the LSTM should create, which is then used for the hidden state as well as the output, since that is the last hidden state. You have to decide how many features you want to use for the LSTM.
Finally, for the input shape, setting batch_first=True requires the input to have the shape [batch_size, seq_len, input_size], in your case that would be [12, 384, 768].
import torch
import torch.nn as nn
# Size: [batch_size, seq_len, input_size]
input = torch.randn(12, 384, 768)
lstm = nn.LSTM(input_size=768, hidden_size=512, batch_first=True)
output, _ = lstm(input)
output.size() # => torch.Size([12, 384, 512])
The image passed to CNN layer and lstm layer,the feature map shape changes like this
BCHW->BCHW(BxCx1xW),
the CNN's output shape should has the height 1.
then sqeeze the dim of height.
BCHW->BCW
in rnn ,shape name changes,[batch ,seqlen,input_size],in image,[batch,width,channel],
**BCW->BWC,**this is batch_first tensor for LSTM layer(like pytorch).
Finally:
BWC is [batch,seqlen,channel].
Related
My input is a array of 64 integers.
model = Sequential()
model.add( Input(shape=(68,), name="input"))
model.add(Conv1D(64, 2, activation="relu", padding="same", name="convLayer"))
I have 10,000 of these arrays in my training set. And I supposed to be specifying this in order for conv1D to work?
I am getting the dreaded
ValueError: Input 0 of layer convLayer is incompatible with the layer: : expected min_ndim=3, found ndim=2. Full shape received: [None, 68]
error and I really don't understand what I need to do.
Don't let the name confuse you. The layer tf.keras.layers.Conv1D needs the following shape: (time_steps, features). If your dataset is made of 10,000 samples with each sample having 64 values, then your data has the shape (10000, 64), which is not directly applicable to the tf.keras.layers.Conv1D layer. You are missing the time_steps dimension. What you can do is use the tf.keras.layers.RepeatVector, which repeats your array input n times, in the example 5. This way your Conv1D layer gets an input of the shape (5, 64). Check out the documentation for more information:
time_steps = 5
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=(64,), name="input"))
model.add(tf.keras.layers.RepeatVector(time_steps))
model.add(tf.keras.layers.Conv1D(64, 2, activation="relu", padding="same", name="convLayer"))
As a side note, you should ask yourself if using a tf.keras.layers.Conv1D layer is the right option for your use case. This layer is usually used for NLP and other time series tasks. For example, in sentence classification, each word in a sentence is usually mapped to a high-dimensional word vector representation, as seen in the image. This results in data with the shape (time_steps, features).
If you want to use character one hot encoded embeddings it would look something like this:
This is a simple example of one single sample with the shape (10, 10) --> 10 characters along the time series dimension and 10 features. It should help you understand the tutorial I mentioned a bit better.
The Conv1D layer does temporal convolution, that is, along the first dimension (not the batch dimension of course), so you should put something like this:
time_steps = 5
model = tf.keras.Sequential()
model.add(tf.keras.layers.Input(shape=(time_steps, 64), name="input"))
model.add(tf.keras.layers.Conv1D(64, 2, activation="relu", padding="same", name="convLayer"))
You will need to slice your data into time_steps temporal slices to feed the network.
However, if your arrays don't have a temporal structure, then conv1D is not the layer you are looking for.
I'm working in the field of machine learning.
For the stronger Network, I'm going to adopt the techniques concerning Conv1D.
The input data is an one-dimension list data so I just would've thought that Conv1D is the best choice.
What would happen if the input size is (1, 740)? Would it be okay the input channel is 1?
I mean,I have a feeling that the (1, 740) tensor's conv1D output should be the same with that of a simple Linear networks.
Of course I'll also include other conv1d layer, like below.
self.conv1 = torch.nn.Conv1d(in_channels=1, out_channels=64, kernel_size=5)
self.conv2 = torch.nn.Conv1d(in_channels=64,out_channels=64, kernel_size=5)
self.conv3 = torch.nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5)
self.conv4 = torch.nn.Conv1d(in_channels=64, out_channels=64, kernel_size=5)
Would it make sense when an input channel is 1?
Thanks in advance. :)
I think it's fine.
Note that the input of Conv1D should be (B, N, M), where B is the batch size, N is the number of channels (e.g. for RGB is 3) and M is the number of features.
The out_channels refers to the number of 5x5 filters to use. look at the output shape of the following code:
k = nn.Conv1d(1,64,kernel_size=5)
input = torch.randn(1, 1, 740)
print(k(input).shape) # -> torch.Size([1, 64, 736])
The 736 is the result of not using padding the dimension isn't kept.
The nn.Conv1d layer takes an input of shape (b, c, w) (where b is the batch size, c the number of channels, and w the input width). Its kernel size is one-dimensional. It performs a convolution operation over the input dimension (batch and channel axes aside). This means the kernel will apply the same operation over the whole input (wether 1D, 2D, or 3D). Like a 'sliding window'. As such, it only has kernel_size parameters. This is the main characteristic of a convolution layer.
Conv1d allows to extract features on the input regardless of where it's located in the input data: at the beginning or at the end of your w-width input. This would make sense if your input is temporal (input sequence over time) or spatial data (an image).
On the other hand, a nn.Linear takes a 1D tensor as input and returns another 1D tensor. You could consider w to be the number of neurons. You would end up having w*output_dim parameters. If your input contains components which are independant from one another (like a One/Multi-Hot-Encoding) then a fully connected layer as nn.Linear implements would be prefered.
These two behave differently. When using a nn.Linear - in scenarios where you should use a nn.Conv1d - your training would ideally result in having neurons of equal weights, if that makes sense... but you probably won't. Fully-densely-connected layers were used in the past in deep learning for computer vision. Today convolutions are used because there are much more efficient and suitable for these types of tasks.
I would like to create a 'Sequential' model (a Time Series model as you might have guessed), that takes 20 days of past data with a feature size of 2, and predict 1 day into the future with the same feature size of 2.
I found out you need to specify the batch size for a stateful LSTM model, so if I specify a batch size of 32 for example, the final output shape of the model is (32, 2), which I think means the model is predicting 32 days into the future rathen than 1.
How would I go on fixing it?
Also, asking before I arrive to the problem; if I specify a batch size of 32 for example, but I want to predict on an input of shape (1, 20, 2), would the model predict correctly or what, since I changed to batch size from 32 to 1. Thank you.
You don't need to specify batch_size. But you should feed 3-d tensor:
import tensorflow as tf
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras import Model, Sequential
features = 2
dim = 128
new_model = Sequential([
LSTM(dim, stateful=True, return_sequences = True),
Dense(2)
])
number_of_sequences = 1000
sequence_length = 20
input = tf.random.uniform([number_of_sequences, sequence_length, features], dtype=tf.float32)
output = new_model(input) # shape is (number_of_sequences, sequence_length, features)
predicted = output[:,-1] # shape is (number_of_sequences, 1, features)
Shape of (32, 2) means that your sequence length is 32.
Batch size is a parameter of training (how many sequences should be feeded to the model before backpropagating error - see stochastic graient descent method). It doesn't affect your data (which shoud be 3-d - (number of sequences, length of sequence, feature)).
If you need to predict only one sequence - just feed tensor of shape (1, 20, 2) to the model.
In keras documentation, input tensor for dense layer takes the input as:
Input shape
nD tensor with shape: (batch_size, ..., input_dim). The most common
situation would be a 2D input with shape (batch_size, input_dim).
To my understanding, batch size in input tensor is the amount of examples you give for training or predicting.
For the batch_size in model.fit,
batch_size: Integer or None.
Number of samples per gradient update. If unspecified, batch_size will
default to 32.
So are the 2 batch size doing the same thing, reducing the input data so as to prevent memory from filling up completely?
Also, I understand that the batch_size in input shape is optional, as keras puts a None if not specified. Is specifying batch_size necessary in model.fit?
Both batch_size arguments are referring to the same thing, i.e. what you described as how many examples to feed into the model at once.
As for your other answer, it is not necessary for the model.fit function from the official keras website (https://keras.io/models/model/) under the model.fit function "batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32" similar to the input shape.
I am learning deep learning and am trying to understand the pytorch code given below. I'm struggling to understand how the probability calculation works. Can somehow break it down in lay-man terms. Thanks a ton.
ps = model.forward(images[0,:])
# Hyperparameters for our network
input_size = 784
hidden_sizes = [128, 64]
output_size = 10
# Build a feed-forward network
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
nn.ReLU(),
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
nn.ReLU(),
nn.Linear(hidden_sizes[1], output_size),
nn.Softmax(dim=1))
print(model)
# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
print(images.shape)
ps = model.forward(images[0,:])
I'm a layman so I'll help you with the layman's terms :)
input_size = 784
hidden_sizes = [128, 64]
output_size = 10
These are parameters for the layers in your network. Each neural network consists of layers, and each layer has an input and an output shape.
Specifically input_size deals with the input shape of the first layer. This is the input_size of the entire network. Each sample that is input into the network will be a 1 dimension vector that is length 784 (array that is 784 long).
hidden_size deals with the shapes inside the network. We will cover this a little later.
output_size deals with the output shape of the last layer. This means that our network will output a 1 dimensional vector that is length 10 for each sample.
Now to break up model definition line by line:
model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
The nn.Sequential part simply defines a network, each argument that is input defines a new layer in that network in that order.
nn.Linear(input_size, hidden_sizes[0]) is an example of such a layer. It is the first layer of our network takes in an input of size input_size, and outputs a vector of size hidden_sizes[0]. The size of the output is considered "hidden" in that it is not the input or the output of the whole network. It "hidden" because it's located inside of the network far from the input and output ends of the network that you interact with when you actually use it.
This is called Linear because it applies a linear transformation by multiplying the input by its weights matrix and adding its bias matrix to the result. (Y = Ax + b, Y = output, x = input, A = weights, b = bias).
nn.ReLU(),
ReLU is an example of an activation function. What this function does is apply some sort of transformation to the output of the last layer (the layer discussed above), and outputs the result of that transformation. In this case the function being used is the ReLU function, which is defined as ReLU(x) = max(x, 0). Activation functions are used in neural networks because they create non-linearities. This allows your model to model non-linear relationships.
nn.Linear(hidden_sizes[0], hidden_sizes[1]),
From what we discussed above, this is a another example of a layer. It takes an input of hidden_sizes[0] (same shape as the output of the last layer) and outputs a 1D vector of length hidden_sizes[1].
nn.ReLU(),
Apples the ReLU function again.
nn.Linear(hidden_sizes[1], output_size)
Same as the above two layers, but our output shape is the output_size this time.
nn.Softmax(dim=1))
Another activation function. This activation function turns the logits outputted by nn.Linear into an actual probability distribution. This lets the model output the probability for each class. At this point our model is built.
# Forward pass through the network and display output
images, labels = next(iter(trainloader))
images.resize_(images.shape[0], 1, 784)
print(images.shape)
These are simply just preprocessing training data and putting it into the correct format
ps = model.forward(images[0,:])
This passes the images through the model (forward pass) and applies the operations previously discussed in layer. You get the resultant output.