This question is a followup to my previous question here: Multi-feature causal CNN - Keras implementation, however, there are numerous things that are unclear to me that I think it warrants a new question. The model in question here has been built according to the accepted answer in the post mentioned above.
I am trying to apply a Causal CNN model on multivariate time-series data of 10 sequences with 5 features.
lookback, features = 10, 5
What should filters and kernel be set to?
What is the effect of filters and kernel on the network?
Are these just an arbitrary number - i.e. number of neurons in ANN layer?
Or will they have an effect on how the net interprets the time-steps?
What should dilations be set to?
Is this just an arbitrary number or does this represent the lookback of the model?
filters = 32
kernel = 5
dilations = 5
dilation_rates = [2 ** i for i in range(dilations)]
model = Sequential()
model.add(InputLayer(input_shape=(lookback, features)))
model.add(Reshape(target_shape=(features, lookback, 1), input_shape=(lookback, features)))
According to the previously mentioned answer, the input needs to be reshaped according to the following logic:
After Reshape 5 input features are now treated as the temporal layer for the TimeDistributed layer
When Conv1D is applied to each input feature, it thinks the shape of the layer is (10, 1)
with the default "channels_last", therefore...
10 time-steps is the temporal dimension
1 is the "channel", the new location for the feature maps
# Add causal layers
for dilation_rate in dilation_rates:
model.add(TimeDistributed(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu')))
According to the mentioned answer, the model needs to be reshaped, according to the following logic:
Stack feature maps on top of each other so each time step can look at all features produced earlier - (10 time steps, 5 features * 32 filters)
Next, causal layers are now applied to the 5 input features dependently.
Why were they initially applied independently?
Why are they now applied dependently?
model.add(Reshape(target_shape=(lookback, features * filters)))
next_dilations = 3
dilation_rates = [2 ** i for i in range(next_dilations)]
for dilation_rate in dilation_rates:
model.add(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu'))
model.add(MaxPool1D())
model.add(Flatten())
model.add(Dense(units=1, activation='linear'))
model.summary()
SUMMARY
What should filters and kernel be set to?
Will they have an effect on how the net interprets the time-steps?
What should dilations be set to to represent lookback of 10?
Why are causal layers initially applied independently?
Why are they applied dependently after reshape?
Why not apply them dependently from the beginning?
===========================================================================
FULL CODE
lookback, features = 10, 5
filters = 32
kernel = 5
dilations = 5
dilation_rates = [2 ** i for i in range(dilations)]
model = Sequential()
model.add(InputLayer(input_shape=(lookback, features)))
model.add(Reshape(target_shape=(features, lookback, 1), input_shape=(lookback, features)))
# Add causal layers
for dilation_rate in dilation_rates:
model.add(TimeDistributed(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu')))
model.add(Reshape(target_shape=(lookback, features * filters)))
next_dilations = 3
dilation_rates = [2 ** i for i in range(next_dilations)]
for dilation_rate in dilation_rates:
model.add(Conv1D(filters=filters,
kernel_size=kernel,
padding='causal',
dilation_rate=dilation_rate,
activation='elu'))
model.add(MaxPool1D())
model.add(Flatten())
model.add(Dense(units=1, activation='linear'))
model.summary()
===========================================================================
EDIT:
Daniel, thank you for your answer.
Question:
If you can explain "exactly" how you're structuring your data, what is the original data and how you're transforming it into the input shape, if you have independent sequences, if you're creating sliding windows, etc. A better understanding of this process could be achieved.
Answer:
I hope I understand your question correctly.
Each feature is a sequence array of time-series data. They are independent, as in, they are not an image, however, they correlate with each other somewhat.
Which is why I am trying to use Wavenet, which is very good at predicting a single time-series array, however, my problem requires me to use multiple multiple features.
Comments about the given answer
Questions:
Why are causal layers initially applied independently?
Why are they applied dependently after reshape?
Why not apply them dependently from the beginning?
That answer is sort of strange. I'm not an expert, but I don't see the need to keep independent features with a TimeDistributed layer. But I also cannot say whether it gives a better result or not. At first I'd say it's just unnecessary. But it might bring extra intelligence though, given that it might see relations that involve distant steps between two features instead of just looking at "same steps". (This should be tested)
Nevertheless, there is a mistake in that approach.
The reshapes that are intended to swap lookback and feature sizes are not doing what they are expected to do. The author of the answer clearly wants to swap axes (keeps the interpretation of what is feature, what is lookback), which is different from reshape (mixes everything and data loses meaningfulness)
A correct approach would need actual axis swapping, like model.add(Permute((2,1))) instead of the reshapes.
So, I don't know these answers, but nothing seems to create that need.
One sure thing is: you will certainly want the dependent part. A model will not get any near the intelligence of your original model if it doesn't consider relations between features. (Unless you're lucky to have your data completely independent)
Now, explaining the relation between LSTM and Conv1D
An LSTM can be directly compared to a Conv1D and the shapes used are exactly the same, and they mean virtually the same, as long as you're using channels_last.
That said, the shape (samples, input_length, features_or_channels) is the correct shape for both LSTM and Conv1D. In fact, features and channels are exactly the same thing in this case. What changes is how each layer works regarding the input length and calculations.
Concept of filters and kernels
Kernel is the entire tensor inside the conv layer that will be multiplied to the inputs to get the results. A kernel includes its spatial size (kernel_size) and number of filters (output features). And also automatic input filters.
There is not a number of kernels, but there is a kernel_size. The kernel size is how many steps in the length will be joined together for each output step. (This tutorial is great for undestanding 2D convolutions regarding what it does and what the kernel size is - just imagine 1D images instead -- this tutorial doesn't show the number of "filters" though, it's like 1-filter animations)
The number of filters relates directly to the number of features, they're exactly the same thing.
What should filters and kernel be set to?
So, if your LSTM layer is using units=256, meaning it will output 256 features, you should use filters=256, meaning your convolution will output 256 channels/features.
This is not a rule, though, you may find that using more or less filters could bring better results, since the layers do different things after all. There is no need to have all layers with the same number of filters as well!! Here you should go with a parameter tuning. Test to see which numbers are best for your goal and data.
Now, kernel size is something that can't be compared to the LSTM. It's a new thing added to the model.
The number 3 is sort of a very common choice. It means that the convolution will take three time steps to produce one time step. Then slide one step to take another group of three steps to produce the next step and so on.
Dilations
Dilations mean how many spaces between steps the convolution filter will have.
A convolution dilation_rate=1 takes kernel_size consecutive steps to produce one step.
A convolution with dilation_rate = 2 takes, for instance, steps 0, 2 and 4 to produce a step. Then takes steps 1,3,5 to produce the next step and so on.
What should dilations be set to to represent lookback of 10?
range = 1 + (kernel_size - 1) * dilation_rate
So, with a kernel size = 3:
Dilation = 0 (dilation_rate=1): the kernel size will range 3 steps
Dilation = 1 (dilation_rate=2): the kernel size will range 5 steps
Dilation = 2 (dilation_rate=4): the kernel size will range 9 steps
Dilation = 3 (dilation_rate=8): the kernel size will range 17 steps
My question to you
If you can explain "exactly" how you're structuring your data, what is the original data and how you're transforming it into the input shape, if you have independent sequences, if you're creating sliding windows, etc. A better understanding of this process could be achieved.
Related
Problem definition
Dear community, I need your help in implementing an LSTM neural network for a classification problem of panel data using Keras. The panel data I am manipulating consists of ids (let's call it id), a timestep for each id (t), n time varying covariates and a binary outcome y. Each id contains a number of timesteps and for each timestep I have my covariates and a unique outcome (0 or 1). I have reason to believe that each covariate for each id can have a certain degree of autocorrelation and henceforth can be considered a small timeseries of t steps. For simplicity, I consider that each id has a fixed number of t observations) with t not a big number (about 10 or so).
Data
Below is a toy example of what the data might look like in my case. In this example, the parameters are 2 individuals, 4 timesteps each, 4 covariates and each observation has a unique binary outcome. Covariates may be considered as (short) timeseries since they might be autocorrelated.
print(df)
[out]:
A B C D y
id t
id1 1 1.054127 0.346052 1.091299 -0.058137 0.0
2 0.621390 -0.204682 -1.056786 0.864572 0.0
3 1.275124 2.473959 0.264029 -1.047810 0.0
4 -0.328441 -0.135891 0.148498 0.470876 1.0
id2 1 0.362969 0.777082 0.197423 -2.225296 0.0
2 0.227134 0.086731 0.550267 -0.361482 0.0
3 0.223526 0.556242 -0.160042 0.675871 1.0
4 0.070125 0.156659 -2.922709 -1.143887 1.0
I have reason to assume that, for id1, the target at timestep 4 is conditional on the three previous timesteps for that same individual (id1). In addition, The target variable y may contain more than one value of 1 for each individual (as outlined in the case of id2 above). I do not have reason to believe that the data from an individual would affect the result of another (as with many behavior analysis scenarios since every individual is unique).
Prediction problem
What I would like to do is to predict a single outcome for a new individual for whom I have those 4 rows of observation. In other words, based on the historical data of an individual, I would like to know if said individual is likely to have an outcome 1 or 0. If I understand correctly, this can be achieved using an LSTM (alternatively, an RNN) with some data manipulation.
Things I have tried so far
To start simple, I have tried aggregating every set of id rows into a single row with a single outcome and applied a typical statistical learning approach such as boosted trees and got a model as good as random.
I looked into shaping it as a survival analysis problem, in vain. I would not be interested in any estimation of a survival function unlike tutorials on how to handle panel data in the medical field (nor would I have access to such data).
I have tried reshaping my data such that the input is a 3D array in the form of [observations, timesteps, features] where observations are unique ids for an LSTM like so in python :
# separate into features and target
df_feat = df.drop("y", axis = 1)
df_target = df[["y"]]
# get reshaped values for 3D tensor
n_samples = len(df_feat.index.get_level_values('id').unique().tolist())
n_timesteps = 4
n_features = df_feat.shape[1]
# reshape input array to be 3D
X_3D = df_feat.to_numpy().reshape(n_samples, n_timesteps, n_features)
print(X_3D.shape)
[out]:
(2, 4, 4)
However, at this point I get confused as to what my learning instances for the LSTM are and what the outcome y should be shaped like. I have tried having a shape like one outcome per training instance by taking only the last observation for each id (so y=[1,1] and y.shape = (2,) in the toy example above) which technically makes an LSTM script run... but does not capture prior information. Below is the code for such LSTM:
def train_lstm(X_train, y_train, X_valid, y_valid, save_name='best_lstm.h5'):
# starts a sequential model
model = Sequential()
# add first lstm hidden layer with 64 units and default keras params
model.add(LSTM(64, input_shape = (X_train.shape[1], X_train.shape[2]), return_sequences=True))
# add a second hidden lstm layer with 128 units and default keras params
model.add(LSTM(128, return_sequences = True))
# add one last hidden layer
model.add(LSTM(64))
# add one dense layer with 2 units and a sigmoid activation function
model.add(Dense(2, activation = 'sigmoid'))
# define adam optimiser with learning rate
opt = tf.keras.optimizers.Adam(learning_rate = 0.01)
# compile model with binary cross entropy as loss function and accuracy as metrics
model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])
# define early stopping and best model checkpoint parameters
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 0, patience = 20)
mc = ModelCheckpoint(save_name, monitor = 'val_accuracy', mode = 'max', verbose = 0, save_best_only = True)
# train the model using fit method (target vector is one-hot encoded as required by keras)
history = model.fit(X_train, tf.one_hot(y_train, depth = 2),
validation_data = (X_valid, tf.one_hot(y_valid, depth = 2)),
epochs = 100, callbacks = [es, mc])
return history
It runs and it makes predictions the way I want them to (for one id of previous history, we can predict one outcome) but results in poor performance since it fails to capture outcomes prior to the last.
I have carefully read and followed this nicely written medium article by Alexander Laskorunsky which remotely resembles what I am trying to do, and slides the window of K-length frames to capture the prior outcomes (and not just the last as I have done which makes more sense). However, in Alexander's case, he does not consider panel data but rather a multivariate timeseries classification that uses n_timesteps to predict the target using all predictors and all rows even if it overlaps (so not using panel data).
Questions
Am I right to believe that I need a many to one LSTM architecture?
How may I divide and reshape training and testing samples such that a new, previously unseen individual which would not be related in any way to other ids can be classified?
Should each id be considered as one sample / training instance? Should each id be split into training and testing sets and concatenate all training and testing sets to feed to an LSTM architecture?
Would you be so kind as to provide code snippets on how to correctly split and reshape my data as well as a simple LSTM architecture using keras (or maybe modify my own function above in case I coded it wrong)? No need for basic preprocessing and encoding variables.
Any help or advice / tutorials / articles regarding what architecture is most suitable for that kind of problem is greatly appreciated and thank you in advance for your help!
Ciao,
this is the second part of a problem I'm facing with CNN 1d. The first part is this
How does it works the input_shape variable in Conv1d in Keras?
I'using this code:
from keras.models import Sequential
from keras.layers import Dense, Conv1D
import numpy as np
N_FEATURES=5
N_TIMESTEPS=10
X = np.random.rand(100, N_FEATURES)
Y = np.random.randint(0,2, size=100)
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=N_TIMESTEPS, activation='relu', input_shape=(N_TIMESTEPS, N_FEATURES)))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Now, what I want to do?
I want to train a CNN 1d over a timeseries with 5 features. Actually I want to work with time windows og length N_TIMESTEPS rather than timeserie it self. This means that I want to use a sort of "magnifier" of dimension N_TIMESTEPS x N_FEATURES on the time series to work locally. That's why I've decided to use CNN
Here come the first question. It is not clear at all if I have to transform the time series into a tensor or it is something that Keras will do for me since I've specified the kernel_size variable.
In case I must provide a tensor I would do something like this:
X_tensor = []
for i in range(len(X)-N_TIMESTEPS):
X_tensor+=[X_tensor[range(i, N_TIMESTEPS+i), :]]
X_tensor = np.asarray(X_tensor)
In this case of course I should also provide a Y_tensor vector computed from Y according to some criteria. Suppose I have already this Y_tensor boolean vector of the same length of X_tensor, which is len(X)-N_TIMESTEPS-1.
Y_tensor = np.random.randint(0,2,len(X)-N_TIMESTEPS-1)
Now if I try to feed the model I get of the most common error for CNN 1d which is:
ValueError: Error when checking input: expected conv1d_4_input to have 3 dimensions, but got array with shape (100, 5)
By looking to a dozen of posts about it I cannot understand what I did wrong. This is what I've tried:
model.fit(X,Y)
model.fit(np.expand_dims(X, axis=0),Y)
model.fit(np.expand_dims(X, axis=2),Y)
model.fit(X_tensor,Y_tensor)
For all of these cases I get always the same error (with different dimensional values in the final tuple).
Questions:
What Keras expects from my data? Can I feed the model with the whole time series or I have to slice it into a tensor?
How I have to feed the model in term of data structure?I.e. I have to specify in some strange way the dimension of the data?
Can you help me? I find out that this is one the most confusing point of CNN implementation in Keras that there are different posts with different solutions that do not fit with structure of my data (even if they have a very common structure according to me).
Note: There are some post suggesting to pass in the input_shape variable the length of the data. This is meaningless to me since I should not provide the dimension of the data (which is a variable) to the model. The only thing I should give to it, according to the theory, is the filter dimension and number of features (namely the dimension of the matrix that will roll over the time series).
Thanks,
am
Simply, Conv1D requires 3 dimensions:
Number of series (1)
Number of steps (100 - your entire data)
Number of features (5)
So, model.fit(np.expand_dims(X, axis=0),Y) is correct for X.
Now, if X is (1, 100, 5), naturally your input_shape=(100,5).
If your Y has 100 steps, then you need to make sure your Conv1D will output 100 steps. You need padding='same', otherwise it will become 91. (I suggest you actually work with 91, since you want a result for each 10 steps and probably don't want border effects spoiling your results)
Y must also follow the same rules for shape:
Number of series (1)
Number of steps (100 if padding='same'; 91 if padding='valid')
Number of features (1 = Dense output)
So, Y = Y.reshape((1,-1,1)).
Since you have only one class (true/false), it's pointless to use 'categorical_crossentropy'. You should go with 'binary_crossentropy'.
In general, your overall idea of using this convolution with kernel_size=10 to simulate sliding windows of 10 steps will work as expected (whether it will be efficient or not is another question, answered only by trying).
If you want better networks for sequences, you should probably try LSTM layers. The dimensions work exactly the same way. You will need return_sequences=False.
The main difference is that you will need to separate the data as you did in that loop. Then:
X.shape == (91, 10, 5)
Y.shape == (91, 1)
I think you don't have a clear idea on how 1d convolutional neural networks work:
if you want to predict the y values from the timeseries x and you just have 1 timeseries, your approach won't work. the network needs lots of samples to train, and having just 1 will allow it to easily memorize the input and not learn. For example, if the timeseries is the humidity of a given day, and y is the chance of rain at a specific timestep, what you have now is the data for just one day (timesteps being for example hours of the day). In order for the network to learn you need to gather data for many days, ending up with datasets of shape x=(n_days, timesteps, features) y=(n_days, timesteps, 1).
If you describe your actual problem there's better chance to get more helpful answers
[Edit] by sticking to your code, and using just one timeseries, you are better off with other methods that don't involve deep learning. You could split your timeseries at regular interval, obtaining n samples that would allow your network to train, but unless you have a very long timeseries that may not be a valid alternative.
I have made some model that in the end will output a tensor of 3 x 10. The reason why it's 3 x 10 is because the vocabulary size is 10, and there are 3 elements in a sequence (this is a sequence multilabel classification problem). This tensor will need to somehow be softmaxed to a 1x10 tensor. Can someone give me explanations about the methods that are available and maybe some example in Keras?
I saw some merging methods in Keras like average or add. Those can be useful in this case but those seems to need two or more tensors as the input. Therefore I probably need to split the 3 x 10 tensor to 3 tensors 1 x 10 each and average them. Maybe there are better ways to achieve this?
A simple way to achieve what you want is to use a final 1x1 Convolution layer.
A layer with 1×1 convolution kernel allow to merge your 3x10 tensor into a 1x10, and
simultaneously learns the fusion weight during training.
Add this layer :
output = Conv2D(1, (1, 1), activation='your_activation')(your_3x10_tensor)
Hope this is the solution you were looking for !
I have been working on creating a convolutional neural network from scratch, and am a little confused on how to treat kernel size for hidden convolutional layers. For example, say I have an MNIST image as input (28 x 28) and put it through the following layers.
Convolutional layer with kernel_size = (5,5) with 32 output channels
new dimension of throughput = (32, 28, 28)
Max Pooling layer with pool_size (2,2) and step (2,2)
new dimension of throughput = (32, 14, 14)
If I now want to create a second convolutional layer with kernel size = (5x5) and 64 output channels, how do I proceed? Does this mean that I only need two new filters (2 x 32 existing channels) or does the kernel size change to be (32 x 5 x 5) since there are already 32 input channels?
Since the initial input was a 2D image, I do not know how to conduct convolution for the hidden layer since the input is now 3 dimensional (32 x 14 x 14).
you need 64 kernel, each with the size of (32,5,5) .
depth(#channels) of kernels, 32 in this case, or 3 for a RGB image, 1 for gray scale etc, should always match the input depth, but values are all the same.
e.g. if you have a 3x3 kernel like this : [-1 0 1; -2 0 2; -1 0 1] and now you want to convolve it with an input with N as depth or say channel, you just copy this 3x3 kernel N times in 3rd dimension, the following math is just like the 1 channel case, you sum all values in all N channels which your kernel window is currently on them after multiplying the kernel values with them and get the value of just 1 entry or pixel. so what you get as output in the end is a matrix with 1 channel:) how much depth you want your matrix for next layer to have? that's the number of kernels you should apply. hence in your case it would be a kernel with this size (64 x 32 x 5 x 5) which is actually 64 kernels with 32 channels for each and same 5x5 values in all cahnnels.
("I am not a very confident english speaker hope you get what I said, it would be nice if someone edit this :)")
You essentially answered your own question. YOU are building the network solver. It seems like your convolutional layer output is [channels out] = [channels in] * [number of kernels]. I had to infer this from the wording of your question. In general, this is how it works: you specify the kernel size of the layer and how many kernels to use. Since you have one input channel you are essentially saying that there are 32 kernels in your first convolution layer. That is 32 unique 5x5 kernels. Each of these kernels will be applied to the one input channel. More in general, each of the layer kernels (32 in your example) is applied to each of the input channels. And that is the key. If you build code to implement the convolution layer according to these generalities, then your subsequent convolution layers are done. In the next layer you specify two kernels per channel. In your example there would be 32 input channels, the hidden layer has 2 kernels per channel, and the output would be 64 channels.
You could then down sample by applying a pooling layer, then flatten the 64 channels [turn a matrix into a vector by stacking the columns or rows], and pass it as a column vector into a fully connected network. That is the basic scheme of convolutional networks.
The work comes when you try to code up backpropagation through the convolutional layers. But the OP didn’t ask about that. I’ll just say this, you will come to a place where you have the stored input matrix (one channel), you have a gradient from a lower layer in the form of a matrix and is the size of the layer kernel, and you need to backpropagate it up to the next convolutional layer.
The simple approach is to rotate your stored channel matrix by 180 degrees and then convolve it with the gradient. The explanation for this is long and tedious, too much to write here, and not a lot on the internet explains it well.
A more sophisticated approach is to apply “correlation” between the input gradient and the stored channel matrix. Note I specifically said “correlation” as opposed to “convolution” and that is key. If you think they “almost” the same thing, then I recommend you take some time and learn about the differences.
If you would like to have a look at my CNN solver here's a link to the project. It's C++ and no documentation, sorry :) It's all in a header file called layer.h, find the class FilterLayer2D. I think the code is pretty readable (what programmer doesn't think his code is readable :) )
https://github.com/sraber/simplenet.git
I also wrote a paper on basic fully connected networks. I wrote it so that I would forget what I learned in my self study. Maybe you'll get something out of it. It's at this link:
http://www.raberfamily.com/scottblog/scottblog.htm
I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_unitsillustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)is of size nb_features, so each feature would be an input of a unit and num_unit must be equal to nb_features right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape that implicitely create as many blocks as the number of time_steps (which, according to me is the first parameter of input_shape parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features) for a stateful LSTM model ? How to deal with time_steps and batch_size ?
First, units in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x and forms a hidden state vector a, the length of this hidden unit vector is what is called the units in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.
You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1). The input shape goes for (time_steps, features).
While this is a series of fully connected layers:
hidden layer 1: 4 units
hidden layer 2: 4 units
output layer: 1 unit
This is a series of LSTM layers:
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
The output will have shape:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of units.
The inputs processed from the last step will have size of units.
To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
Takes an input with n steps and 3 features
Layer 1:
For each time step in the inputs:
Uses 4 units on the inputs to get a size 4 result
Uses 4 recurrent units on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
output features = 4
Layer 2:
Same as layer 1
Layer 3:
For each time step in the inputs:
Uses 1 unit on the inputs to get a size 1 result
Uses 1 unit on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
The number of units is the size (length) of the internal vector states, h and c of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i, f, and o gates. The details of how the resulting latent features are transformed into h and c are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences is true, otherwise only the last h will be emmited making it (batch_size, 4). I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.
I would put it this way - there are 4 LSTM "neurons" or "units", each with 1 Cell State and 1 Hidden State for each timestep they process. So for an input of 1 timestep processing , you will have 4 Cell States, and 4 Hidden States and 4 Outputs.
Actually the correct way to say this is - for one timestep sized input you 1 Cell State (a vector of size 4) and 1 Hidden State (a vector of size 4) and 1 Output (a vector of size 4).
So if you feed in a timeseries with 20 steps, you will have 20 (intermediate) Cell States, each of size 4. That is because the inputs in LSTM are processed sequentially, 1 after the other. Similarly you will have 20 Hidden States, each of size 4.
Usually, your output will be the output of the LAST step (a vector of size 4). However in case you want the outputs of each intermediate step(remember you have 20 timesteps to process), you can make return_sequences = TRUE. In which case you will have 20 , 4 sized vectors each telling you what was the output when each of those steps got processed as those 20 inputs came one after the other.
In case when you put return_states = TRUE , you get the last Hidden State of size = 4 and last Cell State of size 4.