Unrolling, timesteps, batchsize and hidden unit

Unrolling, timesteps, batchsize and hidden unit - python

I read this blog here to understand the theoretical background this but after reading here I am bit confused about what **1)timesteps, 2)unrolling, 3)number of hidden units and 4) batch size ** are ? Maybe someone could explain this on a code basis as well because when I look into the model config this code below does not unroll, but what is timestep doing in this case ? Lets say I have a data of length of 2.000 points, splitted into 40 time steps and one feature. E.g. hidden units are 100. batchsize is not defined, what is happening in the model ?
model = Sequential()
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)
Is the code below still an encoder decode model without a RepeatVector?
model = Sequential()
model.add(LSTM(100, return_sequences=True, input_shape=(n_timesteps_in, n_features)))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)

"Unroll" is just a mechanism to process the LSTMs in a way that makes them faster by occupying more memory. (The details are unknown for me... but it certainly has no influence in steps, shapes, etc.)
When you say "2000 points split in 40 time steps", I have absolutely no idea of what is going on.
The data must be meaningfully structured and saying "2000" data points is really lacking a lot of information.
Data structured for LSTMs is:
I have a certain number of individual sequences (data evolving with time)
Each sequence has a number of time steps (measures in time)
In each step we measured a number of different vars with different meanings (features)
Example:
2000 users in a website
They used the site for 40 days
In each day I measured the number of times they clicked a button
I can plot how this data evolves with time daily (each day is a step)
So, if you have 2000 sequences (also called "samples" in Keras), each sequence with length of 40 steps, and one single feature per step, this will happen:
Dimensions
Batch size is defined as 32 by default in the fit method. The model will process batches containing 32 sequences/users until it reaches 2000 sequences/users.
input_shape will required to be (40,1) (free batch size to choose in fit)
Steps
Your LSTMs will try to understand how clicks vary in time, step by step. That's why they're recurrent, they calculate things for a step and feed these things into the next step, until all 40 steps are processed. (You won't see this processing, though, it's internal)
With return_sequences=True, you will get the output for all steps.
Without it, you will get only the output for the last step.
The model
The model will process 32 parallel (and independent) sequences/users together in each batch.
The first LSTM layer will process the entire sequence in recurrent steps and return a final result. (The sequence is killed, there are no steps left because you didn't use return_sequences=True)
Output shape = (batch, 100)
You create a new sequence with RepeatVector, but this sequence is constant in time.
Output shape = (batch, 40, 100)
The next LSTM layer processes this constant sequence and produces an output sequence, with all 40 steps
Output shape = (bathc, 40, 100)
The TimeDistributed(Dense) will process each of these steps, but independently (in parallel), not recursively as the LSTMs would do.
Output shape = (batch, 40, n_features)
The output will be a the total group of 2000 sequences (that were processed in groups of 32), each with 40 steps and n_features output features.
Cells, features, units
Everything is independent.
Input features is one thing, output features is another. There is no requirement for Dense to use the same number of features used in input_shape, unless that's what you want.
When you use 100 units in the LSTM layer, it will produce an output sequence of 100 features, shape (batch, 40, 100). If you use 200 units, it will produce an output sequence with 200 features, shape (batch, 40, 200). This is computing power. More neurons = more intelligence in the model.
Something strange in the model:
You should replace:
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
With only:
model.add(LSTM(100, return_sequences=True,input_shape=(n_timesteps_in, n_features)))
Not returning sequences in the first layer and then creating a constant sequence with RepeatVector is sort of destroying the work of your first LSTM.

Related

How to make array have same number of samples?

I am a beginner using CNN and Keras and I am trying to make a program to predict whether someone could develop diabetes using data in a CSV file. I think I am getting confused with how to reshape the array as I am receiving the error:
ValueError: Data cardinality is ambiguous:
x sizes: 8
y sizes: 768
Make sure all arrays contain the same number of samples
Here is the code:
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten
# read in the csv file using pandas
data = pd.read_csv("diabetes.csv")
# extract the input and output columns from the dataframe
X = data.drop(columns=['Outcome'])
y = data['Outcome']
# reshape the input data into the shape expected by a CNN
X = X.values.reshape(8, 768, 1)
# create a Sequential model in Keras
model = Sequential()
# add a 2D convolutional layer with 32 filters and a kernel size of 3x3
model.add(Conv2D(32, kernel_size=(3, 3), activation="relu", input_shape=(8, 768, 1)))
# add a flatten layer to flatten the output from the convolutional layer
model.add(Flatten())
# add a fully-connected layer with 64 units and a ReLU activation
model.add(Dense(64, activation="relu"))
# add a fully-connected layer with 10 units and a softmax activation
model.add(Dense(10, activation="softmax"))
# compile the model using categorical crossentropy loss and an Adam optimizer
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# fit the model using the input and output data
model.fit(X, y)
# print prediction
print(model.predict(10, 139, 80, 0, 0, 27.1, 1.441, 57))

Tldr, you probably don't want a CNN in this case.
First off, I’m assuming your data looks something like the following, if that’s not the case the rest of the post may be way off target:
enter image description here
So there are 768 rows or patients, 8 inputs for each row, and 1 output (known as the label).
Convolutional layers are used when there is an input signal that you wish to analyze. In 2d, this would be something like a grid of pixels, or in 1d it might be time series data. Your data is neither – each row of the data represents a single 8-dimensional data point (i.e. a single patient) at a single point in time, so you very likely don’t want to use a convolutional layer at all.
For more information, you can read up on the differences between convnets and fully connected neural networks here: https://ai.stackexchange.com/questions/5546/what-is-the-difference-between-a-convolutional-neural-network-and-a-regular-neur?rq=1
“CNN, in specific, has one or more layers of convolution units. A convolution unit receives its input from multiple units from the previous layer which together create a proximity. Therefore, the input units (that form a small neighborhood) share their weights.
The convolution units (as well as pooling units) are especially beneficial as:
• They reduce the number of units in the network (since they are many-to-one mappings). This means, there are fewer parameters to learn which reduces the chance of overfitting as the model would be less complex than a fully connected network.
• They consider the context/shared information in the small neighborhoods. This feature is very important in many applications such as image, video, text, and speech processing/mining as the neighboring inputs (eg pixels, frames, words, etc) usually carry related information."
A very naïve, very basic NN for a problem like this would just use Dense, i.e. fully connected layers.
In Keras, you can do the following:
model = Sequential()
model.add(Dense(64, activation="relu", input_shape=(8,)))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
Note that the last layer is a single neuron, since you have only one output. If you were classifying images as one of say 10 categories (dog, cat, bird, etc), then you would use 10 output nodes in the last layer, softmax them, and use categorical cross entropy. Here, with a single condition, you only need a single output node and note that the loss function should probably be binary crossentropy – i.e. you’re trying to detect the presence or absence of the condition.
Hope this helps.

How to visualise the structure of a TensorFlow model

I have a simple model for a project which takes in some weather data and predicts a temperature. The structure is the following:
24 inputs with 10 features (one input per hour so 24 hours of input data)
1 output being a temperature
The model works, I'm all fine with that, however I need to present the way the model works and I'm unsure how some values I see about the model describe it. I'm struggling to visually represent the inner structure (as in the neurons or nodes).
Here is the model (The static input_shape is not set statically in my code, it is purely to help answer the question):
forward_layer = tf.keras.layers.LSTM(units=32, return_sequences=True)
backward_layer = tf.keras.layers.LSTM(units=32, return_sequences=True, go_backwards=True)
bilstm_model = tf.keras.models.Sequential([
tf.keras.layers.Bidirectional(forward_layer, backward_layer=backward_layer, input_shape=(24, 10)),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.LSTM(32, return_sequences=False),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(units=1)
])
Note that the reason I seperated the layers for the Bidirectional is because I am running this with TFLite and had issues if the two weren't pre-defined but the inner workings are the same as far as I understand it.
Now as some experts have probably already figured out just looking at this, the input shape is (32, 24, 10) and the output shape is (32, 1) where the input is under the (batch_size, sequence_length, features) format and the output is under the (batch_size, units) format.
As well as I feel I understand what the numbers represent, I can't quite wrap my head around how the model would "look". I mainly struggle to differentiate the structure during training and during predictions.
Here are the ideas I have (note that I will be describing the model based on neurons):
Input is a 24x10 DataFrame
The input is therefore 240 values which can be seen as 24 sets of 10 neurons ? (Not sure I'm right here)
return_sequences means that the 240 values propagate throughout the model ? (Still not sure here)
The 'neuron' structure would ressemble this:
Input(240 neurons) -> BiLSTM(240 neurons) ->
Dropout(240 neurons, 20% drop rate) -> LSTM(240 neurons) ->
Dropout(240 neurons, 20% drop rate) -> LSTM(240 neurons) ->
Dropout(240 neurons, 20% drop rate) -> LSTM(240 neurons) ->
Dropout(? neurons, 20% drop rate) -> Dense(1 neurons) = Output
If I'm not mistaken, the Dropout layer isn't a layer strictly speaking but it stops (in this case) 20% of the input neurons (or output, I'm not sure) from activating 20% of the output neurons.
I'd really appreciate the help on visualising the structure so thanks in advance to any brave soul ready to help me out
EDIT
Here is an example of what I am trying to get out of the numbers. Note that this image ONLY represents the first BiLSTM layer, not the whole model.
The x ? in the image represent what I'm trying to understand, ie how many layers are there and how many neurons are there in each layer

How to reshape input for ConvLSTM2D to not overfit?

I have a time series problem with 15 minutes as a timestep.The complete data will be from 2016-09-01 00:00:15 to 2016-12-31 23:45:00.
I have 5 variables(v1,v2,v3,v4,v5,v6) in the data frame and I want to predict the sixth variable (v6) for the next timestep.
I prepare the data set and prepare the information as 5-time lags. like if the time is t in the row I create the values for (t-1) to (t-5) as lags for v1 to v6.
So in total, I have 30 features (5 lags for 6 variables).
I also normalize the values by PowerTransformer.
scaler_x = PowerTransformer()
scaler_y = PowerTransformer()
train_X = scaler_x.fit_transform(train_X)
train_y = scaler_y.fit_transform(train_y.reshape(-1,1))
My data input shape of traix_X and train_y is like below at initial:
(11253, 30) , (11253, 1)
11253 rows having 30 variables as input and a single variable as target variable .Then i reshape this to fit my ConvLSTM2D like below:
# define the number of subsequences and the length of subsequences
n_steps, n_length = 5, 6 #I take into account of past 5 steps for the 6 variables
n_features=1
#reshape for ConvLSTM
# reshape into subsequences [samples, time steps, rows, cols, channels]
train_X = train_X.reshape(train_X.shape[0], n_steps, 1, n_length, n_features)
train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))
The ConvLSTM2D architecture looks like below :
model = Sequential()
model.add(ConvLSTM2D(filters=64, kernel_size=(1,3), activation='relu', input_shape=(n_steps, 1, n_length, n_features)))
model.add(Flatten())
model.add(RepeatVector(1))
model.add(LSTM(50, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(20, activation='relu')))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mse', optimizer='adam')
# fit network
model.fit(train_X, train_y, epochs=epochs, batch_size=batch_size, verbose=0)
But this model gives a very bad result (It is overfitting a lot). I suspect that my inputs are not given correctly to the ConvLSTM2D.
Is my reshaping correct? Any help is appreciated.
EDIT:
I have realized my input is being given correctly to the Network but the issue is it is overfitting a lot.
My hyperparameters are below :
#hyper-parameter
epochs=100
batch_size=64
adam_opt = keras.optimizers.Adam(lr=0.001)
I even tried 50 and 10 epochs its same issue.

In my personal experience there are a few things I've picked up about using ConvLSTM2D.
I would first check to see if the model is training at all. Based on your answer I am unsure how loss is changing as your model trains - if at all. If there is some variation, you need to perform a grid search (playing around with amount of layers and filters)
I also found my models needed to train for a long time to perform well, see the Keras example on ConvLSTM2d where 300 epochs are needed to train a model to perform an arguably simple task : https://keras.io/examples/conv_lstm/. A case I worked on needed a similar amount of epochs to train.
Check different loss functions and optimizers (even though I think mse and adam are good for this type of problem)
Normalize your data differently, you may want to normalize your data statistically as
shown in this keras example : https://www.tensorflow.org/tutorials/keras/regression
From personal experience, you might want more layers for this specific problem. See keras ConvLSTM2d example above for this
* I see how you want to format your data, and though it may work, a more straightforward solution may work better. You might want to try giving (v1,v2,v3,v4,v5) and predicting for v6. You may have the use large batch sizes for this. *

Loss function for class imbalanced multi-class classifier in Keras

I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.

First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.

Training on multiple time-series of various length using recurrent layers in Keras

TL;DR - I have a couple of thousand speed-profiles (time-series where the speed of a car has been sampled) and I am unsure how to configure my models such that I can perform arbitrary forecasting (i.e. predict t+n samples given a sample t).
I have read numerous explanations (1, 2, 3, 4, 5) about how Keras implements statefulness in their recurrent layers, and how one should reset/not reset between iterations, etc..
However, I am unable to acquire the model shape that I want (I think).
As for now, I am only working with a subset of my profiles (denoted as routes in the code below).
Number of training routes: 90
Number of testing routes: 10
The routes vary in length, hence, the first thing I do is to iterate through all routes and pad them with 0, so they are all the same length. (I have assumed this is required, if I am wrong please let me know.) After the padding I convert the routes into a format better suited for the supervised learning task, as described HERE. In this case I have opted to forecast the succeeding 5 steps of the current sample.
The result is a tensor, as:
Shape of trainig_data: (90, 3186, 6) == (nb_routes, nb_samples/route, nb_timesteps)
which is split into X and y for training as:
Shape of X: (90, 3186, 1)
Shape of y: (90, 3186, 5)
My goal is to have the model take one route at the time and train on it. I have created a model like this:
# Create model
model = Sequential()
# Add recurrent layer
model.add(SimpleRNN(nb_cells, batch_input_shape=(1, X.shape[1], X.shape[2]), stateful=True))
# Add dense layer at the end to acquire correct kind of forecast
model.add(Dense(y.shape[2]))
# Compile model
model.compile(loss="mean_squared_error", optimizer="adam", metrics = ["accuracy"])
# Fit model
for _ in range(nb_epochs):
model.fit(X, y,
validation_split=0.1,
epochs=1,
batch_size=1,
verbose=1,
shuffle=False)
model.reset_states()
Which would imply that I have a model with nb_cells layers, the input of the model is (number_of_samples, number_of_timesteps) i.e. (3186, 1) and the output of the model is (number_of_timesteps_lagged) i.e. (5).
However, when running the above I get the following error:
ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (90, 3186, 5)
I have tried different ways to solve the above, but I have been unsuccessful.
I have also tried other ways of structuring my data and my model. For instance merging my routes such that instead of (90, 3186, 6) I had (286740, 6). I simply took the data for each route and put it after the other. After fiddeling with my model I got this to run, and I get a result that is quite good, but I really want to understand how this works - and I think the solution I am attempting above is bette (if I can get it to work).
Update
Note: I am still looking for feedback.
I have reached a "solution" which I think does the trick.
I have abandoned the padding and instead opted for a one sample at the time approach. The reason being that I am trying to acquire a network that allows me to predict by providing the network with one sample at the time. I want to give the network sample t and have it predict t+1, t+2, ...,t+n, so it is my understanding that I must train the network on one sample at the time. I also assume that using:
stateful will allow me to keep the hidden state of the cells unspoiled between batches (meaning that I can determine the batch size to be len(route))
return_sequences will allow me to get the output vector that I desire
The changed code is given below. Unlike the original question, the shape of the input data is now (90,) (i.e. 90 routes of various length) but each training route still has only one feature per sample, and each label route has five samples per feature (the lagged time).
# Create model
model = Sequential()
# Add nn_type cells
model.add(SimpleRNN(nb_cells, return_sequences=True, stateful=True, batch_input_shape=(1, 1, nb_past_obs)))
# Add dense layer at the end to acquire correct kind of forecast
model.add(Dense(nb_future_obs))
# Compile model
model.compile(loss="mean_squared_error", optimizer="adam", metrics = ["accuracy"])
# Fit model
for e in range(nb_epochs):
for r in range(len(training_data)):
route = training_data[r]
for s in range(len(route)):
X = route[s, :nb_past_obs].reshape(1, 1, nb_past_obs)
y = route[s, nb_past_obs:].reshape(1, 1, nb_future_obs)
model.fit(X, y,
epochs=1,
batch_size=1,
verbose=0,
shuffle=False))
model.reset_states()
return model

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.