Adding LSTM layers in Keras produces input error - python

First of all apologies if I word this wrong, I'm relatively new to TensorFlow
I am designing a model for simple classification of a dataset, each column is an attribute and the final column is the class. I split these and generate a dataframe in the usual way.
If I generated a model with dense layers, it works great:
def baseline_model():
# create model
model = Sequential()
model.add(Dense(30, input_dim=len(dataframe.columns)-2, activation='sigmoid'))
model.add(Dense(20,activation='sigmoid'))
model.add(Dense(unique, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
If I were to add, say, an LSTM layer to the model:
def baseline_model():
# create model
model = Sequential()
model.add(Dense(30, input_dim=len(dataframe.columns)-2, activation='sigmoid'))
#this bit here >
model.add(LSTM(20, return_sequences=True))
model.add(Dense(20,activation='sigmoid'))
model.add(Dense(unique, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
I get the following error when I execute the code:
ValueError: Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=2
I'm not sure where these variables are coming from although, maybe it's the classes? I have three classes of data ('POSITIVE', 'NEGATIVE', 'NEUTRAL') mapped to sets of well over 2,000 attribute values - they're statistical extractions of timewindowed EEG brainwave data from multiple electrodes classed by emotional state.
Note:
The 'input_dim=len(dataframe.columns)-2' produces the number of attributes (inputs), I do this as I'd like the script to work with CSV datasets of different sizes on the fly
Also, there are no tabs in my code pasted but it is indented and will compile
The full code is pasted here: https://pastebin.com/1aXp9uDA for presentation purposes. Apologies in advance for the terrible practices! This is just an initial project, I do plan on cleaning it all up later on!

In your original code you have an input dimension of 2 which is shaped (batch, feature). When you add an LSTM, you're telling Keras you want to do the classification given the last N timesteps, hence you need an input that is shaped (batch, timestep, feature). It's easy to think that an LSTM will look back across all inputs in the batch but unfortunately it doesn't. You have to manually organize your data to present all the timestep elements together.
To split up your data you generally do a sliding window of length N (where N is how many values you wish the LSTM to look back across). You can slide the window N steps each time, meaning there's no overlap of the data or you can simply slide it one sample, meaning you get multiple copies of your data. There are numerous blogs on how to do this. Take a look at this one How to Reshape Input Data for LSTM.
You also have one other issue. For your LSTM, you probably want "return_sequences=False". With this equal to True, you would need to have an output "Y" value for each element of your timestep. You probably want your "Y" value to represent the next value in your time-series. Keep this in mind when organizing your data.
The above link provides some nice examples or you can search for more in-depth ones. If you follow those it should be clear how to reorganize things for an LSTM.

Related

How can I use RNN for classification of variable length sequence?

I have a dataset of which have variable length sentences in each sample. I know this seems to be a very basic question which one should be able to solve just by googling. But I am really new to it and I did whatever search I was able to do.
I came up with some solutions which don't involve padding or truncation.
First was to convert each sample (in form of strings) to a sequence of vectors. I got "Value error: Unable to convert to tensor" errors. I also tried convert_to_tensor method but still same error. The type of each sample was list of numpy arrays; where length of list varied from sample to sample (based on number of tokens in sentence. Model tried for same is below:
inputs = keras.Input(shape=(None, emb_size))
gru_out = keras.layers.Bidirectional(keras.layers.GRU(gru_size, return_sequences=False))(inputs)
gru_out = keras.layers.Flatten()(gru_out)
predictions = keras.layers.Dense(num_classes, activation='sigmoid')(gru_out)
model = keras.Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=keras.optimizers.Adam(), loss='binary_crossentropy', metrics=['accuracy'])
Second Convert the tokenized words to tokens (integers) and feed it to the model which already have embedding weights embedded in embedding layer.
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=emb_size, weights=[embedding_matrix], trainable=False))
model.add(tf.keras.layers.GRU(32, return_sequences=True))
model.add(tf.keras.layers.GRU(32, return_sequences=True))
model.add(tf.keras.layers.GRU(32))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
Here the type of embedding matrix was list of tensorflow tensors. Input was pandas series of samples where each sample was numpy array if integers (tokens). labels in both these cases were in form of Series of integers. This method gave error "ValueError: Data cardinality is ambiguous: Make sure all arrays contain the same number of samples.", I got confused by this error because I have read somewhere that GRU cells can be used to train models with variable size input.
There are other methods like padding sequences to some length before feeding to model. I was able to make it work but I don't want to use padding or truncation.
Another method which worked was when I used a model with text vectorization layer and embedding layerand fed the model directly the sentences. But, here I felt a logical gap as I needed to adapt the vectorizer but if a new word comes while prediction the integers assigned for every word will get changed making everything useless. For out of vocabulary word embdding vector is not a problem since I am using pretrained model for obtaining vector.
I don't know why it should be so tricky for sucha simple task to do. I must be doing something very stupid, kindly guide me out.

Most suitable Machine Learning algorithm for this problem?

I have a dataset and i want to decide on which ML algorithm to apply to my given problem.
Customers are to fill out an assessment questionnaire of 50 questions. Examples of the questions are, what is your job, previous job history, how much do you earn, have you been rejected for a loan etc, and the end goal is to decide whether they should be rejected or not.
I have circa 500 entries for my algorithm to learn from and have pre-processed my dataset and converted the inputs into a numpy array and wondering what would be the best algorithm to use? Should i use a classification algorithm or a neural network in tensorflow and if the latter, what would be the layers I should use?
Thanks
How about beginning with xgboost or random forest? - So plain "old" ML?
The advantage would be that you could visualize the decision tree of the model once trained.
If using a NN in tensorflow (or even easier: keras with tensorflow backend), you could go with a MLP (multi layer perceptron), since the questions answers have fixed position in the input. You don't need many layers.
Important is that you normalize your input data columnwise, so that the input numbers are not much bigger/smaller than +1/-1, respectively. Introductory books often miss this point, though important.
Since your target labeling is "accept" or "reject", binary classifier will do it (also in the machine learning approach). (You use 0 and 1 as labels).
For NN, you don't need for such kind of classification that many layers or neurons. Try the smallest network first. let's say 10 neurons in first layer, then 7 neurons in the next layer (probably even less) and then 1 output neuron for the binary decision.
With Keras this would be:
from keras.models import Sequential
from keras.layers import Dense
def create_mlp(n_input = 500): # number of columns of input data 500 here
model = Sequential()
model.add(Dense(10, input_dim=n_input, kernel_initializer='normal', activation='relu')) # init = kernel_initializer
model.add(Dense(7, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
return model
model = create_mlp(500) # this will generate the correct NN compiled.
Your data frame (or Numpy input array must have as rows the samples,
the columns are for each answer for a question 1 column.
the answers you have to encode in a numeric form. Numbers should be small - the best between -1 and 1. NNs don't like big numbers. Thus column-wise normalization can help.)
That's it. I learned all this stuff last year. Good luck for learning. It will be tons of fun!

LSTM: How to effectively comprehend and use return_sequences=False?

I'm performing a multivariate time series analysis. I intend on running this network many, many times with different batch sizes and hyperparameters. Therefore I have been keeping things general so far and assuming
B := Number of batches
T := Number of timesteps per batch
N := Number of features per timestep
My end goal is to call model.predict on one batch (that hasn't been used to train, of course). Thus it will look like this
prediction = model.predict(unknown)
unknown.shape = (T, N)
prediction.shape = (N,)
I've been trying out loads of networks.
I can't get this to work
model.add(LSTM(N, input_shape=(T, N), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(N, input_shape=(T, N), return_sequences=False))
model.add(Dense(N))
However, I CAN get this to work
model.add(LSTM(N, input_shape=(T, N), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(N, input_shape=(T, N), return_sequences=True)) # True here
model.add(Dense(N))
What's wrong with return_sequences=True if the output includes that last timestep, which is the f(t+1) that I'm looking for? I suppose nothing but this doesn't fix the fundamental problem which is that I don't fully understand how to build these networks. Like I said, I want to keep things as general as possible and deeply understand the networks before I start diving into experiments.
I will be running experiments with very different values for B, T and N to see which works the best.
I've tried to read the docs and whatnot but I'm quite stuck and getting frustrated so I thought I would turn to the community.
Hope the following code helps
eg if number of nodes in each layer is 150
model.add(LSTM(150, return_sequences=True,input_shape=(T,N)))
model.add(LSTM(150))
model.add(Dense(1))
And remove input_shape=(T, N), in second lstm in your code.You have specified that it expects 2D array(T,N) but the first LSTM provides a 3d array consisting of (1,T,N)
There is no need for specifying the input_shape argument in the second recurrent layer. This parameter is expressly required when the recurrent layer is the first layer in the network, but not otherwise.
If you wish to specify the input batch shape at each recurrent layer, remember to set the parameter stateful = True to render your LSTMs stateful(and only use this argument at the first layer of your network).

Keras Dense Net Overfitting

I am attempting to use keras to build an activity classifier from accelerometer signals. However, I am experiencing extreme overfitting of the data even with the most simplistic of models.
The input data is of shape (10,3) and contains roughly .1 second of data from the accelerometer in 3 dimensions. The model is simply
model = Sequential()
model.add(Flatten(input_shape=(10,3)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
The model should output the label [1,0] for walking activities and [0,1] for non-walking activities. After training I get 99.8% accuracy (if only it was real...). When I attempt to predict on data that wasn't used for training, I get 50% accuracy, verifying that the net isn't really "learning" anything except to predict a single class value.
The data is being prepared from 100hz triaxial accelerometer signals. I am not preprocessing the data in any way except for windowing it into bins on length 10 that overlap with the previous bin by 50%. What measures can I take to make the network produce actual predictions? I have tried increasing the window size but the results remain the same. Any advice/general tips are greatly appreciated.
Ian
Try adding some hidden layers and dropout layers to your network. You could create a simple Multi Layer Perceptron (MLP) with a couple of extra lines in between your Flatten layer and Dense layer:
model.add(Dense(64, activation='relu', input_dim=30))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
Or check out this guide. which explains how to create a simple MLP.
Without any hidden layers your model will not actually be 'learning' from the input data, rather it will be mapping the input number of features to the output number of features.
The more layers you add, the more intermediate features and patterns it should extract from the input data which should lead to better model predictions for test data. There will be a lot of trial and error to design the best model as too many layers can result in over fitting.
You have not provided information about how you train the model so that may be the cause of the issue as well. You must ensure that the data is spit into training, testing and validation sets. Some possible split ratios to use for training, validation, test data are: 60%:20%:20%, or 70%:15%:15%. This is ultimately something that you must also decide.
The problem of overfitting was caused by the input data type. The values passed to the classifier should have been float values with 2 decimal places. Somewhere along the way, some of these values had been augmented and had significantly more than 2 decimal places. That is, the input should have looked like
[9.81, 10.22, 11.3]
but instead looked like
[9.81000000012, 10.220010431, 11.3000000101]
The classifier was making its prediction based on this feature, which is obviously not the desired behavior! Lessoned learned - make sure the data preparation is consistent! Thanks to #umutto for the suggestions of random forests, the simple structure was helpful for diagnosing purposes.

Keras: specify input dropout layer that always keeps certain features

I'm training a neural net using Keras in Python for time-series climate data (predicting value X at time t=T), and tried adding a (20%) dropout layer on the inputs, which seemed to limit overfitting and cause a slight increase in performance. However, after I added a new and particularly useful feature (the value of the response variable at time of prediction t=0), I found massively increased performance by removing the dropout layer. This makes sense to me, since I can imagine how the neural net would "learn" the importance of that one feature and base the rest of its training around adjusting that value (i.e, "how do these other features affect how the response at t=0 changes by time t=T").
In addition, there are a few other features that I think should be present for all epochs. That said, I am still hopeful that a dropout layer could improve the model performance-- it just needs to not drop out certain features, like X at t_0: I need a dropout layer that will only drop out certain features.
I have searched for examples of doing this, and read the Keras documentation here, but can't seem to find a way to do it. I may be missing something obvious, as I'm still not familiar with how to manually edit layers. Any help would be appreciated. Thanks!
Edit: sorry for any lack of clarity. Here is the code where I define the model (p is the number of features):
def create_model(p):
model = Sequential()
model.add(Dropout(0.2, input_shape=(p,))) # % of features dropped
model.add(Dense(1000, input_dim=p, kernel_initializer='normal'
, activation='sigmoid'))
model.add(Dense(30, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal',activation='linear'))
model.compile(loss=cost_fn, optimizer='adam')
return model
The best way I can think of applying dropout only to specific features is to simply separate the features in different layers.
For that, I suggest you simply divide your inputs in essential features and droppable features:
from keras.layers import *
from keras.models import Model
def create_model(essentialP,droppableP):
essentialInput = Input((essentialP,))
droppableInput = Input((droppableP,))
dropped = Dropout(0.2)(droppableInput) # % of features dropped
completeInput = Concatenate()([essentialInput, dropped])
output = Dense(1000, kernel_initializer='normal', activation='sigmoid')(completeInput)
output = Dense(30, kernel_initializer='normal', activation='relu')(output)
output = Dense(1, kernel_initializer='normal',activation='linear')(output)
model = Model([essentialInput,droppableInput],output)
model.compile(loss=cost_fn, optimizer='adam')
return model
Train the model using two inputs. You have to manage your inputs before training:
model.fit([essential_train_data,droppable_train_data], predictions, ...)
I don't see any harm to using dropout in the input layer. The usage/effect would be a little different than normal of course. The effect would be similar to adding synthetic noise to an input signal; only the feature/pixel/whatever would be entirely unknown[zeroed out] instead of noisy. And inserting synthetic noise into the input is one of the oldest ways to improve robustness; certainly not bad practice as long as you think about whether it makes sense for your data set.
This question has already an accepted answer but it seems to me you are using dropout in a bad way.
Dropout is only for the hidden layers, not for the input layer !
Dropout act as a regularizer, and prevent the hidden layer complex coadaptation, quoting Hinton paper "Our work extends this idea by showing that dropout can be effectively applied in the hidden layers as well and that it can be interpreted as a form of model averaging" (http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
Dropout can be seen as training several different models with your data and averaging the prediction at test time. If you prevent your models to have all the inputs during training, it will perform badly, especially if one input is crucial. What you want is actually avoid overfitting, meaning you prevent too complex models during the training phase (so each of your models will select the most important features first) before testing.
It is common practice to drop some of the features in ensemble learning but it is control and not stochastic like for dropout. It also works for neural networks as hidden layers have (often) way more neurons as inputs, and so dropout follows the law of big numbers, as for a small number of inputs, you can have in some bad case almost all your inputs dropped.
In conlusion: it is a bad practice to use dropout in the input layer of a neural network.

Categories