How can I train this multiclass RNN?

How can I train this multiclass RNN? - python

I am trying to train the following RNN in tensorflow. It takes an 11-D numeric vector as input and it outputs a sequence of 10 multiclass probability vectors, with 14 exclusive classes.
model = keras.models.Sequential([
keras.layers.SimpleRNN(30, return_sequences=False, input_shape=[1, 11]),
keras.layers.RepeatVector(10),
keras.layers.SimpleRNN(30, return_sequences=True),
keras.layers.SimpleRNN(14, return_sequences=True, activation="softmax")
])
model.compile(loss="categorical_crossentropy",
optimizer="adam")
history = model.fit(X_train, y_train, epochs=50, batch_size=32,
validation_split=0.2)
However, even for a small dataset of 10 points, it takes hundreds of epochs to fit. As you can see in the figure, the loss barely goes down with the training epochs:
When I try to train the real training set, the loss simply does not move. Any idea of how to successfully train this model?
You can find the first 10 datapoints here
And the first 100 datapoints here
To load the data just use:
with open('train10.pickle', 'rb') as f:
X_train, y_train = pickle.load(f)
Thank you very much for your help
EDIT:
To provide additional context, what I have in this problem is a continuous numeric embedding in 11-D to start with, and the output is a sequence of one-hot encodings, so you can think of this problem as training a decoder or doing a decompression to get a sort of "words" back from points in the numeric space (each one-hot vector in the output could be thought of a "letter"). I previously tried to train a non-recurrent network outputting the full list of one-hot encodings (whole "word") at once, but the performance was also very poor. I just do not see what the bottleneck is: if the dimensionality of the numeric embedding, the training algorithm, etc. My tinkering so far with types of layers, numbers of layers, or learning rates did not produce substantial improvements. I am open to sharing the whole dataset if you think that can help. Thank you very much!

Each machine learning problem is unique and it is very difficult to say exactly what the issue is without having access to the full data set. Some possibilities are:
The model specification is suboptimal - try varying the number of hidden layers, the number of neurons in each layer, using GRU/LSTM layers instead of RNN, adding add some dropout layers, etc.
The training algorithm needs to be adjusted - try using a different optimizer, a different batch size, a different train-test split ratio etc.
The input data needs more (or less) preprocessing - try normalizing/standardizing the input features if you haven't already.
You need to do more work on feature engineering - think deeply about all potential relationships between the input data and the target, and try combining columns to create ratios etc. While the NN can theoretically figure this out for itself, it is often effective to try and reduce the work it has to do in this respect.
Your problem may just be difficult or even unsolvable. There may just be no strong relationship between the input and the target.

Related

How do I best optimize my paramters, choices of activation, optimizer ect. in a LSTM?

I'm training a LSTM Neural Network to predict a volatilty (timeseries) in Keras. At the moment, my network is specified as follows:
model = Sequential()
model.add(LSTM(10, input_shape=(1,1), kernel_regularizer = l2(0.0001)))
model.add(Dense(1, activation = 'relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=16)
Here, I have a lot of parameters I could cross validate:
units in LSTM?
More layers?
regularizer (L1 or l2, and amount)?
Activation function?
Optimizer?
Batch size?
However, CV on all these parameters would result in huge computational time, so how do I determind the correct specifications for all of them?

So far as I know, doing grid-search might be the best approach. However, you can lessen your search space by examining your data. If you don't have much data, try to go for a smaller model, don't go too big (or else it will overfit). This can lessen your search space a bit. Some say less layer but more unit works well for low-resource data, but still, it is not guaranteed.
Regularizer can sometimes good or bad, it depends on the task. You'll never know if the setting is correct or not unless you experiment on it.
For batch size, it is recommended to experiment on the batch size from 16 to 512 (or you can go higher if you can). The larger the batch size is, the faster it trains, the more memory it consumes. Smaller batch size also means the model will "walk" more random. In other words, the loss will decrease at a more random pace.
For optimizer, if you want to grid search, just use Adam. It is quite good for most tasks.
All in all, no one can guarantee that tuning different hyperparameters will result in a performance gain. It all needs to be experimented and record. That's why there are so many research papers done on hyperparameters tuning.

scikit-learn add extra data to SGDClassifier

I'm trying to do text classification with scikit-learn.
I have text that does not classify well. I think I can improve the predictions by adding data I can deduce in the form of an array of integers.
For example, sample 1 would come with [3, 1, 5, 2] and sample 2 would come with [2, 1, 4, 2]. This would also be true of the test data.
The idea is that the classifier could use both the text and the numbers to classify the data.
I've read the documentation for scikit learn and I can't find how to do it. It must be possible because all that is classified, internally, is vectors of numbers. So adding another vector of numbers should not be that much of a problem, but I can't figure out how. partial_fit adds more samples, it does not add more information about the existing samples. Is there a way to do what I'm trying to do. I tried to combine GaussianNB with SGDClassifier, but it turns out I don't know how to do that. (Was it a bad idea?)
What should I do?

I think you could add this new feature as another dimension to your training data. You need to modify the training data by adding your new features before calling SGD.
A simple/naive way would be:
For example, if my training data with two samples were
X = [ [1,2,3], [8,9,0] ]
And my new features for each sample was
new_feature_X = [ [11,22,33] , [77,88,00] ]
My new training data would be:
X_new = [[1,2,3,11,22,33] , [8,9,0,77,88,00]]
Then you call SGD.fit(X_new, labels)
As far as my SGD knowledge goes, I don't think there is any other way to combine two features.
The idea is that the classifier could use both the text and the
numbers to classify the data.
I find a neural network to be much more suitable for this. You could use two input layers, one for text vectors, and one for the numbers and feed them together into a network to get the output.
I tried to combine GaussianNB with SGDClassifier, but it turns out I
don't know how to do that. (Was it a bad idea?)
SGD means stochastic gradient descent. Is it possible to find the gradient of NaiveBayes? Whats the corresponding cost function ?
What should I do?
Ensemble. Train two separate classifiers. One using your text data, and another one for your new handcrafted feature. And then take the average of their prediction probabilities. You could train multiple classifier and take their votes. This tutorial is great for that.
Try out MLP Classifier. I used it a while ago, and found it works pretty great with text.
Neural networks. It's pretty easy with Keras.
Read research literature. There is pretty good chance academia might have done some work on your dataset. Try to read some of them. Google scholar, semantic scholar are great places to find published reseaerch.
from keras.layers import Input, Dense,Concatenate
from keras.models import Model
# This returns a tensor
text_input_vec = Input(shape=(784,))
new_numeric_feature = Input(shape=(4,))
# feed your text to a dense layer
dense1 = Dense(64, activation='relu')(text_input_vec)
# feed your numeric feature to another dense layer
dense2 = Dense(64, activation='relu')(new_numeric_feature)
# concatenate/combine the output of both
concat = Concatenate(axis=-1)([dense1,dense2])
# use the above to predict the label of your text. Layer below
# assumes you have 2 classes
predictions = Dense(2, activation='softmax')(concat)
model = Model(inputs=[text_input_vec,new_numeric_feature], outputs=predictions)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()

Handling long timestep sequences in LSTM

I'm trying to use LSTM to predict information on timestep sequences.
My data looks that way: I have few different samples of relatively long sequences (>100000 timesteps) and I'm trying to solve a N-class classification problem where each sample is labeled as different ID. Now I'm trying to understand how to properly prepare my data so the LSTM will train on each sample individually.
In the most basic case, I just feed each sample completely to the network:
model = models.Sequential()
model.add(layers.Embedding(FEATURES_NUMBER, 30))
model.add(layers.LSTM(32, return_sequences=True))
model.compile(optimizer="adam",
loss='categorical_crossentropy', metrics=['acc'])
history = model.fit(train_data,
train_labels,
epochs=10,
batch_size=128,
validation_data=(validation_data,validation_labels)
)
Where train_data is of shape: (4, 100000, 1).
But I'm being told by many blog posts around (like here) that training LSTM on very long sequences might harm the training. So, I don't understand how to properly split the data in correspondence with the LSTM internal state.
I can split each 100000 long sequence to 500 long sub-sequences and then my data will be of shape: (800, 500, 1). But can I tell the LSTM to still make sense of the larger sequences (Keep internal state between sub-sequences of the same larger sequence and re-initialize it when switching to new sequence)?
I'd be happy if someone could shed some light over that process!

The problem has been here for a while. Not sure if you are still interested in my humble two cents. Here is the thing about LSTM, as elegantly designed as it is, it still suffers from vanishing gradients and/or exploding gradients. Adding a forget gate for the cell state alleviates the problem as when a forget gate outputs 0, the back propagation process stops "flowing" backwards at that gate. However, this does not mean LSTM is free from exploding/vanishing gradients when you feed an infinitely long sequence into it. Just image a case where the forget gate output 1 50 times in a row and the gradients for the last output using BPTT would look very similar to the original RNN as the gates are "turned off".
Why would that happen? Because your initial set of params are likely not optimal. And also because a long sequence would requires BPTT through more time steps and thus make chances of that happening higher. You can try training you LSTM on your segmented data subsets.

Keras Dense Net Overfitting

I am attempting to use keras to build an activity classifier from accelerometer signals. However, I am experiencing extreme overfitting of the data even with the most simplistic of models.
The input data is of shape (10,3) and contains roughly .1 second of data from the accelerometer in 3 dimensions. The model is simply
model = Sequential()
model.add(Flatten(input_shape=(10,3)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
The model should output the label [1,0] for walking activities and [0,1] for non-walking activities. After training I get 99.8% accuracy (if only it was real...). When I attempt to predict on data that wasn't used for training, I get 50% accuracy, verifying that the net isn't really "learning" anything except to predict a single class value.
The data is being prepared from 100hz triaxial accelerometer signals. I am not preprocessing the data in any way except for windowing it into bins on length 10 that overlap with the previous bin by 50%. What measures can I take to make the network produce actual predictions? I have tried increasing the window size but the results remain the same. Any advice/general tips are greatly appreciated.
Ian

Try adding some hidden layers and dropout layers to your network. You could create a simple Multi Layer Perceptron (MLP) with a couple of extra lines in between your Flatten layer and Dense layer:
model.add(Dense(64, activation='relu', input_dim=30))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
Or check out this guide. which explains how to create a simple MLP.
Without any hidden layers your model will not actually be 'learning' from the input data, rather it will be mapping the input number of features to the output number of features.
The more layers you add, the more intermediate features and patterns it should extract from the input data which should lead to better model predictions for test data. There will be a lot of trial and error to design the best model as too many layers can result in over fitting.
You have not provided information about how you train the model so that may be the cause of the issue as well. You must ensure that the data is spit into training, testing and validation sets. Some possible split ratios to use for training, validation, test data are: 60%:20%:20%, or 70%:15%:15%. This is ultimately something that you must also decide.

The problem of overfitting was caused by the input data type. The values passed to the classifier should have been float values with 2 decimal places. Somewhere along the way, some of these values had been augmented and had significantly more than 2 decimal places. That is, the input should have looked like
[9.81, 10.22, 11.3]
but instead looked like
[9.81000000012, 10.220010431, 11.3000000101]
The classifier was making its prediction based on this feature, which is obviously not the desired behavior! Lessoned learned - make sure the data preparation is consistent! Thanks to #umutto for the suggestions of random forests, the simple structure was helpful for diagnosing purposes.

Getting started with Keras for machine learning

I'm getting started with machine learning tools and I'd like to learn more about what the heck I'm doing. For instance, the script:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, BatchNormalization
from keras.initializers import RandomUniform
import numpy
numpy.random.seed(13)
RandomUniform(seed=13)
model = Sequential()
model.add(Dense(6, input_dim=6))
model.add(BatchNormalization())
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(11))
model.add(Activation('tanh'))
model.add(Dropout(0.01))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(optimizer='sgd', loss='mean_absolute_error', metrics=['accuracy'])
data = numpy.loadtxt('train', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
model.fit(X, Y, batch_size=1, epochs=1000)
data = numpy.loadtxt('test', delimiter=' ')
X = data[:, 0:6]
Y = data[:, 6]
score = model.evaluate(X, Y, verbose=1)
print ('\n\nThe error is:\n', score, "\n")
print('\n\nPrediction:\n')
Y = model.predict(X, batch_size=1, verbose=1)
print('\nResult:\n', Y, '\n')
It's a Frankenstein I made from some examples I found on the internet and I have many unanswered questions about it:
The file train has 60 rows. Is 1000 epochs too little? Is it too much? Can I get an Underfit/Overfit?
What does the result I get from model.evaluate() mean? I know it's the loss but, if I get a [7.0506157875061035, 0.0], does it mean that my model has a 7% error?
And last, I'm getting a prediction of 0.99875391, 0.99875391, 0.9362126, 0.99875391, 0.99875391, 0.99875391, 0.93571019 when the expected values were anything close to 7.86, 3.57, 8.93, 6.57, 11.7, 8.53, 9.06, which means it's a real bad prediction. Clearly there's a lot of things I'm doing wrong. Could you guys give me a few pointers?
I know it all depends on the type of data I'm using, but is there anything I shouldn't do at all? Or maybe something I should be doing?

1
There is never a ready answer for how many epochs is a good number. It varies wildly depending on the size of your data, your model, and what you want to achieve. Normally, small models require less epochs, bigger models require more. Yours seem small enough and 1000 epochs seems way too much.
It also depends on the learning rate, a parameter given to the optimizer that defines how long are the steps your model takes to update its weights. Bigger learning rates mean less epochs, but there is a chance that you simply never find a good point because you're adjusting weights beyond what is good. Smaller learning rates mean more epochs and better learning.
Normally, if the loss reaches a limit, you're approaching a point where training is not useful anymore. (Of course, there may be problems with the model too, there is really no simple answer for this one).
To detect overfitting, you need besides the training data (X and Y), another group with test data (say Xtest and Ytest, for instance).
Then you use it in model.fit(X,Y, validation_data=(Xtest,Ytest), ...)
Test data is not given for training, it's kept separate just to see if your model can predict good things from data it has never seen in training.
If the training loss goes down, but the validation loss doesn't, you're overfitting (roughly, your model is capable of memorizing the training data without really understanding it).
An underfit, on the contrary, happens when you never achieve the accuracy you expect (of course we always expect a 100% accuracy, no mistakes, but good models get around the 90's, some applicatoins go better 99%, some worse, again, it's very subjective).
2
model.evaluate() gives you the losses and the metrics you added in the compile method.
The loss value is something your model will always try to decrease during training. It roughly means how distant your model is from the exact values. There is no rule for what the loss value means, it could even be negative (but usually keras uses positive losses). The point is: it must decrease during training, that means your model is evolving.
The accuracy value means how many right predictions your model outputs compared to the true values (Y). It seems your accuracy is 0%, your model is getting everything wrong. (You can see that from the values you typed).
3
In your model, you used activation functions. These normalize the results so they don't get too big. This avoids overflowing problems, numeric errors propagating, etc.
It's very very usual to work with values within such bounds.
tanh - outputs values between -1 and 1
sigmoid - outputs values between 0 and 1
Well, if you used a sigmoid activation in the last layer, your model will never output 3 for instance. It tries, but the maximum value is 1.
What you should do is prepare your data (Y), so it's contained between 0 and 1. (This is the best to do in classification problems, often done with images too)
But if you actually want numerical values, then you should just remove the activation and let the output be free to reach higher values. (It all depends on what you want to achieve with your model)

Epoch is a single pass through the full training set. I my mind it seems a lot, but you'd have to check for overfitting and evaluate the predictions. There are many ways of checking and controlling for overfitting in a model. If you understand the methods of doing so from here, coding them in Keras should be no problem.
According to the documentation .evaluate returns:
Scalar test loss (if the model has no metrics) or list of scalars (if the model computes other metrics)
so these are the evaluation metrics of your model, they tell you how good your model is given some notion of good. Those metrics depend on the model and type of data that you've used. Some explanation on those can be found here and here. As mentioned in the documentation,
The attribute model.metrics_names will give you the display labels for the scalar outputs.
So you can know what metric you are looking at. It is easier to do that interactively through the console (ipython, bpython) or Jupyter notebook.
I can't see your data, but a if you are doing a classification problem as suggested by metrics=['accuracy'], the loss=mean_absolute_error doesn't make sense, since it is made for regression problems. To learn more about those I refer you to here and here which discuss classification and regression problems with Keras.
PS: question 3 is not related to software per se, but to the theoretical construct supporting the software. In such cases, I'd recommend asking them at Cross Validated.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.