How to update a keras LSTM weights to avoid Concept Drift - python

Iยดm trying to update a Keras LSTM to avoid the concept of drift. For that Iยดm following the approach proposed in this paper [1] on which they compute an anomaly score and they apply it to update the network weights. In the paper they use the L2 norm to compute the anomaly score and then they update the model weights. As it is stated in the paper:
RNN Update: The anomaly score ๐‘Ž๐‘ก is then used to update the network W๐‘กโˆ’1 to obtain W๐‘ก using backpropagation through time (BPTT):
W๐‘ก = W๐‘กโˆ’1 โˆ’ ๐œ‚โˆ‡๐‘Ž๐‘ก(W๐‘กโˆ’1) where ๐œ‚ is the learning rate
Iโ€™m trying to update the LSTM network weights, but although I have seen some improvements in the model performance for forecasting multi-step ahead multi-sensor data Iโ€™m not sure if the improvement is because the updates deal with the drift concept or just because the model is refitted with the newest data.
Here is an example model:
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(n_neurons, input_shape=(n_seq, n_features)))
model.add(layers.Dense(n_pred_seq * n_features))
model.add(layers.Reshape((n_pred_seq, n_features)))
model.compile(optimizer='adam', loss='mse')
And here is the way on which Iโ€™m updating the model:
y_pred = model.predict_on_batch(x_batch)
up_y = data_y[i,]
a_score = sqrt(mean_squared_error(data_y[i,].flatten(), y_pred[0, :]))
w = model.layers[0].get_weights() #Only get weights for LSTM layer
for l in range(len(w)):
w[l] = w[l] - (w[l]*0.001*a_score) #0.001=learning rate
model.layers[0].set_weights(w)
model.fit(x_batch, up_y, epochs=1, verbose=1)
model.reset_states()
Iโ€™m wondering if this is the correct way to update the LSTM neural network and how the BPTT is applied after updating the weights.
P.D.: I have also seen other methods to detect concept drift such as the ADWIN method from the skmultiflow package but I found this one especially interesting because it also deals with anomalies, updating the model slightly when new data with concept drift comes and almost ignoring the updates when anomalous data comes.
[1] Online Anomaly Detection with Concept Drift Adaptation using Recurrent Neural Networks Saurav, S., Malhotra, P., TV, V., Gugulothu, N., Vig, L., Agarwal, P., & Shroff, G. (2018, January). In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 78-87). ACM.

I personally thinks that it's a valid method. The fact that you're updating the ntework weights depends on what you're doing, so if you do it like you do it's fine.
Maybe another way to do it is to implement your own loss function and embed the anti-drift parameter into it, but it might be a little complicated.
Regarding the BPTT i think it's applied as normal, but you have different "starting points", the weights you've just updated.

Looking at the second block of your code, I believe you are not calculating the gradient properly. Specifically, the gradient update w[l] = w[l] - (w[l]*0.001*a_score) seems to be wrong to me.
Here you are multiplying the weights and the anomaly score. However, the original gradient update equation
means to calculate the gradient of W_{t-1} using the loss \alpha_t, it does not mean to multiply \alpha_t with W_{t-1}.
To apply the online update correctly, you just need to sample your stream sequentially and apply the model.fit() as usual.
Hope this helps.

Related

Keras GradientType: Calculating gradients with respect to the output node

For startes: this question does not ask for help regarding reinforcement learning (RL), RL is only used as an example.
The Keras documentation contains an example actor-critic reinforcement learning implementation using Gradient Tape. Basically, they've created a model with two separate outputs: one for the actor (n actions) and one for the critic (1 reward). The following lines describe the backpropagation process (found somewhere in the code example):
# Backpropagation
loss_value = sum(actor_losses) + sum(critic_losses)
grads = tape.gradient(loss_value, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
Despite the fact that the actor and critic losses are calculated differently, they sum up those two losses to obtain the final loss value used for calculating the gradients.
When looking at this code example, one question came to my mind: Is there a way to calculate the gradients of the output layer with respect to the corresponding losses, i.e. calculate the gradients of the first n output nodes based on the actor loss and the gradient of the last output node using the critic loss? For my understanding, this would be much more convenient than adding both losses (different!) and updating the gradients based on this cumulative approach. Do you agree?
Well, after some research I found the answer myself: It is possible to extract the trainable variables of a given layer based on the layer name. Then we can apply tape.gradient and optimizer.apply_gradients to the extracted set of trainable variables. My current solution is pretty slow, but it works. I just need to figure out how to improve its runtime.

Predicting sine with ANN using Keras [duplicate]

This question already has an answer here:
Is deep learning bad at fitting simple non linear functions outside training scope (extrapolating)?
(1 answer)
Closed 4 years ago.
I am new to machine learning so I will apologize in advance if this question is somewhat recurrent, as I haven't been able to find a satisfying answer to my problem.
As a pedagogical exercise I have been trying to train an ANN to predict a sine wave. My problem is that although my neural network trains accurately the shape of the sine, it somewhat fails to do so in the validation set and to larger inputs. So I start by feeding my input and output as
x = np.arange(800).reshape(-1,1) / 50
y = np.sin(x)/2
The rest of the code goes as
model = Sequential()
model.add(Dense(20, input_shape=(1,),activation = 'tanh',use_bias = True))
model.add(Dense(20,activation = 'tanh',use_bias = True))
model.add(Dense(1,activation = 'tanh',use_bias = True))
model.compile(loss='mean_squared_error', optimizer=Adam(lr=0.005), metrics=['mean_squared_error'])
history = model.fit(x,y,validation_split=0.2, epochs=2000, batch_size=400, verbose=0)
I then devise a test, which is defined as
x1= np.arange(1600).reshape(-1,1) / 50
y1 = np.sin(x1)/2
prediction = model.predict(x1, verbose=1)
So the problem is that the ANN clearly starts to fail in the validation set and to predict a continuation of the sine wave.
Weird behaviour for validation set:
Failure to predict anything other than the training set:
So, what am I doing wrong? Is the ANN incapable of continuing the sine wave? I have tried to fine tune most of the available parameters without success. Most of the FAQs regarding similar issues are due to overfitting, but I haven't been able to solve this.
Congratulations on stumbling on one of the fundamental issues of deep learning on your first try :)
What you did is correct, and, indeed, the ANN (in its current form) is incapable of continuing the sine wave.
However, you can see signs of overfitting in your MSE graph, starting around epoch 800, when the validation error starts to increase.
As pointed above, your NN is not capable of "grasping" cyclic nature of your data.
You can think of your DNN made of dense layers only as of smarter version of linear regression โ€” the reason to use DNN is to have high-level non-linear abstract features that can be "learned" by network itself, instead of engineering features by hand. On contrary, this features are mostly hard to describe and understand.
So, in general DNNs are good for predicting unknown points "in the middle", more far your x from training set, less accurate prediction will be. Again, in general.
To predict things that are cyclic in nature, you should either use more sophisticated architectures, or pre-process your data, i.e. by understanding "seasonality" or "base frequency".

Python Keras LSTM learning converges too fast on high loss

This is more of a deep learning conceptual problem, and if this is not the right platform I'll take it elsewhere.
I'm trying to use a Keras LSTM sequential model to learn sequences of text and map them to a numeric value (a regression problem).
The thing is, the learning always converges too fast on high loss (both training and testing). I've tried all possible hyperparameters, and I have a feeling it's a local minima issue that causes the model's high bias.
My questions are basically :
How to initialize weights and bias given this problem?
Which optimizer to use?
How deep I should extend the network (I'm afraid that if I use a very deep network, the training time will be unbearable and the model variance will grow)
Should I add more training data?
Input and output are normalized with minmax.
I am using SGD with momentum, currently 3 LSTM layers (126,256,128) and 2 dense layers (200 and 1 output neuron)
I have printed the weights after few epochs and noticed that many weights
are zero and the rest are basically have the value of 1 (or very close to it).
Here are some plots from tensorboard :
Faster convergence with a very high loss could possibly mean you are facing an exploding gradients problem. Try to use a much lower learning rate like 1e-5 or 1e-6. You can also try techniques like gradient clipping to limit your gradients in case of high learning rates.
Answer 1
Another reason could be initialization of weights, try the below 3 methods:
Method described in this paper https://arxiv.org/abs/1502.01852
Xavier initialization
Random initialization
For many cases 1st initialization method works the best.
Answer 2
You can try different optimizers like
Momentum optimizer
SGD or Gradient descent
Adam optimizer
The choice of your optimizer should be based on the choice of your loss function. For example: for a logistic regression problem with MSE as a loss function, gradient based optimizers will not converge.
Answer 3
How deep or wide your network should be is again fully dependent on which type of network you are using and what the problem is.
As you said you are using a sequential model using LSTM, to learn sequence on text. No doubt your choice of model is good for this problem you can also try 4-5 LSTMs.
Answer 4
If your gradients are going either 0 or infinite, it is called vanishing gradients or it simply means early convergence, try gradient clipping with proper learning rate and the first weight initialization technique.
I am sure this will definitely solve your problem.
Consider reducing your batch_size.
With large batch_size, it could be that your gradient at some point couldn't find any more variation in your data's stochasticity and for that reason it convergences earlier.

TensorFlow RandomForest vs Deep learning

I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?
If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.
A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.

Efficient way to know if an image related to a dataset that was used to train convolutional neural network

Currently I'm using VGG16 + Keras + Theano thought the Transfer Learning methodology to recognize plants classes. It works just fine and gives me a good accuracy. But the next problem I'm trying to solve - is to find a way of identifying if an input image contains plant at all. I don't want to have another one classifier that will do it, because it's not really efficiently.
So I did some search and have found that we can get activations from the latest model layer (before activation layer) and analyze it.
from keras import backend as K
model = util.load_model() # VGG16 model
model.load_weights(path_to_weights)
def get_activations(m, layer, X_batch):
x = [m.layers[0].input, K.learning_phase()]
y = [m.get_layer(layer).output]
get_activations = K.function(x, y)
activations = get_activations([X_batch, 0])
# trying to get some features from activations
# to understand how can we identify if an image is relevant
for l in activations[0]:
not_nulls = [x for x in l if x > 0]
# shows percentage of activated neurons
c1 = float(len(not_nulls)) / len(l)
n_activated = len(not_nulls)
print 'c1:{}, n_activated:{}'.format(c1, n_activated)
return activations
get_activations(model, 'the_latest_layer_name', inputs)
From the above code I've noticed that when we have very irrelevant image, the number of activated neurons is bigger than for images that contain plants:
For images that was using for model training, number of activated neurons 19%-23%
For images that contain unknown plants species 20%-26%
For irrelevant images 24%-28%
It's not really a good feature to understand if an image relevant as percentage values are intersect.
So, is there a good way to resolve this issue?
Thanks to Feras's idea in the comment above. After some trials, I've come up with the ultimate solution that allows solving this problem with accuracy up to 99.99%.
Steps are:
Train your model on a dataset;
Store activations (see method above how to get them) by predicting relevant and non-relevant images using trained model from the previous step. You should get activations from the penultimate layer. For VGG16 it's the last of two Dense(4096), for InceptionV3 - an extra penultimate Dense(1024) layer, for resnet50 - an extra penultimate Dense(2048) layer.
Solve a binary problem using stored activations data. I've tried a simple flat NN and Logistic Regression. Both were good in accuracy (flat NN was a bit more accurate), but I've chosen the Logistic Regression as it's simpler, faster and consumes less memory and CPU/GPU.
This process should be repeated each time after your model retrained as each time the final weights for CNN are different and what was working previously, will be different next time.
So as result we have another small model for solving the problem.

Categories