I want to implement a regularization technique "Shakeout" in CNTK

I want to implement a regularization technique "Shakeout" in CNTK - python

There is a paper "Shakeout: A New Approach to Regularized Deep Neural Network Training" which can be found here: http://ieeexplore.ieee.org/abstract/document/7920425/
A new regularization technique is introduced in this paper, which can replace dropout layers in a more functional way. I am working on a deep learning problem and for that I want to implement "Shakeout" technique, but the problem is I could not fully understand the actual pipeline from the paper. There is too much mathematics which I am still struggling to understand.
So far, I have seen one open source implementation which is based on "Caffe", but I am just a new practitioner of deep learning and just learning to use CNTK. so its not possible to start working on caffe.
Have anyone implemented "Shakeout" in cntk?
or if someone can provide a pseudo-code for shakeout?
Shakeout implementation on Caffe: https://github.com/kgl-prml/shakeout-for-caffe
Github Issue: https://github.com/kgl-prml/shakeout-for-caffe/issues/1

From a quick look at the paper a dense layer combined with a shakeout layer would look like the following:
def DenseWithShakeout(rate, c, outputs):
weights = C.parameter((C.InferredDimension, outputs), init=C.glorot_uniform())
bias = C.parameter(outputs)
def shakeout(x):
r = C.dropout(x, rate)
signs = weights/C.abs(weights) # one day CNTK should add an actual sign operation
return C.times(r, weights) + c * C.times(r - x, signs) + bias
return shakeout
This can be used inside a C.layers.Sequential() statement e.g.
model = C.layers.Sequential([Dense(0.2, 1, 100), DenseWithShakeout(10)])
will create a two layer network with a shakeout layer in the middle. Note, I have not actually tried this on any real problem.

Related

How to update a keras LSTM weights to avoid Concept Drift

I´m trying to update a Keras LSTM to avoid the concept of drift. For that I´m following the approach proposed in this paper [1] on which they compute an anomaly score and they apply it to update the network weights. In the paper they use the L2 norm to compute the anomaly score and then they update the model weights. As it is stated in the paper:
RNN Update: The anomaly score 𝑎𝑡 is then used to update the network W𝑡−1 to obtain W𝑡 using backpropagation through time (BPTT):
W𝑡 = W𝑡−1 − 𝜂∇𝑎𝑡(W𝑡−1) where 𝜂 is the learning rate
I’m trying to update the LSTM network weights, but although I have seen some improvements in the model performance for forecasting multi-step ahead multi-sensor data I’m not sure if the improvement is because the updates deal with the drift concept or just because the model is refitted with the newest data.
Here is an example model:
model = tf.keras.Sequential()
model.add(tf.keras.layers.LSTM(n_neurons, input_shape=(n_seq, n_features)))
model.add(layers.Dense(n_pred_seq * n_features))
model.add(layers.Reshape((n_pred_seq, n_features)))
model.compile(optimizer='adam', loss='mse')
And here is the way on which I’m updating the model:
y_pred = model.predict_on_batch(x_batch)
up_y = data_y[i,]
a_score = sqrt(mean_squared_error(data_y[i,].flatten(), y_pred[0, :]))
w = model.layers[0].get_weights() #Only get weights for LSTM layer
for l in range(len(w)):
w[l] = w[l] - (w[l]*0.001*a_score) #0.001=learning rate
model.layers[0].set_weights(w)
model.fit(x_batch, up_y, epochs=1, verbose=1)
model.reset_states()
I’m wondering if this is the correct way to update the LSTM neural network and how the BPTT is applied after updating the weights.
P.D.: I have also seen other methods to detect concept drift such as the ADWIN method from the skmultiflow package but I found this one especially interesting because it also deals with anomalies, updating the model slightly when new data with concept drift comes and almost ignoring the updates when anomalous data comes.
[1] Online Anomaly Detection with Concept Drift Adaptation using Recurrent Neural Networks Saurav, S., Malhotra, P., TV, V., Gugulothu, N., Vig, L., Agarwal, P., & Shroff, G. (2018, January). In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data (pp. 78-87). ACM.

I personally thinks that it's a valid method. The fact that you're updating the ntework weights depends on what you're doing, so if you do it like you do it's fine.
Maybe another way to do it is to implement your own loss function and embed the anti-drift parameter into it, but it might be a little complicated.
Regarding the BPTT i think it's applied as normal, but you have different "starting points", the weights you've just updated.

Looking at the second block of your code, I believe you are not calculating the gradient properly. Specifically, the gradient update w[l] = w[l] - (w[l]*0.001*a_score) seems to be wrong to me.
Here you are multiplying the weights and the anomaly score. However, the original gradient update equation
means to calculate the gradient of W_{t-1} using the loss \alpha_t, it does not mean to multiply \alpha_t with W_{t-1}.
To apply the online update correctly, you just need to sample your stream sequentially and apply the model.fit() as usual.
Hope this helps.

Keras' Sequential vs Functional API for Multi-Task Learning Neural Network

I would like to design a neural network for a multi-task deep learning task. Within the Keras API we can either use the "Sequential" or "Functional" approach to build such a neural network. Underneath I provide the code I used to build a network using both approaches to build a network with two outputs:
Sequential
seq_model = Sequential()
seq_model.add(LSTM(32, input_shape=(10,2)))
seq_model.add(Dense(8))
seq_model.add(Dense(2))
seq_model.summary()
Functional
input1 = Input(shape=(10,2))
lay1 = LSTM(32, input_shape=(10,2))(input1)
lay2 = Dense(8)(lay1)
out1 = Dense(1)(lay2)
out2 = Dense(1)(lay2)
func_model = Model(inputs=input1, outputs=[out1, out2])
func_model.summary()
When I look at both the summary outputs for the models, each of them contains identical number of trainable params:
Up until now, this looks fine - however I start doubting myself when I plot both models (using keras.utils.plot_model) which results in the followings graphs:
Personally I do not know how to interpret these. When using a multi-task learning approach, I want all neurons (in my case 8) of the layer before the output-layer to connect to both output neurons. For me this clearly shows in the Functional API (where I have two Dense(1) instances), but this is not very clear from the Sequential API. Nevertheless, the amount of trainable params is identical; suggesting that also the Sequential API the last layer is fully connected to both neurons in the Dense output layer.
Could anybody explain to me the differences between those two examples, or are those fully identical and result in the same neural network architecture? Also, which one would be preferred in this case?
Thank you a lot in advance.

The difference between Sequential and functional keras API:
The sequential API allows you to create models layer-by-layer for most
problems. It is limited in that it does not allow you to create models
that share layers or have multiple inputs or outputs.
the functional API allows you to create models that have a lot more
flexibility as you can easily define models where layers connect to
more than just the previous and next layers. In fact, you can connect
layers to (literally) any other layer. As a result, creating complex
networks such as siamese networks and residual networks become
possible.
To answer your question:
No these APIs are not the same and the number of layers is normal that are the same number.
Which one to use? It depends on the use you want to make of this network. What are you doing the training for? What do you want the output to be?
I recommend this link to make the most of the concept.
Sequential Models & Functional Models
I hope I helped you understand better.

Both models are (in theory) equivalent, as the two output nodes do not have any interaction between them.
It is just that the required outputs have a different shape
[(batch_size,2)]
vs
[(batch_size,),(batch_size,)]
and thus, the loss will be different.
The total loss is averaged for the sequential model in this example, whereas it is summed up for the functional model with two outputs (at least with a default loss such as MSE).
Of course, you can also adapt the functional model to be exactly equivalent to the sequential model:
out1 = Dense(2)(lay2)
#out2 = Dense(1)(lay2)
func_model = Model(inputs=input1, outputs=out1)
Maybe you will also need some activations after the Dense layers.

Both networks are functionally equivalent. Dense layers are fully connected by definition, which is considered to be the most basic and simple design that can be assumed for "normal" neural networks not otherwise specified. The exact learned parameters and behavior may vary slightly based on the implementation. The graph presented is ambiguous only because it does not show the connection of the neurons (which may number in the millions), but rather provides a symbolic representation of the connectivity with its name (Dense), in this case indicating a fully connected layer.
I expect that the sequential model (or equivalent functional model using one dense layer with two neurons as the output) would be faster because it can use a simplified optimization path, but I have not tested this and I have no knowledge of the compile time optimizations performed by Tensorflow.

How can I compute the hessian of the loss function in MXNet?

I'm learning MXNet at the moment and I'm working on a problem using neural nets. I'm interested in observing the curvature of my loss function with respect to the network weights but as best I can tell higher order gradients are not supported for neural network functions at the moment. Is there any (possibly hacky) way that I could still do this?

You can follow the discussion here
The gist of it is that not all operators support higher order gradients at the moment.
In Gluon you can try the following:
with mx.autograd.record():
output = net(x)
loss = loss_func(output)
dz = mx.autograd.grad(loss, [z], create_graph=True) # where [z] is the parameter(s) you want
dz[0].backward() # now the actual parameters should have second order gradients
Taken from this forum thread

Efficient way to know if an image related to a dataset that was used to train convolutional neural network

Currently I'm using VGG16 + Keras + Theano thought the Transfer Learning methodology to recognize plants classes. It works just fine and gives me a good accuracy. But the next problem I'm trying to solve - is to find a way of identifying if an input image contains plant at all. I don't want to have another one classifier that will do it, because it's not really efficiently.
So I did some search and have found that we can get activations from the latest model layer (before activation layer) and analyze it.
from keras import backend as K
model = util.load_model() # VGG16 model
model.load_weights(path_to_weights)
def get_activations(m, layer, X_batch):
x = [m.layers[0].input, K.learning_phase()]
y = [m.get_layer(layer).output]
get_activations = K.function(x, y)
activations = get_activations([X_batch, 0])
# trying to get some features from activations
# to understand how can we identify if an image is relevant
for l in activations[0]:
not_nulls = [x for x in l if x > 0]
# shows percentage of activated neurons
c1 = float(len(not_nulls)) / len(l)
n_activated = len(not_nulls)
print 'c1:{}, n_activated:{}'.format(c1, n_activated)
return activations
get_activations(model, 'the_latest_layer_name', inputs)
From the above code I've noticed that when we have very irrelevant image, the number of activated neurons is bigger than for images that contain plants:
For images that was using for model training, number of activated neurons 19%-23%
For images that contain unknown plants species 20%-26%
For irrelevant images 24%-28%
It's not really a good feature to understand if an image relevant as percentage values are intersect.
So, is there a good way to resolve this issue?

Thanks to Feras's idea in the comment above. After some trials, I've come up with the ultimate solution that allows solving this problem with accuracy up to 99.99%.
Steps are:
Train your model on a dataset;
Store activations (see method above how to get them) by predicting relevant and non-relevant images using trained model from the previous step. You should get activations from the penultimate layer. For VGG16 it's the last of two Dense(4096), for InceptionV3 - an extra penultimate Dense(1024) layer, for resnet50 - an extra penultimate Dense(2048) layer.
Solve a binary problem using stored activations data. I've tried a simple flat NN and Logistic Regression. Both were good in accuracy (flat NN was a bit more accurate), but I've chosen the Logistic Regression as it's simpler, faster and consumes less memory and CPU/GPU.
This process should be repeated each time after your model retrained as each time the final weights for CNN are different and what was working previously, will be different next time.
So as result we have another small model for solving the problem.

Unsupervised pre-training for convolutional neural network in theano

I would like to design a deep net with one (or more) convolutional layers (CNN) and one or more fully connected hidden layers on top.
For deep network with fully connected layers there are methods in theano for unsupervised pre-training, e.g., using denoising auto-encoders or RBMs.
My question is: How can I implement (in theano) an unsupervised pre-training stage for convolutional layers?
I do not expect a full implementation as an answer, but I would appreciate a link to a good tutorial or a reliable reference.

This paper describes an approach for building a stacked convolutional autoencoder. Based on that paper and some Google searches I was able to implement the described network. Basically, everything you need is described in the Theano convolutional network and denoising autoencoder tutorials with one crucial exception: how to reverse the max-pooling step in the convolutional network. I was able to work that out using a method from this discussion - the trickiest part is figuring out the right dimensions for W_prime as these will depend on the feed forward filter sizes and the pooling ratio. Here is my inverting function:
def get_reconstructed_input(self, hidden):
""" Computes the reconstructed input given the values of the hidden layer """
repeated_conv = conv.conv2d(input = hidden, filters = self.W_prime, border_mode='full')
multiple_conv_out = [repeated_conv.flatten()] * np.prod(self.poolsize)
stacked_conv_neibs = T.stack(*multiple_conv_out).T
stretch_unpooling_out = neibs2images(stacked_conv_neibs, self.pl, self.x.shape)
rectified_linear_activation = lambda x: T.maximum(0.0, x)
return rectified_linear_activation(stretch_unpooling_out + self.b_prime.dimshuffle('x', 0, 'x', 'x'))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want to implement a regularization technique "Shakeout" in CNTK - python

Related

How to update a keras LSTM weights to avoid Concept Drift

Keras' Sequential vs Functional API for Multi-Task Learning Neural Network

How can I compute the hessian of the loss function in MXNet?

Efficient way to know if an image related to a dataset that was used to train convolutional neural network

Unsupervised pre-training for convolutional neural network in theano

Categories

Resources