tf.estimator.LinearClassifier output weights interpretation

tf.estimator.LinearClassifier output weights interpretation - python

I am new to tensorflow and machine learning and I am training a tf.estimator.LinearClassifier on the classic MNIST data set.
After the training process I am reading the output weights and biases using classifier.get_variable_names() I get:
"['global_step', 'linear/linear_model/bias_weights', 'linear/linear_model/bias_weights/part_0/Adagrad', 'linear/linear_model/pixels/weights', 'linear/linear_model/pixels/weights/part_0/Adagrad']"
My question is: what is the difference between linear/linear_model/bias_weights and linear/linear_model/bias_weights/part_0/Adagrad? They are both of the same size.
The only explanation I can imagine is that linear/linear_model/bias_weights and linear/linear_model/bias_weights/part_0/Adagrad represent respectively the weights at the beginning and at the end of the training process.
However, I'm not sure about that and I can't find anything on line.

linear/linear_model/bias_weights are your trained model weights.
linear/linear_model/bias_weights/part_0/Adagrad comes from you using the AdaGrad optimizer. The special feature of this optimizer is that it keeps a "memory" of past gradients and uses this to rescale the gradients at each training step. See the AdaGrad paper if you want to know more (very mathy).
The important part is that linear/linear_model/bias_weights/part_0/Adagrad stores this "memory". It is returned because it is technically a tf.Variable in your program, however it is not an actual variable/weight in your model. Only linear/linear_model/bias_weights is. Of course the same holds for linear/linear_model/pixels/weights.

Related

Why does the magnitudes of output during inference correlate with the batch size during training?

I have to say this might be one of the weirdest problems I've ever met.
I was implementing ResNet to perform 10-classification over cifr-10 with tensorflow. Everything seemed to be fine with the training phase -- loss decreased steadily, and accuracy on training set kept increasing to over 90%, however, the results were totally abnormal during inference.
I have analyzed my code very carefully and ruled out the possibility of making mistakes when feeding the data or saving/loading the model. So the only difference between the training phase and the test phase lies in batch normalization layers.
For BN layers, I used tf.layers.batch_normalization directly and I thought I've paid attention to every pitfall in using tf.layers.batch_normalization.
Specifically, I've included the dependency for train_op as follows,
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_op = optimizer.minimize(self.losses)
Also, for saving and loading the model, I've specified var_list as tf.global_variables(). Moreover, I used training=True for training and training=False for test.
Nevertheless, the accuracy during inference was only around 10%, even when applied to the same data used for training. And when I output the last layer of the network (i.e., the 10-dimension vector input to softmax), I found that the magnitude of each item in the 10-dimension vector during training was always 1e0 or 1e-1, while for inference, it could be 1e4 or even 1e5. The strangest part was that I found the magnitude of the 10-dimension vector during inference correlated with the batch size used in training, i.e., the bigger the batch size, the smaller the magnitude.
Besides, I also found that the magnitudes of moving_mean and moving_variance of BN layers correlated with the batch size too, but why was this even possible? I thought moving_mean means the mean of the entire training population, and so was moving_variance. So why was there anything to do with the batch size?
I think there must be something that I don't know about using BN with tensorflow. This problem is really gonna drive me crazy! I've never expected to deal with such a problem in tensorflow, considering how convenient it is to use BN with PyTorch!

The problem has been solved!
I read the source code of tensorflow. Based on my understanding, the value of momentum in tf.layers.batch_normalization should be 1 - 1/num_of_batches. The default value is 0.99, which means the default value is most suitable when there are 100 batches in training data.
I didn't find any documents mentioned this. Hope this can be helpful to someone who would have the same problem with BN in tensorflow!

reconstruct inputs from outputs in regression neural networks in tensorflow

Say we train a multilayer NN in tensorflow for a regression task (i.e. multi input and multi output case). Then we have new instances and we apply the trained model and of course we get the corresponding outputs. Is there a way to backpropagate the outputs and reconstruct the inputs in tensorflow in an easy/efficient manner? What I am thinking is to then use the difference of the original and the reconstructed inputs of the new instances as a QC measure i.e. if the reconstructed inputs are not close enough to the originals then we have a problem etc. I hope I am making myself clear.

No, unfortunately you cannot take a trained model and try to get the corresponding input. The reason for this is that you have infinite possible solutions for each output.
Furthermore, backpropagation is not passing an output backwards through the network. Its the idea of determining what parameters in the model are contributing to what extent to loss function. This will not give the inputs to these hidden layers, but the extent at which the weights affected your decision.

should model.compile() be run prior to using model.load_weights(), if model has been only slightly changed say dropout?

With training & validation through a dataset for nearly 24 epochs, intermittently 8 epochs at once and saving weights cumulatively after each interval.
I observed a constant declining train & test-loss for first 16 epochs, post which the training loss continues to fall whereas test loss rises so i think it's the case of Overfitting.
For which i tried to resume training with weights saved after 16 epochs with change in hyperparameters say increasing dropout_rate a little.
Therefore i reran the dense & transition blocks with new dropout to get identical architecture with same sequence & learnable parameters count.
Now when i'm assigning previous weights to my new model(with new dropout) with model.load_weights() and compiling thereafter.
i see the training loss is even higher, that should be initially (blatantly with increased inactivity of random nodes during training) but later also it's performing quite unsatisfactory,
so i'm suspecting maybe compiling after loading pretrained weights might have ruined the performance?
what's reasoning & recommended sequence of model.load_weights() & model.compile()? i'd really appreciate any insights on above case.

The model.compile() method does not touch the weights in any way.
Its purpose is to create a symbolic function adding the loss and the optimizer to the model's existing function.
You can compile the model as many times as you want, whenever you want, and your weights will be kept intact.
Possible consequences of compile
If you got a model, well trained for some epochs, it's optimizer (depending on what type and parameters you chose for it) will also be trained for that specific epochs.
Compiling will make you lose the trained optimizer, and your first training batches might experience some bad results due to learning rates not suited to the current state of the model.
Other than that, compiling doesn't cause any harm.

How to increment training Theano saved models?

I have a trained model by Theano, and there are new training data I want to increase the mode, How could I do this?

Initialize the model with the pre-trained weights and perform gradient updates for the new examples, but you do have to take care of the learning rate and other parameters (depending on your optimizer). You may also try storing optimizer's parameter as well, initializing the optimizer with those values of parameters to make sure new training data does not drastically change the trained model.

Convolutional Neural Network accuracy with Lasagne (regression vs classification)

I have been playing with Lasagne for a while now for a binary classification problem using a Convolutional Neural Network. However, although I get okay(ish) results for training and validation loss, my validation and test accuracy is always constant (the network always predicts the same class).
I have come across this, someone who has had the same problem as me with Lasagne. Their solution was to setregression=True as they are using Nolearn on top of Lasagne.
Does anyone know how to set this same variable within Lasagne (as I do not want to use Nolearn)? Further to this, does anyone have an explanation as to why this needs to happen?

Looking at the code of the NeuralNet class from nolearn, it looks like the parameter regression is used in various places, but most of the times it affects how the output value and loss are computed.
In case of regression=False (the default), the network outputs the class with the maximum probability, and computes the loss with the categorical crossentropy.
On the other hand, in case of regression=True, the network outputs the probabilities of each class, and computes the loss with the squared error on the output vector.
I am not an expert in deep learning and CNN, but the reason this may have worked is that in case of regression=True and if there is a small error gradient, applying small changes to the network parameters may not change the predicted class and the associated loss, and may lead the algorithm to "think" that it has converged. But if instead you look at the class probabilities, small parameter changes will affect the probabilities and the resulting mean squared error, and the network will continue down this path which may eventually change the predictions.
This is just a guess, it is hard to tell without seeing the code and the dataset.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.