When does Tensorflow update weights and biases?

When does Tensorflow update weights and biases? - python

When does tensorflow update weights and biases in the for loop?
Below is the code from tf's github. mnist_softmax.py
for _ in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
When does tensorflow update weights and biases?
Does it update them when running sess.run()? If so, Does it mean, in this program, tf update weights and biases 1000 times?
Or does it update them after finishing the whole for loop?
If 2. is correct, my next question is, does tf update the model using different training data every time (since it uses next_batch(100). There are 1000*100 training data points in total. But all data points are considered only once individually. Am I correct or did I misunderstand something?
If 3. is correct, is it weird that after just one update step the model had been trained?
I think I must be misunderstanding something, It would be really great if anyone can give me a hint or refer some material.

It updates weights every time you run the train_step.
Yes, it is updating the weights 1000 times in this program.
See above
Yes, you are correct, it loads a mini-batch containing 100 points at once and uses it to compute gradients.
It's not weird at all. You don't necessarily need to see the same data again and again, all that is required is that you have enough data for the network to converge. You can iterate multiple times over the same data if you want, but since this model doesn't have many parameters, it converges in a single epoch.
Tensorflow works by creating a graph of the computations that are required for computing the output of a network. Each of the basic operations like matrix multiplication, addition, anything you can think of are nodes in this computation graph. In the tensorflow mnist example that you are following, the lines from 40-46 define the network architecture
x: placeholder
y_: placeholder
W: Variable - This is learnt during training
b: Variable - This is also learnt during training
The network represents a simple linear regression model where the prediction is made using y = W*x + b (see line 43).
Next, you configure the training procedure for your network. This code uses cross-entropy as the loss function to minimize (see line 57). The minimization is done using the gradient descent algorithm (see line 59).
At this point, your network is fully constructed. Now you need to run these nodes so that actual computation if performed (no computation has been performed up till this point).
In the loop where sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys}) is executed, tf computes the value of train_step which causes the GradientDescentOptimizer to try to minimize the cross_entropy and this is how training progresses.

Related

Why is training loss oscilating up and down?

I am using the TF2 research object detection API with the pre-trained EfficientDet D3 model from the TF2 model zoo. During training on my own dataset I notice that the total loss is jumping up and down - for example from 0.5 to 2.0 a few steps later, and then back to 0.75:
So all in all this training does not seem to be very stable. I thought the problem might be the learning rate, but as you can see in the charts above, I set the LR to decay during the training, it goes down to a really small value of 1e-15, so I don't see how this can be the problem (at least in the 2nd half of the training).
Also when I smooth the curves in Tensorboard, as in the 2nd image above, one can see the total loss going down, so the direction is correct, even though it's still on quite a high value. I would be interested why I can't achieve better results with my training set, but I guess that is another question. First I would be really interested why the total loss is going up and down so much the whole training. Any ideas?
PS: The pipeline.config file for my training can be found here.

In your config it states that your batch size is 2. This is tiny and will cause a very volatile loss.
Try increasing your batch size substantially; try a value of 256 or 512. If you are constrained by memory, try increasing it via gradient accumulation.
Gradient accumulation is the process of synthesising a larger batch by combining the backwards passes from smaller mini-batches. You would run multiple backwards passes before updating the model's parameters.
Typically, a training loop would like this (I'm using pytorch-like syntax for illustrative purposes):
for model_inputs, truths in iter_batches():
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
optimizer.step()
optimizer.zero_grad()
With gradient accumulation, you'll put several batches through and then update the model. This simulates a larger batch size without requiring the memory to actually put a large batch size through all at once:
accumulations = 10
for i, (model_inputs, truths) in enumerate(iter_batches()):
predictions = model(model_inputs)
loss = get_loss(predictions, truths)
loss.backward()
if (i - 1) % accumulations == 0:
optimizer.step()
optimizer.zero_grad()
Reading
https://medium.com/huggingface/training-larger-batches-practical-tips-on-1-gpu-multi-gpu-distributed-setups-ec88c3e51255
How to accumulate gradients in tensorflow?
https://towardsdatascience.com/how-to-easily-use-gradient-accumulation-in-keras-models-fa02c0342b60
Understanding accumulated gradients in PyTorch

Why does the magnitudes of output during inference correlate with the batch size during training?

I have to say this might be one of the weirdest problems I've ever met.
I was implementing ResNet to perform 10-classification over cifr-10 with tensorflow. Everything seemed to be fine with the training phase -- loss decreased steadily, and accuracy on training set kept increasing to over 90%, however, the results were totally abnormal during inference.
I have analyzed my code very carefully and ruled out the possibility of making mistakes when feeding the data or saving/loading the model. So the only difference between the training phase and the test phase lies in batch normalization layers.
For BN layers, I used tf.layers.batch_normalization directly and I thought I've paid attention to every pitfall in using tf.layers.batch_normalization.
Specifically, I've included the dependency for train_op as follows,
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
self.train_op = optimizer.minimize(self.losses)
Also, for saving and loading the model, I've specified var_list as tf.global_variables(). Moreover, I used training=True for training and training=False for test.
Nevertheless, the accuracy during inference was only around 10%, even when applied to the same data used for training. And when I output the last layer of the network (i.e., the 10-dimension vector input to softmax), I found that the magnitude of each item in the 10-dimension vector during training was always 1e0 or 1e-1, while for inference, it could be 1e4 or even 1e5. The strangest part was that I found the magnitude of the 10-dimension vector during inference correlated with the batch size used in training, i.e., the bigger the batch size, the smaller the magnitude.
Besides, I also found that the magnitudes of moving_mean and moving_variance of BN layers correlated with the batch size too, but why was this even possible? I thought moving_mean means the mean of the entire training population, and so was moving_variance. So why was there anything to do with the batch size?
I think there must be something that I don't know about using BN with tensorflow. This problem is really gonna drive me crazy! I've never expected to deal with such a problem in tensorflow, considering how convenient it is to use BN with PyTorch!

The problem has been solved!
I read the source code of tensorflow. Based on my understanding, the value of momentum in tf.layers.batch_normalization should be 1 - 1/num_of_batches. The default value is 0.99, which means the default value is most suitable when there are 100 batches in training data.
I didn't find any documents mentioned this. Hope this can be helpful to someone who would have the same problem with BN in tensorflow!

tf.estimator.LinearClassifier output weights interpretation

I am new to tensorflow and machine learning and I am training a tf.estimator.LinearClassifier on the classic MNIST data set.
After the training process I am reading the output weights and biases using classifier.get_variable_names() I get:
"['global_step', 'linear/linear_model/bias_weights', 'linear/linear_model/bias_weights/part_0/Adagrad', 'linear/linear_model/pixels/weights', 'linear/linear_model/pixels/weights/part_0/Adagrad']"
My question is: what is the difference between linear/linear_model/bias_weights and linear/linear_model/bias_weights/part_0/Adagrad? They are both of the same size.
The only explanation I can imagine is that linear/linear_model/bias_weights and linear/linear_model/bias_weights/part_0/Adagrad represent respectively the weights at the beginning and at the end of the training process.
However, I'm not sure about that and I can't find anything on line.

linear/linear_model/bias_weights are your trained model weights.
linear/linear_model/bias_weights/part_0/Adagrad comes from you using the AdaGrad optimizer. The special feature of this optimizer is that it keeps a "memory" of past gradients and uses this to rescale the gradients at each training step. See the AdaGrad paper if you want to know more (very mathy).
The important part is that linear/linear_model/bias_weights/part_0/Adagrad stores this "memory". It is returned because it is technically a tf.Variable in your program, however it is not an actual variable/weight in your model. Only linear/linear_model/bias_weights is. Of course the same holds for linear/linear_model/pixels/weights.

Trained neural network produces different predictions with same data (TensorFlow)

I have trained a neural network with TensorFlow. After training i saved it and loaded it again in a new '. py' file to avoid retraining on accident. As i was testing it with some extra data i found out that it predicts different things for the same data. Should it not theoretically compute the same thing for the same data?
Some information
feed forward net
4 hidden layers with 900 neurons each
5000 training epochs
reached accuracy of ~80%
data was normalized using normalize from sklearn. preprocessing
cost function: tensorflow.nn.softmax_cross_entropy_with_logits
optimizer: tf.train.AdamOptimizer
I am giving my network the data as a matrix, same way i used for training. (each row containing a data sample, having as many columns as there are input neurons)
Out of ten prediction cycles with the same data my network produces different results in at least 2 cycles (max observed 4 so far)
How can this be. By theory all that is happening are data processing calculations of the form W_i*x_i + b_i. As my x_i, W_i and b_i do not change anymore how come that the prediction varies? May there be a mistake in model reloading routine?
with tf.Session() as sess:
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
result = (sess.run(tf.argmax(prediction.eval(feed_dict=x:input_data}),1)))
print(result)

So this is a really stupid mistake by me. Now it works fine with loading the model from a save. The problem was caused by the global variables initializer. If you leave it out, it will work fine. The previously found information may prove useful for someone so i will leave it here. Solution is now:
saver = tf.train.Saver()
with tf.Session() as sess:
saver.restore(sess, 'path to your saved file C:x/y/z/model/model.ckpt')
After this you can go on as usually. I do not really know why variables initializer prevents this from working. As i see it, it should be something like: initialize all variables to exist and with random values and then got to that saved file and use values from there, but apparently something else happens...

So i have been doing some testing and found out the following about this problem.
As i have been trying to reuse my created model i had to use the tf.global_variables_initializer(). By doing so it has overwritten my imported graph and all the values were random, which explains different network outputs. This still left me with a problem to solve: how do i load my network? The workaround i am currently using is not optimal by far but it at least allows me to use my saved model. Tensor flow allows one to give unique names to the functions and tensors used. By doing so i could access them through the graph:
with tf.Session() as sess:
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
graph = tf.get_default_graph()
graph.get_tensor_by_name('name:0')
Using this method i could access all my saved values, but they were separated! It means that i had 1x weight and 1x bias per operation used, which led to a bunch of new variables. If you do not know the names, use following:
print(graph.get_all_collection_keys())
This prints the collection names (our variables are stored in collections)
print(graph.get_collection('name'))
This allows us to access the collection as see what are the names/keys for our variables.
This led to another problem. I could no longer use my model as global variables initializer had everything overwritten. By thus i had to redefine the whole model manually with weight and biases that i got previously.
Unfortunately, this is the only thing i could come up with. If anyone has a better idea, please let me know.
The whole thing with mistake looked like this:
imports...
placeholders for data...
def my_network(data):
## network definition with tf functions ##
return output
def train_my_net():
prediction = my_network(data)
cost function
optimizer
with tf.Session() as sess:
for i in how many epochs i want:
training routine
save
def use_my_net():
prediction = my_network(data)
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver = tf.train.import_meta_graph('path to .meta')
saver.restore(sess, tf.train.latest_checkpoint('path to checkpoints'))
print(sess.run(prediction.eval(feed_dict={placeholder:data})))
graph = tf.get_default_graph()

Tensorflow: how it trains the model?

Working on Tensorflow, the first step is build a data graph and use session to run it. While, during my practice, such as the MNIST tutorial. It firstly defines loss function and the optimizer, with the following codes (and the MLP model is defined before that):
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1])) #define cross entropy error function
loss = tf.reduce_mean(cross_entropy, name='xentropy_mean') #define loss
optimizer = tf.train.GradientDescentOptimizer(learning_rate) #define optimizer
global_step = tf.Variable(0, name='global_step', trainable=False) #learning rate
train_op = optimizer.minimize(loss, global_step=global_step) #train operation in the graph
The training process:
train_step =tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
for i in range(1000):
batch_xs, batch_ys = mnist.train.next_batch(100)
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
That's how Tensorflow did training in this case. But my question is, how did Tensorflow know which weight it needs to train and update? I mean, in the training codes, we only pass output y to cross_entropy, but for optimizer or loss, we didn't pass any information about the structure directly. In addition, we use dictionary to feed batch data to train_step, but train_step didn't directly use the data. How did Tensorflow know where to use these data as input?
To my question, I thought it might be all those variables or constants are stored in Tensor. Operations such as tf.matmul() should a "subclass" of Tensorflow's operation class(I haven't check the code yet). There might be some mechanism for Tensorflow to recognise relations among tensors (tf.Variable(), tf.constant()) and operations (tf.mul(), tf.div()...). I guess, it could check the tf.xxxx()'s super class to find out whether it is a tensor or operation. This assumption raises my second question: should I use Tensorflow's 'tf.xxx' function as possible to ensure tensorflow could build correct data flow graph, even sometimes it is more complicated than normal Python methods or some functions are supported better in Numpy than Tensorflow?
My last question is: Is there any link between Tensorflow and C++? I heard someone said Tensorflow is faster than normal Python since it uses C or C++ as backend. Is there any transform mechanism to transfer Tensorflow Python codes to C/C++?
I'd also be graceful if someone could share some debugging habits in coding with Tensorflow, since currently I just set up some terminals (Ubuntu) to test each part/functions of my codes.

You do pass information about your structure to Tensorflow when you define your loss with:
loss = tf.reduce_mean(cross_entropy, name='xentropy_mean')
Notice that with Tensorflow you build a graph of operations, and every operation you use in your code is a node in the graph.
When you define your loss you are passing the operation stored in cross_entropy, which depends on y_ and y. y_ is a placeholder for your input whereas y is the result of y = tf.nn.softmax(tf.matmul(x, W) + b). See where I am going? The operation loss contains all the information it needs to build the model an process the input, because it depends on the operation cross_entropy, which depends on y_ and y, which depends on the input x and the model weights W.
So when you call
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})
Tensorflow knows perfectly which operations should be computed when you run train_step, and it knows exactly where to put in the operations graph the data you are passing through feed_dict.
As for how does Tensorflow know which variables should be trained, the answer is easy. It trains any tf.Variable() in the operations graph which is trainable. Notice how when you define the global_step you set trainable=False because you don't want to compute gradients w.r.t that variable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.