Training loss not decreasing when training - tensorflow gpu - python

I am training a graph neural network on a node cluster with one gpu Titan RTX. I am using Tensorflow-gpu 1.15 and it can recognize the gpu successfully. The training involves some tensors operations of type float 64, where the training set is formed by 256K sparse block-circulant matrices of moderate size. I evaluate 256 samples per run and the batch size is set to 32.
When I look at the loss graph in tb, I notice that even after evaluating more than 100K samples (after 24 hours of training) my training loss is not decaying at all: it looks noisy and quite flat. This is the plot from tb:
The loss is measured as the frobenious norm of an error matrix and it is supposed to decay. I am also using the adam optimizer with learning rate of 10^-3.
Any insights on why it is behaving like this? It is basically not learning anything.
I did a quick profiling to see which operations are the slowest, but cannot find something significant.
Could it be the GPU that I am using and the loss in performance due to the heavy memory allocation of float64? When I am checking the gpu usage, I allocate 60% of the memory (and I have the option to release it after operation).
Any suggestion or tips?
I have been using:
Tensorflow-gpu 1.15,
CUDA 10.0.130,
NCCL 2.4.7-CUDA-10.0.130,
cuDNN 7.6.3-CUDA-10.0.130.
Running on a remote server with 4 gpus Titan RTX (I am using 1 of them).

Type tf.float64 is not the problem when you select the correct optimizer when I am running on version 1 compatibility mode 'tf.compat.v1.disable_eager_execution()'
Select the correct input data and target optimizer.
Select the correct tf.Variable.
Select the optimized equation or methods.
Input may require specific methods to transform to tf.float64 in compatibility mode 'tf.compat.v1.disable_eager_execution()'
Running sessions have the same input and update of variables, arrays, or feed_dict.
Purpose of the optimizer when you need to find similarities or you need to find categories of their group.
Sample: Similarity scopes re-occurrence or all pixels compare small of change see it different.
import os
from os.path import exists
import tensorflow as tf
import matplotlib.pyplot as plt
from skimage.transform import resize
import numpy as np
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)
: Variables
learning_rate = 0.1
global_step = 0
IMG_SIZE = (32, 32)
history = [ ]
history_Y = [ ]
list_file = [ ]
list_label = [ ]
for file in os.listdir("F:\\datasets\\downloads\\dark\\train") :
image = plt.imread( "F:\\datasets\\downloads\\dark\\train\\" + file )
image = resize(image, (32, 32))
image = np.reshape( image, (1, 32, 32, 3) )
list_file.append( image )
optimizer = tf.compat.v1.train.AdamOptimizer(
var1 = tf.Variable(255.0, dtype=tf.dtypes.float64)
var2 = tf.Variable(10.0, dtype=tf.dtypes.float64)
X_var = tf.compat.v1.get_variable('X', dtype = tf.float64, initializer = tf.random.normal((1, 32, 32, 3), dtype=tf.dtypes.float64))
y_var = tf.compat.v1.get_variable('Y', dtype = tf.float64, initializer = tf.random.normal((1, 32, 32, 3), dtype=tf.dtypes.float64))
Z = tf.nn.l2_loss((var1 - X_var) ** 2 + (var2 - y_var) ** 2, name="loss")
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1)
loss = tf.reduce_mean(input_tensor=tf.square(Z))
training_op = optimizer.minimize(loss)
previous_train_loss = 0
with tf.compat.v1.Session() as sess:
image = list_file[0]
X = image
Y = image
for i in range(1000):
global_step = global_step + 1
train_loss, temp =[loss, training_op], feed_dict={X_var:X, y_var:Y})
history.append( train_loss )
if global_step % 2 == 0 :
var2 = var2 - 0.001
if global_step % 4 == 0 and train_loss <= previous_train_loss :
var1 = var1 - var2 + 0.5
print( 'steps: ' + str(i) )
print( 'train_loss: ' + str(train_loss) )
previous_train_loss = train_loss
: Graph
history = history[:-1]
plt.legend(loc='lower right')
Without Cosine Similarity: All Pixels are comparing a bit of change they find the meaning.
With Cosine Similarity: Re-occurrence of series supposed to consider same threads.


Why does my Colab session run out of RAM?

I'm building a model for image deblurring based on the model described in this paper using Keras. I train the model on Colab using the following training code:
x_train, y_train = load_h5_dataset()
def train(batch_size=16, epoch_num=5, critic_updates=5, log_dir='drive/MyDrive/train_logs'):
g = make_resnet_generator_model()
d = make_discriminator_model()
gan = make_gan(g, d)
d_opt = Adam(learning_rate=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
gan_opt = Adam(learning_rate=1e-4, beta_1=0.9, beta_2=0.999, epsilon=1e-8)
d.trainable = True
d.compile(optimizer=d_opt, loss=wasserstein_loss)
d.trainable = False
loss = [perceptual_loss, wasserstein_loss]
loss_weights = [100, 1]
gan.compile(optimizer=gan_opt, loss=loss, loss_weights=loss_weights)
d.trainable = True
output_true_batch, output_false_batch = np.ones((batch_size, 1)), -np.ones((batch_size, 1))
writer = tf.summary.create_file_writer(log_dir)
for epoch in tqdm(range(epoch_num)):
print(f"Epoch {epoch + 1}/{epoch_num}...")
permuted_indexes = np.random.permutation(x_train.shape[0])
d_losses = []
gan_losses = []
x_train = dataset['sharp_img']
for index in range(int(x_train.shape[0] / batch_size)):
batch_indexes = permuted_indexes[index * batch_size:(index + 1) * batch_size]
image_blur_batch = x_train[batch_indexes]
image_full_batch = y_train[batch_indexes]
generated_images = g.predict(x=image_blur_batch, batch_size=batch_size)
for _ in range(critic_updates):
d_loss_real = d.train_on_batch(image_full_batch, output_true_batch)
d_loss_fake = d.train_on_batch(generated_images, output_false_batch)
d_loss = 0.5 * np.add(d_loss_fake, d_loss_real)
d.trainable = False
gan_loss = gan.train_on_batch(image_blur_batch, [image_full_batch, output_true_batch])
d.trainable = True
write_logs(writer, ['d_loss', 'gan_loss'], [np.mean(d_losses), np.mean(gan_losses)], epoch)
save_weights(d, g, epoch, int(np.mean(gan_losses)))
In the training code above, the perceptual loss is calculated using a VGG16 network, pretrained on ImageNet. The function load_h5_dataset() is used to load a dataset saved as a .hdf5 file. I encounter two problems when executing this code:
When I run it on Colab, it keeps running out of RAM on Colab and stops the execution. However, the size of the dataset is 6GB, which is well below the available size of RAM of Colab.
When I run this code on my local machine (which has 16GB of RAM and a NVIDIA GeForce GTX 1660 Ti with 6GB capacity), I encounter this error: tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[16,256,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Conv2D]
Can someone have a look at my code and see what going wrong here? Thank you very much.
Can you check this issue
And also you can
del whatevervariable
and then RAM will be free

Loss & accuracy don't improve in Xception (image classification)

As a trial, I'm implementing Xception to classify images without using pretrained weight in Tensorflow.
However, the accuracy are too low compared to the original paper.
Could somebody share any advice to address this problem?
I prepared 500 out of 1000 classes from ImageNet and train ready-Xception model with this data from scrach .
I tried the same learning rate and optimizer as used in the original paper.
– Optimizer: SGD
– Momentum: 0.9
– Initial learning rate: 0.045
– Learning rate decay: decay of rate 0.94 every 2 epochs
However, this did not work so well.
I know it is better to use all of 1000 classes rather than only 500, however, I couldn't prepare storage for it.
Did it affect the performance of my code?
Here is my code.
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras import layers, losses, models, optimizers, callbacks, applications, preprocessing
# scheduler
def scheduler(epoch, lr):
return 0.045*0.94**(epoch/2.0)
lr_decay = callbacks.LearningRateScheduler(scheduler)
# early stopping
EarlyStopping = callbacks.EarlyStopping(monitor='val_loss', min_delta=0, patience=500, verbose=0, mode='auto', restore_best_weights=True)
# build xception
inputs = tf.keras.Input(shape=(224, 224, 3))
x = tf.cast(inputs, tf.float32)
x = tf.keras.applications.xception.preprocess_input(x) #preprocess image
x = applications.xception.Xception(weights=None, include_top=False,)(x, training=True)
x = layers.GlobalAveragePooling2D()(x)
x = layers.Dense(nb_class)(x)
outputs = layers.Softmax()(x)
model = tf.keras.Model(inputs, outputs)
model.compile(optimizer=optimizers.SGD(momentum=0.9, nesterov=True),
loss = 'categorical_crossentropy',
metrics= ['accuracy'])
# fitting data
history =, 224, 224, ), #feed images with a generator
batch_size = 32,
steps_per_epoch = 64,
validation_data = image_gen(df_valid_chunk, 224, 224, ), #feed images with a generator
validation_steps = 64,
callbacks = [lr_decay, EarlyStopping],
My results are below. In the original paper, its accuracy reached around 0.8.
In contrast, the performance of my code is too poor.
Some might wonder if my generator got wrong, so I put my generator code and result below.
from PIL import Image, ImageEnhance, ImageOps
def image_gen(df_data, h, w, shuffle=True):
nb_class = len(np.unique(df_data['Class']))
while True:
if shuffle:
df_data = df_data.sample(frac=1)
for i in range(len(df_data)):
X =[i]).loc['Path'])
X = X.convert('RGB')
X = X.resize((w,h))
X = preprocessing.image.img_to_array(X)
X = np.expand_dims(X, axis=0)
klass = (df_data.iloc[i]).loc['Class']
y = np.zeros(nb_class)
y[klass] = 1
y = np.expand_dims(y, axis=0)
yield X, y
train_gen = image_gen(df_train_chunk, 224, 224, )
for i in range(5):
X, y = next(train_gen)
print('\n\n class: ', y.argmax(-1))
the result is below.
When you chose only 500 labels, do you choose the first 500?
softmax output starting from 0, so make sure your labels staring from 0 to 499 either.

Puzzling mxnet performance under Python vs Mathematica

I compared mxnet performance between Mathematica and Python and observe more than an order of magnitude performance differences and would like advises on how to improve performance under Python.
My NN is an MLP for regression, with 3 float inputs, 8, 16, 24, 8 neurons fully connected layers and 2 float output, Sigmoid is used everywhere except on the input and output neurons. The optimizer used in Mathematica is Adam so I used this too in Python with the same parameters. The training dataset contains 4215 records mapping xyY colors to Munsell Hue and Chroma.
Mathematica is version 11.2 released in 2017 and Mathematica uses mxnet under the hood for deep learning tasks. On the Python side, I use the latest release with mxnet-mkl and I checked that MKLDNN is enabled.
Mathematica licence runs on a MS Surface Pro notebook with Windows 10, i7-7660U, 2.5Ghz, 2 cores, 4 hyperthreads, AVX2. I ran Python on this computer for comparison.
Here are the times for learning loops of 32768 epochs and
Batch Sizes: 128, 256, 512, 1024, 2048, 4096
Mathematica: 8m12s, 5m14s, 3m34s, 2m57s, 3m4s, 3m48s
PythonMxNet: 286m, 163m, 93m, 65m, 49m, 47m
I tried the mxnet environment variables optimization tricks suggested by Intel but only got 120% slower times.
I also switched all the arrays to float32 from float64 with the hypothesis that MKL could process 2 times as many operations in the same amount of time (excluding overhead of course) with SIMD registers but noticed not even a slight improvement.
The reason I switched my NN work from Mathematica to Python is I wanted to train the NN on different and more powerfull computers. And I also don't like having my notebook tied up on NN learning tasks.
How should I interpret those results?
What may be the cause of those performance differences?
Is there anything I can do to gain some performance under Python?
Or is this simply the unavoidable overhead imposed by the Python interpreter?
The script for generating the NN:
def get_MunsellNet( Layers, NbInputs ):
net = nn.HybridSequential()
for l in range( len( Layers ) ):
if l == 0:
net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType, in_units = NbInputs ) )
net.add( nn.Dense( Layers[ l ], activation = 'sigmoid', dtype = mu.DType ) )
net.add( nn.Dense( 2, dtype = mu.DType ) )
net.initialize( mx.init.Uniform(), ctx = ctx )
return net
The NN is created with this:
mu.DType = 'f8'
NbInputs = 3
Norm = 'None' # Possible normalizer are: 'None', 'Unit', 'RMS', 'RRMS', 'Std'
Train_Dataset = mnr.Build_HCTrainData( NbInputs, Norm, Datasets = [ 'all.dat', 'fill.dat' ] )
Test_Dataset1 = mnr.Build_HCTestData( 'real.dat', NbInputs )
Test_Dataset2 = mnr.Build_HCTestData( 'test.dat', NbInputs )
Layers = [ 8, 16, 24, 8 ]
Net = mnn.get_MunsellNet( Layers, NbInputs )
Loss_Fn = mx.gluon.loss.L2Loss()
Learning_Rate = 0.0005
Optimizer = 'Adam'
Batch_Size = 4096
Epochs = 500000
And trained with this:
if __name__ == '__main__':
global Train_Data_Loader
Train_Data_Loader = Train_Dataset, batch_size = Batch_Size, shuffle = True, num_workers = mnn.NbWorkers )
Trainer = mx.gluon.Trainer( Net.collect_params(), Optimizer, {'learning_rate': Learning_Rate} )
Estimator = estimator.Estimator( net = Net,
loss = Loss_Fn,
trainer = Trainer,
context = mnn.ctx )
LossRecordHandler = mnu.ProgressRecorder( Epochs, Test_Dataset1, Test_Dataset2, NbInputs, Net_Name, Epochs / 10 * 8 )
for n in range( 10 ):
Train_Metric = Estimator.prepare_loss_and_metrics()
Net.initialize( force_reinit = True )
# ignore warnings for nightly test on CI only
with warnings.catch_warnings():
warnings.simplefilter( "ignore" ) train_data = Train_Data_Loader,
epochs = Epochs,
event_handlers = [ LossRecordHandler ] )
Let me know if you need more code clips.
In general, Python overhead can be large especially when the compute kernels are launched frequently with small inputs. From the benchmark above there is some evidence that a large chunk of it is due to the overhead. We see that while the batch size increases from 128 to 4096 (32x), the mxnet-to-mathematica time ratio decreases from ~35.75 to ~12.37, which can be seen as speed-up due to less invocation overhead. I will update the answer once there's more details to this question.

Speed of Logistic Regression on MNIST with Tensorflow

I am taking the CS 20SI: Tensorflow for Deep Learning Research from Stanford. I have question regarding the following code:
import time
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
# Step 1: Read in data
# using TF Learn's built in function to load MNIST data to the folder data/mnist
MNIST = input_data.read_data_sets("/data/mnist", one_hot=True)
# Batched logistic regression
learning_rate = 0.01
batch_size = 128
n_epochs = 25
X = tf.placeholder(tf.float32, [batch_size, 784], name = 'image')
Y = tf.placeholder(tf.float32, [batch_size, 10], name = 'label')
#w = tf.Variable(tf.random_normal(shape = [int(shape[1]), int(Y.shape[1])], stddev = 0.01), name='weights')
#b = tf.Variable(tf.zeros(shape = [1, int(Y.shape[1])]), name='bias')
w = tf.Variable(tf.random_normal(shape=[784, 10], stddev=0.01), name="weights")
b = tf.Variable(tf.zeros([1, 10]), name="bias")
logits = tf.matmul(X,w) + b
entropy = tf.nn.softmax_cross_entropy_with_logits( logits=logits, labels=Y)
loss = tf.reduce_mean(entropy) #computes the mean over examples in the batch
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(loss)
init = tf.global_variables_initializer()
with tf.Session() as sess:
n_batches = int(MNIST.train.num_examples/batch_size)
for i in range(n_epochs):
start_time = time.time()
for _ in range(n_batches):
X_batch, Y_batch = MNIST.train.next_batch(batch_size)
opt, loss_ =[optimizer, loss], feed_dict = {X: X_batch, Y:Y_batch})
end_time = time.time()
print('Epoch %d took %f'%(i, end_time - start_time))
On this code, logistic regression with MNIST dataset is performed. The author states:
Running on my Mac, the batch version of the model with batch size 128
runs in 0.5 second
However, when I run it, each epoch takes around 2 seconds, giving a total execution time of around a minute. Is it reasonable that this example takes that time? Currently I have a Ryzen 1700 without OC (3.0GHz) and a GPU Gtx 1080 without OC.
I tried this code on GTX Titan X (Maxwell) and got around 0.5 seconds per epoch. I would expect that GTX 1080 should be able to get similar results.
Try using the latest tensorflow and cuda/cudnn versions. Make sure there are no limiting (which GPUs are visible, how much memory tensorflow can use, etc) environment variables set. You can try running a micro-benchmark to see that you can achieve the the stated FLOPS of your card, e.g. Testing GPU with tensorflow matrix multiplication

How could I use batch normalization in TensorFlow?

I would like to use batch normalization in TensorFlow. I found the related C++ source code in core/ops/ However, I did not find it documented on
BN has different semantics in MLP and CNN, so I am not sure what exactly this BN does.
I did not find a method called MovingMoments either.
Update July 2016 The easiest way to use batch normalization in TensorFlow is through the higher-level interfaces provided in either contrib/layers, tflearn, or slim.
Previous answer if you want to DIY:
The documentation string for this has improved since the release - see the docs comment in the master branch instead of the one you found. It clarifies, in particular, that it's the output from tf.nn.moments.
You can see a very simple example of its use in the batch_norm test code. For a more real-world use example, I've included below the helper class and use notes that I scribbled up for my own use (no warranty provided!):
"""A helper class for managing batch normalization state.
This class is designed to simplify adding batch normalization
( to your model by
managing the state variables associated with it.
Important use note: The function get_assigner() returns
an op that must be executed to save the updated state.
A suggested way to do this is to make execution of the
model optimizer force it, e.g., by:
update_assignments =,
with tf.control_dependencies([optimizer]):
optimizer =
import tensorflow as tf
class ConvolutionalBatchNormalizer(object):
"""Helper class that groups the normalization logic and variables.
ewma = tf.train.ExponentialMovingAverage(decay=0.99)
bn = ConvolutionalBatchNormalizer(depth, 0.001, ewma, True)
update_assignments = bn.get_assigner()
x = bn.normalize(y, train=training?)
(the output x will be batch-normalized).
def __init__(self, depth, epsilon, ewma_trainer, scale_after_norm):
self.mean = tf.Variable(tf.constant(0.0, shape=[depth]),
self.variance = tf.Variable(tf.constant(1.0, shape=[depth]),
self.beta = tf.Variable(tf.constant(0.0, shape=[depth]))
self.gamma = tf.Variable(tf.constant(1.0, shape=[depth]))
self.ewma_trainer = ewma_trainer
self.epsilon = epsilon
self.scale_after_norm = scale_after_norm
def get_assigner(self):
"""Returns an EWMA apply op that must be invoked after optimization."""
return self.ewma_trainer.apply([self.mean, self.variance])
def normalize(self, x, train=True):
"""Returns a batch-normalized version of x."""
if train:
mean, variance = tf.nn.moments(x, [0, 1, 2])
assign_mean = self.mean.assign(mean)
assign_variance = self.variance.assign(variance)
with tf.control_dependencies([assign_mean, assign_variance]):
return tf.nn.batch_norm_with_global_normalization(
x, mean, variance, self.beta, self.gamma,
self.epsilon, self.scale_after_norm)
mean = self.ewma_trainer.average(self.mean)
variance = self.ewma_trainer.average(self.variance)
local_beta = tf.identity(self.beta)
local_gamma = tf.identity(self.gamma)
return tf.nn.batch_norm_with_global_normalization(
x, mean, variance, local_beta, local_gamma,
self.epsilon, self.scale_after_norm)
Note that I called it a ConvolutionalBatchNormalizer because it pins the use of tf.nn.moments to sum across axes 0, 1, and 2, whereas for non-convolutional use you might only want axis 0.
Feedback appreciated if you use it.
As of TensorFlow 1.0 (February 2017) there's also the high-level tf.layers.batch_normalization API included in TensorFlow itself.
It's super simple to use:
# Set this to True for training and False for testing
training = tf.placeholder(tf.bool)
x = tf.layers.dense(input_x, units=100)
x = tf.layers.batch_normalization(x, training=training)
x = tf.nn.relu(x)
...except that it adds extra ops to the graph (for updating its mean and variance variables) in such a way that they won't be dependencies of your training op. You can either just run the ops separately:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)[train_op, extra_update_ops], ...)
or add the update ops as dependencies of your training op manually, then just run your training op as normal:
extra_update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(extra_update_ops):
train_op = optimizer.minimize(loss)
...[train_op], ...)
The following works fine for me, it does not require invoking EMA-apply outside.
import numpy as np
import tensorflow as tf
from tensorflow.python import control_flow_ops
def batch_norm(x, n_out, phase_train, scope='bn'):
Batch normalization on convolutional maps.
x: Tensor, 4D BHWD input maps
n_out: integer, depth of input maps
phase_train: boolean tf.Varialbe, true indicates training phase
scope: string, variable scope
normed: batch-normalized maps
with tf.variable_scope(scope):
beta = tf.Variable(tf.constant(0.0, shape=[n_out]),
name='beta', trainable=True)
gamma = tf.Variable(tf.constant(1.0, shape=[n_out]),
name='gamma', trainable=True)
batch_mean, batch_var = tf.nn.moments(x, [0,1,2], name='moments')
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([batch_mean, batch_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(batch_mean), tf.identity(batch_var)
mean, var = tf.cond(phase_train,
lambda: (ema.average(batch_mean), ema.average(batch_var)))
normed = tf.nn.batch_normalization(x, mean, var, beta, gamma, 1e-3)
return normed
import math
n_in, n_out = 3, 16
ksize = 3
stride = 1
phase_train = tf.placeholder(tf.bool, name='phase_train')
input_image = tf.placeholder(tf.float32, name='input_image')
kernel = tf.Variable(tf.truncated_normal([ksize, ksize, n_in, n_out],
conv = tf.nn.conv2d(input_image, kernel, [1,stride,stride,1], padding='SAME')
conv_bn = batch_norm(conv, n_out, phase_train)
relu = tf.nn.relu(conv_bn)
with tf.Session() as session:
for i in range(20):
test_image = np.random.rand(4,32,32,3)
sess_outputs =[relu],
{ test_image, True})
There is also an "official" batch normalization layer coded by the developers. They don't have very good docs on how to use it but here is how to use it (according to me):
from tensorflow.contrib.layers.python.layers import batch_norm as batch_norm
def batch_norm_layer(x,train_phase,scope_bn):
bn_train = batch_norm(x, decay=0.999, center=True, scale=True,
reuse=None, # is this right?
bn_inference = batch_norm(x, decay=0.999, center=True, scale=True,
reuse=True, # is this right?
z = tf.cond(train_phase, lambda: bn_train, lambda: bn_inference)
return z
to actually use it you need to create a placeholder for train_phase that indicates if you are in training or inference phase (as in train_phase = tf.placeholder(tf.bool, name='phase_train')). Its value can be filled during inference or training with a tf.session as in:
test_error =, feed_dict={x: batch_xtest, y_:batch_ytest, train_phase: False})
or during training:, feed_dict={x: batch_xs, y_:batch_ys, train_phase: True})
I'm pretty sure this is correct according to the discussion in github.
Seems there is another useful link:
You can simply use the build-in batch_norm layer:
batch_norm = tf.cond(is_train,
lambda: tf.contrib.layers.batch_norm(prev, activation_fn=tf.nn.relu, is_training=True, reuse=None),
lambda: tf.contrib.layers.batch_norm(prev, activation_fn =tf.nn.relu, is_training=False, reuse=True))
where prev is the output of your previous layer (can be both fully-connected or a convolutional layer) and is_train is a boolean placeholder. Just use batch_norm as the input to the next layer, then.
Since someone recently edited this, I'd like to clarify that this is no longer an issue.
This answer does not seem correct When phase_train is set to false, it still updates the ema mean and variance. This can be verified with the following code snippet.
x = tf.placeholder(tf.float32, [None, 20, 20, 10], name='input')
phase_train = tf.placeholder(tf.bool, name='phase_train')
# generate random noise to pass into batch norm
x_gen = tf.random_normal([50,20,20,10])
pt_false = tf.Variable(tf.constant(True))
#generate a constant variable to pass into batch norm
y = x_gen.eval()
[bn, bn_vars] = batch_norm(x, 10, phase_train)
train_step = lambda: bn.eval({x:x_gen.eval(), phase_train:True})
test_step = lambda: bn.eval({x:y, phase_train:False})
test_step_c = lambda: bn.eval({x:y, phase_train:True})
# Verify that this is different as expected, two different x's have different norms
# Verify that this is same as expected, same x's (y) have same norm
# THIS IS DIFFERENT but should be they same, should only be reading from the ema.
Using TensorFlow built-in batch_norm layer, below is the code to load data, build a network with one hidden ReLU layer and L2 normalization and introduce batch normalization for both hidden and out layer. This runs fine and trains fine. Just FYI this example is mostly built upon the data and code from Udacity DeepLearning course.
P.S. Yes, parts of it were discussed one way or another in answers earlier but I decided to gather in one code snippet everything so that you have example of whole network training process with Batch Normalization and its evaluation
# These are all the modules we'll be using later. Make sure you can import them
# before proceeding further.
from __future__ import print_function
import numpy as np
import tensorflow as tf
from six.moves import cPickle as pickle
pickle_file = '/home/maxkhk/Documents/Udacity/DeepLearningCourse/SourceCode/tensorflow/examples/udacity/notMNIST.pickle'
with open(pickle_file, 'rb') as f:
save = pickle.load(f)
train_dataset = save['train_dataset']
train_labels = save['train_labels']
valid_dataset = save['valid_dataset']
valid_labels = save['valid_labels']
test_dataset = save['test_dataset']
test_labels = save['test_labels']
del save # hint to help gc free up memory
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
image_size = 28
num_labels = 10
def reformat(dataset, labels):
dataset = dataset.reshape((-1, image_size * image_size)).astype(np.float32)
# Map 2 to [0.0, 1.0, 0.0 ...], 3 to [0.0, 0.0, 1.0 ...]
labels = (np.arange(num_labels) == labels[:,None]).astype(np.float32)
return dataset, labels
train_dataset, train_labels = reformat(train_dataset, train_labels)
valid_dataset, valid_labels = reformat(valid_dataset, valid_labels)
test_dataset, test_labels = reformat(test_dataset, test_labels)
print('Training set', train_dataset.shape, train_labels.shape)
print('Validation set', valid_dataset.shape, valid_labels.shape)
print('Test set', test_dataset.shape, test_labels.shape)
def accuracy(predictions, labels):
return (100.0 * np.sum(np.argmax(predictions, 1) == np.argmax(labels, 1))
/ predictions.shape[0])
#for NeuralNetwork model code is below
#We will use SGD for training to save our time. Code is from Assignment 2
#beta is the new parameter - controls level of regularization.
#Feel free to play with it - the best one I found is 0.001
#notice, we introduce L2 for both biases and weights of all layers
batch_size = 128
beta = 0.001
#building tensorflow graph
graph = tf.Graph()
with graph.as_default():
# Input data. For the training data, we use a placeholder that will be fed
# at run time with a training minibatch.
tf_train_dataset = tf.placeholder(tf.float32,
shape=(batch_size, image_size * image_size))
tf_train_labels = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
#introduce batchnorm
tf_train_dataset_bn = tf.contrib.layers.batch_norm(tf_train_dataset)
#now let's build our new hidden layer
#that's how many hidden neurons we want
num_hidden_neurons = 1024
#its weights
hidden_weights = tf.Variable(
tf.truncated_normal([image_size * image_size, num_hidden_neurons]))
hidden_biases = tf.Variable(tf.zeros([num_hidden_neurons]))
#now the layer itself. It multiplies data by weights, adds biases
#and takes ReLU over result
hidden_layer = tf.nn.relu(tf.matmul(tf_train_dataset_bn, hidden_weights) + hidden_biases)
#adding the batch normalization layerhi()
hidden_layer_bn = tf.contrib.layers.batch_norm(hidden_layer)
#time to go for output linear layer
#out weights connect hidden neurons to output labels
#biases are added to output labels
out_weights = tf.Variable(
tf.truncated_normal([num_hidden_neurons, num_labels]))
out_biases = tf.Variable(tf.zeros([num_labels]))
#compute output
out_layer = tf.matmul(hidden_layer_bn,out_weights) + out_biases
#our real output is a softmax of prior result
#and we also compute its cross-entropy to get our loss
#Notice - we introduce our L2 here
loss = (tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
out_layer, tf_train_labels) +
beta*tf.nn.l2_loss(hidden_weights) +
beta*tf.nn.l2_loss(hidden_biases) +
beta*tf.nn.l2_loss(out_weights) +
#now we just minimize this loss to actually train the network
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
#nice, now let's calculate the predictions on each dataset for evaluating the
#performance so far
# Predictions for the training, validation, and test data.
train_prediction = tf.nn.softmax(out_layer)
valid_relu = tf.nn.relu( tf.matmul(tf_valid_dataset, hidden_weights) + hidden_biases)
valid_prediction = tf.nn.softmax( tf.matmul(valid_relu, out_weights) + out_biases)
test_relu = tf.nn.relu( tf.matmul( tf_test_dataset, hidden_weights) + hidden_biases)
test_prediction = tf.nn.softmax(tf.matmul(test_relu, out_weights) + out_biases)
#now is the actual training on the ANN we built
#we will run it for some number of steps and evaluate the progress after
#every 500 steps
#number of steps we will train our ANN
num_steps = 3001
#actual training
with tf.Session(graph=graph) as session:
for step in range(num_steps):
# Pick an offset within the training data, which has been randomized.
# Note: we could use better randomization across epochs.
offset = (step * batch_size) % (train_labels.shape[0] - batch_size)
# Generate a minibatch.
batch_data = train_dataset[offset:(offset + batch_size), :]
batch_labels = train_labels[offset:(offset + batch_size), :]
# Prepare a dictionary telling the session where to feed the minibatch.
# The key of the dictionary is the placeholder node of the graph to be fed,
# and the value is the numpy array to feed to it.
feed_dict = {tf_train_dataset : batch_data, tf_train_labels : batch_labels}
_, l, predictions =
[optimizer, loss, train_prediction], feed_dict=feed_dict)
if (step % 500 == 0):
print("Minibatch loss at step %d: %f" % (step, l))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
print("Validation accuracy: %.1f%%" % accuracy(
valid_prediction.eval(), valid_labels))
print("Test accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
So a simple example of the use of this batchnorm class:
from bn_class import *
with tf.name_scope('Batch_norm_conv1') as scope:
ewma = tf.train.ExponentialMovingAverage(decay=0.99)
bn_conv1 = ConvolutionalBatchNormalizer(num_filt_1, 0.001, ewma, True)
update_assignments = bn_conv1.get_assigner()
a_conv1 = bn_conv1.normalize(a_conv1, train=bn_train)
h_conv1 = tf.nn.relu(a_conv1)
