I am trying to classify the ETH Food-101 dataset using squeezenet in Caffe2. My model is imported from the Model Zoo and I made two types of modifications to the model:
1) Changing the dimensions of the last layer to have 101 outputs
2) The images from the database are in NHWC form and I just flipped the dimensions of the weights to match. (I plan on changing this)
The Food101 dataset has 75,000 images for training and I am currently using a batch size of 128 and a starting learning rate of -0.01 with a gamma of 0.999 and stepsize of 1. What I noticed is that for the first 2000 iterations of the network the accuracy hovered around 1/128 and this took an hour or so to complete.
I added all the weights to the model.params so they can get updated during gradient descent(except for data) and reinitialized all weights as Xavier and biases to constant. I would expect the accuracy to grow fairly quickly in the first hundred to thousand iterations and then tail off as the number of iterations grow. In my case, the learning is staying constant around 0.
When I look at the gradient file I find that the average is on the order of 10^-6 with a standard deviation of 10^-7. This explains the slow learning rate, but I haven't been able to get the gradient to start much higher.
These are the gradient statistics for the first convolution after a few iterations
Min Max Avg Sdev
-1.69821e-05 2.10922e-05 1.52149e-06 5.7707e-06
-1.60263e-05 2.01478e-05 1.49323e-06 5.41754e-06
-1.62501e-05 1.97764e-05 1.49046e-06 5.2904e-06
-1.64293e-05 1.90508e-05 1.45681e-06 5.22742e-06
Here are the core parts of my code:
#init_path is path to init_net protobuf
#pred_path is path to pred_net protobuf
def main(init_path, pred_path):
ws.ResetWorkspace()
data_folder = '/home/myhome/food101/'
#some debug code here
arg_scope = {"order":"NCHW"}
train_model = model_helper.ModelHelper(name="food101_train", arg_scope=arg_scope)
if not debug:
data, label = AddInput(
train_model, batch_size=128,
db=os.path.join(data_folder, 'food101-train-nchw-leveldb'),
db_type='leveldb')
init_net_def, pred_net_def = update_squeeze_net(init_path, pred_path)
#print str(init_net_def)
train_model.param_init_net.AppendNet(core.Net(init_net_def))
train_model.net.AppendNet(core.Net(pred_net_def))
ws.RunNetOnce(train_model.param_init_net)
add_params(train_model, init_net_def)
AddTrainingOperators(train_model, 'softmaxout', 'label')
AddBookkeepingOperators(train_model)
ws.RunNetOnce(train_model.param_init_net)
if debug:
ws.FeedBlob('data', data)
ws.FeedBlob('label', label)
ws.CreateNet(train_model.net)
total_iters = 10000
accuracy = np.zeros(total_iters)
loss = np.zeros(total_iters)
# Now, we will manually run the network for 200 iterations.
for i in range(total_iters):
#try:
conv1_w = ws.FetchBlob('conv1_w')
print conv1_w[0][0]
ws.RunNet("food101_train")
#except RuntimeError:
# print ws.FetchBlob('conv1').shape
# print ws.FetchBlob('pool1').shape
# print ws.FetchBlob('fire2/squeeze1x1_w').shape
# print ws.FetchBlob('fire2/squeeze1x1_b').shape
#softmax = ws.FetchBlob('softmaxout')
#print softmax[i]
#print softmax[i][0][0]
#print softmax[i][0][:5]
#print softmax[64*i]
accuracy[i] = ws.FetchBlob('accuracy')
loss[i] = ws.FetchBlob('loss')
print accuracy[i], loss[i]
My add_params function initializes the weights as follows
#ops allows me to only initialize the weights of specific ops because i initially was going to do last layer training
def add_params(model, init_net_def, ops=[]):
def add_param(op):
for output in op.output:
if "_w" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("XavierFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
tags=ParameterTags.WEIGHT)
elif "_b" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("ConstantFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
I find that my loss function fluctuates when I use the full training set, but If i use just one batch and iterate over it several times I find that the loss function goes down but very slowly.
While SqueezeNet has 50x fewer parameters than AlexNet, it is still a very large network. The original paper does not mention a training time, but the SqueezeNet-based SQ required 22 hours to train using two Titan X graphics cards - and that was with some of the weights pre-trained! I haven't gone over your code in detail, but what you describe is expected behavior - your network is able to learn on the single batch, just not as quickly as you expected.
I suggest reusing as many of the weights as possible instead of reinitializing them, just as the creators of SQ did. This is known as transfer learning, and it works because many of the lower-level features (lines, curves, basic shapes) in an image are the same regardless of the image's content, and reusing the weights for these layers saves the network from having to re-learn them from scratch.
Related
I am training a GCN (Graph Convolutional Network) on Cora dataset.
The Cora dataset has the following attributes:
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Since my code is very long, I only put the relevent parts of my code here. Firstly, I split the Cora dataset as follows:
def to_mask(index, size):
mask = torch.zeros(size, dtype=torch.bool)
mask[index] = 1
return mask
def cora_splits(data, num_classes):
indices = []
for i in range(num_classes):
# returns all indices of the elements = i from data.y tensor
index = (data.y == i).nonzero().view(-1)
# returns a random permutation of integers from 0 to index.size(0).
index = index[torch.randperm(index.size(0))]
# indices is a list of tensors and it has a length of 7
indices.append(index)
# select 20 nodes from each class for training
train_index = torch.cat([i[:20] for i in indices], dim=0)
rest_index = torch.cat([i[20:] for i in indices], dim=0)
rest_index = rest_index[torch.randperm(len(rest_index))]
data.train_mask = to_mask(train_index, size=data.num_nodes)
data.val_mask = to_mask(rest_index[:500], size=data.num_nodes)
data.test_mask = to_mask(rest_index[500:], size=data.num_nodes)
return data
The train is as follows (taken from here with few modifications):
def train(model, optimizer, data, epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(data)
loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
loss_train.backward()
optimizer.step()
loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])
def accuracy(output, labels):
preds = output.max(1)[1].type_as(labels)
correct = preds.eq(labels).double()
correct = correct.sum()
return correct / len(labels)
When I ran my code with 200 epochs in 10 runs I gained:
tensor([0.7690, 0.8030, 0.8530, 0.8760, 0.8600, 0.8550, 0.8850, 0.8580, 0.8940, 0.8830])
Val Loss: 0.5974, Test Accuracy: 0.854 ± 0.039
where each value in the tensor belongs to the model accurracy of each run and the mean accuracy of all those 10 runs is 0.854 with std ± 0.039.
As it can be observed, the accuracy from the first run to the 10th one is increasing substantially. Therefore, I think the model is overfitting. One reason of overfitting is that in the code, the test data has been seen by the model in the training time since in the train function, there is a line output = model(data) so the model is trained over the whole data. What I intend to do is to train my model only on a part of the data (something similar to data[data.train_mask]) but the problem is I cannot pass data[data.train_mask], due to the forward function of the GCN model (from this repository):
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = F.relu(self.conv1(x, edge_index))
for conv in self.convs:
x = F.relu(conv(x, edge_index))
x = F.relu(self.lin1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
return F.log_softmax(x, dim=-1)
If I pass data[data.train_mask] to the GCN model, then in the above forward function in line x, edge_index = data.x, data.edge_index, x and edge_index cannot be retrieved from data[data.train_mask]. Therefore, I need to find a way to split the Cora dataset in a way that I can pass a specefic part of it with the nodes, edge-index and other attributes to the model. My question is how to do it?
Also, any suggestion about k-fold cross validation is much appreciated.
I guess you are a little confused by the nature of transductive learning and the question you asked doesn't actually address the problem you are facing.
As it can be observed, the accuracy from the first run to the 10th one
is increasing substantially. Therefore, I think the model is
overfitting.
Not necessarily, increasing test accuracy could be a normal behavior when your model is learning from the training samples. The learning can last for several dozens of epochs due to the complexity and non-convexity of loss function. The best signal to tell overfit is when your training accuracy increase but test accuracy decreases significantly.
One reason of overfitting is that in the code, the test data has been
seen by the model in the training time since in the train function,
there is a line output = model(data) so the model is trained over the
whole data.
The model indeed has seen the entire graph(adjacency matrix) in the training, but it only sees the labels of nodes in the training set and knows nothing about the labels of nodes in the test set. This is exactly what transductive learning does.
In the end, if you are 100% sure you want to avoid the paradigm of transductive learning, then you might need to write your own split algorithm to achieve that. But I would like to remind that in the real-world use case, transduction is perfectly suitable. An example is to predict the potential links between social network users, where we have the whole network structure as input and want to simply run edge prediction--> transduction. Thus it doesn't make a lot of sense to avoid it.
Depending on your task, you could take a look of how Stellargraph's EdgeSplitter class(docs) and scikit-learn’s train_test_split function (docs) achive the split.
Node classification
If your task is a node classification task, this Node classification with Graph Convolutional Network (GCN) is a good example of how to load data and do train-test-split. It took Cora dataset as an example. The most important steps are the following:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
train_subjects, test_subjects = model_selection.train_test_split(
node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
test_subjects, train_size=500, test_size=None, stratify=test_subjects
)
train_gen = generator.flow(train_subjects.index, train_targets)
val_gen = generator.flow(val_subjects.index, val_targets)
test_gen = generator.flow(test_subjects.index, test_targets)
Basically, it's the same as train-test-split with a normal classification task, except what we split here is nodes.
Edge classification
If your task is edge classification, you could have a look at this Link prediction example: GCN on the Cora citation dataset. The most relevant code for train-test-split is
# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", keep_connected=True
)
# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", keep_connected=True
)
# For training we create a generator on the G_train graph, and make an
# iterator over the training links using the generator’s flow() method:
train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)
Here the splitting algorithm behind EdgeSplitter class(docs) is more complex, it needs to maintain the graph structure while doing the split, such as keeping the graph connectivity for example. For more details, cf source code for EdgeSplitter
I'm new to ANN, but I've managed to train a convolutional model successfully (using some legacy tensorflow v1 code) up to ~90% accuracy or so on my data. But when I evaluate (test) it on any given batch, the result is somewhat random, even though it's 90% correct. I've tried to re-evaluate the data N times and averaging (using N's between 1 and 25), but still each evaluation differs from the others between 3% to 10% of the data points.
Is there any way to make the evaluation predictable, so that the evaluation of an input batch X always yield the exact same result Y every time I run it (once training is done)?
I'm not sure if it's relevant, but my layers are batch normalized like so:
inp = tf.identity(inp)
channels = inp.get_shape()[-1]
offset = tf.compat.v1.get_variable(
'offset', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.zeros_initializer())
scale = tf.compat.v1.get_variable(
'scale', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.random_normal_initializer(1.0, 0.02))
mean, variance = tf.nn.moments(x=inp, axes=[0, 1], keepdims=False)
variance_epsilon = 1e-5
normalized = tf.nn.batch_normalization(
inp, mean, variance, offset, scale, variance_epsilon=variance_epsilon)
The scale part is initialized with random data, but I assume that gets loaded when I do tf.compat.v1.train.Saver().restore(session, checkpoint_fname)?
I am assuming you are testing the model on your training batches?
You can't equate the accuracy of a portion of your total training dataset to the accuracy of the whole.
Think of it like a regression problem. If you only take a part of the dataset, there is no guarantee that it would average out close to the full dataset.
If you want consistent accuracy, evaluate on the full dataset.
Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!
This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.
I'm using the implementation found in http://adventuresinmachinelearning.com/word2vec-keras-tutorial/ to learn something about word2Vec. What I am not understanding is why isn't the loss function decreasing?
Iteration 119200, loss=0.7305528521537781
Iteration 119300, loss=0.6254740953445435
Iteration 119400, loss=0.8255964517593384
Iteration 119500, loss=0.7267132997512817
Iteration 119600, loss=0.7213149666786194
Iteration 119700, loss=0.6156617999076843
Iteration 119800, loss=0.11473365128040314
Iteration 119900, loss=0.6617216467857361
The net, from my understanding, is a standard one used in this task:
input_target = Input((1,))
input_context = Input((1,))
embedding = Embedding(vocab_size, vector_dim, input_length=1, name='embedding')
target = embedding(input_target)
target = Reshape((vector_dim, 1))(target)
context = embedding(input_context)
context = Reshape((vector_dim, 1))(context)
dot_product = Dot(axes=1)([target, context])
dot_product = Reshape((1,))(dot_product)
output = Dense(1, activation='sigmoid')(dot_product)
model = Model(inputs=[input_target, input_context], outputs=output)
model.compile(loss='binary_crossentropy', optimizer='rmsprop') #adam??
Words come from a vocabulary of size 10000 from http://mattmahoney.net/dc/text8.zip (english text)
What I notice is that some words are somewhat learned in time like the context for numbers and articles is easily guessed, yet the loss is quite stuck around 0.7 from the beginning, and as iterations goes it only fluctuates randomly.
The training part is made like this (which I sense strange since the absence of the standard fit method)
arr_1 = np.zeros((1,))
arr_2 = np.zeros((1,))
arr_3 = np.zeros((1,))
for cnt in range(epochs):
idx = np.random.randint(0, len(labels)-1)
arr_1[0,] = word_target[idx]
arr_2[0,] = word_context[idx]
arr_3[0,] = labels[idx]
loss = model.train_on_batch([arr_1, arr_2], arr_3)
if cnt % 100 == 0:
print("Iteration {}, loss={}".format(cnt, loss))
Am i missing something important about these type of net? What is not written is implemented exactly like the link above
I followed the same tutorial and the loss drops after the algorithm went through a sample again. Note that the loss function is calculated only for the current target and context word pair. In the code example from the tutorial one epoch is only one sample, therefore you would need more than the number of target and context words to come to a point where the loss drops.
I implemented the training part with the following line
model.fit([word_target, word_context], labels, epochs=5)
Be warned that this can take a long time depending on how large the corpus is. The train_on_batch function gives you more control in training and you can vary the batch size or select samples you choose at every step of the training.
I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)
DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.
There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.