Edgy accuracy and loss plots in tensorboard

Edgy accuracy and loss plots in tensorboard - python

I use tensorboard to visualize learning of my neural networks. Currently, I am working with datasets of such size, that equal, say 80%/20% split into training and development collections leaves me with the data that does not fit into memory during processing, so I have to use batches for development step.
When I do so, I end up with following shapes on the tensorboard charts:
Light-green line represents batched summary, while the dark green one represents unbatched development processing. I used datasets of same size and split for those two charts.
Here is the code, that is used to populate summaries
# accuracy definition in network
network.correct_predictions = tf.equal(predictions, tf.argmax(input_y, 1))
network.accuracy = tf.reduce_mean(tf.cast(correct_predictions, "float"), name="accuracy")
network.loss = tf.reduce_mean(_losses)
# accuracy summary
acc_summary = tf.scalar_summary("accuracy", network.accuracy)
loss_summary = tf.scalar_summary("loss", network.loss)
# dev data summary writer
dev_summary_operation = tf.merge_summary([loss_summary, acc_summary])
dev_summary_writer = tf.train.SummaryWriter(dev_summary_dir, session.graph)
# dev step
def dev_step(x_batch, y_batch)
feed_dict = {network.input_x: x_batch, network.input_y: y_batch}
step, summaries, loss, accuracy = session.run([global_step, dev_summary_operation, network.loss, network.accuracy], feed_dict)
dev_summary_writer.add_summary(summaries, step)
Same approach during training step gives normal lines (those, that are not edgy) since training is batched in both cases.
In both cases (batched or not) development summary gets updated once in 100 training batches.
Suggestions of reasons for such behaviour and possible fixes are more than welcome and highly appreciated.

Related

How to split the Cora dataset to train a GCN model only on training part?

I am training a GCN (Graph Convolutional Network) on Cora dataset.
The Cora dataset has the following attributes:
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Since my code is very long, I only put the relevent parts of my code here. Firstly, I split the Cora dataset as follows:
def to_mask(index, size):
mask = torch.zeros(size, dtype=torch.bool)
mask[index] = 1
return mask
def cora_splits(data, num_classes):
indices = []
for i in range(num_classes):
# returns all indices of the elements = i from data.y tensor
index = (data.y == i).nonzero().view(-1)
# returns a random permutation of integers from 0 to index.size(0).
index = index[torch.randperm(index.size(0))]
# indices is a list of tensors and it has a length of 7
indices.append(index)
# select 20 nodes from each class for training
train_index = torch.cat([i[:20] for i in indices], dim=0)
rest_index = torch.cat([i[20:] for i in indices], dim=0)
rest_index = rest_index[torch.randperm(len(rest_index))]
data.train_mask = to_mask(train_index, size=data.num_nodes)
data.val_mask = to_mask(rest_index[:500], size=data.num_nodes)
data.test_mask = to_mask(rest_index[500:], size=data.num_nodes)
return data
The train is as follows (taken from here with few modifications):
def train(model, optimizer, data, epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(data)
loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
loss_train.backward()
optimizer.step()
loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])
def accuracy(output, labels):
preds = output.max(1)[1].type_as(labels)
correct = preds.eq(labels).double()
correct = correct.sum()
return correct / len(labels)
When I ran my code with 200 epochs in 10 runs I gained:
tensor([0.7690, 0.8030, 0.8530, 0.8760, 0.8600, 0.8550, 0.8850, 0.8580, 0.8940, 0.8830])
Val Loss: 0.5974, Test Accuracy: 0.854 ± 0.039
where each value in the tensor belongs to the model accurracy of each run and the mean accuracy of all those 10 runs is 0.854 with std ± 0.039.
As it can be observed, the accuracy from the first run to the 10th one is increasing substantially. Therefore, I think the model is overfitting. One reason of overfitting is that in the code, the test data has been seen by the model in the training time since in the train function, there is a line output = model(data) so the model is trained over the whole data. What I intend to do is to train my model only on a part of the data (something similar to data[data.train_mask]) but the problem is I cannot pass data[data.train_mask], due to the forward function of the GCN model (from this repository):
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = F.relu(self.conv1(x, edge_index))
for conv in self.convs:
x = F.relu(conv(x, edge_index))
x = F.relu(self.lin1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
return F.log_softmax(x, dim=-1)
If I pass data[data.train_mask] to the GCN model, then in the above forward function in line x, edge_index = data.x, data.edge_index, x and edge_index cannot be retrieved from data[data.train_mask]. Therefore, I need to find a way to split the Cora dataset in a way that I can pass a specefic part of it with the nodes, edge-index and other attributes to the model. My question is how to do it?
Also, any suggestion about k-fold cross validation is much appreciated.

I guess you are a little confused by the nature of transductive learning and the question you asked doesn't actually address the problem you are facing.
As it can be observed, the accuracy from the first run to the 10th one
is increasing substantially. Therefore, I think the model is
overfitting.
Not necessarily, increasing test accuracy could be a normal behavior when your model is learning from the training samples. The learning can last for several dozens of epochs due to the complexity and non-convexity of loss function. The best signal to tell overfit is when your training accuracy increase but test accuracy decreases significantly.
One reason of overfitting is that in the code, the test data has been
seen by the model in the training time since in the train function,
there is a line output = model(data) so the model is trained over the
whole data.
The model indeed has seen the entire graph(adjacency matrix) in the training, but it only sees the labels of nodes in the training set and knows nothing about the labels of nodes in the test set. This is exactly what transductive learning does.
In the end, if you are 100% sure you want to avoid the paradigm of transductive learning, then you might need to write your own split algorithm to achieve that. But I would like to remind that in the real-world use case, transduction is perfectly suitable. An example is to predict the potential links between social network users, where we have the whole network structure as input and want to simply run edge prediction--> transduction. Thus it doesn't make a lot of sense to avoid it.
Depending on your task, you could take a look of how Stellargraph's EdgeSplitter class(docs) and scikit-learn’s train_test_split function (docs) achive the split.
Node classification
If your task is a node classification task, this Node classification with Graph Convolutional Network (GCN) is a good example of how to load data and do train-test-split. It took Cora dataset as an example. The most important steps are the following:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
train_subjects, test_subjects = model_selection.train_test_split(
node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
test_subjects, train_size=500, test_size=None, stratify=test_subjects
)
train_gen = generator.flow(train_subjects.index, train_targets)
val_gen = generator.flow(val_subjects.index, val_targets)
test_gen = generator.flow(test_subjects.index, test_targets)
Basically, it's the same as train-test-split with a normal classification task, except what we split here is nodes.
Edge classification
If your task is edge classification, you could have a look at this Link prediction example: GCN on the Cora citation dataset. The most relevant code for train-test-split is
# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", keep_connected=True
)
# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", keep_connected=True
)
# For training we create a generator on the G_train graph, and make an
# iterator over the training links using the generator’s flow() method:
train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)
Here the splitting algorithm behind EdgeSplitter class(docs) is more complex, it needs to maintain the graph structure while doing the split, such as keeping the graph connectivity for example. For more details, cf source code for EdgeSplitter

training by batches leads to more over-fitting

I'm training a sequence to sequence (seq2seq) model and I have different values to train on for the input_sequence_length.
For values 10 and 15, I get acceptable results but when I try to train with 20, I get memory errors so I switched the training to train by batches but the model over-fit and the validation loss explodes, and even with the accumulated gradient I get the same behavior, so I'm looking for hints and leads to more accurate ways to do the update.
Here is my training function (only with batch section) :
if batch_size is not None:
k=len(list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )))
for epoch in range(num_epochs):
optimizer.zero_grad()
epoch_loss=0
for i in list(np.arange(0,(X_train_tensor_1.size()[0]//batch_size-1), batch_size )): # by using equidistant batch till the last one it becomes much faster than using the X.size()[0] directly
sequence = X_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, input_size).to(device)
labels = y_train_tensor[i:i+batch_size,:,:].reshape(-1, sequence_length, output_size).to(device)
# Forward pass
outputs = model(sequence)
loss = criterion(outputs, labels)
epoch_loss+=loss.item()
# Backward and optimize
loss.backward()
optimizer.step()
epoch_loss=epoch_loss/k
model.eval
validation_loss,_= evaluate(model,X_test_hard_tensor_1,y_test_hard_tensor_1)
model.train()
training_loss_log.append(epoch_loss)
print ('Epoch [{}/{}], Train MSELoss: {}, Validation : {} {}'.format(epoch+1, num_epochs,epoch_loss,validation_loss))
EDIT:
here are the parameters that I'm training with :
batch_size = 1024
num_epochs = 25000
learning_rate = 10e-04
optimizer=torch.optim.Adam(model.parameters(), lr=learning_rate)
criterion = nn.MSELoss(reduction='mean')

Batch size affects regularization. Training on a single example at a time is quite noisy, which makes it harder to overfit. Training on batches smoothes everything out, which makes it easier to overfit. Translating back to regularization:
Smaller batches add regularization.
Larger batches reduce regularization.
I am also curious about your learning rate. Every call to loss.backward() will accumulate the gradient. If you have set your learning rate to expect a single example at a time, and not reduced it to account for batch accumulation, then one of two things will happen.
The learning rate will be too high for the now-accumulated gradient, training will diverge, and both training and validation errors will explode.
The learning rate won't be too high, and nothing will diverge. The model will just train more quickly and effectively. If the model is too large for the data being fit, then training error will go to 0 but validation error will explode due to overfitting.
Update
Here is a bit more detail regarding the gradient accumulation.
Every call to loss.backward() will accumulate gradient, until you reset it with optimizer.zero_grad(). It will be acted on when you call optimizer.step(), based on whatever it has accumulated.
The way your code is written, you call loss.backward() for every pass through the inner loop, then you call optimizer.step() in the outer loop before resetting. So the gradient has been accumulated, that is summed, over all examples in the batch and not just one example at a time.
Under most assumptions, that will make the batch-accumulated gradient larger than the gradient for a single example. If the gradients are all aligned, for B batches, it will be larger by B times. If the gradients are i.i.d. then it will be more like sqrt(B) times larger.
If you do not account for this, then you have effectively increased your learning rate by that factor. Some of that will be mitigated by the smoothing effect of larger batches, which can then tolerate a higher learning rate. Larger batches reduce regularization, larger learning rates add it back. But that will not be a perfect match to compensate, so you will still want to adjust accordingly.
In general, whenever you change your batch size you will also want to re-tune your learning rate to compensate.
Leslie N. Smith has written some excellent papers on a methodical approach to hyperparameter tuning. A great place to start is A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay. He recommends you start by reading the diagrams, which are very well done.

TensorFlow on multiple GPU

Recently, I try to learn how to use Tensorflow on multiple GPU by reading the official tutorial. However, there is something that I am confused about. The following code is part of the official tutorial, which calculates the loss on single GPU.
def tower_loss(scope, images, labels):
# Build inference Graph.
logits = cifar10.inference(images)
# Build the portion of the Graph calculating the losses. Note that we will
# assemble the total_loss using a custom function below.
_ = cifar10.loss(logits, labels)
# Assemble all of the losses for the current tower only.
losses = tf.get_collection('losses', scope)
# Calculate the total loss for the current tower.
total_loss = tf.add_n(losses, name='total_loss')
# Attach a scalar summary to all individual losses and the total loss; do the
# same for the averaged version of the losses.
for l in losses + [total_loss]:
# Remove 'tower_[0-9]/' from the name in case this is a multi-GPU training
# session. This helps the clarity of presentation on tensorboard.
loss_name = re.sub('%s_[0-9]*/' % cifar10.TOWER_NAME, '', l.op.name)
tf.summary.scalar(loss_name, l)
return total_loss
The training process is as the following.
def train():
with tf.device('/cpu:0'):
# Create a variable to count the number of train() calls. This equals the
# number of batches processed * FLAGS.num_gpus.
global_step = tf.get_variable(
'global_step', [],
initializer=tf.constant_initializer(0), trainable=False)
# Calculate the learning rate schedule.
num_batches_per_epoch = (cifar10.NUM_EXAMPLES_PER_EPOCH_FOR_TRAIN /
FLAGS.batch_size / FLAGS.num_gpus)
decay_steps = int(num_batches_per_epoch * cifar10.NUM_EPOCHS_PER_DECAY)
# Decay the learning rate exponentially based on the number of steps.
lr = tf.train.exponential_decay(cifar10.INITIAL_LEARNING_RATE,
global_step,
decay_steps,
cifar10.LEARNING_RATE_DECAY_FACTOR,
staircase=True)
# Create an optimizer that performs gradient descent.
opt = tf.train.GradientDescentOptimizer(lr)
# Get images and labels for CIFAR-10.
images, labels = cifar10.distorted_inputs()
batch_queue = tf.contrib.slim.prefetch_queue.prefetch_queue(
[images, labels], capacity=2 * FLAGS.num_gpus)
# Calculate the gradients for each model tower.
tower_grads = []
with tf.variable_scope(tf.get_variable_scope()):
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
# Retain the summaries from the final tower.
summaries = tf.get_collection(tf.GraphKeys.SUMMARIES, scope)
However, I am confused about the for loop about 'for i in xrange(FLAGS.num_gpus)'. It seems that I have to get a new batch image from batch_queue and calculate every gradient. I think this process is serialized instead of parallel. If there anything wrong with my own understanding? By the way, I can also use the iterator to feed image to my model rather than the dequeue right?
Thank you everybody!

This is a common misconception with Tensorflow's coding model.
What you are showing here is the computation graph's construction, NOT the actual execution.
The block:
for i in xrange(FLAGS.num_gpus):
with tf.device('/gpu:%d' % i):
with tf.name_scope('%s_%d' % (cifar10.TOWER_NAME, i)) as scope:
# Dequeues one batch for the GPU
image_batch, label_batch = batch_queue.dequeue()
# Calculate the loss for one tower of the CIFAR model. This function
# constructs the entire CIFAR model but shares the variables across
# all towers.
loss = tower_loss(scope, image_batch, label_batch)
translates to:
For each GPU device (`for i in range..` & `with device...`):
- build operations needed to dequeue a batch
- build operations needed to run the batch through the network and compute the loss
Note how via tf.get_variable_scope().reuse_variables() you're telling the graph that the variables used for the graph GPU must be shared among all (i.e., all graphs on the multiple devices "reuse" the same variables).
None of this actually runs the network once (note how there is no sess.run()): you're just giving instructions on how data must flow.
Then, when you'll start the actual training (I guess you missed that piece of the code when copying it here) each GPU will pull its own batch and produce the per-tower loss. I guess these losses are averaged somewhere in the subsequent code and the average is the loss passed to the optimizer.
Up until the point where the tower losses are averaged together, everything is independent from the other devices, so getting the batch and computing the loss can be done in parallel. Then the gradients and parameter update is done only once, variables are updated and the cycle repeats.
So, to answer your question, no, per-batch loss computation is not serialized, but since this is synchronous distributed computation you need to collect all losses from all GPUs before being allowed to continue with gradients computation and parameters update, so you still have some part of the graph that cannot be independent.

Why does recognition rate drop after multiple online training epochs?

I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)

DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.

There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.

Logging training and validation loss in tensorboard

I'm trying to learn how to use tensorflow and tensorboard. I have a test project based on the MNIST neural net tutorial.
In my code, I construct a node that calculates the fraction of digits in a data set that are correctly classified, like this:
correct = tf.nn.in_top_k(self._logits, labels, 1)
correct = tf.to_float(correct)
accuracy = tf.reduce_mean(correct)
Here, self._logitsis the inference part of the graph, and labels is a placeholder that contains the correct labels.
Now, what I would like to do is evaluate the accuracy for both the training set and the validation set as training proceeds. I can do this by running the accuracy node twice, with different feed_dicts:
train_acc = tf.run(accuracy, feed_dict={images : training_set.images, labels : training_set.labels})
valid_acc = tf.run(accuracy, feed_dict={images : validation_set.images, labels : validation_set.labels})
This works as intended. I can print the values, and I can see that initially, the two accuracies will both increase, and eventually the validation accuracy will flatten out while the training accuracy keeps increasing.
However, I would also like to get graphs of these values in tensorboard, and I can not figure out how to do this. If I simply add a scalar_summary to accuracy, the logged values will not distinguish between training set and validation set.
I also tried creating two identical accuracy nodes with different names and running one on the training set and one on the validation set. I then add a scalar_summary to each of these nodes. This does give me two graphs in tensorboard, but instead of one graph showing the training set accuracy and one showing the validation set accuracy, they are both showing identical values that do not match either of the ones printed to the terminal.
I am probably misunderstanding how to solve this problem. What is the recommended way of separately logging the output from a single node for different inputs?

There are several different ways you could achieve this, but you're on the right track with creating different tf.summary.scalar() nodes. Since you must explicitly call SummaryWriter.add_summary() each time you want to log a quantity to the event file, the simplest approach is probably to fetch the appropriate summary node each time you want to get the training or validation accuracy:
accuracy = tf.reduce_mean(correct)
training_summary = tf.summary.scalar("training_accuracy", accuracy)
validation_summary = tf.summary.scalar("validation_accuracy", accuracy)
summary_writer = tf.summary.FileWriter(...)
for step in xrange(NUM_STEPS):
# Perform a training step....
if step % LOG_PERIOD == 0:
# To log training accuracy.
train_acc, train_summ = sess.run(
[accuracy, training_summary],
feed_dict={images : training_set.images, labels : training_set.labels})
writer.add_summary(train_summ, step)
# To log validation accuracy.
valid_acc, valid_summ = sess.run(
[accuracy, validation_summary],
feed_dict={images : validation_set.images, labels : validation_set.labels})
writer.add_summary(valid_summ, step)
Alternatively, you could create a single summary op whose tag is a tf.placeholder(tf.string, []) and feed the string "training_accuracy" or "validation_accuracy" as appropriate.

Another way to do it, is to use a second file writer. So you are able to use the merge_summaries command.
train_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/train',
sess.graph)
test_writer = tf.summary.FileWriter(FLAGS.summaries_dir + '/test')
tf.global_variables_initializer().run()
Here is the complete documentation. This works for me fine : TensorBoard: Visualizing Learning

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.