What is the best way of duplicating a TensorFlow graph and keep it uptodate?
Ideally I want to put the duplicated graph on another device (e.g. from GPU to CPU) and then time to time update the copy.
Short answer: You probably want checkpoint files (permalink).
Long answer:
Let's be clear about the setup here. I'll assume that you have two devices, A and B, and you are training on A and running inference on B.
Periodically, you'd like to update the parameters on the device running inference with new parameters found during training on the other.
The tutorial linked above is a good place to start. It shows you how tf.train.Saver objects work, and you shouldn't need anything more complicated here.
Here is an example:
import tensorflow as tf
def build_net(graph, device):
with graph.as_default():
with graph.device(device):
# Input placeholders
inputs = tf.placeholder(tf.float32, [None, 784])
labels = tf.placeholder(tf.float32, [None, 10])
# Initialization
w0 = tf.get_variable('w0', shape=[784,256], initializer=tf.contrib.layers.xavier_initializer())
w1 = tf.get_variable('w1', shape=[256,256], initializer=tf.contrib.layers.xavier_initializer())
w2 = tf.get_variable('w2', shape=[256,10], initializer=tf.contrib.layers.xavier_initializer())
b0 = tf.Variable(tf.zeros([256]))
b1 = tf.Variable(tf.zeros([256]))
b2 = tf.Variable(tf.zeros([10]))
# Inference network
h1 = tf.nn.relu(tf.matmul(inputs, w0)+b0)
h2 = tf.nn.relu(tf.matmul(h1,w1)+b1)
output = tf.nn.softmax(tf.matmul(h2,w2)+b2)
# Training network
cross_entropy = tf.reduce_mean(-tf.reduce_sum(labels * tf.log(output), reduction_indices=[1]))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
# Your checkpoint function
saver = tf.train.Saver()
return tf.initialize_all_variables(), inputs, labels, output, optimizer, saver
The code for the training program:
def programA_main():
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
# Build training network on device A
graphA = tf.Graph()
init, inputs, labels, _, training_net, saver = build_net(graphA, '/cpu:0')
with tf.Session(graph=graphA) as sess:
sess.run(init)
for step in xrange(1,10000):
batch = mnist.train.next_batch(50)
sess.run(training_net, feed_dict={inputs: batch[0], labels: batch[1]})
if step%100==0:
saver.save(sess, '/tmp/graph.checkpoint')
print 'saved checkpoint'
...and code for an inference program:
def programB_main():
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)
# Build inference network on device B
graphB = tf.Graph()
init, inputs, _, inference_net, _, saver = build_net(graphB, '/cpu:0')
with tf.Session(graph=graphB) as sess:
batch = mnist.test.next_batch(50)
saver.restore(sess, '/tmp/graph.checkpoint')
print 'loaded checkpoint'
out = sess.run(inference_net, feed_dict={inputs: batch[0]})
print out[0]
import time; time.sleep(2)
saver.restore(sess, '/tmp/graph.checkpoint')
print 'loaded checkpoint'
out = sess.run(inference_net, feed_dict={inputs: batch[0]})
print out[1]
If you fire up the training program and then the inference program, you'll see the inference program produces two different outputs (from the same input batch). This is a result of it picking up the parameters that the training program has checkpointed.
Now, this program obviously isn't your end point. We don't do any real synchronization, and you'll have to decide what "periodic" means with respect to checkpointing. But this should give you an idea of how to sync parameters from one network to another.
One final warning: this does not mean that the two networks are necessarily deterministic. There are known non-deterministic elements in TensorFlow (e.g., this), so be wary if you need exactly the same answer. But this is the hard truth about running on multiple devices.
Good luck!
I'll try to go with a pretty simplified answer, to see if the general approach is what OP is describing:
I'd implement it via the tf.train.Saver object.
Suppose you have your weights in a variable W1, W2, and b1
mysaver = tf.train.Saver(({'w1': W1, 'w2': W2, 'b1': b1}))
In the train loop you can add, every n iterations:
saver.save(session_var, 'model1', global_step=step)
And then in the loading instance, when needed, you run:
tf.train.Saver.restore(other_session_object, 'model1')
Hope this is similar to the solution you are asking.
Simply do the round trip tf.Graph > tf.GraphDef > tf.Graph:
import tensorflow as tf
def copy_graph(graph: tf.Graph) -> tf.Graph:
with tf.Graph().as_default() as copied_graph:
graph_def = graph.as_graph_def(add_shapes=True)
tf.graph_util.import_graph_def(graph_def)
return copied_graph
Related
Before marking my question as duplicate, I want you to understand that I have went through a lot of questions, but none of the solutions there were able to clear my doubts and solve my problem. I have a trained neural network which I want to save, and later use this model to test this model against test dataset.
I tried saving and restoring it, but I am not getting the expected results. Restoring doesn't seem to work, maybe I am using it wrongly, it is just using the values given by the global variable initializer.
This is the code I am using for saving the model.
sess.run(tf.initializers.global_variables())
#num_epochs = 7
for epoch in range(num_epochs):
start_time = time.time()
train_accuracy = 0
train_loss = 0
val_loss = 0
val_accuracy = 0
for bid in range(int(train_data_size/batch_size)):
X_train_batch = X_train[bid*batch_size:(bid+1)*batch_size]
y_train_batch = y_train[bid*batch_size:(bid+1)*batch_size]
sess.run(optimizer, feed_dict = {x:X_train_batch, y:y_train_batch,prob:0.50})
train_accuracy = train_accuracy + sess.run(model_accuracy, feed_dict={x : X_train_batch,y:y_train_batch,prob:0.50})
train_loss = train_loss + sess.run(loss_value, feed_dict={x : X_train_batch,y:y_train_batch,prob:0.50})
for bid in range(int(val_data_size/batch_size)):
X_val_batch = X_val[bid*batch_size:(bid+1)*batch_size]
y_val_batch = y_val[bid*batch_size:(bid+1)*batch_size]
val_accuracy = val_accuracy + sess.run(model_accuracy,feed_dict = {x:X_val_batch, y:y_val_batch,prob:0.75})
val_loss = val_loss + sess.run(loss_value, feed_dict = {x:X_val_batch, y:y_val_batch,prob:0.75})
train_accuracy = train_accuracy/int(train_data_size/batch_size)
val_accuracy = val_accuracy/int(val_data_size/batch_size)
train_loss = train_loss/int(train_data_size/batch_size)
val_loss = val_loss/int(val_data_size/batch_size)
end_time = time.time()
saver.save(sess,'./blood_model_x_v2',global_step = epoch)
After saving the model, the files are written in my working directory something like this.
blood_model_x_v2-2.data-0000-of-0001
blood_model_x_v2-2.index
blood_model_x_v2-2.meta
Similarly, v2-3, so on to v2-6, and then a 'checkpoint' file. I then tried restoring it using this code snippet (after initializing),but getting different results from the expected one. What am I doing wrong ?
saver = tf.train.import_meta_graph('blood_model_x_v2-5.meta')
saver.restore(test_session,tf.train.latest_checkpoint('./'))
According to tensorflow docs:
Restore
Restores previously saved variables.
This method runs the ops added by the constructor for restoring
variables. It requires a session in which the graph was launched. The
variables to restore do not have to have been initialized, as
restoring is itself a way to initialize variables.
Let's see an example:
We save the model similar to this:
import tensorflow as tf
# Prepare to feed input, i.e. feed_dict and placeholders
w1 = tf.placeholder("float", name="w1")
w2 = tf.placeholder("float", name="w2")
b1 = tf.Variable(2.0, name="bias")
feed_dict = {w1: 4, w2: 8}
# Define a test operation that we will restore
w3 = tf.add(w1, w2)
w4 = tf.multiply(w3, b1, name="op_to_restore")
sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Create a saver object which will save all the variables
saver = tf.train.Saver()
# Run the operation by feeding input
print (sess.run(w4, feed_dict))
# Prints 24 which is sum of (w1+w2)*b1
# Now, save the graph
saver.save(sess, './ckpnt/my_test_model', global_step=1000)
And then load the trained model with:
import tensorflow as tf
sess = tf.Session()
# First let's load meta graph and restore weights
saver = tf.train.import_meta_graph('./ckpnt/my_test_model-1000.meta')
saver.restore(sess, tf.train.latest_checkpoint('./ckpnt'))
# Now, let's access and create placeholders variables and
# create feed-dict to feed new data
graph = tf.get_default_graph()
w1 = graph.get_tensor_by_name("w1:0")
w2 = graph.get_tensor_by_name("w2:0")
feed_dict = {w1: 13.0, w2: 17.0}
# Now, access the op that you want to run.
op_to_restore = graph.get_tensor_by_name("op_to_restore:0")
print (sess.run(op_to_restore, feed_dict))
# This will print 60 which is calculated
# using new values of w1 and w2 and saved value of b1.
As you can see we do not initialize our session in the restoring part. There is better way to save and restore model with Checkpoint which allows you to check whether the model is restored correctly or not.
I am new to tensorflow and I am trying to implement a simple feed-forward network for regression, just for learning purposes. The complete executable code is as follows.
The regression mean squared error is around 6, which is quite large. It is a little unexpected because the function to regress is linear and simple 2*x+y, and I expect a better performance.
I am asking for help to check if I did anything wrong in the code. I carefully checked the matrix dimensions so that should be good, but it is possible that I misunderstand something so the network or the session is not properly configured (like, should I run the training session multiple times, instead of just one time (the code below enclosed by #TRAINING#)? I see in some examples they input data piece by piece, and run the training progressively. I run the training just one time and input all data).
If the code is good, maybe this is a modeling issue, but I really don't expect to use a complicated network for such a simple regression.
import tensorflow as tf
import numpy as np
from sklearn.metrics import mean_squared_error
# inputs are points from a 100x100 grid in domain [-2,2]x[-2,2], total 10000 points
lsp = np.linspace(-2,2,100)
gridx,gridy = np.meshgrid(lsp,lsp)
inputs = np.dstack((gridx,gridy))
inputs = inputs.reshape(-1,inputs.shape[-1]) # reshpaes the grid into a 10000x2 matrix
feature_size = inputs.shape[1] # feature_size is 2, features are the 2D coordinates of each point
input_size = inputs.shape[0] # input_size is 10000
# a simple function f(x)=2*x[0]+x[1] to regress
f = lambda x: 2 * x[0] + x[1]
label_size = 1
labels = f(inputs.transpose()).reshape(-1,1) # reshapes labels as a column vector
ph_inputs = tf.placeholder(tf.float32, shape=(None, feature_size), name='inputs')
ph_labels = tf.placeholder(tf.float32, shape=(None, label_size), name='labels')
# just one hidden layer with 16 units
hid1_size = 16
w1 = tf.Variable(tf.random_normal([hid1_size, feature_size], stddev=0.01), name='w1')
b1 = tf.Variable(tf.random_normal([hid1_size, label_size]), name='b1')
y1 = tf.nn.relu(tf.add(tf.matmul(w1, tf.transpose(ph_inputs)), b1))
# the output layer
wo = tf.Variable(tf.random_normal([label_size, hid1_size], stddev=0.01), name='wo')
bo = tf.Variable(tf.random_normal([label_size, label_size]), name='bo')
yo = tf.transpose(tf.add(tf.matmul(wo, y1), bo))
# defines optimizer and predictor
lr = tf.placeholder(tf.float32, shape=(), name='learning_rate')
loss = tf.losses.mean_squared_error(ph_labels,yo)
optimizer = tf.train.GradientDescentOptimizer(lr).minimize(loss)
predictor = tf.identity(yo)
# TRAINING
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
_, c = sess.run([optimizer, loss], feed_dict={lr:0.05, ph_inputs: inputs, ph_labels: labels})
# TRAINING
# gets the regression results
predictions = np.zeros((input_size,1))
for i in range(input_size):
predictions[i] = sess.run(predictor, feed_dict={ph_inputs: inputs[i, None]}).squeeze()
# prints regression MSE
print(mean_squared_error(predictions, labels))
You're right, you understood the problem by yourself.
The problem is, in fact, that you're running the optimization step only one time. Hence you're doing one single update step of your network parameter and therefore the cost won't decrease.
I just changed the training session of your code in order to make it work as expected (100 training steps):
# TRAINING
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
for i in range(100):
_, c = sess.run(
[optimizer, loss],
feed_dict={
lr: 0.05,
ph_inputs: inputs,
ph_labels: labels
})
print("Train step {} loss value {}".format(i, c))
# TRAINING
and at the end of the training step I go:
Train step 99 loss value 0.04462708160281181
0.044106700712455045
I'm having trouble loading a model to resume training.
I'm using a simple two-layered-NN (Fully connected) on a cifar data set for practice.
NN Setup:
#full_connected_layers
import tensorflow as tf
import numpy as np
#input _-> hidden ->
def inference(data_samples, image_pixels, hidden_units, classes, reg_constant):
with tf.variable_scope('Layer1'):
# Define the variables
weights = tf.get_variable(
name='weights',
shape=[image_pixels, hidden_units],
initializer=tf.truncated_normal_initializer(
stddev=1.0 / np.sqrt(float(image_pixels))),
regularizer=tf.contrib.layers.l2_regularizer(reg_constant)
)
biases = tf.Variable(tf.zeros([hidden_units]), name='biases')
# Define the layer's output
hidden = tf.nn.relu(tf.matmul(data_samples, weights) + biases)
with tf.variable_scope('Layer2'):
# Define variables
weights = tf.get_variable('weights', [hidden_units, classes],
initializer=tf.truncated_normal_initializer(
stddev=1.0 / np.sqrt(float(hidden_units))),
regularizer=tf.contrib.layers.l2_regularizer(reg_constant))
biases = tf.Variable(tf.zeros([classes]), name='biases')
# Define the layer's output
logits = tf.matmul(hidden, weights) + biases
# Define summery-operation for 'logits'-variable
tf.summary.histogram('logits', logits)
return logits
def loss(logits, labels):
'''Calculates the loss from logits and labels.
Args:
logits: Logits tensor, float - [batch size, number of classes].
labels: Labels tensor, int64 - [batch size].
Returns:
loss: Loss tensor of type float.
'''
with tf.name_scope('Loss'):
# Operation to determine the cross entropy between logits and labels
cross_entropy = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=labels, name='cross_entropy'))
# Operation for the loss function
loss = cross_entropy + tf.add_n(tf.get_collection(
tf.GraphKeys.REGULARIZATION_LOSSES))
# Add a scalar summary for the loss
tf.summary.scalar('loss', loss)
return loss
def training(loss, learning_rate):
# Create a variable to track the global step
global_step = tf.Variable(0, name='global_step', trainable=False)
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(
loss, global_step=global_step)
#train_step = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(
#loss, global_step=global_step)
return train_step
def evaluation(logits, labels):
with tf.name_scope('Accuracy'):
# Operation comparing prediction with true label
correct_prediction = tf.equal(tf.argmax(logits,1), labels)
# Operation calculating the accuracy of the predictions
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
# Summary operation for the accuracy
tf.summary.scalar('train_accuracy', accuracy)
return accuracy
Saved model like this:
if (i + 1) % 500 == 0:
saver.save(sess, MODEL_DIR, global_step=i)
print('Saved checkpoint')
Saved model files
Within this directory:
C:\Users\Moondra\Desktop\CIFAR - PROJECT\parameters_no_changes
I have the following files as well as model.ckpt-499.index etc:
model.ckpt-999.meta
model.ckpt-999.index
model.ckpt-999.data-00000-of-00001
My attempt at loading the model
import numpy as np
import tensorflow as tf
import time
from datetime import datetime
import os
import data_helpers
import full_connected_layers
import itertools
learning_rate = .0001
max_steps = 3000
batch_size = 400
checkpoint = r'C:\Users\Moondra\Desktop\CIFAR - PROJECT\parameters_no_changes\model.ckpt-999'
with tf.Session() as sess:
saver = tf.train.import_meta_graph(r'C:\Users\Moondra\Desktop\CIFAR - PROJECT' +
'\\parameters_no_changes\model.ckpt-999.meta')
saver.restore(sess, checkpoint)
data_sets = data_helpers.load_data()
images = tf.get_default_graph().get_tensor_by_name('images:0') #image placeholder
labels = tf.get_default_graph().get_tensor_by_name('image-labels:0') #placeholder
loss = tf.get_default_graph().get_tensor_by_name('Loss/add:0')
#global_step = tf.get_default_graph().get_tensor_by_name('global_step/initial_value_1:0')
train_step = tf.train.GradientDescentOptimizer(learning_rate).minimize(
loss)
accuracy = tf.get_default_graph().get_tensor_by_name('Accuracy/Mean:0')
with tf.Session() as sess:
#sess.run(tf.global_variables_initializer())
zipped_data = zip(data_sets['images_train'], data_sets['labels_train'])
batches = data_helpers.gen_batch(list(zipped_data), batch_size,
max_steps)
for i in range(max_steps):
# Get next input data batch
batch = next(batches)
images_batch, labels_batch = zip(*batch)
feed_dict = {
images: images_batch,
labels: labels_batch
}
if i % 100 == 0:
train_accuracy = sess.run(accuracy, feed_dict=feed_dict)
print('Step {:d}, training accuracy {:g}'.format(i, train_accuracy))
ts,loss_ =sess.run([train_step, loss], feed_dict=feed_dict)
Errors and confusion
1) Should I be using this command latest_checkpoint to restore:
`
saver.restore(sess,tf.train.latest_checkpoint('./'))`
I see some tutorials that just point to the folder holding the
.data, .index files.
2) Which brings me to the second question: What should I be using as the second parameter of saver.restore.
Currently I'm just pointing to the folder/dir that holds those files
3) I'm not purposely initializing any variables as I was told, that would overwrite the stored weight and bias values. This seems to be leading to this error:
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value Layer1/weights
[[Node: Layer1/weights/read = Identity[T=DT_FLOAT, _class=["loc:#Layer1/weights"], _device="/job:localhost/replica:0/task:0/cpu:0"](Layer1/weights)]]
4) However, If I do initialize all variables via this code:
sess.run(tf.global_variables_initializer())
My model seems to start training from scratch (and not resuming training)
Does that mean I'm supposed to load all weights and biases via
get_tensor explicitly? If so , how I deal with layers with 20 plus layers?
5) When I run this command
for i in tf.get_default_graph().get_operations():
print(i.values)
I see many global_steps tensors/operations,
'global_step/initial_value' type=Const>>
'global_step' type=VariableV2>>
<'global_step/Assign' type=Assign>>
global_step/read' type=Identity>>
I was trying to load this variable into my current graph, but
didn't know which one I'm supposed to get using the command
get_tensor_by_name. Most of them were resulting in a does not exist error.
6) Same with loss which loss am I supposed to load into my graph with get_tensor
These are the options:
<bound method Operation.values of <tf.Operation 'Loss/Const' type=Const>>
<bound method Operation.values of <tf.Operation 'Loss/Mean' type=Mean>>
<bound method Operation.values of <tf.Operation 'Loss/AddN' type=AddN>>
<bound method Operation.values of <tf.Operation 'Loss/add' type=Add>>
<bound method Operation.values of <tf.Operation 'Loss/loss/tags' type=Const>>
<bound method Operation.values of <tf.Operation 'Loss/loss' type=ScalarSummary>>
6) Lastly, I see a lot of gradient operations when I look at all
the nodes of the graph but I don't see any nodes related to train_step (the
python variable I created that points to the Gradient Dsecent Optimizer). Does that mean I don't need to load it into this graph via get_tensor?
Thank you.
I usually did this sequence of operations:
Initialize
Restore
This translates to this kind of code:
saver = tf.train.Saver()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
saver.restore(sess, tf.train.latest_checkpoint('./'))
...
It will avoid the non-initialized error, and the restore will overwrite with the values from the checkpoint.
1/ In the folder where you save your checkpoint, there should be a file named 'checkpoint' which contains the name of your latest checkpoint.
I normally read this file to find latest checkpoint.
2/ I use checkpoint_directory/global_step.
With this, tf will create 4 files in the checkpoint_directory:
global_step.data-00000-of-00001
global_step.index
global_step.meta
checkpoint
3/ 4/ I'm pretty sure you don't need to pre-initialize the graph before loading, at least I don't do it.
There is some difference: instead of import_meta_graph, I rebuild the whole graph every time I load, but I'm sure it's not an issue to load before you initialize.
5/ Be careful not to mis-take operations for tensors and you are good to go. Tensor name should be op_name:0, which mean this tensor is the output[0] of the operation op_name.
6/ 7/ Well, let me just tell you how I resume my checkpoint. This is probably not the correct way, but it really saves me from the burden of get_tensor_by_name. Seriously get_tensor_by_name can be a real pita sometimes.
Normally my loading process will go through: rebuild graph, load checkpoint, create some new tensors if needed, initialize tensors that is not in the checkpoint.
build_net()
saver = tf.train.Saver()
saver.restore(session, checkpoint_dir/global_step)
add_loss_and_optimizer()
initialize_all_uninitialized_tensor
checkpoint_dir/global_step is from the checkpoint file if you want the latest checkpoint, or you can use different global_step to get the specific checkpoint that you wanna load.
I am using TensorFlow 1.0 and I have develop a simple program to measure performance. I have a silly model as follow
def model(example_batch):
h1 = tf.layers.dense(inputs=example_batch, units=64, activation=tf.nn.relu)
h2 = tf.layers.dense(inputs=h1, units=2)
return h2
and a simple function to run the simulation:
def testPerformanceFromMemory(model, iter=1000 num_cores=2):
example_batch = tf.placeholder(np.float32, shape=(64, 128))
for core in range(num_cores):
with tf.device('/gpu:%d'%core):
prediction = model(example_batch)
init_op = tf.global_variables_initializer()
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
sess.run(init_op)
tf.train.start_queue_runners(sess=sess)
input_array = np.random.random((64,128))
for step in range(iter):
myprediction = sess.run(prediction, feed_dict={example_batch:input_array})
if I run the python script and then run nvidia-smi command I can see that GPU0 is running with a high percentage of usage but GPU1 is 0 % usage.
I read this: https://www.tensorflow.org/tutorials/using_gpu and this: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py but I don't know why my example doesn't run in multi gpu.
PS If I doenload ciphar 10 example from tensorflow repository it run in a multigpu mode.
Edit: As mrry says I am overwriting prediction so, I post here the correct way:
def testPerformanceFromMemory(model, iter=1000 num_cores=2):
example_batch = tf.placeholder(np.float32, shape=(64, 128))
prediction = []
for core in range(num_cores):
with tf.device('/gpu:%d'%core):
prediction.append([model(example_batch)])
init_op = tf.global_variables_initializer()
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True))
sess.run(init_op)
tf.train.start_queue_runners(sess=sess)
input_array = np.random.random((64,128))
for step in range(iter):
myprediction = sess.run(prediction, feed_dict={example_batch:input_array})
Looking at your program, you are creating several parallel subgraphs (often called "towers") on different GPU devices, but overwriting the prediction tensor in each iteration of the first for loop:
for core in range(num_cores):
with tf.device('/gpu:%d'%core):
prediction = model(example_batch)
# ...
for step in range(iter):
myprediction = sess.run(prediction, feed_dict={example_batch:input_array})
As a result, when you call sess.run(prediction, ...) you will only be running the subgraph that was created in the final iteration of the first for loop, which only runs on one GPU.
i am new to neural networks. i have gone through TensorFlow mninst ML Beginners
used tensorflow basic mnist tutorial
and trying to get prediction using external image enter image description here
I have the updated the mnist example provided by tensorflow
On top of that i have added few things :
1. Saving trained models locally
2. loading the saved models.
3. preprocessing the image into 28 * 28.
i have attached the image for reference
1. while training the models, save it locally. So i can reuse it at any point of time.
2. once after training, loading the models.
3. creating an external image via gimp which contains any one values ranging from [0 - 9]
4. using opencv to convert the image into 28 * 28 image and reversing the bit as well.
5. Then trying to predict.
i am able to train the models and save it properly.
i am getting predictions which are not right.
Find my codes Below
1.TrainSimple.py
# Load MNIST Data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("MNIST_data/", one_hot=True)
from random import randint
from scipy import misc
# Start TensorFlow InteractiveSession
import tensorflow as tf
sess = tf.InteractiveSession()
# Placeholders
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
# Variables
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
sess.run(tf.initialize_all_variables())
# Predicted Class and Cost Function
y = tf.nn.softmax(tf.matmul(x,W) + b)
cross_entropy = -tf.reduce_sum(y_*tf.log(y))
saver = tf.train.Saver() # defaults to saving all variables
# GradientDescentOptimizer
train_step = tf.train.GradientDescentOptimizer(0.01).minimize(cross_entropy)
# Train the Model
for i in range(40000):
if (i + 1) == 40000 :
saver.save(sess, "/Users/xxxx/Desktop/TensorFlow/"+"/model.ckpt", global_step=i)
batch = mnist.train.next_batch(50)
train_step.run(feed_dict={x: batch[0], y_: batch[1]})
# Evaluate the Model
correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(accuracy.eval(feed_dict={x: mnist.test.images, y_: mnist.test.labels}))
loadImageAndPredict.py
from random import randint
from scipy import misc
import numpy as np
import cv2
def preProcess(invert_file):
print "preprocessing the images" + invert_file
image=cv2.imread(invert_file,0)
ret,image_thresh = cv2.threshold(image,127,255,cv2.THRESH_BINARY)
l,b=image.shape
fr=0
lr=0
fc=0
lc=0
i=0
while len(set(image_thresh[i,]))==1:
i+=1
fr=i
i=0
while len(set(image_thresh[-1+i,]))==1:
i-=1
lr=i+l
j=0
while len(set(image_thresh[0:,j]))==1:
j+=1
fc=j
j=0
while len(set(image_thresh[0:,-1+j]))==1:
j-=1
lc=j+b
image_crop=image_thresh[fr:lr,fc:lc]
image_padded= cv2.copyMakeBorder(image_crop,5,5,5,5,cv2.BORDER_CONSTANT,value=255)
image_resized = cv2.resize(image_padded, (28, 28))
image_resized = (255-image_resized)
cv2.imwrite(invert_file, image_resized)
import tensorflow as tf
sess = tf.InteractiveSession()
# Placeholders
x = tf.placeholder(tf.float32, shape=[None, 784])
y_ = tf.placeholder(tf.float32, shape=[None, 10])
# # Variables
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
# Predicted Class and Cost Function
y = tf.nn.softmax(tf.matmul(x,W) + b)
cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices=[1]))
saver = tf.train.Saver() # defaults to saving all variables - in this case w and b
# Train the Model
# GradientDescentOptimizer
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)
flag_1 = 0
# create an an array where we can store 1 picture
images = np.zeros((1,784))
# and the correct values
correct_vals = np.zeros((1,10))
preProcess("4_white.png")
gray = cv2.imread("4_white.png", 0)
flatten = gray.flatten() / 255.0
"""
we need to store the flatten image and generate
the correct_vals array
correct_val for a digit (9) would be
[0,0,0,0,0,0,0,0,0,1]
"""
images[0] = flatten
# print images[0]
print len(images[0])
sess.run(tf.initialize_all_variables())
ckpt = tf.train.get_checkpoint_state("/Users/xxxx/Desktop/TensorFlow")
if ckpt and ckpt.model_checkpoint_path:
saver.restore(sess, ckpt.model_checkpoint_path)
my_classification = sess.run(tf.argmax(y, 1), feed_dict={x: [images[0]]})
print 'Neural Network predicted', my_classification[0], "for your digit"
i am not sure what mistake i have done.
Thinking that simple model might not work i have used this convolution code to predict.
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/mnist/convolutional.py
Even that does not predict properly :(
Some things to check:
Does your training loss go down?
Do you get high accuracy on training dataset?
Do you get high accuracy on validation dataset (part of training set set aside?)
Do you get high accuracy on your target dataset?
If you have low training loss (1.), then you are not learning, and need to try different hyper-parameters, such as learning rates.
If you have high (2.) and low (3.), you are overfitting, and need to train for less long, or have higher regularization penalty. If you have high (3.) and low (4.), your training set is not representative of your actual set. Need to make your training set more representative, or at least harder, for instance, by adding distortions