(I'm using tensorflow 1.0 and Python 2.7)
I'm having trouble getting an Estimator to work with queues. Indeed, if I use the deprecated SKCompat interface with custom data files and a given batch size, the model trains properly. I'm trying to use the new interface with an input_fn that batches features out of TFRecord files (equivalent to my custom data files). The scripts runs properly but the loss value doesn't change after 200 or 300 steps. It seems that the model is looping on a small input batch (this would explain why the loss converges so fast).
I have a 'run.py' script that looks like the following:
import tensorflow as tf
from tensorflow.contrib import learn, metrics
#[...]
evalMetrics = {'accuracy':learn.MetricSpec(metric_fn=metrics.streaming_accuracy)}
runConfig = learn.RunConfig(save_summary_steps=10)
estimator = learn.Estimator(model_fn=myModel,
params=myParams,
modelDir='/tmp/myDir',
config=runConfig)
session = tf.Session(graph=tf.get_default_graph())
with session.as_default():
tf.global_variables_initializer()
coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=session,coord=coordinator)
estimator.fit(input_fn=lambda: inputToModel(trainingFileList),steps=10000)
estimator.evaluate(input_fn=lambda: inputToModel(evalFileList),steps=10000,metrics=evalMetrics)
coordinator.request_stop()
coordinator.join(threads)
session.close()
My inputToModel function looks like this:
import tensorflow as tf
def inputToModel(fileList):
features = {'rawData': tf.FixedLenFeature([100],tf.float32),
'label': tf.FixedLenFeature([],tf.int64)}
tensorDict = tf.contrib.learn.read_batch_record_features(fileList,
batch_size=100,
features=features,
randomize_input=True,
reader_num_threads=4,
num_epochs=1,
name='inputPipeline')
tf.local_variables_initializer()
data = tensorDict['rawData']
labelTensor = tensorDict['label']
inputTensor = tf.reshape(data,[-1,10,10,1])
return inputTensor,labelTensor
Any help or suggestions is welcome !
Try to use: tf.global_variables_initializer().run()
I wanna do a similar thing but I do not know how to use Estimator API with multi-threading. There is an Experiment class for serving too - might be useful
delete line session = tf.Session(graph=tf.get_default_graph())
and session.close() and try:
with tf.Session() as sess:
tf.global_variables_initializer().run()
Related
I used OpenAI to train a deepq model. After doing
saver = tf.train.Saver()
saver.save(tf.get_default_session(), 'my_deepq')
I got the following files:
my_deepq.data-00000-of-00001
my_deepq.index
checkpoint
my_deepq.meta
I then need to load this model in two different systems (C++ and python) to do the inference.
For the python part, I tried:
import tensorflow as tf
tf.reset_default_graph()
imported_graph = tf.train.import_meta_graph('my_deepq.meta')
with tf.Session() as sess:
imported_graph.restore(sess, './my_deepq')
The codes ran, but I am not sure where the model was loaded and how to do the inference. Could someone please advise.
For the C++ side, I will do something like:
tensorflow::Session *my_sess;
tensorflow::Status status = tensorflow::NewSession(options, &my_sess);
tensorflow::GraphDef graph_def;
status = ReadBinaryProto(tensorflow::Env::Default(), model_path, &graph_def);
status = my_sess->Create(graph_def);
tensorflow::Status status = my_sess->Run({{"My_Input", input_tensor}}, {"My_Output"}, {}, &output_tensor);
This approach requires the model to be in BinaryProto format, but I am not sure how to save my model in BinaryProto in python. Could anyone please advise. Thank you!
I am fine-tuning an Inception model via tensorflow with the below setup, and am feeding batches tf.DatasetAPI. However, every time I attempt to train this model (before successfully retrieving any batches), I get an OutOfRangeError claiming that the iterator is exhausted:
Caught OutOfRangeError. Stopping Training. End of sequence
[[node IteratorGetNext (defined at <ipython-input-8-c768436e70d8>:13) = IteratorGetNext[output_shapes=[[?,224,224,3], [?,1]], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](OneShotIterator)]]
with tf.Graph().as_default():
I created a function to feed in hard coded batches as the result of get_batch, and this runs and converges without any issues, leading me to believe that the graph and session code is working properly. I also tested the get_batch function to iterate in a session, and this causes no errors. The behavior I would expect is that restarting training (especially with reseting the notebook, etc. ) would produce a fresh iterator over the dataset.
Code to train model:
with tf.Graph().as_default():
tf.logging.set_verbosity(tf.logging.INFO)
images, labels = get_batch(filenames=tf_train_record_path+train_file)
# Create the model, use the default arg scope to configure the batch norm parameters.
with slim.arg_scope(inception.inception_v1_arg_scope()):
logits, ax = inception.inception_v1(images, num_classes=1, is_training=True)
# Specify the loss function:
tf.losses.mean_squared_error(labels,logits)
total_loss = tf.losses.get_total_loss()
tf.summary.scalar('losses/Total_Loss', total_loss)
# Specify the optimizer and create the train op:
optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
train_op = slim.learning.create_train_op(total_loss, optimizer)
# Run the training:
final_loss = slim.learning.train(
train_op,
logdir=train_dir,
init_fn=get_init_fn(),
number_of_steps=1)
Code to get batches using Dataset
def get_batch(filenames):
dataset = tf.data.TFRecordDataset(filenames=filenames)
dataset = dataset.map(parse)
dataset = dataset.batch(2)
iterator = dataset.make_one_shot_iterator()
data_X, data_y = iterator.get_next()
return data_X, data_y
This previously asked question resembles the issue I am experiencing, however, I am not using a batch_join call. I am not if this is an issue with slim.learning.train, restoring from a checkpoint, or scope. Any help would be appreciated!
Your input pipeline looks ok. The problem might be with damaged TFRecords file. You can try your code with random data, or use your images as numpy arrays with tf.data.Dataset.from_tensor_slices().
Also your parse function may cause problems. Try to print your image/label with sess.run.
And I'd advise using Estimator API as train_op. It is much more convenient and slim will be deprecated soon.
Is it possible to create a hub module from existing checkpoints without chaining the training code?
Yes, absolutely. You need a session with (1) a Module and (2) the proper values in its variables. It doesn't matter if those come from actual training or merely restoring a checkpoint. Given a Python library for model building that knows nothing about TensorFlow Hub, you can have a tool on the side for export to a Hub Module that looks like:
import tensorflow_hub as hub
import your_library as build_model_body
def module_fn():
inputs = tf.placeholder(...)
logits = build_model_body(inputs)
hub.add_signature(inputs=inputs, outputs=logits)
def main(_):
spec = hub.create_module_spec(module_fn)
# Supply a checkpoint trained on a model from the same Python code.
checkpoint_path = "..."
# Output will be written here:
export_path = "..."
with tf.Graph().as_default():
module = hub.Module(spec)
init_fn = tf.contrib.framework.assign_from_checkpoint_fn(
checkpoint_path, module.variable_map)
with tf.Session() as session:
init_fn(session)
module.export(export_path, session=session)
Fine points to note:
build_model_body() should transform inputs to outputs (say, pixels to feature vectors) as suitable for a Hub module, but not include data reading, or loss and optimizers. For transfer learning, these are best left to the consumer of the module. Some refactoring may be required.
Supplying the module.variable_map is essential, to translate from plain variable names as created by running build_model_body() by itself to the variable names created by instantiating the Module, live in scope module/state.
I am an absolute beginner to TensorFlow.
If I have a picture (or set of pictures) that I would like to attempt to classify using the code from the Cifar10 TensorFlow tutorial, how would I do so?
I have absolutely no idea where to start.
Train the model using base CIFAR10 dataset exactly as per the tutorial.
Create a new graph with your own inputs - probably easiest to just use a tf.placeholder and feed the data as per below, but there's lots of other ways.
Start a session, load previously saved weights.
Run the session (with a feed_dict if you're using a placeholder as above).
.
import tensorflow as tf
train_dir = '/tmp/cifar10_train' # or use FLAGS as in the train example
batch_size = 8
height = 32
width = 32
image = tf.placeholder(shape=(batch_size, height, width, 3), dtype=tf.uint8)
std_img = tf.image.per_image_standardization(image)
logits = cifar10.inference(std_img)
predictions = tf.argmax(logits, axis=-1)
def get_image_data_batches():
n_batchs = 100
for i in range(n_batchs):
yield (np.random.uniform(size=(batch_size, height, width, 3)*255).astype(np.uint8)
def do_stuff_with(logit_vals, prediction_vals):
pass
with tf.Session() as sess:
# restore variables
saver = tf.train.Saver()
saver.restore(sess, tf.train.latest_checkpoint(train_dir))
# run inference
for batch_data in get_image_data_batches():
logit_vals, prediction_vals = sess.run([logits, predictions], feed_dict={image: image_data})
do_stuff_with(logit_vals, prediction_vals)
There are better ways of getting data into the graph (see tf.data.Dataset), but I believe tf.placeholders are the easiest way for learning and getting something up and running initially.
Also check out tf.estimator.Estimators for a cleaner way of managing sessions. It's very different to the way it's done in this tutorial and -slightly- less flexible, but for standard networks they save you writing a lot of boilerplate code.
I am training neural nets with TensorFlow, and the model's training is working using a custom implementation of batch gradient descent. I have a logging function which records validation error, and it gets down to about 2.6%. I'm saving the model every 10 epochs using a tf.train.Saver.
However, when I load the variables into memory again using a tf.train.Saver with the same script, the model performs poorly--with about the performance it does when the weights are randomly initialized. I have inspected the constitutional filters in the checkpoint and they don't seem to be random however.
I have not included all of my code, since its around 400 lines long, but I've included what seem to be important sections here and summarized the other functionality.
class ModelTrainer:
def __init__(self, ...hyperparameters...):
# Intitialize datasets and hyperparameters
for each gpu
# Create loss function and gradient assigned to this gpu using tf.device("/gpu:n")
with tf.device("/cpu:0")
# Average and clip gradients from the gpu's
# Create this batch gradient descent operation for each trainable variable
variable.assign_sub(learning_rate * averaged_and_clipped_gradient).op
def train(self, ...hyperparameters...)
saver = train.Saver(tf.all_variables(), max_to_keep = 30)
init = tf.initialize_all_variables()
sess = tf.Session()
if starting_point is not None: # Used to evaluate existing models
saver.restore(sess, starting_point)
else:
sess.run(init)
for i in range(number_of_batches)
# ... Get training batch ...
gradients = sess.run(calculate_gradients, feeds = training_batch)
# Average "gradients" variable across multiple batches
# Must be done because of GPU memory limitations
if i % meta_batch_size == 0:
sess.run(apply_gradients_operators,
feeds = gradients_that_have_been_averaged_across_multiple_batches)
# Log validation error
if i % save_after_n_batches == 0:
saver.save(sess, "some-filename", global_step=self.iter_num)
As expected, running these two functions creates a set of checkpoint files called "some-filename-40001" or whatever other iteration number the training is at when that file is saved. Unfortunately when I load these checkpoints back in using the start_point parameter they perform on par with random initialization.
Initially I assumed it was something to do with the way I'm training the model, since I haven't found anyone else with this issue, but the validation error behaves as expected.
Edit: More odd results. After more experimentation, I have found that when I load the saved model using the code:
with tf.Session() as sess:
saver = tf.train.import_meta_graph("saved-checkpoint-40.meta")
saver.restore(sess, "saved-checkpoint-40")
# ... Use model in some way ...
I get different, but still incorrect results.