Training tensorflow im2txt fails with truncated record at

Training tensorflow im2txt fails with truncated record at - python

im2txt trains for a few thousand steps then halts with the following error.
I've checked the training files and they appear OK.
Running on Ubuntu 16.04, TF r.0.11, GPU mode GTX 970 4Gb.
Not sure if it is lack of RAM?
INFO:tensorflow:global step 56396: loss = 2.4654 (0.41 sec/step)
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors.DataLossError'>, truncated record at 369740238
[[Node: ReaderRead = ReaderRead[_class=["loc:#TFRecordReader", "loc:#filename_queue"], _device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReader, filename_queue)]]
Caused by op u'ReaderRead', defined at:
File "/home/john/Developer/tensorflow/tensorflow/models/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 114, in <module>
tf.app.run()
File "/home/john/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "/home/john/Developer/tensorflow/tensorflow/models/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/train.py", line 65, in main
model.build()
File "/home/john/Developer/tensorflow/tensorflow/models/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 352, in build
self.build_inputs()
File "/home/john/Developer/tensorflow/tensorflow/models/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/show_and_tell_model.py", line 153, in build_inputs
num_reader_threads=self.config.num_input_reader_threads)
File "/home/john/Developer/tensorflow/tensorflow/models/im2txt/bazel-bin/im2txt/train.runfiles/im2txt/im2txt/ops/inputs.py", line 115, in prefetch_input_data
_, value = reader.read(filename_queue)
File "/home/john/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 277, in read
return gen_io_ops._reader_read(self._reader_ref, queue_ref, name=name)
File "/home/john/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 211, in _reader_read
queue_handle=queue_handle, name=name)
File "/home/john/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 748, in apply_op
op_def=op_def)
File "/home/john/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2403, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/john/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1305, in __init__
self._traceback = _extract_stack()
DataLossError (see above for traceback): truncated record at 369740238
[[Node: ReaderRead = ReaderRead[_class=["loc:#TFRecordReader", "loc:#filename_queue"], _device="/job:localhost/replica:0/task:0/cpu:0"](TFRecordReader, filename_queue)]]
INFO:tensorflow:global step 56397: loss = 2.5540 (0.40 sec/step)

I have the same problem, not sure why. I did not see any error when creating tfrecords. During training, the error comes out near the end of the records. BTW I am using tf 0.11rc

Related

Does checkpointing with torch.save fail with hugging face -- if not what is the right way to checkpoint and load a hugging face (HF) model?

Does torch.save work on hugging face models (I am using vit)? I assumed yes.
My error:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 116] Stale file handle
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1815, in <module>
main()
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1748, in main
train(args=args)
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1795, in train
meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 213, in meta_train_iterations_ala_l2l
log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
_log_train_val_stats(args=args,
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 113, in _log_train_val_stats
save_for_supervised_learning(args, ckpt_filename='ckpt.pt')
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/checkpointing_uu/supervised_learning.py", line 54, in save_for_supervised_learning
torch.save({'training_mode': args.training_mode,
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 380, in save
return
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 259, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 2736460544 vs 2736460432
my code:
# - ckpt
args_pickable: Namespace = uutils.make_args_pickable(args)
# note not saving any objects, to make sure checkpoint is loadable later with no problems
torch.save({'training_mode': args.training_mode,
'it': args.it,
'epoch_num': args.epoch_num,
# 'args': args_pickable, # some versions of this might not have args!
# decided only to save the dict version to avoid this ckpt not working, make it args when loading
'args_dict': vars(args_pickable), # some versions of this might not have args!
'model_state_dict': get_model_from_ddp(args.model).state_dict(),
'model_str': str(args.model), # added later, to make it easier to check what optimizer was used
'model_hps': args.model_hps,
'model_option': args.model_option,
'opt_state_dict': args.opt.state_dict(),
'opt_str': str(args.opt),
'opt_hps': args.opt_hps,
'opt_option': args.opt_option,
'scheduler_str': str(args.scheduler),
'scheduler_state_dict': try_to_get_scheduler_state_dict(args.scheduler),
'scheduler_hps': args.scheduler_hps,
'scheduler_option': args.scheduler_option,
},
pickle_module=pickle,
f=args.log_root / ckpt_filename)
if this is not the right way to checkpoint hugging face (HF) models, what is?
cross: hf discussion forum: https://discuss.huggingface.co/t/torch-save-with-hugging-face-models-fails/25034

How to use TPUEstimator.export_saved_model with Tensorflow 1.12?

export_saved_model used on TPUEstimator raises TypeError: Failed to convert object of type to Tensor with Tensorflow 1.12.0. Am I using it incorrectly or if it is a bug is there some workaround?
I would like to train a model on TPU using TPUEstimator and then use the trained model locally on CPU. I cannot use the graph saved during training directly, but I need to use export_saved_model instead (Github issue).
export_saved_model on TPUEstimator works correctly with Tensorflow 1.13.0rc0, however it fails with current Tensorflow 1.12.0 (another Github issue). At the moment, however, TPUs with Tensorflow 1.13 are not available at Google Cloud and TPUs with Tensorflow 1.12 are not compatible, so upgrading Tensorflow to 1.13 is not an option.
The relevant code is:
def serving_input_receiver_fn():
feature = tf.placeholder(tf.float32, shape=[None, None, None, 2])
return tf.estimator.export.TensorServingInputReceiver(feature, feature)
estimator.export_saved_model(FLAGS.export_dir, serving_input_receiver_fn)
Expected result.
The model should be exported correctly. This happens with Tensorflow 1.13.0rc0 or with TPUEstimator replaced with Estimator. The former can be reproduced using this colab).
Actual result.
Exporting fails with TypeError: Failed to convert object of type and the traceback included below. This can be reproduced with this colab.
...
WARNING:tensorflow:From /Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py:1044: calling SavedModelBuilder.add_meta_graph_and_variables (from tensorflow.python.saved_model.builder_impl) with legacy_init_op is deprecated and will be removed in a future version.
Instructions for updating:
Pass your op to the equivalent parameter main_op instead.
INFO:tensorflow:Assets added to graph.
INFO:tensorflow:No assets to write.
WARNING:tensorflow:rewrite_for_inference (from tensorflow.contrib.tpu.python.tpu.tpu) is experimental and may change or be removed at any time, and without warning.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Running infer on CPU
ERROR:tensorflow:Operation of type Placeholder (policy_labels) is not supported on the TPU. Execution will fail if this op is used in the graph.
ERROR:tensorflow:Operation of type Placeholder (sat_labels) is not supported on the TPU. Execution will fail if this op is used in the graph.
INFO:tensorflow:Done calling model_fn.
Traceback (most recent call last):
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 527, in make_tensor_proto
str_values = [compat.as_bytes(x) for x in proto_values]
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 527, in <listcomp>
str_values = [compat.as_bytes(x) for x in proto_values]
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/util/compat.py", line 61, in as_bytes
(bytes_or_text,))
TypeError: Expected binary or unicode string, got dict_values([<tf.Tensor 'sat_prob:0' shape=(?,) dtype=float32>, <tf.Tensor 'policy_prob:0' shape=(?, ?, 2) dtype=float32>])
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "neurosat_tpu.py", line 253, in <module>
tf.app.run()
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "neurosat_tpu.py", line 248, in main
estimator.export_saved_model(FLAGS.export_dir, serving_input_receiver_fn)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 734, in export_saved_model
strip_default_attrs=True)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 663, in export_savedmodel
mode=model_fn_lib.ModeKeys.PREDICT)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 789, in _export_saved_model_for_mode
strip_default_attrs=strip_default_attrs)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 907, in _export_all_saved_models
mode=model_fn_lib.ModeKeys.PREDICT)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2188, in _add_meta_graph_for_mode
check_variables=False))
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/estimator/estimator.py", line 984, in _add_meta_graph_for_mode
config=self.config)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2192, in _call_model_fn
return self._call_model_fn_for_inference(features, labels, mode, config)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2253, in _call_model_fn_for_inference
new_tensors.append(array_ops.identity(t))
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 81, in identity
return gen_array_ops.identity(input, name=name)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3454, in identity
"Identity", input=input, name=name)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 513, in _apply_op_helper
raise err
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 510, in _apply_op_helper
preferred_dtype=default_dtype)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1146, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 229, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/constant_op.py", line 208, in constant
value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/Users/michal/.virtualenvs/deepsat/lib/python3.6/site-packages/tensorflow/python/framework/tensor_util.py", line 531, in make_tensor_proto
"supported type." % (type(values), values))
TypeError: Failed to convert object of type <class 'dict_values'> to Tensor. Contents: dict_values([<tf.Tensor 'sat_prob:0' shape=(?,) dtype=float32>, <tf.Tensor 'policy_prob:0' shape=(?, ?, 2) dtype=float32>]). Consider casting elements to a supported type.

Adding argument export_to_tpu=False to TPUEstimator constructor prevents the error in Tensorflow 1.12:
estimator = tf.contrib.tpu.TPUEstimator(..., export_to_tpu=False)
export_to_tpu=False disables exporting the TPU version of the model, but CPU version is still exported and this is sufficient to run the model locally. With Tensorflow 1.13 the bug is fixed and the flag is not necessary.
The answer is based on the Github thread linked in the question.

Can't restore pre-trained network with Tensorflow

I'm stuck with restoring pre-trained network with Tensorflow....
import tensorflow as tf
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
sess=tf.Session()
saver = tf.train.import_meta_graph('./model/20170512-110547/model-20170512-110547.meta')
saver.restore(sess,'./model/20170512-110547/')
I'd like to use pre-trained network which was trained for face recognition, and then wanna add some layers for transfer learning.
(I downloaded the model from here. https://github.com/davidsandberg/facenet)
When I execute the code above, it shows the error,
WARNING:tensorflow:The saved meta_graph is possibly from an older release:
'model_variables' collection should be of type 'byte_list', but instead is of type 'node_list'.
Traceback (most recent call last):
File "/Users/user/Desktop/desktop/Python/HCR/Transfer_face/test.py", line 7, in <module>
saver.restore(sess,'./model/20170512-110547/')
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1560, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./model/20170512-110547/
[[Node: save/RestoreV2_491 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_491/tensor_names, save/RestoreV2_491/shape_and_slices)]]
Caused by op u'save/RestoreV2_491', defined at:
File "/Users/user/Desktop/desktop/Python/HCR/Transfer_face/test.py", line 6, in <module>
saver = tf.train.import_meta_graph('./model/20170512-110547/model-20170512-110547.meta')
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1698, in import_meta_graph
**kwargs)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/meta_graph.py", line 656, in import_scoped_meta_graph
producer_op_list=producer_op_list)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 313, in import_graph_def
op_def=op_def)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/Users/user/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
NotFoundError (see above for traceback): Unsuccessful TensorSliceReader constructor: Failed to find any matching files for ./model/20170512-110547/
[[Node: save/RestoreV2_491 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_arg_save/Const_0_0, save/RestoreV2_491/tensor_names, save/RestoreV2_491/shape_and_slices)]]
I can't understand why the system can't find pre-trained data...
And the directory structure is as below
USER-no-MacBook-Pro:Transfer_face user$ ls -R
model test.py
./model:
20170512-110547
./model/20170512-110547:
20170512-110547.pb
model-20170512-110547.ckpt-250000.index
model-20170512-110547.ckpt-250000.data-00000-of-00001
model-20170512-110547.meta

Import the .pb file.
import tensorflow as tf
from tensorflow.python.framework import tensor_util
with tf.gfile.GFile('20170512-110547.pb', "rb") as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
#import into default graph
tf.import_graph_def(graph_def)
#print some data
wts = [n for n in graph_def.node if n.op == 'Const']
for n in wts:
print(tensor_util.MakeNdarray(n.attr['value'].tensor))
Linked questions:
Import a simple Tensorflow frozen_model.pb file and make prediction in C++
get the value weights from .pb file by Tensorflow
Related documentation: GraphDef

You need use the ckpt path "./model/20170512-110547/model-20170512-110547.ckpt-250000" instead of the folder path.

Running tensorflow with file on HDFS (cannot find libhdfs.so)

I am running into an error "libhdfs.so: cannot open shared object file: No such file or directory" (stack trace below) while trying to run a python script invoking a Tensorflow reader on a file stored in HDFS. I am running the script on a node on the cluster which has Tensorflow in a virtualenv, activated at the time of execution. I set the following environment variables before execution:
export HADOOP_HDFS_HOME=$HADOOP_HDFS_HOME:/opt/cloudera/parcels/CDH
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/cloudera/parcels/CDH/lib/libhdfs.so
export
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:${JAVA_HOME}/jre/lib/amd64/server
I execute the script as such:
CLASSPATH=$($LD_LIBRARY_PATH} classpath --glob) python TEST.py
This is the code in the script:
filename_queue = tf.train.string_input_producer([
"hdfs://hostname:port/user/hdfs/test.avro" ])
reader =
tf.WholeFileReader() key, value = reader.read(filename_queue)
with tf.Session() as sess:
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
sess.run([key,value])
coord.request_stop()
coord.join(threads)
Below is the stack trace of the error. Any ideas on what is causing this error is appreciated (I have already checked that the LD_LIBRARY_PATH variable has an explicit pointer to the libhdfs.so file prior to execution, unable to figure out why it still cannot find the file).
Traceback (most recent call last):
File "TEST.py", line 25, in <module>
sess.run([key,value])
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 767, in run
run_metadata_ptr)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 965, in _run
feed_dict_string, options, run_metadata)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1015, in _do_run
target_list, options, run_metadata)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1035, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: libhdfs.so: cannot open shared object file: No such file or directory
[[Node: ReaderReadV2 = ReaderReadV2[_device="/job:localhost/replica:0/task:0/cpu:0"](WholeFileReaderV2, input_producer)]]
Caused by op u'ReaderReadV2', defined at:
File "TEST.py", line 19, in <module>
key, value = reader.read(filename_queue)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/io_ops.py", line 272, in read
return gen_io_ops._reader_read_v2(self._reader_ref, queue_ref, name=name)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/ops/gen_io_ops.py", line 410, in _reader_read_v2
queue_handle=queue_handle, name=name)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 763, in apply_op
op_def=op_def)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2395, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/username/tensorflow/local/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1264, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): libhdfs.so: cannot open shared object file: No such file or directory

I also encountered this issue the solution for me was copying this file to:
$HADOOP_HDFS_HOME/lib/native
If you don't know the location of this file execute the following command to find it's location:
sudo updatedb
locate libhdfs.so
This will give you the location of the file. Next copy the file to $HADOOP_HDFS_HOME/lib/native:
cp locationOflibhdfs.so $HADOOP_HDFS_HOME/lib/native
Note: Replace locationOflibhdfs.so with the location of the libhdfs.so file.

Not able to run fully_connected_feed.py in Tensorflow

I am following the tutorial of TensorFlow Mechanics 101 (version 0.7.0). As per the document, I download the two files (mnist.py and fully_connected_feed.py) and save them to the same directory on my local machine.
When I run the following command:
$ python /FULL_PATH_TO_fully_connected_feed.py/fully_connected_feed.py
...I get this error: OSError: [Errno 2] No such file or directory: ''. The full output and stack trace are below:
...
...
Step 800: loss = 0.56 (0.005 sec)
Step 900: loss = 0.51 (0.004 sec)
Traceback (most recent call last):
File "./fully_connected_feed.py", line 228, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_app.py", line 30, in run
sys.exit(main(sys.argv))
File "./fully_connected_feed.py", line 224, in main
run_training()
File "./fully_connected_feed.py", line 199, in run_training
saver.save(sess, FLAGS.train_dir, global_step=step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 970, in save
self.export_meta_graph(meta_graph_file_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 990, in export_meta_graph
as_text=as_text)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1315, in export_meta_graph
os.path.basename(filename), as_text=as_text)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 70, in write_graph
gfile.MakeDirs(logdir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_gfile.py", line 295, in MakeDirs
os.makedirs(path, mode)
File "/usr/lib/python2.7/os.py", line 160, in makedirs
mkdir(name, mode)
OSError: [Errno 2] No such file or directory: ''

This is a bug in the 0.7.0 release of TensorFlow, which was fixed in a recent commit and will appear in a bugfix release shortly. The issue is caused when the --train_dir flag doesn't contain a directory name component.
In the meantime, you can avoid this issue by passing the flag --train_dir=./ when you run the example.

This should be a comment to mrry's post (I'm missing reputation)
Changing line #42 from fully_connected_feed.py to
flags.DEFINE_string('train_dir', './data', 'Directory to put the training data.')
solved the problem for me. I'm also on 0.7.0 and was able to run all other mnist examples.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Training tensorflow im2txt fails with truncated record at - python

I have the same problem, not sure why. I did not see any error when creating tfrecords. During training, the error comes out near the end of the records. BTW I am using tf 0.11rc

Related

Does checkpointing with torch.save fail with hugging face -- if not what is the right way to checkpoint and load a hugging face (HF) model?

How to use TPUEstimator.export_saved_model with Tensorflow 1.12?

Can't restore pre-trained network with Tensorflow

Running tensorflow with file on HDFS (cannot find libhdfs.so)

Not able to run fully_connected_feed.py in Tensorflow

Categories

Resources