Not able to run fully_connected_feed.py in Tensorflow - python

I am following the tutorial of TensorFlow Mechanics 101 (version 0.7.0). As per the document, I download the two files (mnist.py and fully_connected_feed.py) and save them to the same directory on my local machine.
When I run the following command:
$ python /FULL_PATH_TO_fully_connected_feed.py/fully_connected_feed.py
...I get this error: OSError: [Errno 2] No such file or directory: ''. The full output and stack trace are below:
...
...
Step 800: loss = 0.56 (0.005 sec)
Step 900: loss = 0.51 (0.004 sec)
Traceback (most recent call last):
File "./fully_connected_feed.py", line 228, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_app.py", line 30, in run
sys.exit(main(sys.argv))
File "./fully_connected_feed.py", line 224, in main
run_training()
File "./fully_connected_feed.py", line 199, in run_training
saver.save(sess, FLAGS.train_dir, global_step=step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 970, in save
self.export_meta_graph(meta_graph_file_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 990, in export_meta_graph
as_text=as_text)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1315, in export_meta_graph
os.path.basename(filename), as_text=as_text)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 70, in write_graph
gfile.MakeDirs(logdir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_gfile.py", line 295, in MakeDirs
os.makedirs(path, mode)
File "/usr/lib/python2.7/os.py", line 160, in makedirs
mkdir(name, mode)
OSError: [Errno 2] No such file or directory: ''

This is a bug in the 0.7.0 release of TensorFlow, which was fixed in a recent commit and will appear in a bugfix release shortly. The issue is caused when the --train_dir flag doesn't contain a directory name component.
In the meantime, you can avoid this issue by passing the flag --train_dir=./ when you run the example.

This should be a comment to mrry's post (I'm missing reputation)
Changing line #42 from fully_connected_feed.py to
flags.DEFINE_string('train_dir', './data', 'Directory to put the training data.')
solved the problem for me. I'm also on 0.7.0 and was able to run all other mnist examples.

Related

Does checkpointing with torch.save fail with hugging face -- if not what is the right way to checkpoint and load a hugging face (HF) model?

Does torch.save work on hugging face models (I am using vit)? I assumed yes.
My error:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 116] Stale file handle
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1815, in <module>
main()
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1748, in main
train(args=args)
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1795, in train
meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 213, in meta_train_iterations_ala_l2l
log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
_log_train_val_stats(args=args,
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 113, in _log_train_val_stats
save_for_supervised_learning(args, ckpt_filename='ckpt.pt')
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/checkpointing_uu/supervised_learning.py", line 54, in save_for_supervised_learning
torch.save({'training_mode': args.training_mode,
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 380, in save
return
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 259, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 2736460544 vs 2736460432
my code:
# - ckpt
args_pickable: Namespace = uutils.make_args_pickable(args)
# note not saving any objects, to make sure checkpoint is loadable later with no problems
torch.save({'training_mode': args.training_mode,
'it': args.it,
'epoch_num': args.epoch_num,
# 'args': args_pickable, # some versions of this might not have args!
# decided only to save the dict version to avoid this ckpt not working, make it args when loading
'args_dict': vars(args_pickable), # some versions of this might not have args!
'model_state_dict': get_model_from_ddp(args.model).state_dict(),
'model_str': str(args.model), # added later, to make it easier to check what optimizer was used
'model_hps': args.model_hps,
'model_option': args.model_option,
'opt_state_dict': args.opt.state_dict(),
'opt_str': str(args.opt),
'opt_hps': args.opt_hps,
'opt_option': args.opt_option,
'scheduler_str': str(args.scheduler),
'scheduler_state_dict': try_to_get_scheduler_state_dict(args.scheduler),
'scheduler_hps': args.scheduler_hps,
'scheduler_option': args.scheduler_option,
},
pickle_module=pickle,
f=args.log_root / ckpt_filename)
if this is not the right way to checkpoint hugging face (HF) models, what is?
cross: hf discussion forum: https://discuss.huggingface.co/t/torch-save-with-hugging-face-models-fails/25034

Jupyterlab: Cannot open the cloned notebook (File Load Error for <filename.ipynb> Unhandled error)

Description
When I tried to open the clone notebook, this happened
When I looked at my command line, this error show up
Traceback (most recent call last):
File "C:\Users\msi\Anaconda3\lib\site-packages\tornado\web.py", line 1704, in _execute
result = await result
File "C:\Users\msi\Anaconda3\lib\site-packages\jupyter_server\services\contents\handlers.py", line 248, in post
checkpoint = await ensure_async(cm.create_checkpoint(path))
File "C:\Users\msi\Anaconda3\lib\site-packages\jupyter_server\services\contents\manager.py", line 520, in create_checkpoint
return self.checkpoints.create_checkpoint(self, path)
File "C:\Users\msi\Anaconda3\lib\site-packages\jupyter_server\services\contents\filecheckpoints.py", line 59, in create_checkpoint
self._copy(src_path, dest_path)
File "C:\Users\msi\Anaconda3\lib\site-packages\jupyter_server\services\contents\fileio.py", line 245, in _copy
copy2_safe(src, dest, log=self.log)
File "C:\Users\msi\Anaconda3\lib\site-packages\jupyter_server\services\contents\fileio.py", line 47, in copy2_safe
shutil.copyfile(src, dst)
File "C:\Users\msi\Anaconda3\lib\shutil.py", line 261, in copyfile
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\msi\\My study\\AIT Coursework\\AT82.01 Computer Programming for Data Science and Artificial Intelligence\\Python-for-DS-AI\\Lectures\\02-MLScratch\\03-Naive Bayesian\\.ipynb_checkpoints\\01 - Supervised Learning - Classification - Naive Bayesian - Gaussian-checkpoint.ipynb'
[W 2021-08-23 17:00:07.499 ServerApp] Unhandled error
It worked perfectly when I import it to Colab, so the notebook file is not corrupted.
I still don't understand what is the problem, not having a checkpoint file shouldnt be the problem in the first place.
Anyone got a clue on what happened?
Never mind, I fixed it. it's because my file name is too long.

Error while training tensorflow object detection about checkpoint error

I have a problem about tensorflow training part.
speci:
tensorflow-gpu= 2.2.0
python= 3.7.9
cuda= 10.1
cdnn= 7.6.- (ı dont remember but it is ok with cuda).
models= ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8 and efficientdet_d7_coco17_tpu-32
reference: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
when I start to train it gives that error:
Traceback (most recent call last):
File "model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\absl\app.py", line 300, in run
_run_main(main, args)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 110, in main
record_summaries=FLAGS.record_summaries)
File "C:\TensorFlow\models\research\object_detection\model_lib_v2.py", line 578, in train_loop
ckpt, manager_dir, max_to_keep=checkpoint_max_to_keep)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 635, in __init__
recovered_state = get_checkpoint_state(directory)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 279, in get_checkpoint_state
coord_checkpoint_filename)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 320, in read_file_to_string
return f.read()
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 116, in read
self._preread_check()
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 79, in _preread_check
self.__name, 1024 * 512)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 122: invalid start byte
my start-training command: And ı check this paths are correct
python model_main_tf2.py --logtostderr --model_dir=pre-trained-models/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8 --pipeline_config_path=pre-trained-models/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8/pipeline.config
Checkpoint path, pipeline path all of them are correct. I tried training with two different models. And cant solve. How can ı solve this problem?
I researched utf-8 errors but cant find solution. Thank you for helping. :))
I met the same issue, which made me struggled overnight.
I have the same env like yours, and I tired mask_rcnn model.
In a nutshell, it can be figured out by changing the output_dir.
Origin:
----mask_rcnn_model (output_dir)
--------checkpoint (folder)
--------saved_model (folder)
--------my_pipeline.config
After:
----output_dir
----mask_rcnn_model
--------checkpoint (folder)
--------saved_model (folder)
--------my_pipeline.config
hint from: https://github.com/tensorflow/models/issues/8892
it worked for my situation.

best_local_affine_kernel.cu [WinError 126] The specified module could not be found on Windows 10

I am running FastPhotoStyle code on Windows 10 and using Python 3.7, CUDA 10.0 and cuda 9.1. Although I made the change that was suggested to upgrade the version of Python from string to Byte, I am still getting the same error. Can you please suggest a fix for this issue.
Resize image: (803,538)->(803,538)
Resize image: (960,540)->(960,540)
Elapsed time in stylization: 2.325060
Elapsed time in propagation: 83.987388
Elapsed time in post processing: 0.015629
Traceback (most recent call last):
File "demo.py", line 47, in
no_post=args.no_post
File "D:\TrainImages\FastPhotoStyle-master\process_stylization.py", line 135, in stylization
out_img = smooth_filter(out_img, cont_pilimg, f_radius=15, f_edge=1e-1)
File "D:\TrainImages\FastPhotoStyle-master\smooth_filter.py", line 402, in smooth_filter
best_ = smooth_local_affine(output_, input_, 1e-7, 3, H, W, f_radius, f_edge)
File "D:\TrainImages\FastPhotoStyle-master\smooth_filter.py", line 333, in smooth_local_affine
program = Program(src.encode('utf-8'),best_local_affine_kernel.cu'.encode('utf-8'))
File "C:\Users\SD\Anaconda3\lib\site-packages\pynvrtc\compiler.py", line 49, in init
self._interface = NVRTCInterface(lib_name)
File "C:\Users\SD\Anaconda3\lib\site-packages\pynvrtc\interface.py", line 87, in init
self._load_nvrtc_lib(lib_path)
File "C:\Users\SD\Anaconda3\lib\site-packages\pynvrtc\interface.py", line 109, in _load_nvrtc_lib
self.lib = cdll.LoadLibrary(name)
File "C:\Users\SD\Anaconda3\lib\ctypes_init.py", line 434, in LoadLibrary
return self.dlltype(name)
File "C:\Users\SD\Anaconda3\lib\ctypes_init.py", line 356, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
I have already changed string to bytes
program = Program(src.encode('utf-8'), 'best_local_affine_kernel.cu'.encode('utf-8'))
ptx = program.compile(['-I/usr/local/cuda/include'.encode('utf-8')])
Please check the documentation here --> https://github.com/NVIDIA/FastPhotoStyle/blob/master/TUTORIAL.md
The above link specifies setup on Ubuntu but there are prerequites for Python modules as well that you should have installed on your machine.

What train_dir to use for Tensorflow imagenet_train to train from scratch?

I am following the below page
https://github.com/tensorflow/models/tree/master/inception
I got to the point I have to run:
bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data
However, I got below error:
Traceback (most recent call last):
File "/home/demo/anaconda3/envs/tensorflow/models/inception/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 41, in <module>
tf.app.run()
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/demo/anaconda3/envs/tensorflow/models/inception/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 35, in main
tf.gfile.DeleteRecursively(FLAGS.train_dir)
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 420, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: /tmp/imagenet_train
My DATA_DIR is /tmp/imagenet_data from previous step bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}"
But what would be my train_dir? The doc doesn't mention it? Look like an empty folder is incorrect.
For me, it works if I set the path of --train_dir=/tmp. Also, you have the processed dataset in the same directory. The --train_dir and --data_dir should not coincide with each other.
Location of where to place the ImageNet data DATA_DIR=$HOME/imagenet-data
Can you tell me if you are still running into problems after changing the directory?
--train_dir is the path to an empty directory where the model checkpoints and events files are stored as the model is trained.

Categories