Error while training tensorflow object detection about checkpoint error - python

I have a problem about tensorflow training part.
speci:
tensorflow-gpu= 2.2.0
python= 3.7.9
cuda= 10.1
cdnn= 7.6.- (ı dont remember but it is ok with cuda).
models= ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8 and efficientdet_d7_coco17_tpu-32
reference: https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html
when I start to train it gives that error:
Traceback (most recent call last):
File "model_main_tf2.py", line 113, in <module>
tf.compat.v1.app.run()
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\absl\app.py", line 300, in run
_run_main(main, args)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\absl\app.py", line 251, in _run_main
sys.exit(main(argv))
File "model_main_tf2.py", line 110, in main
record_summaries=FLAGS.record_summaries)
File "C:\TensorFlow\models\research\object_detection\model_lib_v2.py", line 578, in train_loop
ckpt, manager_dir, max_to_keep=checkpoint_max_to_keep)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 635, in __init__
recovered_state = get_checkpoint_state(directory)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\training\checkpoint_management.py", line 279, in get_checkpoint_state
coord_checkpoint_filename)
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 320, in read_file_to_string
return f.read()
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 116, in read
self._preread_check()
File "C:\Users\Nurullah\.conda\envs\tensorflow\lib\site-packages\tensorflow\python\lib\io\file_io.py", line 79, in _preread_check
self.__name, 1024 * 512)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 122: invalid start byte
my start-training command: And ı check this paths are correct
python model_main_tf2.py --logtostderr --model_dir=pre-trained-models/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8 --pipeline_config_path=pre-trained-models/ssd_resnet101_v1_fpn_1024x1024_coco17_tpu-8/pipeline.config
Checkpoint path, pipeline path all of them are correct. I tried training with two different models. And cant solve. How can ı solve this problem?
I researched utf-8 errors but cant find solution. Thank you for helping. :))

I met the same issue, which made me struggled overnight.
I have the same env like yours, and I tired mask_rcnn model.
In a nutshell, it can be figured out by changing the output_dir.
Origin:
----mask_rcnn_model (output_dir)
--------checkpoint (folder)
--------saved_model (folder)
--------my_pipeline.config
After:
----output_dir
----mask_rcnn_model
--------checkpoint (folder)
--------saved_model (folder)
--------my_pipeline.config
hint from: https://github.com/tensorflow/models/issues/8892
it worked for my situation.

Related

Tensorflow object detection model_main_tf2.py trying to run tf1 code

I have a simple code snippet. I am running model_main_tf2.py. But I am getting an error that module 'tensorflow.compat.v1.io' has no attribute 'GFile'. It seems the program is trying to run tensorflow 1.x codebase. I am using tensorflow 2.8 and every package is updated. I am using Colab environment.
%env PYTHONPATH="$/env/python:/content/drive/MyDrive/tfod/Tensorflow/models:/content/drive/MyDrive/tfod/Tensorflow/models/research:/content/drive/MyDrive/tfod/Tensorflow/models/research/slim:/content/drive/MyDrive/tfod/Tensorflow/models/research/object_detection/protos"
TRAINING_SCRIPT = os.path.join(paths['APIMODEL_PATH'], 'research', 'object_detection', 'model_main_tf2.py')
command1 = "python {} --model_dir={} --pipeline_config_path={} --num_train_steps=10".format(TRAINING_SCRIPT, paths['CHECKPOINT_PATH'],files['PIPELINE_CONFIG'])
print(command1)
! {command1}
I have the traceback log as below. Any help is greatly appreciated.
Traceback (most recent call last):
File "/content/drive/MyDrive/tfod/Tensorflow/models/research/object_detection/model_main_tf2.py", line 115, in <module>
tf.compat.v1.app.run()
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/platform/app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/content/drive/MyDrive/tfod/Tensorflow/models/research/object_detection/model_main_tf2.py", line 112, in main
record_summaries=FLAGS.record_summaries)
File "/content/drive/MyDrive/tfod/Tensorflow/models/research/object_detection/model_lib_v2.py", line 505, in train_loop
pipeline_config_path, config_override=config_override)
File "/content/drive/MyDrive/tfod/Tensorflow/models/research/object_detection/utils/config_util.py", line 137, in get_configs_from_pipeline_file
with tf.io.GFile(pipeline_config_path, "r") as f:
AttributeError: module 'tensorflow.compat.v1.io' has no attribute 'GFile'
Replace
tf.gfile.GFile
to
tf.io.gfile.GFile

Rasa App breaks in Pycharm but works fine in terminal

Whenever I try to run my Rasa app using the run button in PyCharm, or try to use the debugger, I get the following error:
Traceback (most recent call last):
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/pykwalify/core.py", line 76, in __init__
self.source = yaml.load(stream)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/main.py", line 933, in load
loader = Loader(stream, version, preserve_quotes=preserve_quotes)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/loader.py", line 50, in __init__
Reader.__init__(self, stream, loader=self)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 85, in __init__
self.stream = stream # type: Any # as .read is called
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 130, in stream
self.determine_encoding()
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 190, in determine_encoding
self.update_raw()
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/ruamel/yaml/reader.py", line 297, in update_raw
data = self.stream.read(size)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 473: ordinal not in range(128)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/matthewspeck/project/trainer_app/app.py", line 25, in <module>
parser=False, core=True)
File "/Users/matthewspeck/project/trainer_app/rasa_model.py", line 165, in make_rasa_model
rasa_config=rasa_config
File "/Users/matthewspeck/project/trainer_app/rasa_model.py", line 66, in __init__
self._parser = create_agent(use_rasa_nlu=True, load_models=True)
File "/Users/matthewspeck/project/trainer_app/rasa.py", line 32, in create_agent
domain = create_domain()
File "/Users/matthewspeck/project/trainer_app/rasa.py", line 83, in create_domain
domain = ClarifyDomain.load(domain_path)
File "/Users/project/clarification/domain.py", line 39, in load
domain = TemplateDomain.load(filename)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/rasa_core/domain.py", line 404, in load
cls.validate_domain_yaml(filename)
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/rasa_core/domain.py", line 438, in validate_domain_yaml
schema_files=[schema_file])
File "/Users/matthewspeck/anaconda3/envs/proj_env/lib/python3.6/site-packages/pykwalify/core.py", line 78, in __init__
raise CoreError(u"Unable to load any data from source yaml file")
pykwalify.errors.CoreError: <CoreError: error code 3: Unable to load any data from source yaml file: Path: '/'>
Process finished with exit code 1
However, when I run the app from my terminal, or from my text editor (I use VSCode), It runs with no problems whatsoever. I've looked online and every answer I see has something to do with Rasa, but nothing mentions problems with PyCharm.
I've also checked that the yaml for the domain is properly formatted, and it is. Anyone have any idea why I would be getting this error in PyCharm, but not in any other environment, and how I could fix it?
I believe your problem was fixed with Rasa version 0.12 ([changelog][1]): https://github.com/RasaHQ/rasa_core/blob/master/CHANGELOG.rst#0120---2018-11-11 .
I recommend upgrading to a newer version of Rasa Core which parses the training data correctly.

UnicodeDecodeError When I use cuda to train dataset

I used chainer to train some images but there is an error.
I don't know whether its UnicodeDecodeError or the error of installation of cupy.
P:\dcgans\chainer-DCGAN\chainer-DCGAN>python DCGAN.py
Traceback (most recent call last):
File "DCGAN.py", line 279, in <module>
train_dcgan_labeled(gen, dis)
File "DCGAN.py", line 171, in train_dcgan_labeled
zvis = (xp.random.uniform(-1, 1, (100, nz), dtype=np.float32))
File "P:\Python35\lib\site-packages\cupy\random\distributions.py", line 132, in uniform
return rs.uniform(low, high, size=size, dtype=dtype)
File "P:\Python35\lib\site-packages\cupy\random\generator.py", line 235, in uniform
rand = self.random_sample(size=size, dtype=dtype)
File "P:\Python35\lib\site-packages\cupy\random\generator.py", line 153, in random_sample
RandomState._1m_kernel(out)
File "cupy/core/elementwise.pxi", line 552, in cupy.core.core.ElementwiseKernel.__call__ (cupy\core\core.cpp:43810)
File "cupy/util.pyx", line 39, in cupy.util.memoize.decorator.ret (cupy\util.cpp:1480)
File "cupy/core/elementwise.pxi", line 409, in cupy.core.core._get_elementwise_kernel (cupy\core\core.cpp:42156)
File "cupy/core/elementwise.pxi", line 12, in cupy.core.core._get_simple_elementwise_kernel (cupy\core\core.cpp:34787)
File "cupy/core/elementwise.pxi", line 32, in cupy.core.core._get_simple_elementwise_kernel (cupy\core\core.cpp:34609)
File "cupy/core/carray.pxi", line 87, in cupy.core.core.compile_with_cache (cupy\core\core.cpp:34264)
File "P:\Python35\lib\site-packages\cupy\cuda\compiler.py", line 133, in compile_with_cache
base = _empty_file_preprocess_cache[env] = preprocess('', options)
File "P:\Python35\lib\site-packages\cupy\cuda\compiler.py", line 99, in preprocess
pp_src = pp_src.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 27-28: invalid continuation byte
It seems nvcc generated non-UTF8 output and CuPy failed to decode it.
This is a bug of CuPy (I posted an issue: #378).
A possible solution for the time being is to replace 'utf-8' in cupy/cuda/compiler.py at the line pp_src = pp_src.decode('utf-8') with something that match your environment. For example, in Japanese environment, 'cp932' should work, and 'cp936' should perhaps work for simplified Chinese.
You could also try locale.getdefaultlocale()[1] as a universal solution (be sure to import locale).
Update: The fix has been merged. It should be fixed in upcoming CuPy v1.0.3.

What train_dir to use for Tensorflow imagenet_train to train from scratch?

I am following the below page
https://github.com/tensorflow/models/tree/master/inception
I got to the point I have to run:
bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data
However, I got below error:
Traceback (most recent call last):
File "/home/demo/anaconda3/envs/tensorflow/models/inception/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 41, in <module>
tf.app.run()
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "/home/demo/anaconda3/envs/tensorflow/models/inception/bazel-bin/inception/imagenet_train.runfiles/inception/inception/imagenet_train.py", line 35, in main
tf.gfile.DeleteRecursively(FLAGS.train_dir)
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 420, in delete_recursively
pywrap_tensorflow.DeleteRecursively(compat.as_bytes(dirname), status)
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/contextlib.py", line 24, in __exit__
self.gen.next()
File "/home/demo/anaconda3/envs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.FailedPreconditionError: /tmp/imagenet_train
My DATA_DIR is /tmp/imagenet_data from previous step bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}"
But what would be my train_dir? The doc doesn't mention it? Look like an empty folder is incorrect.
For me, it works if I set the path of --train_dir=/tmp. Also, you have the processed dataset in the same directory. The --train_dir and --data_dir should not coincide with each other.
Location of where to place the ImageNet data DATA_DIR=$HOME/imagenet-data
Can you tell me if you are still running into problems after changing the directory?
--train_dir is the path to an empty directory where the model checkpoints and events files are stored as the model is trained.

Not able to run fully_connected_feed.py in Tensorflow

I am following the tutorial of TensorFlow Mechanics 101 (version 0.7.0). As per the document, I download the two files (mnist.py and fully_connected_feed.py) and save them to the same directory on my local machine.
When I run the following command:
$ python /FULL_PATH_TO_fully_connected_feed.py/fully_connected_feed.py
...I get this error: OSError: [Errno 2] No such file or directory: ''. The full output and stack trace are below:
...
...
Step 800: loss = 0.56 (0.005 sec)
Step 900: loss = 0.51 (0.004 sec)
Traceback (most recent call last):
File "./fully_connected_feed.py", line 228, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_app.py", line 30, in run
sys.exit(main(sys.argv))
File "./fully_connected_feed.py", line 224, in main
run_training()
File "./fully_connected_feed.py", line 199, in run_training
saver.save(sess, FLAGS.train_dir, global_step=step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 970, in save
self.export_meta_graph(meta_graph_file_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 990, in export_meta_graph
as_text=as_text)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1315, in export_meta_graph
os.path.basename(filename), as_text=as_text)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 70, in write_graph
gfile.MakeDirs(logdir)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/default/_gfile.py", line 295, in MakeDirs
os.makedirs(path, mode)
File "/usr/lib/python2.7/os.py", line 160, in makedirs
mkdir(name, mode)
OSError: [Errno 2] No such file or directory: ''
This is a bug in the 0.7.0 release of TensorFlow, which was fixed in a recent commit and will appear in a bugfix release shortly. The issue is caused when the --train_dir flag doesn't contain a directory name component.
In the meantime, you can avoid this issue by passing the flag --train_dir=./ when you run the example.
This should be a comment to mrry's post (I'm missing reputation)
Changing line #42 from fully_connected_feed.py to
flags.DEFINE_string('train_dir', './data', 'Directory to put the training data.')
solved the problem for me. I'm also on 0.7.0 and was able to run all other mnist examples.

Categories