Goal: TFX -> TF Lite Converter -> Deploy models on mobile/IoT devices
I am currently learning the Tensorflow Extended with its Chicago Taxi Pipeline Example.
The pipeline is done running (although through a lot of hardships) and the Pusher Component has emitted a Tensorflow SavedModel file (.pb).
However, a new problem is encountered here:
By Tensorflow nightly/1.13.1 (tried both) and Python 2.7.6, I can generate, save and load a SavedModel (a model for mnist digit data for testing the utility) with some simple python code, such as saved_model.simple_save and saved_model.loader.load, but I keep running into errors when applying on the models the TFX Pusher emits, as follows.
(Maybe I did something wrong with the TFX Pipeline?)
The code I used:
import tensorflow as tf
with tf.Session(graph=tf.Graph()) as sess:
tf.compat.v1.saved_model.loader.load(sess, ["serve"], "/home/tigerpaws/taxi/serving_model/taxi_simple/1553187887")#"/home/tigerpaws/saved_model_example/model")
graph=tf.get_default_graph()
Error:
KeyError Traceback (most recent call last)
<ipython-input-11-a6978b82c3d2> in <module>()
1 with tf.Session(graph=tf.Graph()) as sess:
----> 2 tf.compat.v1.saved_model.loader.load(sess, ["serve"], "/home/tigerpaws/taxi/serving_model/taxi_simple/1553187887")#"/home/tigerpaws/saved_model_example/model")
3 graph=tf.get_default_graph()
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/util/deprecation.pyc in new_func(*args, **kwargs)
322 'in a future version' if date is None else ('after %s' % date),
323 instructions)
--> 324 return func(*args, **kwargs)
325 return tf_decorator.make_decorator(
326 func, new_func, 'deprecated',
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/saved_model/loader_impl.pyc in load(sess, tags, export_dir, import_scope, **saver_kwargs)
267 """
268 loader = SavedModelLoader(export_dir)
--> 269 return loader.load(sess, tags, import_scope, **saver_kwargs)
270
271
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/saved_model/loader_impl.pyc in load(self, sess, tags, import_scope, **saver_kwargs)
418 with sess.graph.as_default():
419 saver, _ = self.load_graph(sess.graph, tags, import_scope,
--> 420 **saver_kwargs)
421 self.restore_variables(sess, saver, import_scope)
422 self.run_init_ops(sess, tags, import_scope)
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/saved_model/loader_impl.pyc in load_graph(self, graph, tags, import_scope, **saver_kwargs)
348 with graph.as_default():
349 return tf_saver._import_meta_graph_with_return_elements( # pylint: disable=protected-access
--> 350 meta_graph_def, import_scope=import_scope, **saver_kwargs)
351
352 def restore_variables(self, sess, saver, import_scope=None):
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/training/saver.pyc in _import_meta_graph_with_return_elements(meta_graph_or_file, clear_devices, import_scope, return_elements, **kwargs)
1455 import_scope=import_scope,
1456 return_elements=return_elements,
-> 1457 **kwargs))
1458
1459 saver = _create_saver_from_imported_meta_graph(
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/meta_graph.pyc in import_scoped_meta_graph_with_return_elements(meta_graph_or_file, clear_devices, graph, import_scope, input_map, unbound_inputs_col_name, restore_collections_predicate, return_elements)
804 input_map=input_map,
805 producer_op_list=producer_op_list,
--> 806 return_elements=return_elements)
807
808 # Restores all the other collections.
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/util/deprecation.pyc in new_func(*args, **kwargs)
505 'in a future version' if date is None else ('after %s' % date),
506 instructions)
--> 507 return func(*args, **kwargs)
508
509 doc = _add_deprecated_arg_notice_to_docstring(
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/importer.pyc in import_graph_def(graph_def, input_map, return_elements, name, op_dict, producer_op_list)
397 if producer_op_list is not None:
398 # TODO(skyewm): make a copy of graph_def so we're not mutating the argument?
--> 399 _RemoveDefaultAttrs(op_dict, producer_op_list, graph_def)
400
401 graph = ops.get_default_graph()
/home/tigerpaws/anaconda/lib/python2.7/site-packages/tensorflow/python/framework/importer.pyc in _RemoveDefaultAttrs(op_dict, producer_op_list, graph_def)
157 # Remove any default attr values that aren't in op_def.
158 if node.op in producer_op_dict:
--> 159 op_def = op_dict[node.op]
160 producer_op_def = producer_op_dict[node.op]
161 # We make a copy of node.attr to iterate through since we may modify
KeyError: u'BucketizeWithInputBoundaries'
There was also another attempt, where I tried to convert the SavedModel into a GraphDef (Frozen Graph) so I could give the converter another try.
The conversion would need a output_node_names, which I don't know;
Neither could I find where the model is saved in the code (so maybe I can spot the output node names somewhere).
Any ideas on the problem or alternative ways? Thanks in advance.
Edit: can somebody help create tags? I have not reached 1500 reputation, but this question is really about tfx / tensorflow-extended
Sorry for the confusion caused; the problem actually is caused by the reading of the SavedModel file.
In the SavedModel, there is an operation BucketizeWithInputBoundaries, that is not defined in op_dict.
This is still in Google's TODO list, commented in two of their scripts.
Here and Here. (Github links):
# TODO(jyzhao): BucketizeWithInputBoundaries error without this.
After importing the specified script this problem is solved.
from tensorflow.contrib.boosted_trees.python.ops import quantile_ops # pylint: disable=unused-import
Related
I am trying to create a feature extractor using from torchvision.models.feature_extraction import create_feature_extractor.
The model I am trying to use is from the vit_pytorch (link: https://github.com/lucidrains/vit-pytorch). The problem I face is that when I create a model from this lib:
from vit_pytorch import ViT
from torchvision.models.feature_extraction import create_feature_extractor
model = ViT(image_size=28,
patch_size=7,
num_classes=10,
dim=16,
depth=6,
heads=16,
mlp_dim=256,
dropout=0.1,
emb_dropout=0.1,
channels=1)
random_layer_name = 'transformer.layers.1.1.fn.net.4'
feature_extractor = create_feature_extractor(model,
return_nodes=random_layer_name)
and when trying to use the create_feature_extractor() on this model I always get this error:
RuntimeError Traceback (most recent call last)
Cell In[17], line 2
1 # torch.fx.wrap('len')
----> 2 feature_extractor = create_feature_extractor(model,
3 return_nodes=['transformer.layers.1.1.fn.net.4'])
File ~/Mokslas/AI/venv/lib/python3.10/site-packages/torchvision/models/feature_extraction.py:485, in create_feature_extractor(model, return_nodes, train_return_nodes, eval_return_nodes, tracer_kwargs, suppress_diff_warning)
483 # Instantiate our NodePathTracer and use that to trace the model
484 tracer = NodePathTracer(**tracer_kwargs)
--> 485 graph = tracer.trace(model)
487 name = model.__class__.__name__ if isinstance(model, nn.Module) else model.__name__
488 graph_module = fx.GraphModule(tracer.root, graph, name)
File ~/Mokslas/AI/venv/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py:756, in Tracer.trace(self, root, concrete_args)
749 for module in self._autowrap_search:
750 _autowrap_check(
751 patcher, module.__dict__, self._autowrap_function_ids
752 )
753 self.create_node(
754 "output",
755 "output",
--> 756 (self.create_arg(fn(*args)),),
757 {},
758 type_expr=fn.__annotations__.get("return", None),
759 )
761 self.submodule_paths = None
762 finally:
File ~/Mokslas/AI/venv/lib/python3.10/site-packages/vit_pytorch/vit.py:115, in ViT.forward(self, img)
114 def forward(self, img):
--> 115 x = self.to_patch_embedding(img)
116 b, n, _ = x.shape
118 cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
File ~/Mokslas/AI/venv/lib/python3.10/site-packages/torch/fx/_symbolic_trace.py:734, in Tracer.trace.<locals>.module_call_wrapper(mod, *args, **kwargs)
727 return _orig_module_call(mod, *args, **kwargs)
729 _autowrap_check(
730 patcher,
731 getattr(getattr(mod, "forward", mod), "__globals__", {}),
732 self._autowrap_function_ids,
733 )
--> 734 return self.call_module(mod, forward, args, kwargs)
File ~/Mokslas/AI/venv/lib/python3.10/site-packages/torchvision/models/feature_extraction.py:83, in NodePathTracer.call_module(self, m, forward, args, kwargs)
...
--> 396 raise RuntimeError("'len' is not supported in symbolic tracing by default. If you want "
397 "this call to be recorded, please call torch.fx.wrap('len') at "
398 "module scope")
RuntimeError: 'len' is not supported in symbolic tracing by default. If you want this call to be recorded, please call torch.fx.wrap('len') at module scope
It doesn't matter which model I choose from that library or which layer or layers I choose to be outputed I always get the same error.
I have tried to add torch.fx.wrap('len') but the same problem persisted. I know I could try to solve it by using the hook methods, but is there a way to solve this problem so that I could still use the create_feature_extractor() functionality?
I want to load FaceNet in Keras but I am getting errors.
the modal facenet_keras.h5 is ready but I can't load it.
you can get facenet_keras.h5 from this link:
https://drive.google.com/drive/folders/1pwQ3H4aJ8a6yyJHZkTwtjcL4wYWQb7bn
My tensorflow version is:
tensorflow.__version__
'2.2.0'
and when i want to load data:
from tensorflow.keras.models import load_model
load_model('facenet_keras.h5')
get this error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-2a20f38e8217> in <module>
----> 1 load_model('facenet_keras.h5')
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/saving/save.py in load_model(filepath, custom_objects, compile)
182 if (h5py is not None and (
183 isinstance(filepath, h5py.File) or h5py.is_hdf5(filepath))):
--> 184 return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
185
186 if sys.version_info >= (3, 4) and isinstance(filepath, pathlib.Path):
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/saving/hdf5_format.py in load_model_from_hdf5(filepath, custom_objects, compile)
175 raise ValueError('No model found in config file.')
176 model_config = json.loads(model_config.decode('utf-8'))
--> 177 model = model_config_lib.model_from_config(model_config,
178 custom_objects=custom_objects)
179
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/saving/model_config.py in model_from_config(config, custom_objects)
53 '`Sequential.from_config(config)`?')
54 from tensorflow.python.keras.layers import deserialize # pylint: disable=g-import-not-at-top
---> 55 return deserialize(config, custom_objects=custom_objects)
56
57
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/layers/serialization.py in deserialize(config, custom_objects)
103 config['class_name'] = _DESERIALIZATION_TABLE[layer_class_name]
104
--> 105 return deserialize_keras_object(
106 config,
107 module_objects=globs,
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/utils/generic_utils.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
367
368 if 'custom_objects' in arg_spec.args:
--> 369 return cls.from_config(
370 cls_config,
371 custom_objects=dict(
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py in from_config(cls, config, custom_objects)
984 ValueError: In case of improperly formatted config dict.
985 """
--> 986 input_tensors, output_tensors, created_layers = reconstruct_from_config(
987 config, custom_objects)
988 model = cls(inputs=input_tensors, outputs=output_tensors,
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py in reconstruct_from_config(config, custom_objects, created_layers)
2017 # First, we create all layers and enqueue nodes to be processed
2018 for layer_data in config['layers']:
-> 2019 process_layer(layer_data)
2020 # Then we process nodes in order of layer depth.
2021 # Nodes that cannot yet be processed (if the inbound node
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/network.py in process_layer(layer_data)
1999 from tensorflow.python.keras.layers import deserialize as deserialize_layer # pylint: disable=g-import-not-at-top
2000
-> 2001 layer = deserialize_layer(layer_data, custom_objects=custom_objects)
2002 created_layers[layer_name] = layer
2003
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/layers/serialization.py in deserialize(config, custom_objects)
103 config['class_name'] = _DESERIALIZATION_TABLE[layer_class_name]
104
--> 105 return deserialize_keras_object(
106 config,
107 module_objects=globs,
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/utils/generic_utils.py in deserialize_keras_object(identifier, module_objects, custom_objects, printable_module_name)
367
368 if 'custom_objects' in arg_spec.args:
--> 369 return cls.from_config(
370 cls_config,
371 custom_objects=dict(
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/layers/core.py in from_config(cls, config, custom_objects)
988 def from_config(cls, config, custom_objects=None):
989 config = config.copy()
--> 990 function = cls._parse_function_from_config(
991 config, custom_objects, 'function', 'module', 'function_type')
992
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/layers/core.py in _parse_function_from_config(cls, config, custom_objects, func_attr_name, module_attr_name, func_type_attr_name)
1040 elif function_type == 'lambda':
1041 # Unsafe deserialization from bytecode
-> 1042 function = generic_utils.func_load(
1043 config[func_attr_name], globs=globs)
1044 elif function_type == 'raw':
~/.local/lib/python3.8/site-packages/tensorflow/python/keras/utils/generic_utils.py in func_load(code, defaults, closure, globs)
469 except (UnicodeEncodeError, binascii.Error):
470 raw_code = code.encode('raw_unicode_escape')
--> 471 code = marshal.loads(raw_code)
472 if globs is None:
473 globs = globals()
ValueError: bad marshal data (unknown type code)
thank you.
The possible solutions to this error are shown below:
The Model might have been built and saved in Python 2.x and you might be using Python 3.x. Solution is to use the same Python Version using which the Model has been Built and Saved.
Use the same version of Keras (and, may be, tensorflow), on which your Model was Built and Saved.
The Saved Model might contain Custom Objects. If so, you need to load the Model using the code,
new_model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
If you can recreate the architecture (i.e. you have the original code used to generate it), you can instantiate the model from that code and then use model.load_weights('your_model_file.hdf5') to load in the weights. This isn't an option if you don't have the code used to create the original architecture.
For more details, please refer this Github Issue. For more details regarding Saving and Loading the Model with Custom Objects, please refer this Tensorflow Documentation and this Stack Overflow Answer.
I change python version(3.10 to 3.7) and its solved for me.
I am new to the world of parallelization, and encountered a very odd bug as I was trying to run a function trying to load the same npy file running on several cores.
My code is of the form:
import os
from pathlib import Path
from joblib import Parallel, delayed
import multiprocessing
num_cores = multiprocessing.cpu_count()
mydir = 'path/of/your/choice'
myfile = 'myArray.npy'
mydir=Path(mydir)
myfile=mydir/myfile
os.chdir(mydir)
myarray = np.zeros((12345))
np.save(myfile, myarray)
def foo(myfile, x):
# function loading a myArray and working with it
arr=np.load(myfile)
return arr+x
if __name__=='__main__':
foo_results = Parallel(n_jobs=num_cores, backend="threading")(\
delayed(foo)(myfile,i) for i in range(10))
In my case, this script would run fine about 40% of the way, then return
--> 17 arr=np.load(mydir/'myArray.npy')
ValueError: cannot reshape array of size 0 into shape (12345,)
What blows my mind is that if I enter %pdb debug mode and actually try to run arr=np.load(mydir/'myArray.npy'), this works! So I assume that the issue stems from all the parallel processes running foo trying to load the same numpy array at the same time (as in debug mode, all the processes are paused and only the code that I execute actually runs).
This very minimal example actually works, presumably because the function is very simple and joblib handles this gracefully, but my code would be too long and complicated to be posted here - first of all, has anyone encountered a similar issue in the past? If no one manages to identify my issue, I will post my whole script.
Thanks for your help!
-------------------- EDIT ------------------
Given that there doesn't seem to be an easy answer with the toy code that I posted, here are the full error logs. I played around with the backends following #psarka recommendation and for some reason, the following error arises with the default loky backend (again, no problem to run the code in a non-parallel manner):
/media/maxime/ut_data/Dropbox/NeuroPyxels/npyx/corr.py in ccg_stack(dp, U_src, U_trg, cbin, cwin, normalize, all_to_all, name, sav, again, periods)
541
542 ccg_results=Parallel(n_jobs=num_cores)(\
--> 543 delayed(ccg)(*ccg_inputs[i]) for i in tqdm(range(len(ccg_inputs)), desc=f'Computing ccgs over {num_cores} cores'))
544 for ((i1, u1, i2, u2), CCG) in zip(ccg_ids,ccg_results):
545 if i1==i2:
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/parallel.py in __call__(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
540 AsyncResults.get from multiprocessing."""
541 try:
--> 542 return future.result(timeout=timeout)
543 except CfTimeoutError as e:
544 raise TimeoutError from e
~/miniconda3/envs/npyx/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
426 raise CancelledError()
427 elif self._state == FINISHED:
--> 428 return self.__get_result()
429
430 self._condition.wait(timeout)
~/miniconda3/envs/npyx/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
ValueError: Cannot load file containing pickled data when allow_pickle=False
but this arises with the threading backend, which is more informative (which was originally used in my question) - again, it is possible to actually run train = np.load(Path(dprm,fn)) in debug mode:
/media/maxime/ut_data/Dropbox/NeuroPyxels/npyx/corr.py in ccg_stack(dp, U_src, U_trg, cbin, cwin, normalize, all_to_all, name, sav, again, periods)
541
542 ccg_results=Parallel(n_jobs=num_cores, backend='threading')(\
--> 543 delayed(ccg)(*ccg_inputs[i]) for i in tqdm(range(len(ccg_inputs)), desc=f'Computing ccgs over {num_cores} cores'))
544 for ((i1, u1, i2, u2), CCG) in zip(ccg_ids,ccg_results):
545 if i1==i2:
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/parallel.py in __call__(self, iterable)
1052
1053 with self._backend.retrieval_context():
-> 1054 self.retrieve()
1055 # Make sure that we get a last message telling us we are done
1056 elapsed_time = time.time() - self._start_time
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/parallel.py in retrieve(self)
931 try:
932 if getattr(self._backend, 'supports_timeout', False):
--> 933 self._output.extend(job.get(timeout=self.timeout))
934 else:
935 self._output.extend(job.get())
~/miniconda3/envs/npyx/lib/python3.7/multiprocessing/pool.py in get(self, timeout)
655 return self._value
656 else:
--> 657 raise self._value
658
659 def _set(self, i, obj):
~/miniconda3/envs/npyx/lib/python3.7/multiprocessing/pool.py in worker(inqueue, outqueue, initializer, initargs, maxtasks, wrap_exception)
119 job, i, func, args, kwds = task
120 try:
--> 121 result = (True, func(*args, **kwds))
122 except Exception as e:
123 if wrap_exception and func is not _helper_reraises_exception:
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/_parallel_backends.py in __call__(self, *args, **kwargs)
593 def __call__(self, *args, **kwargs):
594 try:
--> 595 return self.func(*args, **kwargs)
596 except KeyboardInterrupt as e:
597 # We capture the KeyboardInterrupt and reraise it as
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/parallel.py in __call__(self)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
~/miniconda3/envs/npyx/lib/python3.7/site-packages/joblib-1.0.1-py3.7.egg/joblib/parallel.py in <listcomp>(.0)
261 with parallel_backend(self._backend, n_jobs=self._n_jobs):
262 return [func(*args, **kwargs)
--> 263 for func, args, kwargs in self.items]
264
265 def __reduce__(self):
/media/maxime/ut_data/Dropbox/NeuroPyxels/npyx/corr.py in ccg(dp, U, bin_size, win_size, fs, normalize, ret, sav, verbose, periods, again, trains)
258 if verbose: print("File {} not found in routines memory.".format(fn))
259 crosscorrelograms = crosscorrelate_cyrille(dp, bin_size, win_size, sortedU, fs, True,
--> 260 periods=periods, verbose=verbose, trains=trains)
261 crosscorrelograms = np.asarray(crosscorrelograms, dtype='float64')
262 if crosscorrelograms.shape[0]<len(U): # no spikes were found in this period
/media/maxime/ut_data/Dropbox/NeuroPyxels/npyx/corr.py in crosscorrelate_cyrille(dp, bin_size, win_size, U, fs, symmetrize, periods, verbose, trains)
88 U=list(U)
89
---> 90 spike_times, spike_clusters = make_phy_like_spikeClustersTimes(dp, U, periods=periods, verbose=verbose, trains=trains)
91
92 return crosscorr_cyrille(spike_times, spike_clusters, win_size, bin_size, fs, symmetrize)
/media/maxime/ut_data/Dropbox/NeuroPyxels/npyx/corr.py in make_phy_like_spikeClustersTimes(dp, U, periods, verbose, trains)
46 for iu, u in enumerate(U):
47 # Even lists of strings can be dealt with as integers by being replaced by their indices
---> 48 trains_dic[iu]=trn(dp, u, sav=True, periods=periods, verbose=verbose) # trains in samples
49 else:
50 assert len(trains)>1
/media/maxime/ut_data/Dropbox/NeuroPyxels/npyx/spk_t.py in trn(dp, unit, sav, verbose, periods, again, enforced_rp)
106 if op.exists(Path(dprm,fn)) and not again:
107 if verbose: print("File {} found in routines memory.".format(fn))
--> 108 train = np.load(Path(dprm,fn))
109
110 # if not, compute it
~/miniconda3/envs/npyx/lib/python3.7/site-packages/numpy-1.21.0rc2-py3.7-linux-x86_64.egg/numpy/lib/npyio.py in load(file, mmap_mode, allow_pickle, fix_imports, encoding)
443 # Try a pickle
444 if not allow_pickle:
--> 445 raise ValueError("Cannot load file containing pickled data "
446 "when allow_pickle=False")
447 try:
ValueError: Cannot load file containing pickled data when allow_pickle=False
The original error ValueError: cannot reshape array of size 0 into shape (12345,) doesn't show up anymore for some reason.
I am trying to train a pytorch model using Sagemaker on local mode, but whenever I call estimator.fit the code hangs indefinitely and I have to interrupt the notebook kernel. This happens both in my local machine and in Sagemaker Studio. But when I use EC2, the training runs normally.
Here the call to the estimator, and the stack trace once I interrupt the kernel:
import sagemaker
from sagemaker.pytorch import PyTorch
bucket = "bucket-name"
role = sagemaker.get_execution_role()
training_input_path = f"s3://{bucket}/dataset/path"
sagemaker_session = sagemaker.LocalSession()
sagemaker_session.config = {"local": {"local_code": True}}
output_path = "file://."
estimator = PyTorch(
entry_point="train.py",
source_dir="src",
hyperparameters={"max-epochs": 1},
framework_version="1.8",
py_version="py3",
instance_count=1,
instance_type="local",
role=role,
output_path=output_path,
sagemaker_session=sagemaker_session,
)
estimator.fit({"training": training_input_path})
Stack trace:
---------------------------------------------------------------------------
KeyboardInterrupt Traceback (most recent call last)
<ipython-input-9-35cdd6021288> in <module>
----> 1 estimator.fit({"training": training_input_path})
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in fit(self, inputs, wait, logs, job_name, experiment_config)
678 self._prepare_for_training(job_name=job_name)
679
--> 680 self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
681 self.jobs.append(self.latest_training_job)
682 if wait:
/opt/conda/lib/python3.7/site-packages/sagemaker/estimator.py in start_new(cls, estimator, inputs, experiment_config)
1450 """
1451 train_args = cls._get_train_args(estimator, inputs, experiment_config)
-> 1452 estimator.sagemaker_session.train(**train_args)
1453
1454 return cls(estimator.sagemaker_session, estimator._current_job_name)
/opt/conda/lib/python3.7/site-packages/sagemaker/session.py in train(self, input_mode, input_config, role, job_name, output_config, resource_config, vpc_config, hyperparameters, stop_condition, tags, metric_definitions, enable_network_isolation, image_uri, algorithm_arn, encrypt_inter_container_traffic, use_spot_instances, checkpoint_s3_uri, checkpoint_local_path, experiment_config, debugger_rule_configs, debugger_hook_config, tensorboard_output_config, enable_sagemaker_metrics, profiler_rule_configs, profiler_config, environment, retry_strategy)
572 LOGGER.info("Creating training-job with name: %s", job_name)
573 LOGGER.debug("train request: %s", json.dumps(train_request, indent=4))
--> 574 self.sagemaker_client.create_training_job(**train_request)
575
576 def _get_train_request( # noqa: C901
/opt/conda/lib/python3.7/site-packages/sagemaker/local/local_session.py in create_training_job(self, TrainingJobName, AlgorithmSpecification, OutputDataConfig, ResourceConfig, InputDataConfig, **kwargs)
184 hyperparameters = kwargs["HyperParameters"] if "HyperParameters" in kwargs else {}
185 logger.info("Starting training job")
--> 186 training_job.start(InputDataConfig, OutputDataConfig, hyperparameters, TrainingJobName)
187
188 LocalSagemakerClient._training_jobs[TrainingJobName] = training_job
/opt/conda/lib/python3.7/site-packages/sagemaker/local/entities.py in start(self, input_data_config, output_data_config, hyperparameters, job_name)
219
220 self.model_artifacts = self.container.train(
--> 221 input_data_config, output_data_config, hyperparameters, job_name
222 )
223 self.end_time = datetime.datetime.now()
/opt/conda/lib/python3.7/site-packages/sagemaker/local/image.py in train(self, input_data_config, output_data_config, hyperparameters, job_name)
200 data_dir = self._create_tmp_folder()
201 volumes = self._prepare_training_volumes(
--> 202 data_dir, input_data_config, output_data_config, hyperparameters
203 )
204 # If local, source directory needs to be updated to mounted /opt/ml/code path
/opt/conda/lib/python3.7/site-packages/sagemaker/local/image.py in _prepare_training_volumes(self, data_dir, input_data_config, output_data_config, hyperparameters)
487 os.mkdir(channel_dir)
488
--> 489 data_source = sagemaker.local.data.get_data_source_instance(uri, self.sagemaker_session)
490 volumes.append(_Volume(data_source.get_root_dir(), channel=channel_name))
491
/opt/conda/lib/python3.7/site-packages/sagemaker/local/data.py in get_data_source_instance(data_source, sagemaker_session)
52 return LocalFileDataSource(parsed_uri.netloc + parsed_uri.path)
53 if parsed_uri.scheme == "s3":
---> 54 return S3DataSource(parsed_uri.netloc, parsed_uri.path, sagemaker_session)
55 raise ValueError(
56 "data_source must be either file or s3. parsed_uri.scheme: {}".format(parsed_uri.scheme)
/opt/conda/lib/python3.7/site-packages/sagemaker/local/data.py in __init__(self, bucket, prefix, sagemaker_session)
183 working_dir = "/private{}".format(working_dir)
184
--> 185 sagemaker.utils.download_folder(bucket, prefix, working_dir, sagemaker_session)
186 self.files = LocalFileDataSource(working_dir)
187
/opt/conda/lib/python3.7/site-packages/sagemaker/utils.py in download_folder(bucket_name, prefix, target, sagemaker_session)
286 raise
287
--> 288 _download_files_under_prefix(bucket_name, prefix, target, s3)
289
290
/opt/conda/lib/python3.7/site-packages/sagemaker/utils.py in _download_files_under_prefix(bucket_name, prefix, target, s3)
314 if exc.errno != errno.EEXIST:
315 raise
--> 316 obj.download_file(file_path)
317
318
/opt/conda/lib/python3.7/site-packages/boto3/s3/inject.py in object_download_file(self, Filename, ExtraArgs, Callback, Config)
313 return self.meta.client.download_file(
314 Bucket=self.bucket_name, Key=self.key, Filename=Filename,
--> 315 ExtraArgs=ExtraArgs, Callback=Callback, Config=Config)
316
317
/opt/conda/lib/python3.7/site-packages/boto3/s3/inject.py in download_file(self, Bucket, Key, Filename, ExtraArgs, Callback, Config)
171 return transfer.download_file(
172 bucket=Bucket, key=Key, filename=Filename,
--> 173 extra_args=ExtraArgs, callback=Callback)
174
175
/opt/conda/lib/python3.7/site-packages/boto3/s3/transfer.py in download_file(self, bucket, key, filename, extra_args, callback)
305 bucket, key, filename, extra_args, subscribers)
306 try:
--> 307 future.result()
308 # This is for backwards compatibility where when retries are
309 # exceeded we need to throw the same error from boto3 instead of
/opt/conda/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
107 except KeyboardInterrupt as e:
108 self.cancel()
--> 109 raise e
110
111 def cancel(self):
/opt/conda/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
104 # however if a KeyboardInterrupt is raised we want want to exit
105 # out of this and propogate the exception.
--> 106 return self._coordinator.result()
107 except KeyboardInterrupt as e:
108 self.cancel()
/opt/conda/lib/python3.7/site-packages/s3transfer/futures.py in result(self)
258 # possible value integer value, which is on the scale of billions of
259 # years...
--> 260 self._done_event.wait(MAXINT)
261
262 # Once done waiting, raise an exception if present or return the
/opt/conda/lib/python3.7/threading.py in wait(self, timeout)
550 signaled = self._flag
551 if not signaled:
--> 552 signaled = self._cond.wait(timeout)
553 return signaled
554
/opt/conda/lib/python3.7/threading.py in wait(self, timeout)
294 try: # restore state no matter what (e.g., KeyboardInterrupt)
295 if timeout is None:
--> 296 waiter.acquire()
297 gotit = True
298 else:
KeyboardInterrupt:
SageMaker Studio does not natively support local mode. Studio Apps are themselves docker containers and therefore they require privileged access if they were to be able to build and run docker containers.
As an alternative solution, you can create a remote docker host on an EC2 instance and setup docker on your Studio App. There is quite a bit of networking and package installation involved, but the solution will enable you to use full docker functionality. Additionally, as of version 2.80.0 of SageMaker Python SDK, it now supports local mode when you are using remote docker host.
sdockerSageMaker Studio Docker CLI extension (see this repo) can simplify deploying the above solution in simple two steps (only works for Studio Domain in VPCOnly mode) and it has an easy to follow example here.
UPDATE:
There is now a UI extension (see repo) which can make the experience much smoother and easier to manage.
I am working with OpenAI gym to train an actor-critic network where one network provides the action and the second network provides the expected value. However, I keep getting the TypeError: Fetch argument None has invalid type <class 'NoneType'> error when I attempt to get the gradients from the network to be stored so I can update them later. It only appears when I run it with the critic network or if I run a second actor network. I have defined them with different tf.variable_scope values and passed the same session, so it seems to me that it ought to work and I can't seem to figure out why it doesn't. I came across other posts here, here, and here, yet they don't address my issue.
My network is given as (for brevity I cut out the layers and other methods that are working, also the actor network is nearly identical at this level of abstraction, just a different loss function; I can provide more code if deemed necessary):
# Define critic network
class critic(object):
def __init__(self, sess, scope):
self.sess = sess
self.scope = scope
with tf.variable_scope(self.scope):
# Network inputs, outputs, rewards, optimizer, etc...
self.state = tf.placeholder(tf.float32, [None, self.n_inputs],
name='state')
self.returns = tf.placeholder(tf.float32, [None], name='returns')
# Single, linear layer
self.output = fully_connected(self.state, self.n_out,
activation_fn=None,
weights_initializer=None)
self.est_state_value = tf.squeeze(self.output)
# Define loss function
self.loss = tf.squared_difference(self.est_state_value, self.returns)
self.trainable_variables = tf.trainable_variables()
self.gradients = tf.gradients(self.loss, self.trainable_variables)
# Methods for prediction, updating, etc...
And the get_grads method which is intended to return the network gradients is causing the problems:
def get_grads(self, states, actions, returns):
grads = self.sess.run([self.gradients],
feed_dict={
self.state: states,
self.actions: actions,
self.returns: returns
})[0]
return grads
When running the algorithm, it throws the error on the second get_grads call.
tf.reset_default_graph()
sess = tf.Session()
act = actor(sess, scope='actor')
crit = critic(sess, scope='critic')
init = tf.global_variables_initializer()
act.sess.run(init)
crit.sess.run(init)
# Randomized data for example
rewards = np.ones(10)
actions = np.random.choice([0, 1], 10)
states = np.random.normal(size=(10, 4))
act.get_grads(states, actions, rewards)
crit.get_grads(states, rewards)
It made me think that perhaps it was due to similar naming conventions between the two networks, so I tried making changes there, using two separate tf.Session() values, and other things, but the problem persists. If I just run a single network - the actor or critic - everything executes fine and it learns properly. So, I'm not sure what's going on here causing this error or how to fix it. I'd be grateful for any help here.
Full traceback:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-78-c56d39a21e63> in <module>()
13
14 act.get_grads(states, actions, rewards)
---> 15 crit.get_grads(states, rewards)
<ipython-input-76-031f8b9688f5> in get_grads(self, states, returns)
53 feed_dict={
54 self.state: states,
---> 55 self.returns: returns
56 })
57 return grads
...\tensorflow\python\client\session.py in run(self, fetches, feed_dict, options, run_metadata)
903 try:
904 result = self._run(None, fetches, feed_dict, options_ptr,
--> 905 run_metadata_ptr)
906 if run_metadata:
907 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
...\tensorflow\python\client\session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1120 # Create a fetch handler to take care of the structure of fetches.
1121 fetch_handler = _FetchHandler(
-> 1122 self._graph, fetches, feed_dict_tensor, feed_handles=feed_handles)
1123
1124 # Run request and get response.
...\client\session.py in __init__(self, graph, fetches, feeds, feed_handles)
425 """
426 with graph.as_default():
--> 427 self._fetch_mapper = _FetchMapper.for_fetch(fetches)
428 self._fetches = []
429 self._targets = []
...\tensorflow\python\client\session.py in for_fetch(fetch)
243 elif isinstance(fetch, (list, tuple)):
244 # NOTE(touts): This is also the code path for namedtuples.
--> 245 return _ListFetchMapper(fetch)
246 elif isinstance(fetch, dict):
247 return _DictFetchMapper(fetch)
...\tensorflow\python\client\session.py in __init__(self, fetches)
350 """
351 self._fetch_type = type(fetches)
--> 352 self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
353 self._unique_fetches, self._value_indices = _uniquify_fetches(self._mappers)
354
...\tensorflow\python\client\session.py in <listcomp>(.0)
350 """
351 self._fetch_type = type(fetches)
--> 352 self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
353 self._unique_fetches, self._value_indices = _uniquify_fetches(self._mappers)
354
...\tensorflow\python\client\session.py in for_fetch(fetch)
243 elif isinstance(fetch, (list, tuple)):
244 # NOTE(touts): This is also the code path for namedtuples.
--> 245 return _ListFetchMapper(fetch)
246 elif isinstance(fetch, dict):
247 return _DictFetchMapper(fetch)
...\python\client\session.py in __init__(self, fetches)
350 """
351 self._fetch_type = type(fetches)
--> 352 self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
353 self._unique_fetches, self._value_indices = _uniquify_fetches(self._mappers)
354
...\client\session.py in <listcomp>(.0)
350 """
351 self._fetch_type = type(fetches)
--> 352 self._mappers = [_FetchMapper.for_fetch(fetch) for fetch in fetches]
353 self._unique_fetches, self._value_indices = _uniquify_fetches(self._mappers)
354
...\client\session.py in for_fetch(fetch)
240 if fetch is None:
241 raise TypeError('Fetch argument %r has invalid type %r' % (fetch,
--> 242 type(fetch)))
243 elif isinstance(fetch, (list, tuple)):
244 # NOTE(touts): This is also the code path for namedtuples.
TypeError: Fetch argument None has invalid type <class 'NoneType'>
Although I had been calling self.trainable_variables = tf.trainable_variables() within a unique, tf.variable_scope(self.scope), the way I was sequentially initializing the networks led the first network to initialize properly, then the second network had all the trainable variables assigned to self.trainable_variables after initialization. To fix it, I simply needed to be explicit when defining the variables for each network by changing the call to:
self.trainable_variables = tf.trainable_variables(self.scope)