I am saving and loading a model using torch.save() and torch.load() commands.
While loading a fine-tuned simple transformer model in Docker Container, I am facing this error which I am not able to resolve:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
File "/usr/local/lib/python3.7/dist-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 161, in __setstate__
self.sp_model.Load(self.vocab_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "/home/jupyter/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8": No such file or directory Error #2
If anyone has any idea about it, please let me know.
I am using:
torch ==1.7.1+cu101
sentence-transformers 0.3.9
simpletransformers 0.51.15
transformers 4.4.2
tensorflow 2.2.0
I suggest using state_dict objects - the Python dictionaries as they can be easily saved, updated and restored giving you a flexibility for restoring the model later. Here are the recommended Save/Load methods for saving models with state_dict:
Save
torch.save(model.state_dict(), PATH)
Load
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()
Related
Does torch.save work on hugging face models (I am using vit)? I assumed yes.
My error:
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 379, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 499, in _save
zip_file.write_record(name, storage.data_ptr(), num_bytes)
OSError: [Errno 116] Stale file handle
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1815, in <module>
main()
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1748, in main
train(args=args)
File "/shared/rsaas/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1795, in train
meta_train_iterations_ala_l2l(args, args.agent, args.opt, args.scheduler)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/training/meta_training.py", line 213, in meta_train_iterations_ala_l2l
log_train_val_stats(args, args.it, step_name, train_loss, train_acc, training=True)
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 55, in log_train_val_stats
_log_train_val_stats(args=args,
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/logging_uu/wandb_logging/supervised_learning.py", line 113, in _log_train_val_stats
save_for_supervised_learning(args, ckpt_filename='ckpt.pt')
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/checkpointing_uu/supervised_learning.py", line 54, in save_for_supervised_learning
torch.save({'training_mode': args.training_mode,
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 380, in save
return
File "/home/miranda9/miniconda3/envs/metalearning_gpu/lib/python3.9/site-packages/torch/serialization.py", line 259, in __exit__
self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:298] . unexpected pos 2736460544 vs 2736460432
my code:
# - ckpt
args_pickable: Namespace = uutils.make_args_pickable(args)
# note not saving any objects, to make sure checkpoint is loadable later with no problems
torch.save({'training_mode': args.training_mode,
'it': args.it,
'epoch_num': args.epoch_num,
# 'args': args_pickable, # some versions of this might not have args!
# decided only to save the dict version to avoid this ckpt not working, make it args when loading
'args_dict': vars(args_pickable), # some versions of this might not have args!
'model_state_dict': get_model_from_ddp(args.model).state_dict(),
'model_str': str(args.model), # added later, to make it easier to check what optimizer was used
'model_hps': args.model_hps,
'model_option': args.model_option,
'opt_state_dict': args.opt.state_dict(),
'opt_str': str(args.opt),
'opt_hps': args.opt_hps,
'opt_option': args.opt_option,
'scheduler_str': str(args.scheduler),
'scheduler_state_dict': try_to_get_scheduler_state_dict(args.scheduler),
'scheduler_hps': args.scheduler_hps,
'scheduler_option': args.scheduler_option,
},
pickle_module=pickle,
f=args.log_root / ckpt_filename)
if this is not the right way to checkpoint hugging face (HF) models, what is?
cross: hf discussion forum: https://discuss.huggingface.co/t/torch-save-with-hugging-face-models-fails/25034
I have a checkpoint file saved after training a model in Pytorch. I have to inspect it in a different module so I tried to load the checkpoint using the following code.
map_location = lambda storage, loc: storage
checkpoint = torch.load("model.pt", map_location=map_location)
But it is raising ModuleNotFoundError issue, which I couldn't find a way to resolve.
The error traceback :
Traceback (most recent call last):
File "main.py", line 11, in <module>
model = loadmodel(hook_feature)
File "/home/../model_loader.py", line 21, in loadmodel
checkpoint = torch.load(settings.MODEL_FILE, map_location=map_location)
File "/home/../.conda/envs/envreporting/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/home/../.conda/envs/envreporting/lib/python3.6/site-packages/torch/serialization.py", line 842, in _load
result = unpickler.load()
ModuleNotFoundError: No module named 'parse_config'
I couldn't find an already existing issue relevant to this one.
Is it possible that you have used https://github.com/victoresque/pytorch-template for training the model ? In that case, the project also saves its config in the checkpoint and you also need to import parse_config.py file in order to load it.
I am unable to download and use a model I saved earlier from a online-repository. Here's the code:
model = Model().double() # Model is defined in another class
state_dict = torch.hub.load_state_dict_from_url(r'https://filebin.net/j2977ux7kts41aft/checkpoint_best.pt?t=wjbujfoo')
model.load_state_dict(state_dict)
model.eval()
Which gives me the following error:
Traceback (most recent call last):
File "/path/file.py", line 47, in <module>
state_dict = torch.hub.load_state_dict_from_url(r'https://filebin.net/j2977ux7kts41aft/checkpoint_best.pt?t=wjbujfoo')
File "anaconda3/envs/torch_env/lib/python3.6/site-packages/torch/hub.py", line 466, in load_state_dict_from_url
return torch.load(cached_file, map_location=map_location)
File "/anaconda3/envs/torch_env/lib/python3.6/site-packages/torch/serialization.py", line 386, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "anaconda3/envs/torch_env/lib/python3.6/site-packages/torch/serialization.py", line 563, in _load
magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '\x0a'.
The model resides in:
https://filebin.net/j2977ux7kts41aft/checkpoint_best.pt?t=wjbujfoo
Note that I can perfectly download it manually, and then use torch.load(path) to load it without errors, but I need to do it from code! Could it be that the serializing when downloading from url somehow messes up the pickle encoding?
Edit: I don't have to use filebin, any online-storage that supports what I try to do will suffice.
This code with link from 'download button' and 'map_location' parameter works fine for me:
state_dict = torch.hub.load_state_dict_from_url(r'https://filebin.net/j2977ux7kts41aft/checkpoint_best.pt?t=wjbujfoo', map_location=torch.device('cpu'))
The problem was indeed within the environment configuration. I created the model with PyTorch 1.0.2 and then updated to 1.2.0 in order to use torch.hub. This gave me the pickle error. After training a new model in 1.2.0, the error is now gone.
Hope this help someone in the future :)
I have trained a Keras model with Tensorflow backend. It was saved with model.save. I now want to reload the model using model_load, however, I get the following error:
Traceback (most recent call last):
File "<ipython-input-235-387752c910a4>", line 1, in <module>
load_model('MyModel.h5')
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\models.py", line 243, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\models.py", line 317, in model_from_config
return layer_module.deserialize(config, custom_objects=custom_objects)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
printable_module_name='layer')
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\utils\generic_utils.py", line 144, in deserialize_keras_object
list(custom_objects.items())))
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\engine\topology.py", line 2514, in from_config
process_layer(layer_data)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\engine\topology.py", line 2500, in process_layer
custom_objects=custom_objects)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
printable_module_name='layer')
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\utils\generic_utils.py", line 144, in deserialize_keras_object
list(custom_objects.items())))
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\models.py", line 1367, in from_config
if 'class_name' not in config[0] or config[0]['class_name'] == 'Merge':
KeyError: 0
From what I read, there seems to be a bug in Keras when a model that was trained with an older version of Keras is loaded with a recent version. So there might be a version mismatch. However, I couldn't find a report that corresponds to my situation. Downgrading Keras or retraining is not an option.
Did anyone have this issue and maybe even found a solution? I would appreciate it a lot!
Thanks!
For future reference: It is an issue in the config files. Keras 2.2.4 has a fix for this:
Keras 2.2.4
#fchollet fchollet released this on Oct 3 ยท 79 commits to master since this release
Assets 2
This is a bugfix release, addressing two issues:
Ability to save a model when a file with the same name already exists.
Issue with loading legacy config files for the Sequential model.
So I ended up creating a new virtual environment with the most recent TF and Keras versions.
I am trying to do the following:
Run kmeans clustering using tensorflow (1.8.0)
Save the model using kmeans.export_savedmodel
Use the model using tf.saved_model.loader.load
I am using the exact script at: https://www.tensorflow.org/api_docs/python/tf/contrib/factorization/KMeansClustering
I am using following code for saving the model:
Input Reciever:
def serving_input_receiver_fn():
feature_spec = {"x": tf.FixedLenFeature(dtype=tf.float32, shape=[2])}
model_placeholder = tf.placeholder(dtype=tf.string,shape=[None],name='input')
receiver_tensors = {"model_inputs": model_placeholder}
features = tf.parse_example(model_placeholder, feature_spec)
return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)
Export:
kmeans.export_savedmodel("/path/", serving_input_receiver_fn)
To import I use:
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING],"/path")
On last step I run into this issue:
Traceback (most recent call last):
File "restore_model.py", line 6, in <module>
tf.saved_model.loader.load(sess, [tf.saved_model.tag_constants.SERVING], "/Users/z001t3k/work/codebase/ContentPipeline/cep-scripts/cep/datacollection/algorithms/cluster_model/1525963476")
File "/Users/z001t3k/python_virtualenvs/tensorflow/lib/python2.7/site-packages/tensorflow/python/saved_model/loader_impl.py", line 219, in load
saver = tf_saver.import_meta_graph(meta_graph_def_to_load, **saver_kwargs)
File "/Users/z001t3k/python_virtualenvs/tensorflow/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1955, in import_meta_graph
**kwargs)
File "/Users/z001t3k/python_virtualenvs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/meta_graph.py", line 743, in import_scoped_meta_graph
producer_op_list=producer_op_list)
File "/Users/z001t3k/python_virtualenvs/tensorflow/lib/python2.7/site-packages/tensorflow/python/util/deprecation.py", line 432, in new_func
return func(*args, **kwargs)
File "/Users/z001t3k/python_virtualenvs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 460, in import_graph_def
_RemoveDefaultAttrs(op_dict, producer_op_list, graph_def)
File "/Users/z001t3k/python_virtualenvs/tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/importer.py", line 227, in _RemoveDefaultAttrs
op_def = op_dict[node.op]
KeyError: u'NearestNeighbors'
Tensorflow is having trouble locating the NearestNeighbors op, which is part of the graph you're loading. Ops defined in contrib are loaded dynamically when you import the corresponding contrib package in Python.
So just add
import tensorflow.contrib.factorization
before loading the SavedModel.