Parallel Keras model training using python mutliprocessing - python

I am training on a 64 core CPU workstation multiple Keras MLP models simultaneously.
Therefore I am using the Python multiprocessing pool to allocate for each CPU one model being trained.
For the model being trained I am using an Early Stopping and Model checkpoint callback defined in this manner:
es = EarlyStopping(monitor='val_mse', mode='min', verbose=VERBOSE_ALL, patience=10)
mc = ModelCheckpoint('best_model.h5', monitor='val_mse', mode='min', verbose=VERBOSE_ALL, save_best_only=True)
Using a single model the training runs through without any problems.
When I start using the multiprocessing pool however, I end up having issues with the callbacks. A hdf5 model saving issue comes up:
Traceback (most recent call last):
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\callbacks.py", line 1029, in _save_model
self.model.save(filepath, overwrite=True)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1008, in save
signatures, options)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\saving\save.py", line 112, in save_model
model, filepath, overwrite, include_optimizer)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\tensorflow_core\python\keras\saving\hdf5_format.py", line 92, in save_model_to_hdf5
f = h5py.File(filepath, mode='w')
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\h5py\_hl\files.py", line 394, in __init__
swmr=swmr)
File "C:\Users\ICN_admin\Anaconda3\lib\site-packages\h5py\_hl\files.py", line 176, in make_fid
fid = h5f.create(name, h5f.ACC_TRUNC, fapl=fapl, fcpl=fcpl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5f.pyx", line 105, in h5py.h5f.create
OSError: Unable to create file (file signature not found)
This error comes more or less sporadically, and through exceptions I can catch it for repeating the model training.
But is there a way to work around this issue by setting flags or using a different callback file format?
Tensorflow version: 2.1.0
Keras version: 2.3.1
library include:
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import ModelCheckpoint

Related

Error while loading fine-tuned simpletransformer model in Docker Container

I am saving and loading a model using torch.save() and torch.load() commands.
While loading a fine-tuned simple transformer model in Docker Container, I am facing this error which I am not able to resolve:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 594, in load
return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 853, in _load
result = unpickler.load()
File "/usr/local/lib/python3.7/dist-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 161, in __setstate__
self.sp_model.Load(self.vocab_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/usr/local/lib/python3.7/dist-packages/sentencepiece.py", line 177, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "/home/jupyter/.cache/huggingface/transformers/9df9ae4442348b73950203b63d1b8ed2d18eba68921872aee0c3a9d05b9673c6.00628a9eeb8baf4080d44a0abe9fe8057893de20c7cb6e6423cddbf452f7d4d8": No such file or directory Error #2
If anyone has any idea about it, please let me know.
I am using:
torch ==1.7.1+cu101
sentence-transformers 0.3.9
simpletransformers 0.51.15
transformers 4.4.2
tensorflow 2.2.0
I suggest using state_dict objects - the Python dictionaries as they can be easily saved, updated and restored giving you a flexibility for restoring the model later. Here are the recommended Save/Load methods for saving models with state_dict:
Save
torch.save(model.state_dict(), PATH)
Load
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH))
model.eval()

saved model can not load layer which contains custom method

I have a model which applies a custom function in the output layer. But the path to this function is static. Whenever I try to load the model on a different system it can not find the function because it searches the wrong path. Actually it uses the path in which the function was located at on the system I saved the model in the first place.
Here a example of the simplyfied Model:
from tensorflow.keras.models import Model
from tensorflow.keras.losses import mse, mean_squared_error
from tensorflow.keras.layers import Input, LSTM, Dense, Lambda
from tensorflow.keras.optimizers import RMSprop
from helper_functions import poly_transfer
Input_layer = Input(shape=(x_train.shape[1],x_train.shape[2]))
hidden_layer1 = LSTM(units=45, return_sequences=False,stateful=False)(Input_layer)
hidden_layer3 = Dense(25,activation='relu')(hidden_layer1)
speed_out = Lambda(poly_transfer)(hidden_layer3 )
model = Model(inputs=[Input_layer], outputs=[speed_out])
model.compile(loss=mse,
optimizer=RMSprop(lr= 0.0005),
metrics=['mae','mse'])
The function I am speaking of is poly_transfer in the outpul layer.
If I load my model with tensorflow.keras.models.load_model it searches as described in the wrong dir for poly_transfer and I get the error SystemError: unknown opcode.
Is there a way to tell tensorflow.keras.models.load_model where helper_function.py (the skript of poly_transfer) lyes on a different system?
I use tensorflow 2.0.0.
Edit
This is the error. Please note that the path d:/test_data_pros/restructured/helper_functions.py did only exist on the system the model was trained on. The system on which I load the model has the same skript with the same function but naturally on a different path.
2020-12-22 19:43:10.841197: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-12-22 19:43:10.844407: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2699905000 Hz
2020-12-22 19:43:10.844874: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x475d920 executing computations on platform Host. Devices:
2020-12-22 19:43:10.844906: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Host, Default Version
XXX lineno: 11, opcode: 160
Traceback (most recent call last):
File "/home/ebike/workspaces/ebike2x_ws/src/pred_trajectory_pkg/src/trajectory_prediction_node.py", line 122, in <module>
LSTM = lstm_s_g_model(t_pred)
File "/home/ebike/workspaces/ebike2x_ws/src/pred_trajectory_pkg/src/vehicle_models.py", line 126, in __init__
self.model = load_model('/home/ebike/workspaces/ebike2x_ws/src/pred_trajectory_pkg/ml_models/model_test_vivek_150ep.h5')
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/save.py", line 146, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 168, in load_model_from_hdf5
custom_objects=custom_objects)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/saving/model_config.py", line 55, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/serialization.py", line 102, in deserialize
printable_module_name='layer')
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/utils/generic_utils.py", line 191, in deserialize_keras_object
list(custom_objects.items())))
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/network.py", line 906, in from_config
config, custom_objects)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/network.py", line 1852, in reconstruct_from_config
process_node(layer, node_data)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/network.py", line 1799, in process_node
output_tensors = layer(input_tensors, **kwargs)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 842, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/home/ebike/.local/lib/python3.6/site-packages/tensorflow_core/python/keras/layers/core.py", line 795, in call
return self.function(inputs, **arguments)
File "d:/test_data_pros/restructured/helper_functions.py", line 11, in poly_transfer
from pyproj import Proj, transform
SystemError: unknown opcode
The problem has nothing to do with paths, when you saved your model, your custom function was serialized and saved inside the HDF5 by Keras, but this format is specific to a python version, so the file can only be loaded with the same python version (it could work with newer versions, but not with older versions of python).
So if you load your model on the same version of python, it should work fine.

How can I convert Tensorflow frozen graph to TF Lite model?

I am using Faster RCNN, repo that I am using can be found in the link, to detect cars in a video frame. I used Keras 2.2.3 and Tensorflow 1.15.0. I want to deploy and run it on my Android device. Each part in Faster RCNN is implemented in Keras and in order to deploy it on Android I want to convert them to TF Lite model. The final network, the classifier, has a custom layer which is called RoiPoolingConv and I cannot convert the final network to TF Lite. At first, I have tried the following
converter = tf.lite.TFLiteConverter.from_keras_model_file('model_classifier_with_architecture.h5',
custom_objects={"RoiPoolingConv": RoiPoolingConv})
tfmodel = converter.convert()
open ("model_cls.tflite" , "wb") .write(tfmodel)
This gives the following error
Traceback (most recent call last):
File "Keras-FasterRCNN/model_to_tflite.py", line 26, in <module>
custom_objects={"RoiPoolingConv": RoiPoolingConv})
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/lite/python/lite.py", line 747, in from_keras_model_file
keras_model = _keras.models.load_model(model_file, custom_objects)
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/saving/save.py", line 146, in load_model
return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 212, in load_model_from_hdf5
custom_objects=custom_objects)
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/saving/model_config.py", line 55, in model_from_config
return deserialize(config, custom_objects=custom_objects)
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/layers/serialization.py", line 89, in deserialize
printable_module_name='layer')
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/utils/generic_utils.py", line 192, in deserialize_keras_object
list(custom_objects.items())))
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1131, in from_config
process_node(layer, node_data)
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/network.py", line 1089, in process_node
layer(input_tensors, **kwargs)
File "/home/alp/.local/lib/python3.6/site-packages/keras/engine/base_layer.py", line 475, in __call__
previous_mask = _collect_previous_mask(inputs)
File "/home/alp/.local/lib/python3.6/site-packages/keras/engine/base_layer.py", line 1441, in _collect_previous_mask
mask = node.output_masks[tensor_index]
AttributeError: 'Node' object has no attribute 'output_masks'
As a workaround I tried was to convert Keras models to Tensorflow frozen graph and then do the TF Lite conversion on these frozen graphs. This yields the following error
Traceback (most recent call last):
File "/home/alp/.local/bin/toco_from_protos", line 11, in <module>
sys.exit(main())
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/lite/toco/python/toco_from_protos.py", line 59, in main
app.run(main=execute, argv=[sys.argv[0]] + unparsed)
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/alp/.local/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/alp/.local/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/alp/.local/lib/python3.6/site-packages/tensorflow/lite/toco/python/toco_from_protos.py", line 33, in execute
output_str = tensorflow_wrap_toco.TocoConvert(model_str, toco_str, input_str)
Exception: We are continually in the process of adding support to TensorFlow Lite for more ops. It would be helpful if you could inform us of how this conversion went by opening a github issue at https://github.com/tensorflow/tensorflow/issues/new?template=40-tflite-op-request.md
and pasting the following:
Some of the operators in the model are not supported by the standard TensorFlow Lite runtime. If those are native TensorFlow operators, you might be able to use the extended runtime by passing --enable_select_tf_ops, or by setting target_ops=TFLITE_BUILTINS,SELECT_TF_OPS when calling tf.lite.TFLiteConverter(). Otherwise, if you have a custom implementation for them you can disable this error with --allow_custom_ops, or by setting allow_custom_ops=True when calling tf.lite.TFLiteConverter(). Here is a list of builtin operators you are using: ADD, CAST, CONCATENATION, CONV_2D, DEPTHWISE_CONV_2D, FULLY_CONNECTED, MUL, PACK, RESHAPE, RESIZE_BILINEAR, SOFTMAX, STRIDED_SLICE. Here is a list of operators for which you will need custom implementations: AddV2.
Is there a way to achieve the conversion of model with custom layer to TF Lite model?

Trained Keras Model fails to load with load_model

I have trained a Keras model with Tensorflow backend. It was saved with model.save. I now want to reload the model using model_load, however, I get the following error:
Traceback (most recent call last):
File "<ipython-input-235-387752c910a4>", line 1, in <module>
load_model('MyModel.h5')
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\models.py", line 243, in load_model
model = model_from_config(model_config, custom_objects=custom_objects)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\models.py", line 317, in model_from_config
return layer_module.deserialize(config, custom_objects=custom_objects)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
printable_module_name='layer')
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\utils\generic_utils.py", line 144, in deserialize_keras_object
list(custom_objects.items())))
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\engine\topology.py", line 2514, in from_config
process_layer(layer_data)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\engine\topology.py", line 2500, in process_layer
custom_objects=custom_objects)
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\layers\__init__.py", line 55, in deserialize
printable_module_name='layer')
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\utils\generic_utils.py", line 144, in deserialize_keras_object
list(custom_objects.items())))
File "C:\Anaconda\envs\tensorflow\lib\site-packages\keras\models.py", line 1367, in from_config
if 'class_name' not in config[0] or config[0]['class_name'] == 'Merge':
KeyError: 0
From what I read, there seems to be a bug in Keras when a model that was trained with an older version of Keras is loaded with a recent version. So there might be a version mismatch. However, I couldn't find a report that corresponds to my situation. Downgrading Keras or retraining is not an option.
Did anyone have this issue and maybe even found a solution? I would appreciate it a lot!
Thanks!
For future reference: It is an issue in the config files. Keras 2.2.4 has a fix for this:
Keras 2.2.4
#fchollet fchollet released this on Oct 3 ยท 79 commits to master since this release
Assets 2
This is a bugfix release, addressing two issues:
Ability to save a model when a file with the same name already exists.
Issue with loading legacy config files for the Sequential model.
So I ended up creating a new virtual environment with the most recent TF and Keras versions.

RuntimeError during processing .jpg file (exception 20) retrain inceptionV3 tensorflow

My system setup
OS: Ubuntu 16.04LTS
GPU: GTX1060
tensorflow version: tensorflow-gpu (1.6.0)
I am trying to retrain the inceptionV3 classifier model which I trained on MSCeleb-1M dataset using https://github.com/tensorflow/models/blob/master/research/slim/train_image_classifier.py.
then I tried to retrain using custom images and classes with https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/image_retraining/retrain.py.
I have noticed that the script is for an outdated inceptionV3 architecture hence I have modified the bottleneck tensor and input tensor names to match the nodes of my retrained inceptionV3 model. However when feeding own images into the retrain script, I keep hitting this error
INFO:tensorflow:Creating bottleneck at /home/m360/MachineLearning/models/msceleb-small-inception-v3/bottleneck/tulips/5524946579_307dc74476.jpg_inception_v3.txt
Traceback (most recent call last):
File "tensorflow/examples/image_retraining/retrain.py", line 1486, in <module>
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "tensorflow/examples/image_retraining/retrain.py", line 1187, in main
bottleneck_tensor, FLAGS.architecture)
File "tensorflow/examples/image_retraining/retrain.py", line 500, in cache_bottlenecks
resized_input_tensor, bottleneck_tensor, architecture)
File "tensorflow/examples/image_retraining/retrain.py", line 442, in get_or_create_bottleneck
bottleneck_tensor)
File "tensorflow/examples/image_retraining/retrain.py", line 397, in create_bottleneck_file
str(e)))
RuntimeError: Error during processing file /home/m360/MachineLearning/my_dataset/flower_photos/tulips/5524946579_307dc74476.jpg (20)
I do not understand where has gone wrong as there is no documentation about this specific exception code anywhere online so far I have searched. I am thinking it might be problems with the decode_jpeg function in the script but could not crack my head around it.
Please help to enlighten me. Thank you very much.

Categories