So I have been trying to use azure machine learning for faster model training.
I am submitting a training .py file, and within that training file I access my training data, however I am getting error messages regarding that.
I have tried the following code
subscription_id = 'my_id'
resource_group = 'my_resource_group'
workspace_name = 'my_workspace'
workspace = Workspace(subscription_id, resource_group, workspace_name)
dataset = Dataset.get_by_name(workspace, name='my-dataset')
with dataset.mount() as mount_context:
print(os.listdir(mount_context.mount_point))
data = np.load('my-data.npy')
But I am getting the error and training failure with the following output logs.
File "train.py", line 29, in <module>
data = np.load('my-data.npy')
File "/azureml-envs/azureml_167f4dd4c85f61389bb53e00383dafbe/lib/python3.6/site-packages/numpy/lib/npyio.py", line 416, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'my-data.npy'
I assume I am incorrectly mounting my dataset on the remote machine, however I am unsure what the correct way to mount it, or submit a training job is?
Did the print statement returns the list of directory correctly?
Here is a sample notebook that shows how to load data in training: https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/work-with-data/datasets-tutorial/scriptrun-with-data-input-output
Related
I recenlty deployed jupyter into kubernetes, and now i want to read my data and make the data cleaning, while i'm runing :
data = pd.read_csv("home/ghofrane21/data/Les indices des prix.csv", header=None)
I get this error :
File Not Found Error [Errno2] File home/ghofrane21/data/Les indices des prix.csv does not exist
The file already exist :
I have fine-tuned a BERT model for semantic role labeling, using AllenNLP. This produces a model directory (serialization directory, if I recall?) that contains the following:
best.th
config.json
meta.json
metrics_epoch_0.json
metrics_epoch_10.json
metrics_epoch_11.json
metrics_epoch_12.json
metrics_epoch_13.json
metrics_epoch_14.json
metrics_epoch_1.json
metrics_epoch_2.json
metrics_epoch_3.json
metrics_epoch_4.json
metrics_epoch_5.json
metrics_epoch_6.json
metrics_epoch_7.json
metrics_epoch_8.json
metrics_epoch_9.json
metrics.json
model_state_e14_b0.th
model_state_e15_b0.th
model.tar.gz
out.log
training_state_e14_b0.th
training_state_e15_b0.th
vocabulary
Where vocabulary is a folder with labels.txt and non_padded_namespaces.txt.
I'd now like to use this fine-tuned model BERT model as the initialization when learning a related task, event extraction, using this library: https://github.com/wilsonlau-uw/BERT-EE (ie I want to exploit some transfer learning). The config.ini file has a line for fine_tuned_path, where I can specify an already-fine-tuned model that I want to use here. I provided the path to the AllenNLP serialization directory, and I got the following error:
2022-04-05 13:07:28,112 - INFO - setting seed 23
2022-04-05 13:07:28,113 - INFO - loading fine tuned model in /data/projects/SRL/ser_pure_clinical_bert-large_thyme_and_ontonotes/
Traceback (most recent call last):
File "main.py", line 65, in <module>
model = BERT_EE()
File "/data/projects/SRL/BERT-EE/model.py", line 88, in __init__
self.__build(self.use_fine_tuned)
File "/data/projects/SRL/BERT-EE/model.py", line 118, in __build
self.__get_pretrained(self.fine_tuned_path)
File "/data/projects/SRL/BERT-EE/model.py", line 110, in __get_pretrained
self.__model = BERT_EE_model.from_pretrained(path)
File "/home/richier/anaconda3/envs/allennlp/lib/python3.7/site-packages/transformers/modeling_utils.py", line 1109, in from_pretrained
f"Error no file named {[WEIGHTS_NAME, TF2_WEIGHTS_NAME, TF_WEIGHTS_NAME + '.index', FLAX_WEIGHTS_NAME]} found in "
OSError: Error no file named ['pytorch_model.bin', 'tf_model.h5', 'model.ckpt.index', 'flax_model.msgpack'] found in directory /data/projects/SRL/ser_pure_clinical_bert-large_thyme_and_ontonotes/ or `from_tf` and `from_flax` set to False.
Of course, the serialization directory doesn't have any of those files, hence the error. I tried unzipping model.tar.gz but it only has:
config.json
weights.th
vocabulary/
vocabulary/.lock
vocabulary/labels.txt
vocabulary/non_padded_namespaces.txt
meta.json
Digging into the codebase of the GitHub repo I linked above, I can see that BERT_EE_model inherits from BertPreTrainedModel from the transformers library, so the trick would seem to be getting the AllenNLP model into a format that BertPreTrainedModel.from_pretrained() can load...?
Any help would be greatly appreciated!
I believe I have figured this out. Basically, I had to re-load my model archive, access the underlying model and tokenizer, and then save those:
from allennlp.models.archival import load_archive
from allennlp_models.structured_prediction import SemanticRoleLabeler, srl, srl_bert
archive = load_archive('ser_pure_clinical_bert-large_thyme_and_ontonotes/model.tar.gz')
bert_model = archive.model.bert_model #type is transformers.models.bert.modeling_bert.BertModel
bert_model.save_pretrained('ser_pure_clinical_bert-large_thyme_and_ontonotes_save_pretrained/')
bert_tokenizer = archive.dataset_reader.bert_tokenizer
bert_tokenizer.save_pretrained('ser_pure_clinical_bert-large_thyme_and_ontonotes_save_pretrained/')
(This last part is probably less interesting to most folks, but also, in the config.ini I mentioned, the directory 'ser_pure_clinical_bert-large_thyme_and_ontonotes_save_pretrained' needed to be passed to the line pretrained_model_name_or_path not to fine_tuned_path.)
I was attempting to create a RNN that would generate text from shakespearean literature, as taught by this tensorflow course: https://www.tensorflow.org/tutorials/text/text_generation
When I attempted to load the weights, my program would crash with the error message: AttributeError: 'NoneType' object has no attribute 'endswith'
Here is the line of code that crashes the program:
model.load_weights(tf.train.latest_checkpoint(check_dir))
Here is the pastebin of my code: https://pastebin.com/KqmD0phL
Here is the full error message:
Traceback (most recent call last):
File "D:/Python/PycharmProjects/untitled/Shakespeare.py", line 118, in <module>
main()
File "D:/Python/PycharmProjects/untitled/Shakespeare.py", line 108, in main
model.load_weights(tf.train.latest_checkpoint(check_dir))
File "C:\Users\marco\venv\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 182, in load_weights
return super(Model, self).load_weights(filepath, by_name)
File "C:\Users\marco\venv\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1335, in load_weights
if _is_hdf5_filepath(filepath):
File "C:\Users\marco\venv\lib\site-packages\tensorflow_core\python\keras\engine\network.py", line 1645, in _is_hdf5_filepath
return (filepath.endswith('.h5') or filepath.endswith('.keras') or
AttributeError: 'NoneType' object has no attribute 'endswith'
I just ran into the same problem. Reason for this error message at my case was invalid path for the directory containing model training checkpoints. So I propose to check if this line
check_dir = './training_checkpoints'
On your code is right. You could at least try to change it to full path of the directory containing checkpoint data.
check_dir = '/full/path/to/training_checkpoints'
I am also facing the same issue.
You can try this while loading weights:
latest = tf.train.latest_checkpoint(checkpoint_dir)
latest
model.load_weights(latest)
This trains the model with the weights of the latest checkpoints.
I had the same exact issue from a different tutorial. From what I can tell there appears to be disparity between Tensorflow specific calls and Tensorflow.Keras calls.
I found mention in another post about saving with Keras API and loading with Keras API which makes sense to me.
I hope this helps.
I used:
callbacks = [
tf.keras.callbacks.TensorBoard(log_dir='.'+os.sep+'logs',
histogram_freq=0,
embeddings_freq=0,
update_freq='epoch',
profile_batch=0),
#added this (which doesn't profile) to get the example to to work
#When saving a model's weights, tf.keras defaults to the checkpoint format.
#Pass save_format='h5' to use HDF5 (or pass a filename that ends in .h5).
tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_prefix,
#save_weights_only=True,
verbose=1),
PrintLR()
]
then loaded the model explicitly with:
#tutorials indicates to save weights only but I found this to be a problem / concern between
#tensorflow and keras calls, so save the whole model (who cares anyway)
#model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
#load the specific model name
model=tf.keras.models.load_model(checkpoint_dir+os.sep+'ckpt_12.h5')
eval_loss, eval_acc = model.evaluate(eval_dataset)
print('Eval loss: {}, Eval Accuracy: {}'.format(eval_loss, eval_acc))
I'm using Python to retrieve a Blob image from Azure storage and then send it to Custom Vision for a prediction.
This is the code:
import io
from azure.storage.blob import BlockBlobService
from azure.cognitiveservices.vision.customvision.prediction import CustomVisionPredictionClient
block_blob_service = BlockBlobService(
account_name=account_name,
account_key=account_key
)
fp = io.BytesIO()
block_blob_service.get_blob_to_stream(
container_name,
blob_name,
fp,
max_connections=2
)
predictor = CustomVisionPredictionClient(
cv_prediction_key,
endpoint=cv_endpoint
)
# This call breaks with the below error message
results = predictor.predict_image(
cv_project_id,
image_data.getvalue(),
iteration_id=cv_iteration_id
)
However, executing the predict_image function results in the following error:
System.Private.CoreLib: Exception while executing function: Functions.ReloadPostgres. System.Private.CoreLib: Result: Failure
Exception: HttpOperationError: Operation returned an invalid status code 'Resource Not Found'
Stack: File "~/.local/share/virtualenvs/py_func_app-GVYYSfCn/lib/python3.6/site-packages/azure/functions_worker/dispatcher.py", line 288, in _handle__invocation_request
self.__run_sync_func, invocation_id, fi.func, args)
File "~/.pyenv/versions/3.6.8/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "~/.local/share/virtualenvs/py_func_app-GVYYSfCn/lib/python3.6/site-packages/azure/functions_worker/dispatcher.py", line 347, in __run_sync_func
return func(**params)
File "~/py_func_app/ReloadPostgres/__init__.py", line 14, in main
data_handler.fetch_prediction_data()
File "~/py_func_app/Shared_Code/data_handler.py", line 127, in fetch_prediction_data
cv_handler.predict_image(image_data.getvalue(), cv_model)
File "~/py_func_app/Shared_Code/custom_vision.py", line 30, in predict_image
raise e
File "~/py_func_app/Shared_Code/custom_vision.py", line 26, in predict_image
iteration_id=cv_model.cv_iteration_id
File "~/.local/share/virtualenvs/py_func_app-GVYYSfCn/lib/python3.6/site-packages/azure/cognitiveservices/vision/customvision/prediction/custom_vision_prediction_client.py", line 215, in predict_image
raise HttpOperationError(self._deserialize, response)
Here in below i am providing similar example using custom vision prediction using image URL, you can change it to image file :
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 19 11:04:54 2019
#author: moverm
"""
#from azure.storage.blob import BlockBlobService
from azure.cognitiveservices.vision.customvision.prediction import CustomVisionPredictionClient
#block_blob_service = BlockBlobService(
# account_name=account_name,
# account_key=account_key
#)
#
#fp = io.BytesIO()
#block_blob_service.get_blob_to_stream(
# container_name,
# blob_name,
# fp,
# max_connections=2
#)
predictor = CustomVisionPredictionClient(
"prediction-key",
endpoint="https://southcentralus.api.cognitive.microsoft.com"
)
# This call breaks with the below error message
#results = predictor.predict_image(
# 'prediction-key',
# image_data.getvalue(),
# iteration_id=cv_iteration_id
#)
test_img_url = "https://pointsprizes-blog.s3-accelerate.amazonaws.com/316.jpg"
results = predictor.predict_image_url("project-Id", "Iteration-Id", url=test_img_url)
# Display the results.
for prediction in results.predictions:
print ("\t" + prediction.tag_name + ": {0:.2f}%".format(prediction.probability * 100))
Basically issue is related to endpoint.Use https://southcentralus.api.cognitive.microsoft.com for an endpoint.
It should work, and you should be able to see the prediction probability.
Hope it helps.
I tried to reproduce your issue and got a similar issue, which was caused by using the incorrect endpoint from Azure portal when I created a Cognitive Service on the region of Janpa East, as the figure below.
As the figure above shown, the endpoint is https://japaneast.api.cognitive.microsoft.com/customvision/training/v1.0 for version 1, but the azure-cognitiveservices-vision-customvision PyPI page points out the corrent endpoint which should be https://{AzureRegion}.api.cognitive.microsoft.com as the figure below.
So I got the similar issue with yours if using the incorrent endpoint, as below. My code used is the same as yours, the only difference is the running environment which yours is on Azure Functions, but mine is a console script.
Meanwhile, according to the source code custom_vision_prediction_client.py of Azure Cognitive Service SDK for Custom Vision, you can see the code base_url = '{Endpoint}/customvision/v2.0/Prediction' to concat your passed endpoint with /customvision/v2.0/Prediction to generate the real endpoint for calling prediction api.
Therefore, as #MohitVerma-MSFT said, using https://<your cognitive service region>.api.cognitive.microsoft.com for the current version of Python package.
Additional notes as below, there is an announce of important update for customvision.ai you need to know, it may make effect for your current code working soon after.
While loading my dataset using python code on the AWS server using Spyder, I get the following error:
File "<ipython-input-19-7b2e7b5812b3>", line 1, in <module>
ffemq12 = load_h2odataframe_returns(femq12) #; ffemq12 = add_fold_column(ffemq12)
File "D:\Ashwin\do\init_sm.py", line 106, in load_h2odataframe_returns
fr=h2o.H2OFrame(python_obj=returns)
File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 106, in __init__
column_names, column_types, na_strings, skipped_columns)
File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 147, in _upload_python_object
self._upload_parse(tmp_path, destination_frame, 1, separator, column_names, column_types, na_strings, skipped_columns)
File "C:\Program Files\Anaconda2\lib\site-packages\h2o\frame.py", line 321, in _upload_parse
ret = h2o.api("POST /3/PostFile", filename=path)
File "C:\Program Files\Anaconda2\lib\site-packages\h2o\h2o.py", line 104, in api
return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
File "C:\Program Files\Anaconda2\lib\site-packages\h2o\backend\connection.py", line 415, in request
raise H2OConnectionError("Unexpected HTTP error: %s" % e)
I am running this python code on Spyder on the AWS server. The code works fine up to half the dataset (1.5gb/3gb) but throws an error if I increase the data size. I tried increasing the RAM from 61gb to 122 GB but it is still giving me the same error.
Loading the data file
femq12 = pd.read_csv(r"H:\Ashwin\dta\datafile.csv")
ffemq12 = load_h2odataframe_returns(femq12)
Initializing h2o
h2o.init(nthreads = -1,max_mem_size="150G")
Loading h2o
Connecting to H2O server at http://127.0.0.1:54321... successful.
-------------------------- ------------------------------------ H2O cluster uptime: 01 secs H2O cluster timezone: UTC H2O
data parsing timezone: UTC H2O cluster version: 3.22.1.3 H2O
cluster version age: 18 days H2O cluster total nodes: 1 H2O
cluster free memory: 133.3 Gb H2O cluster total cores: 16 H2O
cluster allowed cores: 16 H2O cluster status: accepting new
members, healthy H2O connection proxy: H2O internal security:
False H2O API Extensions: Algos, AutoML, Core V3, Core V4
Python version: 2.7.15 final
I suspect it is a memory issue. But even after increasing RAM and max_mem_size, the dataset is not loading.
Any ideas to fix the error would be appreciated. Thank you.
Solution: Don't use pd.read_csv() and h2o.H2OFrame(), and instead use h2o.import_file() directly.
The error message is on the POST /3/PostFile REST command. Which, as far as I can tell from your code and log snippets, means it is uploading to localhost? That is horribly inefficient.
(If not localhost, i.e. your datafile.csv is on your computer, which is outside of AWS, then upload it to S3 first. If you are doing some data munging on your computer, do that, then save it as a new file, and upload that to S3. It doesn't have to be S3: it could be the hard disk if you only have a single machine in your H2O cluster.)
For some background see also my recent answers at https://stackoverflow.com/a/54568511/841830 and https://stackoverflow.com/a/54459577/841830. (I've not marked as duplicate, as though the advice is the same, in each case, the reason is a bit different; here I wonder if you are hitting a limit for maximum HTTP POST file size, perhaps at 2GB? I suppose it could also be running out of disk space, from all the multiple temp copies be made.)