Python ML Deployment Fails on Azure Container Instance

Python ML Deployment Fails on Azure Container Instance - python

I have same problem as
Why does my ML model deployment in Azure Container Instance still fail?
but the above solution does not work for me. Besides I get additional errors like belos
code": "AciDeploymentFailed",
"message": "Aci Deployment failed with exception: Your container application
crashed. This may be caused by errors in your scoring file's init()
function.\nPlease check the logs for your container instance: anomaly-detection-2.
From the AML SDK, you can run print(service.get_logs()) if you have service object
to fetch the logs. \nYou can also try to run image
mlad046a4688.azurecr.io/anomaly-detection-
2#sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally.
Please refer to http://aka.ms/debugimage#service-launch-fails for more
information.",
"details": [
{
"code": "CrashLoopBackOff",
"message": "Your container application crashed. This may be caused by errors in
your scoring file's init() function.\nPlease check the logs for your container
instance: anomaly-detection-2. From the AML SDK, you can run
print(service.get_logs()) if you have service object to fetch the logs. \nYou can
also try to run image mlad046a4688.azurecr.io/anomaly-detection-
2#sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally.
Please refer to http://aka.ms/debugimage#service-launch-fails for more
information."
}
]
}
It keeps pointing to scoring file but not sure what is wrong here
import numpy as np
import os
import pickle
import joblib
#from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from azureml.core.authentication import AzureCliAuthentication
from azureml.core import Model,Workspace
import logging
logging.basicConfig(level=logging.DEBUG)
def init():
global model
from sklearn.externals import joblib
# retrieve the path to the model file using the model name
model_path = Model.get_model_path(model_name='admlpkl')
print(model_path)
model = joblib.load(model_path)
#ws = Workspace.from_config(auth=cli_auth)
#logging.basicConfig(level=logging.DEBUG)
#modeld = ws.models['admlpkl']
#model=Model.deserialize(ws, modeld)
def run(raw_data):
# data = np.array(json.loads(raw_data)['data'])
# make prediction
data = json.loads(raw_data)
y_hat = model.predict(data)
#r = json.dumps(y_hat.tolist())
r = json.dumps(y_hat)
return r
The model has depencency on other file which I have added in
image_config = ContainerImage.image_configuration(execution_script="score.py",
runtime="python",
conda_file='conda_dependencies.yml',
dependencies=['modeling.py']
The logs are too abstract and really does not help to debug.I am able to create the image but provisioning service fails
Any inputs will be appreciated

Have you registered the model 'admlpkl' in your workspace using the register() function on the model object? If not, there will be no model path and that can cause failure.
See this section on model registration: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where#registermodel
Please follow the below to register and deploy the model to ACI.

Related

Azure Machine Learning Studio Designer Error: code_expired

I am trying to register a data set via the Azure Machine Learning Studio designer but keep getting an error. Here is my code, used in a "Execute Python Script" module:
import pandas as pd
from azureml.core.dataset import Dataset
from azureml.core import Workspace
def azureml_main(dataframe1 = None, dataframe2 = None):
ws = Workspace.get(name = <my_workspace_name>, subscription_id = <my_id>, resource_group = <my_RG>)
ds = Dataset.from_pandas_dataframe(dataframe1)
ds.register(workspace = ws,
name = "data set name",
description = "example description",
create_new_version = True)
return dataframe1,
But I get the following error in the Workspace.get line:
Authentication Exception: Unknown error occurred during authentication. Error detail: Unexpected polling state code_expired.
Since I am inside the workspace and in the designer, I do not usually need to do any kind of authentication (or even reference the workspace). Can anybody offer some direction? Thanks!

when you're inside a "Execute Python Script" module or PythonScriptStep, the authentication for fetching the workspace is already done for you (unless you're trying to authenticate to different Azure ML workspace.
from azureml.core import Run
run = Run.get_context()
ws = run.experiment.workspace
You should be able to use that ws object to register a Dataset.

Interactive Login coming for child run during hyperparameter tuning (hyperdrive) in Azure ML Notebook

I have created a train.py script in Azure and it has the data cleaning, wrangling and classification part using XGBoost. Then I have created a ipynb file to do hyperparameter tuning by calling train.py script.
The child runs keep asking me to perform manual interactive login for every run. Please see the image.
I did the interactive login for many runs but still it will ask me everytime.
Here is the code in ipynb file:
subscription_id = 'XXXXXXXXXXXXXXXXXX'
resource_group = 'XXXXXXXXXXXXXXX'
workspace_name = 'XXXXXXXXXXXXXXX'
workspace = Workspace(subscription_id, resource_group, workspace_name)
myenv = Environment(workspace=workspace, name="myenv")
from azureml.core.conda_dependencies import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package("numpy")
conda_dep.add_pip_package("pandas")
conda_dep.add_pip_package("nltk")
conda_dep.add_pip_package("sklearn")
conda_dep.add_pip_package("xgboost")
myenv.python.conda_dependencies = conda_dep
experiment_name = 'experiments_xgboost_hyperparams'
experiment = Experiment(workspace, experiment_name)
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
compute_cluster_name = 'shan'
try:
compute_target = ComputeTarget(workspace=workspace, name = compute_cluster_name)
print('Found the compute cluster')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS3_V2", max_nodes=4)
compute_target = ComputeTarget.create(workspace, compute_cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)
early_termination_policy = BanditPolicy(slack_factor=0.01)
from azureml.train.hyperdrive import RandomParameterSampling
from azureml.train.hyperdrive import uniform, choice
ps = RandomParameterSampling( {
'learning_rate': uniform(0.1, 0.9),
'max_depth': choice(range(3,8)),
'n_estimators': choice(300, 400, 500, 600)
}
)
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE
from azureml.core import ScriptRunConfig
script_run_config = ScriptRunConfig(source_directory='.', script='train.py', compute_target=compute_target, environment=myenv)
# script_run_config.run_config.target = compute_target
# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(run_config=script_run_config,
hyperparameter_sampling=ps,
policy=early_termination_policy,
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=10,
max_concurrent_runs=4)
hyperdrive = experiment.submit(config=hyperdrive_config)
RunDetails(hyperdrive).show()
hyperdrive.wait_for_completion(show_output=True)
This just keeps asking me interactive login for every child run.

You need to implement an authentication method to avoid having interactive authentication.
The issue comes from this line :
workspace = Workspace(subscription_id, resource_group, workspace_name)
Azure ML SDK tries to access a Workspace only based on its name, the subscription id and the associated resource group. It does not know if you have access to it, this it why it asks you to authenticate through an URL.
I would suggest implementing an authentication through a service principal, you can find the official documentation here.

How to log from a custom ai platform model

I recently deployed a custom model to google cloud's ai-platform, and I am trying to debug some parts of my preprocessing logic. However, My print statements are not being logged to the stackdriver output. I have also tried using the logging client imported from google.cloud, to no avail. Here is my custom prediction file:
import os
import pickle
import numpy as np
from sklearn.datasets import load_iris
import tensorflow as tf
from google.cloud import logging
class MyPredictor(object):
def __init__(self, model, preprocessor):
self.logging_client = logging.Client()
self._model = model
self._preprocessor = preprocessor
self._class_names = ["Snare", "Kicks", "ClosedHH", "ClosedHH", "Clap", "Crash", "Perc"]
def predict(self, instances, **kwargs):
log_name = "Here I am"
logger = self.logging_client.logger(log_name)
text = 'Hello, world!'
logger.log_text(text)
print('Logged: {}'.format(text), kwargs.get("sr"))
inputs = np.asarray(instances)
outputs = self._model.predict(inputs)
if kwargs.get('probabilities'):
return outputs.tolist()
#return "[]"
else:
return [self._class_names[index] for index in np.argmax(outputs.tolist(), axis=1)]
#classmethod
def from_path(cls, model_dir):
model_path = os.path.join(model_dir, 'model.h5')
model = tf.keras.models.load_model(model_path, custom_objects={"adam": tf.keras.optimizers.Adam,
"categorical_crossentropy":tf.keras.losses.categorical_crossentropy, "lr":0.01, "name": "Adam"})
preprocessor_path = os.path.join(model_dir, 'preprocessor.pkl')
with open(preprocessor_path, 'rb') as f:
preprocessor = pickle.load(f)
return cls(model, preprocessor)
I can't find anything online for why my logs are not showing up in stackdriver (neither print statements nor the logging library calls). Has anyone faced this issue?
Thanks,
Nikita
NOTE: If you have enough rep to create tags please add the google-ai-platform tag to this post. I think it would really help people who are in my position. Thanks!

From Documentation:
If you want to enable online prediction logging, you must configure it
when you create a model resource or when you create a model version
resource, depending on which type of logging you want to enable. There
are three types of logging, which you can enable independently:
Access logging, which logs information like timestamp and latency for
each request to Stackdriver Logging.
You can enable access logging when you create a model resource.
Stream logging, which logs the stderr and stdout streams from your
prediction nodes to Stackdriver Logging, and can be useful for
debugging. This type of logging is in beta, and it is not supported by
Compute Engine (N1) machine types.
You can enable stream logging when you create a model resource.
Request-response logging, which logs a sample of online prediction
requests and responses to a BigQuery table. This type of logging is in
beta.
You can enable request-response logging by creating a model version
resource, then updating that version.
For your use case, please use the following template to log custom information into StackDriver:
Model
gcloud beta ai-platform models create {MODEL_NAME} \
--regions {REGION} \
--enable-logging \
--enable-console-logging
Model version
gcloud beta ai-platform versions create {VERSION_NAME} \
--model {MODEL_NAME} \
--origin gs://{BUCKET}/{MODEL_DIR} \
--python-version 3.7 \
--runtime-version 1.15 \
--package-uris gs://{BUCKET}/{PACKAGES_DIR}/custom-model-0.1.tar.gz \
--prediction-class=custom_prediction.CustomModelPrediction \
--service-account custom#project_id.iam.gserviceaccount.com
I tried this and worked fine:
I did some modification to the constructor due to the #classmethod decorator.
Create a service account and grant it "Stackdriver Debugger User" role, use it during model version creation
Add google-cloud-logging library to your setup.py
Consider extra cost of enabling StackDriver logging
When using log_struct check the correct type is passed. (If using str, make sure you convert bytes to str in Python 3 using .decode('utf-8'))
Define the project_id parameter during Stackdriver client creation
logging.Client(), otherwise you will get:
ERROR:root:Prediction failed: 400 Name "projects//logs/my-custom-prediction-log" is missing the parent component. Expected the form projects/[PROJECT_ID]/logs/[ID]"
Code below:
%%writefile cloud_logging.py
import os
import pickle
import numpy as np
from datetime import date
from google.cloud import logging
import tensorflow.keras as keras
LOG_NAME = 'my-custom-prediction-log'
class CustomModelPrediction(object):
def __init__(self, model, processor, client):
self._model = model
self._processor = processor
self._client = client
def _postprocess(self, predictions):
labels = ['negative', 'positive']
return [
{
"label":labels[int(np.round(prediction))],
"score":float(np.round(prediction, 4))
} for prediction in predictions]
def predict(self, instances, **kwargs):
logger = self._client.logger(LOG_NAME)
logger.log_struct({'instances':instances})
preprocessed_data = self._processor.transform(instances)
predictions = self._model.predict(preprocessed_data)
labels = self._postprocess(predictions)
return labels
#classmethod
def from_path(cls, model_dir):
client = logging.Client(project='project_id') # Change to your project
model = keras.models.load_model(
os.path.join(model_dir,'keras_saved_model.h5'))
with open(os.path.join(model_dir, 'processor_state.pkl'), 'rb') as f:
processor = pickle.load(f)
return cls(model, processor, client)
# Verify model locally
from cloud_logging import CustomModelPrediction
classifier = CustomModelPrediction.from_path('.')
requests = ["God I hate the north", "god I love this"]
response = classifier.predict(requests)
response
Then I check with the sample library:
python snippets.py my-custom-prediction-log list
Listing entries for logger my-custom-prediction-log:
* 2020-02-19T19:51:45.809767+00:00: {u'instances': [u'God I hate the north', u'god I love this']}
* 2020-02-19T19:57:18.615159+00:00: {u'instances': [u'God I hate the north', u'god I love this']}
To visualize the logs, in StackDriver > Logging > Select Global and your Log name, if you want to see Model logs you should be able to select Cloud ML Model version.
You can use my files here: model and pre-processor

If you just want your print to work and not use the logging method above me you can just add flush flag to your print,
print(“logged”,flush=True)

ValueError while deploying tensorflow model to Amazon SageMaker

I want to deploy my trained tensorflow model to the amazon sagemaker, I am following the official guide here: https://aws.amazon.com/blogs/machine-learning/deploy-trained-keras-or-tensorflow-models-using-amazon-sagemaker/ to deploy my model using jupyter notebook.
But when I try to use code:
predictor = sagemaker_model.deploy(initial_instance_count=1, instance_type='ml.t2.medium')
It gives me the following error message:
ValueError: Error hosting endpoint sagemaker-tensorflow-2019-08-07-22-57-59-547: Failed Reason: The image '520713654638.dkr.ecr.us-west-1.amazonaws.com/sagemaker-tensorflow:1.12-cpu-py3 ' does not exist.
I think the tutorial does not tell me to create an image, and I do not know what to do.
import boto3, re
from sagemaker import get_execution_role
role = get_execution_role()
# make a tar ball of the model data files
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('export', recursive=True)
# create a new s3 bucket and upload the tarball to it
import sagemaker
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
framework_version = '1.12',
entry_point = 'train.py',
py_version='py3')
%%time
#here I fail to deploy the model and get the error message
predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')

https://github.com/aws/sagemaker-python-sdk/issues/912#issuecomment-510226311
As mentioned in the issue
Python 3 isn't supported using the TensorFlowModel object, as the container uses the TensorFlow serving api library in conjunction with the GRPC client to handle making inferences, however the TensorFlow serving api isn't supported in Python 3 officially, so there are only Python 2 versions of the containers when using the TensorFlowModel object.
If you need Python 3 then you will need to use the Model object defined in #2 above. The inference script format will change if you need to handle pre and post processing. https://github.com/aws/sagemaker-tensorflow-serving-container#prepost-processing.

FileUploadMiscError while persisting output file from Azure Batch

I'm facing the following error while trying to persist log files to Azure Blob storage from Azure Batch execution - "FileUploadMiscError - A miscellaneous error was encountered while uploading one of the output files". This error doesn't give a lot of information as to what might be going wrong. I tried checking the Microsoft Documentation for this error code, but it doesn't mention this particular error code.
Below is the relevant code for adding the task to Azure Batch that I have ported from C# to Python for persisting the log files.
Note: The container that I have configured gets created when the task is added, but there's no blob inside.
import datetime
import logging
import os
import azure.storage.blob.models as blob_model
import yaml
from azure.batch import models
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.common.cloudstorageaccount import CloudStorageAccount
from dotenv import load_dotenv
LOG = logging.getLogger(__name__)
def add_tasks(batch_client, job_id, task_id, io_details, blob_details):
task_commands = "This is a placeholder. Actual code has an actual task. This gets completed successfully."
LOG.info("Configuring the blob storage details")
base_blob_service = BaseBlobService(
account_name=blob_details['account_name'],
account_key=blob_details['account_key'])
LOG.info("Base blob service created")
base_blob_service.create_container(
container_name=blob_details['container_name'], fail_on_exist=False)
LOG.info("Container present")
container_sas = base_blob_service.generate_container_shared_access_signature(
container_name=blob_details['container_name'],
permission=blob_model.ContainerPermissions(write=True),
expiry=datetime.datetime.now() + datetime.timedelta(days=1))
LOG.info(f"Container SAS created: {container_sas}")
container_url = base_blob_service.make_container_url(
container_name=blob_details['container_name'], sas_token=container_sas)
LOG.info(f"Container URL created: {container_url}")
# fpath = task_id + '/output.txt'
fpath = task_id
LOG.info(f"Creating output file object:")
out_files_list = list()
out_files = models.OutputFile(
file_pattern=r"../stderr.txt",
destination=models.OutputFileDestination(
container=models.OutputFileBlobContainerDestination(
container_url=container_url, path=fpath)),
upload_options=models.OutputFileUploadOptions(
upload_condition=models.OutputFileUploadCondition.task_completion))
out_files_list.append(out_files)
LOG.info(f"Output files: {out_files_list}")
LOG.info(f"Creating the task now: {task_id}")
task = models.TaskAddParameter(
id=task_id, command_line=task_commands, output_files=out_files_list)
batch_client.task.add(job_id=job_id, task=task)
LOG.info(f"Added task: {task_id}")

There is a bug in Batch's OutputFile handling which causes it to fail to upload to containers if the full container URL includes any query-string parameters other than the ones included in the SAS token. Unfortunately, the azure-storage-blob Python module includes an extra query string parameter when generating the URL via make_container_url.
This issue was just raised to us, and a fix will be released in the coming weeks, but an easy workaround is instead of using make_container_url to craft the URL, craft it yourself like so: container_url = 'https://{}/{}?{}'.format(blob_service.primary_endpoint, blob_details['container_name'], container_sas).
The resulting URL should look something like this: https://<account>.blob.core.windows.net/<container>?se=2019-01-12T01%3A34%3A05Z&sp=w&sv=2018-03-28&sr=c&sig=<sig> - specifically it shouldn't have restype=container in it (which is what the azure-storage-blob package is including)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python ML Deployment Fails on Azure Container Instance - python

Related

Azure Machine Learning Studio Designer Error: code_expired

Interactive Login coming for child run during hyperparameter tuning (hyperdrive) in Azure ML Notebook

How to log from a custom ai platform model

ValueError while deploying tensorflow model to Amazon SageMaker

FileUploadMiscError while persisting output file from Azure Batch

Categories

Resources