SageMaker using Debugger in local mode - python

Abstract
I'm trying to test SageMaker Debugger examples in 'local mode' from amazon-sagemaker-examples. I'm trying to see the same debugging result in 'local mode' with AWS SageMaker Notebook instance.
What i've done
I put a few lines in one of the examples to use it in 'local mode' referenced from amazon-sagemaker-local-mode-example. The example is tf-mnist-builtin-rule.ipynb.
Commented line is original code in the example
import subprocess
# import boto3
from sagemaker.local import LocalSession
from sagemaker.tensorflow import TensorFlow
instance_type = 'local'
try:
if subprocess.call("nvidia-smi") == 0:
instance_type = "local_gpu"
except:
pass
session = LocalSession()
session.config = {'local': {'local_code': True}}
# session = boto3.session.Session()
# region = session.region_name
role = 'arn:aws:iam::111111111111:role/service-role/AmazonSageMaker-ExecutionRole-20200101T000001'
estimator = TensorFlow(
# role=sagemaker.get_execution_role(),
role=role,
instance_count=1,
# instance_type="ml.p3.8xlarge",
instance_type=instance_type,
# image_uri=f"763104351884.dkr.ecr.{region}.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04",
framework_version='2.4.1',
py_version="py37",
max_run=3600,
source_dir="./src",
entry_point="tf-resnet50-cifar10.py",
# Debugger Parameters
rules=built_in_rules,
sagemaker_session=session,
)
Problem:
I can't see debugging report in realtime unlike AWS SageMaker Notebook instance.
The example said wait=false option (see the figure) makes notebook proceed though it is still running but not working in 'local mode'.
Question:
Any idea make wait=False option working in 'local mode'?
Thanks for reading.

Related

Azure Machine Learning Studio Designer Error: code_expired

I am trying to register a data set via the Azure Machine Learning Studio designer but keep getting an error. Here is my code, used in a "Execute Python Script" module:
import pandas as pd
from azureml.core.dataset import Dataset
from azureml.core import Workspace
def azureml_main(dataframe1 = None, dataframe2 = None):
ws = Workspace.get(name = <my_workspace_name>, subscription_id = <my_id>, resource_group = <my_RG>)
ds = Dataset.from_pandas_dataframe(dataframe1)
ds.register(workspace = ws,
name = "data set name",
description = "example description",
create_new_version = True)
return dataframe1,
But I get the following error in the Workspace.get line:
Authentication Exception: Unknown error occurred during authentication. Error detail: Unexpected polling state code_expired.
Since I am inside the workspace and in the designer, I do not usually need to do any kind of authentication (or even reference the workspace). Can anybody offer some direction? Thanks!
when you're inside a "Execute Python Script" module or PythonScriptStep, the authentication for fetching the workspace is already done for you (unless you're trying to authenticate to different Azure ML workspace.
from azureml.core import Run
run = Run.get_context()
ws = run.experiment.workspace
You should be able to use that ws object to register a Dataset.

Interactive Login coming for child run during hyperparameter tuning (hyperdrive) in Azure ML Notebook

I have created a train.py script in Azure and it has the data cleaning, wrangling and classification part using XGBoost. Then I have created a ipynb file to do hyperparameter tuning by calling train.py script.
The child runs keep asking me to perform manual interactive login for every run. Please see the image.
I did the interactive login for many runs but still it will ask me everytime.
Here is the code in ipynb file:
subscription_id = 'XXXXXXXXXXXXXXXXXX'
resource_group = 'XXXXXXXXXXXXXXX'
workspace_name = 'XXXXXXXXXXXXXXX'
workspace = Workspace(subscription_id, resource_group, workspace_name)
myenv = Environment(workspace=workspace, name="myenv")
from azureml.core.conda_dependencies import CondaDependencies
conda_dep = CondaDependencies()
conda_dep.add_pip_package("numpy")
conda_dep.add_pip_package("pandas")
conda_dep.add_pip_package("nltk")
conda_dep.add_pip_package("sklearn")
conda_dep.add_pip_package("xgboost")
myenv.python.conda_dependencies = conda_dep
experiment_name = 'experiments_xgboost_hyperparams'
experiment = Experiment(workspace, experiment_name)
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
compute_cluster_name = 'shan'
try:
compute_target = ComputeTarget(workspace=workspace, name = compute_cluster_name)
print('Found the compute cluster')
except ComputeTargetException:
compute_config = AmlCompute.provisioning_configuration(vm_size="STANDARD_DS3_V2", max_nodes=4)
compute_target = ComputeTarget.create(workspace, compute_cluster_name, compute_config)
compute_target.wait_for_completion(show_output=True)
early_termination_policy = BanditPolicy(slack_factor=0.01)
from azureml.train.hyperdrive import RandomParameterSampling
from azureml.train.hyperdrive import uniform, choice
ps = RandomParameterSampling( {
'learning_rate': uniform(0.1, 0.9),
'max_depth': choice(range(3,8)),
'n_estimators': choice(300, 400, 500, 600)
}
)
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE
from azureml.core import ScriptRunConfig
script_run_config = ScriptRunConfig(source_directory='.', script='train.py', compute_target=compute_target, environment=myenv)
# script_run_config.run_config.target = compute_target
# Create a HyperDriveConfig using the estimator, hyperparameter sampler, and policy.
hyperdrive_config = HyperDriveConfig(run_config=script_run_config,
hyperparameter_sampling=ps,
policy=early_termination_policy,
primary_metric_name="accuracy",
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=10,
max_concurrent_runs=4)
hyperdrive = experiment.submit(config=hyperdrive_config)
RunDetails(hyperdrive).show()
hyperdrive.wait_for_completion(show_output=True)
This just keeps asking me interactive login for every child run.
You need to implement an authentication method to avoid having interactive authentication.
The issue comes from this line :
workspace = Workspace(subscription_id, resource_group, workspace_name)
Azure ML SDK tries to access a Workspace only based on its name, the subscription id and the associated resource group. It does not know if you have access to it, this it why it asks you to authenticate through an URL.
I would suggest implementing an authentication through a service principal, you can find the official documentation here.

Python ML Deployment Fails on Azure Container Instance

I have same problem as
Why does my ML model deployment in Azure Container Instance still fail?
but the above solution does not work for me. Besides I get additional errors like belos
code": "AciDeploymentFailed",
"message": "Aci Deployment failed with exception: Your container application
crashed. This may be caused by errors in your scoring file's init()
function.\nPlease check the logs for your container instance: anomaly-detection-2.
From the AML SDK, you can run print(service.get_logs()) if you have service object
to fetch the logs. \nYou can also try to run image
mlad046a4688.azurecr.io/anomaly-detection-
2#sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally.
Please refer to http://aka.ms/debugimage#service-launch-fails for more
information.",
"details": [
{
"code": "CrashLoopBackOff",
"message": "Your container application crashed. This may be caused by errors in
your scoring file's init() function.\nPlease check the logs for your container
instance: anomaly-detection-2. From the AML SDK, you can run
print(service.get_logs()) if you have service object to fetch the logs. \nYou can
also try to run image mlad046a4688.azurecr.io/anomaly-detection-
2#sha256:fcbba67cf683626291c1bd084f31438fcd641ddaf80f9bdf8cea274d22d1fcb5 locally.
Please refer to http://aka.ms/debugimage#service-launch-fails for more
information."
}
]
}
It keeps pointing to scoring file but not sure what is wrong here
import numpy as np
import os
import pickle
import joblib
#from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from azureml.core.authentication import AzureCliAuthentication
from azureml.core import Model,Workspace
import logging
logging.basicConfig(level=logging.DEBUG)
def init():
global model
from sklearn.externals import joblib
# retrieve the path to the model file using the model name
model_path = Model.get_model_path(model_name='admlpkl')
print(model_path)
model = joblib.load(model_path)
#ws = Workspace.from_config(auth=cli_auth)
#logging.basicConfig(level=logging.DEBUG)
#modeld = ws.models['admlpkl']
#model=Model.deserialize(ws, modeld)
def run(raw_data):
# data = np.array(json.loads(raw_data)['data'])
# make prediction
data = json.loads(raw_data)
y_hat = model.predict(data)
#r = json.dumps(y_hat.tolist())
r = json.dumps(y_hat)
return r
The model has depencency on other file which I have added in
image_config = ContainerImage.image_configuration(execution_script="score.py",
runtime="python",
conda_file='conda_dependencies.yml',
dependencies=['modeling.py']
The logs are too abstract and really does not help to debug.I am able to create the image but provisioning service fails
Any inputs will be appreciated
Have you registered the model 'admlpkl' in your workspace using the register() function on the model object? If not, there will be no model path and that can cause failure.
See this section on model registration: https://learn.microsoft.com/en-us/azure/machine-learning/service/how-to-deploy-and-where#registermodel
Please follow the below to register and deploy the model to ACI.

TensorFlow Serving: Update model_config (add additional models) at runtime

I'm busy configuring a TensorFlow Serving client that asks a TensorFlow Serving server to produce predictions on a given input image, for a given model.
If the model being requested has not yet been served, it is downloaded from a remote URL to a folder where the server's models are located. (The client does this). At this point I need to update the model_config and trigger the server to reload it.
This functionality appears to exist (based on https://github.com/tensorflow/serving/pull/885 and https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/model_service.proto#L22), but I can't find any documentation on how to actually use it.
I am essentially looking for a python script with which I can trigger the reload from client side (or otherwise to configure the server to listen for changes and trigger the reload itself).
So it took me ages of trawling through pull requests to finally find a code example for this. For the next person who has the same question as me, here is an example of how to do this. (You'll need the tensorflow_serving package for this; pip install tensorflow-serving-api).
Based on this pull request (which at the time of writing hadn't been accepted and was closed since it needed review): https://github.com/tensorflow/serving/pull/1065
from tensorflow_serving.apis import model_service_pb2_grpc
from tensorflow_serving.apis import model_management_pb2
from tensorflow_serving.config import model_server_config_pb2
import grpc
def add_model_config(host, name, base_path, model_platform):
channel = grpc.insecure_channel(host)
stub = model_service_pb2_grpc.ModelServiceStub(channel)
request = model_management_pb2.ReloadConfigRequest()
model_server_config = model_server_config_pb2.ModelServerConfig()
#Create a config to add to the list of served models
config_list = model_server_config_pb2.ModelConfigList()
one_config = config_list.config.add()
one_config.name= name
one_config.base_path=base_path
one_config.model_platform=model_platform
model_server_config.model_config_list.CopyFrom(config_list)
request.config.CopyFrom(model_server_config)
print(request.IsInitialized())
print(request.ListFields())
response = stub.HandleReloadConfigRequest(request,10)
if response.status.error_code == 0:
print("Reload sucessfully")
else:
print("Reload failed!")
print(response.status.error_code)
print(response.status.error_message)
add_model_config(host="localhost:8500",
name="my_model",
base_path="/models/my_model",
model_platform="tensorflow")
Add a model to TF Serving server and to the existing config file conf_filepath: Use arguments name, base_path, model_platform for the new model. Keeps the original models intact.
Notice a small difference from #Karl 's answer - using MergeFrom instead of CopyFrom
pip install tensorflow-serving-api
import grpc
from google.protobuf import text_format
from tensorflow_serving.apis import model_service_pb2_grpc, model_management_pb2
from tensorflow_serving.config import model_server_config_pb2
def add_model_config(conf_filepath, host, name, base_path, model_platform):
with open(conf_filepath, 'r+') as f:
config_ini = f.read()
channel = grpc.insecure_channel(host)
stub = model_service_pb2_grpc.ModelServiceStub(channel)
request = model_management_pb2.ReloadConfigRequest()
model_server_config = model_server_config_pb2.ModelServerConfig()
config_list = model_server_config_pb2.ModelConfigList()
model_server_config = text_format.Parse(text=config_ini, message=model_server_config)
# Create a config to add to the list of served models
one_config = config_list.config.add()
one_config.name = name
one_config.base_path = base_path
one_config.model_platform = model_platform
model_server_config.model_config_list.MergeFrom(config_list)
request.config.CopyFrom(model_server_config)
response = stub.HandleReloadConfigRequest(request, 10)
if response.status.error_code == 0:
with open(conf_filepath, 'w+') as f:
f.write(request.config.__str__())
print("Updated TF Serving conf file")
else:
print("Failed to update model_config_list!")
print(response.status.error_code)
print(response.status.error_message)
While the solutions mentioned here works fine, there is one more method that you can use to hot-reload your models. You can use --model_config_file_poll_wait_seconds
As mentioned here in the documentation -
By setting the --model_config_file_poll_wait_seconds flag to instruct the server to periodically check for a new config file at --model_config_file filepath.
So, you just have to update the config file at model_config_path and tf-serving will load any new models and unload any models removed from the config file.
Edit 1: I looked at the source code and it seems that the flag is present from the very early version of tf-serving but there have been instances where some users were not able to use this flag (see this). So, try to use the latest version if possible.
If you're using the method described in this answer, please note that you're actually launching multiple tensorflow model server instances instead of a single model server, effectively making the servers compete for resources instead of working together to optimize tail latency.

FileUploadMiscError while persisting output file from Azure Batch

I'm facing the following error while trying to persist log files to Azure Blob storage from Azure Batch execution - "FileUploadMiscError - A miscellaneous error was encountered while uploading one of the output files". This error doesn't give a lot of information as to what might be going wrong. I tried checking the Microsoft Documentation for this error code, but it doesn't mention this particular error code.
Below is the relevant code for adding the task to Azure Batch that I have ported from C# to Python for persisting the log files.
Note: The container that I have configured gets created when the task is added, but there's no blob inside.
import datetime
import logging
import os
import azure.storage.blob.models as blob_model
import yaml
from azure.batch import models
from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.common.cloudstorageaccount import CloudStorageAccount
from dotenv import load_dotenv
LOG = logging.getLogger(__name__)
def add_tasks(batch_client, job_id, task_id, io_details, blob_details):
task_commands = "This is a placeholder. Actual code has an actual task. This gets completed successfully."
LOG.info("Configuring the blob storage details")
base_blob_service = BaseBlobService(
account_name=blob_details['account_name'],
account_key=blob_details['account_key'])
LOG.info("Base blob service created")
base_blob_service.create_container(
container_name=blob_details['container_name'], fail_on_exist=False)
LOG.info("Container present")
container_sas = base_blob_service.generate_container_shared_access_signature(
container_name=blob_details['container_name'],
permission=blob_model.ContainerPermissions(write=True),
expiry=datetime.datetime.now() + datetime.timedelta(days=1))
LOG.info(f"Container SAS created: {container_sas}")
container_url = base_blob_service.make_container_url(
container_name=blob_details['container_name'], sas_token=container_sas)
LOG.info(f"Container URL created: {container_url}")
# fpath = task_id + '/output.txt'
fpath = task_id
LOG.info(f"Creating output file object:")
out_files_list = list()
out_files = models.OutputFile(
file_pattern=r"../stderr.txt",
destination=models.OutputFileDestination(
container=models.OutputFileBlobContainerDestination(
container_url=container_url, path=fpath)),
upload_options=models.OutputFileUploadOptions(
upload_condition=models.OutputFileUploadCondition.task_completion))
out_files_list.append(out_files)
LOG.info(f"Output files: {out_files_list}")
LOG.info(f"Creating the task now: {task_id}")
task = models.TaskAddParameter(
id=task_id, command_line=task_commands, output_files=out_files_list)
batch_client.task.add(job_id=job_id, task=task)
LOG.info(f"Added task: {task_id}")
There is a bug in Batch's OutputFile handling which causes it to fail to upload to containers if the full container URL includes any query-string parameters other than the ones included in the SAS token. Unfortunately, the azure-storage-blob Python module includes an extra query string parameter when generating the URL via make_container_url.
This issue was just raised to us, and a fix will be released in the coming weeks, but an easy workaround is instead of using make_container_url to craft the URL, craft it yourself like so: container_url = 'https://{}/{}?{}'.format(blob_service.primary_endpoint, blob_details['container_name'], container_sas).
The resulting URL should look something like this: https://<account>.blob.core.windows.net/<container>?se=2019-01-12T01%3A34%3A05Z&sp=w&sv=2018-03-28&sr=c&sig=<sig> - specifically it shouldn't have restype=container in it (which is what the azure-storage-blob package is including)

Categories