Sagemaker not outputting Tensorboard logs to S3 during training

Sagemaker not outputting Tensorboard logs to S3 during training - python

I'm training a model with Tensorflow using Amazon Sagemaker, and I'd like to be able to monitor training progress while the job is running. During training however, no Tensorboard files are output to S3, only once the training job is completed are the files uploaded to S3. After training has completed, I can download the files and see that Tensorboard has been logging values correctly throughout training, despite only being updated in S3 once after training completes.
I'd like to know why Sagemaker isn't uploading the Tensorboard information to S3 throughout the training process?
Here is the code from my notebook on Sagemaker that kicks off the training job
import sagemaker
from sagemaker.tensorflow import TensorFlow
from sagemaker.debugger import DebuggerHookConfig, CollectionConfig, TensorBoardOutputConfig
import time
bucket = 'my-bucket'
output_prefix = 'training-jobs'
model_name = 'my-model'
dataset_name = 'my-dataset'
dataset_path = f's3://{bucket}/datasets/{dataset_name}'
output_path = f's3://{bucket}/{output_prefix}'
job_name = f'{model_name}-{dataset_name}-training-{time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())}'
s3_checkpoint_path = f"{output_path}/{job_name}/checkpoints" # Checkpoints are updated live as expected
s3_tensorboard_path = f"{output_path}/{job_name}/tensorboard" # Tensorboard data isn't appearing here until the training job has completed
tensorboard_output_config = TensorBoardOutputConfig(
s3_output_path=s3_tensorboard_path,
container_local_output_path= '/opt/ml/output/tensorboard' # I have confirmed this is the unaltered path being provided to tf.summary.create_file_writer()
)
role = sagemaker.get_execution_role()
estimator = TensorFlow(entry_point='main.py', source_dir='./', role=role, max_run=60*60*24*5,
output_path=output_path,
checkpoint_s3_uri=s3_checkpoint_path,
tensorboard_output_config=tensorboard_output_config,
instance_count=1, instance_type='ml.g4dn.xlarge',
framework_version='2.3.1', py_version='py37', script_mode=True)
dpe_estimator.fit({'train': dataset_path}, wait=True, job_name=job_name)

There is a issue on tensorflow github related to the s3 client in version 2.3.1 which is the one you are using. Check in the cloudwatch logs if you have an error like
OP_REQUIRES failed at whole_file_read_ops.cc:116 : Failed precondition: AWS Credentials have not been set properly. Unable to access the specified S3 location
Then the provided solution is to add GetObejectVersion permission to the bucket. Alternatively, to confirm that is a tensorflow issue, you can try a different version.

First some speculation without any facts: Sagemaker could work as some other systems that sync files between local drive and s3. They might check that the file hasn't been accessed recently before syncing it so that they don't copy it while someone is writing to it. The log files are written constantly until shutdown so that might result in them not being copied ever.
I have used Sagemaker Docker containers with same problem. I've tried two ways circumvent this problem and they seemed to work.
First one is to periodically create a new log file. So e.g. every 30 minutes call again tf.summary.create_file_writer(...) to switch to a new log file. Old file is synced to s3 when it's not used anymore.
Second one is to directly write logs to s3. tf.summary.create_file_writer('s3://bucket/dir/'). This is more instant way of getting the info into s3.

Related

Is it possible to dynamically create compute/clusters at runtime in Azure ML?

I'm looking to dynamically create compute clusters at runtime for an Azure ML pipeline.
A simplistic version of the pipeline looks like this:
# create the compute
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=1)
cpu_cluster = ComputeTarget.create(ws, 'test-cluster', compute_config)
cpu_cluster.wait_for_completion(show_output=True)
# construct the step
step_1 = PythonScriptStep(script_name='test_script.py', name='test_step', compute_target=cpu_cluster
)
# validate the pipeline and publish
pipeline = Pipeline(ws, steps=[step_1])
pipeline.validate()
# run the experiment
experiment = Experiment(workspace=ws, name=experiment_name)
pipeline_run = experiment.submit(config=pipeline)
pipeline_run.wait_for_completion()
This works perfectly fine when I run the driver script locally however, when I publish the pipeline and execute from ADF, the compute clusters don't get created.
UserError: Response status code does not indicate success: 400 (Unknown compute target 'test-cluster'.). Unknown compute target 'test-cluster'.
Any guidance or suggestions welcome.

Yes, we can create dynamically. Check the below documentation for the procedure.
Document
Downloads the project snapshot to the compute target from the Blob
storage associated with the workspace.
Builds a Docker image corresponding to each step in the pipeline.
Downloads the Docker image for each step to the compute target from
the container registry.
Configures access to Dataset and OutputFileDatasetConfig objects.
For as_mount() access mode, FUSE is used to provide virtual access.
If mount isn't supported or if the user specified access as
as_upload(), the data is instead copied to the compute target.
Runs the step in the compute target specified in the step
definition.
Creates artifacts, such as logs, stdout and stderr, metrics, and
output specified by the step. These artifacts are then uploaded and
kept in the user's default datastore.
The above points are also available in the document link shared.

Does MLflow allow to log artifacts from remote locations like S3?

My setting
I have developed an environment for ML experiments that looks like the following: training happens in the AWS cloud with SageMaker Training Jobs. The trained model is stored in the /opt/ml/model directory, which is reserved by SageMaker to pack models as a .tar.gz in SageMaker's own S3 bucket. Several evaluation metrics are computed during training and testing, and recorded to an MLflow infrastructure consisting of an S3-based artifact store (see Scenario 4). Note that this is a different S3 bucket than SageMaker's.
A very useful feature from MLflow is that any model artifacts can be logged to a training run, so data scientists have access to both metrics and more complex outputs through the UI. These outputs include (but are not limited to) the trained model itself.
A limitation is that, as I understand it, the MLflow API for logging artifacts only accepts as input a local path to the artifact itself, and will always upload it to its artifact store. This is suboptimal when the artifacts are stored somewhere outside MLflow, as you have to store them twice. A transformer model may weigh more than 1GB.
My questions
Is there a way to pass an S3 path to MLflow and make it count as an artifact, without having to download it locally first?
Is there a way to avoid pushing a copy of an artifact to the artifact store? If my artifacts already reside in another remote location, it would be ideal to just have a link to such location in MLflow and not a copy in MLflow storage.

You can use a Tracking Server with S3 as a backend

Can we use data directly from RDS or df as a data source for training job in Sagemaker, rather than pulling it from from s3 or EFS?

I am using Sagemaker platform for model development and deployment. Data is read from RDS tables and then spitted to train and test df.
To create the training job in Sagemaker, I found that it takes data source only as s3 and EFS. For that I need to keep train and test data back to s3, which is repeating the data storing process in RDS and s3.
I would want to directly pass the df from RDS as a parameter in tarining job code. Is there any way we can pass df in fit method
image="581132636225.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-ols-model:latest"
model_output_folder = "model-output"
print(image)
tree = sagemaker.estimator.Estimator(
image,
role,
1,
"ml.c4.2xlarge",
output_path="s3://{}/{}".format(sess.default_bucket(), model_output_folder),
sagemaker_session=sess,
)
**tree.fit({'train': "s3_path_having_test_data"}, wait=True)**

The training data must be read from Amazon S3, Amazon EFS or Amazon FSx for Lustre.
One advantage of this is being able to reproduce your training results later on, as the input data is frozen in time (unless deleted), as apposed to a live DB.
Typical code:
train_df.to_csv("train.csv", header=False, index=False)
boto3.Session().resource("s3").Bucket(bucket).Object(
os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")
s3_path_having_test_data = "s3://{}/{}/train".format(bucket, prefix)
tree.fit({'train': "s3_path_having_test_data"}, wait=True)

Google colab cache google drive content

I have a dataset on google drive that's about 20GB big.
I use a generator to pull in the dataset to my keras/TF models, and the overhead of loading the files (for every batch) is insane.
I want to prefetch the content as one operation and then simply fetched from the local VM disk
I tried this:
from google.colab import drive
drive.mount('/content/drive')
!mkdir -p $RAW_NOTEBOOKS_DIR
!cp $RAW_NOTEBOOKS_DIR $LOCAL_NOTEBOOKS_DIR
However, this snippet runs finishes executing instantly (so it obviously didn't download the data - which was the intent of the cp command (copying from Drive to local).
Is this at all possible?
RAW_NOTEBOOKS_DIR = "/content/drive/My\ Drive/Colab\ Notebooks"

Theres a good example on google codelab for doing this, they write the dataset in a local TFRecords:
https://codelabs.developers.google.com/codelabs/keras-flowers-tpu/#0
you can find more info here:
https://keras.io/examples/keras_recipes/tfrecord/
so instead of reading the data every time from google drive you just need to read it one time and write it in the local memory in a TFRecord, then pass it to the model for training.
If you follow the guides is pretty straightforward.

Reset hadoop aws keys to upload to another s3 bucket under different username

Sorry for horrible question title but here is my scenario
I have a pyspark databricks notebook in which I am loading other notebooks.
One of this notebooks is setting some redshift configuration for reading data from redshift(Some temp S3 buckets). I cannot change any of this configuration.
Under this configuration, both of this returns True. This is useful in step number 5
sc._jsc.hadoopConfiguration().get("fs.s3n.awsAccessKeyId") == None
sc._jsc.hadoopConfiguration().get("fs.s3n.awsSecretAccessKey") == None
I've a apache spark model which I need to store to my S3 bucket which is different bucket than configured for redshift
I am pickling other objects and storing into AWS using boto3 and It is working properly but I don't think we can pickle apache models like other objects. So I've to use model's save method with S3 url and for that I am setting aws credentials like this and this works (if no one in same cluster is not messing with AWS configurations).
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId",
AWS_ACCESS_KEY_ID)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", AWS_SECRET_ACCESS_KEY)
After I save this model, I also need to read other data from redshift and here it is failing with following error. What I think is that redshift's configuration of S3 is changed with above code.
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 1844.0 failed 4 times, most recent failure: Lost task
0.3 in stage 1844.0 (TID 63816, 10.0.63.188, executor 3): com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service:
Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID:
3219CD268DEE5F53; S3 Extended Request ID:
rZ5/zi2B+AsGuKT0iW1ATUyh9xw7YAt9RULoE33WxTaHWUWqHzi1+0sRMumxnnNgTvNED30Nj4o=), S3 Extended Request ID:
rZ5/zi2B+AsGuKT0iW1ATUyh9xw7YAt9RULoE33WxTaHWUWqHzi1+0sRMumxnnNgTvNED30Nj4o=
Now my question is why I am not able to read data again. How can I reset redshift's S3 configuration the way it was before setting explicitly after saving model into S3.
What I also don't understand is, initially aws values were None and when I try to reset with None on my own it returns an error saying
The value of property fs.s3n.awsAccessKeyId must not be null
Right now I am thinking workaround in which I will save model locally on databricks and then will make zip of it and upload it to S3 but still this is just a patch. I would like to do it in proper manner.
Sorry for using quote box for code because it was not working for multiline code for some reason
Thank you in Advance!!!

re-import the notebook that sets up the redshift connectivity. Or find where it is set and copy that code.
If you don't have privileges to modify the notebooks you are importing then I'd guess you don't have privileges to set roles on the cluster. If you use roles then you don't need aws keys.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.