The requirement is that I have to trigger a SageMaker endpoint on lambda to get predictions(which is easy) but have to do some extra processing for variable importance using packages such as XGBoost and SHAP.
I am able to hit the endpoint and get variable importance using the SageMaker Jupyter notebook. Now, I want to replicate the same thing on AWS lambda.
1) How to run python code on AWS lambda with package dependencies for Pandas, XGBoost and SHAP (total package size greater than 500MB). The unzipped deployment package size is greater than 250 MB, hence lambda is not allowing to deploy. I even tried using lambda function from Cloud9 and got the same error due to size restrictions. I have also tried lambda layers, but no luck.
2) Is there a way for me to run the code with such big packages on or through lambda bypassing the deployment package size limitation of 250 MB
3) Is there a way to trigger a SageMaker notebook execution through lambda which would do the calculations and return the output back to lambda?
Try to upload your dependencies to the Lambda Layer. FYI: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
In addition to use multiple layers for your dependencies - you may want to reduce the *.so files by linux strip command which discards symbols from compiled object files which may not necessary in production
In order to strip all *.so -
use linux/docker container with access to your dependencies directory
cd to your dependencies directory
Run
find . -name *.so -exec strip {} \;
Will execute strip command on every *.so file in the current working directory recursively.
It helped me reduce one of my dependencies objects from 94MB to just 7MB
I found the 250MB limitation on AWS lambda size to be draconian. Only one file ibxgboost.so from xgboost package is already around 140 MB which leaves only 110Mb for everything else. That makes AWS lambdas useless for anything but simple "hello world" stuff.
As an ugly workaround you can store xgboost package somewhere on s3 an copy it to the /tmp folder from the lambda invocation routine and point your python path to it. The allowed tmp space is a bit higher - 500MB so it might work.
I am not sure though if the /tmp folder is not cleaned between the lambda function runs though.
You can try using SageMaker Inference Pipelines to do pre-processing before making actual predictions. Basically, you can use the same pre-processing script used for training for inference as well. When the pipeline model is deployed, the full set of containers with pre-processing tasks installs and runs on each EC2 instance in the endpoint or transform job. Feature processing and inferences are executed with low latency because the containers deployed in an inference pipeline are co-located on the same EC2 instance (endpoint). You can refer documentation here.
Following blog posts/notebooks cover this feature in detail
Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone/inference_pipeline_sparkml_xgboost_abalone.ipynb
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_blazingtext_dbpedia/inference_pipeline_sparkml_blazingtext_dbpedia.ipynb
Related
I have to do large scale feature engineering on some data. My current approach is to spin up an instance using SKLearnProcessor and then scale the job by choosing a larger instance size or increasing the number of instances. I require using some packages that are not installed on Sagemaker instances by default and so I want to install the packages using .whl files.
Another hurdle is that the Sagemaker role does not have internet access.
import boto3
import sagemaker
from sagemaker import get_execution_role
from sagemaker.sklearn.processing import SKLearnProcessor
sess = sagemaker.Session()
sess.default_bucket()
region = boto3.session.Session().region_name
role = get_execution_role()
sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role=role,
sagemaker_session = sess,
instance_type="ml.t3.medium",
instance_count=1)
sklearn_processor.run(code='script.py')
Attempted resolutions:
Upload the packages to a CodeCommit repository and clone the repo into the SKLearnProcessor instances. Failed with error fatal: could not read Username for 'https://git-codecommit.eu-west-1.amazonaws.com': No such device or address. I tried cloning the repo into a sagemaker notebook instance and it works, so its not a problem with my script.
Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
Also looked into using the package s3fs but it didn't seem suitable to copy the wheel files.
Alternatives
My client is hesitant to spin up containers from custom docker images. Any alternatives?
2. Use a bash script to copy the packages from s3 using the CLI. The bash script I used is based off this post. But the packages never get copied, and an error message is not thrown.
This approach seems sound.
You may be better off overriding the command field on the SKLearnProcessor to /bin/bash, run a bash script like install_and_run_my_python_code.sh that installs the wheel containing your python dependencies, then runs your main python entry point script.
Additionally, instead of using AWS S3 calls to download your code in a script, you could use a ProcessingInput to download your code rather than doing this with AWS CLI calls in a bash script, which is what the SKLearnProcessor does to download your entry point script.py code across all the instances.
Keen to run a library of python code, which uses "RAY", on AWS Lambda / a serverless infrastructure.
Is this possible?
What I am after:
- Ability to run python code (with RAY library) on serverless (AWS Lambda), utilising many CPUs/GPUs
- Run the code from a local machine IDE (PyCharm)
- Have graphics (eg. Matplotlib) display on the local machine / in the local browser
Consideration is that RAY does not run on Windows.
Please let me know if this is doable (and if possible, best approach to set up).
Thank you!
CWSE
AWS Lambda
AWS Lambda doesn't have GPU support and is tragically suited for distributed training of neural networks. It's maximum run time is 15 minutes, they don't have enough memory to hold dataset (maybe small part of it only).
You may want AWS Lambda for lightweight inference jobs after your neural network/ML model was trained.
As AWS Lambda autoscales it would be well suited for tasks like single image classification and immediate return for multiple users.
Ray
What you should be after for parallel and distributed training are AWS EC2 instances. For deep learning p3 isntances might be a good choice due to Tesla V100 offering. For more CPU heavy load, c5 instances might be a good fit.
When it comes to Ray it indeed doesn't support Windows, but it supports Docker (see installation guide). You may log into container with ray preconfigured after mounting/copying your source code into container with this command:
docker run -t -i ray-project/deploy
and run it from there. For docker installation on Windows see here. It should be doable this way. If not, use some other docker image like ubuntu, setup everything you need (ray and other libraries) and run from within container (or better yet, make the container executable so it outputs to your console as you wanted).
It should be doable this way.
If not, you may manually log into small AWS EC2 instance, setup your environment there and run as well.
You may wish to check this friendly introduction to settings and ray documentation to get info how to configure your exact use case.
import boto3, json
#pass profile to boto3
boto3.setup_default_session(profile_name='default')
lam = boto3.client('lambda', region_name='us-east-1')
payload = {
"arg1": "val1",
"arg2": "val2"
}
payloadJSON = json.dumps(payload)
lam.invoke(FunctionName='some_lambda', InvocationType='Event', LogType='None', Payload=payloadJSON)
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/lambda.html#Lambda.Client.invoke
If you have a creds file you can cat the ~/.aws/credentials file and you can get your role for the session setup.
https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
Is their any way to shrink size of the deployment package in python.
I'm using Fbprophet and other liraries by which Lambda max. size limit is exceeding so can we shrink those libraries and deploy .zip to AWS Lambda ??
Lambda Layers sound like what you want - They allow you to have a separate package with all your dependencies which then you can reference in your code.
From their docs:
Layers let you keep your deployment package small, which makes
development easier. You can avoid errors that can occur when you
install and package dependencies with your function code. For Node.js,
Python, and Ruby functions, you can develop your function code in the
Lambda console as long as you keep your deployment package under 3 MB.
For more info Layers see : https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
I have built a deployment package with pandas, numpy, etc for my sample code to run. The size is some 46 MB. Doubt is, do I have to zip my code update every time and again update the entire deployment package to AWS S3 for a simple code update too?
Is there any other way, by which, I can avoid the 45 MB upload cost of S3 everytime and just upload the few KBs of code?
I would recommend creating a layer in AWS lambda.
First you need to create an instance of Amazon Linux (using the AMI specified in https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html - at this time (26th of March 2019) it is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 ) or a docker container with the same environment as the lambda execution environment.
I personally do it with docker.
For example, to create a layer for Python 3.6, I would run
sudo docker run --rm -it -v "$PWD":/var/task lambci/lambda:build-python3.6 bash
Then I would create a folder python/lib/python3.6/site-packages in /var/task in the docker container (so it will be accessible later on in the directory on the host machine where I started docker)
do pip3 install <your packages here> -t python/lib/python3.6/site-packages
zip up the folder python and upload it as a layer and use it in my AWS lambda function.
NB! The paths in the zip file should look like "python/lib/python3.6/site-packages/{your package names}"
Now the heavy dependencies are in separate layer and you don't have re-upload them every time you update the function, you only need to update the code
Split the application into two parts. The first part would be the lambda function that only includes your application code. The next part a lambda layer. The lambda layer will include onky the the dependencies and be uploaded once.
A lambda layer can be uploaded and attached to the lambda function. When your function is invoked, AWS will combine the lambda function with the lambda layer then execute the entire package.
When updating your code, you will only need to update the lambda function. Since the package is much smaller you can edit it using the web editor, or you can zip it and upload it directly to lambda using the cli tools.
Exmample: aws lambda update-function-code --function-name Yourfunctionname --zip-file fileb://Lambda_package.zip
Here are video instructions and examples on creating a lambda layer for dependencies.It demonstrates using the pymsql library, but you can install any of your libraries there.
https://geektopia.tech/post.php?blogpost=Create_Lambda_Layer_Python
I'm having some issues with an Apache Beam Python SDK defined Dataflow. If I step through my code it reaches the the pipeline.run() step which I assume means the execution graph was successfully defined. However, the job never registers on the Dataflow monitoring tool, which from this makes me think it never reaches the pipeline validation step.
I'd like to know more about what happens between these two steps to help in debugging the issue. I see output indicating packages in my requirements.txt and apache-beam are getting pip installed and it seems like something is getting pickled before being sent to Google's servers. Why is that? If I already have apache-beam downloaded, why download it again? What exactly is getting pickled?
I'm not looking for a solution to my problem here, just trying to understand the process better.
During Graph Construction, Dataflow checks for errors and any illegal operations in the pipeline. Once the checks are successful, execution graph is translated into JSON and transmitted to Dataflow service. In Dataflow service the JSON graph is validated and it becomes a job.
However, if the pipeline is executed locally, the graph is not translated to JSON or transmitted to Dataflow service. So, the graph would not show up as a job in the monitoring tool, it will run on the local machine [1]. You can follow the documentation to configure the local machine [2].
[1] https://cloud.google.com/dataflow/service/dataflow-service-desc#pipeline-lifecycle-from-pipeline-code-to-dataflow-job
[2] https://cloud.google.com/dataflow/pipelines/specifying-exec-params#configuring-pipelineoptions-for-local-execution
The packages from requirements.txt are downloaded using pip download and they are staged on to the staging location. This staging location will be used as cache by the Dataflow and is used to look up packages when pip install -r requirements.txt is called on Dataflow workers to reduce calls to the pypi.