I'm working on an ML project that utilizes AWS Lambda for building models and generating predictions. Lambdas are written in python and use several ML libraries like pandas, sklearn, numpy, and scikit-learn.
These lambdas use shared code that packaged by Lambda's layer.
I use AWS CDK for project deployment. CDK code is written in TypeScript, don't ask why I mix Python and Typescript, it's not relevant in this situation.
The size of the package (lambda code + layer) exceeds the maximum allowed size of 250MB because of ML libraries.
After AWS announcement of containerized lambdas support, I decided to try it out to overcome the 250MB limit. However, I didn't find any good example that fit my situation, so I'm trying to build it myself.
The CDK code looks like this:
...
// Create a lambda layer from code
// Code is located in lambda-code/ml directory and it looks
// like any Python package with main ML and DB connection functions
const mlLayer = new PythonLayerVersion(this, 'mlLayer', {
entry: './lambda-code/ml/',
})
...
// Lambda function is specified like
const classifyTransactionLambda = new DockerImageFunction(this, 'classifyTransactionLambda', {
code: DockerImageCode.fromImageAsset('./lambda-code/classify'),
memorySize: 512,
layers: [mlLayer],
tracing: Tracing.ACTIVE,
environment: {
BUCKET_NAME: mlModelsBucket.bucketName,
ENV: env
}
});
...
The structure of the code looks like this:
Dockerfile in classify lambda:
# Use the python lambda image from AWS ECR
FROM public.ecr.aws/lambda/python:3.7
COPY requirements.txt ./
RUN pip3 install -r requirements.txt
COPY index.py ./
CMD ["index.classify_transaction_handler"]
When I run cdk deploy I'm getting the following error:
This lambda function uses a runtime that is incompatible with this layer (FROM_IMAGE is not in [python3.7])
Does anyone run into problem like this? Is this error mean that mlLayer version is not compatible with lambda classifyTransactionLambda?
Any help would be very appreciated!
At this point
Functions defined as container images do not support layers. When you build a container image, you can package your preferred runtimes and dependencies as a part of the image.
https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
So I modified my build to copy all layer/library code to each lambda function before building an image.
Related
I want to use this repo and I have created and activated a virtualenv and installed the required dependencies.
I get an error when I run pytest.
And under the file binance_cdk/app.py it describes the following tasks:
App (PSVM method) entry point of the program.
Note:
Steps tp setup CDK:
install npm
cdk -init (creates an empty project)
Add in your infrastructure code.
Run CDK synth
CDK bootstrap <aws_account>/
Run CDK deploy ---> This creates a cloudformation .yml file and the aws resources will be created as per the mentioned stack.
I'm stuck on step 3, what do I add in this infrastructure code, and if I want to use this on amazon sagemaker which I am not familiar with, do I even bother doing this on my local terminal, or do I do the whole process regardless on sagemaker?
Thank you in advance for your time and answers !
The infrastructure code is the Python code that you want to write for the resources you want to provision with SageMaker. In the example you provided for example the infra code they have is creating a Lambda function. You can do this locally on your machine, the question is what do you want to achieve with SageMaker? If you want to create an endpoint then following the CDK Python docs with SageMaker to identify the steps for creating an endpoint. Here's two guides, the first is an introduction to the AWS CDK and getting started. The second is an example of using the CDK with SageMaker to create an endpoint for inference.
CDK Python Starter: https://towardsdatascience.com/build-your-first-aws-cdk-project-18b1fee2ed2d
CDK SageMaker Example: https://github.com/philschmid/cdk-samples/tree/master/sagemaker-serverless-huggingface-endpoint
Is their any way to shrink size of the deployment package in python.
I'm using Fbprophet and other liraries by which Lambda max. size limit is exceeding so can we shrink those libraries and deploy .zip to AWS Lambda ??
Lambda Layers sound like what you want - They allow you to have a separate package with all your dependencies which then you can reference in your code.
From their docs:
Layers let you keep your deployment package small, which makes
development easier. You can avoid errors that can occur when you
install and package dependencies with your function code. For Node.js,
Python, and Ruby functions, you can develop your function code in the
Lambda console as long as you keep your deployment package under 3 MB.
For more info Layers see : https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
The requirement is that I have to trigger a SageMaker endpoint on lambda to get predictions(which is easy) but have to do some extra processing for variable importance using packages such as XGBoost and SHAP.
I am able to hit the endpoint and get variable importance using the SageMaker Jupyter notebook. Now, I want to replicate the same thing on AWS lambda.
1) How to run python code on AWS lambda with package dependencies for Pandas, XGBoost and SHAP (total package size greater than 500MB). The unzipped deployment package size is greater than 250 MB, hence lambda is not allowing to deploy. I even tried using lambda function from Cloud9 and got the same error due to size restrictions. I have also tried lambda layers, but no luck.
2) Is there a way for me to run the code with such big packages on or through lambda bypassing the deployment package size limitation of 250 MB
3) Is there a way to trigger a SageMaker notebook execution through lambda which would do the calculations and return the output back to lambda?
Try to upload your dependencies to the Lambda Layer. FYI: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
In addition to use multiple layers for your dependencies - you may want to reduce the *.so files by linux strip command which discards symbols from compiled object files which may not necessary in production
In order to strip all *.so -
use linux/docker container with access to your dependencies directory
cd to your dependencies directory
Run
find . -name *.so -exec strip {} \;
Will execute strip command on every *.so file in the current working directory recursively.
It helped me reduce one of my dependencies objects from 94MB to just 7MB
I found the 250MB limitation on AWS lambda size to be draconian. Only one file ibxgboost.so from xgboost package is already around 140 MB which leaves only 110Mb for everything else. That makes AWS lambdas useless for anything but simple "hello world" stuff.
As an ugly workaround you can store xgboost package somewhere on s3 an copy it to the /tmp folder from the lambda invocation routine and point your python path to it. The allowed tmp space is a bit higher - 500MB so it might work.
I am not sure though if the /tmp folder is not cleaned between the lambda function runs though.
You can try using SageMaker Inference Pipelines to do pre-processing before making actual predictions. Basically, you can use the same pre-processing script used for training for inference as well. When the pipeline model is deployed, the full set of containers with pre-processing tasks installs and runs on each EC2 instance in the endpoint or transform job. Feature processing and inferences are executed with low latency because the containers deployed in an inference pipeline are co-located on the same EC2 instance (endpoint). You can refer documentation here.
Following blog posts/notebooks cover this feature in detail
Preprocess input data before making predictions using Amazon SageMaker inference pipelines and Scikit-learn
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_xgboost_abalone/inference_pipeline_sparkml_xgboost_abalone.ipynb
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/advanced_functionality/inference_pipeline_sparkml_blazingtext_dbpedia/inference_pipeline_sparkml_blazingtext_dbpedia.ipynb
I have built a deployment package with pandas, numpy, etc for my sample code to run. The size is some 46 MB. Doubt is, do I have to zip my code update every time and again update the entire deployment package to AWS S3 for a simple code update too?
Is there any other way, by which, I can avoid the 45 MB upload cost of S3 everytime and just upload the few KBs of code?
I would recommend creating a layer in AWS lambda.
First you need to create an instance of Amazon Linux (using the AMI specified in https://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html - at this time (26th of March 2019) it is amzn-ami-hvm-2017.03.1.20170812-x86_64-gp2 ) or a docker container with the same environment as the lambda execution environment.
I personally do it with docker.
For example, to create a layer for Python 3.6, I would run
sudo docker run --rm -it -v "$PWD":/var/task lambci/lambda:build-python3.6 bash
Then I would create a folder python/lib/python3.6/site-packages in /var/task in the docker container (so it will be accessible later on in the directory on the host machine where I started docker)
do pip3 install <your packages here> -t python/lib/python3.6/site-packages
zip up the folder python and upload it as a layer and use it in my AWS lambda function.
NB! The paths in the zip file should look like "python/lib/python3.6/site-packages/{your package names}"
Now the heavy dependencies are in separate layer and you don't have re-upload them every time you update the function, you only need to update the code
Split the application into two parts. The first part would be the lambda function that only includes your application code. The next part a lambda layer. The lambda layer will include onky the the dependencies and be uploaded once.
A lambda layer can be uploaded and attached to the lambda function. When your function is invoked, AWS will combine the lambda function with the lambda layer then execute the entire package.
When updating your code, you will only need to update the lambda function. Since the package is much smaller you can edit it using the web editor, or you can zip it and upload it directly to lambda using the cli tools.
Exmample: aws lambda update-function-code --function-name Yourfunctionname --zip-file fileb://Lambda_package.zip
Here are video instructions and examples on creating a lambda layer for dependencies.It demonstrates using the pymsql library, but you can install any of your libraries there.
https://geektopia.tech/post.php?blogpost=Create_Lambda_Layer_Python
I want to package and deploy a simple project on AWS Lambda, using Zappa, but without the Zappa requirements overhead.
Given this simple scenario:
lambda_handler.py
def handle(event, context):
print('Hello World')
I have a deploy.sh script that does that:
#!/usr/bin/env bash
source venv/bin/activate
zappa package -o lambda.zip
aws lambda update-function-code --function-name lambda-example --zip-file fileb://./lambda.zip
This works, BUT the final lambda.zip is way bigger then it needs to be:
I know that for this specific case the Zappa is not needed, but in the real project I'm using some libraries that requires https://github.com/Miserlou/lambda-packages, and using Zappa is the simplest way to install them.
How do I generate the python lambda package without this overhead?
First, you can use slim_handler that allows to upload larger files than 50M. Second, as #bddb already mentioned you can eclude some files such as .pyc, zip etc with the exclude property. Please find more details here:
https://github.com/Miserlou/Zappa#package
Here is an example how your zappa_settings.json could look like:
{
"dev": {
...
"slim_handler": false, // Useful if project >50M. Set true to just upload a small handler to Lambda and load actual project from S3 at runtime. Default false.
"exclude": ["*.gz", "*.rar"], // A list of regex patterns to exclude from the archive. To exclude boto3 and botocore (available in an older version on Lambda), add "boto3*" and "botocore*".
}
}