Building extensions to AWS Lambda with Continuous Delivery

Building extensions to AWS Lambda with Continuous Delivery - python

I have a GitHub repository containing a AWS Lambda function. I am currently using Travis CI to build, test and then deploy this function to Lambda if all the tests succeed using
deploy:
provider: lambda
(other settings here)
My function has the following dependencies specified in its requirements.txt
Algorithmia
numpy
networkx
opencv-python
I have set the build script for Travis CI to build in the working directory using the below command so as to have the dependencies get properly copied over to my AWS Lambda function.
pip install --target=$TRAVIS_BUILD_DIR -r requirements.txt
The problem is that while the build in Travis CI succeeds and everything is deployed to the Lambda function successfully, testing my Lambda function results in the following error:
Unable to import module 'mymodule':
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
My best guess as to why this is happening is that numpy is being built in the Ubuntu distribution of linux that Travis CI uses but the Amazon Linux that it is running on when executing as a Lambda function isn't able to run it properly. There are numerous forum posts and blog posts such as this one detailing that python modules that need to build C/C++ extensions must be built on a EC2 instance.
My question is: This is a real hassle to have to add another complication to the CD pipeline and have to mess around with EC2 instances. Has Amazon come up with some better way to do this (because there really should be a better way to do this) or is there some way to have everything compiled properly in Travis CI or another CI solution?
Also, I suppose it's possible that I've mis-identified the problem and that there is some other reason why importing numpy is failing. If anyone has suggestions on how to resolve this that would be great!
EDIT: As suggested by #jordanm it looks like it may be possible to load a docker container with the amazonlinux image when running TravisCI and then perform my build and test inside that container. Unfortunately, while that certainly is easier than using EC2 - I don't think I can use the normal lambda deploy tools in TravisCI - I'll have to write my own deploy script using the aws cli which is a bit of a pain. Any other ideas - or ways to make this smoother? Ideally I would be specify what docker image my builds run on in TravisCI as their default build environment is already using docker...but they don't seem to support that functionality yet: https://github.com/travis-ci/travis-ci/issues/7726

After quite a bit of tinkering I think I've found something that works. I thought I'd post it here in case others have the same problem.
I decided to use Wercker as they have quite a generous free tier and allow you to customize the docker image for your builds.
Turns out there is a docker image that has been created to replicate the exact environments that Lambda functions are executed on! See: https://github.com/lambci/docker-lambda When running your builds in this docker container, extensions will be built properly so they can execute successfully on Lambda.
In case anyone does want to use Wercker here's the wercker.yml I used and it may be helpful as a template:
box: lambci/lambda:build-python3.6
build:
steps:
- script:
name: Install Dependencies
code: |
pip install --target=$WERCKER_SOURCE_DIR -r requirements.txt
pip install pytest
- script:
name: Test code
code: pytest
- script:
name: Cleaning up
code: find $WERCKER_SOURCE_DIR \( -name \*.pyc -o -name \*.pyo -o -name __pycache__ \) -prune -exec rm -rf {} +
- script:
name: Create ZIP
code: |
cd $WERCKER_SOURCE_DIR
zip -r $WERCKER_ROOT/lambda_deploy.zip . -x *.git*
deploy:
box: golang:latest
steps:
- arjen/lambda:
access_key: $AWS_ACCESS_KEY
secret_key: $AWS_SECRET_KEY
function_name: yourFunction
region: us-west-1
filepath: $WERCKER_ROOT/lambda_deploy.zip

Although I appreciate you may not want to add further complications to your project, you could potentially use a Python-focused Lambda management tool for setting up your builds and deployments, say something like Gordon. You could also just use this tool to do your deployment from inside the Amazon Linux Docker container running within Travis.
If you wish to change CI providers, CodeShip allows you to build with any Docker container of your choice, and then deploy to Lambda.
Wercker also runs full Docker-based builds and has many user-submitted deploy "steps", some of which support deployment to Lambda.

Related

Difference between installation of pip git+https and python setup.py

I am aware of this popular topic, however I am running into a different outcome when installing a python app using pip with git+https and python setup.py
I am building a docker image. I am trying to install in an image containing several other python apps, this custom webhook.
Using git+https
RUN /venv/bin/pip install git+https://github.com/alerta/alerta-contrib.git#subdirectory=webhooks/sentry
This seems to install the webhook the right way, as the relevant endpoint is l8r discoverable.
What is more, when I exec into the running container and doing a search for relevant files, I see the following
./venv/lib/python3.7/site-packages/sentry_sdk
./venv/lib/python3.7/site-packages/__pycache__/alerta_sentry.cpython-37.pyc
./venv/lib/python3.7/site-packages/sentry_sdk-0.15.1.dist-info
./venv/lib/python3.7/site-packages/alerta_sentry.py
./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg-info
In my second approach I just copy this directory locally and in my Dockerfile I do
COPY sentry /app/sentry
RUN /venv/bin/python /app/sentry/setup.py install
This does not install the webhook appropriately and what is more, in the respective container I see a different file layout
./venv/lib/python3.7/site-packages/sentry_sdk
./venv/lib/python3.7/site-packages/sentry_sdk-0.15.1.dist-info
./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg
./alerta_sentry.egg-info
./dist/alerta_sentry-5.0.0-py3.7.egg
(the sentry_sdk - related files must be irrelevant)
Why does the second approach fail to install the webhook appropriately?
Should these two option yield the same result?

What finally worked is the following
RUN /venv/bin/pip install /app/sentry/
I don't know the subtle differences between these two installation modes
I did notice however that /venv/bin/python /app/sentry/setup.py install did not produce an alerta_sentry.py but only the .egg file, i.e. ./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg
On the other hand, /venv/bin/pip install /app/sentry/ unpacked (?) the .egg creating the ./venv/lib/python3.7/site-packages/alerta_sentry.py
I don't also know why the second installation option (i.e. the one creating the .egg file) was not working run time.

Shrinking AWS Lambda deployment package with CFLAGS and PIP to fit sklearn

I'm loading a pickled machine learning model with my Lambda handler so I need sklearn (I get "ModuleNotFoundError: No module named 'sklearn'" if it's not included)
So I created a new deployment package in Docker with sklearn.
But when I tried to upload the new lambda.zip file I could not save the lambda function. I get the error: Unzipped size must be smaller than 262144000 bytes
I did some googling and found two suggestions: (1) using CLFAG with PIP and (2) using Lambda Layers.
I don't think Layers will work. Moving parts of my deployment package to layers won't reduce the total size (and AWS documentation states " he total unzipped size of the function and all layers can't exceed the unzipped deployment package size limit of 250 MB".
CFLAGS sound promising but I've never worked with CFLAGS before and I'm getting errors.
I'm trying to add the flags: -Os -g0 -Wl,--strip-all
Pre-CLFAGS my docker pip command was: pip3 install requests pandas s3fs datetime bs4 sklearn -t ./
First I tried: pip3 install requests pandas s3fs datetime bs4 sklearn -t -Os -g0 -Wl,--strip-all ./
That produced errors of the variety "no such option: -g"
Then I tried CFLAGS = -Os -g0 -Wl,--strip-all pip3 install requests pandas s3fs datetime bs4 sklearn -t ./ and CFLAGS = -Os -g0 -Wl,--strip-all
But they produced the error "CFLAGS: command not found"
Can anyone help me understand how to use CFLAGS?
Also, I'm familiar with the saying "beggars can't be choosers" so any advice would be appreciated.
That said, I'm a bit of a noob so if you could help me with CFLAGS in the context of my Docker deployment package workflow it'd be most appreciated.
My docker workflow is:
docker run -it olivierhervieu/amazonlinux-python36-onbuild
mkdir deploy
cd deploy
pip3 install requests pandas s3fs datetime bs4 sklearn -t ./
zip -r lambda.zip *

This kinda is an answer (I was able to shrink my deployment package and get my Lambda deployed) and kinda not an answer (I still don't know how to use CFLAGS).
A lot of googling eventually led me to this article which included a link to this list of modules that come pre-installed in the AWS Lambda Python environment.
My deployment package contained several of the modules that already exist in the AWS Lambda environment and thus do not need to be included in deployment packages.
The modules that saved the most space for me were Boto3 and Botocore. I didn't explicitly add these in my Docker environment but they made their way into my deployment package anyway (I'm guessing that S3FS depends on these modules and when installing S3FS they are also added).
I was also able to remove a lot of smaller modules (datetime, dateutil, docutils, six, etc).
With these modules removed my package was under the 250mb limit and I was able to deploy.
Were I still not under the limit - I wasn't sure if that would be enough - I was going to try another suggestion from the linked article above: removing .py files from the deployment package (you don't need both .pyc and .py files).
Hope this helps with your Lambda deployment package size!

These days you would use Docker container for your lambda as its size can be 10 GB, which is far greater then traditional lambda functions deployed using deployment packages and layers. From AWS:
You can now package and deploy AWS Lambda functions as a container image of up to 10 GB.
Thus you could create a lambda container with sklearn plus any other files and dependencies that you require with the total size of 10 GB.

We faced this exact problem ourselves but with Spacy rather than sklearn.
You're going about it the right way by looking at not deploying packages already included in the AWS deployment, but note sometimes this still won't get you under the limit (especially for ML purposes in which large models have to be included as part of the dependency).
In these instances, another option is to save any external static files (e.g. models etc.) which are used by the library in a private S3 bucket and then read them in at runtime. For example, as described by this answer..
Incidentally, If you're using the serverless framework to deploy your lambdas should check out the serverless-python-requirements plugin, which allows you to implement the steps you've described such as specifying packets not to deploy with the function and build 'slim' versions of the dependencies (automatically stripping out the .so files, pycache and dist-info directories as well as .pyc and .pyo files.)
Good luck :)

We had the same problem and it was very difficult to make it work. We ended up buying this layer that includes scikit-learn, pandas, numpy and scipy.
https://www.awslambdas.com/layers/3/aws-lambda-scikit-learn-numpy-scipy-python38-layer
There is another layer that includes xgboost as well.

I found this article that mentions the use of CFLAGS. In the comments, a guy named Jesper explained how to use CFLAGS as quoted:
If anyone else doubts how to add the CFLAGS to pip here is how I did it:
Before running pip do this (I did this in Ubuntu):
export CFLAGS="$CFLAGS -Os -g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib"
export CXXFLAGS="$CXXFLAGS -Os -g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib"
Then run pip like (for numpy, but do whatever other package you want):
pip install numpy --no-cache-dir --compile --global-option=build_ext

Force Dataflow workers to use Python 3?

I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner works fine, but with DataflowRunner it fails to install 1 dependency from requirements.txt. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.
The pipeline is done and working fine locally (DirectRunner) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0 in my requirements.txt).
One of the modules required by my pipeline is:
from lbcapi3 import api
So my requirements.txt sent to GCP has a line with:
lbcapi3==1.0.0
That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:
ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n'
That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt.
Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?
I tried by adding this as the first line of the my file api-etl.py, but didn't work:
#!/usr/bin/env python3
Thanks!

Follow the instructions in the quickstart to get up and running with your pipeline. When installing the Apache Beam SDK, make sure to install version 2.16 (since this is the first version that officially supports Python 3). Please, check your version.
You can use the Apache Beam SDK with Python versions 3.5, 3.6, or 3.7 if you are keen to migrate from Python 2.x environments.
For more information, refer to this documentation. Also, take a look for preinstalled dependencies.
Edited, after providing additional information:
I have reproduced problem on Dataflow. I see two solutions.
You can use --extra_package option, which would allow staging local packages in an accessible way. Instead of listing local package in the requirements.txt, create a tarball of the local package (e.g. my_package.tar.gz) and use --extra_package option to stage them.
Clone the repository from Github:
$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/
Build the tarball with the following command:
$ python setup.py sdist
The last few lines will look like this:
Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)
Then, run your pipeline with the following command-line option:
--extra_package /path/to/package/package-name
In my case:
--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
Make sure, that all of required options are provided in the command (job_name, project, runner, staging_location, temp_location):
python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
The error you faced, would disappear.
Second solution - posting the additional libraries that your app is using in setup.py file, refer to the documentation.
Create a setup.py file for your project:
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
version='PACKAGE-VERSION',
install_requires=[],
packages=setuptools.find_packages(),
)
You can get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call.

The simple answer is when deploying your pipeline, you need to make sure that your local environment is on python 3.5 3.6 or 3.7. If it is, then the Dataflow worker will have the same version once your job is launched.

Firebase on AWS Lambda Import Error

I am trying to connect Firebase with an AWS Lambda. I am using their firebase-admin sdk. I have installed and created the dependancy package as described here. But I am getting this error on Lambda:
Unable to import module 'index':
Failed to import the Cloud Firestore library for Python.
Make sure to install the "google-cloud-firestore" module.
I have previously also tried setting up a similar function using node.js but I received an error message because GRPC was not configured. I think that this error message might be stemming from that same problem. I don't know how to fix this. I have tried:
pip install grpcio -t path/to/...
and installing google-cloud-firestore, but neither fixed the problem. When I run the code from my terminal, I get no errors.

Part of the problem here is that grpcio compiles a platform specific dynamic module: cygrpc.cpython-37m-darwin.so (in my case). According to this response you cannot import dynamic modules in a zip file: https://stackoverflow.com/a/58140801

Updating to python 3.8 fix this for me

As Alex DeBrie mentioned in his article on serverless.com,
The plugins section registers the plugin with the Framework. In the custom section, we tell the plugin to use Docker when installing packages with pip. It will use a Docker container that's similar to the Lambda environment so the compiled extensions will be compatible. You will need Docker installed for this to work.
Which means, the environment is different between Local and Lambda, so the compiled extensions would differ. If use a container to contain packages installed by pip, the container would mimic the environment of Lambda, then everything would run well.
If you use Serverless Frame work to deploy your Python app to AWS Lambda, add these lines to serverless.yml file:
...
plugins:
- serverless-python-requirements
...
custom:
pythonRequirements:
dockerizePip: non-linux
dockerImage: mlupin/docker-lambda:python3.9-build
...
then serverless-python-requirements would automatically open a Docker container based on mlupin/docker-lambda:python3.9-build image.
This container would mimic the Lamda environment, let pip install and compile everything in it. So the compiled extensions will be compatible.
This worked in my case. Hope this helps.

Unable to install pandas on AWS Lambda

I'm trying to install and run pandas on an Amazon Lambda instance. I've used the recommended zip method of packaging my code file model_a.py and related python libraries (pip install pandas -t /path/to/dir/) and uploaded the zip to Lambda. When I try to run a test, this is the error message I get:
Unable to import module 'model_a': C extension:
/var/task/pandas/hashtable.so: undefined symbol: PyFPE_jbuf not built.
If you want to import pandas from the source directory, you may need
to run 'python setup.py build_ext --inplace' to build the C extensions
first.
Looks like an error in a variable defined in hashtable.so that comes with the pandas installer. Googling for this did not turn up any relevant articles. There were some references to a failure in numpy installation but nothing concrete. Would appreciate any help in troubleshooting this! Thanks.

I would advise you to use Lambda layers to use additional libraries. The size of a lambda function package is limited, but layers can be used up to 250MB (more here).
AWS has open sourced a good package, including Pandas, for dealing with data in Lambdas. AWS has also packaged it making it convenient for Lambda layers. You can find instructions here.

I have successfully run pandas code on lambda before. If your development environment is not binary-compatible with the lambda environment, you will not be able to simply run pip install pandas -t /some/dir and package it up into a lambda .zip file. Even if you are developing on linux, you may still run into compatability issues.
So, how do you get around this? The solution is actually pretty simple: run your pip install on a lambda container and use the pandas module that it downloads/builds instead. When I did this, I had a build script that would spin up an instance of the lambci/lambda container on my local system (a clone of the AWS Lambda container in docker), bind my local build folder to /build and run pip install pandas -t /build/. Once that's done, kill the container and you have the lambda-compatible pandas module in your local build folder, ready to zip up and send to AWS along with the rest of your code.
You can do this for an arbitrary set of python modules by making use of a requirements.txt file, and you can even do it for arbitrary versions of python by first creating a virtual environment on the lambci container. I haven't needed to do this for a couple of years, so maybe there are better tools by now, but this approach should at least be functional.

If you want to install it directly through the AWS Console, I made a step-by-step youtube tutorial, check out the video here: How to install Pandas on AWS Lambda

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Building extensions to AWS Lambda with Continuous Delivery - python

Related

Difference between installation of pip git+https and python setup.py

Shrinking AWS Lambda deployment package with CFLAGS and PIP to fit sklearn

Force Dataflow workers to use Python 3?

Firebase on AWS Lambda Import Error

Unable to install pandas on AWS Lambda

Categories

Resources