Shrinking AWS Lambda deployment package with CFLAGS and PIP to fit sklearn - python

I'm loading a pickled machine learning model with my Lambda handler so I need sklearn (I get "ModuleNotFoundError: No module named 'sklearn'" if it's not included)
So I created a new deployment package in Docker with sklearn.
But when I tried to upload the new lambda.zip file I could not save the lambda function. I get the error: Unzipped size must be smaller than 262144000 bytes
I did some googling and found two suggestions: (1) using CLFAG with PIP and (2) using Lambda Layers.
I don't think Layers will work. Moving parts of my deployment package to layers won't reduce the total size (and AWS documentation states " he total unzipped size of the function and all layers can't exceed the unzipped deployment package size limit of 250 MB".
CFLAGS sound promising but I've never worked with CFLAGS before and I'm getting errors.
I'm trying to add the flags: -Os -g0 -Wl,--strip-all
Pre-CLFAGS my docker pip command was: pip3 install requests pandas s3fs datetime bs4 sklearn -t ./
First I tried: pip3 install requests pandas s3fs datetime bs4 sklearn -t -Os -g0 -Wl,--strip-all ./
That produced errors of the variety "no such option: -g"
Then I tried CFLAGS = -Os -g0 -Wl,--strip-all pip3 install requests pandas s3fs datetime bs4 sklearn -t ./ and CFLAGS = -Os -g0 -Wl,--strip-all
But they produced the error "CFLAGS: command not found"
Can anyone help me understand how to use CFLAGS?
Also, I'm familiar with the saying "beggars can't be choosers" so any advice would be appreciated.
That said, I'm a bit of a noob so if you could help me with CFLAGS in the context of my Docker deployment package workflow it'd be most appreciated.
My docker workflow is:
docker run -it olivierhervieu/amazonlinux-python36-onbuild
mkdir deploy
cd deploy
pip3 install requests pandas s3fs datetime bs4 sklearn -t ./
zip -r lambda.zip *

This kinda is an answer (I was able to shrink my deployment package and get my Lambda deployed) and kinda not an answer (I still don't know how to use CFLAGS).
A lot of googling eventually led me to this article which included a link to this list of modules that come pre-installed in the AWS Lambda Python environment.
My deployment package contained several of the modules that already exist in the AWS Lambda environment and thus do not need to be included in deployment packages.
The modules that saved the most space for me were Boto3 and Botocore. I didn't explicitly add these in my Docker environment but they made their way into my deployment package anyway (I'm guessing that S3FS depends on these modules and when installing S3FS they are also added).
I was also able to remove a lot of smaller modules (datetime, dateutil, docutils, six, etc).
With these modules removed my package was under the 250mb limit and I was able to deploy.
Were I still not under the limit - I wasn't sure if that would be enough - I was going to try another suggestion from the linked article above: removing .py files from the deployment package (you don't need both .pyc and .py files).
Hope this helps with your Lambda deployment package size!

These days you would use Docker container for your lambda as its size can be 10 GB, which is far greater then traditional lambda functions deployed using deployment packages and layers. From AWS:
You can now package and deploy AWS Lambda functions as a container image of up to 10 GB.
Thus you could create a lambda container with sklearn plus any other files and dependencies that you require with the total size of 10 GB.

We faced this exact problem ourselves but with Spacy rather than sklearn.
You're going about it the right way by looking at not deploying packages already included in the AWS deployment, but note sometimes this still won't get you under the limit (especially for ML purposes in which large models have to be included as part of the dependency).
In these instances, another option is to save any external static files (e.g. models etc.) which are used by the library in a private S3 bucket and then read them in at runtime. For example, as described by this answer..
Incidentally, If you're using the serverless framework to deploy your lambdas should check out the serverless-python-requirements plugin, which allows you to implement the steps you've described such as specifying packets not to deploy with the function and build 'slim' versions of the dependencies (automatically stripping out the .so files, pycache and dist-info directories as well as .pyc and .pyo files.)
Good luck :)

We had the same problem and it was very difficult to make it work. We ended up buying this layer that includes scikit-learn, pandas, numpy and scipy.
https://www.awslambdas.com/layers/3/aws-lambda-scikit-learn-numpy-scipy-python38-layer
There is another layer that includes xgboost as well.

I found this article that mentions the use of CFLAGS. In the comments, a guy named Jesper explained how to use CFLAGS as quoted:
If anyone else doubts how to add the CFLAGS to pip here is how I did it:
Before running pip do this (I did this in Ubuntu):
export CFLAGS="$CFLAGS -Os -g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib"
export CXXFLAGS="$CXXFLAGS -Os -g0 -Wl,--strip-all -I/usr/include:/usr/local/include -L/usr/lib:/usr/local/lib"
Then run pip like (for numpy, but do whatever other package you want):
pip install numpy --no-cache-dir --compile --global-option=build_ext

Related

Difference between installation of pip git+https and python setup.py

I am aware of this popular topic, however I am running into a different outcome when installing a python app using pip with git+https and python setup.py
I am building a docker image. I am trying to install in an image containing several other python apps, this custom webhook.
Using git+https
RUN /venv/bin/pip install git+https://github.com/alerta/alerta-contrib.git#subdirectory=webhooks/sentry
This seems to install the webhook the right way, as the relevant endpoint is l8r discoverable.
What is more, when I exec into the running container and doing a search for relevant files, I see the following
./venv/lib/python3.7/site-packages/sentry_sdk
./venv/lib/python3.7/site-packages/__pycache__/alerta_sentry.cpython-37.pyc
./venv/lib/python3.7/site-packages/sentry_sdk-0.15.1.dist-info
./venv/lib/python3.7/site-packages/alerta_sentry.py
./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg-info
In my second approach I just copy this directory locally and in my Dockerfile I do
COPY sentry /app/sentry
RUN /venv/bin/python /app/sentry/setup.py install
This does not install the webhook appropriately and what is more, in the respective container I see a different file layout
./venv/lib/python3.7/site-packages/sentry_sdk
./venv/lib/python3.7/site-packages/sentry_sdk-0.15.1.dist-info
./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg
./alerta_sentry.egg-info
./dist/alerta_sentry-5.0.0-py3.7.egg
(the sentry_sdk - related files must be irrelevant)
Why does the second approach fail to install the webhook appropriately?
Should these two option yield the same result?
What finally worked is the following
RUN /venv/bin/pip install /app/sentry/
I don't know the subtle differences between these two installation modes
I did notice however that /venv/bin/python /app/sentry/setup.py install did not produce an alerta_sentry.py but only the .egg file, i.e. ./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg
On the other hand, /venv/bin/pip install /app/sentry/ unpacked (?) the .egg creating the ./venv/lib/python3.7/site-packages/alerta_sentry.py
I don't also know why the second installation option (i.e. the one creating the .egg file) was not working run time.

Full installation of tensorflow (all modules)?

I have this repository with me ; https://github.com/layog/Accurate-Binary-Convolution-Network . As requirements.txt says, it requires tensorflow==1.4.1. So I am using miniconda (in Ubuntu18.04) and for the love of God, I can't get it to run (errors out at the below line)
from tensorflow.examples.tutorial.* import input_data
Gives me an ImportError saying it can't find tensorflow.examples. I have diagnosed the problem that a few modules are missing after I installed tensorflow (Have tried all of the below ways)
pip install tensorflow==1.4.1
conda install -c conda-forge tensorflow==1.4.1
#And various wheel packages avaliable on the internet for 1.4.1
pip install tensorflow-1.4.0rc1-cp36-cp36m-manylinux1_x86_64.whl
Question is, if I want all the modules which are present in the git repo source as my installed copy, do I have to COMPLETELY build tensorflow from source ? If yes, can you mention the flag I should use? Are there any wheel packages available that have all modules present in them ?
A link would save me tonnes of effort!
NOTE: Even if I manually import the examples directory, it says tensorflow.contrib is missing, and if I local import that too, another ImportError pops up. There has to be an easier way I am sure of it
Just for reference for others stuck in the same situation:-
Use latest tensorflow build and bezel 0.27.1 for installing it. Even though the requirements state that we need an older version - use newer one instead. Not worth the hassle and will get the job done.
Also to answer the above question about building only specific directories is possible. Each module consists of BUILD file which is fed to bezel.
See the names category of the file to build specific to that folder. For reference the command I used to generate the wheel package for examples.tutorial.mnist :
bazel build --config=opt --config=cuda --incompatible_load_argument_is_label=false //tensorflow/examples/tutorials/mnist:all_files
Here all_files is the name found in the examples/tutorials/mnist/BUILD file.

Building extensions to AWS Lambda with Continuous Delivery

I have a GitHub repository containing a AWS Lambda function. I am currently using Travis CI to build, test and then deploy this function to Lambda if all the tests succeed using
deploy:
provider: lambda
(other settings here)
My function has the following dependencies specified in its requirements.txt
Algorithmia
numpy
networkx
opencv-python
I have set the build script for Travis CI to build in the working directory using the below command so as to have the dependencies get properly copied over to my AWS Lambda function.
pip install --target=$TRAVIS_BUILD_DIR -r requirements.txt
The problem is that while the build in Travis CI succeeds and everything is deployed to the Lambda function successfully, testing my Lambda function results in the following error:
Unable to import module 'mymodule':
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
My best guess as to why this is happening is that numpy is being built in the Ubuntu distribution of linux that Travis CI uses but the Amazon Linux that it is running on when executing as a Lambda function isn't able to run it properly. There are numerous forum posts and blog posts such as this one detailing that python modules that need to build C/C++ extensions must be built on a EC2 instance.
My question is: This is a real hassle to have to add another complication to the CD pipeline and have to mess around with EC2 instances. Has Amazon come up with some better way to do this (because there really should be a better way to do this) or is there some way to have everything compiled properly in Travis CI or another CI solution?
Also, I suppose it's possible that I've mis-identified the problem and that there is some other reason why importing numpy is failing. If anyone has suggestions on how to resolve this that would be great!
EDIT: As suggested by #jordanm it looks like it may be possible to load a docker container with the amazonlinux image when running TravisCI and then perform my build and test inside that container. Unfortunately, while that certainly is easier than using EC2 - I don't think I can use the normal lambda deploy tools in TravisCI - I'll have to write my own deploy script using the aws cli which is a bit of a pain. Any other ideas - or ways to make this smoother? Ideally I would be specify what docker image my builds run on in TravisCI as their default build environment is already using docker...but they don't seem to support that functionality yet: https://github.com/travis-ci/travis-ci/issues/7726
After quite a bit of tinkering I think I've found something that works. I thought I'd post it here in case others have the same problem.
I decided to use Wercker as they have quite a generous free tier and allow you to customize the docker image for your builds.
Turns out there is a docker image that has been created to replicate the exact environments that Lambda functions are executed on! See: https://github.com/lambci/docker-lambda When running your builds in this docker container, extensions will be built properly so they can execute successfully on Lambda.
In case anyone does want to use Wercker here's the wercker.yml I used and it may be helpful as a template:
box: lambci/lambda:build-python3.6
build:
steps:
- script:
name: Install Dependencies
code: |
pip install --target=$WERCKER_SOURCE_DIR -r requirements.txt
pip install pytest
- script:
name: Test code
code: pytest
- script:
name: Cleaning up
code: find $WERCKER_SOURCE_DIR \( -name \*.pyc -o -name \*.pyo -o -name __pycache__ \) -prune -exec rm -rf {} +
- script:
name: Create ZIP
code: |
cd $WERCKER_SOURCE_DIR
zip -r $WERCKER_ROOT/lambda_deploy.zip . -x *.git*
deploy:
box: golang:latest
steps:
- arjen/lambda:
access_key: $AWS_ACCESS_KEY
secret_key: $AWS_SECRET_KEY
function_name: yourFunction
region: us-west-1
filepath: $WERCKER_ROOT/lambda_deploy.zip
Although I appreciate you may not want to add further complications to your project, you could potentially use a Python-focused Lambda management tool for setting up your builds and deployments, say something like Gordon. You could also just use this tool to do your deployment from inside the Amazon Linux Docker container running within Travis.
If you wish to change CI providers, CodeShip allows you to build with any Docker container of your choice, and then deploy to Lambda.
Wercker also runs full Docker-based builds and has many user-submitted deploy "steps", some of which support deployment to Lambda.

Unable to install pandas on AWS Lambda

I'm trying to install and run pandas on an Amazon Lambda instance. I've used the recommended zip method of packaging my code file model_a.py and related python libraries (pip install pandas -t /path/to/dir/) and uploaded the zip to Lambda. When I try to run a test, this is the error message I get:
Unable to import module 'model_a': C extension:
/var/task/pandas/hashtable.so: undefined symbol: PyFPE_jbuf not built.
If you want to import pandas from the source directory, you may need
to run 'python setup.py build_ext --inplace' to build the C extensions
first.
Looks like an error in a variable defined in hashtable.so that comes with the pandas installer. Googling for this did not turn up any relevant articles. There were some references to a failure in numpy installation but nothing concrete. Would appreciate any help in troubleshooting this! Thanks.
I would advise you to use Lambda layers to use additional libraries. The size of a lambda function package is limited, but layers can be used up to 250MB (more here).
AWS has open sourced a good package, including Pandas, for dealing with data in Lambdas. AWS has also packaged it making it convenient for Lambda layers. You can find instructions here.
I have successfully run pandas code on lambda before. If your development environment is not binary-compatible with the lambda environment, you will not be able to simply run pip install pandas -t /some/dir and package it up into a lambda .zip file. Even if you are developing on linux, you may still run into compatability issues.
So, how do you get around this? The solution is actually pretty simple: run your pip install on a lambda container and use the pandas module that it downloads/builds instead. When I did this, I had a build script that would spin up an instance of the lambci/lambda container on my local system (a clone of the AWS Lambda container in docker), bind my local build folder to /build and run pip install pandas -t /build/. Once that's done, kill the container and you have the lambda-compatible pandas module in your local build folder, ready to zip up and send to AWS along with the rest of your code.
You can do this for an arbitrary set of python modules by making use of a requirements.txt file, and you can even do it for arbitrary versions of python by first creating a virtual environment on the lambci container. I haven't needed to do this for a couple of years, so maybe there are better tools by now, but this approach should at least be functional.
If you want to install it directly through the AWS Console, I made a step-by-step youtube tutorial, check out the video here: How to install Pandas on AWS Lambda

Issue using Flask-Assets to compile less files

I'm currently trying to set up a Flask web app, and trying to use Flask-Assets to compile my less files into minified css.
Here is my assets.py file that creates the bundle.
from flask_assets import Bundle
common_css = Bundle(
'vendor/less/theme.less',
filters='less',
output='static/css/common.css',
)
The error that I am getting is:
OSError: [Errno 2] No such file or directory
In the webassets documentation for the less filter, it says that:
This depends on the NodeJS implementation of less, installable via npm. To use the old Ruby-based version (implemented in the 1.x Ruby gem), see Less.
...
LESS_BIN (binary)
Path to the less executable used to compile source files. By default, the filter will attempt to run lessc via the system path.
I installed less using $ npm install less, but for some reason it looks like webassets can't use it.
When I try to use different filters, then webassets can successfully create the bundle.
Thanks!
npm install installs packages in current directory by default (you should be able to find node_modules directory there). You have two choices:
Install lessc globally:
$ npm install -g less
This way webassets will be able to find it itself.
Provide a full path to lessc executable:
assets = Environment(app)
assets.config['less_bin'] = '/path/to/lessc'
The path should be <some_directory>/node_modules/.bin/lessc.

Categories