Problem
I'm trying to load and use a keras model in AWS Lambda, but importing keras from tensorflow is taking a long time in my lambda function. Curiously, though, it didn't take very long in SageMaker. Why is this, and how can I fix it?
Setup Description
I'm using the serverless framework to deploy my function. The handler and serverless.yml are included below. I have an EFS volume holding my dependencies, which were installed using an EC2 instance with the EFS volume mounted. I pip installed dependencies to the EFS with the -t flag. For example, I installed tensorflow like this:
sudo miniconda3/envs/devenv/bin/pip install tensorflow -t /mnt/efs/fs1/lib
where /mnt/efs/fs1/lib is the folder on the EFS which stores my dependencies. The models are stored on s3.
I prototyped my loading my model on a Sagemaker notebook with the following code:
import time
start = time.time()
from tensorflow import keras
print('keras: {}'.format(time.time()-start))
import boto3
import os
import zipfile
print('imports: {}'.format(time.time()-start))
modelPath = '***model.zip'
bucket = 'predictionresources'
def load_motion_model():
s3 = boto3.client('s3')
s3.download_file(bucket, modelPath, 'model.motionmodel.zip')
with zipfile.ZipFile('model.motionmodel.zip', 'r') as zip_ref:
zip_ref.extractall('model.motionmodel')
return keras.models.load_model('model.motionmodel/'+os.listdir('model.motionmodel')[0])
model = load_motion_model()
print('total time: {}'.format(time.time()-start))
which has the following output:
keras: 2.0228586196899414
imports: 2.0231151580810547
total time: 3.0635251998901367
so, including all imports, this takes around 3 seconds to execute.
however, when I deploy to AWS Lambda with serverless, keras takes substantially longer to import. This lambda function (which is the same as the other one, just wrapped in a handler) and serverless.yml:
Handler
try:
import sys
import os
sys.path.append(os.environ['MNT_DIR']+'/lib0') # nopep8 # noqa
except ImportError:
pass
#returns the version of all dependancies
def test(event, context):
print('TEST LOADING')
import time
start = time.time()
from tensorflow import keras
print('keras: {}'.format(time.time()-start))
import boto3
import os
import zipfile
print('imports: {}'.format(time.time()-start))
modelPath = '**********nmodel.zip'
bucket = '***********'
def load_motion_model():
s3 = boto3.client('s3')
s3.download_file(bucket, modelPath, 'model.motionmodel.zip')
with zipfile.ZipFile('model.motionmodel.zip', 'r') as zip_ref:
zip_ref.extractall('model.motionmodel')
return keras.models.load_model('model.motionmodel/'+os.listdir('model.motionmodel')[0])
model = load_motion_model()
print('total time: {}'.format(time.time()-start))
body = {
'message': 'done!'
}
response = {
"statusCode": 200,
"body": json.dumps(body)
}
return response
(p.s. I know this would fail due to lack of write access, and the model needs to be saved to /tmp/)
Serverless.yml
service: test4KerasTest
plugins:
- serverless-pseudo-parameters
custom:
efsAccessPoint: fsap-*****
LocalMountPath: /mnt/efs
subnetsId: subnet-*****
securityGroup: sg-*****
provider:
name: aws
runtime: python3.6
region: us-east-2
timeout: 30
iamRoleStatements:
- Effect: Allow
Action:
- s3:GetObject
- s3:PutObject
Resource: 'arn:aws:s3:::predictionresources/*'
package:
exclude:
- node_modules/**
- .vscode/**
- .serverless/**
- .pytest_cache/**
- __pychache__/**
functions:
test:
handler: handler.test
environment: # Service wide environment variables
MNT_DIR: ${self:custom.LocalMountPath}
BUCKET: predictionresources
REGION: us-east-2
vpc:
securityGroupIds:
- ${self:custom.securityGroup}
subnetIds:
- ${self:custom.subnetsId}
iamManagedPolicies:
- arn:aws:iam::aws:policy/AmazonElasticFileSystemClientReadWriteAccess
- arn:aws:iam::aws:policy/AmazonS3FullAccess
events:
- http:
path: test
method: get
fileSystemConfig:
localMountPath: '${self:custom.LocalMountPath}'
arn: 'arn:aws:elasticfilesystem:${self:provider.region}:#{AWS::AccountId}:access-point/${self:custom.efsAccessPoint}'
results in this output from cloudwatch:
As can be seen, keras takes substantially longer to import in my lambda environment, but the other imports don't seem to be as negatively effected. I have tried importing different modules in different orders, and keras consistently takes an unreasonable amount of time to import.. Due to restrictions by the API Gateway, this function can't take longer than 30 seconds, which means I have to find a way to shorten the time it takes me to import keras on my lambda function.
Related
I want to deploy model in Azure but I'm struggling with the following problem.
I have my model registered in Azure. The file with extension .sav is located locally. The registration looks the following:
import urllib.request
from azureml.core.model import Model
# Register model
model = Model.register(ws, model_name="my_model_name.sav", model_path="model/")
I have my score.py file. The init() function in the file looks like this:
import json
import numpy as np
import pandas as pd
import os
import pickle
from azureml.core.model import Model
def init():
global model
model_path = Model.get_model_path(model_name = 'my_model_name.sav', _workspace='workspace_name')
model = pickle(open(model_path, 'rb'))
But when I try to deploy I se the following error:
"code": "AciDeploymentFailed",
"statusCode": 400,
"message": "Aci Deployment failed with exception: Your container application crashed. This may be caused by errors in your scoring file's init() function.
1. Please check the logs for your container instance: leak-tester-pm. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs.
And when I run print(service.logs()) I have the following output (I have only one model registered in Azure):
None
Am I doing something wrong with loading model in score.py file?
P.S. The .yml file for the deployment:
name: project_environment
dependencies:
# The python interpreter version.
# Currently Azure ML only supports 3.5.2 and later.
- python=3.6.2
- pip:
- scikit-learn==0.24.2
- azureml-defaults
- numpy
- pickle-mixin
- pandas
- xgboost
- azure-ml-api-sdk
channels:
- anaconda
- conda-forge
The local inference server allows you to quickly debug your entry script (score.py). In case the underlying score script has a bug, the server will fail to initialize or serve the model. Instead, it will throw an exception & the location where the issues occurred.
There are two possible reasons for the error or exception occurred.
HTTP server issue. Need to troubleshoot it
Docker deployment.
You need to debug the procedure which you followed. In some cases, HTTP server issues will cause the problem of initialization (init())
Check Azure Machine Learning inference HTTP server for better debugging from server perspective.
The docker file mentioned is looking good, but it's better to debug once by the steps mentioned in https://learn.microsoft.com/en-us/azure/machine-learning/how-to-troubleshoot-deployment-local#dockerlog
Try below code inside init() function:
def init():
global model
model_path = "modelfoldername/model.pkl"
filename = 'mymodel.sav'
pickle.dump(model_path, open(filename, 'wb'))
# load the model from disk
model= pickle.load(open(filename, 'rb'))
I would like to be able to run an ad-hoc python script that would access and run analytics on the model calculated by a dbt run, are there any best practices around this?
We recently built a tool that could that caters very much to this scenario. It leverages the ease of referencing tables from dbt in Python-land. It's called fal.
The idea is that you would define the python scripts you would like to run after your dbt models are run:
# schema.yml
models:
- name: iris
meta:
owner: "#matteo"
fal:
scripts:
- "notify.py"
And then the file notify.py is called if the iris model was run in the last dbt run:
# notify.py
import os
from slack_sdk import WebClient
from slack_sdk.errors import SlackApiError
CHANNEL_ID = os.getenv("SLACK_BOT_CHANNEL")
SLACK_TOKEN = os.getenv("SLACK_BOT_TOKEN")
client = WebClient(token=SLACK_TOKEN)
message_text = f"""Model: {context.current_model.name}
Status: {context.current_model.status}
Owner: {context.current_model.meta['owner']}"""
try:
response = client.chat_postMessage(
channel=CHANNEL_ID,
text=message_text
)
except SlackApiError as e:
assert e.response["error"]
Each script is ran with a reference to the current model for which it is running in a context variable.
To start using fal, just pip install fal and start writing your python scripts.
For production, I'd recommend an orchestration layer such as apache airflow.
See this blog post to get started, but essentially you'll have an orchestration DAG (note - not a dbt DAG) that does something like:
dbt run <with args> -> your python code
Fair warning, though, this can add a bit of complexity to your project.
I suppose you could get a similar effect with a CI/CD tool like github actions or circleCI
I am trying to execute a apache beam pipeline as a dataflow job in Google Cloud Platform.
My project structure is as follows:
root_dir/
__init__.py
setup.py
main.py
utils/
__init__.py
log_util.py
config_util.py
Here's my setup.py
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3",
"apache-beam[gcp]>=2.20.0",
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Here's my pipeline code:
import math
import apache_beam as beam
from datetime import datetime
from apache_beam.options.pipeline_options import PipelineOptions
from utils.log_util import LogUtil
from utils.config_util import ConfigUtil
class DataflowExample:
config = {}
def __init__(self):
self.config = ConfigUtil.get_config(module_config=["config"])
self.project = self.config['project']
self.region = self.config['location']
self.bucket = self.config['core_bucket']
self.batch_size = 10
def execute_pipeline(self):
try:
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow started")
query = "SELECT id, name, company FROM `<bigquery_table>` LIMIT 10"
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/"
}
options = PipelineOptions(**beam_options, save_main_session=True)
with beam.Pipeline(options=options) as pipeline:
data = (
pipeline
| 'Read from BQ ' >> beam.io.Read(beam.io.ReadFromBigQuery(query=query, use_standard_sql=True))
| 'Count records' >> beam.combiners.Count.Globally()
| 'Print ' >> beam.ParDo(PrintCount(), self.batch_size)
)
LogUtil.log_n_notify(log_type="info", msg=f"Dataflow completed")
except Exception as e:
LogUtil.log_n_notify(log_type="error", msg=f"Exception in execute_pipeline - {str(e)}")
class PrintCount(beam.DoFn):
def __init__(self):
self.logger = LogUtil()
def process(self, row_count, batch_size):
try:
current_date = datetime.today().date()
total = int(math.ceil(row_count / batch_size))
self.logger.log_n_notify(log_type="info", msg=f"Records pulled from table on {current_date} is {row_count}")
self.logger.log_n_notify(log_type="info", msg=f"Records per batch: {batch_size}. Total batches: {total}")
except Exception as e:
self.logger.log_n_notify(log_type="error", msg=f"Exception in PrintCount.process - {str(e)}")
if __name__ == "__main__":
df_example = DataflowExample()
df_example.execute_pipeline()
Functionality of pipeline is
Query against BigQuery Table.
Count the total records fetched from querying.
Print using the custom Log module present in utils folder.
I am running the job using cloud shell using command - python3 - main.py
Though the Dataflow job starts, the worker nodes throws error after few mins saying "ModuleNotFoundError: No module named 'utils'"
"utils" folder is available and the same code works fine when executed with "DirectRunner".
log_util and config_util files are custom util files for logging and config fetching respectively.
Also, I tried running with setup_file options as python3 - main.py --setup_file </path/of/setup.py> which makes the job to just freeze and does not proceed even after 15 mins.
How do I resolve the ModuleNotFoundError with "DataflowRunner"?
Posting as community wiki. As confirmed by #GopinathS the error and fix are as follows:
The error encountered by the workers is Beam SDK base version 2.32.0 does not match Dataflow Python worker version 2.28.0. Please check Dataflow worker startup logs and make sure that correct version of Beam SDK is installed.
To fix this "apache-beam[gcp]>=2.20.0" is removed from install_requires of setup.py since, the '>=' is assigning the latest available version (2.32.0 as of this writing) while the workers version are only 2.28.0.
Updated setup.py:
setuptools.setup(
name='dataflow_example',
version='1.0',
install_requires=[
"google-cloud-tasks==2.2.0",
"google-cloud-pubsub>=0.1.0",
"google-cloud-storage==1.39.0",
"google-cloud-bigquery==2.6.2",
"google-cloud-secret-manager==2.0.0",
"google-api-python-client==2.3.0",
"oauth2client==4.1.3", # removed apache-beam[gcp]>=2.20.0
"wheel>=0.36.2"
],
packages=setuptools.find_packages()
)
Updated beam_options in the pipeline code:
beam_options = {
"project": self.project,
"region": self.region,
"job_name": "dataflow_example",
"runner": "DataflowRunner",
"temp_location": f"gs://{self.bucket}/temp_location/",
"setup_file": "./setup.py"
}
Also make sure that you pass all the pipeline options at once and not partially.
If you pass --setup_file </path/of/setup.py> in the command then make sure to read and append the setup file path into the already defined beam_options variable using argument_parser in your code.
To avoid parsing the argument and appending into beam_options I instead added it directly in beam_options as "setup_file": "./setup.py"
Dataflow might have problems with installing packages that are platform locked in isolated network.
It won't be able to compile them if no network is there.
Or maybe it tries installing them but since cannot compile downloads wheels? No idea.
Still to be able to use packages like psycopg2 (binaries), or google-cloud-secret-manager (no binaries BUT dependencies have binaries), you need to install everything that has no binaries (none-any) AND no dependencies with binaries, by requirements.txt and the rest by --extra_packages param with wheel. Example:
--extra_packages=package_1_needed_by_2-manylinux.whl \
--extra_packages=package_2_needed_by_3-manylinux.whl \
--extra_packages=what-you-need_needing_3-none-any.whl
I have created a simple HTTP trigger-based azure function in python which is calling another python script to create a sample file in azure data lake gen 1. My solution structure is given below: -
Requirements.txt contains the following imports: -
azure-functions
azure-mgmt-resource
azure-mgmt-datalake-store
azure-datalake-store
init.py
import logging, os, sys
import azure.functions as func
import json
def main(req: func.HttpRequest) -> func.HttpResponse:
logging.info('Python HTTP trigger function processed a request.')
name = req.params.get('name')
if not name:
try:
req_body = req.get_json()
except ValueError:
pass
else:
name = req_body.get('name')
if name:
full_path_to_script = os.path.join(os.path.dirname( __file__ ) + '/Test.py')
logging.info(f"Path: - {full_path_to_script}")
os.system(f"python {full_path_to_script}")
return func.HttpResponse(f"Hello {name}!")
else:
return func.HttpResponse(
"Please pass a name on the query string or in the request body",
status_code=400
)
Test.py
import json
from azure.datalake.store import core, lib, multithread
directoryId = ''
applicationKey = ''
applicationId = ''
adlsCredentials = lib.auth(tenant_id = directoryId, client_secret = applicationKey, client_id = applicationId)
adlsClient = core.AzureDLFileSystem(adlsCredentials, store_name = '')
with adlsClient.open('stage1/largeFiles/TestFile.json', 'rb') as input_file:
data = json.load(input_file)
with adlsClient.open('stage1/largeFiles/Result.json', 'wb') as responseFile:
responseFile.write(data)
Test.py is failing with an error that no module found azure.datalake.store
Why other required modules are not working for Test.py since it is inside the same directory?
pip freeze output: -
adal==1.2.2
azure-common==1.1.23
azure-datalake-store==0.0.48
azure-functions==1.0.4
azure-mgmt-datalake-nspkg==3.0.1
azure-mgmt-datalake-store==0.5.0
azure-mgmt-nspkg==3.0.2
azure-mgmt-resource==6.0.0
azure-nspkg==3.0.2
certifi==2019.9.11
cffi==1.13.2
chardet==3.0.4
cryptography==2.8
idna==2.8
isodate==0.6.0
msrest==0.6.10
msrestazure==0.6.2
oauthlib==3.1.0
pycparser==2.19
PyJWT==1.7.1
python-dateutil==2.8.1
requests==2.22.0
requests-oauthlib==1.3.0
six==1.13.0
urllib3==1.25.6
Problem
os.system(f"python {full_path_to_script}") from your functions project is causing the issue.
Azure Functions Runtime sets up the environment, along with modifying process level variables like os.path so that your function can load any dependencies you may have. When you create a sub-process like that, not all information will flow through. Additionally, you will face issues with logging -- logs from test.py would not show up properly unless explicitly handled.
Importing works locally because you have all your requirements.txt modules installed and available to test.py. This is not the case in Azure. After remotely building as part of publish, your modules are included as part of your code package published. It's not "installed" globally in the Azure environment per se.
Solution
You shouldn't have to run your script like that. In the example above, you could import your test.py from your __init__.py file, and that should behave like it was called python test.py (at least in the case above). Is there a reason you'd want to do python test.py in a sub-process over importing it?
Here's the official guide on how you'd want to structure your app to import shared code -- https://learn.microsoft.com/en-us/azure/azure-functions/functions-reference-python#folder-structure
Side-Note
I think once you get through the import issue, you may also face problems with adlsClient.open('stage1/largeFiles/TestFile.json', 'rb'). We recommend following the developer guide above to structure your project and using __file__ to get the absolute path (reference).
For example --
import pathlib
with open(pathlib.Path(__file__).parent / 'stage1' / 'largeFiles' /' TestFile.json'):
....
Now, if you really want to make os.system(f"python {full_path_to_script}") work, we have workarounds to the import issue. But, I'd rather not recommend such approach unless you have a really compelling need for it. :)
I have a python script named foo.py. It has a lambda handler function defined like this:
def handler(event, context):
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
download_path = '/tmp/{}.gz'.format(key)
csv_path = '/tmp/{}.csv'.format(key)
... proceed to proprietary stuff
This is in a zip file like so:
-foo.zip
-foo.py
-dependencies
I have uploaded this zip file to AWS Lambda and configured an AWS Lambda Function to run foo.handler. However, every time I test it, I get "errorMessage": "Unable to import module 'foo'".
Any ideas what might be going on here?
stat --format '%a' foo.py shows 664
So, I was importing psycopg2 in my lambda function, which requires libpq.so, which installs with Postgres. Postgres isn't installed in the lambda environment, so importing psycopg2 failed, which meant that, by extension, Amazon's import of my lambda function also failed. Not a very helpful error message, though.
Thankfully, somebody's built a version of psycopg2 that works with AWS lambda: https://github.com/jkehler/awslambda-psycopg2