MWAA - Airflow - PythonVirtualenvOperator requires virtualenv

MWAA - Airflow - PythonVirtualenvOperator requires virtualenv - python

I am using AWS's MWAA service (2.2.2) to run a variety of DAGs, most of which are implemented with standard PythonOperator types. I bundle the DAGs into an S3 bucket alongside any shared requirements, then point MWAA to the relevant objects & versions. Everything runs smoothly so far.
I would now like to implement a DAG using the PythonVirtualenvOperator type, which AWS acknowledge is not supported out of the box. I am following their guide on how to patch the behaviour using a custom plugin, but continue to receive an error from Airflow, shown at the top of the dashboard in big red writing:
DAG Import Errors (1)
... ...
AirflowException: PythonVirtualenvOperator requires virtualenv, please install it.
I've confirmed that the plugin is indeed being picked up by Airflow (I see it referenced in the admin screen), and for the avoidance of doubt I am using the exact code provided by AWS in their examples for the DAG. AWS's documentation on this is pretty light and I've yet to stumble across any community discussion for the same.
From AWS's docs, we'd expect the plugin to run at startup prior to any DAGs being processed. The plugin itself appears to effectively rewrite the venv command to use the pip-installed version, rather than that which is installed on the machine, however I've struggled to verify that things are happening in the order I expect. Any pointers on debugging the instance's behavior would be very much appreciated.
Has anyone faced a similar issue? Is there a gap in the MWAA documentation that needs addressing? Am I missing something incredibly obvious?
Possibly related, but I do see this warning in the scheduler's logs, which may indicate why MWAA is struggling to resolve the dependency?
WARNING: The script virtualenv is installed in '/usr/local/airflow/.local/bin' which is not on PATH.

Airflow uses shutil.which to look for virtualenv. The installed virtualenv via requirements.txt isn't on the PATH. Adding the path to virtualenv to PATH solves this.
The doc here is wrong https://docs.aws.amazon.com/mwaa/latest/userguide/samples-virtualenv.html
import os
from airflow.plugins_manager import AirflowPlugin
import airflow.utils.python_virtualenv
from typing import List
def _generate_virtualenv_cmd(tmp_dir: str, python_bin: str, system_site_packages: bool) -> List[str]:
cmd = ['python3','/usr/local/airflow/.local/lib/python3.7/site-packages/virtualenv', tmp_dir]
if system_site_packages:
cmd.append('--system-site-packages')
if python_bin is not None:
cmd.append(f'--python={python_bin}')
return cmd
airflow.utils.python_virtualenv._generate_virtualenv_cmd=_generate_virtualenv_cmd
#This is the added path code
os.environ["PATH"] = f"/usr/local/airflow/.local/bin:{os.environ['PATH']}"
class VirtualPythonPlugin(AirflowPlugin):
name = 'virtual_python_plugin'

Related

Airflow Browser UI not detecting imported Module

Through a Linux server, I am running Airflow with Docker compose. Other DAGs created with .py scripts work fine. Other python scripts creating DAGS that import different modules will run fine and show up in the DAG list.
However, importing the modules below within my Launch.py results in a Broken DAG: [/usr/local/airflow/dags/ScanLaunchDemo.py] No module named 'tenable_io'.
Ironically, Launch.py runs perfectly fine within the Linux instance and within a python terminal (the 'no tenable_io' error does not show). It seems like only Airflow cannot 'detect' the module below.
from tenable_io.client import TenableIOClient
from tenable_io.api.scans import ScanCreateRequest
from tenable_io.api.models import ScanSettings
Running pip3 list will show that tenable-io is installed.
Thanks for the helps peeps

If you are using Docker Compose, then in order to make the module available to airflow, you need to use custom image where you installed your own additional dependencies. We just updated the documentation make it clearer why you need it, and how to do it, including examples:
https://airflow.apache.org/docs/docker-stack/build.html

Luckily for me, I rewrote the API request such that it did not require a module that was not already built into Airflow. Jarek seems to be correct however, that for modules that don't come installed within airflow, a custom image will need to be created.

import matplotlib failed while deploying my model in AWS sagemaker

I have deployed my AWS model successfully.
but while testing i am getting runtime Error: "import matplotlib.pyplot as plt" . I think it is due to pytorch framework version i used(framework_version=1.2.0). I am facing the same issue when i use higher versions as well.
PyTorchModel(model_data=model_artifact,
role = role,
framework_version=1.2.0,
entry_point='predict.py',
predictor_cls=ImagePredictor)
I have other issue when i use version=1.0.0. i.e i am not able to import libraries from sub directories and deployment itself is failing.
Eg: i have some code files in "Code" directory.
from Code.CTModel import NetWork ---> **this line will fail as "No module named Code" when i use version=1.0.0**
Ultimately i want to how to use/import libraries which are written under sub-directories.

It sounds like you want to inject some additional code libraries into the SageMaker PyTorch serving container. You might have to dig into the source code for how the PyTorch serving container is built to further customize it: https://github.com/aws/sagemaker-pytorch-inference-toolkit, or build your own image.
Digging into that source code a bit, I see that the container has enabled the importing of arbitrary code, but only when "multi-model mode" is enabled. Can you verify that the code exists under a directory "code" in your model directory and that "multi-model mode" is enabled?
def initialize(self, context):
# Adding the 'code' directory path to sys.path to allow importing user modules when multi-model mode is enabled.
if (not self._initialized) and ENABLE_MULTI_MODEL:
code_dir = os.path.join(context.system_properties.get("model_dir"), 'code')
sys.path.append(code_dir)
self._initialized = True
Reference: https://github.com/aws/sagemaker-pytorch-inference-toolkit/blob/c4e7abc49aeebc2f9b6035337548a90e4330113d/src/sagemaker_pytorch_serving_container/handler_service.py#L47
If this all seems complicated to you (it is), you might want to look into some standardized formats for serializing your PyTorch model such as https://onnx.ai/. I'd love to learn more about what you're trying to do here sometime if you reach out to me at contact#modelzoo.dev. I'm beta-testing a platform that enables deployment in a single line of code and would love to test it out here.

Let me make my query little bit high level: I have predict.py, jupyter notebook , Code(Direcotry),Evoludation(directory) and other .py files in source_dir.
--Code
--ResNet.py
--Densenet.py
--DataLoader.py
--Evaluation
--Evaluation.py
--predict.py
--CT_Code.ipynb
When i execute the predict file from jupyter notebook in my local system, all the modules are imported properly and everything is working fine. But when i am deploying same thing in sagemaker notebook facing issues as mentioned in my question.(Not able to import libraries from Code directory and some basic modules like imageio,PIL, Matplotlib)

Python Serverless (SLS): Runtime.ImportModuleError: Unable to import module

I am working on a project that is using AWS CodeBuild to deploy a Serverless (SLS) function that is written in Python.
The deployment works fine within code build. It successfully creates the function and I can view the lambda within the Lambda AWS UI. Whenever the function is triggered, I get the error seen below:
Runtime.ImportModuleError: Unable to import module 'some/function': attempted relative import with no known parent package
It is extremely frustrating as I know the function exists at that directory listed above. During the CodeBuild script, I can ls into the directory and confirm that it indeed exists. The function is defined in my serverless.yml file as follows:
functions:
file-blaster:
runtime: python3.7
handler: some/function.function_name
events:
- existingS3:
bucket: some_bucket
events:
- s3:ObjectCreated:*
rules:
- prefix: ${opt:stage}/some/prefix
Sadly, I haven't been able to crack this one. Has anyone had a similar experience while working with SLS and python in the cloud?
It seems odd that SLS would build and deploy successfully, but the Lambda itself cant find the function.

This will be a short answer for what is a somewhat longer discussion on Python imports. You can do the research yourself on the hectic and confusing battle between relative and absolute imports as a design for a python project.
The Gist:
It is necessary to understand that the base of the python importing for SLS functions IS where the serverless.yml file exists (I imagine that it is similar to having a main.py that calls the other files that are referenced as "functions" in the sls yml). For my case above, I did not structure the imports using absolute imports when I had my issues. I switched all of my imports to have absolute paths, so when I moved the package around, it would continue to work.
The error that I was given Runtime.ImportModuleError: Unable to import module 'some/function': attempted relative import with no known parent package was really poor to describe the actual issue. The error should have included that the packages being used by some/function were not found when attempting a relative import because that was the actual problem that needed fixing.
Hopefully this helps someone else out someday. Let me know if I can provide more information where I haven't already.

I think you need to change your handler property from :
handler: some/function.function_name
to
handler: some/function.{lambda handler name}
like, my folder structure is:
- some
- function1.py
then my template will be:
functions:
file-blaster:
runtime: python3.7
handler: some/function1.lambda_handler
for more details check here https://serverless.com/framework/docs/providers/aws/guide/functions/

In Shiny, Python Virtual environment PERMISSION DENIED (Error 126)

We are building a User Interface APP (predicting a continuous variable through a machine learning model) through R Shiny.
Since we built the machine learning model in Python3 sklearn module, we hope that we could write python codes in R Shiny to call that model and corresponding functions.
We used R-package "reticulate" to create virtual python environment where it would save python packages, and through which we could call python3 functions.
We created the virtual environment using the following line of code (the function in R package "reticulate")
use_virtualenv("env", required = TRUE)
Where we indeed have the following directory "env/bin" in which there are python and python3 to execute.
The Shiny APP worked perfectly locally. HOWEVER, when we made attempts to publish, it gave the following error (please see picture) (after the APP was successfully deployed and on shinyapps.io, it said the APP was running).
The issue was "Error 126", which denied the permission for our APP to access the virtual environment. This issue had no previous (similar) case on Stackoverflow, and therefore we spent a long time to debug (issue not resolved).
If anyone knows how to solve this problem, would it be possible for you to kindly mark your solution tips below? (We hope your solution would not modify our basic layout, i.e. "calling python-made model in Shiny and publish through Shiny") We really appreciate your efforts to help us out!
Thank you so much!

Could you share the code where actual call to python script is being made? is it a python module function that you are calling from Rshiny? what does the python module/function do and return? I have used reticulate inside shiny to call Python scripts and it works fine. Didn't require to set the environment. Just provide the source to python script and call it like any other R function.

If you're trying to deploy to shinyapps.io, you may need to set the RETICULATE_PYTHON env variable so that reticulate uses the correct version of Python when running your app:
VIRTUALENV_NAME = 'env'
Sys.setenv(RETICULATE_PYTHON = paste0('/home/shiny/.virtualenvs/',
VIRTUALENV_NAME,
'/bin/python'))
Full example here demonstrates one method for configuring a Shiny + reticulate app so that it can easily run both locally and on shinyapps.io.

How do I add python libraries to an AWS lambda function for Alexa?

I was following the tutorial to create an Alexa app using Python:
Python Alexa Tutorial
I was able to successfully follow all the steps and get the app to work.I now want to modify the python code and use external libraries such as import requests
or any other libraries that I install using pip. How would I setup my lambda function to include any pip packages that I install locally on my machine?

As it is described in the Amazon official documentation link here It is as simple as just creating a zip of all the folder contents after installing the required packages in your folder where you have your python lambda code.
As Vineeth pointed above in his comment, The very first step in moving from an inline code editor to a zip file upload approach is to change your lambda function handler name under configuration settings to include the python script file name that holds the lambda handler.
lambda_handler => {your-python-script-file-name}.lambda_handler.
Other solutions like python-lambda and lambda-uploader help with simplifying the process of uploading and the most importantly LOCAL TESTING. These will save a lot of time in development.

The official documentation is pretty good. In a nutshell, you need to create a zip file of a directory containing both the code of your lambda function and all external libraries you use at the top level.
You can simulate that by deactivating your virtualenv, copying all your required libraries into the working directory (which is always in sys.path if you invoke a script on the command line), and checking whether your script still works.

You may want to look into using frameworks such as zappa which will handle packaging up and deploying the lambda function for you.
You can use that in conjunction with flask-ask to have an easier time making Alexa skills. There's even a video tutorial of this (from the zappa readme) here

To solve this particular problem we're using a library called juniper. In a nutshell, all you need to do is create a very simple manifest file that looks like:
functions:
# Name the zip file you want juni to create
router:
# Where are your dependencies located?
requirements: ./src/requirements.txt.
# Your source code.
include:
- ./src/lambda_function.py
From this manifest file, calling juni build will create the zip file artifact for you. The file will include all the dependencies you specify in the requirements.txt.
The command will create this file ./dist/router.zip. We're using that file in conjunction with a sam template. However, you can then use that zip and upload it to the console, or through the awscli.

Echoing #d3ming's answer, a framework is a good way to go at this point. Creating the deployment package manually isn't impossible, but you'll need to be uploading your packages' compiled code, and if you're compiling that code on a non-linux system, the chance of running into issues with differences between your system and the Lambda function's deployed environment are high.
You can then work around that by compiling your code on a linux machine or Docker container.. but between all that complexity you can get much more from adopting a framework.
Serverless is well adopted and has support for custom python packages. It even integrates with Docker to compile your python dependencies and build the deployment package for you.
If you're looking for a full tutorial on this, I wrote one for Python Lambda functions here.

Amazon created a repository that deals with your situation:
https://github.com/awsdocs/aws-lambda-developer-guide/tree/master/sample-apps/blank-python
The blank app is an example on how to push a lambda function that depends on requirements, with the bonus that being made by Amazon.
Everything you need to do is to follow the step by step, and update the repository based on your needs.

For some lambda POCs and fast lambda prototyping you can include and use the following function _install_packages, you can place a call to it before lambda handling function (for lambda init time package installation, if your deps need less than 10 seconds to install) or place the call at the beginning of the lambda handler (this will call the function exactly once at the first lambda event). Given pip install options included, packages to be installed must provide binary installable versions for manylinux.
_installed = False
def _install_packages(*packages):
global _installed
if not _installed:
import os
import sys
import time
_started = time.time()
os.system("mkdir -p /tmp/packages")
_packages = " ".join(f"'{p}'" for p in packages)
print("INSTALLED:")
os.system(
f"{sys.executable} -m pip freeze --no-cache-dir")
print("INSTALLING:")
os.system(
f"{sys.executable} -m pip install "
f"--no-cache-dir --target /tmp/packages "
f"--only-binary :all: --no-color "
f"--no-warn-script-location {_packages}")
sys.path.insert(0, "/tmp/packages")
_installed = True
_ended = time.time()
print(f"package installation took: {_ended - _started:.2f} sec")
# usage example before lambda handler
_install_packages("pymssql", "requests", "pillow")
def lambda_handler(event, context):
pass # lambda code
# usage example from within the lambda handler
def lambda_handler(event, context):
_install_packages("pymssql", "requests", "pillow")
pass # lambda code
Given examples install python packages: pymssql, requests and pillow.
An example lambda that installs requests and then calls ifconfig.me to obtain it's egress IP address.
import json
_installed = False
def _install_packages(*packages):
global _installed
if not _installed:
import os
import sys
import time
_started = time.time()
os.system("mkdir -p /tmp/packages")
_packages = " ".join(f"'{p}'" for p in packages)
print("INSTALLED:")
os.system(
f"{sys.executable} -m pip freeze --no-cache-dir")
print("INSTALLING:")
os.system(
f"{sys.executable} -m pip install "
f"--no-cache-dir --target /tmp/packages "
f"--only-binary :all: --no-color "
f"--no-warn-script-location {_packages}")
sys.path.insert(0, "/tmp/packages")
_installed = True
_ended = time.time()
print(f"package installation took: {_ended - _started:.2f} sec")
# usage example before lambda handler
_install_packages("requests")
def lambda_handler(event, context):
import requests
return {
'statusCode': 200,
'body': json.dumps(requests.get('http://ifconfig.me').content.decode())
}
Given single quote escaping is considered when building pip's command line, you can specify a version in a package spec like this pillow<9, the former will install most recent 8.X.X version of pillow.

I too struggled for a while with this. The after deep diving into aws resources I got to know the lambda function on aws runs locally on a a linux. And it's very important to have the the python package version which matches with the linux version.
You may find more information on this on :
https://aws.amazon.com/lambda/faqs/
Follow the steps to download the version.
1. Find the .whl image of the package from pypi and download it on you local.
2. Zip the packages and add them as layers in aws lambda
3. Add the layer to the lambda function.
Note: Please make sure that version you're trying to install python package matches the linux os on which the aws lambda performs computes tasks.
References :
https://pypi.org/project/Pandas3/#files

A lot of python libraries can be imported via Layers here: https://github.com/keithrozario/Klayers, or your can use a framework like serverless that has plugins to package packages directly into your artifact.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.