Force Dataflow workers to use Python 3?

Force Dataflow workers to use Python 3? - python

I have a simple batch Apache Beam Pipeline. When run locally - DirectRunner works fine, but with DataflowRunner it fails to install 1 dependency from requirements.txt. The reason is that the specific package is for Python 3, and the workers are (apparently) running the pipeline with Python 2.
The pipeline is done and working fine locally (DirectRunner) with Python 3.7.6. I'm using the latest Apache Beam SDK (apache-beam==2.16.0 in my requirements.txt).
One of the modules required by my pipeline is:
from lbcapi3 import api
So my requirements.txt sent to GCP has a line with:
lbcapi3==1.0.0
That module (lbcapi3) is in PyPI, but it's only targeted for Python 3.x. When I run the pipeline in Dataflow I get:
ERROR: Could not find a version that satisfies the requirement lbcapi3==1.0.0 (from -r requirements.txt (line 27)) (from versions: none)\r\nERROR: No matching distribution found for lbcapi3==1.0.0 (from -r requirements.txt (line 27))\r\n'
That makes me think that the Dataflow worker is running the pipeline with Python 2.x to install the dependencies in requirements.txt.
Is there a way to specify the Python version to use by a Goggle Dataflow pipeline (the workers)?
I tried by adding this as the first line of the my file api-etl.py, but didn't work:
#!/usr/bin/env python3
Thanks!

Follow the instructions in the quickstart to get up and running with your pipeline. When installing the Apache Beam SDK, make sure to install version 2.16 (since this is the first version that officially supports Python 3). Please, check your version.
You can use the Apache Beam SDK with Python versions 3.5, 3.6, or 3.7 if you are keen to migrate from Python 2.x environments.
For more information, refer to this documentation. Also, take a look for preinstalled dependencies.
Edited, after providing additional information:
I have reproduced problem on Dataflow. I see two solutions.
You can use --extra_package option, which would allow staging local packages in an accessible way. Instead of listing local package in the requirements.txt, create a tarball of the local package (e.g. my_package.tar.gz) and use --extra_package option to stage them.
Clone the repository from Github:
$ git clone https://github.com/6ones/lbcapi3.git
$ cd lbcapi3/
Build the tarball with the following command:
$ python setup.py sdist
The last few lines will look like this:
Writing lbcapi3-1.0.0/setup.cfg
creating dist
Creating tar archive
removing 'lbcapi3-1.0.0' (and everything under it)
Then, run your pipeline with the following command-line option:
--extra_package /path/to/package/package-name
In my case:
--extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
Make sure, that all of required options are provided in the command (job_name, project, runner, staging_location, temp_location):
python prediction/run.py --runner DataflowRunner --project $PROJECT --staging_location $BUCKET/staging --temp_location $BUCKET/temp --job_name $PROJECT-prediction-cs --setup_file prediction/setup.py --model $BUCKET/model --source cs --input $BUCKET/input/images.txt --output $BUCKET/output/predict --extra_package /home/user/dataflow-prediction-example/lbcapi3/dist/lbcapi3-1.0.0.tar.gz
The error you faced, would disappear.
Second solution - posting the additional libraries that your app is using in setup.py file, refer to the documentation.
Create a setup.py file for your project:
import setuptools
setuptools.setup(
name='PACKAGE-NAME',
version='PACKAGE-VERSION',
install_requires=[],
packages=setuptools.find_packages(),
)
You can get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call.

The simple answer is when deploying your pipeline, you need to make sure that your local environment is on python 3.5 3.6 or 3.7. If it is, then the Dataflow worker will have the same version once your job is launched.

Related

Amazon EMR pip install in bootstrap actions runs OK but has no effect

In Amazon EMR, I am using the following script as a custom bootstrap action to install python packages. The script runs OK (checked the logs, packages installed successfully) but when I open a notebook in Jupyter Lab, I cannot import any of them. If I open a terminal in JupyterLab and run pip list or pip3 list, none of my packages is there. Even if I go to / and run find . -name mleap for instance, it does not exist.
Something I have noticed is that on the master node, I am getting all the time an error saying bootstrap action 2 has failed (there is no second action, only one). According to this, it is a rare error which I get in all my clusters. However, my cluster eventually gets created and I can use it.
My script is called aws-emr-bootstrap-actions.sh
#!/bin/bash
sudo python3 -m pip install numpy scikit-learn pandas mleap sagemaker boto3
I suspect it might have something to do with a docker image being deployed that invalidates my previous installs or something, but I think (for my Google searches) it is common to use bootstrap actions to install python packages and should work ...

The PYSPARK, Python interpreter that Spark is using, is different than the one to which the OP was installing the modules (as confirmed in comments).

Difference between installation of pip git+https and python setup.py

I am aware of this popular topic, however I am running into a different outcome when installing a python app using pip with git+https and python setup.py
I am building a docker image. I am trying to install in an image containing several other python apps, this custom webhook.
Using git+https
RUN /venv/bin/pip install git+https://github.com/alerta/alerta-contrib.git#subdirectory=webhooks/sentry
This seems to install the webhook the right way, as the relevant endpoint is l8r discoverable.
What is more, when I exec into the running container and doing a search for relevant files, I see the following
./venv/lib/python3.7/site-packages/sentry_sdk
./venv/lib/python3.7/site-packages/__pycache__/alerta_sentry.cpython-37.pyc
./venv/lib/python3.7/site-packages/sentry_sdk-0.15.1.dist-info
./venv/lib/python3.7/site-packages/alerta_sentry.py
./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg-info
In my second approach I just copy this directory locally and in my Dockerfile I do
COPY sentry /app/sentry
RUN /venv/bin/python /app/sentry/setup.py install
This does not install the webhook appropriately and what is more, in the respective container I see a different file layout
./venv/lib/python3.7/site-packages/sentry_sdk
./venv/lib/python3.7/site-packages/sentry_sdk-0.15.1.dist-info
./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg
./alerta_sentry.egg-info
./dist/alerta_sentry-5.0.0-py3.7.egg
(the sentry_sdk - related files must be irrelevant)
Why does the second approach fail to install the webhook appropriately?
Should these two option yield the same result?

What finally worked is the following
RUN /venv/bin/pip install /app/sentry/
I don't know the subtle differences between these two installation modes
I did notice however that /venv/bin/python /app/sentry/setup.py install did not produce an alerta_sentry.py but only the .egg file, i.e. ./venv/lib/python3.7/site-packages/alerta_sentry-5.0.0-py3.7.egg
On the other hand, /venv/bin/pip install /app/sentry/ unpacked (?) the .egg creating the ./venv/lib/python3.7/site-packages/alerta_sentry.py
I don't also know why the second installation option (i.e. the one creating the .egg file) was not working run time.

Building extensions to AWS Lambda with Continuous Delivery

I have a GitHub repository containing a AWS Lambda function. I am currently using Travis CI to build, test and then deploy this function to Lambda if all the tests succeed using
deploy:
provider: lambda
(other settings here)
My function has the following dependencies specified in its requirements.txt
Algorithmia
numpy
networkx
opencv-python
I have set the build script for Travis CI to build in the working directory using the below command so as to have the dependencies get properly copied over to my AWS Lambda function.
pip install --target=$TRAVIS_BUILD_DIR -r requirements.txt
The problem is that while the build in Travis CI succeeds and everything is deployed to the Lambda function successfully, testing my Lambda function results in the following error:
Unable to import module 'mymodule':
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
My best guess as to why this is happening is that numpy is being built in the Ubuntu distribution of linux that Travis CI uses but the Amazon Linux that it is running on when executing as a Lambda function isn't able to run it properly. There are numerous forum posts and blog posts such as this one detailing that python modules that need to build C/C++ extensions must be built on a EC2 instance.
My question is: This is a real hassle to have to add another complication to the CD pipeline and have to mess around with EC2 instances. Has Amazon come up with some better way to do this (because there really should be a better way to do this) or is there some way to have everything compiled properly in Travis CI or another CI solution?
Also, I suppose it's possible that I've mis-identified the problem and that there is some other reason why importing numpy is failing. If anyone has suggestions on how to resolve this that would be great!
EDIT: As suggested by #jordanm it looks like it may be possible to load a docker container with the amazonlinux image when running TravisCI and then perform my build and test inside that container. Unfortunately, while that certainly is easier than using EC2 - I don't think I can use the normal lambda deploy tools in TravisCI - I'll have to write my own deploy script using the aws cli which is a bit of a pain. Any other ideas - or ways to make this smoother? Ideally I would be specify what docker image my builds run on in TravisCI as their default build environment is already using docker...but they don't seem to support that functionality yet: https://github.com/travis-ci/travis-ci/issues/7726

After quite a bit of tinkering I think I've found something that works. I thought I'd post it here in case others have the same problem.
I decided to use Wercker as they have quite a generous free tier and allow you to customize the docker image for your builds.
Turns out there is a docker image that has been created to replicate the exact environments that Lambda functions are executed on! See: https://github.com/lambci/docker-lambda When running your builds in this docker container, extensions will be built properly so they can execute successfully on Lambda.
In case anyone does want to use Wercker here's the wercker.yml I used and it may be helpful as a template:
box: lambci/lambda:build-python3.6
build:
steps:
- script:
name: Install Dependencies
code: |
pip install --target=$WERCKER_SOURCE_DIR -r requirements.txt
pip install pytest
- script:
name: Test code
code: pytest
- script:
name: Cleaning up
code: find $WERCKER_SOURCE_DIR \( -name \*.pyc -o -name \*.pyo -o -name __pycache__ \) -prune -exec rm -rf {} +
- script:
name: Create ZIP
code: |
cd $WERCKER_SOURCE_DIR
zip -r $WERCKER_ROOT/lambda_deploy.zip . -x *.git*
deploy:
box: golang:latest
steps:
- arjen/lambda:
access_key: $AWS_ACCESS_KEY
secret_key: $AWS_SECRET_KEY
function_name: yourFunction
region: us-west-1
filepath: $WERCKER_ROOT/lambda_deploy.zip

Although I appreciate you may not want to add further complications to your project, you could potentially use a Python-focused Lambda management tool for setting up your builds and deployments, say something like Gordon. You could also just use this tool to do your deployment from inside the Amazon Linux Docker container running within Travis.
If you wish to change CI providers, CodeShip allows you to build with any Docker container of your choice, and then deploy to Lambda.
Wercker also runs full Docker-based builds and has many user-submitted deploy "steps", some of which support deployment to Lambda.

Python: Cannot get imports to work when installed on remote server

(Before responding with a 'see this link' answer, know that I've been searching for hours and have probably read it all. I've done my due diligence, I just can't seem to find the solution)
That said, I'll start with my general setup and give details after.
Setup: On my desktop, I have a project that I am running in Pycharm, Python3.4, using a virtual environment. In the cloud (AWS). I have an EC2 instance running Ubuntu. I'm not using a virtual environment in the cloud. The cloud machine has both python 2.7 and python 3.5 installed.
[Edit] I've switched to a virtual machine on my cloud environment, and installing from setup distrubution (still broken)
Problem: On my desktop, both within pycharm and from the command line (within the virtual environment using workon (project), I can run a particular file called "do_daily.py" without any issues. However, If I try to run the same file on the cloud server, I get the famous import error.
[edit] Running directly from command line on remote server.
python3 src/do_daily.py
File "src/do_daily.py", line 3, in <module>
from src.db_backup import dev0_backup as dev0bk
ImportError: No module named 'src.db_backup'
Folder Structure: My folder structure for the specific import is (among other stuff).
+ project
+ src
- __init__.py
- do_daily.py
+ db_backup
- __init__.py
- dev0_backup.py
Python Path: (echo $PYTHONPATH)
/home/ubuntu/automation/Project/src/tg_servers:/home/ubuntu/automation/Project/src/db_backup:/home/ubuntu/automation/Project/src/aws:/home/ubuntu/automation/Project/src:/home/ubuntu/automation/Project
Other stuff:
print(sys.executable) = /usr/bin/python3
print(sys.path) = gives me all the above plus a bunch of default paths.
I have run out of ideas and would appreciate any help.
Thank you,
SteveJ
SOLUTION
Clearly the accepted answer was the most comprehensive and represents the best approach to the problem. However, for those seeing this later - I was able to solve my specific problem a little more directly.
(From within the virtual environment), both the add2virtualenv and creating .pth files did work. What I was missing is that I had to add all packages; src, db_backup, pkgx,y,z etc...

I have created a github repository (https://github.com/thebjorn/pyimport.git), and tested the code on a freshly created AWS/Ubuntu instance.
First the installs and updates I did (installing and updating pip3):
ubuntu#:~$ sudo apt-get update
ubuntu#:~$ sudo apt install python3-pip
ubuntu#:~$ pip3 install -U pip
Then get the code:
ubuntu#:~$ git clone https://github.com/thebjorn/pyimport.git
My version of do_daily.py imports dev0_backup, contains a function that tells us it was called, and a __main__ section (for calling with -m or filename):
ubuntu#ip-172-31-29-112:~$ cat pyimport/src/do_daily.py
from __future__ import print_function
from src.db_backup import dev0_backup as dev0bk
def do_daily_fn():
print("do_daily_fn called")
if __name__ == "__main__":
do_daily_fn()
The setup.py file points directly to the do_daily_fn:
ubuntu#ip-172-31-29-112:~$ cat pyimport/setup.py
from setuptools import setup
setup(
name='pyimport',
version='0.1',
description='pyimport',
url='https://github.com/thebjorn/pyimport.git',
author='thebjorn',
license='MIT',
packages=['src'],
entry_points={
'console_scripts': """
do_daily = src.do_daily:do_daily_fn
"""
},
zip_safe=False
)
Install the code in dev mode:
ubuntu#:~$ pip3 install -e pyimport
I can now call do_daily in a number of ways (notice that I haven't done anything with my PYTHONPATH).
The console_scripts in setup.py makes it possible too call do_daily by just typing its name:
ubuntu#:~$ do_daily
do_daily_fn called
Installing the package (in dev mode or otherwise) makes the -m flag work out-of-the box:
ubuntu#:~$ python3 -m src.do_daily
do_daily_fn called
You can even call the file directly (although this is by far the ugliest way and I would recommend against it):
ubuntu#:~$ python3 pyimport/src/do_daily.py
do_daily_fn called

Your PYTHONPATH should contain /home/ubuntu/automation/Project and likely nothing below it.
There all reasons to use a virtualenv in production and never install any packages into system Python explicitly. System Python is for running the OS-provided software written in Python. Don't mix it with your deployments.

A few questions here.
From which directory are you running you program?
Did you try to import the db_backup module inside of src/__init__.py?

Upgrade pyramid, SQLAlchemy, zope and rebuild Python project

I have inherited a Python REST API that runs on Ubuntu.
My main goal is to update these Python components to the latest releases, e.g. zope is now at 2.0.
It uses Python 2.7, Pyramid 1.4.1, zope 0.6, transaction 1.3, SQLAlchemy 0.7.9, WebError 0.10.3, and uses nginx as the web server.
Oh, and it uses cx_Oracle to connect to the Oracle instance.
The project (and other items) are in a folder called rest_api, where I can see setup.py, and some other custom setups, setup_prod.py, etc.
I went to /usr/local/lib/python-2.7/sites-packages and I tried running "pip install --upgrade [package_name]" and the command completes successfully for each package.
Is this all I need to do, or do I have to rebuild the project with setup*.py?
I found some notes that showed 2 commands that look like what I want -
rebuild_cmd = "cd %s/python/rest_api/; /usr/bin/env python setup_prod.py build" % current_dir
install_cmd = "cd %s/python/rest_api/; sudo /usr/bin/env python setup_prod.py install" % current_dir
...but when I try running "python setup_prod.py build" from the directory, with or without sudo, I get a traceback error.
To summarize -
How do I upgrade the python packages like zope, SQLAlchemy, Pyramids, etc. to the latest release?
Do I need to rebuild the project if I am only upgrading the python packages from above?
Without knowing the program details, is there a "basic" python build sequence that I can try, e.g. run setup.py build, then setup.py install, or something else?

It uses Python 2.7, Pyramid 1.4.1, zope 0.6, transaction 1.3, SQLAlchemy 0.7.9, WebError 0.10.3
How do you know this? Find the place where any of these versions are mentioned (I guess they should be mentioned somewhere in setup_prod.py), change them to what you want, build the project and check if the app works with new dependencies.
...but when I try running "python setup_prod.py build" from the directory, with or without sudo, I get a traceback error.
Please show your traceback.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Force Dataflow workers to use Python 3? - python

The simple answer is when deploying your pipeline, you need to make sure that your local environment is on python 3.5 3.6 or 3.7. If it is, then the Dataflow worker will have the same version once your job is launched.

Related

Amazon EMR pip install in bootstrap actions runs OK but has no effect

Difference between installation of pip git+https and python setup.py

Building extensions to AWS Lambda with Continuous Delivery

Python: Cannot get imports to work when installed on remote server

Upgrade pyramid, SQLAlchemy, zope and rebuild Python project

Categories

Resources