Apache Beam Dataflow pipeline build and deploy with Bazel

Apache Beam Dataflow pipeline build and deploy with Bazel - python

Looking at the official Python Beam documentation page it seems like the only way to do deployments is to have a setup.py that defines dependencies that exist within your repo or externally.
But this doesn't quite work with the Bazel way of managing Python dependencies (i.e. I have no setup.py file) or separate requirements.txt file for each pipeline in my repo.
How does one package and deploy jobs to runners using Bazel?

You can have a script that runs pip freeze > ... to generate the requirements files for your pipelines, then use them to deploy your pipelines.
If multiple pipelines share the same set of Python dependencies and the super set of all their dependencies is not that big, you can build a custom container to submit with all the jobs.

Related

DevOps on python multi module projects

I am trying to implement devops(CI/CD pipelines) on few of the python projects. All of the python projects are available in git repository.
I have implemented devops on each python projects, shown below the build pipeline for one of the project and it was working successfully.
Now i have brought all of the python projects in a single repository as multi module project as shown below.
Now I want to implement devops by crating CI/CD pipelines on each of the modules separately.
For that, I have created build(CI) pipelines which are similar as above, by modifying requirement.txt and set up.py files path as modulename/requirement.txt and modulename/setup.py.
setup.py file is failing by throwing an error, unable to find the package and the task for test is failing because of not able to find the packages which are having test resource files.
Below is the folder structure of python module.
Any idea how to resolve these errors?
Is there any reference for implementing devops(CI/CD) on python multi module projects?
Any leads much appreciated!

Google Cloud Dataflow Dependencies

I want to use dataflow to process in parallel a bunch of video clips I have stored in google storage. My processing algorithm has non-python dependencies and is expected to change over development iterations.
My preference would be to use a dockerized container with the logic to process the clips, but it appears that custom containers are not supported (in 2017):
use docker for google cloud data flow dependencies
Although they may be supported now - since it was being worked on:
Posthoc connect FFMPEG to opencv-python binary for Google Cloud Dataflow job
According to this issue a custom docker image may be pulled, but I couldn't find any documentation on how to do it with dataflow.
https://issues.apache.org/jira/browse/BEAM-6706?focusedCommentId=16773376&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16773376
Another option might be to use setup.py to install any dependencies as described in this dated example:
https://cloud.google.com/blog/products/gcp/how-to-do-distributed-processing-of-landsat-data-in-python
However, when running the example I get an error that there is no module named osgeo.gdal.
For pure python dependencies I have also tried to pass the --requirements_file argument, however I still get an error: Pip install failed for package: -r
I could find documentation for adding dependencies to apache_beam, but not to dataflow, and it appears the apache_beam instructions do not work, based on my tests of --requirements_file and --setup_file

This was answered in the comments, rewriting here for clarity:
In Apache Beam you can modify the setup.py file while will be run once per container on start-up. This file allows you to perform arbitrary commands before the the SDK Harness start to receive commands from the Runner Harness.
A complete example can be found in the Apache Beam repo.

As of 2020, you can use Dataflow Flex Templates, which allow you to specify a custom Docker container in which to execute your pipeline.

How to deploy AWS python Lambda project locally?

I got an AWS python Lambda function which contains few python files and also several dependencies.
The app is build using Chalice so by that the function will be mapped like any REST function.
Before the deployment in prod env, I want to test it locally, so I need to pack all this project (python files and dependencies), I tried to look over the web for the desired solution but I couldn't find it.
I managed to figrue how to deploy one python file, but a whole project did not succeed.

Take a look to the Atlassian's Localstack: https://github.com/atlassian/localstack
It's a full copy of the AWS cloud stack, locally.

I use Travis : I hooked it to my master branch in git, so that when I push on this branch, Travis tests my lambda, with a script that uses pytest, after having installed all its dependencies with pip install. If all the tests passed, it then deploy the lambda in AWS in my prod-env.

Python Development in multiple repositories

We are trying to find the best way to approach that problem.
Say I work in a Python environment, with pip & setuptools.
I work in a normal git flow, or so I hope.
So:
Move to feature branch in some app, make changes.
Move to feature branch in a dependent lib - Develop thing.
Point the app, using "-e git+ssh" to the feature branch of the dependent lib.
Create a Pull Request.
When this is all done, I want to merge stuff to master, but I can't without making yet another final change to have the app (step 3 above) requirements.txt now point to the main branch of the feature.
Is there any good workflow for "micro services" or multiple dependent source codes in python that we are missing?

Python application workflow from development to deployment
It looks like you are in search for developing Python application, using git.
Following description is applicable to any kind of Python based application,
not only to Pyramid based web ones.
Requirements
Situation:
developing Python based solution using Pyramid web framework
there are multiple python packages, participating in final solution, packages might be dependent.
some packages come from public pypi, others might be private ones
source code controlled by git
Expectation:
proposed working style shall allow:
pull requests
shall work for situations, where packages are dependent
make sure, deployments are repeatable
Proposed solution
Concepts:
even the Pyramid application released as versioned package
for private pypi use devpi-server incl. volatile and release indexes.
for package creation, use pbr
use tox for package unit testing
test, before you release new package version
test, before you deploy
keep deployment configuration separate form application package
Pyramid web app as a package
Pyramid allows creation of applications in form of Python package. In
fact, whole initial tutorial (containing 21 stages) is using exactly this
approach.
Despite the fact, you can run the application in develop mode, you do not have
to do so in production. Running from released package is easy.
Pyramid uses nice .ini configuration files. Keep development.ini in the
package repository, as it is integral part for development.
On the other hand, make sure, production .ini files are not present as they
should not mix with application and belong to deployment stuff.
To make deployment easier, add into your package a command, which prints to
stdout typical deployment configuration. Name the script e.g. myapp_gen_ini.
Write unittests and configure tox.ini to run them.
Keep deployment stuff separate from application
Mixing application code with deployment configurations will make problem at
the moment, you will have to install to second instance (as you are likely to
change at least one line of your configuration).
In deployment repository:
keep here requirements.txt, which lists the application package and other
packages needed for production. Be sure you specify exact package version at
least for your application package.
keep here production.ini file. If you have more deployments, use one branch per deployment.
put here tox.ini
tox.ini shall have following content:
[tox]
envlist = py27
# use py34 or others, if your prefer
[testenv]
commands =
deps =
-rrequirements.txt
Expected use of deployment respository is:
clone it to the server
run tox, this will create virtualenv .tox/py27
activate the virtualenv by $ source .tox/py27/bin/activate
if production.ini does not exist in the repo yet, run command
$ myapp_gen_ini > production.ini to generate template for production
configuration
edit the production.ini as needed.
test, it works.
commit the production.ini changes to the repository
do other stuff needed to deploy the app (configure web server, supervisord etc.)
For setup.py use pbr package
To make package creation simpler, and to keep package versioning related to git
repository tags, use pbr. You will end up with setup.py being only 3 lines
long and all relevant stuff will be specified in setup.cfg in form of ini
file.
Before you build the first time, you have to have some files in git repository,
otherwise it will complain. As you use git, this shall be no problem.
To assign new package version, set $ git tag -a 0.2.0 and build it. This will
create the package with version 0.2.0.
As a bonus, it will create AUTHORS and ChangeLog based on your commit
messages. Keep these files in .gitignore and use them to create AUTHORS.rst
and ChangeLog.rst manually (based on autogenerated content).
When you push your commits to another git repository, do not forget to push the tags too.
Use devpi-server as private pypi
devpi-server is excellent private pypi, which will bring you following advantages:
having private pypi at all
cached public pypi packages
faster builds of virtual environments (as it will install from cached packages)
being able to use pip even without having internet connectivity
pushing between various types of package indexes: one for development
(published version can change here), one for deployment (released version will not change here).
simple unit test run for anyone having access to it, and it will even collect
the results and make them visible via web page.
For described workflow it will contribute as repository of python packages, which can be deployed.
Command to use will be:
$ devpi upload to upload developed package to the server
$ devpi test <package_name> to download, install, run unit test,
publish test results to devpi-server and clean up temporary installation.
$ devpi push ... to push released package to proper index on devpi-server or even on public pypi.
Note, that all the time it is easy to have pip command configured to consume
packages from selected index on devpi server for $ pip install <package>.
devpi-server is also ready for use in continuous integration testing.
How git fits into this workflow
Described workflow is not bound to particular style of using git.
On the other hand, git can play it's role in following situations:
commit: commit message will be part of autogenerated ChangeLog
tag: defines versions (recognized by setup.py based on pbr).
As git is distributed, having multiple repositories, branches etc.,
devpi-server allows similar distribution as each user can have it's own
working index to publish to. Anyway, finally there will be one git repository
with master branch to use. In devpi-server will be also one agreed
production index.
Summary
Described process is not simple, but the complexity is relevant to complexity of the task.
It is based on tools:
tox
devpi-server
pbr (Python package)
git
Proposed solution allows:
managing python packages incl. release management
unit testing and continuous integration testing
any style of using git
deployment and development having clearly defined scopes and interactions.
Your question assumes multiple repositories. Proposed solution allows decoupling multiple repositories by means of well managed package versions, published to devpi-server.

We ended up using git dependencies and not devpi.
I think when git is used, there is no need to add another package repository as long as pip can use this.
The core issue, where the branch code (because of a second level dependency) is different from the one merged to master is not solved yet, instead we work around that by working to remove that second level dependency.

python tox, creating rpm virtualenv, as part of ci pipeline, unsure about where in workflow

I'm investigating how Python applications can also use a CI pipeline, but I'm not sure how to create the standard work-flow.
Jenkins is used to do the initial repository clone, and then initiates tox. Basically this is where maven, and/or msbuild, would get dependency packages and build.... which tox does via pip, so all good here.
But now for the confusing part, the last part of the pipeline is creating and uploading packages. Devs would likely upload created packages to a local pip repository, BUT then also possibly create a deployment package. In this case it would need to be an RPM containing a virtualenv of the application. I have made one manually using rpmvenev, but regardless of how its made, how what such a step be added to a tox config? In the case if rpmvenv, it creates its own virtualenv, a self contained command so to speak.

I like going with the Unix philosophy for this problem. Have a tool that does one thing incredibly well, then compose other tools together. Tox is purpose built to run your tests in a bunch of different python environments so using it to then build a deb / rpm / etc for you I feel is a bit of a misuse of that tool. It's probably easier to use tox just to run all your tests then depending on the results have another step in your pipeline deal with building a package for what was just tested.
Jenkins 2.x which is fairly recent at the time of this writing seems to be much better about building pipelines. BuildBot is going through a decent amount of development and already makes it fairly easy to build a good pipeline for this as well.
What we've done at my work is
Buildbot in AWS which receives push notifications from Github on PR's
That kicks off a docker container that pulls in the current code and runs Tox (py.test, flake8, as well as protractor and jasmine tests)
If the tox step comes back clean, kick off a different docker container to build a deb package
Push that deb package up to S3 and let Salt deal with telling those machines to update
That deb package is also just available as a build artifact, similar to what Jenkins 1.x would do. Once we're ready to go to staging, we just take that package and promote it to the staging debian repo manually. Ditto for rolling it to prod.
Tools I've found useful for all this:
Buildbot because it's in Python thus easier for us to work on but Jenkins would work just as well. Regardless, this is the controller for the entire pipeline
Docker because each build should be completely isolated from every other build
Tox the glorious test runner to handle all those details
fpm builds the package. RPM, DEB, tar.gz, whatever. Very configurable and easy to script.
Aptly makes it easy to manage debian repositories and in particular push them up to S3.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Apache Beam Dataflow pipeline build and deploy with Bazel - python

Related

DevOps on python multi module projects

Google Cloud Dataflow Dependencies

How to deploy AWS python Lambda project locally?

Python Development in multiple repositories

python tox, creating rpm virtualenv, as part of ci pipeline, unsure about where in workflow

Categories

Resources