Bundle virtual/conda environment for EMR bootstrap

Bundle virtual/conda environment for EMR bootstrap - python

I have been using Anaconda Python a lot but also have made many package upgrades (like PANDAS). I've written some tools that I want to turn into a MapReduce job and I've researched how to go about the python EMR bootstrapping for package dependencies.
I thought about a possible workaround: just getting and installing the Anaconda distribution. But then I remembered that I'd have to do all the necessary upgrading.
My last effort in possibly making this easy is this question: is there a way to "rebundle" the upgraded anaconda (or one its environments) so that it can be stored on S3 and used as in the EMR bootstrap action?
Thanks for any help!
ADDED: I suppose it would require a license to be able to wrap up an Anaconda distro like this and use it on various machines, be they in my office network or on AWS. Here's an open source version of this question (I just learned the main package manager to the Anaconda distro is actually OS):
Suppose I have a virtual (or conda) environment running with various modules and extensions installed. What is the proper way, if any, to encapsulate/bundle this virtual environment so that I can efficiently deploy it as needed? I have come across 'pip bundle' and there is 'conda clone' and 'conda create' as well. Also, there appears the concept of conda channels. It's just not clear to me if I can put these together for efficient deployment on EMR and if so, how.

The license allows you to do this, if that's what you're asking.
You might also look at http://continuum.io/anaconda-cluster and http://continuumio.github.io/conda-cluster/.

Related

Python dependency management best practices

I have a little Python side project which is experiencing some growing pains, wondering how people on larger Python projects manage this issue.
The project is Python/Flask/Docker deployed to AWS. Listed dependencies (that we import directly in the project) are installed from a requirements.txt file with explicit version numbers. We added the version numbers after noticing our new deployments (which rebuild Docker/dependencies etc) would sometimes install newer versions of the packages, causing the project to break.
The issue we're facing now is that an onboarding developer is setting up her environment and facing the same issue - this time with sub-dependencies of the original dependencies. (For example, Flask might install Werkskreug, Jinja2, etc and if some of these are the wrong version, the app breaks.) The obvious solution is to go through each sub-dependency and list out every package, with explicit versions, in requirements.txt. But this is a bit of a pain so I'm asking around to see what people do on Real Projects.
You guys can't be doing this all manually, right? In JS we have NPM and package.lock files and so on - they're automatically built. Is there some equivalent in Python? Have I missed something basic that we should be using here?
Thanks in advance

I wrote a tool that might be helpful for this called realreq.. You can install it from pip pip install realreq. It will generate the requirements you have by reading through your source files and recursively specifying their requirements.
realreq --deep -s /path/to/source will fully specify your dependencies and their sub-dependencies. Note that if you are using a virtual environment you need to have it activated for realreq to be able to find the dependencies, and they must be installed. (i.e realreq needs to be ran in an environment where the dependencies are installed). One of your engineers who has a setup env can run it and then pass the output as a requirements.txt file to your new engineers.

Managing Multiple Python installations

Many modern software has dependency on python language and they -as a consequence- install their own versions of python with the necessary libraries for each particular software to work properly.
In my case, I have my own python that I downloaded intentionally using anaconda distribution, but I also have the ones came with ArcGIS, QGIS, and others.
I have difficulties distinguishing which python -say- I am updating or adding libraries to when reaching them from the command line, and these are not environments but rather the full python packages.
What is the best way to tackle that?
Is there a way to force new software to create new environments within one central python distribution instead of loosing track of all the copies existing in my computer?!
Note that I am aware that QGIS can be downloaded now through conda, which reduces the size of my problem, but doesn't completely solve it. Moreover, that version of QGIS comes with its own issues.
Thank you very much.

as Nukala suggested, that's exactly what virtual environments are for. It contains a particular version of a python interpreter and a set of libraries to be used by a single (or sometimes multiple) project. If you use IDE:s such as Pycharm, it handles the venvs for you automatically.

You can use pyenv to manage python versions in your system. Using pyenv you can easily switch between multiple versions.
And as suggested - each project can create a virtual environment. You have multiple options here - venv, virtualenv, virtualenvwrapper, pipenv, poetry ... etc.

setup.py + virtualenv = chicken and egg issue?

I'm a Java/Scala dev transitioning to Python for a work project. To dust off the cobwebs on the Python side of my brain, I wrote a webapp that acts as a front-end for Docker when doing local Docker work. I'm now working on packaging it up and, as such, am learning about setup.py and virtualenv. Coming from the JVM world, where dependencies aren't "installed" so much as downloaded to a repository and referenced when needed, the way pip handles things is a bit foreign. It seems like best practice for production Python work is to first create a virtual environment for your project, do your coding work, then package it up with setup.py.
My question is, what happens on the other end when someone needs to install what I've written? They too will have to create a virtual environment for the package but won't know how to set it up without inspecting the setup.py file to figure out what version of Python to use, etc. Is there a way for me to create a setup.py file that also creates the appropriate virtual environment as part of the install process? If not — or if that's considered a "no" as this respondent stated to this SO post — what is considered "best practice" in this situation?

You can think of virtualenv as an isolation for every package you install using pip. It is a simple way to handle different versions of python and packages. For instance you have two projects which use same packages but different versions of them. So, by using virtualenv you can isolate those two projects and install different version of packages separately, not on your working system.
Now, let's say, you want work on a project with your friend. In order to have the same packages installed you have to share somehow what versions and which packages your project depends on. If you are delivering a reusable package (a library) then you need to distribute it and here where setup.py helps. You can learn more in Quick Start
However, if you work on a web site, all you need is to put libraries versions into a separate file. Best practice is to create separate requirements for tests, development and production. In order to see the format of the file - write pip freeze. You will be presented with a list of packages installed on the system (or in the virtualenv) right now. Put it into the file and you can install it later on another pc, with completely clear virtualenv using pip install -r development.txt
And one more thing, please do not put strict versions of packages like pip freeze shows, most of time you want >= at least X.X version. And good news here is that pip handles dependencies by its own. It means you do not have to put dependent packages there, pip will sort it out.
Talking about deploy, you may want to check tox, a tool for managing virtualenvs. It helps a lot with deploy.

Python default package path always point to system environment, that need Administrator access to install. Virtualenv able to localised the installation to an isolated environment.
For deployment/distribution of package, you can choose to
Distribute by source code. User need to run python setup.py --install, or
Pack your python package and upload to Pypi or custom Devpi. So the user can simply use pip install <yourpackage>
However, as you notice the issue on top : without virtualenv, they user need administrator access to install any python package.
In addition, the Pypi package worlds contains a certain amount of badly tested package that doesn't work out of the box.
Note : virtualenv itself is actually a hack to achieve isolation.

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?

We are currently using Luigi, MRJob and other frameworks to run Hadoo streaming jobs using Python. We are already able to ship the jobs with its own virtualenv so no specific Python dependencies are installed in the nodes (see the article). I was wondering if someone has done similar with Anaconda/Conda Package manager.
PD. I am also aware of Conda-Cluster, however it looks like a more complex/sophisticated solution (and it is behind a paywall).

Update 2019:
The answer is yes and the way of doing it is using conda-pack
https://conda.github.io/conda-pack/

I don't know a way of packaging a conda environment in a tar/zip for then untar it in a different box and get it ready to use like in the example you mention, that might not be possible. At least not without Anaconda in all the worker nodes, there might be also issues moving between different OS.
Anaconda Cluster was created to solve that problem (Disclaimer: I am an Anaconda Cluster developer) but it is uses a more complicated approach, basically we use a configuration management system (salt) to install anaconda in all the nodes in the cluster and control the conda environments.
We use a configuration management system because we also deploy the hadoop stack (spark and its friends) and we need to target big clusters, but in reality if you only need to deploy anaconda and have not to many nodes you should be able to do that just with fabric (that Anaconda Cluster also uses in some parts) and run it on a regular laptop.
If you are interested Anaconda Cluster docs are here: http://continuumio.github.io/conda-cluster/

Howto deploy python applications inside corporate network

First let me explain the current situation:
We do have several python applications which depend on custom (not public released ones) as well as general known packages. These depedencies are all installed on the system python installation. Distribution of the application is done via git by source. All these computers are hidden inside a corporate network and don't have internet access.
This approach is bit pain in the ass since it has the following downsides:
Libs have to be installed manually on each computer :(
How to better deploy an application? I recently saw virtualenv which seems to be the solution but I don't see it yet.
virtualenv creates a clean python instance for my application. How exactly should I deploy this so that usesrs of the software can easily start it?
Should there be a startup script inside the application which creates the virtualenv during start?
The next problem is that the computers don't have internet access. I know that I can specify a custom location for packages (network share?) but is that the right approach? Or should I deploy the zipped packages too?
Would another approach would be to ship the whole python instance? So the user doesn't have to startup the virutalenv? In this python instance all necessary packages would be pre-installed.
Since our apps are fast growing we have a fast release cycle (2 weeks). Deploying via git was very easy. Users could pull from a stable branch via an update script to get the last release - would that be still possible or are there better approaches?
I know that there are a lot questions. Hopefully someone can answer me r give me some advice.

You can use pip to install directly from git:
pip install -e git+http://192.168.1.1/git/packagename#egg=packagename
This applies whether you use virtualenv (which you should) or not.
You can also create a requirements.txt file containing all the stuff you want installed:
-e git+http://192.168.1.1/git/packagename#egg=packagename
-e git+http://192.168.1.1/git/packagename2#egg=packagename2
And then you just do this:
pip install -r requirements.txt
So the deployment procedure would consist in getting the requirements.txt file and then executing the above command. Adding virtualenv would make it cleaner, not easier; without virtualenv you would pollute the systemwide Python installation. virtualenv is meant to provide a solution for running many apps each in its own distinct virtual Python environment; it doesn't have much to do with how to actually install stuff in that environment.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bundle virtual/conda environment for EMR bootstrap - python

The license allows you to do this, if that's what you're asking. You might also look at http://continuum.io/anaconda-cluster and http://continuumio.github.io/conda-cluster/.

Related

Python dependency management best practices

Managing Multiple Python installations

setup.py + virtualenv = chicken and egg issue?

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?

Howto deploy python applications inside corporate network

Categories

Resources