I am after advice on how to manage python modules within the context of docker.
Current options that I'm aware of include:
Installing them individually via pip in the build process
Installing them together via pip in the build process via requirments.txt
Installing them to a volume and adding the volume to the PYTHONPATH
Ideally I want a solution that is fully re-producible and that doesn't require every module to be re-installed if I decide to add another module or update the version of one of them.
From my perspective:
(2) is an issue because the docker ADD command (to get access to the requirements.txt file) apparently invalidates the cache and means that any changes to the file means everything has to be re-built / re-installed everytime you build the image.
(1) keeps the cache intact but means you'd need to specify the exact version for each package (and potentially their dependencies?) which seems like it could be pretty tedious and error prone.
(3) is currently my personal favorite as it allows the packages to persist between images/builds and allows for requirements.txt to be used. Only downside is that essentially you are storing the packages on your local machine rather than the image which leads to the container being dependent on the host OS which kind-of defeats the point of a container.
So yer I'm not entirely sure what best practices are here and would appreciate advice.
For reference there have been other questions on this topic but I don't feel any of them properly address my above question:
docker with modified python modules?
Docker compose installing requirements.txt
How can I install python modules in a docker image?
EDIT:
Just some additional notes to give some more context. My projects are typically data analysis focused (rather than software development or web development). I tend to use multiple images (1 for python, 1 for R, 1 for the database) using docker compose to manage them all together. So far I've been using a makefile on the host OS to re-build the project from scratch i.e. something like
some_output.pdf: some_input.py
docker-compose run python_container python some_input.py
where the outputs are written to a volume on the host OS
The requirements.txt file is the best option. (Even if changing it does a complete reinstall.)
A new developer starts on your project. They check out your source control repository and say, "oh, it's a Python project!", create a virtual environment, and run pip install -r requirements.txt, and they're set to go. A week later they come by and say "so how do we deploy this?", but since you've wrapped the normal Python setup in Docker they don't have to go out of their way to use a weird Docker-specific development process.
Disaster! Your primary server's hard disk has crashed! You have backups of all of your data, but the application code just gets rebuilt from source control. If you're keeping code in a Docker volume (or a bind-mounted host directory) you need to figure out how to rebuild it; but your first two options have that written down in the Dockerfile. This is also important for the new developer in the previous paragraph (who wants to test their image locally before deploying it) and any sort of cluster-based deployment system (Swarm, Kubernetes) where you'd like to just deploy an image and not also have to deploy the code alongside it, by hand, outside of the deployment system framework.
Another option is to use multi-stage build feature. Create an intermediate build that installs the dependencies and then just copy the folder to the production image (second build stage). This gives you the benefit of your option 3 as well.
It depends on which step in your build is more expensive and would benefit from caching. Compare the following:
Dockerfile A
FROM Ubuntu:16.04
Install Python, Pip etc.
Add requirements.txt
Run pip install
Run my build steps which are expensive.
Dockerfile B
FROM Ubuntu:16.04 AS intermediate
Install Python, Pip etc.
Add requirements.txt
Run pip install
FROM Ubuntu:16.04
Run my build steps which are expensive.
COPY --from=intermediate /pip-packages/ /pip-packages/
In the first case touching your requirements.txt will force a full build. In the second case, your expensive build steps are still cached. The intermediate build still runs but I assume that is not the expensive step here.
Related
My production server has no access to the internet, so it's a bit a mess copying all the dependencies from my dev machine to the production/development server.
If I'd use virtualenv, I'd have all my dependencies in this environment. Doing this I'd also be able to deploy it on any machine, which has python & virtualenv installed.
But I've seen this rarely, and it seems kind of dirty.
Am I wrong and this could be a good practice, or are there other ways to solve that nicely?
Three options I would consider:
Run your own PyPI mirror with the dependencies you need. You really only need to build the file layout and pull from your local server using the index-url flag:
$ pip install --index-url http://pypi.beastcraft.net/ numpy
Build virtualenvs on the same architecture and copy those over as needed.
This works, but you're taking a risk on true portability.
Use terrarium to build virtual environments then bring those over (basically option 2 but with easier bookkeeping/automation).
I've done all of these and actually think that hosting your own PyPI mirror is the best option. It gives you the most flexibility when you're making a deployment or trying out new code.
My colleagues don't want to create their own virtualenv and deploy my tools from git on their own as part of their automation. Instead they want the tools pre installed on a shared server.
So I was thinking of making a /opt directory, putting the virtualenv in there and then pulling from git every hour to force an update of the python package. I am not version tagging the tools currently and instead would just ask pip to force the update every time.
The problem is the race condition. If the tool is called by their automation during the pip upgrade, the tool can fail due to the install not being atomic - I believe pip removes the entire package first.
I've thought of various work arounds (forcing the use of flock, using symlink to atomically switch the virtualenv, wrapping the tool in a script that makes the virtualenv in a temp directory for every use...)
Is there a best practice here I'm not aware of?
edit: I also asked over at devops and got a pretty good answer.
I'm sure this has been answered elsewhere, but I must not know the right keywords to find the answer...
I'm working on a site that requires several different components deployed on different servers but relying on some shared functions. I've implemented this by putting the shared functions into a pip module in its own git repo that I can put into the requirements.txt file of each project.
This is pretty standard stuff - more or less detailed here:
https://devcenter.heroku.com/articles/python-pip
My question is now that I have this working to deploy code into production, how do I set up my dev environment in such a way that I can make edits to the code in the shared module without having to do all of the following?
1. Commit changes
2. increment the version in setup.py in shared library
3. Increment in requirements.txt
4. pip install -r requirements.txt
That's a lot of steps to do all over again if I make one typo.
On a similar note, I used jenkins with git hook and a simple(4 or 5 lines maybe that would install/upgrade requirements.txt, restart webserver and little more stuff) bash script. When I commit changes, jenkins would run my bash script, then voila. Almost instant upgrade.
But note that, this is hack-ish. Jenkins is a continuous integration tool focusing building and testing, and there are probably better and simpler tools in this case, hint: Continuous Integration.
I'm building a site that relies on the output of a machine learning algorithm. All that is needed for the user-facing part of the site is the output of the algorithm (class labels for a set of items), which can be easily stored and retrieved from the django models. The algorithm could be run once a day, and does not rely on user input.
So this part of the site only depends on django and related packages.
But developing, tuning, and evaluating the algorithm uses many other python packages such as scikit-learn, pandas, numpy, matplotlib, etc. It also requires saving many different sets of class labels.
These dependencies cause some issues when deploying to heroku, because numpy requires LAPACK/BLAS. It also seems like it would be good practice to have as few dependencies as possible in the deployed app.
How can I separate the machine-learning part from the user-facing part, but, still have them integrated enough that the results of the algorithm are easily used?
I thought of creating two separate projects, and then writing to the user-facing database in some way, but that seems like it would lead to maintance problems (managing the dependencies, changes in database schemas etc).
As far as I understand, this problem is a little bit different than using different settings or databases for production and development, because it is more about managing different sets of dependencies.
Just move what we discussed to the answer in case people have the same question, my suggestion is:
Spend some time define what are the dependencies for your site and for the algorithm code.
Dump the dependency list into requirements.txt for each project.
Deploy them on different environments so the conflicts don't happen.
Develop some API endpoints on your site side using Django Rest Framework or Tastypie and let your algorithm code update your model using the API. Use cron to run your algorithm code regularly and push the data.
Create a requirements file for each environment, and a base requirements file for those packages shared by all the environments.
$ mkdir requirements
$ pip freeze > requirements/base.txt
$ echo "-r base.txt" > requirements/development.txt
$ echo "-r base.txt" > requirements/production.txt
Then adjust your development and production dependencies and install each one in the proper environment
#change to your development virtualenv
#$source .virtualenvs/development/bin/activate
$ pip install -r requirements/development.txt
#change to your production virtualenv
#$source .virtualenvs/production/bin/activate
$ pip install -r requirements/production.txt
I prefer using poetry as my dependency manager. It lets you define the dev dependencies, rather than having separate requirements.txt files which is extra work.
First let me explain the current situation:
We do have several python applications which depend on custom (not public released ones) as well as general known packages. These depedencies are all installed on the system python installation. Distribution of the application is done via git by source. All these computers are hidden inside a corporate network and don't have internet access.
This approach is bit pain in the ass since it has the following downsides:
Libs have to be installed manually on each computer :(
How to better deploy an application? I recently saw virtualenv which seems to be the solution but I don't see it yet.
virtualenv creates a clean python instance for my application. How exactly should I deploy this so that usesrs of the software can easily start it?
Should there be a startup script inside the application which creates the virtualenv during start?
The next problem is that the computers don't have internet access. I know that I can specify a custom location for packages (network share?) but is that the right approach? Or should I deploy the zipped packages too?
Would another approach would be to ship the whole python instance? So the user doesn't have to startup the virutalenv? In this python instance all necessary packages would be pre-installed.
Since our apps are fast growing we have a fast release cycle (2 weeks). Deploying via git was very easy. Users could pull from a stable branch via an update script to get the last release - would that be still possible or are there better approaches?
I know that there are a lot questions. Hopefully someone can answer me r give me some advice.
You can use pip to install directly from git:
pip install -e git+http://192.168.1.1/git/packagename#egg=packagename
This applies whether you use virtualenv (which you should) or not.
You can also create a requirements.txt file containing all the stuff you want installed:
-e git+http://192.168.1.1/git/packagename#egg=packagename
-e git+http://192.168.1.1/git/packagename2#egg=packagename2
And then you just do this:
pip install -r requirements.txt
So the deployment procedure would consist in getting the requirements.txt file and then executing the above command. Adding virtualenv would make it cleaner, not easier; without virtualenv you would pollute the systemwide Python installation. virtualenv is meant to provide a solution for running many apps each in its own distinct virtual Python environment; it doesn't have much to do with how to actually install stuff in that environment.