Python virtual environments and space management (Pipenv in particular) - python

I have recently grown interest in learning how to use virtual environments with Python.
As you probably already know, they are useful when in need of multiple versions of the same package. As far as I understood, using pip you can't differentiate between versions since it just uses the package's name.
I will take as an example Pipenv, which seems to be a new powerful tool also announced as the new standard by the PyPA. I fairly understand what, how, and why Pipenv does the (basic) things. What I don't understand (or better, what puzzles me) is how Pipenv (or any virtual environment tool in Python for what I know) manages space on the disk.
With Pip you usually install packages in one place, then you simple import them in you code, that's it.
With Pipenv (or similar) you create a virtual environment in which you have everything installed and it cannot communicate with the external world (which is kind of the point, I know).
Now let's suppose I am working on ProjectA, then on ProjectB. Both will have their environment (somewhere within ~.virtualenvs, for Pipenv).
Let's also suppose that even if the two projects have different high-level dependencies, they have one sub-dependence in common. I mean, same name same version.
When I do "pipenv install thatpackage" in each of the cases, it will be downloaded and stored separately in each case. Am I correct?
If I'm right, isn't this a waste of space? I would have 2 copies of the same package on my disk. If this is reiterated for many packages you can guess how much space is wasted when working on many different projects.

Related

What is under the hood of a venv virtual enviornment?

Lately I have started to use venv virtual environments for my development. (before I was just using docker images and conda environments)
However I notice that virtual environments are created for some code you have.
My question is isn't that wasteful?
I mean if we have 20 repos of code, and they all need opencv, having 20 virtual environments make it install opencv 20 times?
What is under the hood of the virtual environment practice?
There's a classic trade-off involved here. YES, the liberal use of virtualenvs requires more disk space...but these days, disk space is super cheap. The common consensus, which if you're an old timer like me then you would have come to on your own by now, is that the benefits of having a separate virtualenv for each of your projects VASTLY outweighs the downside of using some extra disk space. Of course, you can get by with fewer or even no virtualenvs.
I used to try to have fewer than one virtualenv per project. But that sometimes led to problems when I went to package up and distribute a particular project. Only if you have one virtualenv per project can you be 100% sure that the runtime environment you are using for testing your code will be identical to the one you'll represent in a requirements.txt file when you distribute your app.
PS: I just bought a refurbished 4TB hard drive on Amazon to use for backups. It cost me $38 dollars, shipped to me in one day! And 1TB SSDs can now be had for about $75. It's amazing how cheap "disk" space is these days.
My question is isn't that wasteful?
If you have N projects on same machine and all these projects using different versions of most imported libraries, venv is a good choice to be able guarantee that all projects are properly working as expected. In other way (if you have only single version of Python) you have to test all N pojects for compatibilty with single version. What is more wastful: spent time to check all libraries for compatibility with current version, or just install multiple versions of libraries checked and recommended by library provider?
I mean if we have 20 repos of code, and they all need opencv, having 20 virtual environments make it install opencv 20 times?
Actually you don't need to create venv for every single repo (project). it is just a folder and may be created outside source code directory (for example I use pyenv with virtualenv plugin that makes able to switch between envs independently of repo). So if you have 10 repos using opencv x.x and 10 repos using opencv y.y you may use only two venvs x.x and y.y
What is under the hood of the virtual environment practice?
In modern times of automation any processes venv is a simple way to be sure that automated process are using specified versions of libraries and here is no conflicts between versions on single worker (server). venv per project allow you create / destroy python environment at any time without worrying is it updated semewehere outise and updated at all? You don't need to maintenance environment and keep it fresh after creating. Just create → use → remove. of course if couple of processes are using same versions of libraries and start / stop working at same time (so venv may be deleted at same time), it will be better to use single environment for these projects

Managing Multiple Python installations

Many modern software has dependency on python language and they -as a consequence- install their own versions of python with the necessary libraries for each particular software to work properly.
In my case, I have my own python that I downloaded intentionally using anaconda distribution, but I also have the ones came with ArcGIS, QGIS, and others.
I have difficulties distinguishing which python -say- I am updating or adding libraries to when reaching them from the command line, and these are not environments but rather the full python packages.
What is the best way to tackle that?
Is there a way to force new software to create new environments within one central python distribution instead of loosing track of all the copies existing in my computer?!
Note that I am aware that QGIS can be downloaded now through conda, which reduces the size of my problem, but doesn't completely solve it. Moreover, that version of QGIS comes with its own issues.
Thank you very much.
as Nukala suggested, that's exactly what virtual environments are for. It contains a particular version of a python interpreter and a set of libraries to be used by a single (or sometimes multiple) project. If you use IDE:s such as Pycharm, it handles the venvs for you automatically.
You can use pyenv to manage python versions in your system. Using pyenv you can easily switch between multiple versions.
And as suggested - each project can create a virtual environment. You have multiple options here - venv, virtualenv, virtualenvwrapper, pipenv, poetry ... etc.

Why and when to use a new anaconda environment? (One environment for everything? A new environment for every project?)

I have a little bit of experience using Anaconda and am about to transition to using it much more, for all of my Python data work. Before embarking on this I have what feels like a simple question: "when should I use a new environment?"
I cannot find any good, practical advice on this on StackOverflow or elsewhere on the web.
I understand what environments are and their benefits and how, if I am working on a project that has a dependency on a specific version of a library that is different to e.g. the latest version of that library etc. etc. ... then virtual environments are the answer; but I am looking for some advice as to how to practically approach their use in my day-to-day work on different data projects.
Logically there appears to be (at least) two approaches:
Use one environment until you absolutely need a separate environment
for a specific project
Use a new environment for every single project
I can see some pros and cons to each approach and am wondering if there is any best practice that can be shared.
If I should use just one environment until I need a second one, should I just use the default "root" environment and load all my required dependent libraries into that or is it best to start off with my own environment that is called something else?
An answer to this question "why create new environment for install" by #codeblooded gives me some hints as to how to use and approach conda environments and suggests a third way,
Create new environments on an as-needs basis, projects do not "live" inside environments but use environments at runtime, you will end up with as many different virtual environments as you need to run the projects that you regularly use on that machine, that may be just one environment or it may be more
Anyway, you can see that I am struggling to get my head around this, any help would be greatly appreciated. Thank you!
As a developer that works with data scientists I would strongly recommend creating an environment for each project. The benefit of python environments is that the encapsulate the requirements of a project from all other python projects.
In the case above, if you were to use Python36 for 8 different projects it would be very easy to accidentally upgrade a package or install a conflicting package that breaks other projects without you realising it.
In the work you do it might not be a big deal, but given how easy it is to create a separate environment for each project the benefits outway the small time cost.
I can tell you that if any of the developers I work with was found to be using a single python environment for multiple development projects they would be instructed to stop doing that immediately.
Ok, I think I worked this one out for myself. Seems kind of obvious now.
You do NOT need to create an environment for every project.
However if particular projects require particular versions of libraries, a particular version of Python etc then you can create a virtual environment to capture all of those dependencies.
To give an example,
let's say you are working on a project that requires a library that has a dependency on a particular version of Python e.g. Python 3.6,
and your base (root) environment is Python 3.7,
you would create a new anaconda environment configured to use Python 3.6 (maybe call it "Python36")
and you would install all the required libraries in that environment and you would use that environment when running that project.
When you have another project that requires similar libraries, you may re-use your now existing Python 3.6 environment (named "Python36") to run this new project,
you would not have to create a new Python 3.6 environment, in the same way that you would not have to install multiple instances of Python 3.6 in order to run multiple projects that required Python 3.6.

Do multiple, related Python projects need their own virtual environment?

I have two, related Python projects:
../src/project_a
../src/project_a/requirements.txt
../src/project_a/project_a.py
../src/project_b
../src/project_b/requirements.txt
../src/project_b/project_b.py
Both projects use the same version of Python. The respective requirements.txt files are similar but not identical.
Do I create a separate virtual environment for each project or can I create a "global" virtual environment at the ../src level?
Note: I'm obviously new to using virtual environments.
Virtual environments are meant to keep things isolated from each other.
If one project is a dependency of the other one, then they have to be installed in the same environment.
If two projects have dependencies that conflict with each other, then they have to be installed in different environments.
If two projects are meant to be run on different versions of the Python interpreter, then they have to be installed in different environments.
That's basically the only rules (I can think of). To me the rest is just a mix of best practices, personal opinions, common sense, technical limitations, and so on.
One could think of the pet vs cattle analogy (again) for example. Virtual environments can be seen as throw away things, that are created on demand (automatically with tools such as tox for example), which is easy once the dependencies are clearly written down (in requirements.txt for example).
In your case, I would probably start with a single Python virtual environment, and only start creating more when the need arises. Most likely this will happen once the projects grow in size. And eventually it could become an absolute necessity once a project requires specific versions of dependencies that conflict with the dependencies of the other.

How to maintain long-lived python projects w.r.t. dependencies and python versions?

short version: how can I get rid of the multiple-versions-of-python nightmare ?
long version: over the years, I've used several versions of python, and what is worse, several extensions to python (e.g. pygame, pylab, wxPython...). Each time it was on a different setup, with different OSes, sometimes different architectures (like my old PowerPC mac).
Nowadays I'm using a mac (OSX 10.6 on x86-64) and it's a dependency nightmare each time I want to revive script older than a few months. Python itself already comes in three different flavours in /usr/bin (2.5, 2.6, 3.1), but I had to install 2.4 from macports for pygame, something else (cannot remember what) forced me to install all three others from macports as well, so at the end of the day I'm the happy owner of seven (!) instances of python on my system.
But that's not the problem, the problem is, none of them has the right (i.e. same set of) libraries installed, some of them are 32bits, some 64bits, and now I'm pretty much lost.
For example right now I'm trying to run a three-year-old script (not written by me) which used to use matplotlib/numpy to draw a real-time plot within a rectangle of a wxwidgets window. But I'm failing miserably: py26-wxpython from macports won't install, stock python has wxwidgets included but also has some conflict between 32 bits and 64 bits, and it doesn't have numpy... what a mess !
Obviously, I'm doing things the wrong way. How do you usally cope with all that chaos ?
I solve this using virtualenv. I sympathise with wanting to avoid further layers of nightmare abstraction, but virtualenv is actually amazingly clean and simple to use. You literally do this (command line, Linux):
virtualenv my_env
This creates a new python binary and library location, and symlinks to your existing system libraries by default. Then, to switch paths to use the new environment, you do this:
source my_env/bin/activate
That's it. Now if you install modules (e.g. with easy_install), they get installed to the lib directory of the my_env directory. They don't interfere with existing libraries, you don't get weird conflicts, stuff doesn't stop working in your old environment. They're completely isolated.
To exit the environment, just do
deactivate
If you decide you made a mistake with an installation, or you don't want that environment anymore, just delete the directory:
rm -rf my_env
And you're done. It's really that simple.
virtualenv is great. ;)
Some tips:
on Mac OS X, use only the python installation in /Library/Frameworks/Python.framework.
whenever you use numpy/scipy/matplotlib, install the enthought python distribution
use virtualenv and virtualenvwrapper to keep those "system" installations pristine; ideally use one virtual environment per project, so each project's dependencies are fulfilled. And, yes, that means potentially a lot of code will be replicated in the various virtual envs.
That seems like a bigger mess indeed, but at least things work that way. Basically, if one of the projects works in a virtualenv, it will keep working no matter what upgrades you perform, since you never change the "system" installs.
Take a look at virtualenv.
What I usually do is trying to (progressively) keep up with the Python versions as they come along (and once all of the external dependencies have correct versions available).
Most of the time the Python code itself can be transferred as-is with only minor needed modifications.
My biggest Python project # work (15.000+ LOC) is now on Python 2.6 a few months (upgrading everything from Python 2.5 did take most of a day due to installing / checking 10+ dependencies...)
In general I think this is the best strategy with most of the interdependent components in the free software stack (think the dependencies in the linux software repositories): keep your versions (semi)-current (or at least: progressing at the same pace).
install the python versions you need, better if from sources
when you write a script, include the full python version into it (such as #!/usr/local/bin/python2.6)
I can't see what could go wrong.
If something does, it's probably macports fault anyway, not yours (one of the reasons I don't use macports anymore).
I know I'm probably missing something and this will get downvoted, but please leave at least a little comment in that case, thanks :)
I use the MacPorts version for everything, but as you note a lot of the default versions are bizarrely old. For example vim omnicomplete in Snow Leopard has python25 as a dependency. A lot of python related ports have old dependencies but you can usually flag the newer version at build time, for example port install vim +python26 instead of port install vim +python. Do a dry run before installing anything to see if you are pulling, for example, the whole of python24 when it isn't necessary. Check portfiles often because the naming convention as Darwin ports was getting off the ground left something to be desired. In practice I just leave everything in the default /opt... folders of MacPorts, including a copy of the entire framework with duplicates of PyObjC, etc., and just stick with one version at a time, retaining the option to return to the system default if things break unexpectedly. Which is all perhaps a bit too much work to avoid using virtualenv, which I've been meaning to get around to using.
I've had good luck using Buildout. You set up a list of which eggs and which versions you want. Buildout then downloads and installs private versions of each for you. It makes a private "python" binary with all the eggs already installed. A local "nosetests" makes things easy to debug. You can extend the build with your own functions.
On the down side, Buildout can be quite mysterious. Do "buildout -vvvv" for a while to see exactly what it's doing and why.
http://www.buildout.org/docs/tutorial.html
At least under Linux, multiple pythons can co-exist fairly happily. I use Python 2.6 on a CentOS system that needs Python 2.4 to be the default for various system things. I simply compiled and installed python 2.6 into a separate directory tree (and added the appropriate bin directory to my path) which was fairly painless. It's then invoked by typing "python2.6".
Once you have separate pythons up and running, installing libraries for a specific version is straightforward. If you invoke the setup.py script with the python you want, it will be installed in directories appropriate to that python, and scripts will be installed in the same directory as the python executable itself and will automagically use the correct python when invoked.
I also try to avoid using too many libraries. When I only need one or two functions from a library (eg scipy), I'll often see if I can just copy them to my own project.

Categories