How to configure Python in a GPU cluster? - python

I have a GPU cluster with one storage-node and several computing nodes each has 8 GPU. I am configuring the cluster.
One of the task is to configure the python, what we need is several versions of Python and some python packages, and for some packages we may require several versions of it, such as different version of tensorflow.
So the question is how to configure the python and the packages so that it' convenient to use different version of the package I want to use.
I have installed both python2.7 and python3.6 in each computing node and in the storage node. But I think it is a good way if one has a huge amount of computing node to configure. One of the solution is to install python in the share directory of the cluster instead of the default /usr/local path.
Anyone has a better way to do this?
What I use now is OpenPBS(Torque) and I am new to HPC.
Thanks a lot.

You can install Modules software environment in a shared directory accessible on every node. Then it will be easy to load a specific version of python or TensorFlow:
module load lang/Python/3.6.0
module load lib/Tensorflow/1.1.0
Then, if for some packages we may require several versions of it, you can have a look at Python virtualenv that permits to install several version of the same package. To share it on all the nodes, consider to create your virtualenv on a shared mount point.

You could install each piece of software on the storage node under a certain directory and mount that directory on the compute nodes. Then you don't have to install each software several times.
A common solution to this problem are Environment Modules. You install your software as a module. This means that the software is installed in a certain directory (e.g /opt/modules/python/3.6/) together with a modulefile. When you do module load python/3.6, the modulefile sets environment variables such that Python3.6 is in PATH, PYTHONPATH, etc.
This results in a nice separation of your software stack and also enables you to install newer versions of tensorflow without messing up the environment.

Related

Managing Multiple Python installations

Many modern software has dependency on python language and they -as a consequence- install their own versions of python with the necessary libraries for each particular software to work properly.
In my case, I have my own python that I downloaded intentionally using anaconda distribution, but I also have the ones came with ArcGIS, QGIS, and others.
I have difficulties distinguishing which python -say- I am updating or adding libraries to when reaching them from the command line, and these are not environments but rather the full python packages.
What is the best way to tackle that?
Is there a way to force new software to create new environments within one central python distribution instead of loosing track of all the copies existing in my computer?!
Note that I am aware that QGIS can be downloaded now through conda, which reduces the size of my problem, but doesn't completely solve it. Moreover, that version of QGIS comes with its own issues.
Thank you very much.
as Nukala suggested, that's exactly what virtual environments are for. It contains a particular version of a python interpreter and a set of libraries to be used by a single (or sometimes multiple) project. If you use IDE:s such as Pycharm, it handles the venvs for you automatically.
You can use pyenv to manage python versions in your system. Using pyenv you can easily switch between multiple versions.
And as suggested - each project can create a virtual environment. You have multiple options here - venv, virtualenv, virtualenvwrapper, pipenv, poetry ... etc.

Trouble with setting PBS Cluster using dask that finds my own modules

I am running into some errors when trying to set up my own client using jobqueue PBS Cluster instead of using a default local cluster (i.e., client = Client()).
When setting the default, my own modules were recognized, but I realized my workers in the PBS Cluster could not find them. This page and other research was helpful in understanding what I might be able to do.
I organized my modules into a package and used pip install -e . since I'll still be developing it. I confirmed my python environment site-packages directory has my package (via an .egg-link file).
I hoped installing the package would make my modules available, but I received the same error when I run my code after setting up a basic PBS Cluster:
cluster = PBSCluster(cores=x,memory=y)
cluster.scale(n)
client=Client(cluster)
Is my basic idea of installing the modules as a package not enough?
I looked into client.upload_file based on this answer as another means to make the reference to my module file explicit. Will I need to do something like this still to install modules directly on the workers?
Apologies for length, I am very new to both dask and operating on a HPC.
Thanks for any help.
First, just a sanity check: When using an HPC cluster, there is typically a shared filesystem, which all workers can access (and so can your client machine). Is that the case for your cluster? If so, make sure your conda environment is in a shared location that all workers can access.
I organized my modules into a package and used pip install -e .
That should work, as long as your source code is also on the shared filesystem. The directory pointed to by the .egg-link file should be accessible from the worker machines. Is it?

Is is possible to use a Conda environment as "virtualenv" for a Hadoop Streaming Job (in Python)?

We are currently using Luigi, MRJob and other frameworks to run Hadoo streaming jobs using Python. We are already able to ship the jobs with its own virtualenv so no specific Python dependencies are installed in the nodes (see the article). I was wondering if someone has done similar with Anaconda/Conda Package manager.
PD. I am also aware of Conda-Cluster, however it looks like a more complex/sophisticated solution (and it is behind a paywall).
Update 2019:
The answer is yes and the way of doing it is using conda-pack
https://conda.github.io/conda-pack/
I don't know a way of packaging a conda environment in a tar/zip for then untar it in a different box and get it ready to use like in the example you mention, that might not be possible. At least not without Anaconda in all the worker nodes, there might be also issues moving between different OS.
Anaconda Cluster was created to solve that problem (Disclaimer: I am an Anaconda Cluster developer) but it is uses a more complicated approach, basically we use a configuration management system (salt) to install anaconda in all the nodes in the cluster and control the conda environments.
We use a configuration management system because we also deploy the hadoop stack (spark and its friends) and we need to target big clusters, but in reality if you only need to deploy anaconda and have not to many nodes you should be able to do that just with fabric (that Anaconda Cluster also uses in some parts) and run it on a regular laptop.
If you are interested Anaconda Cluster docs are here: http://continuumio.github.io/conda-cluster/

Bundle virtual/conda environment for EMR bootstrap

I have been using Anaconda Python a lot but also have made many package upgrades (like PANDAS). I've written some tools that I want to turn into a MapReduce job and I've researched how to go about the python EMR bootstrapping for package dependencies.
I thought about a possible workaround: just getting and installing the Anaconda distribution. But then I remembered that I'd have to do all the necessary upgrading.
My last effort in possibly making this easy is this question: is there a way to "rebundle" the upgraded anaconda (or one its environments) so that it can be stored on S3 and used as in the EMR bootstrap action?
Thanks for any help!
ADDED: I suppose it would require a license to be able to wrap up an Anaconda distro like this and use it on various machines, be they in my office network or on AWS. Here's an open source version of this question (I just learned the main package manager to the Anaconda distro is actually OS):
Suppose I have a virtual (or conda) environment running with various modules and extensions installed. What is the proper way, if any, to encapsulate/bundle this virtual environment so that I can efficiently deploy it as needed? I have come across 'pip bundle' and there is 'conda clone' and 'conda create' as well. Also, there appears the concept of conda channels. It's just not clear to me if I can put these together for efficient deployment on EMR and if so, how.
The license allows you to do this, if that's what you're asking.
You might also look at http://continuum.io/anaconda-cluster and http://continuumio.github.io/conda-cluster/.

Does Python have a versioned dependency management solution?

Does Python have anything similar to apt or Maven where a single repository can house different versions of a library as opposed to just the current version?
For example: My site-packages folder does not group libraries by version. So instead of:
/Library/Python/2.7/site-packages/tox/1_2_3
We have:
/Library/Python/2.7/site-packages/tox
...which presumably contains the latest version of tox which may or may not be compatible with every piece of software on my system that wants to use tox. Is there a versioned approach to this? If not, is it possible to create one?
No, there is no way to install multiple versions of a package in the same environment, nor import multiple versions in the same process. The commonly accepted way to handle specific versions for separate projects is to set up a virtualenv for that project and install the specific requirements.

Categories