I come from a sort of HPC background and I am just starting to learn about machine learning in general and TensorFlow in particular. I was initially surprised to find out that distributed TensorFlow is designed to communicate with TCP/IP by default though it makes sense in hindsight given what Google is and the kind of hardware it uses most commonly.
I am interested in experimenting with TensorFlow in a parallel way with MPI on a cluster. From my perspective, this should be advantageous because latency should be much lower due to MPI's use of Remote Direct Memory Access (RDMA) across machines without shared memory.
So my question is, why doesn't this approach seem to be more common given the increasing popularity of TensorFlow and machine learning ? Isn't latency a bottleneck ? Is there some typical problem that is solved, that makes this sort of solution impractical? Are there likely to be any meaningful differences between calling TensorFlow functions in a parallel way vs implementing MPI calls inside of the TensorFlow library ?
Thanks
It seems tensorflow already supports MPI, as stated at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi
MPI support for tensorflow was also discussed at https://arxiv.org/abs/1603.02339
Generally speaking, keep in mind MPI is best at sending/receiving messages, but not so great at sending notifications and acting upon events.
Last but not least, MPI support of multi-threaded applications (e.g. MPI_THREAD_MULTIPLE) has not always been production-ready among MPI implementation s.
These were two general statements and i honestly do not know if they are relevant for tensorflow.
According to the doc in Tensorflow git repo,actually tf utilizes gRPC library by detault, which is based on HTTP2 protocol, rather than TCP/IP protocol, and this paper should give you some insight, hope this information is useful.
Related
I'm using Pyomo library to solve an optimization problem with Gurobi optimizer. I have an academic license for Gurobi, but afaik this should not pose any kind of limitation. I'd like to optimize the model with different parameters in parallel. In other words, I have many instances of the same model but different parameters that could be solved independently.
I've followed the instruction that I found on the documentation page of Pyomo regarding pyro and dispatching (here the reference). Basically, I started the Pyomo name server with pyomo_ns, then the dispatch server with dispatch_srvr, and finally I have launched four instances of pyro_mip_server.
When I launch my python script it works. No errors occur. The problem is that it takes the same amount of time as the serial execution. I've also monitored the activity of all eight cores that my CPU has and only one is constantly at 100% of the load. It's like no concurrent execution is happening at all.
I'm using SolverManager provided by Pyomo library to submit from the Python script different instances of my model. You can find here the guide of how to use it.
I'm on a Linux machine, the latest LTS Ubuntu distro.
Have somebody had such an experience or does someone know what's the problem here? If you need any additional info just let me know :). Thank you!
Horovod is combining NCCL and MPI into an wrapper for Distributed Deep Learning in for example TensorFlow.
I haven't heard of NCCL previously and was looking into its functionality. The following is stated about NCCL on the NVIDIA website:
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.
From the introduction video about NCCL I understood that NCCL works via PCIe, NVLink, Native Infiniband, Ethernet and it can even detect if GPU Direct via RDMA makes sense in the current hardware topology and uses it transparently.
So I am questioning why MPI is needed in Horovod? As far as I understand, MPI is also used for efficiently exchanging the gradients among distributed nodes via an allreduce paradigm. But as I understand, NCCL already supports those functionalities.
So is MPI only used for easily scheduling the jobs on a cluster? For Distributed Deep Learning on CPU, since we cannot use NCCL there?
I would highly appreciate if someone could explain in which scenarios MPI and/or NCCL is used for Distributed Deep Learning and what are their responsibilities during the training job.
MPI (Message Passing Interface) is a message-passing standard used in parallel computing (Wikipedia). Most of the time, you'd use Open MPI when using Horovod, which is an open-source implementation of the MPI standard.
The MPI implementation allows one to easily run more than a single instance of a program in parallel. The program code is kept the same but just running in a few different processes. In addition, the MPI library exposes an API to easily share data and information among these processes.
Horovod uses this mechanism in order to run some processes of the Python script which is running the neural network. These processes should know and share some information during the running of the neural network. Some of this information is about the environment, for example:
The number of processes that are currently being running, for being able to correctly modify parameters and hyperparameters for the neural network such as the batch size, learning rate, etc..
Knowing which process is the "master" one, to print logs and save files (checkpoints) from only a single process.
The id (called "rank") of the current process so it could use a specific area of the input data.
Some of this information is about the training process of the neural network, for example:
The randomized initial values for the weights and biases of the model, so all processes will start from the same point.
The values of the weights and biases at the end of every training step, so all processes will start the next step with the same values.
There is more information that is shared and the above bullets are some of it.
At first, Horovod used MPI for all the requirements above. Then, Nvidia released NCCL which is a library that consists of many algorithms for high-performance communication between GPUs. To improve the overall performance, Horovod started using NCCL for things like (4) and mainly (5) as NCCL allowed sharing this data between GPUs much more efficiently.
In Nvidia docs we can see that NCCL can be used in conjunction with MPI, and in general:
MPI is used for CPU-CPU communication, and NCCL is used for GPU-GPU communication.
Horovod still uses MPI for running the few instances of the Python script and manage the environment (rank, size, which process is the "master", etc..) for allowing the user to easily manage the run.
Firstly, horovod used MPI only in the beginning.
After NCCL is introduced to horovod, even in NCCL mode, MPI is still used for providing environmental info (rank, size and local_rank). NCCL doc has an example shows how it leverages MPI in one device per process setting:
The following code is an example of a communicator creation in the context of MPI, using one device per MPI rank.
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/examples.html#example-2-one-device-per-process-or-thread
I am trying to use arrayFire python (https://github.com/arrayfire/arrayfire-python) for multi-GPU programming.
However, when I try to interface it with the concurrent futures (https://docs.python.org/3/library/concurrent.futures.html) library, I run into synchronization issues.
Does anyone have inputs on how to use arrayfire-python to parallel process on multiple GPUs ?
ArrayFire allows Mutli-GPU programming but does not distribute the work load automatically. It is up to the user to decide which memory and functions run on which device.
ArrayFire as it stands now is NOT thread safe. Hence running anything on multiple threads can cause issues.
Disclosure: I am a developer for ArrayFire.
I have written a python code to carry out genetic algorithm optimization, but it is too slow. I would like to know how to run the same in parallel mode making use of multiple CPUs ?
For more clarity, another python code will be called by my code for, say 100 times one after the other, I wanted to divide this between 4 CPUs. So that 25 times the outside python code is solved by each CPU. Thereby increasing the speed.
Its highly appreciated if someone can help me with is ?
Thanks in advance!
There are several packages that provide parallel computing for python2. I am the author of a package called pathos, which provides parallel computing with several parallel backends and gives them a common API. pathos provides parallel pipes and maps for multi-process, multi-threaded, parallel over sockets, MPI-parallel, and also interactions with schedulers and over ssh. pathos relies on several packages, which you can pick from if you don't want all the different options.
pathos uses: pyina which in turn uses mpi4py. mpi4py provides bindings to MPI, but you can't run the code from python 'normally'… you need to run with whatever you use to run MPI. pyina enables you to run mpi4py from normal python, and to interact with schedulers. Plus, pyina uses dill, which can serialize most python objects, and thus you are much more able to send what you want across processes.
pathos provides a fork of multiprocessing that also plays well with dill and pyina. Using both can enable you to do hierarchical parallel competing -- like launching MPI parallel that then spawns multiprocess or multithreaded parallel.
pathos also uses ppft, which is a fork of pp (Parallel Python), which provides parallel computing across sockets -- so that means you can connect a parallel map across several machines.
There are alternatives to pathos, such as IPython-parallel. However, the ability to use MPI is very new, and I don't know how capable it is yet. It may or may not leverage IPython-cluster-helper, which has been in development for a little while. Note that IPython doesn't use pp, it uses zmq instead, and IPython also provides connectivity to EC2 if you like cloud stuff.
Here are some relevant links:
pathos, 'pyina, dill, ppft: https://github.com/uqfoundation
Ipython-cluster-helper: https://github.com/roryk/ipython-cluster-helper
IPython: http://ipython.org/ipython-doc/dev/parallel/parallel_mpi.html
I have started using Python in a real-time application (serial communication with to gps modules at once), but have found out recently about Lua. Which language would be more suited to the application?
My definition of real-time in this context is the fastest possible time to receive, process and output the data. (Feedback system)
Both are fine languages. Neither should take you years to learn. An easy way to make the decision is to look at what modules are out there already.
For example, you mentioned that your application is related to GPS. Take a look at what libraries are already written to hook Python and Lua into your particular GPS hardware. Maybe someone's already done most of the hard work for you. If not, then go down a step. If you're talking to your GPS over an I2C link, look at I2C libraries in both languages. See which ones are more popular and better maintained.
That said, garbage collected languages have historically had problems with meeting real time requirements. Depending on yours, you may need to go with a lower level language. You should also ensure that whatever system you're running on will support your programming environment. I've worked with systems where Python would have been great but it doesn't fit in 5K of code space.
Take a look at eLua and see if it meets your needs.