Training Neural Networks on Hadoop Cluster

Training Neural Networks on Hadoop Cluster - python

I have been studying neural networks for some weeks. Furthermore even if I always used R the Keras library in Python was really helpful with someone with a small programming background like me.
Keras it's a very nice interface which allows the customization I need without even invoking the backend, unless for some custom loss metrics I used.
Being that straightforward is also the Hardware specification, which for example allows you to switch from the CPU of the machine where you have your Python+Keras installed to the machine (compatible) GPU, allowing to exploit the strong parallelization of neural networks when training them.
I was wondering if there is anything which allows you to switch to hadoop cluster training of neural networks with the same kind of ease.
Moreover is there some hadoop open source cluster available to do so?
Thank you for your help

Related

Artificial neural network performance degradation in other PCs

I am experimenting with designing a semantic segmentation network using Pytorch. It performs well on my computer. For better performance, we experimented by moving the network to a computer with a lot of GPU capacity. Basically, if you set the same environment for only the versions of Pytorch and torchvision and proceed with the experiment, performance degradation occurs even on the moved PC or Google Colab. I just copied and pasted the code and ran it as a test before doing other experiments.
The network structure is the same, but are there other external factors that will degrade performance? (ex: gpu, ram, etc...)...

What is the difference between Tensorflow and Keras?

I am currently working with neural networks in keras and I know that it works with tensorflow in the back-end, I have it installed on the GPU, but I don't know if keras uses the GPU or if it is something completely different from tensorflow.

TensorFlow is a mid-level framework that performs operations on tensors. Keras is a high-level API that simplifies the creation and training of neural networks. Keras doesn't do any of the tensor ops itself; it delegates those to its backend, which is a mid-level framework of your choosing: TensorFlow, CNTK, or Theano. Each of those frameworks can be configured to do the tensor ops in whatever ways they can (as far as I am aware, each of them can use either CPUs or GPUs). Keras, however, doesn't really care how the ops get done. It just tells the backend to do them, and they get done.

TensorFlow Horovod: NCCL and MPI

Horovod is combining NCCL and MPI into an wrapper for Distributed Deep Learning in for example TensorFlow.
I haven't heard of NCCL previously and was looking into its functionality. The following is stated about NCCL on the NVIDIA website:
The NVIDIA Collective Communications Library (NCCL) implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs.
From the introduction video about NCCL I understood that NCCL works via PCIe, NVLink, Native Infiniband, Ethernet and it can even detect if GPU Direct via RDMA makes sense in the current hardware topology and uses it transparently.
So I am questioning why MPI is needed in Horovod? As far as I understand, MPI is also used for efficiently exchanging the gradients among distributed nodes via an allreduce paradigm. But as I understand, NCCL already supports those functionalities.
So is MPI only used for easily scheduling the jobs on a cluster? For Distributed Deep Learning on CPU, since we cannot use NCCL there?
I would highly appreciate if someone could explain in which scenarios MPI and/or NCCL is used for Distributed Deep Learning and what are their responsibilities during the training job.

MPI (Message Passing Interface) is a message-passing standard used in parallel computing (Wikipedia). Most of the time, you'd use Open MPI when using Horovod, which is an open-source implementation of the MPI standard.
The MPI implementation allows one to easily run more than a single instance of a program in parallel. The program code is kept the same but just running in a few different processes. In addition, the MPI library exposes an API to easily share data and information among these processes.
Horovod uses this mechanism in order to run some processes of the Python script which is running the neural network. These processes should know and share some information during the running of the neural network. Some of this information is about the environment, for example:
The number of processes that are currently being running, for being able to correctly modify parameters and hyperparameters for the neural network such as the batch size, learning rate, etc..
Knowing which process is the "master" one, to print logs and save files (checkpoints) from only a single process.
The id (called "rank") of the current process so it could use a specific area of the input data.
Some of this information is about the training process of the neural network, for example:
The randomized initial values for the weights and biases of the model, so all processes will start from the same point.
The values of the weights and biases at the end of every training step, so all processes will start the next step with the same values.
There is more information that is shared and the above bullets are some of it.
At first, Horovod used MPI for all the requirements above. Then, Nvidia released NCCL which is a library that consists of many algorithms for high-performance communication between GPUs. To improve the overall performance, Horovod started using NCCL for things like (4) and mainly (5) as NCCL allowed sharing this data between GPUs much more efficiently.
In Nvidia docs we can see that NCCL can be used in conjunction with MPI, and in general:
MPI is used for CPU-CPU communication, and NCCL is used for GPU-GPU communication.
Horovod still uses MPI for running the few instances of the Python script and manage the environment (rank, size, which process is the "master", etc..) for allowing the user to easily manage the run.

Firstly, horovod used MPI only in the beginning.
After NCCL is introduced to horovod, even in NCCL mode, MPI is still used for providing environmental info (rank, size and local_rank). NCCL doc has an example shows how it leverages MPI in one device per process setting:
The following code is an example of a communicator creation in the context of MPI, using one device per MPI rank.
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/examples.html#example-2-one-device-per-process-or-thread

Implications of using MPI with TensorFlow

I come from a sort of HPC background and I am just starting to learn about machine learning in general and TensorFlow in particular. I was initially surprised to find out that distributed TensorFlow is designed to communicate with TCP/IP by default though it makes sense in hindsight given what Google is and the kind of hardware it uses most commonly.
I am interested in experimenting with TensorFlow in a parallel way with MPI on a cluster. From my perspective, this should be advantageous because latency should be much lower due to MPI's use of Remote Direct Memory Access (RDMA) across machines without shared memory.
So my question is, why doesn't this approach seem to be more common given the increasing popularity of TensorFlow and machine learning ? Isn't latency a bottleneck ? Is there some typical problem that is solved, that makes this sort of solution impractical? Are there likely to be any meaningful differences between calling TensorFlow functions in a parallel way vs implementing MPI calls inside of the TensorFlow library ?
Thanks

It seems tensorflow already supports MPI, as stated at https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/mpi
MPI support for tensorflow was also discussed at https://arxiv.org/abs/1603.02339
Generally speaking, keep in mind MPI is best at sending/receiving messages, but not so great at sending notifications and acting upon events.
Last but not least, MPI support of multi-threaded applications (e.g. MPI_THREAD_MULTIPLE) has not always been production-ready among MPI implementation s.
These were two general statements and i honestly do not know if they are relevant for tensorflow.

According to the doc in Tensorflow git repo，actually tf utilizes gRPC library by detault, which is based on HTTP2 protocol, rather than TCP/IP protocol, and this paper should give you some insight, hope this information is useful.

Tensorflow Object Detection API with GPU on Windows and real-time detection

I am testing the new Tensorflow Object Detection API in Python, and I succeeded in installing it on Windows using docker. However, my trained model (Faster RCNN resnet101 COCO) takes up to 15 seconds to make a prediction (with very good accuracy though), probably because I only use Tensorflow CPU.
My three questions are:
Considering the latency, where is the problem? I heard Faster RCNN was a good model for low latency visual detection, is it because of the CPU-only execution?
With such latency, is it possible to make efficient realtime video processing by using tensorflow GPU, or should I use a more popular model like YOLO?
The popular mean to use tensorflow GPU in docker is nvidia-docker but is not supported on windows. Should I continue to look for a docker (or conda) solution for local prediction, or should I deploy my model directly to a virtual instance with GPU (I am comfortable with Google Cloud Platform)?
Any advice and/or good practice concerning real-time video processing with Tensorflow is very welcome!

Considering the latency, where is the problem ? I heard Faster RCNN
was a good model for low latency visual detection, is it because of
the CPU-only execution ?
Of course, it's because you are using CPU.
With such latency, is it possible to make efficient realtime video
processing by using tensorflow GPU, or should I use a more popular
model like YOLO ?
Yolo is fast, but I once used it for face and accuracy was not that great. But a good alternative.
The popular mean to use tensorflow GPU in docker is nvidia-docker but
is not supported on windows. Should I continue to look for a docker
(or conda) solution for local prediction, or should I deploy my model
directly to a virtual instance with GPU (I am comfortable with Google
Cloud Platform) ?
I think you can still use your local GPU in windows, as Tensorflow supports GPU on python.
And here is an example, simply to do that. It has a client which can read webcam or IP cam stream. The server is using Tensorflow python GPU version and ready to use pre-trained model for predictions.
Unfortunately, Tensoflow does not support tensorflow-serving on windows. Also as you said Nvidia-Docker is not supported on windows. Bash on windows has no support for GPU either. So I think this is the only easy way to go for now.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.