I'm a newbie to Tensorflow. I have been learning how to use TensorFlow to train models in a distributed manner and I have access to multiple servers, each with multiple CPUs.
Training mechanisms are clearly outlined in documentation and tutorials, but there are some ambiguities regarding data management while training multiple workers. In my understanding, data should be shared and stored on a single machine, and tf.distribute.DistributedDataset distributes data among workers.
Is my understanding that shared data is stored on one machine correct?
Think of a situation where we have multiple workers in our network and we want to train a model for 10 epochs on a large dataset. Is it true that tf.distribute.DistributedDataset sends data to workers 10 times? Are there any mechanisms to prevent the same batches of data from being sent to the same worker ten times?
This post, for instance, states that:
Spark and HDFS are designed to work well together. When Spark needs some data from HDFS, it grabs the closest copy which minimizes the time data spends traveling around the network.
I'm looking for something similar for Tensorflow's distributed training.
Related
The question is very simple but don't know how to implement it in practice. I would like to train a tensorflow LSTM model with the dataset, which is incredible large (50 millions records). I am able to load the data file to a local machine but the machine crash during the pre-processing stage due to limited memory. I have tried to del un-used files and garbage collection to free the memory but it does not help.
Is there any way, I can train a tensorflow model separately for example, the model will be train 5 times, each time only use 10 million records and then delete 10 million records after training to free the memory ram. The same procedure will be repeated for 5 times to train a tensorflow model.
Thanks
There are some ways to avoid these problems:
1- You can use google colab and high-RAM in runtime or any other Rent a VM in the cloud.
2- The three basic software techniques for handling too much data: compression, chunking, and indexing.
We currently have a system running on AWS Sagemaker whereby several units have their own trained machine learning model artifact (using an SKLearn training script with the Sagemaker SKLearn estimator).
Through the use of Sagemaker's multi-model endpoints, we are able to host all of these units on a single instance.
The problem we have is that we need to scale this system up such that we can train individual models for hundreds of thousand of units and then host the resulting model artifacts on a multi-model endpoint. But, Sagemaker has a limit to the number of models you can train in parallel (our limit is 30).
Aside from training our models in batches, does anyone have any ideas how to go about implementing a system in AWS Sagemaker whereby for hundreds of thousands of units, we can have a separate trained model artifact for each unit?
Is there a way to output multiple model artifacts for 1 sagemaker training job with the use of an SKLearn estimator?
Furthermore, how does Sagemaker make use of multiple CPUs when a training script is submitted? Does this have to be specified in the training script/estimator object or is this handled automatically?
Here are some ideas:
1. does anyone have any ideas how to go about implementing a system in AWS Sagemaker whereby for hundreds of thousands of units, we can have a separate trained model artifact for each unit? Is there a way to output multiple model artifacts for 1 sagemaker training job with the use of an SKLearn estimator?
I don't know if the 30-training job concurrency is a hard limit, if it is a blocker you should try and open a support ticket to ask if it is and try and get it raised. Otherwise as you can point out, you can try and train multiple models in one job, and produce multiple artifacts that you can either (a) send to S3 manually, or (b) save to opt/ml/model so that they all get sent to the model.tar.gz artifact in S3. Note that if this artifact gets too big this could get impractical though
2. how does Sagemaker make use of multiple CPUs when a training script is submitted? Does this have to be specified in the training script/estimator object or is this handled automatically?
This depends on the type of training container you are using. SageMaker built-in containers are developed by Amazon teams and designed to efficiently use available resources. If you use your own code such as custom python in the Sklearn container, you are responsible for making sure that your code is efficiently written and uses available hardware. Hence framework choice is quite important :) for example, some sklearn models support explicitly using multiple CPUs (eg the n_jobs parameter in the random forest), but I don't think that Sklearn natively supports GPU, multi-GPU or multi-node training.
I have an implementation of a GRU based network in PyTorch, which I train using a 4 GB GPU present in my laptop, and obviously it takes a lot of time (4+ hrs for 1 epoch). I am looking for ideas/leads on how I can move this deep-learning model to train on a couple of spark clusters instead.
So far, I have only come across this GitHub library called SparkTorch, which unfortunately has limited documentation and the examples provided are way too trivial.
https://github.com/dmmiller612/sparktorch
To summarize, I am looking for answers to the following two questions:
Is it a good idea to train a deep learning model on spark clusters, since I read at places that the communication overhead undermines the gains in training speed
How to convert the PyTorch model (and the underlying dataset) in order to perform a distributed training across the worker nodes.
Any leads appreciated.
The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.
I am working on a relatively large text-based web classification problem and I am planning on using the multinomial Naive Bayes classifier in sklearn in python and the scrapy framework for the crawling. However, I am a little concerned that sklearn/python might be too slow for a problem that could involve classifications of millions of websites. I have already trained the classifier on several thousand websites from DMOZ.
The research framework is as follows:
1) The crawler lands on a domain name and scrapes the text from 20 links on the site (of depth no larger than one). (The number of tokenized words here seems to vary between a few thousand to up to 150K for a sample run of the crawler)
2) Run the sklearn multionmial NB classifier with around 50,000 features and record the domain name depending on the result
My question is whether a Python-based classifier would be up to the task for such a large scale application or should I try re-writing the classifier (and maybe the scraper and word tokenizer as well) in a faster environment? If yes what might that environment be?
Or perhaps Python is enough if accompanied with some parallelization of the code?
Thanks
Use the HashingVectorizer and one of the linear classification modules that supports the partial_fit API for instance SGDClassifier, Perceptron or PassiveAggresiveClassifier to incrementally learn the model without having to vectorize and load all the data in memory upfront and you should not have any issue in learning a classifier on hundreds of millions of documents with hundreds of thousands (hashed) features.
You should however load a small subsample that fits in memory (e.g. 100k documents) and grid search good parameters for the vectorizer using a Pipeline object and the RandomizedSearchCV class of the master branch. You can also fine tune the value of the regularization parameter (e.g. C for PassiveAggressiveClassifier or alpha for SGDClassifier) using the same RandomizedSearchCVor a larger, pre-vectorized dataset that fits in memory (e.g. a couple of millions of documents).
Also linear models can be averaged (average the coef_ and intercept_ of 2 linear models) so that you can partition the dataset, learn linear models independently and then average the models to get the final model.
Fundamentally, if you rely on numpy, scipy, and sklearn, Python will not be a bottleneck as most critical portions of those libraries are implemented as C-extensions.
But, since you're scraping millions of sites, you're going to be bounded by your single machine's capabilities. I would consider using a service like PiCloud [1] or Amazon Web Services (EC2) to distribute your workload across many servers.
An example would be to funnel your scraping through Cloud Queues [2].
[1] http://www.picloud.com
[2] http://blog.picloud.com/2013/04/03/introducing-queues-creating-a-pipeline-in-the-cloud/