When I run gensim's LdaMulticore model on a machine with 12 cores, using:
lda = LdaMulticore(corpus, num_topics=64, workers=10)
I get a logging message that says
using serial LDA version on this node
A few lines later, I see another loging message that says
training LDA model using 10 processes
When I run top, I see 11 python processes have been spawned, but 9 are sleeping, I.e. only one worker is active. The machine has 24 cores, and is not overwhelmed by any means. Why isn't LdaMulticore running in parallel mode?
First, make sure you have installed a fast BLAS library, because most of the time consuming stuff is done inside low-level routines for linear algebra.
On my machine the gensim.models.ldamodel.LdaMulticore can use up all the 20 cpu cores with workers=4 during training. Setting workers larger than this didn't speed up the training. One reason might be the corpus iterator is too slow to use LdaMulticore effectively.
You can try to use ShardedCorpus to serialize and replace the corpus, which should be much faster to read/write. Also, simply zipping your large .mm file so it takes up less space (=less I/O) may help too. E.g.,
mm = gensim.corpora.MmCorpus(bz2.BZ2File('enwiki-latest-pages-articles_tfidf.mm.bz2'))
lda = gensim.models.ldamulticore.LdaMulticore(corpus=mm, id2word=id2word, num_topics=100, workers=4)
Related
I have a very large dataset (7 million rows, 54 features) that I would like to fit a regression model to using XGBoost. To train the best possible model, I want to use BayesSearchCV from scikit-optimize to run the fit repeatedly for different hyperparameter combinations until the best performing set is found.
For a given set of hyperparameters, XGBoost takes a very long time to train a model, so in order to find the best hyperparameters without spending days on every permutation of training folds, hyperparameters, etc., I want to multithread both XGBoost and BayesSearchCV. The relevant part of my code looks like this:
xgb_pipe = Pipeline([('clf', XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 1))])
xgb_fit_params = {'clf__early_stopping_rounds': 5, 'clf__eval_metric': 'mae', 'clf__eval_set': [[X_val.values, y_val.values]]}
xgb_kfold = KFold(n_splits = 5, random_state = 42)
xgb_unsm_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_jobs = 2, n_points = 1, n_iter = 15, random_state = 42, verbose = 4, scoring = 'neg_mean_absolute_error', fit_params = xgb_fit_params)
xgb_unsm_cv.fit(X_train.values, y_train.values)
However, I've found that when n_jobs > 1 in the BayesSearchCV call, the fit crashes and I get the following error:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
This error persists whenever I use more than 1 thread in the BayesSearchCV call, and is independent of the memory I provide.
Is this some fundamental incompatibility between XGBoost and scikit-optimize, or can both packages be forced to work together somehow? Without some way of multithreading the optimization, I fear that fitting my model will take weeks to perform. What can I do to fix this?
I don't think the error has something to do with the incompatibility of the libraries. Rather, since you are asking for two different multi-thread operations, you are running out of the the memory as your program is trying to put the complete dataset onto your RAM not once but twice for multiple instances (depending on the threads).
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Segmentation Fault refers to an error where the system ran out of available memory.
Note that XGBoost is a RAM hungry beast, coupling it with another multi-threaded operation is bound to take a toll(and personally, not recommended with daily driver machines.)
The most viable solution would be to probably use Google's TPU or some other cloud service (beware of the costs), or use some technique to reduce the size of the dataset for processing using some statistical techniques like the ones mentioned in this kaggle notebook and Data Science StackExchange Article.
The idea is, either you upscale the hardware (monetary cost), go head-on with single thread BayesianCV (time cost) or downsize the data using whatever technique best suits you.
Finally, the answer still is that the libraries are probably compatible, just the data is too large for the available RAM.
I am currently running 8 Random Forest models(each with grid search) on 8 different datasets for 1000 times. To save time, I opened up 8 different terminals and am running each models in parallel. Now, my question is
Is it faster to train each random forest models with the argument n_jobs ==-1 or is it better to assign number of cores such as n_jobs=3? I have 26 cores available. Thus I may be able to assign at most 3 cores to each model.
Looking at the cpu usage with htop, I can see that running with n_jobs=-1 already makes full use of all cpus. Would it be possible that setting the n_jobs=-1 actually results in bottleneck when distributing the data to each core?
The fastest in my opinion would be to use n_jobs=-1 in all 8 terminals and your computer will internally allocate as needed the necessary CPU resources to each worker.
The problem
I am currently working on a project that I sadly can't share with you. The project is about hyper-parameter optimization for neural networks, and it requires that I train multiple neural network models (more than I can store on my GPU) in parallel. The network architectures stay the same, but the network parameters and hyper-parameters are subjected to change between each training interval. I am currently achieving this using PyTorch on a linux environment in order to allow my NVIDIA GTX 1660 (6GB RAM) to use the multiprocessing feature that PyTorch provides.
Code (simplified):
def training_function(checkpoint):
load(checkpoint)
train(checkpoint)
unload(checkpoint)
for step in range(training_steps):
trained_checkpoints = list()
for trained_checkpoint in pool.imap_unordered(training_function, checkpoints):
trained_checkpoints.append(trained_checkpoint)
for optimized_checkpoint in optimize(trained_checkpoints):
checkpoints.update(optimized_checkpoint)
I currently test with a population of 30 neural networks (i.e. 30 checkpoints) with the MNIST and FashionMNIST datasets which consists of 70 000 (50k training, 10k validation, 10k testing) 28x28 images with 1 channel each respectively. The network I train is a simple Lenet5 implementation.
I use a torch.multiprocessing pool and allow 7 processes to be spawned. Each process uses some of the GPU memory available just to initialize the CUDA environment in each process. After training, the checkpoints are adapted with my hyper-parameter optimization technique.
The load function in the training_function loads the model- and optimizer state (holds the network parameter tensors) from a local file into GPU memory using torch.load. The unload saves the newly trained states back to file using torch.save and deletes them from memory. I do this because PyTorch will only detach GPU tensors when no variable is referencing them. I have to do this because I have limited GPU memory.
The current setup works, but each CUDA initialization occupies over 700MB of GPU RAM, and so I am interested if there are other ways I could do this that may use less memory without a penalty to efficiency.
My attempts
I suspected I could use a thread pool in order to save some memory, and it did. By spawning 7 threads instead of 7 processes, CUDA was only initialized once, which saved almost half of my memory. However, this lead to a new problem in which the GPU only utilized approx. 30% utilization according to nvidia-smi that I am monitoring in a separate linux terminal. Without threads, I get around 85-90% utilization.
I also messed around with torch.multiprocessing.set_sharing_strategy which is currently set to 'file_descriptor', but with no luck.
My questions
Is there a better way to work with multiple model- and optimizer states without saving and loading them to files while training? I have tried to move the model to CPU using model.cpu() before saving the state_dict, but this did not work in my implementation (memory leaks).
Is there an efficient way I can train multiple neural networks at the same time that uses less GPU memory? When searching the web, I only find references to nn.DataParallel which trains the same model over multiple GPUs by copying it to each GPU. This does not work for my problem.
I will soon have access to multiple, more powerful GPUs with more memory, and I suspect this problem will be less annoying then, but I wouldn't be surprised if there is a better solution I am not getting.
Update (09.03.2020)
For any future readers, if you set out to do something similar to the pseudo code displayed above, and you plan on using multiple GPUs, please make sure to create one multiprocessing pool for each GPU device. Pools don't execute functions in order with the underlying processes it contains, and so you will end up initializing CUDA multiple times on the same process, wasting memory.
Another important note is that while you may be passing the device (e.g. 'cuda:1') to every torch.cuda-function, you may discover that torch does something with the default cuda device 'cuda:0' somewhere in the code, initializing CUDA on that device for every process, which wastes memory on an unwanted and non-needed CUDA initialization. I fixed this issue by using with torch.cuda.device(device_id) that encapsulate the entire training_function.
I ended up not using multiprocessing pools and instead defined my own custom process class that holds the device and training function. This means I have to maintain in-queues for each device-process, but they all share the same out-queue, meaning I can retrieve the results the moment they are available. I figured writing a custom process class was simpler than writing a custom pool class. I desperately tried to keep using pools as they are easily maintained, but I had to use multiple imap-functions, and so the results were not obtainable one at a time, which lead to a less efficient training-loop.
I am now successfully training on multiple GPUs, but my questions posted above still remains unanswered.
Update (10.03.2020)
I have implemented another way to store model- and optimizer statedicts outside of GPU RAM. I have written function that replaces every tensor in the dicts with it's .to('cpu') equivalent. This costs me some CPU memory, but it is more reliable than storing local files.
Update (11.06.2020)
I have still not found a different approach that leads to fewer CUDA initializations while maintaining the same processing speed. From what I've read and come to understand, PyTorch does not infer too much with how CUDA is operating, and leaves that up to NVIDIA.
I have ended up using a pool of custom, device-specific processes, called Workers, that is maintained by my custom pool class (more about this above). In addition, I let each of these Workers take in one or more checkpoints as well as the function that processes them (training, testing, hp optimization) via a Queue. These checkpoints are then processed simultaneously via a python multiprocessing ThreadPool in each Worker and the results are then returned one by one via the return Queue the moment they are ready.
This gives me the parallel procedure I was needing, but the memory issue is still there. Due to time constraints, I have come to terms with it for now.
I am currently training a recurrent net on Tensorflow for a text classification problem and am running into performance and out of memory issues. I am on AWS g2.8xlarge with Ubuntu 14.04, and a recent nightly build of tensorflow (which I downloaded on Aug 25).
1) Performance issue:
On the surface, both the CPU and GPU are highly under-utilized. I've run multiple tests on this (and have used line_profiler and memory_profiler in the process). Train durations scale linearly with number of epochs, so I tested with 1 epoch. For RNN config = 1 layer, 20 nodes, training time = 146 seconds.
Incidentally, that number is about 20 seconds higher/slower than the same test run on a g2.2xlarge!
Here is a snapshot of System Monitor and nvidia-smi (updated every 2 seconds) about 20 seconds into the run:
SnapshotEarlyPartOfRun
As you can see, GPU utilization is at 19%. When I use nvprof, I find that the total GPU process time is about 27 seconds or so. Also, except for one vCPU, all others are very under-utilized. The numbers stay around this level, till the end of the epoch where I measure error across the entire training set, sending GPU utilization up to 45%.
Unless I am missing something, on the surface, each device is sitting around waiting for something to happen.
2) Out of memory issue:
If I increase the number of nodes to 200, it gives me an Out of Memory error which happens on the GPU side. As you can see from the above snapshots, only one of the four GPUs is used. I've found that the way to get tensorflow to use the GPU has to do with how you assign the model. If you don't specify anything, tensorflow will assign it to a GPU. If you specify a GPU, only it will be used. Tensorflow does not like it when I assign it to multiple devices with a "for d in ['/gpu:0',...]". I get into an issue with re-using the embedding variable. I would like to use all 4 GPUs for this (without setting up distributed tensorflow). Here is the snapshot of the Out of memory error:
OutofMemoryError
Any suggestions you may have for both these problems would be greatly appreciated!
Re (1), to improve GPU utilization did you try increasing the batch size and / or shortening the sequences you use for training?
Re (2), to use multiple GPUs you do need to manually assign the ops to GPU devices, I believe. The right way is to place ops on specific GPUs by doing
with g.Device("/gpu:0"):
...
with g.Device("/gpu:1"):
...
I am executing the scikit-learn SVM classifier (SVC) from python 2.7.10 and it has been running for over 2 hours now. I have read the data using pandas.read_csv preprocessed it and then run
clf = SVC(C = 0.001, kernel = 'linear', cache_size = 7000, verbose = True)
clf.fit(X_train, y_train)
I have experience in running classifiers (Random Forests and Deep Neural Networks) in H2O using R and they never take this long! The machine I am running on has 16 GB RAM and and i7 with 3.6 GHz on each core. The taskmonitor tells me that 8.6 Gb RAM are being used by python, however, only 13% of the CPU. I don't quite understand why it is so slow and not even using all resources.
The data I have has 12000000 rows and 22 columns and the only verbose sklearn is giving me is one line:
[LibSVM]
Is that normal behavior or should I see a lot more? Could anyone post the verbose of a svc that finished?
Also, can I do anything to speed things up besides lowering the C parameter? Using less rows is not really an option since I want to benchmark algorithms and they wouldn't be comparable if different training data was used. Finally, can anyone explain why so little of my resources are being used?
You can try using accelerated implementations of algorithms - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
For SVM you for sure would be able to get higher compute efficiency, however for such large datasets this still would be noticeable.
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()