Queue manager for experiments in neural network training

Queue manager for experiments in neural network training - python

I am conducting experiments with neural networks using Keras + TensorFlow backend. I do this using GPU on my PC, running Windows 7.
My workflow looks like the following.
I create a small python script that defines a model then runs model.fit_generator with ~50 epochs and early stopping, if validation accuracy does not improve after 10-15 epochs. Then I run a terminal with a command like python model_v3_4_5.py
Usually one epoch takes about 1.5 hours. During this period some new ideas (training parameters or new architecture) come into my head.
Then I create a new python script...
During experiments I've found that it is better not to train several models in parallel. I've experienced doubling of epoch time and strange decrease of validation accuracy.
Therefore, I'd like to wait until the first training finishes then run the second one. Simultaneously, I'd like to avoid idling of my PC and run a new training immediately after the first one has finished.
But I don't know exactly when the first training finishes, therefore, running commands like timeout <50 hours> && python model_v3_4_6.py would be a dumb solution.
Then I need some kind of a queue manager.
One solution that have come to my mind is installing Jenkins slave on my PC and use queues that Jenkins provides. As far as I remember, Jenkins has issues with GPU access.
Another variant - training models in the Jupyter notebook in separate cells. However, I cannot see queue of cell execution here. And this is a topic, being discussed.
Update. Next variant. Add to the model scripts some code, retrieving current GPU state (does it run NN currently?) and wait if it is calculating. This will produce issues in case of several scripts (more than one bright new idea :) ) waiting for GPU to idle.
Are there any other variants?

Finally, I've come up to the simple cmd script
set PYTHONPATH=%CD%
:start
for %%m in (train_queue\model*.py) do (
python %%m
del %%m
)
timeout 5
goto start
One creates a subdirectory train_queue and puts scripts with models in it. All scripts log their output to files, whose names contain timestamps.
This script also calls timeout program

Related

Limit number of cores used on server for tensorflow 2 and keras

I try to run a Python script that trains several Neural Networks using TensorFlow and Keras. The problem is that I cannot restrict the number of cores used on the server, even though it works on my local desktop.
The basic structure is that I have defined a function run_net that runs the neural net. This function is called with different parameters in parallel using joblib (see below). Additionally, I have tried running the function iteratively with different parameters which didn't solve the problem.
Parallel(n_jobs=1, backend="multiprocessing")(
delayed(run_net)
If I run that on my local Windows Desktop, everything works fine. However, if I try to run the same script on our institute's server with 48 cores and check CPU usage using htop command, all cores are used. I already tried setting n_jobs in joblib Parallel to 1 and it looks like CPU usage goes to 100% once the tensorflow models are trained.
I already searched for different solutions and the main one that I found is the one below. I define that before running the parallel jobs shown above. I also tried placing the code below before every fit or predict method of the model.
NUM_PARALLEL_EXEC_UNITS = 5
config = tf.compat.v1.ConfigProto(
intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS,
inter_op_parallelism_threads=2,
device_count={"CPU": NUM_PARALLEL_EXEC_UNITS},
)
session = tf.compat.v1.Session(config=config)
K.set_session(session)
At this point, I am quite lost and have no idea how to make Tensorflow and/or Keras use a limited number of cores as the server I am using is shared across the institute.
The server is running linux. However, I don't know which exact distribution/version it is. I am very new to running code on a server.
These are the versions I am using:
python == 3.10.8
tensorflow == 2.10.0
keras == 2.10.0
If you need any other information, I am happy to provide that.
Edit 1
Both the answer suggested in this thread doesn't work as well as using only these commands:
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)

after trying some things, I have found a solution to my problem. With the following code, I can restrict the number of CPUs used:
os.environ["OMP_NUM_THREADS"] = "5"
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
Note, that I have no idea how many CPUs will be used in the end. I noticed that it isn't five cores being used but more. As I don't really care about the exact number of cores but just that I don't use all cores, I am fine with that solution for now. If anybody knows how to calculate the number of cores used from the information provided above, let me know.

Why Might Python Be Running Slow

I am trying to run a program I created that uses a neural network to predict stock prices. I am trying to run it for a number of various different stocks. I am running the same exact code on both my desktop and my laptop.
At first I was running the code only on my desktop, and it was running very slow. At first I thought it was just because of the number of calculations to be made for the neural network. However, I also started running the code on my laptop to be able to run it for two stocks at the same time.
The code runs much much faster on my laptop (I would estimate about 20x faster), even though my desktop has a much better processor, GPU, etc... I am also using the same size data set for each run as well.
I added the lines of code:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = ""
So that python should be using my processor and not my graphics processor, I am not sure if that makes a difference.
Any idea why this might be?

How to have multiple MLFlow runs in parallel?

I'm not very familiar with parallelization in Python and I'm getting an error when trying to train a model on multiple training folds in parallel. Here's a simplified version of my code:
def train_test_model(fold):
# here I train the model etc...
# now I want to save the parameters and metrics
with mlflow.start_run():
mlflow.log_param("run_name", run_name)
mlflow.log_param("modeltype", modeltype)
# and so on...
if __name__=="__main__":
pool = ThreadPool(processes = num_trials)
# run folds in parallel
pool.map(lambda fold:train_test_model(fold), folds)
I'm getting the following error:
Exception: Run with UUID 23e9bb6d22674a518e48af9c51252860 is already active. To start a new run, first end the current run with mlflow.end_run(). To start a nested run, call start_run with nested=True
The documentation says that mlflow.start_run() starts a new run and makes it active which is the root of my problem. Every thread starts a MLFlow run for its corresponding fold and makes it active while I need the runs to run in parallel i.e. all be active(?) and save parameters/metrics of the corresponding fold. How can I solve that issue?

I found a solution, maybe it will be useful for someone else. You can see details with code examples here: https://github.com/mlflow/mlflow/issues/3592

Python killed on GCP

I have been working on comparison to run deep learning code on local machine and Google Cloud Platform.
The code is about recurrent neural network and it ran perfectly well on local machine.
But on GCP cloud shell, when I want to compile my python file, it shows "Killed"
userID#projectID:~$ python rnn.py
Killed
Is it because that I am out of memory? (because I tried to run line by line, and on the second time I assigned large data to a variable, it stuck.)
My code is somewhat like this
imdb = np.load('imdb_word_emb.npz')
X_train = imdb['X_train']
X_test = imdb['X_test']
on the third line, the machine stuck and showed "Killed"
I tried to change the order of the second and third line, it still stuck at the third line.
My training data is a (25000,80,128)-array. So is my testing data. The data set works perfectly well on my local machine. I am sure there are no problem with this data set.
Or is it because of other reasons?
It would be awesome if people who know how to solve or even few key words tell me how to deal with this. Thank you :D

The error you are getting is because Cloud Shell is not intended for computational or network intensive processes, see Cloud Shell limitations.
I understand you want to compare your local machine with Google Cloud Platform. As stated in the public docs:
"When you start Cloud Shell, it provisions a g1-small Google Compute
Engine"
A g1-small machine type has 1.70GB RAM and a shared physical core. Keeping this in mind and also that is a limited as stated before, your local machine is likely more powerful than Cloud Shell so you'd not see any improvement.
I recommend you to create a Compute Engine instance with a different machine type, you can use a custom machine type to set the number of cores and GB of RAM you want to have. I guess you want to benefit from running your workload faster in Google Compute Engine so you can choose a better machine type than your local one in terms of resources and compare how much it improves.

How to reconnect to the ongoing process on GoogleColab

I recently started to use Google Colab to train my CNN model. It always needs about 10+ hours to train once. But I cannot stay in the same place during these 10+ hours, so I always poweroff my notebook and let the process keep going.
My code will save models automatically. I figured out that when I disconnect from the Colab, the process are still saving models after disconnection.
Here are the questions:
When I try to reconnect to the Colab notebook, it always stuck at "INITIALIZAING" stage and can't connect. I'm sure that the process is running. How do I know if the process is OVER?
Is there any way to reconnect to the ongoing process? It will be nice to me to observe the training losses during the training.
Sorry for my poor English, thanks alot.

Output your loss results to a log file saved in your drive, and periodically check this file.
You can run your training process like:
!log_file = "/content/drive/My Drive/path/log.log"
!python train.py > "${log_file}"

first question: restart runtime from runtime menu
second question: i think you can use tensorboard to monitor your work.

It seems there's no normal way to do this. But you can save your model to Google Drive with current training epoch number, so when you see something like "my_model_epoch_1000" on your google drive, you will know that the process is over.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.