I am using the joblib library to run multiple NN on my multiple CPU at once. the idea is that I want to make a final prediction as the average of all the different NN predictions. I use keras and theano on the backend.
My code works if I set n_job=1 but fails for anything >1.
Here is the error message:
[Parallel(n_jobs=3)]: Using backend ThreadingBackend with 3 concurrent workers.
Using Theano backend.
WARNING (theano.gof.compilelock): Overriding existing lock by dead process '6088' (I am process '6032')
WARNING (theano.gof.compilelock): Overriding existing lock by dead process '6088' (I am process '6032')
The code I use is rather simple (it works for n_job=1)
from joblib import Parallel, delayed
result = Parallel(n_jobs=1,verbose=1, backend="threading")(delayed(myNNfunction)(arguments,i,X_train,Y_train,X_test,Y_test) for i in range(network))
For information (I don't know if this is relevant), this is my parameters for keras:
os.environ['KERAS_BACKEND'] = 'theano'
os.environ["MKL_THREADING_LAYER"] = "GNU"
os.environ['MKL_NUM_THREADS'] = '3'
os.environ['GOTO_NUM_THREADS'] = '3'
os.environ['OMP_NUM_THREADS'] = '3'
I have tried to use the technique proposed here but it didn't change a thing. To be precise I have created a file in C:\Users\myname.theanorc with this in it:
[global]
base_compiledir=/tmp/%(user)s/theano.NOBACKUP
I've read somewhere (I can't find the link sorry) that on windows machines I shouldn't call the file .theanorc.txt but only .theanorc ; in any case it doesn't work.
Would you know what I am missing?
Related
I try to run a Python script that trains several Neural Networks using TensorFlow and Keras. The problem is that I cannot restrict the number of cores used on the server, even though it works on my local desktop.
The basic structure is that I have defined a function run_net that runs the neural net. This function is called with different parameters in parallel using joblib (see below). Additionally, I have tried running the function iteratively with different parameters which didn't solve the problem.
Parallel(n_jobs=1, backend="multiprocessing")(
delayed(run_net)
If I run that on my local Windows Desktop, everything works fine. However, if I try to run the same script on our institute's server with 48 cores and check CPU usage using htop command, all cores are used. I already tried setting n_jobs in joblib Parallel to 1 and it looks like CPU usage goes to 100% once the tensorflow models are trained.
I already searched for different solutions and the main one that I found is the one below. I define that before running the parallel jobs shown above. I also tried placing the code below before every fit or predict method of the model.
NUM_PARALLEL_EXEC_UNITS = 5
config = tf.compat.v1.ConfigProto(
intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS,
inter_op_parallelism_threads=2,
device_count={"CPU": NUM_PARALLEL_EXEC_UNITS},
)
session = tf.compat.v1.Session(config=config)
K.set_session(session)
At this point, I am quite lost and have no idea how to make Tensorflow and/or Keras use a limited number of cores as the server I am using is shared across the institute.
The server is running linux. However, I don't know which exact distribution/version it is. I am very new to running code on a server.
These are the versions I am using:
python == 3.10.8
tensorflow == 2.10.0
keras == 2.10.0
If you need any other information, I am happy to provide that.
Edit 1
Both the answer suggested in this thread doesn't work as well as using only these commands:
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
after trying some things, I have found a solution to my problem. With the following code, I can restrict the number of CPUs used:
os.environ["OMP_NUM_THREADS"] = "5"
tf.config.threading.set_intra_op_parallelism_threads(5)
tf.config.threading.set_inter_op_parallelism_threads(5)
Note, that I have no idea how many CPUs will be used in the end. I noticed that it isn't five cores being used but more. As I don't really care about the exact number of cores but just that I don't use all cores, I am fine with that solution for now. If anybody knows how to calculate the number of cores used from the information provided above, let me know.
I have a server access which has multiple GPUs that can be accessed simultaneously by many users.
I choose only 1 gpu_id from the terminal and have a code like this.
device = "cuda:"+str(FLAGS.gpu_id) if torch.cuda.is_available() else "cpu"
where FLAGS is a parser, parsing arguments from terminal.
Even though I select only one id, I saw that I am using 2 different GPUs. That causes issues, when the other GPU memory is almost full, and my process terminates by throwing "CUDA out of memory" error.
I want to understand, what could be the possible cases for such thing to happen?
It is hard to tell what is wrong without knowing how you use the device parameter. In any case, you can try to achieve what you want with a different approach. Run your python script in the following way:
CUDA_VISIBLE_DEVICES=0 python3 my_code.py
I am running a Python program using the excellent EasyOCR module. It relies on PyTorch for image detection and every time I run it, it produces a warning: "Using CPU. Note: This module is much faster with a GPU." for each iteration.
What can I add to my code to stop this output without stopping other output? I don't have a GPU so that is not an option.
After looking into the source code, I noticed that verbose is set to True by default in the constructor.
After setting verbose=False the message stops appearing.
reader = easyocr.Reader(['en'], gpu=False, verbose=False)
I think there is a command line parameter --gpu=false. Have you tried that?
So when I run cuda.select_device(0) and then cuda.close(). Pytorch cannot access the GPU again, I know that there is way so that PyTorch can utilize the GPU again without having to restart the kernel. But I forgot how. Does anyone else know?
from numba import cuda as cu
import torch
# random tensor
a=torch.rand(100,100)
#tensor can be loaded onto the gpu()
a.cuda()
device = cu.get_current_device()
device.reset()
# thows error "RuntimeError: CUDA error: invalid argument"
a.cuda()
cu.close()
# thows error "RuntimeError: CUDA error: invalid argument"
a.cuda()
torch.cuda.is_available()
#True
And then trying to run cuda-based pytorch code yields:
RuntimeError: CUDA error: invalid argument
could you provide a more complete snippet, I am running
from numba import cuda
import torch
device = cuda.get_current_device()
device.reset()
cuda.close()
torch.cuda.isavailable()
which prints True, not sure what is your issue?
I had the same issue but with TensorFlow and Keras when iterating through a for loop to tune hyperparamenters. It did not free up the GPU memory used by older models. The cuda solution did not work for me. The following did:
import gc
gc.collect()
This has two possible culprits.
some driver issue be it numba driver or kernel driver managing gpu.
reason for suspecting this is Roger did not see this issue. or such issue is not reported to numba repo.
Another possible issue.
cuda.select_device(0)
which is not needed.
is there any strong reason do use this explicitly ?
Analysis :
do keep in mind design for cuda.get_current_device()
and cuda.close()
they are dependent on context and not gpu. as per documentations of get_current_device
Get current device associated with the current thread
Do check
gpus = cuda.list_devices()
before and after your code.
if the gpus listed are same.
then you need to create context again.
if creating context agiain is problem. please attach your complete code and debug log if possible.
I tried to change the device used in theano-based program.
from theano import config
config.device = "gpu1"
However I got error
Exception: Can't change the value of this config parameter after initialization!
I wonder what is the best way of change gpu to gpu1 in code ?
Thanks
Another possibility which worked for me was setting the environment variable in the process, before importing theano:
import os
os.environ['THEANO_FLAGS'] = "device=gpu1"
import theano
There is no way to change this value in code running in the same process. The best you could do is to have a "parent" process that alters, for example, the THEANO_FLAGS environment variable and spawns children. However, the method of spawning will determine which environment the children operate in.
Note also that there is no way to do this in a way that maintains a process's memory through the change. You can't start running on CPU, do some work with values stored in memory then change to running on GPU and continue running using the values still in memory from the earlier (CPU) stage of work. The process must be shutdown and restarted for a change of device to be applied.
As soon as you import theano the device is fixed and cannot be changed within the process that did the import.
Remove the "device" config in .theanorc, then in your code:
import theano.sandbox.cuda
theano.sandbox.cuda.use("gpu0")
It works for me.
https://groups.google.com/forum/#!msg/theano-users/woPgxXCEMB4/l654PPpd5joJ