Execute function specifically on CPU in Jax - python

I have a function that will basically instantiate a huge array and do other things. I am running my code on TPUs so basically my memory is limited.
How can I execute my function specifically on the CPU?
If I do:
y = jax.device_put(my_function(), device=jax.devices("cpu")[0])
I guess that my_function() is first executed on TPU and the result is put on CPU, which gives me memory error.
and using jax.config.update('jax_platform_name', 'cpu') at the beginning of my code seems to have no effect.
Also please note that I can't modify my_function()
Thanks!

I'm going to make a guess here. I can't run it either so you may have to fiddle with it
with jax.default_device(jax.devices("cpu")[0]):
y = my_function()
See the docs here and here.

To directly specify the device on which a function should be executed, use the device argument of jax.jit. For example (using a GPU runtime because it's the accelerator I have access to at the moment):
import jax
gpu_device = jax.devices('gpu')[0]
cpu_device = jax.devices('cpu')[0]
def my_function(x):
return x.sum()
x = jax.numpy.arange(10)
x_gpu = jax.jit(my_function, device=gpu_device)(x)
print(x_tpu.device())
# gpu:0
x_cpu = jax.jit(my_function, device=cpu_device)(x)
print(x_cpu.device())
# TFRT_CPU_0
This can also be controlled with the jax.default_device decorator around the call-site:
with jax.default_device(cpu_device):
print(jax.jit(my_function)(x).device())
# TFRT_CPU_0
with jax.default_device(gpu_device):
print(jax.jit(my_function)(x).device())
# gpu:0

Related

Why does a call to torch.tensor inside of apply_async fail to complete (seems to block execution)?

I'm trying to understand why the following simple example doesn't successfully complete execution and seems to get stuck on the first line of really_simple_func (on Ubuntu machines, but not Windows). The code is:
import torch as t
import numpy as np
import multiprocessing as mp # I've tried both multiprocessing
# import torch.multiprocessing as mp # and torch.multiprocessing
def really_simple_func():
temp_val_2 = t.tensor(np.zeros(425447)[0:400000]) # this is the line that blocks.
return 4.3
if __name__ == "__main__":
print("Run brief starting")
some_zeros = np.zeros(425447)
temp_val = t.tensor(some_zeros[0:400000]) # DELETE THIS LINE TO MAKE IT WORK
pool = mp.Pool(processes=1)
job = pool.apply_async(really_simple_func)
print("just before job.get()")
result = job.get()
print("Run brief completed. Reward = {}".format(result))
I have torch 1.11.0 installed, numpy 1.22.3 and have tried both CPU and GPU versions of Torch. When I run this code on two different Ubuntu machines, I get the following output:
Run brief starting
just before job.get()
However, the code never successfully completes (doesn't print the "Run brief completed" line). (It does complete on a third Windows box).
On the Ubuntu machines, if I delete the line with the comment "#DELETE THIS LINE TO MAKE IT WORK" the execution DOES complete, printing the final line as expected. Similarly, if I leave the line defining temp_val in but delete the line with the comment "This is the line that blocks" it will also complete. Moreover, if I reduce the size of the temp_val tensor (say from 400000 to 4000) it will also complete successfully. Finally, it is worth noting that while I can reproduce this behaviour on two different Ubuntu machines, this code does actually complete on my Windows machine - though, as far as I can tell, the versions of key packages, such as torch, are the same.
I don't understand this behaviour. I suspect it is something to do with the way torch allocates memory or stores information. I've tried calling del temp_val to free up memory, but that doesn't seem to fix things. It seems to me that the async call to t.tensor within really_simple_func is stopped from completing if there has already been a call to t.tensor in the main code block, creating a sufficiently large tensor.
I don't understand why this is happening, or even if that is the correct explanation. In any case, what would be best practice if I do need to do some tensor processing within apply_async as well as in the main thread? More generally, what is Torch waiting on when I make a call to t.tensor?
(Obviously, this is just the simplest version of the real code I'm trying to get to work that reproduced this issue. I realise that calling mp.Pool with only one process doesn't really make sense...nor, indeed, does using apply_async to call a function that returns a constant!)
Unfortunately, I cannot provide any answers to your questions.
I can, however, share experiences with seemingly the same issue. I use a Linux machine with torch 1.8.1 and numpy 1.19.2.
When I run the following code on my machine:
with Pool(max_pool) as p:
pool_outputs = list(
tqdm(
p.imap(lambda f: get_model_results_per_query_file(get_preds, tokenizer, f), query_files),
total=len(query_files)
)
)
For which the function get_model_results_per_query_file contains operations similar to the following:
feats = features.unsqueeze(0).repeat(batch_size, 1, 1).to(device)
(features is a torch tensor)
The first round of jobs automatically fail, and new ones are immediately started (that do not fail for some reason). The whole process never completes though, since the main process still seems to be waiting for the first failed jobs.
If I remove the lines in my code involving the repeat function, no jobs fail.
I managed to solve my issue and preserve the same results by adapting a similar solution to yours:
feats = torch.as_tensor(np.tile(features, (batch_size, 1, 1))).to(device)
I believe as_tensor works in a similar fashion to from_numpy in this case.
I only managed to find this solution thanks to your post and your proposed workaround, so thank you!
After some further exploration, here is a brief answer to my own question.
While I still don't fully understand the blocking behaviour (and would welcome any further explanation), I have just seen that the way I'm generating torch tensors from a numpy array is not correct.
In particular, instead of using torch.tensor(temp_val) where temp_val is a numpy array, I should be using torch.from_numpy(temp_val). Doing this fixes the problem.
Alternatively, I can convert temp_val into a list and then create the tensor via torch.tensor(temp_val_as_list) - which also avoids the issue.

Can you call the numba.cuda.random device function in user-created numba CUDA device functions?

I have a cuda kernel and several device functions in numba for a project. When I try to call xoroshiro128p_uniform_float32 from the
numba.cuda.random module. Whenever I try to call this, I get:
import numba
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states
from numba.cuda.random import xoroshiro128p_uniform_float64
#cuda.jit('void(float32[:,:])', device=True)
def device(rng_states):
thread_id = cuda.grid(1)
probability = xoroshiro128p_uniform_float64(rng_states, thread_id)
#cuda.jit()
def kernel(rng_states):
device(rng_states)
BPG = 10
TPB = 10
rng_states = create_xoroshiro128p_states(BPG * TPB, seed=42069)
kernel[TPB, BPG](rng_states)
TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Untyped global name 'xoroshiro128p_uniform_float32': Cannot determine Numba type of <class 'function'>
Has anyone successfully called imported functions in a CUDA device function on numba before?
The error is here:
#cuda.jit('void(float32[:,:])', device=True)
def device(rng_states):
That signature is incorrect and also unnecessary. There is no reason that I know of to think that rng_states is or should be a 2D float32 array.
Do this:
#cuda.jit(device=True)
def device(rng_states):
A few other things I noticed which are not the proximal issue:
A kernel launch specifies blocks per grid first, then threads per block. So this might be incorrect if you modify things in the future:
kernel[TPB, BPG](rng_states)
To my eye it is better written as:
kernel[BPG, TPB](rng_states)
Finally, your question states:
When I try to call xoroshiro128p_uniform_float32
but your code reflects:
probability = xoroshiro128p_uniform_float64(
Regarding this question:
Has anyone successfully called imported functions in a CUDA device function on numba before?
In the general case, functions from a python import may or may not be usable in CUDA device code. (Most are not usable.) This random number generator, provided by numba for this purpose, might be called a "special case".

In pytorch what happens when I move a module to cuda device

In an example code I see this:
self.models["pose_encoder"] = \
networks.ResnetEncoder(18, self.opt.weights_init == "pretrained",
num_input_images=self.num_pose_frames)
self.models["pose_encoder"].to("cuda:4")
with ResnetEncoder defined by
class ResnetEncoder(nn.Module):
"""Pytorch module for a resnet encoder
"""
def __init__(self, num_layers, pretrained, num_input_images=1, **kwargs):
super(ResnetEncoder, self).__init__()
I am confused about what happens when the to(cuda:4) part happens to the module. Do the whole tensors defined in the module move to cuda:4?
What cause me error now is that I have a member function in the module, and in that function I define a tensor:
def A():
self.mytensor = []
# after some operation this is not empty any more
self.mytensor.cuda()
and this error occur:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:4 and cuda:0!
I know the cuda:0 is caused by the .cuda() operation in the last line of code. But I don't know any way to move 'self.mytensor' to cuda:4. I cam pass device as a parameter in the module's constructor, but I guess there is a better way to do this. and I want the device to change during runtime, so I don't want use os.environ also. Is there any way to do this?
Follow the document, to.device('name_device') is the special function of torch.Tensor which use to move your Tensor to a different device.
The name_device can be replaced by: cuda:num_gpu or cpu, to.device('cuda:4') will move the tensor to GPU number 4 (you can check the number of GPU by command nvidia-smi on cmd/terminal).
Any operation between two tensors will require they stand in the same device. The bug raises because your self.mycuda is on GPU 0, you can move it to GPU 4 by self.mycuda.to_device('cuda:4')
Note: by using .cuda() = .to_device('cuda:0'). You can specify the device by self.mycuda.cuda(4)

automatically choose a device keras tensorflow [duplicate]

I have access through ssh to a cluster of n GPUs. Tensorflow automatically gave them names gpu:0,...,gpu:(n-1).
Others have access too and sometimes they take random gpus.
I did not place any tf.device() explicitely because that is cumbersome and even if I selected gpu number j and that someone is already on gpu number j that would be problematic.
I would like to go throuh the gpus usage and find the first that is unused and use only this one.
I guess someone could parse the output of nvidia-smi with bash and get a variable i and feed that variable i to the tensorflow script as the number of the gpu to use.
I have never seen any example of this. I imagine it is a pretty common problem. What would be the simplest way to do that ? Is a pure tensorflow one available ?
I'm not aware of pure-TensorFlow solution. The problem is that existing place for TensorFlow configurations is a Session config. However, for GPU memory, a GPU memory pool is shared for all TensorFlow sessions within a process, so Session config would be the wrong place to add it, and there's no mechanism for process-global config (but there should be, to also be able to configure process-global Eigen threadpool). So you need to do on on a process level by using CUDA_VISIBLE_DEVICES environment variable.
Something like this:
import subprocess, re
# Nvidia-smi GPU memory parsing.
# Tested on nvidia-smi 370.23
def run_command(cmd):
"""Run command, return output as string."""
output = subprocess.Popen(cmd, stdout=subprocess.PIPE, shell=True).communicate()[0]
return output.decode("ascii")
def list_available_gpus():
"""Returns list of available GPU ids."""
output = run_command("nvidia-smi -L")
# lines of the form GPU 0: TITAN X
gpu_regex = re.compile(r"GPU (?P<gpu_id>\d+):")
result = []
for line in output.strip().split("\n"):
m = gpu_regex.match(line)
assert m, "Couldnt parse "+line
result.append(int(m.group("gpu_id")))
return result
def gpu_memory_map():
"""Returns map of GPU id to memory allocated on that GPU."""
output = run_command("nvidia-smi")
gpu_output = output[output.find("GPU Memory"):]
# lines of the form
# | 0 8734 C python 11705MiB |
memory_regex = re.compile(r"[|]\s+?(?P<gpu_id>\d+)\D+?(?P<pid>\d+).+[ ](?P<gpu_memory>\d+)MiB")
rows = gpu_output.split("\n")
result = {gpu_id: 0 for gpu_id in list_available_gpus()}
for row in gpu_output.split("\n"):
m = memory_regex.search(row)
if not m:
continue
gpu_id = int(m.group("gpu_id"))
gpu_memory = int(m.group("gpu_memory"))
result[gpu_id] += gpu_memory
return result
def pick_gpu_lowest_memory():
"""Returns GPU with the least allocated memory"""
memory_gpu_map = [(memory, gpu_id) for (gpu_id, memory) in gpu_memory_map().items()]
best_memory, best_gpu = sorted(memory_gpu_map)[0]
return best_gpu
You can then put it in utils.py and set GPU in your TensorFlow script before first tensorflow import. IE
import utils
import os
os.environ["CUDA_VISIBLE_DEVICES"] = str(utils.pick_gpu_lowest_memory())
import tensorflow
An implementation along the lines of Yaroslav Bulatov's solution is available on https://github.com/bamos/setGPU.

How does one train multiple models in a single script in TensorFlow when there are GPUs present?

Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM and disk). I wanted to run in one single script and in one single machine a program that evaluates multiple models (say 50 or 200) in TensorFlow, each with a different hyper parameter setting (say, step-size, decay rate, batch size, epochs/iterations, etc). At the end of training assume we just record its accuracy and get rid of the model (if you want assume the model is being check pointed every so often, so its fine to just throw away the model and start training from scratch. You may also assume some other data may be recorded like the specific hyper params, train, validation, train errors are recorded as we train etc).
Currently I have a (pseudo-)script that looks as follow:
def train_multiple_modles_in_one_script_with_gpu(arg):
'''
trains multiple NN models in one session using GPUs correctly.
arg = some obj/struct with the params for trianing each of the models.
'''
#### try mutliple models
for mdl_id in range(100):
#### define/create graph
graph = tf.Graph()
with graph.as_default():
### get mdl
x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
y_ = tf.placeholder(float_type, get_y_shape(arg))
y = get_mdl(arg,x)
### get loss and accuracy
loss, accuracy = get_accuracy_loss(arg,x,y,y_)
### get optimizer variables
opt = get_optimizer(arg)
train_step = opt.minimize(loss, global_step=global_step)
#### run session
with tf.Session(graph=graph) as sess:
# train
for i in range(nb_iterations):
batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
# check_point mdl
if i % report_error_freq == 0:
sess.run(step.assign(i))
#
train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )
essentially it tries lots of models in one single run but it builds each model in a separate graph and runs each one in a separate session.
I guess my main worry is that its unclear to me how tensorflow under the hood allocates resources for the GPUs to be used. For example, does it load the (part of the) data set only when a session is ran? When I create a graph and a model, is it brought in the GPU immediately or when is it inserted in the GPU? Do I need to clear/free the GPU each time it tries a new model? I don't actually care too much if the models are ran in parallel in multiple GPU (which can be a nice addition), but I want it to first run everything serially without crashing. Is there anything special I need to do for this to work?
Currently I am getting an error that starts as follow:
I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit: 340000768
InUse: 336114944
MaxInUse: 339954944
NumAllocs: 78
MaxAllocSize: 335665152
W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB. See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]
and further down the line it says:
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
[[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)
however further down the output file (where it prints) it seems to print fine the errors/messages that should show as training proceeds. Does this mean that it didn't run out of resources? Or was it actually able to use the GPU? If it was able to use the CPU instead of the CPU, when why is this an error only happening when GPU are about to be used?
The weird thing is that the data set is really not that big (all 60K points are 24.5M) and when I run a single model locally in my own computer it seems that the process uses less than 5GB. The GPUs have at least 8GB and the computer with them has plenty of RAM and disk (at least 16GB). Thus, the errors that tensorflow is throwing at me are quite puzzling. What is it trying to do and why are they occurring? Any ideas?
After reading the answer that suggests to use the multiprocessing library I came up with the following script:
def train_mdl(args):
train(mdl,args)
if __name__ == '__main__':
for mdl_id in range(100):
# train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
p = Process(target=train_mdl, args=(args,))
p.start()
p.join()
print('Done training all models!')
honestly I am not sure why his answer suggests to use pool, or why there are weird tuple brackets but this is what would make sense for me. Would the resources for tensorflow be re-allocated every time a new process is created in the above loop?
I think that running all models in one single script can be bad practice in the long term (see my suggestion below for a better alternative). However, if you would like to do it, here is a solution: You can encapsulate your TF session into a process with the multiprocessing module, this will make sure TF releases the session memory once the process is done. Here is a code snippet:
from multiprocessing import Pool
import contextlib
def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
< your code >
num_pool_worker=1 # can be bigger than 1, to enable parallel execution
with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
pool_results = po.map_async(my_model,
((param1, param2, param3)
for param1, param2, param3 in params_list))
results_list = pool_results.get()
Note from OP: The random number generator seed does not reset automatically with the multi-processing library if you choose to use it. Details here: Using python multiprocessing with different random seed for each process
About TF resource allocation: Usually TF allocates much more resources than it needs. Many times you can restrict each process to use a fraction of the total GPU memory, and discover through trial and error the fraction your script requires.
You can do it with the following snippet
gpu_memory_fraction = 0.3 # Choose this number through trial and error
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
session_config = tf.ConfigProto(gpu_options=gpu_options)
sess = tf.Session(config=session_config, graph=graph)
Note that sometimes TF increases the memory usage in order to accelerate the execution. Therefore, reducing the memory usage might make your model run slower.
Answers to the new questions in your edit/comments:
Yes, Tensorflow will be re-allocated every time a new process is created, and cleared once a process ends.
The for-loop in your edit should also do the job. I suggest to use Pool instead, because it will enable you to run several models concurrently on a single GPU. See my notes about setting gpu_memory_fraction and "choosing the maximal number of processes". Also note that: (1) The Pool map runs the loop for you, so you don't need an outer for-loop once you use it. (2) In your example, you should have something like mdl=get_model(args) before calling train()
Weird tuple parenthesis: Pool only accepts a single argument, therefore we use a tuple to pass multiple arguments. See multiprocessing.pool.map and function with two arguments for more details. As suggested in one answer, you can make it more readable with
def train_mdl(params):
(x,y)=params
< your code >
As #Seven suggested, you can use CUDA_VISIBLE_DEVICES environment variable to choose which GPU to use for your process. You can do it from within your python script using the following on the beginning of the process function (train_mdl).
import os # the import can be on the top of the python script
os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id)
A better practice for executing your experiments would be to isolate your training/evaluation code from the hyper parameters/ model search code.
E.g. have a script named train.py, which accepts a specific combination of hyper parameters and references to your data as arguments, and executes training for a single model.
Then, to iterate through the all the possible combinations of parameters you can use a simple task (jobs) queue, and submit all the possible combinations of hyper-parametrs as separate jobs. The task queue will feed your jobs one at a time to your machine. Usually, you can also set the queue to execute number of processes concurrently (see details below).
Specifically, I use task spooler, which is super easy to install and handful (doesn't requires admin privileges, details below).
Basic usage is (see notes below about task spooler usage):
ts <your-command>
In practice, I have a separate python script that manages my experiments, set all the arguments per specific experiment and send the jobs to the ts queue.
Here are some relevant snippets of python code from my experiments manager:
run_bash executes a bash command
def run_bash(cmd):
p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
out = p.stdout.read().strip()
return out # This is the stdout from the shell command
The next snippet sets the number of concurrent processes to be run (see note below about choosing the maximal number of processes):
max_job_num_per_gpu = 2
run_bash('ts -S %d'%max_job_num_per_gpu)
The next snippet iterates through a list of all combinations of hyper params / model params. Each element of the list is a dictionary, where the keys are the command line arguments for the train.py script
for combination_dict in combinations_list:
job_cmd = 'python train.py ' + ' '.join(
['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])
submit_cmd = "ts bash -c '%s'" % job_cmd
run_bash(submit_cmd)
A note about about choosing the maximal number of processes:
If you are short on GPUs, you can use gpu_memory_fraction you found, to set the number of processes as max_job_num_per_gpu=int(1/gpu_memory_fraction)
Notes about task spooler (ts):
You could set the number of concurrent processes to run ("slots") with:
ts -S <number-of-slots>
Installing ts doesn't requires admin privileges. You can download and compile it from source with a simple make, add it to your path and you're done.
You can set up multiple queues (I use it for multiple GPUs), with
TS_SOCKET=<path_to_queue_name> ts <your-command>
e.g.
TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>
TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>
See here for further usage example
A note about automatically setting the path names and file names:
Once you separate your main code from the experiment manager, you will need an efficient way to generate file names and directory names, given the hyper-params. I usually keep my important hyper params in a dictionary and use the following function to generate a single chained string from the dictionary key-value pairs.
Here are the functions I use for doing it:
def build_string_from_dict(d, sep='%'):
"""
Builds a string from a dictionary.
Mainly used for formatting hyper-params to file names.
Key-value pairs are sorted by the key name.
Args:
d: dictionary
Returns: string
:param d: input dictionary
:param sep: key-value separator
"""
return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])
def _value2str(val):
if isinstance(val, float):
# %g means: "Floating point format.
# Uses lowercase exponential format if exponent is less than -4 or not less than precision,
# decimal format otherwise."
val = '%g' % val
else:
val = '{}'.format(val)
val = re.sub('\.', '_', val)
return val
As I understand, firstly tensorflow constructs a symbolic graph and infers the derivatives based on chain rule. Then allocates memory for all (necessary) tensors, including some inputs and outputs of layers for efficiency. When running a session, data will be loaded into the graph but in general, memory use will not change any more.
The error you met, I guess, may be caused by constructing several models in one GPU.
Isolating your training/evaluation code from the hyper parameters is a good choice, as #user2476373 proposed. But I am using bash script directly, not task spooler (may be it's more convenient), e.g.
CUDA_VISIBLE_DEVICES=0 python train.py --lrn_rate 0.01 --weight_decay_rate 0.001 --momentum 0.9 --batch_size 8 --max_iter 60000 --snapshot 5000
CUDA_VISIBLE_DEVICES=0 python eval.py
Or you can write a 'for' loop in the bash script, not necessarily in python script. Noting that I used CUDA_VISIBLE_DEVICES=0 at beginning of the script (the index could be 7 if you have 8 GPUs in one machine). Because based on my experience, I've found that tensorflow uses all GPUs in one machine if I didn't specify operations use which GPU with the code like this
with tf.device('/gpu:0'):
If you want to try multi-GPU implementation, there is some example.
Hope this could help you.
An easy solution: Give each model a unique session and graph.
It works for this platform: TensorFlow 1.12.0, Keras 2.1.6-tf, Python 3.6.7, Jupyter Notebook.
Key code:
with session.as_default():
with session.graph.as_default():
# do something about an ANN model
Full code:
import tensorflow as tf
from tensorflow import keras
import gc
def limit_memory():
""" Release unused memory resources. Force garbage collection """
keras.backend.clear_session()
keras.backend.get_session().close()
tf.reset_default_graph()
gc.collect()
#cfg = tf.ConfigProto()
#cfg.gpu_options.allow_growth = True
#keras.backend.set_session(tf.Session(config=cfg))
keras.backend.set_session(tf.Session())
gc.collect()
def create_and_train_ANN_model(hyper_parameter):
print('create and train my ANN model')
info = { 'result about this ANN model' }
return info
for i in range(10):
limit_memory()
session = tf.Session()
keras.backend.set_session(session)
with session.as_default():
with session.graph.as_default():
hyper_parameter = { 'A set of hyper-parameters' }
info = create_and_train_ANN_model(hyper_parameter)
limit_memory()
Inspired by this link: Keras (Tensorflow backend) Error - Tensor input_1:0, specified in either feed_devices or fetch_devices was not found in the Graph
I have the same issue. My solution is to run from another script doing the following as many times and in as many hyperparameter configurations as you want.
cmd = "python3 ./model_train.py hyperparameters"
os.system(cmd)
You probably don't want to do this.
If you run thousands and thousands of models on your data, and pick the one that evaluates best, you are not doing machine learning; instead you are memorizing your data set, and there is no guarantee that the model you pick will perform at all outside that data set.
In other words, that approach is similar to having a single model, which has thousands of degrees of liberty. Having a model with such high order of complexity is problematic, since it will be able to fit your data better than is actually warranted; such a model is annoyingly able to memorize any noise (outliers, measurement errors, and such) in your training data, which causes the model to perform poorly when the noise is even slightly different.
(Apologies for posting this as an answer, the site wouldn't let me add a comment.)

Categories