Ray: How to run many actors on one GPU?

Ray: How to run many actors on one GPU? - python

I have only one gpu, and I want to run many actors on that gpu. Here's what I do using ray, following https://ray.readthedocs.io/en/latest/actors.html
first define the network on gpu
class Network():
def __init__(self, ***some args here***):
self._graph = tf.Graph()
os.environ['CUDA_VISIBLE_DIVICES'] = ','.join([str(i) for i in ray.get_gpu_ids()])
with self._graph.as_default():
with tf.device('/gpu:0'):
# network, loss, and optimizer are defined here
sess_config = tf.ConfigProto(allow_soft_placement=True)
sess_config.gpu_options.allow_growth=True
self.sess = tf.Session(graph=self._graph, config=sess_config)
self.sess.run(tf.global_variables_initializer())
atexit.register(self.sess.close)
self.variables = ray.experimental.TensorFlowVariables(self.loss, self.sess)
then define the worker class
#ray.remote(num_gpus=1)
class Worker(Network):
# do something
define the learner class
#ray.remote(num_gpus=1)
class Learner(Network):
# do something
train function
def train():
ray.init(num_gpus=1)
leaner = Learner.remote(...)
workers = [Worker.remote(...) for i in range(10)]
# do something
This process works fine when I don't try to make it work on gpu. That is, it works fine when I remove all with tf.device('/gpu:0') and (num_gpus=1). The trouble arises when I keep them: It seems that only learner is created, but none of the workers is constructed. What should I do to make it work?

When you define an actor class using the decorator #ray.remote(num_gpus=1), you are saying that any actor created from this class must have one GPU reserved for it for the duration of the actor's lifetime. Since you have only one GPU, you will only be able to create one such actor.
If you want to have multiple actors sharing a single GPU, then you need to specify that each actor requires less than 1 GPU, for example, if you wish to share one GPU among 4 actors, then you can have each actor require 1/4th of a GPU. This can be done by declaring the actor class with
#ray.remote(num_gpus=0.25)
In addition, you need to make sure that each actor actually respects the limits that you are placing on it. For example, if you want declare an actor with #ray.remote(num_gpus=0.25), then you should also make sure that TensorFlow uses at most one quarter of the GPU memory. See the answers to How to prevent tensorflow from allocating the totality of a GPU memory? for example.

Related

multi-output keras model with a callback that monitors two metrics

I have a tf model that has two outputs, as indicated by this model.compile():
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=7e-4),
loss={"BV": tf.keras.losses.MeanAbsoluteError(), "Rsp": tf.keras.losses.MeanAbsoluteError()},
metrics={"BV": [tf.keras.metrics.RootMeanSquaredError(name="RMSE"), tfa.metrics.r_square.RSquare(name="R2")],
"Rsp": [tf.keras.metrics.RootMeanSquaredError(name="RMSE"), tfa.metrics.r_square.RSquare(name="R2")]})
I would like to use the ModelCheckpoint callback, which should monitor a sum of val_BV_R2 and val_Rsp_R2. I am able to run the callback like this:
save_best_model = tf.keras.callbacks.ModelCheckpoint("xyz.hdf5", monitor="val_Rsp_R2")
However, I don't know how to make it to save the model with the highest sum of two metrics.

According to the tf.keras.callbacks.ModelCheckpoint documentation, the metric to monitor che be only one at a time.
One way to achieve what you want, could be to define an additional custom metric, that performs the sum of the two metrics. Then you could monitor your custom metric and save the checkpoints as you are already doing. However this is a bit complicated, due to having multiple outputs.
Alternatively you could define a custom callback that does the same combining. Below a simple example of this second option. It should work (sorry I can't test it right now):
class CombineCallback(tf.keras.callbacks.Callback):
def __init__(self, **kargs):
super(CombineCallback, self).__init__(**kargs)
def on_epoch_end(self, epoch, logs={}):
logs['combine_metric'] = 0.5*logs['val_BV_R2'] + 0.5*logs['val_Rsp_R2']
Inside the callback you should be able to access your metrics directly with logs['name_of_my_metric'] or through the get function logs.get("name_of_my_metric").
Also I multiplied by 0.5 to leave the combined metric approximately in the same range, but see if this works for your case.
To use it just do:
save_best_model = CombineCallback("xyz.hdf5")
model.fit(..., callbacks=[save_best_model])
More information can be found at the Examples of Keras callback applications.

How to make pytorch lightning module have injected, nested models?

I have some nets, such as the following (augmented) resnet18:
num_classes = 10
resnet = models.resnet18(pretrained=True)
for param in resnet.parameters():
param.requires_grad = True
num_ftrs = resnet.fc.in_features
resnet.fc = nn.Linear(num_ftrs, num_classes)
And I want to use them inside a lightning module, and have it handle all optimizations, to_device, stages and so on. In other words, I want to register those modules for my lightning module.
I also want to be able to access their public members.
class MyLightning(LightningModule):
def __init__(self, resnet):
super().__init__()
self._resnet = resnet
self._criterion = lambda x: 1.0
def forward(self, x):
resnet_out = self._resnet(x)
loss = self._criterion(resnet_out)
return loss
my_lightning = MyLightning(resnet)
The above doesn't optimize any parameters.
Trying
def __init__(self, resnet)
...
_layers = list(resnet.children())[:-1]
self._resnet = nn.Sequential(*_layers)
Doesn't take resnet.fc into account. This also doesn't make sense to be the intended way of nesting models inside pytorch lightning.
How to nest models in pytorch lightning, and have them fully accessible and handled by the framework?

The training loop and optimization process is handles by the Trainer class. You can do so by initializing a new instance:
>>> trainer = Trainer()
And wrapping your PyTorch Lightning module with it. This way you can perform fitting, tuning, validating, and testing on that instance provided a DataLoader or LightningDataModule:
>>> trainer.fit(my_lightning, train_dataloader, val_dataloader)
You will have to implement the following functions on your Lightning module (i.e. in your case MyLightning):
Name
Description
init
Define computations here
forward
Use for inference only (separate from training_step)
training_step
the complete training loop
validation_step
the complete validation loop
test_step
the complete test loop
predict_step
the complete prediction loop
configure_optimizers
define optimizers and LR schedulers
source LightningModule documentation page.
Keep in mind a LightningModule is a nn.Module, so whenever you define a nn.Module as attribute to a LightningModule in the __init__ function, this module will end being registered as a sub-module to the parent pytorch lightning module.

The pytorch model should inherit from nn.Module, So you should find firstly the resnet18 in pytorch, then you can use the resnet18 or revise it by youself
The origin resnet codes is in this path: ...\python\Lib\site-packages\torchvision\models\resnet.py, you import the resnet network from here, so you can use it directly.
Now, you will find the original codes
class ResNet(nn.Module):...
https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py#L166
And import it like
from torchvision.models import ResNet
Finally, you can inherit from ResNet
class MyLightning(ResNet):

How many times are DoFn's constructed?

I'm using the apache beam python SDK and Dataflow to write an inference pipeline for making predictions with TensorFlow models. I have the prediction step in a DoFn, but I don't want to have to load the model every time I process a bundle because that's very expensive. From the docs here, "If required, a fresh instance of the argument DoFn is created on a worker, and the DoFn.Setup method is called on this instance. This may be through deserialization or other means. A PipelineRunner may reuse DoFn instances for multiple bundles. A DoFn that has terminated abnormally (by throwing an Exception) will never be reused." I've noticed that if I write my code like this
class StatefulGetEmbeddingsDoFn(beam.DoFn):
def __init__(self, model_dir):
self.model = None # initialize
self.model_dir = model_dir
def process(self, element):
if not self.model: # load model if model hasn't been loaded yet
global i
i += 1
logging.info('Getting model: {}'.format(i))
self.model = Model(saved_model_dir=self.model_dir)
ids, b64 = element
embeddings = self.model.predict(b64)
res = [
{
'image': _id,
'embeddings': embedding.tolist()
} for _id, embedding in zip(ids, embeddings)
]
return res
It seems like the model is being loaded more than once on every worker (I've got a cluster of ~30-40 machines). Is there a way of preventing the model from being loaded more than once? I would've expected this DoFn to only be constructed once on every machine but from the logs, it seems like that's not the case...

I know this is an older question, but my initial thoughts are to use the setup and start_bundle methods.
https://beam.apache.org/releases/pydoc/2.22.0/apache_beam.transforms.core.html#apache_beam.transforms.core.DoFn.setup

Store/Reload CNTK Trainer, Model, Inputs, Outputs

What is the best way to store a trainer and all necessary components?
1. Storing:
Store checkpoint of the trainer: Use its trainer.save_checkpoint(filename, external_state={}) function
Additionally store the model separately: Use the z.save(filename) method, every cntk operation has. You can also get z = trainer.model.
2. Reloading:
Restore the model: Use C.load_model(...). (Don't get confused by the deprecated persist namespace from the Cntk 1.)
Get the inputs from the restored model.
Restore the trainer itself: Use trainer.restore_from_checkpoint as eg. shown here. The problem is, this function already needs a trainer object which probably has to be initialized in the same way as the trainer used to create the check point!?
How do I now restore the label-inputs which are going into the error function used by the trainer? In the following code I marked the variables which I think I have to restore after I once stored them.
z = C.layers.Dense(.... )
loss = error = C.squared_error(z, **l**)
**trainer** = C.Trainer(**z**, (loss, error), [mylearner], my_tensorboard_writer)

You can restore your trainer, but I actually prefer to just load my model m. The simple reason is that it is much easier to create a whole new trainer, beacuse then you can change all the other parameters of the trainer more easily.
Then you can get the input variable from the loaded model (if your network has only one input):
input_var = m.arguments[0]
then you need the output of your model:
output = m(input_var)
and define the loss function using your target output target_output:
C.squared_error(output, target_output)
using your model and the loss function you can recreate your trainer from there, setting the learning rate etc. as you like

Load local (unserializable) objects on workers

I'm trying to use Dataflow in conjunction with Tensorflow for predictions. Those predictions are happening on the workers and I'm currently loading the model through startup_bundle(). Like here:
class PredictDoFn(beam.DoFn):
def start_bundle(self):
self.model = load_model_from_file()
def process(self, element):
...
My current problem is that even if I process 1000 elements, the startup_bundle() function is called multiple times (at least 10) and not once per worked as I've hoped. This slows down the pipeline significantly because the model needs to be loaded many times and it takes every time 30 seconds.
Are there any ways to load the model on the workers on initialisation and not every time in the start_bundle()?
Thanks in advance!
Dimitri

The easiest thing would be for you to add an if self.model is None: self.model = load_model_from_file(), and this may not reduce the number of times your model is reloaded.
This is because DoFn instances are not currently reused across bundles. This means that your model will be forgotten after every work item is executed.
You could also create a global variable where you keep the model. This would reduce the amount of reloads, but it would be really unorthodox (though it may solve your use case).
A global variable approach should work something like this:
class MyModelDoFn(object):
def process(self, elem):
global my_model
if my_model is None:
my_model = load_model_from_file()
yield my_model.apply_to(elem)
An approach that relies on a thread-local variable would look like so. Consider that this will load the model once per thread, so the number of times your model is loaded depends on runner implementation (it will work in Dataflow):
class MyModelDoFn(object):
_thread_local = threading.local()
#property
def model(self):
model = getattr(MyModelDoFn._thread_local, 'model', None)
if not model:
MyModelDoFn._thread_local.model = load_model_from_file()
return MyModelDoFn._thread_local.model
def process(self, elem):
yield self.model.apply_to(elem)
I guess you can load the model from the start_bundle call as well.
Note: This approach is very unorthodox, and is not guaranteed to work in newer versions, nor all runners.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ray: How to run many actors on one GPU? - python

Related

multi-output keras model with a callback that monitors two metrics

How to make pytorch lightning module have injected, nested models?

How many times are DoFn's constructed?

Store/Reload CNTK Trainer, Model, Inputs, Outputs

Load local (unserializable) objects on workers

Categories

Resources