torch training with Multi GPU enviroment - python

I'm trying to run a training on a multi gpu enviroment.
here's model code
net_1 = nn.Sequential(nn.Conv2d(2, 12, 5),
nn.MaxPool2d(2),
snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),
nn.Conv2d(12, 32, 5),
nn.MaxPool2d(2),
snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True),
nn.Flatten(),
nn.Linear(32*5*5, 10),
snn.Leaky(beta=beta, spike_grad=spike_grad, init_hidden=True, output=True)
)
net_1.cuda()
net = nn.DataParallel(net_1)
snn.Leaky is a module used to implement SNN structure combinig with torch.nn, Which makes network work as kind of RNN.
links here(https://snntorch.readthedocs.io/en/latest/readme.html)
The input shape looks like this (timestep, batchsize, 2, 32,32)
Training code
def forward_pass(net, data):
spk_rec = []
utils.reset(net) # resets hidden states for all LIF neurons in net
for step in range(data.size(1)): # data.size(0) = number of time steps
datas = data[:,step,:,:,:].cuda()
net = net.to(device)
spk_out, mem_out = net(datas)
spk_rec.append(spk_out)
return torch.stack(spk_rec)
optimizer = torch.optim.Adam(net.parameters(), lr=2e-2, betas=(0.9, 0.999))
loss_fn = SF.mse_count_loss(correct_rate=0.8, incorrect_rate=0.2)
num_epochs = 5
num_iters = 50
loss_hist = []
acc_hist = []
t_spk_rec_sum = []
start = time.time()
net.train()
# training loop
for epoch in range(num_epochs):
for i, (data, targets) in enumerate(iter(trainloader)):
data = data.to(device)
targets = targets.to(device)
spk_rec = forward_pass(net, data)
loss_val = loss_fn(spk_rec, targets)
# Gradient calculation + weight update
optimizer.zero_grad()
loss_val.backward()
optimizer.step()
# Store loss history for future plotting
loss_hist.append(loss_val.item())
print("time :", time.time() - start,"sec")
print(f"Epoch {epoch}, Iteration {i} \nTrain Loss: {loss_val.item():.2f}")
acc = SF.accuracy_rate(spk_rec, targets)
acc_hist.append(acc)
print(f"Train Accuracy: {acc * 100:.2f}%\n")
And I got this error
Traceback (most recent call last):
File "/home/hubo1024/PycharmProjects/snntorch/multi_gpu_train.py", line 87, in <module>
spk_rec = forward_pass(net, data)
File "/home/hubo1024/PycharmProjects/snntorch/multi_gpu_train.py", line 63, in forward_pass
spk_out, mem_out = net(datas)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 168, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py", line 178, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 86, in parallel_apply
output.reraise()
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/_utils.py", line 461, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
output = module(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/container.py", line 139, in forward
input = module(input)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 162, in forward
self.mem = self.state_fn(input_)
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 201, in _build_state_function_hidden
self._base_state_function_hidden(input_) - self.reset * self.threshold
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/snntorch/_neurons/leaky.py", line 195, in _base_state_function_hidden
base_fn = self.beta.clamp(0, 1) * self.mem + input_
File "/home/hubo1024/anaconda3/envs/spyketorchproject/lib/python3.10/site-packages/torch/_tensor.py", line 1121, in __torch_function__
ret = func(*args, **kwargs)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Process finished with exit code 1
Line 87 is
spk_rec = forward_pass(net, data)
from traning loop
and line 63 is
spk_out, mem_out = net(datas)
of forward pass function
I checked and made sure that there's no part where the tensor is defined as cpu,
And the code works well when I run this code in single GPU.
I'm currently using
torch.utils.data import DataLoader
for making batch train loader. I'm thinking that this might be main source of the problem.
Should I use different dataloader for multi GPU training?
And if so where can I find some reference with this?, I serched a bit but those info where a bit old.

This was a bug in the Leaky neuron that kept resetting its device when using DataParallel. It has been fixed in the current version of snnTorch in GitHub, and addressed in this issue: https://github.com/jeshraghian/snntorch/issues/154
We're working on fixing up the other neurons now.

Related

Torch: Input type and weight type (torch.cuda.FloatTensor) should be the same

Note: I have already seen similar questions: the same error, tell torch not to use GPU, but the answers do not work for me.
I have installed PyTorch version 1.13.0+cu117 (the latest), and the code structure is as follows (an image classification task):
# os.environ["CUDA_VISIBLE_DEVICES"]="" # required?
device = torch.device("cpu") # use CPU
...
train_set = DataLoader(
torchvision.datasets.ImageFolder(path, transform), **kwargs
)
...
model = myCNN().to(device)
optimizer = SGD(args)
loss = CrossEntropyLoss()
train()
I want to train on CPU.
For dataloader, in accordance to this, I've set pin_memory=True and non_blocking=pin_memory. The error persists even on setting pin_memory=False.
The training loop has the following structure:
for epoch in n_epochs:
model.train()
inputs, labels = inputs.to(device, non_blocking=non_blocking), labels.to(device, non_blocking=non_blocking)
Compute loss, back-propagate
The error traceback (on calling train()):
Traceback (most recent call last):
File "code.py", line 233, in <module>
train()
File "code.py", line 122, in train
outputs = model(inputs)
File "...\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "code.py", line 87, in forward
output = self.network(input)
File "...\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "...\torch\nn\modules\container.py", line 204, in forward
input = module(input)
File "...\torch\nn\modules\module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "...\torch\nn\modules\conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "...\torch\nn\modules\conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same or input should be a MKLDNN tensor and weight is a dense tensor
Edit: There was a comment regarding possible issues due to the model. The model is roughly:
class myCNN(nn.Module):
def __init__(self, ...other args...):
super().__init__()
self.network = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding),
nn.ReLU(),
nn.MaxPool2d(kernel_size),
... similar convolutional layers ...
nn.Flatten(),
nn.Linear(in_features, out_features)
)
def forward(self, input):
output = self.network(input)
return output
Since I have transferred both model and data to the same device, what could be the reason of this error? How to correct it?
The issue was due to incorrect usage of summary from torchinfo. It does a forward pass (if input size is provided), and the device is (by default) selected on basis of torch.cuda.is_available().
If device (as specified in the question) argument is given to summary, the training happens just fine.

Loss key needs to be present Pytorch

I am following this repo:
https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/entity_linking
Here is a small tutorial:
https://colab.research.google.com/github/NVIDIA/NeMo/blob/v1.0.2/tutorials/nlp/Entity_Linking_Medical.ipynb
Before starting this tutorial change branch to r1.10.0
When I train this model on entire UMLS dataset given the commands it gives the following error:
In automatic_optimization, when `training_step` returns a dict, the 'loss' key needs to be present
I checked the training steps method and it is fine:
def training_step(self, batch, batch_idx):
"""
Lightning calls this inside the training loop with the data from the training dataloader
passed in as `batch`.
"""
input_ids, token_type_ids, attention_mask, concept_ids = batch
logits = self.forward(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)
train_loss = self.loss(logits=logits, labels=concept_ids)
# No hard examples found in batch,
# shouldn't use this batch to update model weights
if train_loss == 0:
train_loss = None
lr = None
else:
lr = self._optimizer.param_groups[0]["lr"]
self.log("train_loss", train_loss)
self.log("lr", lr, prog_bar=True)
return {"loss": train_loss, "lr": lr}
Here is a full stacktrace:
[NeMo I 2022-07-29 18:29:27 multi_similarity_loss:91] Encountered zero loss in multisimloss, loss = 0.0. No hard examples found in the batch
Error executing job with overrides: ['project_dir=.']
Traceback (most recent call last):
File "self_alignment_pretraining.py", line 38, in main
trainer.fit(model)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 769, in fit
self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run
results = self._run_stage()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage
return self._run_train()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train
self.fit_loop.run()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 268, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 208, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 207, in advance
self.optimizer_idx,
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 256, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
using_lbfgs=is_lbfgs,
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1593, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/core/lightning.py", line 1644, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/core/optimizer.py", line 168, in step
step_output = self._strategy.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 278, in optimizer_step
optimizer_output = super().optimizer_step(optimizer, opt_idx, closure, model, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/strategies/strategy.py", line 193, in optimizer_step
return self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 85, in optimizer_step
closure_result = closure()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 148, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 134, in closure
step_output = self._step_fn()
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 437, in _training_step
training_step_output, self.trainer.accumulate_grad_batches
File "/home/umair/miniconda3/envs/aemap/lib/python3.7/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 75, in from_training_step_output
"In automatic_optimization, when `training_step` returns a dict, the 'loss' key needs to be present"
pytorch_lightning.utilities.exceptions.MisconfigurationException: In automatic_optimization, when `training_step` returns a dict, the 'loss' key needs to be present
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
You get this error message about "loss key needs to be present" because in some training steps you return the dict {"loss": None}. This happens in your code here
if train_loss == 0:
train_loss = None
lr = None
where you set train_loss = None. Lightning does not like that, because it wants loss to be a tensor with a graph attached.
If you wish to skip the optimization step completely, just return None from the training_step method, like this:
if train_loss == 0:
return None

Input dataset of different shapes in tf.keras.Model with custom train_step override

I defined a custom tf.keras.Model and overrode train_step for implementing custom training logic. The dataset trainDataset is a tf.data object, each element containing (image, label) with different image sizes. I would like to perform data augmentation inside the train_step as in the code below. I believe my code including the part has no logical flaws, including the part where I use model.fit to train the model.
However, an error occurs telling me that it Cannot batch tensors with different shapes. I see that something is executed before train_step and that is blocking the training process. How could I solve this?
model=GeneralCNN(cfg, network, augmentation)
model.compile(optimizer, loss, cfg['training'])
...
trainLogs=model.fit(trainDataset.batch(cfg['batch_size']), epochs=1, validation_data=valDataset)
...(subclass of tf.keras.Model)
def train_step(self, data):
tf.print('check!')
images, labels = data
images = self.augmentation(images) # <--- includes resizing
# initialize important variables.
batch_size = tf.shape(images)[0]
# Train the network
with tf.GradientTape() as tape:
predictions = self.network(images)
loss = self.loss_fn(labels, predictions)
grads = tape.gradient(loss, self.network.trainable_weights)
self.optimizer.apply_gradients(zip(grads, self.network.trainable_weights))
# Update metrics
self.lossMetric.update_state(loss)
predictionsIndicies=tf.math.argmax(predictions, axis=1)
self.accuracyMetric.update_state((predictionsIndicies, labels))
return {"loss": self.lossMetric.result(), "accuracy": self.accuracyMetric.result()}
...
Error:
raceback (most recent call last):
File "train.py", line 116, in <module>
app.run(main)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "train.py", line 76, in main
trainLogs=model.fit(P, epochs=1, steps_per_epoch=1000, validation_data=valDataset)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/training.py", line 1183, in fit
tmp_logs = self.train_function(iterator)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 889, in __call__
result = self._call(*args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/def_function.py", line 950, in _call
return self._stateless_fn(*args, **kwds)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 3024, in __call__
filtered_flat_args, captured_inputs=graph_function.captured_inputs) # pylint: disable=protected-access
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 1961, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/function.py", line 596, in call
ctx=ctx)
File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot batch tensors with different shapes in component 0. First element had shape [375,500,3] and element 1 had shape [333,500,3].
[[node IteratorGetNext (defined at train.py:76) ]] [Op:__inference_train_function_1852]
Function call stack:
train_function
Edit: augmentation code just in case
the code works when I resize the dataset before model.fit
def BuildAugmentation(cfg):
augmentationType = cfg['augmentation']
if augmentationType=='none':
return SimpleResize(cfg)
elif augmentationType=='simple':
return SimpleAugmentation(cfg)
def SimpleResize(cfg):
# resizing only w/o augmentations
model=tf.keras.models.Sequential([
tfPreprocessing.Resizing(cfg['image_size'], cfg['image_size'])
])
return model
def SimpleAugmentation(cfg):
# custom simple augmenation w/ humble augmentations
model=tf.keras.models.Sequential([
tfPreprocessing.RandomRotation(factor=0.02),
tfPreprocessing.RandomZoom(height_factor=0.2, width_factor=0.2),
tfPreprocessing.Resizing(cfg['image_size'], cfg['image_size']),
tfPreprocessing.RandomFlip("horizontal")
])
return model

PyTorch error in trying to backward through the graph a second time

I'm trying to run this code: https://github.com/aitorzip/PyTorch-CycleGAN
I modified only the dataloader and transforms to be compatible with my data.
When trying to run it I get this error:
Traceback (most recent call last):
File "models/CycleGANs/train",
line 150, in
loss_D_A.backward()
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 221, in
backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File
"/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py",
line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: Trying to backward through the graph a second time, but the saved intermediate
results have already been freed. Specify retain_graph=True when
calling backward the first time.
This is the train loop up to the point of error:
for epoch in range(opt.epoch, opt.n_epochs):
for i, batch in enumerate(dataloader):
# Set model input
real_A = Variable(input_A.copy_(batch['A']))
real_B = Variable(input_B.copy_(batch['B']))
##### Generators A2B and B2A #####
optimizer_G.zero_grad()
# Identity loss
# G_A2B(B) should equal B if real B is fed
same_B = netG_A2B(real_B)
loss_identity_B = criterion_identity(same_B, real_B)*5.0
# G_B2A(A) should equal A if real A is fed
same_A = netG_B2A(real_A)
loss_identity_A = criterion_identity(same_A, real_A)*5.0
# GAN loss
fake_B = netG_A2B(real_A)
pred_fake = netD_B(fake_B)
loss_GAN_A2B = criterion_GAN(pred_fake, target_real)
fake_A = netG_B2A(real_B)
pred_fake = netD_A(fake_A)
loss_GAN_B2A = criterion_GAN(pred_fake, target_real)
# Cycle loss
# TODO: cycle loss doesn't allow for multimodality. I leave it for now but needs to be thrown out later
recovered_A = netG_B2A(fake_B)
loss_cycle_ABA = criterion_cycle(recovered_A, real_A)*10.0
recovered_B = netG_A2B(fake_A)
loss_cycle_BAB = criterion_cycle(recovered_B, real_B)*10.0
# Total loss
loss_G = loss_identity_A + loss_identity_B + loss_GAN_A2B + loss_GAN_B2A + loss_cycle_ABA + loss_cycle_BAB
loss_G.backward()
optimizer_G.step()
##### Discriminator A #####
optimizer_D_A.zero_grad()
# Real loss
pred_real = netD_A(real_A)
loss_D_real = criterion_GAN(pred_real, target_real)
# Fake loss
fake_A = fake_A_buffer.push_and_pop(fake_A)
pred_fale = netD_A(fake_A.detach())
loss_D_fake = criterion_GAN(pred_fake, target_fake)
# Total loss
loss_D_A = (loss_D_real + loss_D_fake)*0.5
loss_D_A.backward()
I am not familiar at all what it means. My guess is it's something to do with fake_A_buffer. It's just a fake_A_buffer = ReplayBuffer()
class ReplayBuffer():
def __init__(self, max_size=50):
assert (max_size > 0), 'Empty buffer or trying to create a black hole. Be careful.'
self.max_size = max_size
self.data = []
def push_and_pop(self, data):
to_return = []
for element in data.data:
element = torch.unsqueeze(element, 0)
if len(self.data) < self.max_size:
self.data.append(element)
to_return.append(element)
else:
if random.uniform(0,1) > 0.5:
i = random.randint(0, self.max_size-1)
to_return.append(self.data[i].clone())
self.data[i] = element
else:
to_return.append(element)
return Variable(torch.cat(to_return))
Error after setting `loss_G.backward(retain_graph=True)
Traceback (most recent call last): File "models/CycleGANs/train",
line 150, in
loss_D_A.backward() File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 221, in
backward
torch.autograd.backward(self, gradient, retain_graph, create_graph) File
"/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py",
line 130, in backward
Variable._execution_engine.run_backward( RuntimeError: one of the variables needed for gradient computation has been modified by an
inplace operation: [torch.FloatTensor [3, 64, 7, 7]] is at version 2;
expected version 1 instead. Hint: enable anomaly detection to find the
operation that failed to compute its gradient, with
torch.autograd.set_detect_anomaly(True).
And after setting torch.autograd.set_detect_anomaly(True)
/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py:130:
UserWarning: Error detected in MkldnnConvolutionBackward. Traceback of
forward call that caused the error:
File "models/CycleGANs/train",
line 115, in
fake_B = netG_A2B(real_A)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/Histology-Style-Transfer-Research/models/CycleGANs/models.py",
line 67, in forward
return self.model(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py",
line 117, in forward
input = module(input)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/Histology-Style-Transfer-Research/models/CycleGANs/models.py",
line 19, in forward
return x + self.conv_block(x)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py",
line 117, in forward
input = module(input)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py",
line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py",
line 423, in forward
return self._conv_forward(input, self.weight)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/conv.py",
line 419, in _conv_forward
return F.conv2d(input, weight, self.bias, self.stride, (Triggered internally at
/opt/conda/conda-bld/pytorch_1603729096996/work/torch/csrc/autograd/python_anomaly_mode.cpp:104.)
Variable._execution_engine.run_backward(
Traceback (most recent call
last): File "models/CycleGANs/train", line 133, in
loss_G.backward(retain_graph=True)
File "/opt/conda/lib/python3.8/site-packages/torch/tensor.py", line 221, in
backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File
"/opt/conda/lib/python3.8/site-packages/torch/autograd/init.py",
line 130, in backward
Variable._execution_engine.run_backward( RuntimeError: Function 'MkldnnConvolutionBackward' returned nan values in its 2th output.
loss_G.backward() should be loss_G.backward(retain_graph=True) this is because when you use backward normally it doesn't record the operations it performs in the backward pass, retain_graph=True is telling to do so.

How to ensure at least 2 of n classes are included in training data

I am currently training a CNN. One of the metrics I am using is AUC. One issue I have noticed is that sometimes my generator will only select examples from one class (I have 3 classes in this project). So if my batch size is 20 it will sometimes randomly select 20 examples from class one for 1 epoch. If this happens then I get an error stating that AUC cannot be calculated with only one class and then the training ends.
Is there a way to make a condition in the generator that more or less states you need at least 2 of the n classes? Without having to use tf.metrics.auc
Thank you
# load training data
def load_train_data_batch_generator(batch_size=32, rows_in=48, cols_in=48, zs_in=32,
channels_in=2, num_classes=3,
dir_dict=dir_dict):
# dir_in_train = main_dir + '/test_CT_PET_combo'
# required when using hyperopt
batch_size = int(batch_size)
# if not: TypeError: 'float' object cannot be interpreted as an integer
fnames = os.listdir(dir_dict['dir_in_train_combo'])
y_train = np.zeros((batch_size, num_classes))
x_train = np.zeros((batch_size, rows_in, cols_in, zs_in, channels_in))
while True:
count = 0
for fname in np.random.choice(fnames, batch_size, replace=False):
data_label = scipy.io.loadmat(os.path.join(dir_dict['dir_out_train'], fname))['output']
# changing one hot encoding to integer
integer_label = np.argmax(data_label[0], axis=0)
y_train[count,:] = data_label
# Loading train ct w/ c and pet/ct combo
train_combo = scipy.io.loadmat(os.path.join(dir_dict['dir_in_train_combo'], fname))[fname]
x_train[count,:,:,:,:] = train_combo
count += 1
yield(x_train, y_train)
Per request: code for metric and error
Metric code
def sk_auroc(y_true, y_pred):
import tensorflow as tf
from sklearn.metrics import roc_auc_score
return tf.py_func(roc_auc_score, (y_true, y_pred), tf.double)
Epoch 1/200
57/205 [=======>......................] - ETA: 11s - loss: 1.2858 - acc: 0.3632 - sk_auroc: 0.4581 - auc: 0.5380ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
Traceback (most recent call last):
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 158, in __call__
ret = func(*args)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 277, in roc_auc_score
sample_weight=sample_weight)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/sklearn/metrics/base.py", line 118, in _average_binary_score
sample_weight=score_weight)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 268, in _binary_roc_auc_score
raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
[[Node: metrics_1/sk_auroc/PyFunc = PyFunc[Tin=[DT_FLOAT, DT_FLOAT], Tout=[DT_DOUBLE], token="pyfunc_24", _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_predictions_target_1_0_1, predictions_1/Softmax/_857)]]
Traceback (most recent call last):
File "<ipython-input-48-34101247f335>", line 8, in optimize_cnn
model, results = train_model(space)
File "<ipython-input-47-254bd056a344>", line 40, in train_model
validation_steps=round(len(os.listdir(dir_out_val))/space['batch_size'])
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/keras/legacy/interfaces.py", line 91, in wrapper
return func(*args, **kwargs)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1418, in fit_generator
initial_epoch=initial_epoch)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/keras/engine/training_generator.py", line 217, in fit_generator
class_weight=class_weight)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/keras/engine/training.py", line 1217, in train_on_batch
outputs = self.train_function(ins)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
return self._call(inputs)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1454, in __call__
self._session._session, self._handle, args, status, None)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
Traceback (most recent call last):
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/tensorflow/python/ops/script_ops.py", line 158, in __call__
ret = func(*args)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 277, in roc_auc_score
sample_weight=sample_weight)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/sklearn/metrics/base.py", line 118, in _average_binary_score
sample_weight=score_weight)
File "/home/mikedoho/anaconda3/lib/python3.6/site-packages/sklearn/metrics/ranking.py", line 268, in _binary_roc_auc_score
raise ValueError("Only one class present in y_true. ROC AUC score "
ValueError: Only one class present in y_true. ROC AUC score is not defined in that case.
[[Node: metrics_1/sk_auroc/PyFunc = PyFunc[Tin=[DT_FLOAT, DT_FLOAT], Tout=[DT_DOUBLE], token="pyfunc_24", _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_predictions_target_1_0_1, predictions_1/Softmax/_857)]]
tf.metrics.auc code and the picture showing the reason I dont really like it
# converting tf metric in keras metric
def as_keras_metric(method):
import functools
from keras import backend as K
import tensorflow as tf
#functools.wraps(method)
def wrapper(self, args, **kwargs):
""" Wrapper for turning tensorflow metrics into keras metrics """
value, update_op = method(self, args, **kwargs)
K.get_session().run(tf.local_variables_initializer())
with tf.control_dependencies([update_op]):
value = tf.identity(value)
return value
return wrapper
tf_auc_roc = as_keras_metric(tf.metrics.auc)
Seems like the tf.metrics.auc is too smooth and something might be off that I will have to look into later
You can use tf.metrics.auc in tensorflow instead of sklearn.metrics.roc_auc_score in sklearns. For example:
import tensorflow as tf
label = tf.Variable([1,0,0,0,1])
pred = tf.Variable([0.8,1,0.6,0.23,0.78])
auc,op = tf.metrics.auc(label,pred)
with tf.Session()as sess:
init = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
sess.run(init)
for i in range(3):
auc_value, op_value = sess.run([auc,op])
print(auc_value)
0.0
0.6666667
0.66666657
There will be no problem with you.

Categories