I am trying to build autoencoder model, where input/output is RGB images with size of 256 x 256. I tried to train model on 1 GPU with 12 GB of memory but I always caught CUDA OOM (I tried differen batchsizes and even batch size of 1 is failing). So I read about model parallelism in Pytorch and tried this:
class Autoencoder(nn.Module):
def __init__(self, input_output_size):
super(Autoencoder, self).__init__()
self.encoder = nn.Sequential(
nn.Linear(input_output_size, 1024),
nn.ReLU(True),
nn.Linear(1024, 200),
nn.ReLU(True)
).cuda(0)
self.decoder = nn.Sequential(
nn.Linear(200, 1024),
nn.ReLU(True),
nn.Linear(1024, input_output_size),
nn.Sigmoid()).cuda(1)
print(self.encoder.get_device())
print(self.decoder.get_device())
def forward(self, x):
x = x.cuda(0)
x = self.encoder(x)
x = x.cuda(1)
x = self.decoder(x)
return x
So I have moved my encoder and decoder on different GPUs. But now I get this exception:
Expected tensor for 'out' to have the same device as tensor for argument #2 'mat1'; but device 0 does not equal 1 (while checking arguments for addmm)
It appear when I do x = x.cuda(1) in forward method.
Moreover, here is my "train" code, maye you can give me some advices about optimizations? Is images of 3 x 256 x 256 too large for training? (I cannot reduce them). Thank you in advance.
Training:
input_output_size = 3 * 256 * 256
model = Autoencoder(input_output_size).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.MSELoss()
for epoch in range(100):
epoch_loss = 0
for batch_idx, (images, _) in enumerate(dataloader):
images = torch.flatten(images, start_dim=1).to(device)
output_images = model(images).to(device)
train_loss = criterion(output_images, images)
train_loss.backward()
optimizer.step()
if batch_idx % 5 == 0:
with torch.no_grad():
model.eval()
pred = model(test_set).to(device)
model.train()
test_loss = criterion(pred, test_set)
wandb.log({"MSE train": train_loss})
wandb.log({"MSE test": test_loss})
del pred, test_loss
if batch_idx % 200 == 0:
# here I send testing images from output to W&B
with torch.no_grad():
model.eval()
pred = model(test_set).to(device)
model.train()
wandb.log({"PRED": [wandb.Image((pred[i].cpu().reshape((3, 256, 256)).permute(1, 2, 0) * 255).numpy().astype(np.uint8), caption=str(i)) for i in range(20)]})
del pred
gc.collect()
torch.cuda.empty_cache()
epoch_loss += train_loss.item()
del output_images, train_loss
epoch_loss = epoch_loss / len(dataloader)
wandb.log({"Epoch MSE train": epoch_loss})
del epoch_loss
Three issues that I'm seeing:
model(test_set)
This is when you are sending the entirety of your test set (presumably huge) as a single batch through your model.
I don't know what wandb is, but another likely source of memory growth is these lines:
wandb.log({"MSE train": train_loss})
wandb.log({"MSE test": test_loss})
You seem to be saving train_loss and test_loss, but these contain not only the numbers themselves, but the computational graphs (living on the GPU) needed for backprop. Before saving them, you want to convert them into float or numpy.
Your model contains two 3*256*256 x 1024 weight blocks. When used in Adam, these will require 3*256*256 x 1024 * 3 * 4 bytes = 2.25GB of VRAM each (Possibly more, if it's inefficiently implemented) This looks like a poor architecture for other reasons also.
Related
I get this error in the training loop for this neural network:
class YourModel(torch.nn.Module):
def __init__(self):
super(YourModel, self).__init__()
self.fc1 = nn.Linear(50, 128)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(128, 1)
def forward(self, x1, x2):
x = torch.cat((x1, x2), dim=1)
out = self.fc1(x)
out = self.sigmoid(out)
out = self.fc2(out)
return out
model = YourModel().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
loss_fn = nn.BCELoss()
My dataloader contains 3 datasets, 1 with 25 features for 8000 documents, other with 25 features of 8000 queries and the last one with the relation between both (0 or 1). So that's why I'm using a neural network for binary classification. (However if you know an alternative neural network I'm open to options)
My batch_size is 1 right now and here is my training loop:
def train(dataloader, model, loss_fn, optimizer):
model.train()
train_loss = 0
num_batches = len(dataloader)
all_pred = []
all_real = []
for batch, i in enumerate(train_dataloader): #access to each batch
i_1 = i[0]
i_2 = i[1]
y = i[2].float().view(1, 1) #find relevance
#y = torch.clamp(y, min=0, max=1)
#x = np.hstack((i_1, i_2))
#x = torch.Tensor(x)
#x = torch.clamp(x, min=0, max=1)
# Zero the gradients
optimizer.zero_grad()
# Forward pass
y_pred = model(i_1, i_2).float()
y_pred = torch.clamp(y_pred, min=0, max=1)
loss = loss_fn(y_pred, y)
# Backward pass
loss.backward()
# Update the parameters
optimizer.step()
train_loss += loss.item() #sum the loss
all_pred.append(y_pred)
all_real.append(y)
if batch > 0 and batch%1000 == 0:
print(f"Partial loss: {train_loss/batch}, F1: {f1_score(all_real, all_pred)}")
train_loss /= num_batches
print(f"Total loss: {train_loss}") #print loss of every epoch
return train_loss
I'm getting this error: "Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead." but as far as I know I'm not calling numpy on any tensors. And if I use the detach method then I get the an error saying that the loss can not be computed because the tensor of 0 doesn't need grad. So it is pretty much a loop.
I have a dataset of laser welding images of size 300*300 which contains two class of bad and good weld seam. I have followed Pytorch fine-tuning tutorial for an inception-v3 classifier.
on the other hand, I also build a custom CNN with 3 conv layer and 3 fc. What I observed is that the fine tuning showed lots of variation on validation accuracy. basically, I see different maximum accuracy every time I train my model. Plus, my accuracy in fine-tuning is much less than my custom CNN!! for example the accuracy for my synthetic images from a GAN is 86% with inception-v3, while it is 94% with my custom CNN. The real data for both network shows almost similar behaviour and accuracy, however accuracy in custom CNN is about 2% more.
I trained with different training scales of 200, 500 and 1000 train-set images (half of them for each class like for 200 images we have 100 good and 100 bad). I also include a resize transform of 224 in my train_loader; in fine tuning tutorial, this resize is automatically done to 299 for inception-v3. for each trial, the validation-size and its content is constant.
Do you know what cause this behavior? Is it because my dataset is so different from the pretrained model classes? am I not supposed to get better results with fine-tuning?
My custom CNN:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.pool = nn.MaxPool2d(2, 2)
self.conv3 = nn.Conv2d(16, 24, 5)
self.fc1 = nn.Linear(13824, 120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, 2)
def forward(self, x):
x = self.pool(F.relu(self.conv1(x)))
x = self.pool(F.relu(self.conv2(x)))
x = self.pool(F.relu(self.conv3(x)))
#x = x.view(-1, 16 * 5 * 5)
x = x.view(x.size(0),-1)
#print(x.shape)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
#x = F.softmax(x, dim=1)
return x
model = Net()
criterion = nn.CrossEntropyLoss()
#optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=5e-4)
model.to(device)
with training loop of:
epochs = 15
steps = 0
running_loss = 0
print_every = 10
train_losses, test_losses = [], []
train_acc, test_acc = [], []
for epoch in range(epochs):
for inputs, labels in trainloader:
steps += 1
inputs, labels = inputs.to(device), labels.to(device)
optimizer.zero_grad()
logps = model.forward(inputs)
loss = criterion(logps, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if steps % print_every == 0:
test_loss = 0
accuracy = 0
model.eval()
with torch.no_grad():
for inputs, labels in testloader:
inputs, labels = inputs.to(device), labels.to(device)
logps = model.forward(inputs)
batch_loss = criterion(logps, labels)
test_loss += batch_loss.item()
ps = torch.exp(logps)
top_p, top_class = ps.topk(1, dim=1)
equals = top_class == labels.view(*top_class.shape)
accuracy += torch.mean(equals.type(torch.FloatTensor)).item()
train_losses.append(running_loss/len(trainloader))
test_losses.append(test_loss/len(testloader))
#train_acc.append(running_loss/len(trainloader))
test_acc.append(accuracy/len(testloader))
print(f"Epoch {epoch+1}/{epochs}.. "
f"Train loss: {running_loss/print_every:.3f}.. "
f"Test loss: {test_loss/len(testloader):.3f}.. "
f"Test accuracy: {accuracy/len(testloader):.3f}")
running_loss = 0
model.train()
Here is my theory :
Pre-training is useful when you want to leverage already existing data to help the model train on similar data, for which you have few instances. At least this was the reasoning behind the Unet architecture in medical image segmentation.
Now, to me the key is in the notion of "similar". If your network have been pre-trained on cats, dogs and you want to extrapolate to weld seam there's a chance your pre-training is not helping or even getting in the way of the model training properly.
Why ?
When training your CNN you get randomly initialized weights, whereas using a pre-trained network you get pre-trainned weights. If the features your are extracting are similar across dataset then you get a head start by having the network already attuned to this features.
For example, Cats and Dogs share similar spatial features visually (eye position, nose, ears...). So there's chance that you converge to a local minima faster during training since your are already starting from a good base that just need to adapt to the new specific of your data.
Conclusions:
If the similarity assumptions does not hold it means your model would have to "unlearn" what he already learned to adapt to the new specifics of your dataset and I guess that would be the reason why training is more difficult and does not give as good result as a blank slate CNN. (especially if you don't have that much data).
PS : I'd be curious to see if your pre trained model end up catching up with your CNN if you give it more epochs to train.
I am working on a multilabel text classification task with Bert.
The following is the code for generating an iterable Dataset.
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
train_set = TensorDataset(X_train_id,X_train_attention, y_train)
test_set = TensorDataset(X_test_id,X_test_attention,y_test)
train_dataloader = DataLoader(
train_set,
sampler = RandomSampler(train_set),
drop_last=True,
batch_size=13
)
test_dataloader = DataLoader(
test_set,
sampler = SequentialSampler(test_set),
drop_last=True,
batch_size=13
)
The following are the the dimensions of the training set:
In[]
print(X_train_id.shape)
print(X_train_attention.shape)
print(y_train.shape)
Out[]
torch.Size([262754, 512])
torch.Size([262754, 512])
torch.Size([262754, 34])
There should be 262754 rows each with 512 columns. The output should predict the values from 34 possible labels. I am breaking them down into batches of 13.
Training code
optimizer = AdamW(model.parameters(), lr=2e-5)
# Training
def train(model):
model.train()
train_loss = 0
for batch in train_dataloader:
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
optimizer.zero_grad()
loss, logits = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
train_loss += loss.item()
return train_loss
# Testing
def test(model):
model.eval()
val_loss = 0
with torch.no_grad():
for batch in test_dataloader:
b_input_ids = batch[0].to(device)
b_input_mask = batch[1].to(device)
b_labels = batch[2].to(device)
with torch.no_grad():
(loss, logits) = model(b_input_ids,
token_type_ids=None,
attention_mask=b_input_mask,
labels=b_labels)
val_loss += loss.item()
return val_loss
# Train task
max_epoch = 1
train_loss_ = []
test_loss_ = []
for epoch in range(max_epoch):
train_ = train(model)
test_ = test(model)
train_loss_.append(train_)
test_loss_.append(test_)
Out[]
Expected input batch_size (13) to match target batch_size (442).
This is the description of my model:
from transformers import BertForSequenceClassification, AdamW, BertConfig
model = BertForSequenceClassification.from_pretrained(
"cl-tohoku/bert-base-japanese-whole-word-masking", # 日本語Pre trainedモデル
num_labels = 34,
output_attentions = False,
output_hidden_states = False,
)
I have clearly stated that I want the batch size to be 13. However, during the training process pytorch is throwing a Runtime Error
Where is the number 442 even coming from? I have clearly stated that I want each batch to have a size of 13 rows.
I have already confirmed that each batch has input_id with dimensions [13,512], attention tensor with dimensions [13,512], and labels with dimensions [13,34].
I have tried caving in and using a batch size of 442 when initializing the DataLoader, but after a single batch iteration, it throws another Pytorch Value Error Expected: input batch size does not match target batch size, this time showing:
ValueError: Expected input batch_size (442) to match target batch_size (15028).
Why does the batch size keep on changing? Where is the number 15028 even coming from?
The following are some of the answers I have looked through, but had no luck on applying to my source code:
https://discuss.pytorch.org/t/valueerror-expected-input-batch-size-324-to-match-target-batch-size-4/24498
https://discuss.pytorch.org/t/valueerror-expected-input-batch-size-1-to-match-target-batch-size-64/43071
Pytorch CNN error: Expected input batch_size (4) to match target batch_size (64)
Thanks in advance. Your support is truly appreciated :)
It looks like this model does not handle multi-target scenario according to documentation:
labels (torch.LongTensor of shape (batch_size,), optional) – Labels for computing the sequence classification/regression loss. Indices should be in [0, ..., config.num_labels - 1]. If config.num_labels == 1 a regression loss is computed (Mean-Square loss), If config.num_labels > 1 a classification loss is computed (Cross-Entropy).
So, you need prepare your labels to have the shape of batch_size: torch.Size([batch_size]) with class index in a range [0, ..., config.num_labels - 1] just like for the original pytorch's CrossEntropyLoss (see example section).
I'm new to Pytorch and I'm trying to implemente a simple CNN to recognize MNIST images.
I'm training the network using MSE Loss as loss function and SGD as optimizer. When I get to the training it gives me the following
warning: " UserWarning: Using a target size (torch.Size([64])) that is different to the input size (torch.Size([64, 10])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size."
And then I get the following
error: "RuntimeError: The size of tensor a (10) must match the size of tensor b
(64) at non-singleton dimension 1".
I've tried to solve it using some solutions I've found in other questions but nothing seems to work. Here's the code of how I load the dataset:
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,),(0.5,))])
trainset = torchvision.datasets.MNIST(root='./data', train = True, transform = transform, download = True)
trainloader = torch.utils.data.DataLoader(trainset, batch_size = 64, shuffle = True)
testset = torchvision.datasets.MNIST(root='./data', train = False, transform = transform, download = True)
testloader = torch.utils.data.DataLoader(testset, batch_size = 64, shuffle = False)
The code to define my network:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
#Convolutional layers
self.conv1 = nn.Conv2d(1, 6, 5)
self.conv2 = nn.Conv2d(6, 12, 5)
#Fully connected layers
self.fc1 = nn.Linear(12*4*4, 120)
self.fc2 = nn.Linear(120, 60)
self.out = nn.Linear(60,10)
def forward(self, x):
x = F.max_pool2d(F.relu(self.conv1(x)), (2,2))
x = F.max_pool2d(F.relu(self.conv2(x)), (2,2))
x = x.reshape(-1, 12*4*4)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.out(x)
return x
And this is the training:
net = Net()
print(net)
criterion = nn.MSELoss()
optimizer = optim.SGD(net.parameters(), lr=0.001)
epochs = 3
for epoch in range(epochs):
running_loss = 0;
for images, labels in trainloader:
optimizer.zero_grad()
output = net(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
else:
print(f"Training loss: {running_loss/len(trainloader)}")
print('Finished training')
Thank you!
The loss you're using (nn.MSELoss) is incorrect for this problem. You should use nn.CrossEntropyLoss.
Mean Squared Loss measures the mean squared error between input x and target y. Here the input and target naturally should be of the same shape.
Cross Entropy Loss computes the probability over the classes for each image. The output would be a matrix N x C and target would be a vector of size N. (N = batch size, C = number of classes)
Since your aim is to classify the image, this is what you'll want to use.
In your case, your network output will be a matrix of size 64 x 10 and target is a vector of size 64. Each row of the output matrix (after applying the softmax function) indicates the probability of that class after which the Cross entropy loss is computed. Pytorch's nn.CrossEntropyLoss combines both the softmax operation with the loss computation.
You can refer the documentation here for more info on how Pytorch computes losses.
I agree with #AshwinNair advise and I did change in for loop in train and eval section as below it work for me.
for i, (img, label) in enumerate(dataloader):
img = img.to(device)
label = label.to(device)`
I'm trying to make a Back Propagation Neural Network with PyTorch. I can successfully execute and test its accuracy, but it doesn't work very efficiently. Now, I'm supposed to increase its efficiency by setting different activation rules for neurons, so that those neurons that don't contribute to the final output get excluded (pruned) from the computations, thereby increasing the time and accuracy.
My code looks like this (extracted snippets) -
# Hyper Parameters
input_size = 20
hidden_size = 50
num_classes =130
num_epochs = 500
batch_size = 5
learning_rate = 0.1
# normalise input data
for column in data:
# the last column is target
if column != data.shape[1] - 1:
data[column] = data.loc[:, [column]].apply(lambda x: (x - x.mean()) / x.std())
# randomly split data into training set (80%) and testing set (20%)
msk = np.random.rand(len(data)) < 0.8
train_data = data[msk]
test_data = data[~msk]
# define train dataset and a data loader
train_dataset = DataFrameDataset(df=train_data)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
# Neural Network
class Net(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(Net, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.sigmoid = nn.Sigmoid()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = self.fc1(x)
out = self.sigmoid(out)
out = self.fc2(out)
return out
net = Net(input_size, hidden_size, num_classes)
# train the model by batch
for epoch in range(num_epochs):
for step, (batch_x, batch_y) in enumerate(train_loader):
# convert torch tensor to Variable
X = Variable(batch_x)
Y = Variable(batch_y.long())
# Forward + Backward + Optimize
optimizer.zero_grad() # zero the gradient buffer
outputs = net(X)
loss = criterion(outputs, Y)
all_losses.append(loss.data[0])
loss.backward()
optimizer.step()
if epoch % 50 == 0:
_, predicted = torch.max(outputs, 1)
# calculate and print accuracy
total = predicted.size(0)
correct = predicted.data.numpy() == Y.data.numpy()
print('Epoch [%d/%d], Step [%d/%d], Loss: %.4f, Accuracy: %.2f %%' % (epoch + 1, num_epochs, step + 1, len(train_data) // batch_size + 1, loss.data[0], 100 * sum(correct)/total))
Can someone tell me how to do that in PyTorch as I'm very new to PyTorch.
I'm not sure if that question is supposed to be on stackoverflow, but I will give you a hint anyway. You are working with a sigmoid activation function at the moment, the gradient of which vanishes if the input value is too large to small. A commonly used approach is to use the ReLU activation function (stands for rectified linear unit).
ReLU(x) is the identity for the positive domain and 0 for the negative domain, in Python that would be written as follows:
def ReLU(x):
if(x > 0):
return x
else:
return 0
It should be readily available in PyTorch