Keras .fit giving better performance than manual Tensorflow - python

I'm new to Tensorflow and Keras. To get started, I followed the https://www.tensorflow.org/tutorials/quickstart/advanced tutorial. I'm now adapting it to train on CIFAR10 instead of MNIST dataset. I recreated this model https://keras.io/examples/cifar10_cnn/ and I'm trying to run it in my own codebase.
Logically, if the model, batch size and optimizer are all the same, then the two should perform identically, but they don't. I thought it might be that I'm making a mistake in preparing the data. So I copied the model.fit function from the keras code into my script, and it still performs better. Using .fit gives me around 75% accuracy in 25 epochs, while with the manual method it takes around 60 epochs. With .fit I also achieve slightly better max accuracy.
What I want to know is: Is .fit doing something behind the scenes that's optimizing training? What do I need to add to my code to get the same performance? Am I doing something obviously wrong?
Thanks for your time.
Main code:
import tensorflow as tf
from tensorflow import keras
import msvcrt
from Plotter import Plotter
#########################Configuration Settings#############################
BatchSize = 32
ModelName = "CifarModel"
############################################################################
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
print("x_train",x_train.shape)
print("y_train",y_train.shape)
print("x_test",x_test.shape)
print("y_test",y_test.shape)
x_train, x_test = x_train / 255.0, x_test / 255.0
# Convert class vectors to binary class matrices.
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)
train_ds = tf.data.Dataset.from_tensor_slices(
(x_train, y_train)).batch(BatchSize)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(BatchSize)
loss_object = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.0001,decay=1e-6)
# Create an instance of the model
model = ModelManager.loadModel(ModelName,10)
train_loss = tf.keras.metrics.Mean(name='train_loss')
train_accuracy = tf.keras.metrics.CategoricalAccuracy(name='train_accuracy')
test_loss = tf.keras.metrics.Mean(name='test_loss')
test_accuracy = tf.keras.metrics.CategoricalAccuracy(name='test_accuracy')
########### Using this function I achieve better results ##################
model.compile(loss='categorical_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])
model.fit(x_train, y_train,
batch_size=BatchSize,
epochs=100,
validation_data=(x_test, y_test),
shuffle=True,
verbose=2)
############################################################################
########### Using the below code I achieve worse results ##################
#tf.function
def train_step(images, labels):
with tf.GradientTape() as tape:
predictions = model(images, training=True)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
train_loss(loss)
train_accuracy(labels, predictions)
#tf.function
def test_step(images, labels):
predictions = model(images, training=False)
t_loss = loss_object(labels, predictions)
test_loss(t_loss)
test_accuracy(labels, predictions)
epoch = 0
InterruptLoop = False
while InterruptLoop == False:
#Shuffle training data
train_ds.shuffle(1000)
epoch = epoch + 1
# Reset the metrics at the start of the next epoch
train_loss.reset_states()
train_accuracy.reset_states()
test_loss.reset_states()
test_accuracy.reset_states()
for images, labels in train_ds:
train_step(images, labels)
for test_images, test_labels in test_ds:
test_step(test_images, test_labels)
test_accuracy = test_accuracy.result() * 100
train_accuracy = train_accuracy.result() * 100
#Print update to console
template = 'Epoch {}, Loss: {}, Accuracy: {}, Test Loss: {}, Test Accuracy: {}'
print(template.format(epoch,
train_loss.result(),
train_accuracy ,
test_loss.result(),
test_accuracy))
# Check if keyboard pressed
while msvcrt.kbhit():
char = str(msvcrt.getch())
if char == "b'q'":
InterruptLoop = True
print("Stopping loop")
The model:
from tensorflow.keras.layers import Dense, Flatten, Conv2D, Dropout, MaxPool2D
from tensorflow.keras import Model
class ModelData(Model):
def __init__(self,NumberOfOutputs):
super(ModelData, self).__init__()
self.conv1 = Conv2D(32, 3, activation='relu', padding='same', input_shape=(32,32,3))
self.conv2 = Conv2D(32, 3, activation='relu')
self.maxpooling1 = MaxPool2D(pool_size=(2,2))
self.dropout1 = Dropout(0.25)
############################
self.conv3 = Conv2D(64,3,activation='relu',padding='same')
self.conv4 = Conv2D(64,3,activation='relu')
self.maxpooling2 = MaxPool2D(pool_size=(2,2))
self.dropout2 = Dropout(0.25)
############################
self.flatten = Flatten()
self.d1 = Dense(512, activation='relu')
self.dropout3 = Dropout(0.5)
self.d2 = Dense(NumberOfOutputs,activation='softmax')
def call(self, x):
x = self.conv1(x)
x = self.conv2(x)
x = self.maxpooling1(x)
x = self.dropout1(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.maxpooling2(x)
x = self.dropout2(x)
x = self.flatten(x)
x = self.d1(x)
x = self.dropout3(x)
x = self.d2(x)
return x

I know this question already has an answer, but I faced the same problem and the solution seemed to be something different, that's not specified in the documentation.
I copy & paste here the answer (and the relative link) I found on GitHub, which solved the issue in my case:
The problem is caused by broadcasting in your loss function in the
custom loop. Make sure that the dimensions of predictions and label is
equal. At the moment (for MAE) they are [128,1] and [128]. Just make
use of tf.squeeze or tf.expand_dims.
Link: https://github.com/tensorflow/tensorflow/issues/28394
Basic translation: when computing the loss, always be sure of the tensors' shapes.

Mentioning the solution here (Answer Section) even though it is present in the Comments, for the benefit of the Community.
On the same Dataset, the Accuracy can differ when using Keras Model.fit with that of the Model built using Tensorflow mainly if the Data is shuffled because, when we shuffle the Data, the Split of Data between Training and Testing (or Validation) will be different resulting in different Train and Test Data in both the cases (Keras and Tensorflow).
If we want to observe the similar results on the Same Dataset and with similar Architecture in Keras and in Tensorflow, we can Turn off Shuffling the Data.
Hope this helps. Happy Learning!

Related

How do I fix this size of tensor error for my NN classifier PyTorch

I'm having trouble understanding why this is throwing an error. This code is pulled directly from the PyTorch documentation for a NN classifier for the fashion MNIST dataset. However when I try to flip this to the MNIST handwritten digits data set it comes up with the following error:
RuntimeError: The size of tensor a (10) must match the size of tensor b (64) at non-singleton dimension 1
This occurs when using the loss function during the training loop function. Can anyone help me understand why this is happening. Thanks!
import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, Lambda
import torchvision.models as models
import matplotlib.pyplot as plt
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
#device = "cpu"
print(f"Using {device} device")
training_data = datasets.MNIST(
root="data",
train=True,
download=True,
transform=ToTensor()
)
test_data = datasets.MNIST(
root="data",
train=False,
download=True,
transform=ToTensor()
)
train_dataloader = DataLoader(training_data, batch_size=64)
test_dataloader = DataLoader(test_data, batch_size=64)
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.flatten = nn.Flatten()
self.linear_relu_stack = nn.Sequential(
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10),
)
def forward(self, x):
x = self.flatten(x)
logits = self.linear_relu_stack(x)
return logits
def train_loop(dataloader, model, loss_fn, optimizer):
size = len(dataloader.dataset)
for batch, (X, y) in enumerate(dataloader):
# Compute prediction and loss
X, y = X.to(device), y.to(device)
pred = model(X)
loss = loss_fn(pred, y)
# Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
if batch % 100 == 0:
loss, current = loss.item(), batch * len(X)
print(f"loss: {loss:>7f} [{current:>5d}/{size:>5d}]")
def test_loop(dataloader, model, loss_fn):
size = len(dataloader.dataset)
num_batches = len(dataloader)
test_loss, correct = 0, 0
with torch.no_grad():
for X, y in dataloader:
X, y = X.to(device), y.to(device)
pred = model(X)
test_loss += loss_fn(pred, y).item()
correct += (pred.argmax(1) == y).type(torch.float).sum().item()
test_loss /= num_batches
correct /= size
print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
def save_checkpoint(state, filename = "checkpoint.pth.tar"):
print("=> Saving checkpoint")
torch.save(state, filename)
model = NeuralNetwork().to(device)
learning_rate = 1e-3
batch_size = 64
epochs = 10
# Initialize the loss function
loss_fn = nn.MSELoss()
optimiser = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(epochs):
print(f"Epoch {t+1}\n-------------------------------")
train_loop(train_dataloader, model, loss_fn, optimiser)
test_loop(test_dataloader, model, loss_fn)
print("Done!")
torch.nn.MSELoss is an implemention of mean squared error. You can't measure the difference between two tensors if they're different sizes (MSELoss does not allow for broadcasting). So if you're using MSELoss, then the predictions and the targets must be the same shape. In your case, preds is a tensor of shape [64, 10], and y is a tensor of shape [64].
The reason y is of shape [64] rather than [64, 10] is that most classification dataset implementations represent targets as integer labels rather than one-hot encoded vectors. In theory, you could convert these integer label targets to one-hot encoded targets.
But in reality, since this is a classification problem, you should probably be using something like nn.CrossEntropyLoss rather than nn.MSELoss. The former is a conventional classification loss function, and it allows the targets to be integer labels rather than one-hot labels (so just swapping out MSELoss for CrossEntropyLoss should solve your problem). MSELoss is better suited for regression tasks and such.

Model Accuracy is High but Val_Accuracy is low

I'm trying to improve my val accuracy as it is very low. I have tried changing the batch_size, the number of images being used for validation and training. Added in extra dense levels but none of them have worked. The dataset I'm using has not been split up yet into Training and Validation which is what I have done using partitioning. I have given the values for the samples as you can see below and have tried to increase the VALIDATION_SAMPLES but when I do, my cluster keeps crashing.
TRAINING_SAMPLES = 10000
VALIDATION_SAMPLES = 2000
TEST_SAMPLES = 2000
IMG_WIDTH = 178
IMG_HEIGHT = 218
BATCH_SIZE = 32
NUM_EPOCHS = 20
def generate_df(partition, attr, num_samples):
df_ = df_par_attr[(df_par_attr['partition'] == partition)
& (df_par_attr[attr] == 0)].sample(int(num_samples/2))
df_ = pd.concat([df_,
df_par_attr[(df_par_attr['partition'] == partition)
& (df_par_attr[attr] == 1)].sample(int(num_samples/2))])
# for Training and Validation
if partition != 2:
x_ = np.array([load_reshape_img(images_folder + fname) for fname in df_.index])
x_ = x_.reshape(x_.shape[0], 218, 178, 3)
y_ = np_utils.to_categorical(df_[attr],2)
# for Test
else:
x_ = []
y_ = []
for index, target in df_.iterrows():
im = cv2.imread(images_folder + index)
im = cv2.resize(cv2.cvtColor(im, cv2.COLOR_BGR2RGB), (IMG_WIDTH, IMG_HEIGHT)).astype(np.float32) / 255.0
im = np.expand_dims(im, axis =0)
x_.append(im)
y_.append(target[attr])
return x_, y_
My training model is build after the partitioning which you can see below
# Train data
x_train, y_train = generate_df(0, 'Male', TRAINING_SAMPLES)
# Train - Data Preparation - Data Augmentation with generators
train_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
rotation_range=30,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=True,
)
train_datagen.fit(x_train)
train_generator = train_datagen.flow(
x_train, y_train,
batch_size=BATCH_SIZE,
)
The same also goes for the validation
# Validation Data
x_valid, y_valid = generate_df(1, 'Male', VALIDATION_SAMPLES)
# Validation - Data Preparation - Data Augmentation with generators
valid_datagen = ImageDataGenerator(
preprocessing_function=preprocess_input,
)
valid_datagen.fit(x_valid)
validation_generator = valid_datagen.flow(
x_valid, y_valid,
)
I tried playing around with the layers but got told that it wouldn't really affect your val_accuracy
x = inc_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation="relu")(x)
x = Dropout(0.5)(x)
x = Dense(256, activation="relu")(x)
predictions = Dense(2, activation="softmax")(x)
I tried using the 'adam' optimizer but it made no difference when compared to sgd
model_.compile(optimizer=SGD(lr=0.0001, momentum=0.9)
, loss='categorical_crossentropy'
, metrics=['accuracy'])
hist = model_.fit_generator(train_generator
, validation_data = (x_valid, y_valid)
, steps_per_epoch= TRAINING_SAMPLES/BATCH_SIZE
, epochs= NUM_EPOCHS
, callbacks=[checkpointer]
, verbose=1
)
Who ever told you modifying the model won't effect validation accuracy in most cases is dead wrong. The problem you have in your model is it is not deep enough to extract the features of the images. Below is the code I have used on hundreds of models and has proved to be very accurate with respect to achieving low training and validation loss and avoid over fitting
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Dense, Activation,Dropout,Conv2D, MaxPooling2D,BatchNormalization, Flatten
from tensorflow.keras.optimizers import Adam, Adamax
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras import regularizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.models import Model, load_model
def make_model(img_img_size, class_count,lr=.001, trainable=True):
img_shape=(img_size[0], img_size[1], 3)
model_name='EfficientNetB3'
base_model=tf.keras.applications.efficientnet.EfficientNetB3(include_top=False, weights="imagenet",input_shape=img_shape, pooling='max')
base_model.trainable=trainable
x=base_model.output
x=keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x = Dense(256, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dropout(rate=.45, seed=123)(x)
output=Dense(class_count, activation='softmax')(x)
model=Model(inputs=base_model.input, outputs=output)
model.compile(Adamax(learning_rate=lr), loss='categorical_crossentropy', metrics=['accuracy'])
return model, base_model # return the base_model so the callback can control its training state
TRAINING_SAMPLES = 10000
VALIDATION_SAMPLES = 2000
TEST_SAMPLES = 2000
IMG_WIDTH = 178
IMG_HEIGHT = 218
BATCH_SIZE = 32
NUM_EPOCHS = 20
img_size=(IMG_HEIGHT,IMG_WIDTH)
class_count=2
model, base_model=make_model(img_size, class_count, lr=.001, trainable=True)
I also recommend that you use two keras callbacks. One is to control the learning rate. Documentation for that is here. The other controls early stopping and saves the model with the lowest validation loss. Documentation for that is here.
My recommended code for these callbacks is shown below
rlronp=tf.keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.5, patience=2,verbose=1)
estop=tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=4, verbose=1,restore_best_weights=True)
callbacks=[rlronp, estop]
put the above code prior to using model.fit. In model.fit set the parameter
callbacks=callbacks

tensorflow training: pass model as argument to training loop?

I have experimented with training a simple tensorflow model using two scenarios: passing my model to my training loop (and to the subfunctions which are called from training loop), versus not passing my model to the training loop. The two cases result in different results. When passing my model to the training functions, the model is trained properly. But in the second scenario, something is wrong because the model is apparently not trained. I am baffled, and I wonder if it's a scope thing.
To be more specific, my setup involves dynamically creating a new model of larger size (adding some layers at each iteration of a for loop), and then training the resulting model. As stated before, I train the model in two scenarios: passing the model to training subfunctions and not doing so, and I obtain different results depending on which one I do. I verify this by giving the model a test sample (class 0 MNIST images) and checking if the correct classification is output. The models trained by passing the model as an argument is trained correctly, but, if I do not do this, then only the first model created by the for loop is correctly trained, as verified by incorrect class predictions. Can this be explained?
The code below is for training by not passing model as an argument.
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)
from tensorflow.keras.optimizers import Adam
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import time
epochs = 200
input_shape = (28,28,1)
num_classes=10
batch_size = 64
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_val = x_train[-10000:]
y_val = y_train[-10000:]
x_train = x_train[:-10000]
y_train = y_train[:-10000]
x_train = np.expand_dims(x_train, -1)
x_val = np.expand_dims(x_val, -1)
x_test = np.expand_dims(x_test, -1)
x_train = x_train.astype("float32")
x_test = x_test.astype("float32")
x_val = x_val.astype("float32")
y_test_sorted_args_0=np.where(y_test == 0)
x_test_0=x_test[y_test_sorted_args_0]
y_test_0=np.full( (x_test_0.shape)[0],0)
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_dataset = train_dataset.shuffle(buffer_size=1024).batch(batch_size)
val_dataset = tf.data.Dataset.from_tensor_slices((x_val, y_val))
val_dataset = val_dataset.batch(batch_size)
acc_metric = keras.metrics.SparseCategoricalAccuracy()
val_acc_metric = keras.metrics.SparseCategoricalAccuracy()
optimizer = keras.optimizers.SGD(learning_rate=1e-3)
loss_fn = keras.losses.SparseCategoricalCrossentropy(from_logits=False)
#tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
mod_output = model(x, training=True)
loss_value = loss_fn(y, mod_output)
grads = tape.gradient(loss_value, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
acc_metric.update_state(y, mod_output)
return loss_value
#tf.function
def test_step(x, y):
val = model(x, training=False)
acc_metric.update_state(y, val)
def train( epochs):
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
start_time = time.time()
# Iterate over the batches of the dataset.
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
loss_value = train_step( x_batch_train, y_batch_train)
# Log every 200 batches.
if step % 200 == 0:
print(
"Training loss (for one batch) at step %d: %.4f"
% (step, float(loss_value))
)
print("Seen so far: %d samples" % ((step + 1) * batch_size))
# Display metrics at the end of each epoch.
train_acc = acc_metric.result()
print("Training acc over epoch: %.4f" % (float(train_acc),))
# Reset training metrics at the end of each epoch
acc_metric.reset_states()
# Run a validation loop at the end of each epoch.
for x_batch_val, y_batch_val in val_dataset:
test_step(x_batch_val, y_batch_val)
val_acc = acc_metric.result()
val_acc_metric.reset_states()
print("Validation acc: %.4f" % (float(val_acc),))
print("Time taken: %.2fs" % (time.time() - start_time))
max_hidden=7
for num_hidden_layers in range(1,max_hidden,3):
model1 = keras.Sequential(
[
keras.Input(shape=input_shape),
layers.Flatten(),
]
)
for i in range(1, num_hidden_layers+1):
model1.add(layers.Dense(150, activation="relu"))
model1.add(layers.Dense(num_classes, activation="softmax"))
model=model1
train(epochs)
#verify that the model is properly trained by checking that the model correclty predicts images from class 0.
#The more class 0 predictions we have, the better.
for sample_index in range(0,10):
x_sample=x_test_0[sample_index,:,:,:]
x_sample=np.expand_dims(x_sample, 0)
print(tf.math.argmax(model(x_sample),axis=1))
time.sleep(1)

Keras Own Training routine vs Keras Fit

I am trying to compare the results of my own training loop with the one given by Keras Fit. I provide a code snippet below (RUN_TYPE = 1 runs own training loop, else run Keras fit) . As you can see:
I seed the rnds so my generated training set is exactly the same (checked).
I use the same function to create the DNN architecture.
I use the same hyperparameter values (learning rate, optimised parameters, batch_size, etc)
I specifically use shuffle=False in the Keras.fit function to disable any batch shuffling.
When I set batch_size to the size of the training sample, I get the same loss between the two versions. As soon as batch_size < training sample size, i.e. an epoch takes multiple steps, my results diverge. Having made the algo deterministic I don't understand where the discrepancy comes from.
Any tips?
RUN_TYPE = 2
import numpy as np
import time
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.callbacks import LambdaCallback
import tensorflow.keras.backend as K
np.random.seed(1)
tf.random.set_seed(2)
#tf.function
def training_step(x, y, model, opt):
print(x)
print(y)
with tf.GradientTape() as tape:
predictions = model(x, training=True)
mseLoss = keras.losses.MeanSquaredError()
loss = mseLoss(y,predictions)
grads = tape.gradient(loss, model.trainable_variables)
opt.apply_gradients(zip(grads, model.trainable_variables))
return loss
#HyperParameters Initialisation
#learnging_rate, beta_1, beta_2, blablabla.
if RUN_TYPE == 1:
X_train, y_train = createDataSet(n)
train_dataset = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_dataset = train_dataset.batch(batch_size)
# An optimizer for updating the trainable variables
optimizer = keras.optimizers.Adam(learning_rate=learning_rate, beta_1=beta_1, beta_2=beta_2)
# Create an instance of the model
model = createModel(X_train.shape[1:])
# Train the model
for epoch in range(epochs):
print("\nStart of epoch %d" % (epoch,))
for step, (x_batch_train, y_batch_train) in enumerate(train_dataset):
loss = training_step(x_batch_train, y_batch_train, model, optimizer)
print("Step " + str(step) + " loss = " + str(loss.numpy()))
else:
X, Y = createDataSet(n)
model = createModel(X.shape[1:])
optimizer = keras.optimizers.Adam(learning_rate=learning_rate, beta_1=beta_1, beta_2=beta_2)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(X, Y, epochs=epochs, batch_size=batch_size, verbose=0, shuffle=False) #

pytorch training loss invariant with varying forward pass implementations

The following code (MNIST MLP in PyTorch) delivers approximately the same training loss regardless of having the last layer in the forward pass returning:
F.log_softmax(x)
(x)
Option 1 is incorrect because I use criterion = nn.CrossEntropyLoss() and yet the results are almost identical. Am I missing anything?
import torch
import numpy as np
import time
from torchvision import datasets
import torchvision.transforms as transforms
# number of subprocesses to use for data loading
num_workers = 0
# how many samples per batch to load
batch_size = 20
# convert data to torch.FloatTensor
transform = transforms.ToTensor()
# choose the training and test datasets
train_data = datasets.MNIST(root='data', train=True,
download=True, transform=transform)
test_data = datasets.MNIST(root='data', train=False,
download=True, transform=transform)
# prepare data loaders
train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size,
num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size,
num_workers=num_workers)
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# linear layer (784 -> 1 hidden node)
self.fc1 = nn.Linear(28 * 28, 512)
self.dropout1= nn.Dropout(p=0.2, inplace= False)
self.fc2 = nn.Linear(512, 256)
self.dropout2= nn.Dropout(p=0.2, inplace= False)
self.dropout = nn.Dropout(p=0.2, inplace= False)
self.fc3 = nn.Linear(256, 10)
def forward(self, x):
# flatten image input
x = x.view(-1, 28 * 28)
# add hidden layer, with relu activation function
x = F.relu(self.fc1(x))
x = self.dropout1(x)
x = F.relu(self.fc2(x))
x = self.dropout2(x)
x = self.fc3(x)
# return F.log_softmax(x)
return x
# initialize the NN
model = Net()
print(model)
model.to('cuda')
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
n_epochs = 10
model.train() # prep model for training
for epoch in range(n_epochs):
# monitor training loss
train_loss = 0.0
start = time.time()
for data, target in train_loader:
data, target = data.to('cuda'), target.to('cuda')
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
train_loss += loss.item()*data.size(0)
train_loss = train_loss/len(train_loader.dataset)
print('Epoch: {} \tTraining Loss: {:.6f} \ttime: {:.6f}'.format(
epoch+1,
train_loss,
time.time()-start
))
For numerical stability, the nn.CrossEntropyLoss() is implemented with the softmax layer inside it. So you should NOT use the softmax layer in your forward pass.
From the docs (https://pytorch.org/docs/stable/nn.html#crossentropyloss):
This criterion combines nn.LogSoftmax() and nn.NLLLoss() in one single class.
Using the softmax layer in the forward pass will lead to worse metrics because the gradient magnitudes are lowered (thus, the weight updates are also lowered). I learned it the hard way!
I guess your problem is that the loss is similar at the beginning of training, but at the end of the training, they should not. It is usually a good sanity check to overfit your model in one batch of data. The model should reach 100% accuracy if the batch is small enough. If the model is taking too long to train than you probably have a bug somewhere.
Hope that helps =)

Categories