Related
I am trying to optimize a given neural network (ex Perceptron Multilayer, with 2 hidden layers), by finding the number of epoch and batch that give the highest accuracy.
for epoch from 10 to 200 (in steps of 10):
for batch from 40 to 200 (in steps of 20):
modele.fit (X_train, Y_train, epochs = epoch, batch_size = batch)
I save batch, epoch, Accuracy;
Afterwards I kept the smallest epoch with the smallest corresponding batch which has the highest recognition
ex best_params: epoch = 10, batch = 150 => Accuracy = 94%
My problem is that when I re-run my model with the best_params, it doesn't give me the same results (loss, accuracy), even sometimes very low accuracy (eg 10%).
i try to fix seed, but no best result
Regards
Djam75
df=pd.DataFrame(columns=['Nb_Batch','Nb_Epoch','Accuracy'])
i=0
lst_loss=[]
lst_accuracy=[]
lst_epoch=list(np.arange(10,200,10))
lst_batch=list(np.arange(100,400,20))
for epoch in lst_epoch:
print ('---------------- Epoch ' + str(epoch)+ '------------------')
for batch in lst_batch:
modelSimple.fit(X_train, Y_train, nb_epoch = epoch, batch_size = batch, verbose = 0)
score = modelSimple.evaluate(X_test, Y_test)
df.loc[i,"Nb_Batch"]=batch
df.loc[i,"Nb_Epoch"]=epoch
df.loc[i,"Accuracy"]=score[1]*100
i=i+1
This might be happening due to random parameter initialization. Because if you are building an end-to-end model without transfer learn the weights, every time you training architecture get random values for its parameters.
In this case, a good practice is to use batch normalization layers after some layers according to your architecture.
tensoflow-implementation
pytorch-implmentation
extra idea:
Do not use any 'for', 'while' loops in the model implementation.
you can follow templates in TensorFlow or PyTorch.
OR, if you build a complete model from scratch, vectorize operations by using NumPy like metrics operation library.
Thanks for the update.
I resolve my probelm by saving a model and load it after.
thaks for idea (batch normalization ) and extra idea : not user any for ;-)
regards
I think you might not be updating the weight matrix after completing the training for certain batch sizes and epochs.
Please include the code as well in order to see the problem
I have followed this emnist tutorial to create an image classification experiment (7 classes) with the aim of training a classifier on 3 silos of data with the TFF framework.
Before training begins, I convert the model to a tf keras model using tff.learning.assign_weights_to_keras_model(model,state.model) to evaluate on my validation set. Regardless of the label, the model only predicts one class. This is to be expected as no training of the model has occurred yet. However, I repeat this step after each federated averaging round and the problem persists. All validation images are predicted to one class. I also save the tf keras model weights after each round and make predictions on the test set - no changes.
Some of the steps I have taken to check the source of the issue:
Checked if the tf keras model weights are updating when the FL model is converted after each round - they are updating.
Ensured that the buffer size is greater than the training dataset size for each client.
Compared the predictions to the class distribution in the training datasets. There is a class imbalance but the one class that the model predicts is not necessarily the majority class. Also, it is not always the same class. For the most part, it predicts only class 0.
Increased the number of rounds to 5 and epochs per round to 10. This is computationally very intensive as it is quite a large model being trained with approx 1500 images per client.
Investigated the TensorBoard logs from each training attempt. The training loss is decreasing as the round progresses.
Tried a much simpler model - basic CNN with 2 conv layers. This allowed me to greatly increase the number of epochs and rounds. When evaluating this model on the test set, it predicted 4 different classes but the performance remains very bad. This would indicate that I just would need to increase the number of rounds and epochs for my original model to increase the variation in predictions. This is difficult due the large training time that would be a result.
Model details:
The model uses the XceptionNet as the base model with the weights unfrozen. This performs well on the classification task when all the training images are pooled into a global dataset. Our aim is to hopefully achieve a comparable performance with FL.
base_model = Xception(include_top=False,
weights=weights,
pooling='max',
input_shape=input_shape)
x = GlobalAveragePooling2D()( x )
predictions = Dense( num_classes, activation='softmax' )( x )
model = Model( base_model.input, outputs=predictions )
Here is my training code:
def fit(self):
"""Train FL model"""
# self.load_data()
summary_writer = tf.summary.create_file_writer(
self.logs_dir
)
federated_averaging = self._construct_iterative_process()
state = federated_averaging.initialize()
tfkeras_model = self._convert_to_tfkeras_model( state )
print( np.argmax( tfkeras_model.predict( self.val_data ), axis=-1 ) )
val_loss, val_acc = tfkeras_model.evaluate( self.val_data, steps=100 )
with summary_writer.as_default():
for round_num in tqdm( range( 1, self.num_rounds ), ascii=True, desc="FedAvg Rounds" ):
print( "Beginning fed avg round..." )
# Round of federated averaging
state, metrics = federated_averaging.next(
state,
self.training_data
)
print( "Fed avg round complete" )
# Saving logs
for name, value in metrics._asdict().items():
tf.summary.scalar(
name,
value,
step=round_num
)
print( "round {:2d}, metrics={}".format( round_num, metrics ) )
tff.learning.assign_weights_to_keras_model(
tfkeras_model,
state.model
)
# tfkeras_model = self._convert_to_tfkeras_model(
# state
# )
val_metrics = {}
val_metrics["val_loss"], val_metrics["val_acc"] = tfkeras_model.evaluate(
self.val_data,
steps=100
)
for name, metric in val_metrics.items():
tf.summary.scalar(
name=name,
data=metric,
step=round_num
)
self._checkpoint_tfkeras_model(
tfkeras_model,
round_num,
self.checkpoint_dir
)
def _checkpoint_tfkeras_model(self,
model,
round_number,
checkpoint_dir):
# Obtaining model dir path
model_dir = os.path.join(
checkpoint_dir,
f'round_{round_number}',
)
# Creating directory
pathlib.Path(
model_dir
).mkdir(
parents=True
)
model_path = os.path.join(
model_dir,
f'model_file_round{round_number}.h5'
)
# Saving model
model.save(
model_path
)
def _convert_to_tfkeras_model(self, state):
"""Converts global TFF modle of TF keras model
Takes the weights of the global model
and pushes them back into a standard
Keras model
Args:
state: The state of the FL server
containing the model and
optimization state
Returns:
(model); TF Keras model
"""
model = self._load_tf_keras_model()
model.compile(
loss=self.loss,
metrics=self.metrics
)
tff.learning.assign_weights_to_keras_model(
model,
state.model
)
return model
def _load_tf_keras_model(self):
"""Loads tf keras models
Raises:
KeyError: A model name was not defined
correctly
Returns:
(model): TF keras model object
"""
model = create_models(
model_type=self.model_type,
input_shape=[self.img_h, self.img_w, 3],
freeze_base_weights=self.freeze_weights,
num_classes=self.num_classes,
compile_model=False
)
return model
def _define_model(self):
"""Model creation function"""
model = self._load_tf_keras_model()
tff_model = tff.learning.from_keras_model(
model,
dummy_batch=self.sample_batch,
loss=self.loss,
# Using self.metrics throws an error
metrics=[tf.keras.metrics.CategoricalAccuracy()] )
return tff_model
def _construct_iterative_process(self):
"""Constructing federated averaging process"""
iterative_process = tff.learning.build_federated_averaging_process(
self._define_model,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=0.02 ),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=1.0 ) )
return iterative_process
Increased the number of rounds to 5 ...
Running only a few rounds of federated learning sounds insufficient. One of the earliest Federated Averaging papers (McMahan 2016) required running for hundreds of rounds when the MNIST data had non-iid splits. More recently (Reddi 2020) required thousands of rounds for CIFAR-100. One thing to note is that each "round" is one "step" of the global model. That step may be larger with more client epochs, but these are averaged and diverging clients may reduce the magnitude of the global step.
I also save the tf keras model weights after each round and make predictions on the test set - no changes.
This can be concerning. It will be easier to debug if you could share the code used in the FL training loop.
Note sure this is an answer, but more a liked observation.
I've been trying to characterize the learning process (accuracy and loss) on the Federated Learning for Image Classification notebook tutorial with TFF.
I'm seeing major improvements in speed of convergence by modifying the epoch hyperparameter. Changing epochs from 5, 10, 20 etc. But I'm also seeing major increase in training accuracy. I suspect overfitting is occurring, though then I evaluate on the test set accuracy is still high.
Wondering what is going on. ?
My understanding is that the epoch param controls the # of forward/back prop on each client per round of training. Is this correct ? So ie 10 rounds of training on 10 clients with 10 epochs would be 10 Epochs X 10 Clients X 10 rounds. Realise a lager range of clients is needed etc but I was expecting to see poorer accuracy on the test set.
What can I do to see whats going on. Could I use the evaluation check with something like learning curves to to see if overfitting is occurring ?
test_metrics = evaluation(state.model, federated_test_data) Only appears to give a single data point, how can I get the individual test accuracy for each test example validated?
Appreciate any thoughts you may have on the matter, Colin . . .
I am trying to modify the global_step in checkpoints so that I can move the training from one machine to another.
Let's say I was doing training for several days on machine A. Now I bought a new machine B with better graphic card and more GPU memory and would like to move the training from machine A to machine B.
In order to restore the checkpoints in machine B, I have previously specified the global_step in Saver.save in machine A but with smaller batch_size and larger sub_iterations.
batch_size=10
sub_iterations=500
for (...)
for i in range(sub_iterations):
inputs, labels = next_batch(batch_size)
session.run(optimizer, feed_dict={inputs: inputs, labels: labels})
saver = tf.train.Saver()
saver.save(session, checkpoints_path, global_step)
Now I copied all the files including the checkpoints from machine A to machine B. Because machine B has more GPU memory, I can modify the batch_size to a larger value but use fewer sub_iterations.
batch_size=100
sub_iterations=50 # = 500 / (100/10)
for (...)
for i in range(sub_iterations):
inputs, labels = next_batch(batch_size)
session.run(optimizer, feed_dict={inputs: inputs, labels: labels})
However we cannot directly restore the copied checkpoints as global_step is different in machine B. For example, tf.train.exponential_decay will produce incorrect learning_rate as the number of sub_iterations is reduced in machine B.
learning_rate = tf.train.exponential_decay(..., global_step, sub_iterations, decay_rate, staircase=True)
Is it possible to modify the global_step in checkpoints? Or there is an alternative but more appropriate way to handle this situation?
Edit 1
In addition to calculating the learning_rate, I also used the global_step to calculate and reduce the number of iterations.
while i < iterations:
j = 0
while j < sub_iterations:
inputs, labels = next_batch(batch_size)
feed_dict_train = {inputs: inputs, labels: labels}
_, gstep = session.run([optimizer, global_step], feed_dict=feed_dict_train)
if (i == 0) and (j == 0):
i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations)
j = j + 1
i = i + 1
And we will start the iterations from new i and j. Please feel free to comment on this as it might not be a good approach to restore the checkpoints and continue training from loaded checkpoints.
Edit 2
In machine A, let's say iterations is 10,000, sub_iterations is 500 and batch_size is 10. So the total number of batches we are aiming to train is 10000x500x10 = 50,000,000. Assume we have trained for several days and global_step becomes 501. So the total number of batches trained is 501x10 = 5010. The remaining number of batches not trained is from 5011 to 50,000,000. If we apply i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations), the last trained value of i is 501/500=1 and j is 501%500=1.
Now you have copied all the files including the checkpoints to machine B. Since B has more GPU memory and we can train for more batches for one sub-iteration, we set batch_size to 100, and adjust sub_iterations to 50 but leave iterations as 10000. The total number of batches to train is still 50,000,000. So the problem comes, how can we start and train the batches from 5011 to 50,000,000 and do not train again for first 5010 samples?
In order to start and train the batches from 5011 in machine B, we should set i to 1 and j to 0 because the total batches it has trained will be (1*50+0)*100 = 5,000 which is close to 5,010 (as the batch_size is 100 in machine B as opposed to 10 in machine A, we cannot start exactly from 5,010 and we can either choose 5,000 or 5,100).
If we do not adjust the global_step (as suggested by #coder3101), and use back i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations) in machine B, i will become 501/50=10 and j will become 501%50=1. So we will start and train from batch 50,100 (=501*batch_size=501*100) which is incorrect (not close to 5010).
This formula i, j = int(gstep/ sub_iterations), numpy.mod(gstep, sub_iterations) is introduced because if we stop the training in machine A at one point, we can restore the checkpoints and continue the training in machine A using this formula. However it seems this formula is not applicable when we move the training from machine A to machine B. Therefore I was hoping to modify the global_step in checkpoints to deal with this situation, and would like to know if this is possible.
Yes. It is possible.
To modify the global_step in machine B, you have to perform the following steps:
Calculate the corresponding global_step
In the above example, global_step in machine A is 501 and the total number of trained batches is 501x10=5010. So the corresponding global_step in machine B is 5010/100=50.
Modify the checkpoint filenames
Modify the suffix of model_checkpoint_path and all_model_checkpoint_paths values to use correct step number (e.g. 50 in the above example) in /checkpoint.
Modify the filename of index, meta and data for the checkpoints to use correct step number.
Override global_step after restoring checkpoint
saver.restore(session, model_checkpoint_path)
initial_global_step = global_step.assign(50)
session.run(initial_global_step)
...
# do the training
Now if you restore the checkpoints (including global_step) and override global_step, the training will use the updated global_step and adjust i and j and learning_rate correctly.
If you want to keep decayed learning rate same on Machine B as it was on A, you can tweak other parameters of the function tf.train.exponential_dacay . For Example, you can change decay rate on the new machine.
In order to find that decay_rate you need to know how exponential_decay is computed.
decay_learning_rate = learning_rate * decay_rate ^
(global_step/decay_steps)
global_step/decay_step yields an pure integer if staircase==True
Where decay_steps is your sub_iteration and global_step is from Machine A.
You can change the decay_rate in such a way that new learning rate on machine B is same or close to what you would expect from machine A at that global_step.
Also, you can change the initial learning rate for machine B so as to achieve uniform exponential rate decay from machine A to B.
So in short while importing from A to B you have changed 1 variable (sub_iteration), and kept global_step same. You can adjust the other 2 variables of exponential_decay(..)in such a way that your output learning rate from the function is same as you would expect from Machine A at that global_step.
I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)
DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.
There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.
I am trying to classify the ETH Food-101 dataset using squeezenet in Caffe2. My model is imported from the Model Zoo and I made two types of modifications to the model:
1) Changing the dimensions of the last layer to have 101 outputs
2) The images from the database are in NHWC form and I just flipped the dimensions of the weights to match. (I plan on changing this)
The Food101 dataset has 75,000 images for training and I am currently using a batch size of 128 and a starting learning rate of -0.01 with a gamma of 0.999 and stepsize of 1. What I noticed is that for the first 2000 iterations of the network the accuracy hovered around 1/128 and this took an hour or so to complete.
I added all the weights to the model.params so they can get updated during gradient descent(except for data) and reinitialized all weights as Xavier and biases to constant. I would expect the accuracy to grow fairly quickly in the first hundred to thousand iterations and then tail off as the number of iterations grow. In my case, the learning is staying constant around 0.
When I look at the gradient file I find that the average is on the order of 10^-6 with a standard deviation of 10^-7. This explains the slow learning rate, but I haven't been able to get the gradient to start much higher.
These are the gradient statistics for the first convolution after a few iterations
Min Max Avg Sdev
-1.69821e-05 2.10922e-05 1.52149e-06 5.7707e-06
-1.60263e-05 2.01478e-05 1.49323e-06 5.41754e-06
-1.62501e-05 1.97764e-05 1.49046e-06 5.2904e-06
-1.64293e-05 1.90508e-05 1.45681e-06 5.22742e-06
Here are the core parts of my code:
#init_path is path to init_net protobuf
#pred_path is path to pred_net protobuf
def main(init_path, pred_path):
ws.ResetWorkspace()
data_folder = '/home/myhome/food101/'
#some debug code here
arg_scope = {"order":"NCHW"}
train_model = model_helper.ModelHelper(name="food101_train", arg_scope=arg_scope)
if not debug:
data, label = AddInput(
train_model, batch_size=128,
db=os.path.join(data_folder, 'food101-train-nchw-leveldb'),
db_type='leveldb')
init_net_def, pred_net_def = update_squeeze_net(init_path, pred_path)
#print str(init_net_def)
train_model.param_init_net.AppendNet(core.Net(init_net_def))
train_model.net.AppendNet(core.Net(pred_net_def))
ws.RunNetOnce(train_model.param_init_net)
add_params(train_model, init_net_def)
AddTrainingOperators(train_model, 'softmaxout', 'label')
AddBookkeepingOperators(train_model)
ws.RunNetOnce(train_model.param_init_net)
if debug:
ws.FeedBlob('data', data)
ws.FeedBlob('label', label)
ws.CreateNet(train_model.net)
total_iters = 10000
accuracy = np.zeros(total_iters)
loss = np.zeros(total_iters)
# Now, we will manually run the network for 200 iterations.
for i in range(total_iters):
#try:
conv1_w = ws.FetchBlob('conv1_w')
print conv1_w[0][0]
ws.RunNet("food101_train")
#except RuntimeError:
# print ws.FetchBlob('conv1').shape
# print ws.FetchBlob('pool1').shape
# print ws.FetchBlob('fire2/squeeze1x1_w').shape
# print ws.FetchBlob('fire2/squeeze1x1_b').shape
#softmax = ws.FetchBlob('softmaxout')
#print softmax[i]
#print softmax[i][0][0]
#print softmax[i][0][:5]
#print softmax[64*i]
accuracy[i] = ws.FetchBlob('accuracy')
loss[i] = ws.FetchBlob('loss')
print accuracy[i], loss[i]
My add_params function initializes the weights as follows
#ops allows me to only initialize the weights of specific ops because i initially was going to do last layer training
def add_params(model, init_net_def, ops=[]):
def add_param(op):
for output in op.output:
if "_w" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("XavierFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
tags=ParameterTags.WEIGHT)
elif "_b" in output:
weight_shape = []
for arg in op.arg:
if arg.name == 'shape':
weight_shape = arg.ints
weight_initializer = initializers.update_initializer(
None,
None,
("ConstantFill", {}))
model.create_param(
param_name=output,
shape=weight_shape,
initializer=weight_initializer,
I find that my loss function fluctuates when I use the full training set, but If i use just one batch and iterate over it several times I find that the loss function goes down but very slowly.
While SqueezeNet has 50x fewer parameters than AlexNet, it is still a very large network. The original paper does not mention a training time, but the SqueezeNet-based SQ required 22 hours to train using two Titan X graphics cards - and that was with some of the weights pre-trained! I haven't gone over your code in detail, but what you describe is expected behavior - your network is able to learn on the single batch, just not as quickly as you expected.
I suggest reusing as many of the weights as possible instead of reinitializing them, just as the creators of SQ did. This is known as transfer learning, and it works because many of the lower-level features (lines, curves, basic shapes) in an image are the same regardless of the image's content, and reusing the weights for these layers saves the network from having to re-learn them from scratch.