My study is setup to use the Hyperband pruner with 60 trials, 10M max resource and reduction factor of 2.
def optimize_agent(trial):
# ...
model = PPO("MlpPolicy", env, **params)
study = optuna.create_study(
min_resource=1, max_resource=10000000, reduction_factor=2
study.optimize(optimize_agent, n_trials=60, n_jobs=2)
When I let the study run overnight, it ran the first 6 trials to completion (2M steps each). Isn't the HyberbandPruner supposed to stop at least some trials before they complete?
I am applying federated averaging on my federated learning model. After running the model for thousands rounds the model still did not converged.
How can I increase the number of epochs in training, and how it differs from the number of rounds?
And how can I reach to convergence, since I tried to increase the number of rounds but it take long time to train (I am using Google Colab, and the execution time can not be more than 24 hours I also tried subscribed to Google Colab Pro to use the GPU but it did not work well)
The code and the training results are provided below
train_data = [train.create_tf_dataset_for_client(c).repeat(2).map(reshape_data)
for c in train_client_ids]
iterative_process = tff.learning.build_federated_averaging_process(
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.0001),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.9))
NUM_ROUNDS = 50000
state = iterative_process.initialize()
logdir = "/tmp/logs/scalars/training/"
summary_writer = tf.summary.create_file_writer(logdir)
with summary_writer.as_default():
for round_num in range(0, NUM_ROUNDS):
state, metrics =, train_data)
if (round_num% 1000) == 0:
print('round {:2d}, metrics={}'.format(round_num, metrics))
for name, value in metrics['train'].items():
tf.summary.scalar(name, value, step=round_num)
And the output in shown in
See this tutorial for how to increase epochs (basically increase the number in .repeat()). The number of epochs is the number of iterations a client train on each batch. The number of rounds is the total number of federated computation.
cross validation of GBT Classifier on PySpark taking too much time on 2 GB data(80% Train & 20 % Test). Is there a way to reduce the time?
The sample code is as given below:-
dt = GBTClassifier(maxIter = 250)
pipeline_dt = Pipeline(stages=[indexer, assembler, dt])
paramGrid = ParamGridBuilder().build()
crossval = CrossValidator(estimator=pipeline_dt, estimatorParamMaps=paramGrid,
cvModel =
By default, evaluation is run in the serial manner - next round is done after previous is finished. Starting with Spark 2.3, there is a parallelism parameter that specifies how many evaluations may run in parallel.
P.S. If you'll add parameter search as well, then I would recommend to look to the Hyperopt library that improves hyperparameters search.
I have followed this emnist tutorial to create an image classification experiment (7 classes) with the aim of training a classifier on 3 silos of data with the TFF framework.
Before training begins, I convert the model to a tf keras model using tff.learning.assign_weights_to_keras_model(model,state.model) to evaluate on my validation set. Regardless of the label, the model only predicts one class. This is to be expected as no training of the model has occurred yet. However, I repeat this step after each federated averaging round and the problem persists. All validation images are predicted to one class. I also save the tf keras model weights after each round and make predictions on the test set - no changes.
Some of the steps I have taken to check the source of the issue:
Checked if the tf keras model weights are updating when the FL model is converted after each round - they are updating.
Ensured that the buffer size is greater than the training dataset size for each client.
Compared the predictions to the class distribution in the training datasets. There is a class imbalance but the one class that the model predicts is not necessarily the majority class. Also, it is not always the same class. For the most part, it predicts only class 0.
Increased the number of rounds to 5 and epochs per round to 10. This is computationally very intensive as it is quite a large model being trained with approx 1500 images per client.
Investigated the TensorBoard logs from each training attempt. The training loss is decreasing as the round progresses.
Tried a much simpler model - basic CNN with 2 conv layers. This allowed me to greatly increase the number of epochs and rounds. When evaluating this model on the test set, it predicted 4 different classes but the performance remains very bad. This would indicate that I just would need to increase the number of rounds and epochs for my original model to increase the variation in predictions. This is difficult due the large training time that would be a result.
Model details:
The model uses the XceptionNet as the base model with the weights unfrozen. This performs well on the classification task when all the training images are pooled into a global dataset. Our aim is to hopefully achieve a comparable performance with FL.
base_model = Xception(include_top=False,
x = GlobalAveragePooling2D()( x )
predictions = Dense( num_classes, activation='softmax' )( x )
model = Model( base_model.input, outputs=predictions )
Here is my training code:
def fit(self):
"""Train FL model"""
# self.load_data()
summary_writer = tf.summary.create_file_writer(
federated_averaging = self._construct_iterative_process()
state = federated_averaging.initialize()
tfkeras_model = self._convert_to_tfkeras_model( state )
print( np.argmax( tfkeras_model.predict( self.val_data ), axis=-1 ) )
val_loss, val_acc = tfkeras_model.evaluate( self.val_data, steps=100 )
with summary_writer.as_default():
for round_num in tqdm( range( 1, self.num_rounds ), ascii=True, desc="FedAvg Rounds" ):
print( "Beginning fed avg round..." )
# Round of federated averaging
state, metrics =
print( "Fed avg round complete" )
# Saving logs
for name, value in metrics._asdict().items():
print( "round {:2d}, metrics={}".format( round_num, metrics ) )
# tfkeras_model = self._convert_to_tfkeras_model(
# state
# )
val_metrics = {}
val_metrics["val_loss"], val_metrics["val_acc"] = tfkeras_model.evaluate(
for name, metric in val_metrics.items():
def _checkpoint_tfkeras_model(self,
# Obtaining model dir path
model_dir = os.path.join(
# Creating directory
model_path = os.path.join(
# Saving model
def _convert_to_tfkeras_model(self, state):
"""Converts global TFF modle of TF keras model
Takes the weights of the global model
and pushes them back into a standard
Keras model
state: The state of the FL server
containing the model and
optimization state
(model); TF Keras model
model = self._load_tf_keras_model()
return model
def _load_tf_keras_model(self):
"""Loads tf keras models
KeyError: A model name was not defined
(model): TF keras model object
model = create_models(
input_shape=[self.img_h, self.img_w, 3],
return model
def _define_model(self):
"""Model creation function"""
model = self._load_tf_keras_model()
tff_model = tff.learning.from_keras_model(
# Using self.metrics throws an error
metrics=[tf.keras.metrics.CategoricalAccuracy()] )
return tff_model
def _construct_iterative_process(self):
"""Constructing federated averaging process"""
iterative_process = tff.learning.build_federated_averaging_process(
client_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=0.02 ),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD( learning_rate=1.0 ) )
return iterative_process
Increased the number of rounds to 5 ...
Running only a few rounds of federated learning sounds insufficient. One of the earliest Federated Averaging papers (McMahan 2016) required running for hundreds of rounds when the MNIST data had non-iid splits. More recently (Reddi 2020) required thousands of rounds for CIFAR-100. One thing to note is that each "round" is one "step" of the global model. That step may be larger with more client epochs, but these are averaged and diverging clients may reduce the magnitude of the global step.
I also save the tf keras model weights after each round and make predictions on the test set - no changes.
This can be concerning. It will be easier to debug if you could share the code used in the FL training loop.
Note sure this is an answer, but more a liked observation.
I've been trying to characterize the learning process (accuracy and loss) on the Federated Learning for Image Classification notebook tutorial with TFF.
I'm seeing major improvements in speed of convergence by modifying the epoch hyperparameter. Changing epochs from 5, 10, 20 etc. But I'm also seeing major increase in training accuracy. I suspect overfitting is occurring, though then I evaluate on the test set accuracy is still high.
Wondering what is going on. ?
My understanding is that the epoch param controls the # of forward/back prop on each client per round of training. Is this correct ? So ie 10 rounds of training on 10 clients with 10 epochs would be 10 Epochs X 10 Clients X 10 rounds. Realise a lager range of clients is needed etc but I was expecting to see poorer accuracy on the test set.
What can I do to see whats going on. Could I use the evaluation check with something like learning curves to to see if overfitting is occurring ?
test_metrics = evaluation(state.model, federated_test_data) Only appears to give a single data point, how can I get the individual test accuracy for each test example validated?
Appreciate any thoughts you may have on the matter, Colin . . .
previously I thought that smaller batch_size would lead to faster training, but in practice in keras, I am receiving the opposite results which is that bigger batch_size make training faster.
I am implementing a sample code, and by increasing the amount of batch_size, the training become faster. that is the opposite of my previously common believe (that smaller batch_size results in faster training),
here's the sample code:
# fit model
import time
start = time.time()
history =, trainy, validation_data=(testX, testy), epochs=1000,
batch_size= 500 , verbose=0)
end = time.time()
elapsed = end - start
I put 500, 250, 50 and 10 as batch_size respectively, i expect the lower batch_size would have faster training, but batch_size 500 results in 6.3 sec,
250 results in 6.7sec, 50 results in 28.0sec, and 10 results in 140.2secc !!!
This makes sense. I do not know what model you are using but Keras is highly optimized making use of vectorization for fast matrix operations. So if you split your data of 5000 samples into batch sizes of 500, with 1000 epochs, essentially there are (5000/500) x 1000 iterations through the model. That's 10 000
Now if you do that for a batch size of 10 there are (5000/10) x 1000 iterations. That's 500 000. Many more iterations through the model, both forward and backwards.
Hope that helps.
For hardware, GPUs are very good at parallelising the calculations, specifically, the matrix operations happens in forward-and-backward-propagation. The same thing happens in software side, tensorflow and other DL libraies optimise matrix operations.
Therefore, lager batch size makes the GPU and DL libraries able to "optimise more matrix computation", which leads to faster training time.
I am currently trying to implement both direct and recursive multi-step forecasting strategies using the statsmodels ARIMA library and it has raised a few questions.
A recursive multi-step forecasting strategy would be training a one-step model, predicting the next value, appending the predicted value onto the end of my exogenous values fed into the forecast method and repeating. This is my recursive implementation:
def arima_forecast_recursive(history, horizon=1, config=None):
# make list so can add / remove elements
history = history.tolist()
model = ARIMA(history, order=config)
model_fit ='nc', disp=0)
for i, x in enumerate(history):
yhat = model_fit.forecast(steps=1, exog=history[i:])
return np.array(yhat)
def walk_forward_validation(dataframe, config=None):
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
yhat = arima_forecast_recursive(train, n_test, config)
results = smape3(test, yhat)
return tuple_list
Similarly to perform a direct strategy I would just fit my model on the available training data and use this to predict the total multi-step forecast at once. I am not sure how to achieve this using the statsmodels library.
My attempt (which produces results) is below:
def walk_forward_validation(dataframe, config=None):
# This currently implements a direct forecasting strategy
n_train = 52 # Give a minimum of 2 forecasting periods to capture any seasonality
n_test = 26 # Test set should be the size of one forecasting horizon
n_records = len(dataframe)
tuple_list = []
for index, i in enumerate(range(n_train, n_records)):
# create the train-test split
train, test = dataframe[0:i], dataframe[i:i + n_test]
# Test set is less than forecasting horizon so stop here.
if len(test) < n_test:
yhat = arima_forecast_direct(train, n_test, config)
results = smape3(test, yhat)
return tuple_list
def arima_forecast_direct(history, horizon=1, config=None):
model = ARIMA(history, order=config)
model_fit ='nc', disp=0)
return model_fit.forecast(steps=horizon)[0]
What confuses me specifically is if the model should just be fit once for all predictions or multiple times to make a single prediction in the multi-step forecast? Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one. As shown above my current implementation only fits one model.
What I do not understand is how, if I call method multiple times on the same training data I will get a model that I will get aa fit that is any different outside of the expected normal stochastic variation?
My final question is with regard to optimisation. Using a method such as walk forward validation gives me statistically very significant results, but for many time series it is very computationally expensive. Both of the above implementations are already called using the joblib parallel loop execution functionality which significantly reduced the runtime on my laptop. However I would like to know if there is anything that can be done with regard to the above implementations to make them even more efficient. When running these methods for ~2000 separate time series (~ 500,000 data points total across all series) there is a runtime of 10 hours. I have profiled the code and most of the execution time is spent in the statsmodels library, which is fine but there a discrepancy between the runtime of the walk_forward_validation() method and This is expected as obviously the walk_forward_validation() method does stuff other than just call the fit method, but if anything in it can be changed to speed up execution time then please let me know.
The idea of this code is to find an optimal arima order per time series as it isn't feasible to investigate 2000 time series individually and as such the walk_forward_validation() method is called 27 times per time series. So roughly 27,000 times overall. Therefore any performance saving that can be found within this method will have an impact no matter how small it is.
Normally, ARIMA can only perform recursive forecasting, not direct forecasting. There might some research done on variations of ARIMA for direct forecasting, but they wouldn't be implemented in Statsmodels. In statsmodels, (or in R auto.arima()), when you set a value for h > 1, it simply performs a recursive forecast to get there.
As far as I know, none of the standard forecasting libraries have direct forecasting implemented yet, you're going to have to code it yourself.
Taken from Souhaib Ben Taieb's doctoral thesis (page 35 paragraph 3) it is presented that direct model will estimate H models, where H is the length of the forecast horizon, so in my example with a forecast horizon of 26, 26 models should be estimated instead of just one.
I haven't read Ben Taieb's thesis, but from his paper "Machine Learning Strategies for Time Series Forecasting", for direct forecasting, there is only one model for one value of H. So for H=26, there will be only one model. There will be H models if you need to forecast for every value between 1 and H, but for one H, there is only one model.