running the colab linked below, I get the following error:
"The wandb backend process has shutdown"
I see nothing suspicious in the way the colab uses wandb and I couldn't find anyone with the same problem. Any help is greatly appreciated. I am using the latest version of wandb in colab.
This is where I set up wandb:
if WANDB:
wandb.login()
and this is the part where I get the error:
#setup wandb if we're using it
if WANDB:
experiment_name = os.environ.get("EXPERIMENT_NAME")
group = experiment_name if experiment_name != "none" else wandb.util.generate_id()
cv_scores = []
oof_data_frame = pd.DataFrame()
for fold in range(1, config.folds + 1):
print(f"Fold {fold}/{config.folds}", end="\n"*2)
fold_directory = os.path.join(config.output_directory, f"fold_{fold}")
make_directory(fold_directory)
model_path = os.path.join(fold_directory, "model.pth")
model_config_path = os.path.join(fold_directory, "model_config.json")
checkpoints_directory = os.path.join(fold_directory, "checkpoints/")
make_directory(checkpoints_directory)
#Data collators are objects that will form a batch by using a list of dataset elements as input.
collator = Collator(tokenizer=tokenizer, max_length=config.max_length)
train_fold = train[~train["fold"].isin([fold])]
train_dataset = Dataset(texts=train_fold["anchor"].values,
pair_texts=train_fold["target"].values,
contexts=train_fold["title"].values,
targets=train_fold["score"].values,
max_length=config.max_length,
sep=tokenizer.sep_token,
tokenizer=tokenizer)
train_loader = DataLoader(dataset=train_dataset,
batch_size=config.batch_size,
num_workers=config.num_workers,
pin_memory=config.pin_memory,
collate_fn=collator,
shuffle=True,
drop_last=False)
print(f"Train samples: {len(train_dataset)}")
validation_fold = train[train["fold"].isin([fold])]
validation_dataset = Dataset(texts=validation_fold["anchor"].values,
pair_texts=validation_fold["target"].values,
contexts=validation_fold["title"].values,
targets=validation_fold["score"].values,
max_length=config.max_length,
sep=tokenizer.sep_token,
tokenizer=tokenizer)
validation_loader = DataLoader(dataset=validation_dataset,
batch_size=config.batch_size*2,
num_workers=config.num_workers,
pin_memory=config.pin_memory,
collate_fn=collator,
shuffle=True,
drop_last=False)
print(f"Validation samples: {len(validation_dataset)}")
model = Model(**config.model)
if not os.path.exists(model_config_path):
model.config.to_json_file(model_config_path)
model_parameters = model.parameters()
optimizer = get_optimizer(**config.optimizer, model_parameters=model_parameters)
training_steps = len(train_loader) * config.epochs
if "scheduler" in config:
config.scheduler.parameters.num_training_steps = training_steps
config.scheduler.parameters.num_warmup_steps = training_steps * config.get("warmup", 0)
scheduler = get_scheduler(**config.scheduler, optimizer=optimizer, from_transformers=True)
else:
scheduler = None
model_checkpoint = ModelCheckpoint(mode="min",
delta=config.delta,
directory=checkpoints_directory,
overwriting=True,
filename_format="checkpoint.pth",
num_candidates=1)
if WANDB:
wandb.init()
#wandb.init(group=group, name=f"fold_{fold}", config=config)
(train_loss, train_metrics), (validation_loss, validation_metrics, validation_outputs) = training_loop(model=model,
optimizer=optimizer,
scheduler=scheduler,
scheduling_after=config.scheduling_after,
train_loader=train_loader,
validation_loader=validation_loader,
epochs=config.epochs,
gradient_accumulation_steps=config.gradient_accumulation_steps,
gradient_scaling=config.gradient_scaling,
gradient_norm=config.gradient_norm,
validation_steps=config.validation_steps,
amp=config.amp,
debug=config.debug,
verbose=config.verbose,
device=config.device,
recalculate_metrics_at_end=True,
return_validation_outputs=True,
logger="tqdm")
if WANDB:
wandb.finish()
if config.save_model:
model_state = model.state_dict()
torch.save(model_state, model_path)
print(f"Model's path: {model_path}")
validation_fold["prediction"] = validation_outputs.to("cpu").numpy()
oof_data_frame = pd.concat([oof_data_frame, validation_fold])
cv_monitor_value = validation_loss if config.cv_monitor_value == "loss" else validation_metrics[config.cv_monitor_value]
cv_scores.append(cv_monitor_value)
del model, optimizer, validation_outputs, train_fold, validation_fold
torch.cuda.empty_cache()
gc.collect()
print(end="\n"*6)
TDLR; Check if the generated id is unique in the project space of wandb you are using.
Explanation
You can check the exact reason this happened in the log files under the wandb folder and specific run id. Had the same issue with Error communicating with wandb process and The wandb backend process has shutdown.
My problem was that I was assigning the run id to a specific instance which already existed, and rerunning the whole search space, but the run id have to be unique. Using name in init is generally a safer bet if you don't intend to continue the previous run (which is possible if you indicate so in the init method).
You can try running Wandb in offline mode, to see if this can help, and later on doing wandb sync.
Solution which worked for me is run !wandb login --relogin.
Related
I am new at stable baselines and RL. What I am trying to do is :
loading my previously trained model from my computer and then re-train it from the point it ended it’s last training. For that, I am loading my previously saved model inside policy_fn() and I am giving policy_fn as parameter inside pposgd_simple.learn() method. It shows error "ValueError: At least two variables have the same name: pi/obfilter/count"
Also, I am unsure of whether it starts the training from the previous ending point or whether it started the training from the very beginning (when it trains correctly in a different setting). Can anyone please help me directing the way to verify it. One option may be printing the model parameters, but I am unsure of it.
I am also trying to use Tensorboard to monitor my training. But when I run the training, the program says “tensorboard_log=logger_path, TypeError: learn() got an unexpected keyword argument 'tensorboard_log'. ” My stable baselines version 2.10.2. I am attaching my entire code of training below.
def make_env(seed=None):
reward_scale = 1.0
rank = MPI.COMM_WORLD.Get_rank()
myseed = seed + 1000 * rank if seed is not None else None
set_global_seeds(myseed)
env = Env()
env = Monitor(env, logger_path, allow_early_resets=True)
env.seed(seed)
if reward_scale != 1.0:
from baselines.common.retro_wrappers import RewardScaler
env = RewardScaler(env, reward_scale)
return env
def train(num_timesteps, path=None):
from baselines.ppo1 import mlp_policy, pposgd_simple
sess = U.make_session(num_cpu=1)
sess.__enter__()
def policy_fn(name, ob_space, ac_space):
policy = mlp_policy.MlpPolicy(name=name, ob_space=ob_space, ac_space=ac_space,
hid_size=64, num_hid_layers=3)
saver = tf.train.Saver()
if path is not None:
print("Tried to restore from ", path)
U.initialize()
saver.restore(tf.get_default_session(), path)
saver2 = tf.train.import_meta_graph('/srcs/src/models/model1.meta')
model = saver.restore(sess,tf.train.latest_checkpoint('/srcs/src/models/'))
#return policy
return saver2
env = make_env()
pi = pposgd_simple.learn(env, policy_fn,
max_timesteps=num_timesteps,
timesteps_per_actorbatch=1024,
clip_param=0.2, entcoeff=0.0,
optim_epochs=10,
optim_stepsize=5e-5,
optim_batchsize=64,
gamma=0.99,
lam=0.95,
schedule='linear',
tensorboard_log=logger_path,
#tensorboard_log="./ppo1_tensorboard/",
)
env.env.plotSave()
saver = tf.train.Saver(tf.all_variables())
saver.save(sess, '/models/model1')
return pi
def main():
logger.configure()
path_ = "/models/model1"
train(num_timesteps=409600, path=path_)
if __name__ == '__main__':
rank = MPI.COMM_WORLD.Get_rank()
logger_path = None if logger.get_dir() is None else os.path.join(logger.get_dir(), str(rank))
main()
I get below warning when I try to run the code from this page.
/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use thePyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
FutureWarning,
I am super confused because the code doesn't seem to set the optimizer at all. The most probable places where the optimizer was set could be below but I dont know how to change the optimizer then
# define the training arguments
training_args = TrainingArguments(
output_dir = '/media/data_files/github/website_tutorials/results',
num_train_epochs = 5,
per_device_train_batch_size = 8,
gradient_accumulation_steps = 8,
per_device_eval_batch_size= 16,
evaluation_strategy = "epoch",
disable_tqdm = False,
load_best_model_at_end=True,
warmup_steps=200,
weight_decay=0.01,
logging_steps = 4,
fp16 = True,
logging_dir='/media/data_files/github/website_tutorials/logs',
dataloader_num_workers = 0,
run_name = 'longformer-classification-updated-rtx3090_paper_replication_2_warm'
)
# instantiate the trainer class and check for available devices
trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_data,
eval_dataset=test_data
)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device
I tried another transformer such as distilbert-base-uncased using the identical code but it seems to run without any warnings.
Is this warning more specific to longformer?
How should I change the optimizer?
import torch_optimizer as optim
optim.AdamW(params, opt.learning_rate, (opt.optim_alpha, opt.optim_beta), opt.optim_epsilon, weight_decay=opt.weight_decay)
It can be used this way.
You need to add optim='adamw_torch', the default is optim='adamw_hf'
Refer here
Can you try the following:
# define the training arguments
training_args = TrainingArguments(
optim='adamw_torch',
# your training arguments
...
...
...
)
I have a trained a BERT text classification model using keras on spam vs ham dataset. I have deployed the model and got a Sagemaker endpoint. I want to use it for any prediction.
I am using a ml.t2.medium Sagemaker instance and my tensorflow version is 2.6.2 in the Sagemaker notebook
I am getting an error while using the Sagemaker endpoint for prediction. The error is Session was not created with a graph before Run()
This is my code for training the classifier
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
# In[2]:
import pandas as pd
df = pd.read_csv("spam.csv")
df.head(5)
# In[3]:
df.groupby('Category').describe()
# In[4]:
df['Category'].value_counts()
# In[5]:
df_spam = df[df['Category']=='spam']
df_spam.shape
# In[6]:
df_ham = df[df['Category']=='ham']
df_ham.shape
# In[7]:
df_ham_downsampled = df_ham.sample(df_spam.shape[0])
df_ham_downsampled.shape
# In[8]:
df_balanced = pd.concat([df_ham_downsampled, df_spam])
df_balanced.shape
# In[9]:
df_balanced['Category'].value_counts()
# In[10]:
df_balanced['spam']=df_balanced['Category'].apply(lambda x: 1 if x=='spam' else 0)
df_balanced.sample(5)
# In[11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df_balanced['Message'],df_balanced['spam'], stratify=df_balanced['spam'])
# In[12]:
bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
# In[13]:
def get_sentence_embeding(sentences):
preprocessed_text = bert_preprocess(sentences)
return bert_encoder(preprocessed_text)['pooled_output']
get_sentence_embeding([
"500$ discount. hurry up",
"Bhavin, are you up for a volleybal game tomorrow?"]
)
# In[14]:
# Bert layers
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)
# Neural network layers
l = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
l = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(l)
# Use inputs and outputs to construct a final model
model = tf.keras.Model(inputs=[text_input], outputs = [l])
# In[15]:
model.summary()
# In[16]:
METRICS = [
tf.keras.metrics.BinaryAccuracy(name='accuracy'),
tf.keras.metrics.Precision(name='precision'),
tf.keras.metrics.Recall(name='recall')
]
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=METRICS)
# In[17]:
model.fit(X_train, y_train, epochs=1)
AND THIS PART IS USED FOR DEPLOYING THE MODEL
# In[18]:
model.save('saved_model/28dec1')
# In[3]:
model = tf.keras.models.load_model('saved_model/28dec1')
model.predict(["who is the spammer on here"])
array([[0.08218178]], dtype=float32)
# Check its architecture
model.summary()
# In[18]:
tf.compat.v1.enable_eager_execution()
print("pass")
# In[5]:
def convert_h5_to_aws(loaded_model):
"""
given a pre-trained keras model, this function converts it to a TF protobuf format
and saves it in the file structure which aws expects
"""
from tensorflow.python.saved_model import builder
from tensorflow.python.saved_model.signature_def_utils import predict_signature_def
from tensorflow.python.saved_model import tag_constants
# This is the file structure which AWS expects. Cannot be changed.
model_version = '1'
export_dir = 'export/Servo/' + model_version
# Build the Protocol Buffer SavedModel at 'export_dir'
builder = builder.SavedModelBuilder(export_dir)
# Create prediction signature to be used by TensorFlow Serving Predict API
signature = predict_signature_def(
inputs={"inputs": loaded_model.input}, outputs={"score": loaded_model.output})
from keras import backend as K
with K.get_session() as sess:
# Save the meta graph and variables
builder.add_meta_graph_and_variables(
sess=sess, tags=[tag_constants.SERVING], signature_def_map={"serving_default": signature})
builder.save()
#create a tarball/tar file and zip it
import tarfile
with tarfile.open('model.tar.gz', mode='w:gz') as archive:
archive.add('export', recursive=True)
convert_h5_to_aws(model)
# In[3]:
import sagemaker
sagemaker_session = sagemaker.Session()
inputs = sagemaker_session.upload_data(path='model.tar.gz', key_prefix='model')
# In[7]:
# where did it upload to?
print("Bucket name is:")
sagemaker_session.default_bucket()
# In[9]:
import boto3, re
from sagemaker import get_execution_role
# the (default) IAM role you created when creating this notebook
role = get_execution_role()
# Create a Sagemaker model (see AWS console>SageMaker>Models)
from sagemaker.tensorflow.model import TensorFlowModel
sagemaker_model = TensorFlowModel(model_data = 's3://' + sagemaker_session.default_bucket() + '/model/model.tar.gz',
role = role,
framework_version = '1.12',
entry_point = 'train.py')
# In[10]:
# Deploy a SageMaker to an endpoint
predictor = sagemaker_model.deploy(initial_instance_count=1,
instance_type='ml.m4.xlarge')
# In[5]:
import numpy as np
import sagemaker
from sagemaker.tensorflow.model import TensorFlowModel
endpoint = 'sagemaker-tensorflow-serving-2021-10-28-11-18-34-001' #get endpoint name from SageMaker > endpoints
predictor=sagemaker.tensorflow.model.TensorFlowPredictor(endpoint, sagemaker_session)
# .predict send the data to our endpoint
#data = np.asarray(["what the shit"]) #<-- update this to have inputs for your model
predictor.predict(["this is not a spam"])
And I am getting this error
ModelError: An error occurred (ModelError) when calling the InvokeEndpoint operation: Received client error (400) from primary with message "{ "error": "Session was not created with a graph before Run()!" }
Can someone please help me.
Instead of saving model as h5 use below method.
model.save("export/Servo/1/")
for some reason it expects exact format. Also, please remove if any hidden file is there in this folder.
This will save the model already in protobuf format.
Hi everyone i need somehelp.
I try to code resnet-101 imagenet classification using tensorflow without using estimator. I try it to study deep learning and understand how to use tensorflow.
My problem is monitoredtrainingSession does not initilize my iterator.
I have read some article about the problems and try to use hook to handle it but it fails and i have no idea why it fails.
after i create monitoredtrainingsession it first initialize train_iterator
and got outOfRange exception
then validation step are performed.
It seems fine for now but after finish runing validation and try to run training step again. I got Error related with iterator.get_next().
It saids I did not initialize iterator but my hook function clearly call
session.run(self._initializer, feed_dict={filenames: self._filenames})
i'm sure because i can see the below message that i print to check if it is initialized or not.
iter_val.initializer after_create_session is called 0 times
what am i wrong with it?
running flow are like below
run train step fine (epoch =0)
run validation step fine (epoch =0)
run train step Error(epoch =1)
Please ignore horovod(hvd()) in the code cause I am not using it right now.
Here is my code so please help me to fix it and let me know what's wrong with my code.
class _DatasetInitializerHook(tf.train.SessionRunHook):
def __init__(self, initializer, filenames=[], name=""):
self._initializer = initializer
self._filenames = filenames
self._name = name
self._cnt = 0
self._before_runCnt = 0
def begin(self):
pass
def after_create_session(self, session, coord):
del coord
if len(self._filenames) == 0:
session.run(self._initializer)
else:
session.run(self._initializer, feed_dict={filenames: self._filenames})
print(self._name, "after_create_session is called {} times".format(self._cnt))
self._cnt += 1
if __name__ == "__main__":
if len(sys.argv) > 1:
nlogs = sys.argv[1]
else:
nlogs = 0
hvd.init()
b_imagenet=False
if b_imagenet:
training_filenames = ['/data/tfrecords/imagenet2012_train_shard{}.tfrecord'.format(i) for i in range(129)]
else:
training_filenames = ['/data/cifar-10-tfrecords/train_shard{}.tfrecord'.format(i) for i in range(1, 2, 1)]
filenames = tf.placeholder(tf.string, shape=[None])
trainData = dataset_input_fn(is_training=True, filename=filenames, nworkers=hvd.size(), workeridx=hvd.rank(),
batch_size=FLAGS.batchSize, prefetch_size=FLAGS.prefetch_buffer_size, repeat=1,
shuffle_buffer_size=FLAGS.shuffle_buffer_size)
valData = dataset_input_fn(is_training=False, filename=FLAGS.validationfile, nworkers=hvd.size(), workeridx=hvd.rank(),
batch_size=1,prefetch_size=FLAGS.prefetch_buffer_size, repeat=1, shuffle_buffer_size=1)
# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
for i in tqdm(range(FLAGS.nepoch)):
shuffle(training_filenames)
model = model_class(nCls=FLAGS.nClasses, img_width=FLAGS.width, img_height=FLAGS.height,
learning_rate=FLAGS.learning_rate, weight_decay=FLAGS.weight_decay)
iter_train = trainData.make_initializable_iterator()
train_op = model.build_model(iter_train.get_next(), is_trainig=True, hvd=None)
train_hooks = [hvd.BroadcastGlobalVariablesHook(0),
_DatasetInitializerHook(iter_train.initializer, training_filenames, "iter_train.initializer")]
with tf.train.MonitoredTrainingSession(checkpoint_dir="./tmp/train_logs", config=config, hooks=train_hooks,
save_checkpoint_secs=30) as sess:
try:
while True:
opt = sess.run([train_op])
except tf.errors.OutOfRangeError:
pass
iter_val = valData.make_initializable_iterator()
prediction_result = model.build_model(iter_val.get_next(),is_trainig=False, hvd=None)
validation_hooks = [hvd.BroadcastGlobalVariablesHook(0),
_DatasetInitializerHook(iter_val.initializer, [], "iter_val.initializer")]
with tf.train.MonitoredTrainingSession( checkpoint_dir="./tmp/train_logs",config=config, hooks=validation_hooks) as sess:
try:
while True:
result = sess.run([prediction_result])
except tf.errors.OutOfRangeError:
pass
This is the error message I got.
tensorflow.python.framework.errors_impl.FailedPreconditionError: GetNext() failed because the iterator has not been initialized. Ensure that you have run the initializer operation for this iterator before getting the next element.
[[node IteratorGetNext (defined at workspace/multi_gpu/main.py:128) ]]
Errors may have originated from an input operation.
Input Source operations connected to node IteratorGetNext:
IteratorV2_2 (defined at workspace/multi_gpu/main.py:126)
Try putting your initializer into a scaffold:
scaffold = tf.train.Scaffold(local_init_op=train_init_operator)
and give it to the monitoredTrainingSession with:
with tf.train.MonitoredTrainingSession(scaffold=scaffold, ...
When writing checkpoint files using a tf.train.MonitoredTrainingSession it somehow writes multiple metagraphs. What am I doing wrong?
I stripped it down to the following code:
import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
save_steps = 10,
saver = saver))]
with tf.train.MonitoredTrainingSession(master = '',
is_chief = True,
checkpoint_dir = None,
hooks = hooks,
save_checkpoint_secs = None,
save_summaries_steps = None,
save_summaries_secs = None) as mon_sess:
for i in range(30):
if mon_sess.should_stop():
break
try:
gs, _ = mon_sess.run([global_step, train])
print(gs)
except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
break
finally:
pass
Running this will give duplicate metagraphs, as evidenced by the tensorboard warning:
$ tensorboard --logdir ../train/test1/ --port=6006
WARNING:tensorflow:Found more than one graph event per run, or there
was a metagraph containing a graph_def, as well as one or more graph
events. Overwriting the graph with the newest event. Starting
TensorBoard 54 at local:6006 (Press CTRL+C to quit)
This is in tensorflow 1.2.0 (I cannot upgrade).
Running the same thing without a monitored session gives the right checkpoint output:
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
init_op = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init_op)
for i in range(30):
gs, _ = sess.run([global_step, train])
print(gs)
if i%10==0:
saver.save(sess, output_path+'/test2/my-model', global_step=gs)
print("Saved ckpt")
Results in no tensorboard errors:
$ tensorboard --logdir ../traitest2/ --port=6006
Starting TensorBoard 54 at local:6006 (Press CTRL+C to quit)
I'd like to fix this as I suspect I'm missing something fundamental, and this error may have some connection to other issues I have in distributed mode. I have to restart tensorboard anytime I want to update the data. Moreover, TensorBoard seems to get really slow over time when it puts out many of these warnings.
There is a related question: tensorflow Found more than one graph event per run
In this case the errors were due to multiple runs (with different parameters) written to the same output directory. The case here is about a single run to a clean output directory.
Running the MonitoredTrainingSession version in distributed mode gives the same errors.
Update Oct-12
#Nikhil Kothari suggested to use tf.train.MonitoredSession instead of the larger tf.train.MonitoredTrainSession wrapper, as follows:
import tensorflow as tf
global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name="global_step")
train = tf.assign(global_step, global_step + 1)
saver = tf.train.Saver()
hooks[(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test3/ckpt/",
save_steps=10,
saver=saver))]
chiefsession = tf.train.ChiefSessionCreator(scaffold=None,
master='',
config=None,
checkpoint_dir=None,
checkpoint_filename_with_path=None)
with tf.train.MonitoredSession(session_creator=chiefsession,
hooks=hooks,
stop_grace_period_secs=120) as mon_sess:
for i in range(30):
if mon_sess.should_stop():
break
try:
gs, _ = mon_sess.run([global_step, train])
print(gs)
except (tf.errors.OutOfRangeError,tf.errors.CancelledError) as e:
break
finally:
pass
Unfortunately this still gives the same tensorboard errors:
$ tensorboard --logdir ../train/test3/ --port=6006
WARNING:tensorflow:Found more than one graph event per run, or there
was a metagraph containing a graph_def, as well as one or more graph
events. Overwriting the graph with the newest event. Starting
TensorBoard 54 at local:6006 (Press CTRL+C to quit)
btw, each codeblock is stand-alone, copy=paste it in a Jupyter notebook and you will replicate the problem.
I wonder if this is because every node in your cluster is running the same code, declaring itself as a chief, and saving out graphs and checkpoints.
I don't if the is_chief = True is just illustrative in the post here on Stack Overflow or that is exactly what you are using... so guessing a bit here.
I personally used MonitoredSession instead of MonitoredTrainingSession and created a list of hooks based on whether the code is running on the master/chief or not. Example: https://github.com/TensorLab/tensorfx/blob/master/src/training/_trainer.py#L94
You should set the parameter chief_only_hooks in 'MonitoredTrainingSession', the code as follows:
hooks = [(tf.train.CheckpointSaverHook(checkpoint_dir=output_path + "test1/ckpt/",
save_steps = 10,
saver = saver))]
with tf.train.MonitoredTrainingSession(master = '',
is_chief = True,
checkpoint_dir = None,
chief_only_hooks = hooks,
save_checkpoint_secs = None,
save_summaries_steps = None,
save_summaries_secs = None) as mon_sess: