I'm trying to train gpt2 model with custom dataset, but it fails with the error below.
ValueError: Unexpected result of `train_function` (Empty logs). Please use `Model.compile(..., run_eagerly=True)`, or `tf.config.run_functions_eagerly(True)` for more information of where went wrong, or file a issue/bug to `tf.keras`.
I thought model and dataset are correctly defined and processed by referring this article.
But the error shows up when model.fit is executed.
Can someone tell me how to resolve the error, or proper way to train the model?
from transformers import TFGPT2LMHeadModel, GPT2Tokenizer
import tensorflow as tf
# Define the model
model = TFGPT2LMHeadModel.from_pretrained('gpt2', from_pt=True)
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric], run_eagerly=True)
model.summary()
# Obtain the tokeinizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>"
})
# Get single string
paths = ['data.txt'] # each file only contains some sentences.
single_string = ''
for filename in paths:
with open(filename, "r", encoding='utf-8') as f:
x = f.read()
single_string += x + tokenizer.eos_token
string_tokenized = tokenizer.encode(single_string)
print(string_tokenized)
# creating the TensorFlow dataset
examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 1000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []
for ex in examples:
inputs.append(ex[:-1])
labels.append(ex[1:])
dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
print(dataset)
# train the model
num_epoch = 10
history = model.fit(dataset, epochs=num_epoch) # <- shows the error
Related
I saved my Skorch neural net model using the below code:
net_b = NeuralNetClassifier(
Classifier_b,
max_epochs=50,
optimizer__momentum= 0.9,
lr=0.1,
device=device,
)
#Fit the model on the full data
net_b.fit(merged_X_train, merged_Y_train);
#Test saving
import pickle
with open('MLP.pkl', 'wb') as f:
pickle.dump(net_b, f)
When I try to load this model again and run it against test data, I receive the following error:
TypeError: forward() got an unexpected keyword argument 'baseline value'
This is my code:
#Split the data
X_train, y_train, X_valid, y_valid,X_test, y_test = train_valid_test_split(rescaled_data, target = 'fetal_health',
train_size=0.8, valid_size=0.1, test_size=0.1)
input_dim = f_df_toscale.shape[1]
output_dim = len(np.unique(f_target))
hidden_dim_a = 20
hidden_dim_b = 12
device = 'cpu'
class Classifier_b(nn.Module):
def __init__(self,
input_dim = input_dim,
hidden_dim_a = hidden_dim_b,
output_dim = output_dim):
super(Classifier_b, self).__init__()
#Take the inputs and pass these to a hidden layer
self.hidden = nn.Linear(input_dim,hidden_dim_b)
#Take the hidden layer and pass it through an additional hidden layer
self.hidden_b = nn.Linear(hidden_dim_a,hidden_dim_b)
#Take the hidden layer and pass to a multi nerouon output
self.output = nn.Linear(hidden_dim_b,output_dim)
def forward(self, x):
hidden = F.relu(self.hidden(x))
hidden = F.relu(self.hidden_b(hidden))
output = F.softmax(self.output(hidden))
return output
#load the model
with open('MLP.pkl', 'rb') as f:
model_MLP = pickle.load(f)
#Test the model
y_pred = model_MLP.predict(X_test)
ML = accuracy_score(y_test, y_pred)
print('The accuracy score for the MLP is ', ML)
When I run this model normally in the original notebook everything run fines. But when I try to load my model from a saved state I get the error. Any idea why? I have nothing called 'baseline value'.
Thanks
The save and load model can be problematic if the code changes. So it is better to use
save_params() and load_params()
In your case
net_b.save_params(f_params='some-file.pkl')
To load the model first initialize (initializing is very important) and then load parameters
new_net.initialize()
new_net.load_params(f_params='some-file.pkl')
I'm trying to train TFBertForNextSentencePrediction on my own corpus, not from scratch, but rather taking the existing bert model with only a next sentence prediction head and further train it on a specific cuprous of text (pairs of sentences). Then I want to use the model I trained to be able to extract sentence embeddings from the last hidden state for other texts.
Currently the problem I encounter is that after I train the keras model I am not able to extract the hidden states of the last layer before the next sentence prediction head.
Below is the code. Here I only train it on a few sentences just to make sure the code works.
Any help will be greatly appreciated.
Thanks,
Ayala
import numpy as np
import pandas as pd
import tensorflow as tf
from datetime import datetime
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import ModelCheckpoint
from transformers import BertTokenizer, PreTrainedTokenizer, BertConfig, TFBertForNextSentencePrediction
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
PRETRAINED_MODEL = 'bert-base-uncased'
# set paths and file names
time_stamp = str(datetime.now().year) + "_" + str(datetime.now().month) + "_" + str(datetime.now().day) + "_" + \
str(datetime.now().hour) + "_" + str(datetime.now().minute)
model_name = "pretrained_nsp_model"
model_dir_data = model_name + "_" + time_stamp
model_fn = model_dir_data + ".h5"
base_path = os.path.dirname(__file__)
input_path = os.path.join(base_path, "input_data")
output_path = os.path.join(base_path, "output_models")
model_path = os.path.join(output_path, model_dir_data)
if not os.path.exists(model_path):
os.makedirs(model_path)
# set model checkpoint
checkpoint = ModelCheckpoint(os.path.join(model_path, model_fn), monitor="val_loss", verbose=1, save_best_only=True,
save_weights_only=True, mode="min")
# read data
max_length = 512
def get_tokenizer(pretrained_model_name):
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)
return tokenizer
def tokenize_nsp_data(A, B, max_length):
data_inputs = tokenizer(A, B, add_special_tokens=True, max_length=max_length, truncation=True,
pad_to_max_length=True, return_attention_mask=True,
return_tensors="tf")
return data_inputs
def get_data_features(data_inputs, max_length):
data_features = {}
for key in data_inputs:
data_features[key] = sequence.pad_sequences(data_inputs[key], maxlen=max_length, truncating="post",
padding="post", value=0)
return data_features
def get_transformer_model(transformer_model_name):
# get transformer model
config = BertConfig(output_attentions=True)
config.output_hidden_states = True
config.return_dict = True
transformer_model = TFBertForNextSentencePrediction.from_pretrained(transformer_model_name, config=config)
return transformer_model
def get_keras_model(transformer_model):
# get keras model
input_ids = tf.keras.layers.Input(shape=(max_length,), name='input_ids', dtype='int32')
input_masks_ids = tf.keras.layers.Input(shape=(max_length,), name='attention_mask', dtype='int32')
token_type_ids = tf.keras.layers.Input(shape=(max_length,), name='token_type_ids', dtype='int32')
X = transformer_model({'input_ids': input_ids, 'attention_mask': input_masks_ids, 'token_type_ids': token_type_ids})[0]
model = tf.keras.Model(inputs=[input_ids, input_masks_ids, token_type_ids], outputs=X)
model.summary()
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
optimizer=tf.optimizers.Adam(learning_rate=0.00005), metrics=['accuracy'])
return model
def get_metrices(true_values, pred_values):
cm = confusion_matrix(true_values, pred_values)
acc_score = accuracy_score(true_values, pred_values)
f1 = f1_score(true_values, pred_values, average="binary")
precision = precision_score(true_values, pred_values, average="binary")
recall = recall_score(true_values, pred_values, average="binary")
metrices = {'confusion_matrix': cm,
'acc_score': acc_score,
'f1': f1,
'precision': precision,
'recall': recall
}
for k, v in metrices.items():
print(k, ':\n', v)
return metrices
# get tokenizer
tokenizer = get_tokenizer(PRETRAINED_MODEL)
# train
prompt = ["Hello", "Hello", "Hello", "Hello"]
next_sentence = ["How are you?", "Pizza", "How are you?", "Pizza"]
train_labels = [0, 1, 0, 1]
train_labels = to_categorical(train_labels)
train_inputs = tokenize_nsp_data(prompt, next_sentence, max_length)
train_data_features = get_data_features(train_inputs, max_length)
# val
prompt = ["Hello", "Hello", "Hello", "Hello"]
next_sentence = ["How are you?", "Pizza", "How are you?", "Pizza"]
val_labels = [0, 1, 0, 1]
val_labels = to_categorical(val_labels)
val_inputs = tokenize_nsp_data(prompt, next_sentence, max_length)
val_data_features = get_data_features(val_inputs, max_length)
# get transformer model
transformer_model = get_transformer_model(PRETRAINED_MODEL)
# get keras model
model = get_keras_model(transformer_model)
callback_list = []
early_stop = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=4, min_delta=0.005, verbose=1)
callback_list.append(early_stop)
reduce_lr = tf.keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.2, patience=2, epsilon=0.001)
callback_list.append(reduce_lr)
callback_list.append(checkpoint)
history = model.fit([train_data_features['input_ids'], train_data_features['attention_mask'],
train_data_features['token_type_ids']], np.array(train_labels), batch_size=2, epochs=3,
validation_data=([val_data_features['input_ids'], val_data_features['attention_mask'],
val_data_features['token_type_ids']], np.array(val_labels)), verbose=1,
callbacks=callback_list)
model.layers[3].save_pretrained(model_path) # need to save this and make sure i can get the hidden states
## predict
# load model
transformer_model = get_transformer_model(model_path)
model = get_keras_model(transformer_model)
model.summary()
model.load_weights(os.path.join(model_path, model_fn))
# test
prompt = ["Hello", "Hello"]
next_sentence = ["How are you?", "Pizza"]
test_labels = [0, 1]
test_df = pd.DataFrame({'A': prompt, 'B': next_sentence, 'label': test_labels})
test_labels = to_categorical(val_labels)
test_inputs = tokenize_nsp_data(prompt, next_sentence, max_length)
test_data_features = get_data_features(test_inputs, max_length)
# predict
pred_test = model.predict([test_data_features['input_ids'], test_data_features['attention_mask'], test_data_features['token_type_ids']])
preds = tf.keras.activations.softmax(tf.convert_to_tensor(pred_test)).numpy()
true_test = test_df['label'].to_list()
pred_test = [1 if p[1] > 0.5 else 0 for p in preds]
test_df['pred_val'] = pred_test
metrices = get_metrices(true_test, pred_test)
I am also attaching a picture from the debugging mode in which I try (with no success) to view the hidden state. The problem is I am not able to see and save the transform model I trained and view the embeddings of the last hidden state. I tried converting the KerasTensor to numpy array but without success.
The issue resides in your 'get_keras_model()' function. You defined here that you are only interested in the first of the element of the output (i.e. logits) with:
X = transformer_model({'input_ids': input_ids, 'attention_mask': input_masks_ids, 'token_type_ids': token_type_ids})[0]
Just do the index selection as conditional like this to get the whole output of the model
def get_keras_model(transformer_model, is_training=True):
###your other code
X = transformer_model({'input_ids': input_ids, 'attention_mask': input_masks_ids, 'token_type_ids': token_type_ids})
if is_training:
X= X[0]
###your other code
return model
#predict
###your other code
model = get_keras_model(transformer_model, is_training=False)
###your other code
print(pred_test.keys())
Output:
odict_keys(['logits', 'hidden_states', 'attentions'])
P.S.: The BertTokenizer can truncate and add padding by themself (documentation).
I'm starting with Keras creating a model to classify text labels by inputting a couple of text features with a single output. I've a specific function to create the model and another one to test the model using a different dataset.
I'm still trying to fine tune the model predictions but i'd like to try understand why my test function is getting different results every time the model is recreated. Is that usual ? Also, i'd appreciate any tip to improve the model accuracy.
def create_model(model_name,data,test_data):
# lets take 80% data as training and remaining 20% for test.
train_size = int(len(data) * .9)
test_size = int(len(data) * .4)
train_headlines = data['Subject']
train_category = data['Category']
train_activities = data['Activity']
test_headlines = data['Subject'][:test_size]
test_category = data['Category'][:test_size]
test_activities = data['Activity'][:test_size]
# define Tokenizer with Vocab Sizes
vocab_size1 = 10000
vocab_size2 = 5000
batch_size = 100
tokenizer = Tokenizer(num_words=vocab_size1)
tokenizer2 = Tokenizer(num_words=vocab_size2)
test_headlines=test_headlines.astype(str)
train_headlines=train_headlines.astype(str)
test_category=test_category.astype(str)
train_category=train_category.astype(str)
tokenizer.fit_on_texts(test_headlines)
tokenizer2.fit_on_texts(test_category)
x_train = tokenizer.texts_to_matrix(train_headlines, mode='tfidf')
x_test = tokenizer.texts_to_matrix(test_headlines, mode='tfidf')
y_train = tokenizer2.texts_to_matrix(train_category, mode='tfidf')
y_test = tokenizer2.texts_to_matrix(test_category, mode='tfidf')
# load classes
labels = []
encoder = LabelBinarizer()
encoder.fit(train_activities)
text_labels = encoder.classes_
with open('outputs/classes.txt', 'w') as f:
for item in text_labels:
f.write("%s\n" % item)
z_train = encoder.transform(train_activities)
z_test = encoder.transform(test_activities)
num_classes = len(text_labels)
print ("num_classes: "+str(num_classes))
input1 = Input(shape=(vocab_size1,), name='main_input')
x1 = Dense(512, activation='relu')(input1)
x1 = Dense(64, activation='relu')(x1)
x1 = Dense(64, activation='relu')(x1)
input2 = Input(shape=(vocab_size2,), name='cat_input')
main_output = Dense(num_classes, activation='softmax', name='main_output')(x1)
model = Model(inputs=[input1, input2], outputs=[main_output])
model.compile(loss='categorical_crossentropy',
optimizer='adam',
metrics=['accuracy'])
model.summary()
history = model.fit([x_train,y_train], z_train,
batch_size=batch_size,
epochs=30,
verbose=1,
validation_split=0.1)
score = model.evaluate([x_test,y_test], z_test,
batch_size=batch_size, verbose=1)
print('Test accuracy:', score[1])
# serialize model to JSON
model_json = model.to_json()
with open("./outputs/my_model_"+model_name+".json", "w") as json_file:
json_file.write(model_json)
# creates a HDF5 file 'my_model.h5'
model.save('./outputs/my_model_'+model_name+'.h5')
# Save Tokenizer i.e. Vocabulary
with open('./outputs/tokenizer'+model_name+'.pickle', 'wb') as handle:
pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
def validate_model (model_name,test_data,labels):
from keras.models import model_from_json
test_data['Subject'] = test_data['Subject'] + " " + test_data['Description']
headlines = test_data['Subject'].astype(str)
categories = test_data['Category'].astype(str)
# load json and create model
json_file = open("./outputs/my_model_"+model_name+".json", 'r')
loaded_model_json = json_file.read()
json_file.close()
model = model_from_json(loaded_model_json)
# load weights into new model
model.load_weights('./outputs/my_model_'+model_name+'.h5')
print("Loaded model from disk")
# loading
import pickle
with open('./outputs/tokenizer'+model_name+'.pickle', 'rb') as handle:
tokenizer = pickle.load(handle)
# Subjects
x_pred = tokenizer.texts_to_matrix(headlines, mode='tfidf')
# Categorias
y_pred = tokenizer.texts_to_matrix(categories, mode='tfidf')
predictions = []
scores = []
predictions_vetor = model.predict({'main_input': x_pred, 'cat_input': y_pred})
I read your training code following.
model.fit([x_train,y_train], z_train,
batch_size=batch_size,
epochs=30,
verbose=1,
validation_split=0.1)
You are using [x_train, y_train] as features and z_train as labels for your model. y_train is the raw form of label and z_train is the encoded form of label.
This way you are leaking information to the training set, hence resulting in an over-fitting situation. You model is not generalised at all, and therefore predicting irrelevant results.
I am trying to do binary text classification on custom data (which is in csv format) using different transformer architectures that Hugging Face 'Transformers' library offers. I am using this Tensorflow blog post as reference.
I am loading the custom dataset into 'tf.data.Dataset' format using the following code:
def get_dataset(file_path, **kwargs):
dataset = tf.data.experimental.make_csv_dataset(
file_path,
batch_size=5, # Artificially small to make examples easier to show.
na_value="",
num_epochs=1,
ignore_errors=True,
**kwargs)
return dataset
After this when I tried using the 'glue_convert_examples_to_features' method to tokenize as below:
train_dataset = glue_convert_examples_to_features(
examples = train_data,
tokenizer = tokenizer,
task = None,
label_list = ['0', '1'],
max_length = 128
)
which throws an error "UnboundLocalError: local variable 'processor' referenced before assignment" at:
if is_tf_dataset:
example = processor.get_example_from_tensor_dict(example)
example = processor.tfds_map(example)
In all the examples, I see that they are using the tasks like 'mrpc' etc which are pre-defined and have a glue_processor to handle. Error raises at the 'line 85' in source code.
Can anyone help with solving this issue using with 'custom data' ?
I had the same starting problem.
This Kaggle submission helped me a lot. There you can see how you can tokenize the data according to the chosen pre-trained model:
from transformers import BertTokenizer
from keras.preprocessing.sequence import pad_sequences
bert_model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(bert_model_name, do_lower_case=True)
MAX_LEN = 128
def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenized_sentences = []
for sentence in tqdm(sentences):
tokenized_sentence = tokenizer.encode(
sentence, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = max_seq_len, # Truncate all sentences.
)
tokenized_sentences.append(tokenized_sentence)
return tokenized_sentences
def create_attention_masks(tokenized_and_padded_sentences):
attention_masks = []
for sentence in tokenized_and_padded_sentences:
att_mask = [int(token_id > 0) for token_id in sentence]
attention_masks.append(att_mask)
return np.asarray(attention_masks)
input_ids = tokenize_sentences(df_train['comment_text'], tokenizer, MAX_LEN)
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", value=0, truncating="post", padding="post")
attention_masks = create_attention_masks(input_ids)
After that you should split ids and masks:
from sklearn.model_selection import train_test_split
labels = df_train[label_cols].values
train_ids, validation_ids, train_labels, validation_labels = train_test_split(input_ids, labels, random_state=0, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, labels, random_state=0, test_size=0.1)
train_size = len(train_inputs)
validation_size = len(validation_inputs)
Furthermore, I looked into the source of glue_convert_examples_to_features. There you can see how a tf.data.dataset compatible with the BERT model can be created. I created a function for this:
def create_dataset(ids, masks, labels):
def gen():
for i in range(len(train_ids)):
yield (
{
"input_ids": ids[i],
"attention_mask": masks[i]
},
labels[i],
)
return tf.data.Dataset.from_generator(
gen,
({"input_ids": tf.int32, "attention_mask": tf.int32}, tf.int64),
(
{
"input_ids": tf.TensorShape([None]),
"attention_mask": tf.TensorShape([None])
},
tf.TensorShape([None]),
),
)
train_dataset = create_dataset(train_ids, train_masks, train_labels)
I then use the dataset like this:
from transformers import TFBertForSequenceClassification, BertConfig
model = TFBertForSequenceClassification.from_pretrained(
bert_model_name,
config=BertConfig.from_pretrained(bert_model_name, num_labels=20)
)
# Prepare training: Compile tf.keras model with optimizer, loss and learning rate schedule
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.CategoricalAccuracy('accuracy')
model.compile(optimizer=optimizer, loss=loss, metrics=[metric])
# Train and evaluate using tf.keras.Model.fit()
history = model.fit(train_dataset, epochs=1, steps_per_epoch=115, validation_data=val_dataset, validation_steps=7)
I am testing building a model in Keras and then turning it into some Tensorflow format so that I can run predictions in the Tensorflow C++ API. I'm adapting from this tutorial. I am testing on the MNIST dataset and have built my model in Keras:
inpt = keras.layers.Input(shape = (28,28,1), name = "input_node")
x = keras.layers.Convolution2D(16, 2, padding = 'same', activation = 'relu')(inpt)
x = keras.layers.MaxPool2D(pool_size = 2)(x)
x = keras.layers.Convolution2D(32, 2, padding = 'same', activation = 'relu')(x)
x = keras.layers.MaxPool2D(pool_size = 2)(x)
x = keras.layers.Flatten()(x)
x = keras.layers.Dense(128, activation = 'relu')(x)
output = keras.layers.Dense(10, activation = 'softmax', name = "output_node")(x)
model = keras.models.Model(inpt,output)
model.compile(optimizer = keras.optimizers.Adam(lr = 0.0001), loss = 'categorical_crossentropy', metrics = ['accuracy'])
and then used the model_to_estimator function:
estimator_model = tf.keras.estimator.model_to_estimator(keras_model = model, model_dir = './TF_MNIST')
This works well, and I can train using:
estimator_model.train(input_fn = input_function(X_train,y_train,True))
However, I'm trying to use freeze_graph as follows:
checkpoint_state_name = "model.ckpt-21001.index"
input_graph_name = "graph.pbtxt"
output_graph_name = "output_graph.pb"
input_graph_path = os.path.join('./TF_MNIST', input_graph_name)
input_saver_def_path = ""
input_binary = False
input_checkpoint_path = os.path.join('./TF_MNIST', checkpoint_state_name)
output_node_names = "output_node"
restore_op_name = "save/restore_all"
filename_tensor_name = "save/Const:0"
output_graph_path = os.path.join('./TF_MNIST', output_graph_name)
clear_devices = False
freeze_graph.freeze_graph(input_graph_path, input_saver_def_path,
input_binary, input_checkpoint_path,
output_node_names, restore_op_name,
filename_tensor_name, output_graph_path,
clear_devices, initializer_nodes = "input_node")
where I have chosen then name output_graph.pb for the destination of the generated freeze_graph.
I'm getting the following error:
ValueError Traceback (most recent call last)
<ipython-input-69-215edbaaf017> in <module>()
3 output_node_names, restore_op_name,
4 filename_tensor_name, output_graph_path,
----> 5 clear_devices, initializer_nodes = "input_node")
ValueError: No variables to save
In the example in the tutorial, there is no input argument initializer_nodes so I assumed it was the name of the input node. Additionally, when I use checkpoint files that are not .index files, it gives a Data loss warning, saying that the data is not in the right format.
Questions:
How can I fix this error?
Why is the .index checkpoint file the right one to use (if, indeed it is correct)?
The tutorial has a input_graph.pb graph, whereas mine is a .pbtxt, why is that?
Can I introduce tf.Session() into my Keras model to store and print accuracy scores, as currently these are not being printed in training, nor are they being stored in the checkpoint file being read by TensorBoard.
Any help on any of these questions would be much appreciated.