I am trying to fine tune a transformer model for text classification but I am having trouble training the model. I have tried many things but none of them seem to work. I have also tried different solutions on other question but they didn't work. I am using 'microsoft/deberta-v3-base' model for fine tuning. Here's my code:
train_dataset = Dataset.from_pandas(df_tr[['text', 'label']]).class_encode_column("label")
val_dataset = Dataset.from_pandas(df_tes[['text', 'label']]).class_encode_column("label")
train_tok_dataset = train_dataset.map(tokenizer_func, batched=True, remove_columns=('text'))
val_tok_dataset = val_dataset.map(tokenizer_func, batched=True, remove_columns=('text'))
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(config.model_name, num_labels=3)
transformer_model = TFAutoModelForSequenceClassification.from_pretrained(config.model_name, output_hidden_states=True)
input_ids = tf.keras.Input(shape=(config.max_len, ),dtype='int32')
attention_mask = tf.keras.Input(shape=(config.max_len, ), dtype='int32')
transformer = transformer_model([input_ids, attention_mask])
hidden_states = transformer[1] # get output_hidden_states
#print(hidden_states)
hidden_states_size = 4 # count of the last states
hiddes_states_ind = list(range(-hidden_states_size, 0, 1))
selected_hiddes_states = tf.keras.layers.concatenate(tuple([hidden_states[i] for i in hiddes_states_ind]))
# Now we can use selected_hiddes_states as we want
output = tf.keras.layers.Dense(128, activation='relu')(selected_hiddes_states)
output=tf.keras.layers.Flatten()(output)
output = tf.keras.layers.Dense(3, activation='softmax')(output)
model = tf.keras.models.Model(inputs = [input_ids, attention_mask], outputs = output)
from transformers import create_optimizer
import tensorflow as tf
batch_size = 8
num_epochs = config.epochs
#batches_per_epoch = len(tokenized_tweets["train"]) // batch_size
total_train_steps = int(num_steps * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=num_steps/2)
model.compile(optimizer=optimizer)
with tf.device('GPU:0'):
model.fit(x=[np.array(train_tok_dataset["input_ids"]),np.array(train_tok_dataset["attention_mask"])],
y=tf.keras.utils.to_categorical(y_train,num_classes=3),
validation_data=([np.array(val_tok_dataset["input_ids"]),np.array(val_tok_dataset["attention_mask"])],tf.keras.utils.to_categorical(y_test,num_classes=3)),
epochs=config.epochs,class_weight={0:0.57,1:0.18,2:0.39})
It seems like a small issue, but I am new to tensorflow and transformers so I couldn't sort it out myself.
I would say it's probably due to the fact that you are not adding a loss to the compilation, thus no gradient can be computed wrt it:
model.compile(optimizer=optimizer)
^^^^^^^^^^^^^^^^^^^^---- no "loss = tf.keras.losses...
Maybe you're just missing an = on the right side of validation_data.
model.fit(
x=[np.array(...),np.array(...)],
y=tf.keras.utils.to_categorical(...),
validation_data=([np.array(...), np.array(...)], tf.keras.utils.to_categorical(...)),
...
)
Good Afternoon Everyone,
I am currently having some trouble with tensorflow, since for some reason I get a Shape error after about 3 and a half hours running. The files are loaded using the tensorflow pipeline, and creating two reinitializable datasets for training and test. I know the data has the correct shape because I do a hardcoded reshape to the expected shape and I've never got an error there. The problem is, when running the network at some point there is a sample that do not have the correct amount of number in the flatten operation. And the program crashes, but there is no other explanation other than the number of elements in the tensor is not divisible by 10 (my batch size). Which honestly makes no sense to me since the data has gone exactly through the same pipeline as the other batches that run with no problem.
I can provide code if needed, but I think is more a failure to understand some concept from the framework.
Thanks in advance for all the help.
EDIT: Please, find the code here, a bit of nomemclature t corresponds to a layer that has time data (X), f corresponds to a layer that has frequency data (FREQ), q corresponds to a layer that contains cepstral data (QUEF) and tf corresponds to layers that contain 2-D data, spectrograms of X (SPECG), Y is the label. All data are tf.float32 except for the labels which are tf.int64
EDIT 2: The operation that gives problems is the flatten on qsubnet_out
EDIT 3: Probably the most important, it seems than some of the layers converge to NaNs
Training loop:
for i in range(FLAGS.max_steps):
start = time.time()
sess.run([train],feed_dict={handle:train_handle})
if i%10 == False:
summary_op,entropy,acc,expected,output = sess.run([merged,loss,accuracy,Y,tf.argmax(logit,1)],feed_dict={handle:train_handle})
summary_op,_,_ = sess.run([merged,loss,accuracy],feed_dict={handle:test_handle})
Training operations:
W = { 'tc1': [64,3], 'tc2':[128,3], 'tc3':[256,5], 'tc4': [128, 2],
'fc1': [64,3], 'fc2':[128,3], 'fc3':[256,5], 'fc4': [128, 2],
'qc1': [64,3], 'qc2':[128,3], 'qc3':[256,5], 'qc4': [128, 2],
'tfc1': [64,(3,3)], 'tfc2':[128,(3,3)], 'tfc3':[256,(5,5)], 'tfc4': [128, (2,2)],
'dense1': 1000, 'dense2': 100, 'dense3': 200,'dense4': 300, 'dense5': 200,
'out' : NUM_CLASSES
}
iter = tf.data.Iterator.from_string_handle(handle, train_dataset.output_types, train_dataset.output_shapes)
X,FREQ,QUEF,SPECG,Y = iter.get_next()
X.set_shape([FLAGS.batch_size,768,14])
FREQ.set_shape([FLAGS.batch_size,384,14])
QUEF.set_shape([FLAGS.batch_size,384,14])
SPECG.set_shape([FLAGS.batch_size,65,18,14])
logit = net.run(X,FREQ,QUEF,SPECG,W)
loss = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=Y,logits=logit))
And the the file net.py:
def run(X,FREQ,QUEF,SPECG,W):
time = tf.layers.batch_normalization(X,axis=-1,training=True,trainable=True)
freq = tf.layers.batch_normalization(FREQ,axis=-1,training=True,trainable=True)
quef = tf.layers.batch_normalization(QUEF,axis=-1,training=True,trainable=True)
time_freq = tf.layers.batch_normalization(SPECG,axis=-1,training=True,trainable=True)
regularizer = tf.contrib.layers.l2_regularizer(0.1);
#########################################################################################################
#### TIME SUBNET
with tf.device('/GPU:1'):
tc1 = tf.layers.conv1d(inputs=time,filters=W['tc1'][0],kernel_size=W['tc1'][1],strides=1,padding='SAME',kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tc1')
trelu1 = tf.nn.relu(features=tc1,name='trelu1')
tpool1 = tf.layers.max_pooling1d(trelu1,pool_size=2,strides=1)
tc2 = tf.layers.conv1d(inputs=tpool1,filters=W['tc2'][0],kernel_size=W['tc2'][1],strides=1,padding='SAME',kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tc2')
tc3 = tf.layers.conv1d(inputs=tc2,filters=W['tc3'][0],kernel_size=W['tc3'][1],strides=1,padding='SAME',kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tc3')
trelu2 = tf.nn.relu(tc3,name='trelu2')
tpool2 = tf.layers.max_pooling1d(trelu2,pool_size=2,strides=1)
tc4 = tf.layers.conv1d(inputs=tpool2,filters=W['tc4'][0],kernel_size=W['tc4'][1],strides=1,padding='SAME',kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tc4')
tsubnet_out = tf.nn.relu6(tc4,'trelu61')
#########################################################################################################
#### CEPSTRUM SUBNET (QUEFRENCIAL)
qc1 = tf.layers.conv1d(inputs=quef,filters=W['qc1'][0],kernel_size=W['qc1'][1],strides=1,padding='SAME',kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='qc1')
qrelu1 = tf.nn.relu(features=qc1,name='qrelu1')
qpool1 = tf.layers.max_pooling1d(qrelu1,pool_size=2,strides=1)
qc2 = tf.layers.conv1d(inputs=qpool1,filters=W['qc2'][0],kernel_size=W['qc2'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='qc2')
qc3 = tf.layers.conv1d(inputs=qc2,filters=W['qc3'][0],kernel_size=W['qc3'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='qc3')
qrelu2 = tf.nn.relu(qc3,name='qrelu2')
qpool2 = tf.layers.max_pooling1d(qrelu2,pool_size=2,strides=1)
qc4 = tf.layers.conv1d(inputs=qpool2,filters=W['qc4'][0],kernel_size=W['qc4'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='qc4')
qsubnet_out = tf.nn.relu6(qc4,'qrelu61')
#########################################################################################################
#FREQ SUBNET
with tf.device('/GPU:1'):
fc1 = tf.layers.conv1d(inputs=freq,filters=W['fc1'][0],kernel_size=W['fc1'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='fc1')
frelu1 = tf.nn.relu(features=fc1,name='trelu1')
fpool1 = tf.layers.max_pooling1d(frelu1,pool_size=2,strides=1)
fc2 = tf.layers.conv1d(inputs=fpool1,filters=W['fc2'][0],kernel_size=W['fc2'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='fc2')
fc3 = tf.layers.conv1d(inputs=fc2,filters=W['fc3'][0],kernel_size=W['fc3'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='fc3')
frelu2 = tf.nn.relu(fc3,name='frelu2')
fpool2 = tf.layers.max_pooling1d(frelu2,pool_size=2,strides=1)
fc4 = tf.layers.conv1d(inputs=fpool2,filters=W['fc4'][0],kernel_size=W['fc4'][1],padding='SAME',strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='fc4')
fsubnet_out = tf.nn.relu6(fc4,'frelu61')
########################################################################################################
## TIME/FREQ SUBNET
with tf.device('/GPU:0'):
tfc1 = tf.layers.conv2d(inputs=time_freq,filters=W['tfc1'][0],kernel_size=W['tfc1'][1],padding='SAME', strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tfc1')
tfrelu1 = tf.nn.relu(tfc1)
tfpool1 = tf.layers.max_pooling2d(tfrelu1,pool_size=[2, 2],strides=[1, 1])
tfc2 = tf.layers.conv2d(inputs=tfpool1,filters=W['tfc2'][0],kernel_size=W['tfc2'][1],padding='SAME', strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tfc2')
tfc3 = tf.layers.conv2d(inputs=tfc2,filters=W['tfc3'][0],kernel_size=W['tfc3'][1],padding='SAME', strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tfc3')
tfrelu2 = tf.nn.relu(tfc3)
tfpool2 = tf.layers.max_pooling2d(tfrelu2,pool_size=[2, 2], strides=[1, 1])
tfc4 = tf.layers.conv2d(inputs=tfpool2,filters=W['tfc4'][0],kernel_size=W['tfc4'][1],padding='SAME', strides=1,kernel_initializer=tf.initializers.random_normal,kernel_regularizer=regularizer,name='tfc4')
tfsubnet_out = tf.nn.relu6(tfc4,'tfrelu61')
########################################################################################################
##Flatten subnet outputs
tsubnet_out = tf.layers.flatten(tsubnet_out)
fsubnet_out = tf.layers.flatten(fsubnet_out)
tfsubnet_out = tf.layers.flatten(tfsubnet_out)
qsubnet_out = tf.layers.flatten(qsubnet_out)
#Final subnet computation
input_final = tf.concat((tsubnet_out,fsubnet_out,qsubnet_out,tfsubnet_out),1)
dense1 = tf.layers.dense(input_final,W['dense1'],tf.nn.relu, kernel_initializer=tf.initializers.random_normal,name='dense1')
dense2 = tf.layers.dense(dense1,W['dense2'],tf.nn.relu, kernel_initializer=tf.initializers.random_normal,name='dense2')
dense3 = tf.layers.dense(dense2,W['dense3'],tf.nn.relu, kernel_initializer=tf.initializers.random_normal,name='dense3')
dense4 = tf.layers.dense(dense3,W['dense4'],tf.nn.relu, kernel_initializer=tf.initializers.random_normal,name='dense4')
dense5 = tf.layers.dense(dense4,W['dense5'],tf.nn.relu, kernel_initializer=tf.initializers.random_normal,name='dense5')
out = tf.layers.dense(dense5,W['out'],tf.nn.relu, name='out')
return out
Finally after some days, I've been able to track the problem. Which was not related to the code, I submitted, in the end. But it was related to the creation of the Tensorflow Dataset. Since in the batchin, if the length of the Dataset was not divisible by the batch size. The flag drop_remainder to True.
I will not delete the question since I believe is a problem that more people may have in the future and the source is not easily identificable.
I've been working on this neural network with the intent to predict TBA (time based availability) of simulated windmill parks based on certain attributes. The neural network runs just fine, and gives me some predictions, however I'm not quite satisfied with the results. It fails to notice some very obvious correlations that I can clearly see by myself. Here is my current code:
`# Import
import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
maxi = 0.96
mini = 0.7
# Make data a np.array
data = pd.read_csv('datafile_ML_no_avg.csv')
data = data.values
# Shuffle the data
shuffle_indices = np.random.permutation(np.arange(len(data)))
data = data[shuffle_indices]
# Training and test data
data_train = data[0:int(len(data)*0.8),:]
data_test = data[int(len(data)*0.8):int(len(data)),:]
# Scale data
scaler = MinMaxScaler(feature_range=(mini, maxi))
scaler.fit(data_train)
data_train = scaler.transform(data_train)
data_test = scaler.transform(data_test)
# Build X and y
X_train = data_train[:, 0:5]
y_train = data_train[:, 6:7]
X_test = data_test[:, 0:5]
y_test = data_test[:, 6:7]
# Number of stocks in training data
n_args = X_train.shape[1]
multi = int(8)
# Neurons
n_neurons_1 = 8*multi
n_neurons_2 = 4*multi
n_neurons_3 = 2*multi
n_neurons_4 = 1*multi
# Session
net = tf.InteractiveSession()
# Placeholder
X = tf.placeholder(dtype=tf.float32, shape=[None, n_args])
Y = tf.placeholder(dtype=tf.float32, shape=[None,1])
# Initialize1s
sigma = 1
weight_initializer = tf.variance_scaling_initializer(mode="fan_avg",
distribution="uniform", scale=sigma)
bias_initializer = tf.zeros_initializer()
# Hidden weights
W_hidden_1 = tf.Variable(weight_initializer([n_args, n_neurons_1]))
bias_hidden_1 = tf.Variable(bias_initializer([n_neurons_1]))
W_hidden_2 = tf.Variable(weight_initializer([n_neurons_1, n_neurons_2]))
bias_hidden_2 = tf.Variable(bias_initializer([n_neurons_2]))
W_hidden_3 = tf.Variable(weight_initializer([n_neurons_2, n_neurons_3]))
bias_hidden_3 = tf.Variable(bias_initializer([n_neurons_3]))
W_hidden_4 = tf.Variable(weight_initializer([n_neurons_3, n_neurons_4]))
bias_hidden_4 = tf.Variable(bias_initializer([n_neurons_4]))
# Output weights
W_out = tf.Variable(weight_initializer([n_neurons_4, 1]))
bias_out = tf.Variable(bias_initializer([1]))
# Hidden layer
hidden_1 = tf.nn.relu(tf.add(tf.matmul(X, W_hidden_1), bias_hidden_1))
hidden_2 = tf.nn.relu(tf.add(tf.matmul(hidden_1, W_hidden_2),
bias_hidden_2))
hidden_3 = tf.nn.relu(tf.add(tf.matmul(hidden_2, W_hidden_3),
bias_hidden_3))
hidden_4 = tf.nn.relu(tf.add(tf.matmul(hidden_3, W_hidden_4),
bias_hidden_4))
# Output layer (transpose!)
out = tf.transpose(tf.add(tf.matmul(hidden_4, W_out), bias_out))
# Cost function
mse = tf.reduce_mean(tf.squared_difference(out, Y))
# Optimizer
opt = tf.train.AdamOptimizer().minimize(mse)
# Init
net.run(tf.global_variables_initializer())
# Fit neural net
batch_size = 10
mse_train = []
mse_test = []
# Run
epochs = 10
for e in range(epochs):
# Shuffle training data
shuffle_indices = np.random.permutation(np.arange(len(y_train)))
X_train = X_train[shuffle_indices]
y_train = y_train[shuffle_indices]
# Minibatch training
for i in range(0, len(y_train) // batch_size):
start = i * batch_size
batch_x = X_train[start:start + batch_size]
batch_y = y_train[start:start + batch_size]
# Run optimizer with batch
net.run(opt, feed_dict={X: batch_x, Y: batch_y})
# Show progress
if np.mod(i, 50) == 0:
mse_train.append(net.run(mse, feed_dict={X: X_train, Y: y_train}))
mse_test.append(net.run(mse, feed_dict={X: X_test, Y: y_test}))
pred = net.run(out, feed_dict={X: X_test})
print(pred)`
Have tried to tweak around with the number of hidden layers, number of nodes per layer, number of epochs to run and trying different activation functions and optimizers. However, I am quite new to neural networks, so there might be something very obvious that I'm missing.
Thanks in advance to anyone who managed to read through all of that.
It will make is much easier you you will share a small dataset that illustrate the problem. However, I will state some of the issues with non-standards datasets and how to overcome them.
Possible solutions
Regularization and validation-based optimization - are methods that are always good to try when looking for some extra-accuracy. See dropout methods here (original paper), and some overview here.
Unbalanced data - Sometimes of the time series categories/events behave like anomalies, or just in unbalanced ways. If you read a book, words like the or it will appear much more times than warehouse or such. This can become a problem if your main task is to detect the word warehouse and you train your network (even lstms) in traditional ways. A way to overcome this problem is by balancing the samples (creating balanced datasets) or to give more weight to low-frequent categories.
Model structure - sometimes fully connected layers are not enough. See computer vision problems for instance, where we train using convolution layers. The convolution and pooling layers enforce structure on the model, which is suitable for images. This is also some sort of regulation, since we have less parameters in those layers. In time-series problems, convolutions are also possible and turns out that works just fine. See example in Conditional Time Series Forecasting with Convolution Neural Networks.
The above suggestions are presented in the order I would suggest to try.
Good luck!