handle overfitting in unbalanced datasets

handle overfitting in unbalanced datasets - python

I have an unbalanced dataset (only 0.06% of data is labeled 1 and the rest are labeled 0). As I researched, I had to resample the data, so I used imblearn package to randomUnserSample my dataset. Then I used a Keras Sequential to create a neural network. While training, F1Score increases to around 75% (1000th epoch result is: loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472), but on test set, the result is disappointing (loss: 55.35181%, acc: 79.25248%, f1_m: 0.39789%, precision_m: 0.23259%, recall_m: 1.54982%).
What I assume is that on train set, because the number of 1s and 0s are the same and therefore class_wights are both set to 1, so the network is not costing much for wrong predictions of 1s.
I have used some techniques like reducing the number of layers, reducing number of neurons, using regularization and dropouts, but the test set f1Score is never more than 0.5%. What should I do. Thanks
my neural network:
def neural_network(X, y, epochs_count=3, handle_overfit=False):
# create model
model = Sequential()
model.add(Dense(12, input_dim=len(X_test.columns), activation='relu'))
if (handle_overfit):
model.add(Dropout(rate = 0.5))
model.add(Dense(8, activation='relu', kernel_regularizer=regularizers.l1(0.1)))
if (handle_overfit):
model.add(Dropout(rate = 0.1))
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc', f1_m, precision_m, recall_m])
# change weights of the classes '0' and '1' and set weights automatically
class_weights = class_weight.compute_class_weight('balanced', [0, 1], y)
print("---------------------- \n chosen class_wieghts are: ", class_weights, " \n ---------------------")
# Fit the model
model.fit(X, y, epochs=epochs_count, batch_size=512, class_weight=class_weights)
return model
defining train and test set:
vtrain_set, test_set = train_test_split(data, test_size=0.35, random_state=0)
X_train = train_set[['..... some columns ....']]
y_train = train_set[['success']]
print('Initial dataset shape: ', X_train.shape)
rus = RandomUnderSampler(random_state=42)
X_undersampled, y_undersampled = rus.fit_sample(X_train, y_train)
print('undersampled dataset shape: ', X_undersampled.shape)
and result is:
Initial dataset shape: (1625843, 11)
undersampled dataset shape: (1970, 11)
and finally the neural network calling:
print (X_undersampled.shape, y_undersampled.shape)
print (X_test.shape, y_test.shape)
model = neural_network(X_undersampled, y_undersampled, 1000, handle_overfit=True)
# evaluate the model
print("\n---------------\nEvaluated on test set:")
scores = model.evaluate(X_test, y_test)
for i in range(len(model.metrics_names)):
print("%s: %.5f%%" % (model.metrics_names[i], scores[i]*100))
and result is:
(1970, 11) (1970,)
(875454, 11) (875454, 1)
----------------------
chosen class_wieghts are: [1. 1.]
---------------------
Epoch 1/1000
1970/1970 [==============================] - 4s 2ms/step - loss: 4.5034 - acc: 0.5147 - f1_m: 0.3703 - precision_m: 0.5291 - recall_m: 0.2859
.
.
.
.
Epoch 999/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5705 - acc: 0.7538 - f1_m: 0.7471 - precision_m: 0.7668 - recall_m: 0.7296
Epoch 1000/1000
1970/1970 [==============================] - 0s 6us/step - loss: 0.5691 - acc: 0.7543 - f1_m: 0.7525 - precision_m: 0.7582 - recall_m: 0.7472
---------------
Evaluated on test set:
875454/875454 [==============================] - 49s 56us/step
loss: 55.35181%
acc: 79.25248%
f1_m: 0.39789%
precision_m: 0.23259%
recall_m: 1.54982%

Related

Why isn't my CNN model for a Binary Classification not learning?

I have been given 10000 images of shape (100,100), representing detection of particles, I have then created 10000 empty images of shape (100,100) and mixed them together. I have given each respective type labels of 0 and 1, seen in the code here:
Labels = np.append(np.ones(10000),np.zeros(empty_sheets.shape[0]))
images_scale1 = np.zeros(s) #scaling each image so that it has a maximum number of 1
#scaling each image so that it has a maximum number of 1
l = s[0]
for i in range(l):
images_scale1[i] = images[i]/np.amax(images[i])
empty_sheets_noise1 = add_noise(empty_sheets,0)
scale1noise1 = np.concatenate((images_scale1,empty_sheets_noise1),axis=0)
y11 = Labels
scale1noise1s, y11s = shuffle(scale1noise1, y11)
scale1noise1s_train, scale1noise1s_test, y11s_train, y11s_test = train_test_split(
scale1noise1s, y11, test_size=0.25)
#reshaping image arrays so that they can be passed through CNN
scale1noise1s_train = scale1noise1s_train.reshape(scale1noise1s_train.shape[0],100,100,1)
scale1noise1s_test = scale1noise1s_test.reshape(scale1noise1s_test.shape[0],100,100,1)
y11s_train = y11s_train.reshape(y11s_train.shape[0],1)
y11s_test = y11s_test.reshape(y11s_test.shape[0],1)
Then to set up my model I create a new function:
def create_model():
#initiates new model
model = keras.models.Sequential()
model.add(keras.layers.Conv2D(64, (3,3),activation='relu',input_shape=(100,100,1)))
model.add(keras.layers.MaxPooling2D((2, 2)))
model.add(keras.layers.Dropout(0.2))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(32))
model.add(keras.layers.Dense(64))
model.add(keras.layers.Dense(1,activation='sigmoid'))
return model
estimators1m1 = create_model()
estimators1m1.compile(optimizer='adam', metrics=['accuracy', tf.keras.metrics.Precision(),
tf.keras.metrics.Recall()], loss='binary_crossentropy')
history = estimators1m1.fit(scale1noise1s_train, y11s_train, epochs=3,
validation_data=(scale1noise1s_test, y11s_test))
which produces the following:
Epoch 1/3 469/469 [==============================] - 62s 131ms/step -
loss: 0.6939 - accuracy: 0.4917 - precision_2: 0.4905 - recall_2:
0.4456 - val_loss: 0.6933 - val_accuracy: 0.5012 - val_precision_2: 0.5012 - val_recall_2: 1.0000 Epoch 2/3 469/469 [==============================] - 63s 134ms/step - loss: 0.6889 -
accuracy: 0.5227 - precision_2: 0.5209 - recall_2: 0.5564 - val_loss:
0.6976 - val_accuracy: 0.4994 - val_precision_2: 0.5014 - val_recall_2: 0.2191 Epoch 3/3 469/469
[==============================] - 59s 127ms/step - loss: 0.6527 -
accuracy: 0.5783 - precision_2: 0.5764 - recall_2: 0.5887 - val_loss:
0.7298 - val_accuracy: 0.5000 - val_precision_2: 0.5028 - val_recall_2: 0.2131
I have tried more epochs and I still only manage to get 50% accuracy which is useless as its just predicting the same things constantly.

There can be many reasons why your model is not working. One that seems more likely is that the model is under-fitting as both accuracy on training set and validation set is low meaning that the neural network is unable to capture the pattern in the data. Hence you should consider building little more complex model by adding more layer at the same time avoiding over-fitting with techniques like dropout. You should also get best parameters by doing hyperparameter tuning.

tensorflow sample sizes for epochs

I have a dataset of 50000 items: reviews & sentiment (positive or negative)
I distributed 90% to the training set and the rest to the testing set.
My question is, if I run 5 epochs on the training set that I have, shouldn't each epoch load 9000 instead of 1407?
# to divide train & test sets
test_sample_size = int(0.1*len(preprocessed_reviews)) # 10% of data as the validation set
# for sentiment
sentiment = [1 if x=='positive' else 0 for x in sentiment]
# separate data to train & test sets
X_test, X_train = (np.array(preprocessed_reviews[:test_sample_size]),
np.array(preprocessed_reviews[test_sample_size:])
)
y_test, y_train = (np.array(sentiment[:test_sample_size]),
np.array(sentiment[test_sample_size:])
)
tokenizer = Tokenizer(oov_token='<OOV>') # for the unknown words
tokenizer.fit_on_texts(X_train)
vocab_count = len(tokenizer.word_index) + 1 # +1 is for padding
training_sequences = tokenizer.texts_to_sequences(X_train) # tokenizer.word_index to see indexes
training_padded = pad_sequences(training_sequences, padding='post') # pad sequences with 0s
training_normal = preprocessing.normalize(training_padded) # normalize data
testing_sequences = tokenizer.texts_to_sequences(X_test)
testing_padded = pad_sequences(testing_sequences, padding='post')
testing_normal = preprocessing.normalize(testing_padded)
input_length = len(training_normal[0]) # length of all sequences
# build a model
model = keras.models.Sequential()
model.add(keras.layers.Embedding(input_dim=vocab_count, output_dim=2,input_length=input_length))
model.add(keras.layers.GlobalAveragePooling1D())
model.add(keras.layers.Dense(63, activation='relu')) # hidden layer
model.add(keras.layers.Dense(16, activation='relu')) # hidden layer
model.add(keras.layers.Dense(1, activation='sigmoid')) # output layer
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
model.fit(training_normal, y_train, epochs=5)
Output:
Epoch 1/5
1407/1407 [==============================] - 9s 7ms/step - loss: 0.6932 - accuracy: 0.4992
Epoch 2/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5030
Epoch 3/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.4987
Epoch 4/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5024
Epoch 5/5
1407/1407 [==============================] - 9s 6ms/step - loss: 0.6932 - accuracy: 0.5020
Sorry I'm quite new to tensorflow, I hope someone could help out!

So if you have around 50,000 datapoints, distributed with 90/10 ratio (train/test), that means that ~45,000 will be the training data, and remaining 5000 will be for testing.
When you call a fit method, Keras has the default parameter for batch_size set to 32(you can always change that to 64, 128..)
So the number 1407 tells you that the model needs to do 1407 feedforward and backpropagation steps, before one full epoch is completed(because 1407 * 32 ~ 45,000).

Binary classification model using BERT encoder stuck at 50% accuracy

I'm trying to train a simple model for the Yelp binary classification task.
Load BERT encoder:
gs_folder_bert = "gs://cloud-tpu-checkpoints/bert/keras_bert/uncased_L-12_H-768_A-12"
bert_config_file = os.path.join(gs_folder_bert, "bert_config.json")
config_dict = json.loads(tf.io.gfile.GFile(bert_config_file).read())
bert_config = bert.configs.BertConfig.from_dict(config_dict)
_, bert_encoder = bert.bert_models.classifier_model(
bert_config, num_labels=2)
checkpoint = tf.train.Checkpoint(model=bert_encoder)
checkpoint.restore(
os.path.join(gs_folder_bert, 'bert_model.ckpt')).assert_consumed()
Load data:
data, info = tfds.load('yelp_polarity_reviews', with_info=True, batch_size=-1, as_supervised=True)
train_x_orig, train_y_orig = tfds.as_numpy(data['train'])
train_x = encode_examples(train_x_orig)
train_y = train_y_orig
Use BERT to embed the data:
encoder_output = bert_encoder.predict(train_x)
Setup the model:
inputs = keras.Input(shape=(768,))
x = keras.layers.Dense(64, activation='relu')(inputs)
x = keras.layers.Dense(8, activation='relu')(x)
outputs = keras.layers.Dense(1, activation='sigmoid')(x)
model = keras.Model(inputs=inputs, outputs=outputs)
sgd = SGD(lr=0.0001)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy'])
Train:
model.fit(encoder_output[0], train_y, batch_size=64, epochs=3)
# encoder_output[0].shape === (10000, 1, 768)
# y_train.shape === (100000,)
Training results:
Epoch 1/5
157/157 [==============================] - 1s 5ms/step - loss: 0.6921 - accuracy: 0.5455
Epoch 2/5
157/157 [==============================] - 1s 5ms/step - loss: 0.6918 - accuracy: 0.5455
Epoch 3/5
157/157 [==============================] - 1s 5ms/step - loss: 0.6915 - accuracy: 0.5412
Epoch 4/5
157/157 [==============================] - 1s 5ms/step - loss: 0.6913 - accuracy: 0.5407
Epoch 5/5
157/157 [==============================] - 1s 5ms/step - loss: 0.6911 - accuracy: 0.5358
I tried different learning rates, but the main issue seems that training takes 1 second and the accuracy stays at ~0.5. Am I not setting the inputs/model correctly?

Your BERT model is not training. It has to be placed before dense layers and train as part of the model. the input layer has to take not BERT vectors, but the sequence of tokens cropped to max_length and padded. Here is the example code: https://keras.io/examples/nlp/text_extraction_with_bert/, see the beginning of create_model function.
Alternatively, you can use Trainer from transformers.

Training accuracy is less than validation accuracy

I have created a CNN model for classifying text data. Please help me interpret my result and tell me why is my training accuracy less than validation accuracy?
I have a total of 2619 Data, all of them are text data. There are two different classes. Here is a sample of my dataset.
The validation set has 34 data. Rest of 2619 data are training data.
I have done RepeatedKfold cross-validation. Here is my code.
from sklearn.model_selection import RepeatedKFold
kf = RepeatedKFold(n_splits=75, n_repeats=1, random_state= 42)
for train_index, test_index in kf.split(X,Y):
#print("Train:", train_index, "Validation:",test_index)
x_train, x_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = Y.iloc[train_index], Y.iloc[test_index]
I have used CNN. Here is my model.
model = Sequential()
model.add(Embedding(2900,2 , input_length=1))
model.add(Conv1D(filters=2, kernel_size=3, kernel_regularizer=l2(0.0005 ), bias_regularizer=l2(0.0005 ), padding='same', activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.3))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1, kernel_regularizer=l2(0.0005 ), bias_regularizer=l2(0.0005 ), activation='sigmoid'))
model.add(Dropout(0.25))
adam = optimizers.Adam(lr = 0.0005, beta_1 = 0.9, beta_2 = 0.999, epsilon = None, decay = 0.0, amsgrad = False)
model.compile(loss='binary_crossentropy', optimizer=adam, metrics=['accuracy'])
print(model.summary())
history = model.fit(x_train, y_train, epochs=300,validation_data=(x_test, y_test), batch_size=128, shuffle=False)
# Final evaluation of the model
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
And here is the result.
Epoch 295/300
2585/2585 [==============================] - 0s 20us/step - loss: 1.6920 - acc: 0.7528 - val_loss: 0.5839 - val_acc: 0.8235
Epoch 296/300
2585/2585 [==============================] - 0s 20us/step - loss: 1.6532 - acc: 0.7617 - val_loss: 0.5836 - val_acc: 0.8235
Epoch 297/300
2585/2585 [==============================] - 0s 27us/step - loss: 1.5328 - acc: 0.7551 - val_loss: 0.5954 - val_acc: 0.8235
Epoch 298/300
2585/2585 [==============================] - 0s 20us/step - loss: 1.6289 - acc: 0.7524 - val_loss: 0.5897 - val_acc: 0.8235
Epoch 299/300
2585/2585 [==============================] - 0s 21us/step - loss: 1.7000 - acc: 0.7582 - val_loss: 0.5854 - val_acc: 0.8235
Epoch 300/300
2585/2585 [==============================] - 0s 25us/step - loss: 1.5475 - acc: 0.7451 - val_loss: 0.5934 - val_acc: 0.8235
Accuracy: 82.35%
Please help me with my problem. Thank you.

You may have too much regularization for your model causing it to underfit your data.
A good way to start is begin with no regularization at all (no Dropout, no weights decay, ..) and look if it's overfitting:
If not, regularization is useless
If it's overfitting, add regularization little by little, start by small dropout / weights decay, and then icrease it if it's continue to overfit
Moroever, don't put Dropout as final layer, and don't put two Dropout layers successively.

Your training accuracy is less than validation accuracy likely because of using dropout: it "turns off" some neurons during training to prevent overfitting. During validation dropout is off, so your network uses all its neurons, thus making (in that particular case) more accurate predictions.
In general, I agree with advice by Thibault Bacqueyrisses and want to add that it's also usually a bad practice to put dropout before batch normalization (which is not about this particular case anyway).

Make a simple KERAS network for classification

I'm trying to do a simple Keras Neural Network but the model doesn't fit:
Train on 562 samples, validate on 188 samples
Epoch 1/20
562/562 [==============================] - 1s 1ms/step - loss: 8.1130 - acc: 0.4911 - val_loss: 7.6320 - val_acc: 0.5213
Epoch 2/20
562/562 [==============================] - 0s 298us/step - loss: 8.1130 - acc: 0.4911 - val_loss: 7.6320 - val_acc: 0.5213
Epoch 3/20
562/562 [==============================] - 0s 295us/step - loss: 8.1130 - acc: 0.4911 - val_loss: 7.6320 - val_acc: 0.5213
Epoch 4/20
562/562 [==============================] - 0s 282us/step - loss: 8.1130 - acc: 0.4911 - val_loss: 7.6320 - val_acc: 0.5213
Epoch 5/20
562/562 [==============================] - 0s 289us/step - loss: 8.1130 - acc: 0.4911 - val_loss: 7.6320 - val_acc: 0.5213
Epoch 6/20
562/562 [==============================] - 0s 265us/step - loss: 8.1130 - acc: 0.4911 - val_loss: 7.6320 - val_acc: 0.5213
The data base is structured in a CSV file like this:
doc venda img1 img2 v1 v2 gt
RG venda1 img123 img12 [3399, 162675, ...] [3399, 162675, ...] 1
My intent its to use the diff between v1 and v2 vector to answer me if img1 and im2 are from the same class.
The code:
from sklearn.model_selection import train_test_split
(X_train, X_test, Y_train, Y_test) = train_test_split(train, train_labels, test_size=0.25, random_state=42)
# create the model
model = Sequential()
model.add(Dense(10, activation="relu", input_dim=10, kernel_initializer="uniform"))
model.add(Dense(6, activation="relu", kernel_initializer="uniform"))
model.add(Dense(1, activation='sigmoid'))
print(model.summary())
model.compile(loss='binary_crossentropy',
optimizer='rmsprop',
metrics=['accuracy'])
model.fit(
np.array(X_train),
np.array(Y_train),
shuffle=True,
epochs=20,
verbose=1,
batch_size=5,
validation_data=(np.array(X_test), np.array(Y_test)),
)
What i'm doing wrong?

Divide the difference vector by some constant number so that the feature vector is in range 0 to 1 or -1 to 1. Right now the values are too big and the loss is coming high. Network learns faster if the data is normalized properly.

I have had success normalizing features using this function. I forget exactly why I use the same mu and sigma from train set on the test and val but I am pretty sure I learned it during the deep.ai course on coursera
def normalize_features(dataset):
mu = np.mean(dataset, axis = 0) # columns
sigma = np.std(dataset, axis = 0)
norm_parameters = {'mu': mu,
'sigma': sigma}
return (dataset-mu)/(sigma+1e-10), norm_parameters
# Normal X data; using same mu and sigma from test set;
x_train, norm_parameters = normalize_features(x_train)
x_val = (x_val-norm_parameters['mu'])/(norm_parameters['sigma']+1e-10)
x_test = (x_test-norm_parameters['mu'])/(norm_parameters['sigma']+1e-10)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

handle overfitting in unbalanced datasets - python

Related

Why isn't my CNN model for a Binary Classification not learning?

tensorflow sample sizes for epochs

Binary classification model using BERT encoder stuck at 50% accuracy

Training accuracy is less than validation accuracy

Make a simple KERAS network for classification

Categories

Resources