I have to deal with highly unbalanced data. As I understand, I need to use weighted cross entropy loss.
I tried this:
import tensorflow as tf
weights = np.array([<values>])
def loss(y_true, y_pred):
# weights.shape = (63,)
# y_true.shape = (64, 63)
# y_pred.shape = (64, 63)
return tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(y_true, y_pred, weights))
model.compile('adam', loss=loss, metrics=['acc'])
But there's an error:
ValueError: Creating variables on a non-first call to a function decorated with tf.function
How can I create this kind of loss?
I suggest in the first instance to resort to using class_weight from Keras.
class_weight
is a dictionary with {label:weight}
For example, if you have 20 times more examples in label 1 than in label 0, then you can write
# Assign 20 times more weight to label 0
model.fit(..., class_weight = {0:20, 1:0})
In this way you don't need to worry implementing weighted CCE on your own.
Additional note : in your model.compile() do not forget to use weighted_metrics=['accuracy'] in order to have a relevant reflection of your accuracy.
model.fit(..., class_weight = {0:20, 1:0}, weighted_metrics = ['accuracy'])
class weights is a dictionary that compensates for the imbalance in the data set. For example if you had a data set of 1000 dog images and 100 cat images your classifier be biased toward the dog class. If it predicted dog each time it would be correct 90 percent of the time. To compensate for the imbalance the class_weights dictionary enables you to weight samples of cats 10 times higher than that of dogs when calculating loss. One way is to use the class_weight method from sklearn as shown below
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
If you are working with imbalance classes, you should use the class weights. For example if you have two classes where class 0 has twice as more data than class 1 :
class_weight = {0 :1, 1: 2}
When you compile, make use the weighted_metrics instead of just metrics or else the model won't take into account the class weights when calculating the accuracy and it will be unrealistically high.
model.compile(loss="binary_crossentropy",optimizer='adam', weighted_metrics=['accuracy'])
hist = model.fit_generator(train,validation_split=0.2,epochs=20,class_weight=class_weight)
Related
I am looking at these two questions and documentation:
Whats the output for Keras categorical_accuracy metrics?
Categorical crossentropy need to use categorical_accuracy or accuracy as the metrics in keras?
https://keras.io/api/metrics/probabilistic_metrics/#categoricalcrossentropy-class
For classification of X-Rays images I (15 classes) I do:
# Compile a model
model1.compile(optimizer = 'adam', loss = 'categorical_crossentropy',
metrics = ['accuracy'])
# Fit the model
history1 = model1.fit_generator(train_generator, epochs = 10,
steps_per_epoch = 10, verbose = 1, validation_data = valid_generator)
My model works and I have an output:
But I am not sure how to add validation accuracy here to compare results and avoid over/underfitting.
I hope the following can help you:
The use of "categorical_crossentropy" tells me that your labels are a one hot encoding over different classes.
Let's say you have 15 classes, the correct prediction would be a vector with 14 zeros, and a one at the corresponding index. In this context "accuracy" will be very high as your model will be correctly predicting mostly zero everywhere, so the accuracy should easily be at least 13/15 = 0.86.
A more suitable metric would be "categorical_accuracy" which will give you 1 if the model predicts the correct index, and else 0.
If you have a validation "categorical_accuracy" better than 1/15 = 0.067 (assuming your class are correctly balanced), your model is better than random.
You can find a list of metrics at keras metrics.
I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])
If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.
For my problem, I want to predict customer review scores ranging from 1 to 5.
I thought it would be good to implement this as a regression problem because a predicted 1 from the model while 5 being the true value should be a "worse" prediction than 4.
It is also wished, that the model performs somehow equally good for all review score classes.
Because my dataset is highly unbalanced I want to create a metric/loss that is capable of capturing this (I think just as F1 for classification).
Therefore I created following metric (for now just mse is relevant):
def custom_metric(y_true, y_pred):
df = pd.DataFrame(np.column_stack([y_pred, y_true]), columns=["Predicted", "Truth"])
class_mse = 0
#class_mae = 0
print("MAE for Classes:")
for i in df.Truth.unique():
temp = df[df["Truth"]==i]
mse = mean_squared_error(temp.Truth, temp.Predicted)
#mae = mean_absolute_error(temp.Truth, temp.Predicted)
print("Class {}: {}".format(i, mse))
class_mse += mse
#class_mae += mae
print()
print("AVG MSE over Classes {}".format(class_mse/len(df.Truth.unique())))
#print("AVG MAE over Classes {}".format(class_mae/len(df.Truth.unique())))
Now an example prediction:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error
# sample predictions: "model" messed up at class 2 and 3
y_true = np.array((1,1,1,2,2,2,3,3,3,4,4,4,5,5,5))
y_pred = np.array((1,1,1,2,2,3,5,4,3,4,4,4,5,5,5))
custom_metric(y_true, y_pred)
Now my question: Is it able to create a custom tensorflow loss function which is able to act in a similar behaviour? I also worked on this implementation which is not yet ready for tensorflow but maybe more alike:
def custom_metric(y_true, y_pred):
mse_class = 0
num_classes = len(np.unique(y_true))
stacked = np.vstack((y_true, y_pred))
for i in np.unique(stacked[0]):
y_true_temp = stacked[0][np.where(stacked[0]==i)]
y_pred_temp = stacked[1][np.where(stacked[0]==i)]
mse = np.mean(np.square(y_pred_temp - y_true_temp))
mse_class += mse
return mse_class/num_classes
But still, I am not sure how to work around the for loop for a tensorflow like definition.
Thanks in advance for any help!
The for loop should be dealt with exactly by means of numpy/tensorflow operations on a tensor.
A custom metric example would be:
from keras import backend as K
def custom_mean_squared_error(y_true, y_pred):
return K.mean(K.square(y_pred - y_true), axis=-1)
where y_true is the ground truth label, y_pred are your predictions. You can see there are not explicit for-loops.
The motivation for not using for loops is that vectorized operations (which are present both in numpy and tensorflow) take advantage of the modern CPU architectures, turning multiple iterative operations into matrix ones. Consider that a dot-product implementation in numpy takes approximately 30 times less than a regular for-loop in Python.
I've depeloped a neural network for classification and I'm getting a 0.93 of accuracy, the problem is that I'm predicting all zeros because the distribution of the data.
How can I fix it? Should I change from neural network to another algorithm?
Thanks in advance
Edit: i've just checked and my model is predicting the same probability for each row.
The model is a NN with 5 layers, and tf.nn.relu6 as activation function. The cost function is tf.nn.sigmoid_cross_entropy_with_logits
To predict the values I use:
predicted = tf.nn.sigmoid(Z5)
correct_pred = tf.equal(tf.round(predicted), Y)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
EDIT 2
I have 'fixed' the imbalance class problem (undersampling and upsampling 0s and 1s) but the net is still predicting the same values for each row:
I have tested to change activation function to tanh or sigmoid but then outputs NaN's
There are multiple solutions for unbalanced data. But first, the accuracy is not a good metric for unbalanced data, because if you only had 5 positives and 95 negatives, you accuracy will be 95% of predicting negatives. You should check sensitivity and specificity, or other metrics that work good with unbalanced data like the LIFT score.
To train the model with unbalanced data, there are multiple solutions. One of them is the Up-sample Minority Class.
Up-sampling is the process of randomly duplicating observations from
the minority class in order to reinforce its signal.
You can upsample data with a code like this:
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=576, # to match majority class
random_state=123) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.balance.value_counts()
# 1 576
# 0 576
# Name: balance, dtype: int64
You can find more information and other solutions that are well explained here.
Can anyone tell me what is the simplest way to apply class_weight in Keras when the dataset is unbalanced please?
I only have two classes in my target.
Thanks.
The class_weight parameter of the fit() function is a dictionary mapping classes to a weight value.
Lets say you have 500 samples of class 0 and 1500 samples of class 1 than you feed in class_weight = {0:3 , 1:1}. That gives class 0 three times the weight of class 1.
train_generator.classes gives you the proper class names for your weighting.
If you want to calculate this programmatically you can use scikit-learnĀ“s sklearn.utils.compute_class_weight().
The function looks at the distribution of labels and produces weights to equally penalize under or over-represented classes in the training set.
See also this useful thread here: https://github.com/fchollet/keras/issues/1875
And this thread might also be of help: Is it possible to automatically infer the class_weight from flow_from_directory in Keras?
Using class_weight from sklearn kit.
Im also using this method to deal with the imbalance data
from sklearn.utils import class_weight
class_weight = class_weight.compute_class_weight('balanced'
,np.unique(Y_train)
,Y_train)
then model.fit
Classifier.fit(train_X,train_Y,batch_size = 100, epochs = 10
,validation_data= (test_X,test_Y),class_weight = class_weight )
1- Define a dictionary with your labels and their associated weights
class_weight = {0: 0.1,
1: 1.,
2: 2.}
2- Feed the dictionary as a parameter:
model.fit(X_train, Y_train, batch_size = 100, epochs = 10, class_weight=class_weight)
Are you asking about the right weighting to apply or how to do that in the code? The code is simple:
class_weights = {}
for i in range(2):
class_weights[i] = your_weight
and then you pass the argument class_weight=class_weights in model.fit.
The right weighting to use would be some sort of inverse frequency; you can also do a bit of trial and error.
class weights takes a dictionary type.
from collections import Counter
itemCt = Counter(trainGen.classes)
maxCt = float(max(itemCt.values()))
cw = {clsID : maxCt/numImg for clsID, numImg in itemCt.items()}