I am trying to use keras to fit a CNN model to classify 2 classes of data . I have imbalanced dataset I want to balance the data. I don't know can I use class_weight in model.fit_generator . I wonder if I used class_weight="balanced" in model.fit_generator
The main code:
def generate_arrays_for_training(indexPat, paths, start=0, end=100):
while True:
from_=int(len(paths)/100*start)
to_=int(len(paths)/100*end)
for i in range(from_, int(to_)):
f=paths[i]
x = np.load(PathSpectogramFolder+f)
x = np.expand_dims(x, axis=0)
if('P' in f):
y = np.repeat([[0,1]],x.shape[0], axis=0)
else:
y =np.repeat([[1,0]],x.shape[0], axis=0)
yield(x,y)
history=model.fit_generator(generate_arrays_for_training(indexPat, filesPath, end=75),
validation_data=generate_arrays_for_training(indexPat, filesPath, start=75),
steps_per_epoch=int((len(filesPath)-int(len(filesPath)/100*25))),
validation_steps=int((len(filesPath)-int(len(filesPath)/100*75))),
verbose=2,
epochs=15, max_queue_size=2, shuffle=True, callbacks=[callback])
If you don't want to change your data creation process, you can use class_weight in your fit generator. You can use dictionary to set your class_weight and observe with fine tuning. For instance when class_weight is not used, and you have 50 examples for class0 and 100 examples for class1. Then, loss function calculate loss uniformly. It means that class1 will be a problem. But, when you set:
class_weight = {0:2 , 1:1}
It means that loss function will give 2 times weight to your class 0 now. Therefore, misclassification of underrepresented data will take 2 times more punishment than before. Thus, model can handle imbalanced data.
If you use class_weight='balanced' model can make that setting automatically. But my suggestion is that, create a dictionary like class_weight = {0:a1 , 1:a2} and try different values for a1 and a2, so you can understand difference.
Also, you can use undersampling methods for imbalanced data instead of using class_weight. Check Bootstrapping methods for that purpose.
Related
My model performs a multi-class (3) classification task.
I would like to change the way model "fits". Instead of calculation of a metric such as acc or logloss - I would like to run a simulation on whole data set to see how the model performs after each fit, in real-time.
Please note that simulation != loss/error. Simulation takes into the consideration time component of the data, the sequence in which events occur. Whereas the loss function simply calculates the error based on true values.
Currently I do the simulation after the whole "fitting" process has been done:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
all_ds = lgb.Dataset(X, label=y)
train_ds = lgb.Dataset(X_train, label=y_train)
test_ds = lgb.Dataset(X_test, label=y_test)
params = {
'device_type': "gpu",
'objective': 'multiclass',
'metric': 'multi_logloss',
"boosting_type": "gbdt",
"num_class": 3,
'random_state': 123
}
# fit
model = lgb.train(
params,
train_ds,
num_boost_round=20
valid_sets=[test_ds]
)
# make prediction on a whole data set
y_pred= model.predict(all_ds)
# simulate
simulation_result = simulate(X, y_pred) # float value
current process is:
fit step 1 - error x
fit step 2 - error y
..
fit step 20 - error z
simulate - see how the model performs
I would like to change the process to
fit step 1 - simulate - use result of simulation as an error
fit step 2 - simulate - use result of simulation as an error
..
fit step 20 - simulate - use result of simulation as an error
Is there a way to achieve it through a custom callback or a custom evaluation metric or some other way?
I tried creating a custom eval metric, unfortunately I cannot invoke predict() from within the function. Moreover I find the preds parameter value to be something I cannot simply use without transformations of some sort.. It contains some sort of multidimensional array that I have no idea how to convert to actual predictions.
def customEvalMetric(preds, eval_data):
# how to invoke predict() method on a whole dataset here?
# OR how to convert preds to one-hot encoded values?
# simulation_result = simulate(all_ds, ..?..)
return 'simulation_result', simulation_result, True
and using as
model = lgb.train(
params,
train_ds,
num_boost_round=20
valid_sets=[all_ds],
feval=customEvalMetric,
)
p.s. now that I think about it - I could in theory fit once in a loop, then use init_model to load the existing model weights.. Is this the only way?
I suppose this question is applicable to other tree boosting libraries since the API are similar (xgboost for example)
The custom eval function should work. As per the docs, preds is:
The predicted values. Predicted values are returned before any transformation, e.g. they are raw margin instead of probability of positive class for binary task.
So if this is a classification problem, you might need to apply the softmax transformation to each row. For a regression problem, you should be able to use this output as-is.
I have to deal with highly unbalanced data. As I understand, I need to use weighted cross entropy loss.
I tried this:
import tensorflow as tf
weights = np.array([<values>])
def loss(y_true, y_pred):
# weights.shape = (63,)
# y_true.shape = (64, 63)
# y_pred.shape = (64, 63)
return tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(y_true, y_pred, weights))
model.compile('adam', loss=loss, metrics=['acc'])
But there's an error:
ValueError: Creating variables on a non-first call to a function decorated with tf.function
How can I create this kind of loss?
I suggest in the first instance to resort to using class_weight from Keras.
class_weight
is a dictionary with {label:weight}
For example, if you have 20 times more examples in label 1 than in label 0, then you can write
# Assign 20 times more weight to label 0
model.fit(..., class_weight = {0:20, 1:0})
In this way you don't need to worry implementing weighted CCE on your own.
Additional note : in your model.compile() do not forget to use weighted_metrics=['accuracy'] in order to have a relevant reflection of your accuracy.
model.fit(..., class_weight = {0:20, 1:0}, weighted_metrics = ['accuracy'])
class weights is a dictionary that compensates for the imbalance in the data set. For example if you had a data set of 1000 dog images and 100 cat images your classifier be biased toward the dog class. If it predicted dog each time it would be correct 90 percent of the time. To compensate for the imbalance the class_weights dictionary enables you to weight samples of cats 10 times higher than that of dogs when calculating loss. One way is to use the class_weight method from sklearn as shown below
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
If you are working with imbalance classes, you should use the class weights. For example if you have two classes where class 0 has twice as more data than class 1 :
class_weight = {0 :1, 1: 2}
When you compile, make use the weighted_metrics instead of just metrics or else the model won't take into account the class weights when calculating the accuracy and it will be unrealistically high.
model.compile(loss="binary_crossentropy",optimizer='adam', weighted_metrics=['accuracy'])
hist = model.fit_generator(train,validation_split=0.2,epochs=20,class_weight=class_weight)
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.
I have built a TensorFlow model that uses a DNNClassifier to classify input into two categories.
My problem is that Outcome 1 occurs upwards of 90-95% of the time. Therefore, TensorFlow is giving me the same probabilities for all of my predictions.
I am trying to predict the other outcome (e.g. having a false positive for Outcome 2 is preferable to missing a possible occurrence of Outcome 2). I know that in machine learning in general, in this case it would be worthwhile to try to upweight Outcome 2.
However, I don't know how to do this in TensorFlow. The documentation alludes to it being possible, but I can't find any examples of what it would actually look like. Has anyone has successfully done this, or does anyone know where I could find some example code or a thorough explanation (I'm using Python)?
Note: I have seen exposed weights being manipulated when someone is using the more fundamental parts of TensorFlow and not an estimator. For maintenance reasons, I need to do this using an estimator.
tf.estimator.DNNClassifier constructor has weight_column argument:
weight_column: A string or a _NumericColumn created by
tf.feature_column.numeric_column defining feature column representing
weights. It is used to down weight or boost examples during training.
It will be multiplied by the loss of the example. If it is a string,
it is used as a key to fetch weight tensor from the features. If it is
a _NumericColumn, raw tensor is fetched by key weight_column.key, then
weight_column.normalizer_fn is applied on it to get weight tensor.
So just add a new column and fill it with some weight for the rare class:
weight = tf.feature_column.numeric_column('weight')
...
tf.estimator.DNNClassifier(..., weight_column=weight)
[Update] Here's a complete working example:
import numpy as np
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('mnist', one_hot=False)
train_x, train_y = mnist.train.next_batch(1024)
test_x, test_y = mnist.test.images, mnist.test.labels
x_column = tf.feature_column.numeric_column('x', shape=[784])
weight_column = tf.feature_column.numeric_column('weight')
classifier = tf.estimator.DNNClassifier(feature_columns=[x_column],
hidden_units=[100, 100],
weight_column=weight_column,
n_classes=10)
# Training
train_input_fn = tf.estimator.inputs.numpy_input_fn(x={'x': train_x, 'weight': np.ones(train_x.shape[0])},
y=train_y.astype(np.int32),
num_epochs=None, shuffle=True)
classifier.train(input_fn=train_input_fn, steps=1000)
# Testing
test_input_fn = tf.estimator.inputs.numpy_input_fn(x={'x': test_x, 'weight': np.ones(test_x.shape[0])},
y=test_y.astype(np.int32),
num_epochs=1, shuffle=False)
acc = classifier.evaluate(input_fn=test_input_fn)
print('Test Accuracy: %.3f' % acc['accuracy'])
Can anyone tell me what is the simplest way to apply class_weight in Keras when the dataset is unbalanced please?
I only have two classes in my target.
Thanks.
The class_weight parameter of the fit() function is a dictionary mapping classes to a weight value.
Lets say you have 500 samples of class 0 and 1500 samples of class 1 than you feed in class_weight = {0:3 , 1:1}. That gives class 0 three times the weight of class 1.
train_generator.classes gives you the proper class names for your weighting.
If you want to calculate this programmatically you can use scikit-learnĀ“s sklearn.utils.compute_class_weight().
The function looks at the distribution of labels and produces weights to equally penalize under or over-represented classes in the training set.
See also this useful thread here: https://github.com/fchollet/keras/issues/1875
And this thread might also be of help: Is it possible to automatically infer the class_weight from flow_from_directory in Keras?
Using class_weight from sklearn kit.
Im also using this method to deal with the imbalance data
from sklearn.utils import class_weight
class_weight = class_weight.compute_class_weight('balanced'
,np.unique(Y_train)
,Y_train)
then model.fit
Classifier.fit(train_X,train_Y,batch_size = 100, epochs = 10
,validation_data= (test_X,test_Y),class_weight = class_weight )
1- Define a dictionary with your labels and their associated weights
class_weight = {0: 0.1,
1: 1.,
2: 2.}
2- Feed the dictionary as a parameter:
model.fit(X_train, Y_train, batch_size = 100, epochs = 10, class_weight=class_weight)
Are you asking about the right weighting to apply or how to do that in the code? The code is simple:
class_weights = {}
for i in range(2):
class_weights[i] = your_weight
and then you pass the argument class_weight=class_weights in model.fit.
The right weighting to use would be some sort of inverse frequency; you can also do a bit of trial and error.
class weights takes a dictionary type.
from collections import Counter
itemCt = Counter(trainGen.classes)
maxCt = float(max(itemCt.values()))
cw = {clsID : maxCt/numImg for clsID, numImg in itemCt.items()}