Predicting all zeros

Predicting all zeros - python

I've depeloped a neural network for classification and I'm getting a 0.93 of accuracy, the problem is that I'm predicting all zeros because the distribution of the data.
How can I fix it? Should I change from neural network to another algorithm?
Thanks in advance
Edit: i've just checked and my model is predicting the same probability for each row.
The model is a NN with 5 layers, and tf.nn.relu6 as activation function. The cost function is tf.nn.sigmoid_cross_entropy_with_logits
To predict the values I use:
predicted = tf.nn.sigmoid(Z5)
correct_pred = tf.equal(tf.round(predicted), Y)
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
EDIT 2
I have 'fixed' the imbalance class problem (undersampling and upsampling 0s and 1s) but the net is still predicting the same values for each row:
I have tested to change activation function to tanh or sigmoid but then outputs NaN's

There are multiple solutions for unbalanced data. But first, the accuracy is not a good metric for unbalanced data, because if you only had 5 positives and 95 negatives, you accuracy will be 95% of predicting negatives. You should check sensitivity and specificity, or other metrics that work good with unbalanced data like the LIFT score.
To train the model with unbalanced data, there are multiple solutions. One of them is the Up-sample Minority Class.
Up-sampling is the process of randomly duplicating observations from
the minority class in order to reinforce its signal.
You can upsample data with a code like this:
from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]
# Upsample minority class
df_minority_upsampled = resample(df_minority,
replace=True, # sample with replacement
n_samples=576, # to match majority class
random_state=123) # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
# Display new class counts
df_upsampled.balance.value_counts()
# 1 576
# 0 576
# Name: balance, dtype: int64
You can find more information and other solutions that are well explained here.

Related

"All zero class" prediction by Neural Network

In a classification problem involving the identification of fraudulent transactions, I reduced the dimensionality of the data(28 columns)[A complete quasi-separation was detected by Logit in statsmodels] using a stacked auto encoder(28->15->5) and fed the compressed data(5 columns) into a neural network with two hidden layers, each having 10 nodes and 'relu' activation function. I trained the model over a 100 epochs(The AUC metric didn't go beyond 0.500 and the train loss became constant after a few epochs).The model predicted all the records of the test set as non-fraudulent(0 class) and yielded a confusion matrix like this:
Confusion matrix:
[[70999 0]
[ 115 0]]
Accuracy Score : 0.9983828781955733
Can someone please explain the problem behind this result and suggest a feasible solution?..

Since your accuracy is over 99% on all zero class prediction, the percent of fraud cases in your train set is less than 1%
Typically if the fraud cases are rare, the model will not place enough importance on the fraud cases to predict well.
to fix this you can add costs to penalize the majority class, or add weights to penalize the majority class or use class balancing methods such as SMOTE

How to create weighted cross entropy loss?

I have to deal with highly unbalanced data. As I understand, I need to use weighted cross entropy loss.
I tried this:
import tensorflow as tf
weights = np.array([<values>])
def loss(y_true, y_pred):
# weights.shape = (63,)
# y_true.shape = (64, 63)
# y_pred.shape = (64, 63)
return tf.reduce_mean(tf.nn.weighted_cross_entropy_with_logits(y_true, y_pred, weights))
model.compile('adam', loss=loss, metrics=['acc'])
But there's an error:
ValueError: Creating variables on a non-first call to a function decorated with tf.function
How can I create this kind of loss?

I suggest in the first instance to resort to using class_weight from Keras.
class_weight
is a dictionary with {label:weight}
For example, if you have 20 times more examples in label 1 than in label 0, then you can write
# Assign 20 times more weight to label 0
model.fit(..., class_weight = {0:20, 1:0})
In this way you don't need to worry implementing weighted CCE on your own.
Additional note : in your model.compile() do not forget to use weighted_metrics=['accuracy'] in order to have a relevant reflection of your accuracy.
model.fit(..., class_weight = {0:20, 1:0}, weighted_metrics = ['accuracy'])

class weights is a dictionary that compensates for the imbalance in the data set. For example if you had a data set of 1000 dog images and 100 cat images your classifier be biased toward the dog class. If it predicted dog each time it would be correct 90 percent of the time. To compensate for the imbalance the class_weights dictionary enables you to weight samples of cats 10 times higher than that of dogs when calculating loss. One way is to use the class_weight method from sklearn as shown below
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)

If you are working with imbalance classes, you should use the class weights. For example if you have two classes where class 0 has twice as more data than class 1 :
class_weight = {0 :1, 1: 2}
When you compile, make use the weighted_metrics instead of just metrics or else the model won't take into account the class weights when calculating the accuracy and it will be unrealistically high.
model.compile(loss="binary_crossentropy",optimizer='adam', weighted_metrics=['accuracy'])
hist = model.fit_generator(train,validation_split=0.2,epochs=20,class_weight=class_weight)

Loss function for class imbalanced multi-class classifier in Keras

I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.

First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.

So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.

Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

LSTM validation

I have a dataset with 100k rows, which are the pairs of store-item numbers (eg. (store 1, item 190)), 300 columns, which are a series of dates (eg. 2017-01-01, 2017-01-02, 2017-01-03 ...). Values are the sales.
I tried to use LSTM keras to predict future sales, how can I construct my train and validation dataset?
I am thinking to split train and validation like data[:, :n_days] and data[:, n_days:]. So I will have same number of samples (100k) in both my train and validation dataset. Do I think it wrong?
If this is the way, how should I define n_days, should the train and validation dataset be exactly the same dimensions? (something like, n_days = 150, 149 days used to predict 1 day).

how can I construct my train and validation dataset?
Not sure if a rule of thumb, but a common approach is to split your dataset into a ~80% training set and ~20% validation set; in your case this would be approximately 80k and 20k. The actual percentages may vary, but that ratio is the one most sources recommend. Ideally you would also want to have a test dataset, one that you never used during training or validation, to evaluate the final performance of your models.
Now, regarding the shape of your data it is important to recall what the keras docs on Recurrent Layers says:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
Defining this shape would depend on the nature of your problem. You mention you want to predict sales, so this can be phrased as a Regression Problem. You also mention your data consists of 300 columns that make up your time series, and naturally you have the real sales value for each of those rows.
In this case, using a batch size of 1, your shape seems will be (1, 300, 1). Which means you are training on batches of 1 element (the most thorough Gradient update), where each has 300 time steps and 1 feature or dimension on each time step.
For splitting your data one option you can use that has helped me before is the train_test_split method from Sklearn, where you simply pass your data and labels and indicate the ratio you want:
from sklearn.cross_validation import train_test_split
#Split your data to have 15% validation split
X, X_val, Y, Y_val = train_test_split(data, labels, test_size=0.15)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.