Neural Network regression when the output is imbalanced - python

I am trying to perform regression using a neural network to predict a single output from 146 input features.
I applied Standard Scaling on all inputs and output.
I monitor the Mean Absolute Error after training and it is unreasonably high on the train, validation and test sets (I am not even overfitting).
I suspect this is due to the fact that the output variable is very imbalanced (see histogram).
From the histogram it is possible to see that most of the samples are grouped around 0 but there is also another small group of samples around -5.
Histogram of the imbalanced output
This is model creation code:
input = Input(batch_shape=(None, X.shape[1]))
layer1 = Dense(20, activation='relu')(input)
layer1 = Dropout(0.3)( layer1)
layer1 = BatchNormalization()(layer1)
layer2 = Dense(5, activation='relu',
kernel_regularizer='l2')(layer1)
layer2 = Dropout(0.3)(layer2)
layer2 = BatchNormalization()(layer2)
out_layer = Dense(1, activation='linear')(layer2)
model = Model(inputs=input, outputs=out_layer)
model.compile(loss='mean_squared_error', optimizer=optimizers.adam()
, metrics=['mae'])
This is the model summary:
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) (None, 146) 0
_________________________________________________________________
dense_1 (Dense) (None, 20) 2940
_________________________________________________________________
dropout_1 (Dropout) (None, 20) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 20) 80
_________________________________________________________________
dense_2 (Dense) (None, 5) 105
_________________________________________________________________
dropout_2 (Dropout) (None, 5) 0
_________________________________________________________________
batch_normalization_2 (Batch (None, 5) 20
_________________________________________________________________
dense_3 (Dense) (None, 1) 6
=================================================================
Total params: 3,151
Trainable params: 3,101
Non-trainable params: 50
_________________________________________________________________
Looking at the actual model predictions, the large error mainly happens for samples with a true output value around -5 (the small group of samples).
I tried many configurations for the hyperparameters but still the error is very high.
I see many suggestions on performing neural network classification on imbalanced data but what could be done with regression?
It seems odd to me that a regression neural network is not learning this correctly. What am I doing wrong?

From your histogram, it looks as though it's rare for there to be a non-zero output. This is similar to a classification problem where we're trying to predict a rare class, in that a strong strategy in terms of the loss function is simply to guess the most common class - in this case your modal value of zero.
You should do some research around what people do to predict rare events or to classify inputs when some classes are rare. E.g. this discussion might be helpful: https://www.reddit.com/r/MachineLearning/comments/412wpp/predicting_rare_events_how_to_prevent_machine/
Some strategies you might try include
Removing most of the zero-output training examples so that your training data is more balanced
Creating or acquiring more non-zero training examples
Using a different machine learning algorithm (someone at the link I provided recommends boosting. I wonder if you'd get good results from using a residual neural network structure, which is in some ways similar to boosting)
Re-structuring or rescaling your data to add more weight to the rare values

It appears to me that you have a normal distribution with a very small standard deviation. In which case this should train just as well as any other probability distribution.

Related

Keras Sequential model input: How significant are the dimensions?

I am trying to build a multioutput classifier on 3D data structured like [sampleID, timestamp, deviceID, sensorID] with one-hot labels like [sampleID, deviceID] to determine which device "wins".
In a nutshell, it is a massive collection of timeseries readings from five sensors taken at regular intervals from each of four different devices. The objective is to determine which of the devices is most likely to be in a particular state at the end of each sampleID. The labels are a one-hot representation of the devices.
In a case like this where a human would find meaning in the structure of the dataset, does the training process derive similar benefit? Can I simplify my dataset by reducing it to [dataset, deviceID, timestamp X sensor] or even [dataset, deviceID X timestamp X sensor] and still get similar accuracy?
In other words would simplifying the following dataset:
[10000, 1000, 4, 5]
down to
[10000, 4, 5000]
or
[10000, 1000, 20]
or even
[10000, 20000]
significantly diminish the model's ability to classify output?
Edited to for detail and formatting.
IIUC, you are asking if using 1000 timesteps for 20 objects (device X sensor) is better than using 1000 timesteps for 4 devices for 5 sensors.
There is no way of actually determining which would better model your problem, but, we can quickly build some tests to see which models capture the complexity of the problem better.
Case 1: 1000 time steps, 20 objects -> Sequential LSTM based model
If you consider the 20 sensors individually, you can simply use a LSTM based model and let the model handle the non linear relationships between them. Since you have a 2D input, simply build reshape your data and build a model in the following structure. Feel free to add more layers and activations etc.
from tensorflow.keras import layers, Model, utils
#Temporal model
inp = layers.Input((1000,20))
x = layers.LSTM(30, return_sequences=True)(inp)
x = layers.LSTM(30)(x)
out = layers.Dense(4, activation='softmax')(x)
model = Model(inp, out)
model.summary()
Model: "model_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_6 (InputLayer) [(None, 1000, 20)] 0
_________________________________________________________________
lstm_4 (LSTM) (None, 1000, 30) 6120
_________________________________________________________________
lstm_5 (LSTM) (None, 30) 7320
_________________________________________________________________
dense_20 (Dense) (None, 4) 124
=================================================================
Total params: 13,564
Trainable params: 13,564
Non-trainable params: 0
_________________________________________________________________
Case 2: 1000 time steps, 4x5 objects -> Conv-LSTM based model
Since you have a 3D input, you want to consider the 4x5 as your spatial axes and your 1000 as your channels/feature maps/temporal features. Since your data type has channels_first, do specify them in the Conv2D as well as MaxPooling2D layers.
Then, once you have convolved over the spatial axes, you can start working on the feature maps with an LSTM. Sample code below, feel free to modify and build on top of this.
from tensorflow.keras import layers, Model, utils
#Conv-LSTM model
inp = layers.Input((1000,4,5))
x = layers.Conv2D(30,2, data_format="channels_first")(inp)
x = layers.MaxPooling2D(2, data_format="channels_first")(x)
x = layers.Reshape((-1,2))(x)
x = layers.LSTM(20)(x)
out = layers.Dense(4, activation='softmax')(x)
model = Model(inp, out)
model.summary()
Model: "model_21"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_25 (InputLayer) [(None, 1000, 4, 5)] 0
_________________________________________________________________
conv2d_19 (Conv2D) (None, 30, 3, 4) 120030
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 30, 1, 2) 0
_________________________________________________________________
reshape_10 (Reshape) (None, 30, 2) 0
_________________________________________________________________
lstm_19 (LSTM) (None, 20) 1840
_________________________________________________________________
dense_30 (Dense) (None, 4) 84
=================================================================
Total params: 121,954
Trainable params: 121,954
Non-trainable params: 0
_________________________________________________________________

Filter out classes before doing a prediction with a multi-class classifier

I built a multi-class classifier with Keras and want to improve my results by filtering out classes I know won't be useful for a certain datapoint, before doing a prediction. In other words, narrow down the possibilities as much as possible before giving the hand to the model. In my case I am classifying handwritten characters and digits, and sometimes I know that some inputs for instance can't be [A, B, F, K, 0, 3, 8, 9]. Filtering these out will definitely help, removing for example the ambiguity than can arise when classifying a 0 or an O.
Here is the model's summary:
Model: "sequential_10"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_20 (Conv2D) (None, 24, 24, 32) 832
_________________________________________________________________
max_pooling2d_20 (MaxPooling (None, 12, 12, 32) 0
_________________________________________________________________
conv2d_21 (Conv2D) (None, 8, 8, 64) 51264
_________________________________________________________________
max_pooling2d_21 (MaxPooling (None, 4, 4, 64) 0
_________________________________________________________________
flatten_10 (Flatten) (None, 1024) 0
_________________________________________________________________
dropout_10 (Dropout) (None, 1024) 0
_________________________________________________________________
dense_17 (Dense) (None, 128) 131200
_________________________________________________________________
dense_18 (Dense) (None, 30) 3870
=================================================================
Total params: 187,166
Trainable params: 187,166
Non-trainable params: 0
_________________________________________________________________
The last fully connected layer has the softmax activation function. However, the outcome of a prediction using model.predict(), model.predict_classes() or model.predict_proba() will only yield results such as [0, 0, 1, 0, ..., 0]. I know this should be the probability of each class, but I always get all 0 and a single 1. To verify this isn't due to overfitting, I trained the model for a single epoch, reaching only ~60% accuracy (while if I continued the training I'd reach ~94%).
If I had access to the probabilities of each classes I could do the filtering after the prediction. But since it only really outputs a single class, I can't risk filtering it out and ending up with a bunch of 0.
Is there a way I can get classes' probabilities, or filter out classes before the prediction, to avoid training multiple models for each sub-group of classes ? To avoid misunderstandings, here is a bit more information: I am training a model on the emnist dataset (characters and digits) to classify my own data made of similar images of characters and digits. However, my data is divided into multiple groups, some of them I know can't contain certain characters or digits. The model is trained with all classes (all digits and chars) and when I say "training multiple models" I mean training models with different subsets of classes (for instance only [A, B, 0, 1, 2, 3], which I'm trying to avoid.
I would add a Lambda layer that masks out the irrelevant logits before the softmax activation (i.e., a 30-long vector of 1's and 0's is added as an input to your model as well). If you have the data you are talking about a-priori, you could even train your model with it to yield better results.

scikit-learn regression with multiple continuous targets

I want to perform regression on a dataset where the input has multiple features and the output has multiple continuous targets.
I've been looking through the sklearn documentation, but the only multi-target examples I've found have either 1) a discrete set of target labels or 2) use a heuristic algorithm like KNN instead of an optimization-based algorithm like regression. Adding regularization would also be great, but I can't find a method even for simple least-squares. This is a really simple, smooth optimization problem so I'd be shocked if it wasn't already implemented somewhere. I'd appreciate it if someone could point me in the right direction!
You can find what you are looking for here.
https://machinelearningmastery.com/multi-output-regression-models-with-python/
But it would be better to try using Keras if you have enough data (output layer without any activation.
from keras.layers import Dense, Input
from keras.models import Model
from keras.regularizers import l2
num_inputs = 10
num_outputs = 4
inp = Input((num_inputs,))
out = Dense(num_outputs, kernel_regularizer=l2(0.01))(inp)
model = Model(inp, out)
model.compile(optimizer='sgd', loss='mse', metrics=['acc','mse'])
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_9 (InputLayer) (None, 10) 0
_________________________________________________________________
dense_7 (Dense) (None, 4) 44
=================================================================
Total params: 44
Trainable params: 44
Non-trainable params: 0
_________________________________________________________________

How to improve LSTM model predictions and accuracy?

After creating pre-embedded layer using gensim my val_accuracy has gone down to 45% for 4600 records:-
model = models.Sequential()
model.add(Embedding(input_dim=MAX_NB_WORDS, output_dim=EMBEDDING_DIM,
weights=[embedding_model],trainable=False,
input_length=seq_len,mask_zero=True))
#model.add(SpatialDropout1D(0.2))
#model.add(Embedding(vocabulary_size, 64))
model.add(GRU(units=150, return_sequences=True))
model.add(Dropout(0.4))
model.add(LSTM(units=200,dropout=0.4))
#model.add(Dropout(0.8))
#model.add(LSTM(100))
#model.add(Dropout(0.4))
#Bidirectional(tf.keras.layers.LSTM(embedding_dim))
#model.add(LSTM(400,input_shape=(1117, 100),return_sequences=True))
#model.add(Bidirectional(LSTM(128)))
model.add(Dense(100, activation='relu'))
#
#model.add(Dropout(0.4))
#model.add(Dense(200, activation='relu'))
model.add(Dense(4, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop',
metrics=['accuracy'])
Model: "sequential_4"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_4 (Embedding) (None, 50, 100) 2746300
_________________________________________________________________
gru_4 (GRU) (None, 50, 150) 112950
_________________________________________________________________
dropout_4 (Dropout) (None, 50, 150) 0
_________________________________________________________________
lstm_4 (LSTM) (None, 200) 280800
_________________________________________________________________
dense_7 (Dense) (None, 100) 20100
_________________________________________________________________
dense_8 (Dense) (None, 4) 404
=================================================================
Total params: 3,160,554
Trainable params: 414,254
Non-trainable params: 2,746,300
_________________________________________________________________
Full code is at
https://colab.research.google.com/drive/13N94kBKkHIX2TR5B_lETyuH1QTC5VuRf?usp=sharing
It would be great help for me.Since i am new in deep learning and i tried almost everything i knew.But now am all blank.
The problem is with your input. You've padded your input sequences with zeros but have not provided this information to your model. So your model doesn't ignore the zeros which is the reason it's not learning at all. To resolve this, change your embedding layer as follows:
model.add(layers.Embedding(input_dim=vocab_size+1,
output_dim=embedding_dim,
mask_zero=True))
This will enable your model to ignore the zero padding and learn. Training with this, I got a training accuracy of 100% in just 6 epochs though validation accuracy wasn't that good (aroung 54%) which is expected as your training data contains only 32 examples. More about embedding layer: https://keras.io/api/layers/core_layers/embedding/
Since your dataset is small, the model tends to overfit on training data quite easily which gives lower validation accuracy. To mitigate this to some extent, you can try using pre-trained word embeddings like word2vec or GloVe instead of training your own embedding layer. Also, try some text data augmentation methods like creating artificial data using templates or replacing words in training data with their synonyms. You can also experiment with different types of layers (like replacing GRU with another LSTM) but in my opinion that may not help much here and should be considered after trying out pre-trained embeddings and data augmentation.

how to configure data labels in a numpy array for training a Keras model?

I'm trying to implement Keras for my first time (so sorry for the dumb question) as part of a wider project to make an AI that learns to play connect 4. As part of this, I pass a NN a 6*7 grid and it outputs an array of 7 values giving the probabilities to pick for each column in the game. Here is the output of the Model.summary() method for a bit more detail:
______________________________________________________________
Layer (type) Output Shape Param #
=================================================================
flatten (Flatten) (None, 42) 0
_________________________________________________________________
dense (Dense) (None, 20) 860
_________________________________________________________________
dense_1 (Dense) (None, 20) 420
_________________________________________________________________
dense_2 (Dense) (None, 7) 147
=================================================================
Total params: 1,427
Trainable params: 1,427
Non-trainable params: 0
_________________________________________________________________
_________________________________________________________________
the model will give (at the moment random) predictions when i pass it numpy arrays of shape (1, 6, 7), however, when i try to train the model with an array of shape (221, 6, 7) for the data and an array of shape (221, 7) for the labels i get this error:
ValueError: Error when checking target: expected dense_2 to have shape (1,) but got array with shape (7,)
This is the code I use to train the model (which outputs (221, 6, 7) and (221, 7)):
board_tensor = np.array(full_board_list)
print(board_tensor.shape)
label_tensor = np.array(full_label_list)
print(label_tensor.shape)
self.model.fit(board_tensor, label_tensor)
this is the code I use to define the model:
self.model = keras.Sequential([
keras.layers.Flatten(input_shape=(6, 7)),
keras.layers.Dense(20, activation=tf.nn.relu),
keras.layers.Dense(20, activation=tf.nn.relu),
keras.layers.Dense(7, activation=tf.nn.softmax)])
self.model.compile(optimizer=tf.train.AdamOptimizer(),
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
(the model is part of an AI object so that it could be compared to other types of AI objects)
This is the code which successfully predicts a batch of size 1, generated from by a two dimensional list representing the board (it outputs (1, 6, 7) and (1, 7)):
input_tensor = np.array(board.board)
input_tensor = np.expand_dims(input_tensor, 0)
print(input_tensor.shape)
probability_distribution = self.model.predict(input_tensor)
print(probability_distribution.shape)
I realise that the error is probably due to a lack of understanding on my part as to what the methods in Keras expect to be given; so as a little side-note, does anyone have any good, thorough learning resources which really get you to understand what each method is doing (ie. not just telling you which code to type in to make an image recogniser) that would be understandable to people new to Keras and Tensorflow like me?
thanks a lot in advance!
You are using the sparse_categorical_crossentropy loss, which takes integer labels (not one-hot encoded ones), while your labels are one-hot encoded. This is why you get an error.
The easiest way to fix it is to change loss to categorical_crossentropy.

Categories