I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.
First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.
Related
I'm trying to improve my model so it can become a bit more accurate. Right now I'm training the model and get this as my training and validation accuracy.
For every epoch I get an training accuracy of 0.0003 and an validation accuracy of 0. I know this isn't good but I don't know how I can fix this.
Data is normalized with the minmax scaler. 4 of the 8 features are normalized (other 4 are hour, day, day_of_week and month)
Update:
I've also tried to normalize the entire dataset and it doesn't make a differance
scaling = MinMaxScaler(feature_range=(0,1)).fit(df[cols])
df[[cols]] = scaling.transform(df[[cols]])
My model: The shape is (5351, 1, 8)
and the input_shape is (1, 8)
model = keras.Sequential()
model.add(keras.layers.Bidirectional(keras.layers.LSTM(2,input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='linear')))
model.add(keras.layers.Dense(1))
model.compile(loss='mean_squared_error', optimizer='Adamax', metrics=['acc'])
history = model.fit(
X_train, y_train,
epochs=200,
batch_size=24,
validation_split=0.35,
shuffle=False,
)
i tried using the answer of this question:
Keras model accuracy not improving
but it didn't work
A mean_sqared_error loss is for regression tasks while a acc metric is for classification problems. So it makes no sense to use them together.
If you work on a classification problem, use binary_crossentropy or categorical_crossentropy as loss and keep the metric parameter as you did.
If it is a regression tasks, change the metric to [mse] for mean squares error instead of [acc].
Your model "works" and you have applied the standard formula for backpropagation by using the mean squares error loss. But measuring the accuracy will make Keras check if your model's output is EXACTLY equals to the expected values. Since the loss function is for regression, it will hardly ever be equal.
Three last points because that little change won't correct everything.
Firstly, your last dense layer should have an activation function. (It's safier)
Secondly, I'm pretty sure a Bidirectional+LSTM layer placed before a Dense layer should have a return_sequences=False. A LSTM layer (with or without Bidirectional) can return thé full séquence of vector (like a matrix) but a dense layer takes vectors as input. But in this case it will work because of the third point.
The last point is about the shape of your data. You have 5351 examples of shape (1, 8) each which a vector of size 8. But a LSTM layer takes a sequence of vectors still thé size of your séquence is one. I don't know if it is relevent to use an RNN type layer here.
I am learning how to use the LSTM model in Keras. I have looked at this answer and this answer, and would like to train a model in the many-to-many manner but at testing time make predictions using the one-to-many with stateful=True manner. I am unsure if I am on the right track.
I have a data set comprising of 10,000 individuals, each has a sequence of 20 timesteps and 10 features. I want to train an LSTM model to predict 5 of the features in the next timestep, using a 90-10 train and test split, my train_x is shaped (9,000, 20, 10) and my train_y is shaped (9,000, 20, 5) with the values in y being the values of the selected features in the next timestep. My test_x is shaped (1,000, 20, 10).
At test time, I would like to use the trained model to make predictions using only the 10 features at the very start of the sequence (timestep 0). First to predict the values of the selected 5 features in the next time step. The values of the other 5 features in the next timestep is known so I would like to combine them with the predicted 5 features and again use that as input to predict the 5 features in the next timestep and so on for 20 steps.
Is it possible to do this using the Keras library?
My code for training looks like
t_model = Sequential()
t_model.add(LSTM(100, return_sequence=True,
input_shape=(train_x.shape[1],
train_x.shape[2])))
t_model.add(TimeDistributed(Dense(5))
t_modle.compile(loss='mean_squared_error',
optimizer='adam')
checkpointer = ModelCheckpoint(filepath='weights.hdf5',
verbose=1,
save_best_only=True)
history = t_model.fit(train_x, train_y, epochs=50,
validation_split=0.1, callbacks=[checkpointer],
verbose=2, shuffle=False)
This seems to train ok. Please let me know if there is any misunderstanding in the way I am structuring my model.
My code for testing looks like
p_model = Sequential()
p_model.add(LSTM(100, stateful=True,
return_sequences=True,
batch_input_shape=(1, 1,
test_x.shape[2])))
p_model.add(TimeDistributed(Dense(5)))
p_model.load_weights('weights.hdf5')
complete_yhat = np.empty([0, 5])
for i in range(len(test_x):
ind = test_x[i]
x = ind[0]
x = x.reshape(1, 1, x.shape[0])
for j in range(20):
yhat = p_model.predict(x)
complete_yhat = np.append(complete_yhat, yhat[0], axis=0)
if j < 19:
x = ind[j+1]
x = np.append([x[:-5]], yhat[0], axis=1)
x = x.reshape(1, x.shape[0], x.shape[1])
p_model.reset_states()
This runs ok, but I am struggling to get good forecast accuracy. Can someone let me know whether I am using Keras LSTM correctly?
Thank you for your help
I am not sure if you can really train a model with many-to-many architecture and then test it one-to-many. You might be able to hack something and have a piece of code that runs, but from a technical point of view this does not make much sense. Can you explain why you want to do one-to-many in test time?
Generally, the rule of thumb for any supervised machine learning model development is that your training phase should "resemble" you testing phase. For example, if you want to test one-to-many architecture, then you should also train it as one-to-many.
Edit:
Reading the comments, it seems that you want to train with features from one time step and see how it will perform for future time steps. (I think this is in odds with the nature of a time-series data where every sample contributes to the future state, and if one sample can predict the future very well, then it means that the next samples are useless... but anyways). Here is how you can do this. There are other ways of course...
Split your data for training and testing similar to what you are doing for test time. so your input should by of shape (None, 10) and output of the shape (None, 20, 5). Then use Keras RepeatVector at your input (like this output = RepeatVector(20)(input) and then you should get something of the shape (None, 20, 10) which you can now pass through the rest of your model.
The accuracy starts off at around 40% and drops down during one epoch to 25%
My model:
self._model = keras.Sequential()
self._model.add(keras.layers.Dense(12, activation=tf.nn.sigmoid)) # hidden layer
self._model.add(keras.layers.Dense(len(VCDNN.conventions), activation=tf.nn.softmax)) # output layer
optimizer = tf.train.AdamOptimizer(0.01)
self._model.compile(optimizer, loss=tf.losses.sparse_softmax_cross_entropy, metrics=["accuracy"])
I have 4 labels, 60k rows of data, split evenly for each label so 15k each and 20k rows of data for evaluation
my data example:
name label
abcTest label1
mete_Test label2
ROMOBO label3
test label4
The input is turned into integers for each character and then hot encoded and output is just turned into integers [0-3]
1 epoch evaluation (loss, acc):
[0.7436684370040894, 0.25]
UPDATE
More details about the data
The strings are of up to 20 characters
I first convert each character to int based on an alphabet dictionary (a: 1, b:2, c:3) and if a word is shorter than 20 chars i fill the rest with 0's now those values are hot encoded and reshaped so
assume max 5 characters
1. ["abc","d"]
2. [[1,2,3,0,0],[4,0,0,0,0]]
3. [[[0,1,0,0,0],[0,0,1,0,0],[0,0,0,1,0],[1,0,0,0,0],[1,0,0,0,0]],[[0,0,0,0,1],[1,0,0,0,0],[1,0,0,0,0],[1,0,0,0,0],[1,0,0,0,0]]]
4. [[0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0],[0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0]]
and labels describe the way a word is spelled basically naming convention e.g. all lowercase - unicase, testBest - camelCase, TestTest - PascalCase, test_test - snake_case
With added 2 extra layers and LR reduced to 0.001
Pic of training
Update 2
self._model = keras.Sequential()
self._model.add(
keras.layers.Embedding(VCDNN.alphabetLen, 12, input_length=VCDNN.maxFeatureLen * VCDNN.alphabetLen))
self._model.add(keras.layers.LSTM(12))
self._model.add(keras.layers.Dense(len(VCDNN.conventions), activation=tf.nn.softmax)) # output layer
self._model.compile(tf.train.AdamOptimizer(self._LR), loss="sparse_categorical_crossentropy",
metrics=self._metrics)
Seems to start and immediately dies with no error (-1073740791)
The 0.25 acc means the model couldn't learn anything useful as it is same as the random guess. This means the network structure may not good for the problem.
Currently, the recurring neural network, like LSTM, is more commonly used for sequence modeling. For instance:
model = Sequential()
model.add(Embedding(char_size, embedding_size))
model.add(LSTM(hidden_size))
model.add(Dense(len(VCDNN.conventions), activation='softmax'))
This will work better if the label is related to the char sequence information about the input words.
This means your models isn't really learning anything useful. It might be stuck in a local minima. This could be due to following reasons:
a) you don't have enough train data to train a neural network. NNs usually require fairly large datasets to converge. Try using a RandomForest classifier at first to see what results you can get there
b) it's possible your target data might not have anything to do with your train data and so it's impossible to train such a model that would map efficiently without overfitting
c) your model could do with some improvements
If you want to give improving your model a go I would add a few extra dense layers with a few more units. So after line 2 of your model I'd add:
self._model.add(keras.layers.Dense(36, activation=tf.nn.sigmoid))
self._model.add(keras.layers.Dense(36, activation=tf.nn.sigmoid))
Another thing you can try is a different learning rate. I'd go with the default for AdamOptimizer which is 0.001. So just change 0.01 to 0.001 in the AdamOptimizer() call
You may also want to train more than just one epoch
I have a network with 32 input nodes, 20 hidden nodes and 65 output nodes. My network input actually is a hash code of length 32 and the output is the word.
The input is the ascii value of each character of the Hash code. The output of the network is a binary representation I have made. Say for example a is equal to 00000 and b is equal to 00001 and so on and so forth. It only includes the alphabet and the space that why it's only 5 bits per character. I have a maximum limit of only 13 characters in my training input, so my output nodes is 13 * 5 = 65. And Im expecting a binary output like 10101010101010101010101010101010101010101010101010101010101001011 . The bit sequence can predict at most 16 characters word given a hash code of 32 length as an input. Below is my current code:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform((train_samples).reshape(-1, 32))
train_labels = train_labels.reshape(-1, 65)
model = Sequential([
Dense(32, input_shape=(32,), activation = 'sigmoid'),
BatchNormalization(),
Dense(25, activation='tanh'),
BatchNormalization(),
Dense(65, input_shape=(65,), activation='sigmoid')
])
overfitCallback = EarlyStopping(monitor='loss', min_delta=0, patience = 1000)
model.summary()
model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_samples, train_labels, batch_size=1000, epochs=1000000, callbacks=[overfitCallback], shuffle = True, verbose=2)
I plan to overfit the model, so that it can memorize all the hash codes of the words in the dictionary. As an initial, my training samples is only 5,000 something. I just wanted to see if it will learn from a small dataset. How will I make network converge faster? I think its running more than one hour, and its loss function is still .5004 something and the accuracy is .7301. It gets up and down but when I check every 10 minutes or so, I can see only alittle improvement. How will I fine tune it?
UPDATE :
The training had already stopped but it didn't converge. It's loss is .4614 and accuracy is .7422
There are some hyper parameters that i would suggest to change first.
Try 'relu' or LeakyReLU() as the activation function for the non-output layers. Basically relu is the standard activation function for baseline models.
The standard optimizer (for most cases) currently is Adam, try using this. Tweak its learning rate when needed. You could get better results with sgd, but it often takes a lot of epochs and a lot of hyper parameter tuning. Adam is basically the quickest (in general) optimizer to reach a 'low' loss.
To prevent overfitting you might also want to implement Dropout(0.5), where the 0.5 is as an example.
Once you have reached the lowest loss, you might start changing these hyper parameters even more, to try and egt a lower loss.
Apart from this, the first thing i actually suggest is trying and add multiple hidden layers with different sizes. This might have a way larger impact then trying to optimize all the hyper parameters.
Edit: Maybe you could post a screenshot of your training loss vs epochs for the train & val data? This might make things more clear for others.
I read this blog here to understand the theoretical background this but after reading here I am bit confused about what **1)timesteps, 2)unrolling, 3)number of hidden units and 4) batch size ** are ? Maybe someone could explain this on a code basis as well because when I look into the model config this code below does not unroll, but what is timestep doing in this case ? Lets say I have a data of length of 2.000 points, splitted into 40 time steps and one feature. E.g. hidden units are 100. batchsize is not defined, what is happening in the model ?
model = Sequential()
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)
Is the code below still an encoder decode model without a RepeatVector?
model = Sequential()
model.add(LSTM(100, return_sequences=True, input_shape=(n_timesteps_in, n_features)))
model.add(LSTM(100, return_sequences=True))
model.add(TimeDistributed(Dense(n_features, activation='tanh')))
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
history=model.fit(train, train, epochs=epochs, verbose=2, shuffle=False)
"Unroll" is just a mechanism to process the LSTMs in a way that makes them faster by occupying more memory. (The details are unknown for me... but it certainly has no influence in steps, shapes, etc.)
When you say "2000 points split in 40 time steps", I have absolutely no idea of what is going on.
The data must be meaningfully structured and saying "2000" data points is really lacking a lot of information.
Data structured for LSTMs is:
I have a certain number of individual sequences (data evolving with time)
Each sequence has a number of time steps (measures in time)
In each step we measured a number of different vars with different meanings (features)
Example:
2000 users in a website
They used the site for 40 days
In each day I measured the number of times they clicked a button
I can plot how this data evolves with time daily (each day is a step)
So, if you have 2000 sequences (also called "samples" in Keras), each sequence with length of 40 steps, and one single feature per step, this will happen:
Dimensions
Batch size is defined as 32 by default in the fit method. The model will process batches containing 32 sequences/users until it reaches 2000 sequences/users.
input_shape will required to be (40,1) (free batch size to choose in fit)
Steps
Your LSTMs will try to understand how clicks vary in time, step by step. That's why they're recurrent, they calculate things for a step and feed these things into the next step, until all 40 steps are processed. (You won't see this processing, though, it's internal)
With return_sequences=True, you will get the output for all steps.
Without it, you will get only the output for the last step.
The model
The model will process 32 parallel (and independent) sequences/users together in each batch.
The first LSTM layer will process the entire sequence in recurrent steps and return a final result. (The sequence is killed, there are no steps left because you didn't use return_sequences=True)
Output shape = (batch, 100)
You create a new sequence with RepeatVector, but this sequence is constant in time.
Output shape = (batch, 40, 100)
The next LSTM layer processes this constant sequence and produces an output sequence, with all 40 steps
Output shape = (bathc, 40, 100)
The TimeDistributed(Dense) will process each of these steps, but independently (in parallel), not recursively as the LSTMs would do.
Output shape = (batch, 40, n_features)
The output will be a the total group of 2000 sequences (that were processed in groups of 32), each with 40 steps and n_features output features.
Cells, features, units
Everything is independent.
Input features is one thing, output features is another. There is no requirement for Dense to use the same number of features used in input_shape, unless that's what you want.
When you use 100 units in the LSTM layer, it will produce an output sequence of 100 features, shape (batch, 40, 100). If you use 200 units, it will produce an output sequence with 200 features, shape (batch, 40, 200). This is computing power. More neurons = more intelligence in the model.
Something strange in the model:
You should replace:
model.add(LSTM(100, input_shape=(n_timesteps_in, n_features)))
model.add(RepeatVector(n_timesteps_in))
With only:
model.add(LSTM(100, return_sequences=True,input_shape=(n_timesteps_in, n_features)))
Not returning sequences in the first layer and then creating a constant sequence with RepeatVector is sort of destroying the work of your first LSTM.