The accuracy starts off at around 40% and drops down during one epoch to 25%
My model:
self._model = keras.Sequential()
self._model.add(keras.layers.Dense(12, activation=tf.nn.sigmoid)) # hidden layer
self._model.add(keras.layers.Dense(len(VCDNN.conventions), activation=tf.nn.softmax)) # output layer
optimizer = tf.train.AdamOptimizer(0.01)
self._model.compile(optimizer, loss=tf.losses.sparse_softmax_cross_entropy, metrics=["accuracy"])
I have 4 labels, 60k rows of data, split evenly for each label so 15k each and 20k rows of data for evaluation
my data example:
name label
abcTest label1
mete_Test label2
ROMOBO label3
test label4
The input is turned into integers for each character and then hot encoded and output is just turned into integers [0-3]
1 epoch evaluation (loss, acc):
[0.7436684370040894, 0.25]
UPDATE
More details about the data
The strings are of up to 20 characters
I first convert each character to int based on an alphabet dictionary (a: 1, b:2, c:3) and if a word is shorter than 20 chars i fill the rest with 0's now those values are hot encoded and reshaped so
assume max 5 characters
1. ["abc","d"]
2. [[1,2,3,0,0],[4,0,0,0,0]]
3. [[[0,1,0,0,0],[0,0,1,0,0],[0,0,0,1,0],[1,0,0,0,0],[1,0,0,0,0]],[[0,0,0,0,1],[1,0,0,0,0],[1,0,0,0,0],[1,0,0,0,0],[1,0,0,0,0]]]
4. [[0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0],[0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0]]
and labels describe the way a word is spelled basically naming convention e.g. all lowercase - unicase, testBest - camelCase, TestTest - PascalCase, test_test - snake_case
With added 2 extra layers and LR reduced to 0.001
Pic of training
Update 2
self._model = keras.Sequential()
self._model.add(
keras.layers.Embedding(VCDNN.alphabetLen, 12, input_length=VCDNN.maxFeatureLen * VCDNN.alphabetLen))
self._model.add(keras.layers.LSTM(12))
self._model.add(keras.layers.Dense(len(VCDNN.conventions), activation=tf.nn.softmax)) # output layer
self._model.compile(tf.train.AdamOptimizer(self._LR), loss="sparse_categorical_crossentropy",
metrics=self._metrics)
Seems to start and immediately dies with no error (-1073740791)
The 0.25 acc means the model couldn't learn anything useful as it is same as the random guess. This means the network structure may not good for the problem.
Currently, the recurring neural network, like LSTM, is more commonly used for sequence modeling. For instance:
model = Sequential()
model.add(Embedding(char_size, embedding_size))
model.add(LSTM(hidden_size))
model.add(Dense(len(VCDNN.conventions), activation='softmax'))
This will work better if the label is related to the char sequence information about the input words.
This means your models isn't really learning anything useful. It might be stuck in a local minima. This could be due to following reasons:
a) you don't have enough train data to train a neural network. NNs usually require fairly large datasets to converge. Try using a RandomForest classifier at first to see what results you can get there
b) it's possible your target data might not have anything to do with your train data and so it's impossible to train such a model that would map efficiently without overfitting
c) your model could do with some improvements
If you want to give improving your model a go I would add a few extra dense layers with a few more units. So after line 2 of your model I'd add:
self._model.add(keras.layers.Dense(36, activation=tf.nn.sigmoid))
self._model.add(keras.layers.Dense(36, activation=tf.nn.sigmoid))
Another thing you can try is a different learning rate. I'd go with the default for AdamOptimizer which is 0.001. So just change 0.01 to 0.001 in the AdamOptimizer() call
You may also want to train more than just one epoch
Related
I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!
Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow
Problem definition
Dear community, I need your help in implementing an LSTM neural network for a classification problem of panel data using Keras. The panel data I am manipulating consists of ids (let's call it id), a timestep for each id (t), n time varying covariates and a binary outcome y. Each id contains a number of timesteps and for each timestep I have my covariates and a unique outcome (0 or 1). I have reason to believe that each covariate for each id can have a certain degree of autocorrelation and henceforth can be considered a small timeseries of t steps. For simplicity, I consider that each id has a fixed number of t observations) with t not a big number (about 10 or so).
Data
Below is a toy example of what the data might look like in my case. In this example, the parameters are 2 individuals, 4 timesteps each, 4 covariates and each observation has a unique binary outcome. Covariates may be considered as (short) timeseries since they might be autocorrelated.
print(df)
[out]:
A B C D y
id t
id1 1 1.054127 0.346052 1.091299 -0.058137 0.0
2 0.621390 -0.204682 -1.056786 0.864572 0.0
3 1.275124 2.473959 0.264029 -1.047810 0.0
4 -0.328441 -0.135891 0.148498 0.470876 1.0
id2 1 0.362969 0.777082 0.197423 -2.225296 0.0
2 0.227134 0.086731 0.550267 -0.361482 0.0
3 0.223526 0.556242 -0.160042 0.675871 1.0
4 0.070125 0.156659 -2.922709 -1.143887 1.0
I have reason to assume that, for id1, the target at timestep 4 is conditional on the three previous timesteps for that same individual (id1). In addition, The target variable y may contain more than one value of 1 for each individual (as outlined in the case of id2 above). I do not have reason to believe that the data from an individual would affect the result of another (as with many behavior analysis scenarios since every individual is unique).
Prediction problem
What I would like to do is to predict a single outcome for a new individual for whom I have those 4 rows of observation. In other words, based on the historical data of an individual, I would like to know if said individual is likely to have an outcome 1 or 0. If I understand correctly, this can be achieved using an LSTM (alternatively, an RNN) with some data manipulation.
Things I have tried so far
To start simple, I have tried aggregating every set of id rows into a single row with a single outcome and applied a typical statistical learning approach such as boosted trees and got a model as good as random.
I looked into shaping it as a survival analysis problem, in vain. I would not be interested in any estimation of a survival function unlike tutorials on how to handle panel data in the medical field (nor would I have access to such data).
I have tried reshaping my data such that the input is a 3D array in the form of [observations, timesteps, features] where observations are unique ids for an LSTM like so in python :
# separate into features and target
df_feat = df.drop("y", axis = 1)
df_target = df[["y"]]
# get reshaped values for 3D tensor
n_samples = len(df_feat.index.get_level_values('id').unique().tolist())
n_timesteps = 4
n_features = df_feat.shape[1]
# reshape input array to be 3D
X_3D = df_feat.to_numpy().reshape(n_samples, n_timesteps, n_features)
print(X_3D.shape)
[out]:
(2, 4, 4)
However, at this point I get confused as to what my learning instances for the LSTM are and what the outcome y should be shaped like. I have tried having a shape like one outcome per training instance by taking only the last observation for each id (so y=[1,1] and y.shape = (2,) in the toy example above) which technically makes an LSTM script run... but does not capture prior information. Below is the code for such LSTM:
def train_lstm(X_train, y_train, X_valid, y_valid, save_name='best_lstm.h5'):
# starts a sequential model
model = Sequential()
# add first lstm hidden layer with 64 units and default keras params
model.add(LSTM(64, input_shape = (X_train.shape[1], X_train.shape[2]), return_sequences=True))
# add a second hidden lstm layer with 128 units and default keras params
model.add(LSTM(128, return_sequences = True))
# add one last hidden layer
model.add(LSTM(64))
# add one dense layer with 2 units and a sigmoid activation function
model.add(Dense(2, activation = 'sigmoid'))
# define adam optimiser with learning rate
opt = tf.keras.optimizers.Adam(learning_rate = 0.01)
# compile model with binary cross entropy as loss function and accuracy as metrics
model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])
# define early stopping and best model checkpoint parameters
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 0, patience = 20)
mc = ModelCheckpoint(save_name, monitor = 'val_accuracy', mode = 'max', verbose = 0, save_best_only = True)
# train the model using fit method (target vector is one-hot encoded as required by keras)
history = model.fit(X_train, tf.one_hot(y_train, depth = 2),
validation_data = (X_valid, tf.one_hot(y_valid, depth = 2)),
epochs = 100, callbacks = [es, mc])
return history
It runs and it makes predictions the way I want them to (for one id of previous history, we can predict one outcome) but results in poor performance since it fails to capture outcomes prior to the last.
I have carefully read and followed this nicely written medium article by Alexander Laskorunsky which remotely resembles what I am trying to do, and slides the window of K-length frames to capture the prior outcomes (and not just the last as I have done which makes more sense). However, in Alexander's case, he does not consider panel data but rather a multivariate timeseries classification that uses n_timesteps to predict the target using all predictors and all rows even if it overlaps (so not using panel data).
Questions
Am I right to believe that I need a many to one LSTM architecture?
How may I divide and reshape training and testing samples such that a new, previously unseen individual which would not be related in any way to other ids can be classified?
Should each id be considered as one sample / training instance? Should each id be split into training and testing sets and concatenate all training and testing sets to feed to an LSTM architecture?
Would you be so kind as to provide code snippets on how to correctly split and reshape my data as well as a simple LSTM architecture using keras (or maybe modify my own function above in case I coded it wrong)? No need for basic preprocessing and encoding variables.
Any help or advice / tutorials / articles regarding what architecture is most suitable for that kind of problem is greatly appreciated and thank you in advance for your help!
I am trying to apply deep learning to a multi-class classification problem with high class imbalance between target classes (10K, 500K, 90K, 30K). I want to write a custom loss function.
This is my current model:
model = Sequential()
model.add(LSTM(
units=10, # number of units returned by LSTM
return_sequences=True,
input_shape=(timestamps,nb_features),
dropout=0.2,
recurrent_dropout=0.2
)
)
model.add(TimeDistributed(Dense(1)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(units=nb_classes,
activation='softmax'))
model.compile(loss="categorical_crossentropy",
metrics = ['accuracy'],
optimizer='adadelta')
Unfortunately, all predictions belong to class 1!!! The model always predicts 1 for any input...
Appreciate any pointers on how I can solve this task.
Update:
Dimensions of input data:
94981 train sequences
29494 test sequences
X_train shape: (94981, 20, 18)
X_test shape: (29494, 20, 18)
y_train shape: (94981, 4)
y_test shape: (29494, 4)
Basically in the train data I have 94981 samples. Each sample contains a sequence of 20 timestamps. There are 18 features.
The imbalance between target classes (10K, 500K, 90K, 30K) is just an example. I have similar proportions in my real dataset.
First of all, you have ~100k samples. Start with something smaller, like 100 samples and multiple epochs and see whether your model overfits to this smaller training dataset (if it can't, you either have an error in your code or the model is not capable to model the dependencies [I would go with the second case]). Seriously, start with this one. And remember about representing all of your classes in this small dataset.
Secondly, hidden size of LSTM may be too small, you have 18 features for each sequence and sequences have length of 20, while your hidden is only 10. And you apply dropout to top it off and regularize the network even further.
Furthermore, you may want to add some dense outputs units instead of merely returning a linear layer of size 10 x 1 for each timestamp.
Last but not least, you may want to upsample the underrepresented data. 0 class would have to be repeated say 50 times (or maybe 25), class 2 something around 4 times and your one around 10-15 times, so the network is trained on them.
Oh, and use cross-validation for your hyperparameters like the hidden size, number of dense units etc.
Plus I don't know for how many epochs you've been training this network, what is your test dataset (it is entirely possible it only constitutes of the first class if you haven't done stratification).
I think this will get you started, hit me up with any doubts in the comments.
EDIT: When it comes to metrics, you may want to check something different than mere accuracy; maybe F1 score and your loss monitoring + accuracy to see how it performs. There are other available choices, for inspiration you can check sklearn's documentation as they provide quite a few options.
I have a network with 32 input nodes, 20 hidden nodes and 65 output nodes. My network input actually is a hash code of length 32 and the output is the word.
The input is the ascii value of each character of the Hash code. The output of the network is a binary representation I have made. Say for example a is equal to 00000 and b is equal to 00001 and so on and so forth. It only includes the alphabet and the space that why it's only 5 bits per character. I have a maximum limit of only 13 characters in my training input, so my output nodes is 13 * 5 = 65. And Im expecting a binary output like 10101010101010101010101010101010101010101010101010101010101001011 . The bit sequence can predict at most 16 characters word given a hash code of 32 length as an input. Below is my current code:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform((train_samples).reshape(-1, 32))
train_labels = train_labels.reshape(-1, 65)
model = Sequential([
Dense(32, input_shape=(32,), activation = 'sigmoid'),
BatchNormalization(),
Dense(25, activation='tanh'),
BatchNormalization(),
Dense(65, input_shape=(65,), activation='sigmoid')
])
overfitCallback = EarlyStopping(monitor='loss', min_delta=0, patience = 1000)
model.summary()
model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_samples, train_labels, batch_size=1000, epochs=1000000, callbacks=[overfitCallback], shuffle = True, verbose=2)
I plan to overfit the model, so that it can memorize all the hash codes of the words in the dictionary. As an initial, my training samples is only 5,000 something. I just wanted to see if it will learn from a small dataset. How will I make network converge faster? I think its running more than one hour, and its loss function is still .5004 something and the accuracy is .7301. It gets up and down but when I check every 10 minutes or so, I can see only alittle improvement. How will I fine tune it?
UPDATE :
The training had already stopped but it didn't converge. It's loss is .4614 and accuracy is .7422
There are some hyper parameters that i would suggest to change first.
Try 'relu' or LeakyReLU() as the activation function for the non-output layers. Basically relu is the standard activation function for baseline models.
The standard optimizer (for most cases) currently is Adam, try using this. Tweak its learning rate when needed. You could get better results with sgd, but it often takes a lot of epochs and a lot of hyper parameter tuning. Adam is basically the quickest (in general) optimizer to reach a 'low' loss.
To prevent overfitting you might also want to implement Dropout(0.5), where the 0.5 is as an example.
Once you have reached the lowest loss, you might start changing these hyper parameters even more, to try and egt a lower loss.
Apart from this, the first thing i actually suggest is trying and add multiple hidden layers with different sizes. This might have a way larger impact then trying to optimize all the hyper parameters.
Edit: Maybe you could post a screenshot of your training loss vs epochs for the train & val data? This might make things more clear for others.
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.