I have a binary classification problem. I use the following keras model to do my classification.
input1 = Input(shape=(25,6))
x1 = LSTM(200)(input1)
input2 = Input(shape=(24,6))
x2 = LSTM(200)(input2)
input3 = Input(shape=(21,6))
x3 = LSTM(200)(input3)
input4 = Input(shape=(20,6))
x4 = LSTM(200)(input4)
x = concatenate([x1,x2,x3,x4])
x = Dropout(0.2)(x)
x = Dense(200)(x)
x = Dropout(0.2)(x)
output = Dense(1, activation='sigmoid')(x)
However, the results I get is extremely bad. I thought the reason is that I have too many features, thus, needs have more improved layers after the concatenate.
I was also thinking if it would be helpful to used a flatten() layer after the concatenate.
anyway, since I am new to deep learning, I am not so sure how to make this a better model.
I am happy to provide more details if needed.
Here is what I can suggest
Remove every things that prevent overfitting, such as Dropout and regularizer. What can happen is that your model may not be able to capture the complexity of your data using given layer, so you need to make sure that your model is able to overfit first before adding regularizer.
Now try increase number of Dense layer and number of neuron in each layer until you can see some improvement. There is also a possibility that your data is too noisy or you have only few data to train the model so you can't even produce a useful predictions.
Now if you are LUCKY and you can see overfitting, you can add Dropout and regularizer.
Because every neural network is a gradient base algorithm, you may end up at local minimum. You may also need to run the algorithm multiple times with different initial weight before you can get a good result or You can change your loss function so that you have a convex problem where local minimum is global minimum.
If you can't achieve better result
You may need to try different topology because LSTM is just trying to model a system that assume to have Markov property. you can look at nested-LSTM or something like that, which model the system in the way that next time step is not just depend on current time step.
The Dropout right before the output layer could be problematic. I would suggest removing both Dropout layers and evaluating performance, then re-introduce regularization once the model is performing well on the the training set.
Related
I've written an LSTM model that predicts the sequential data.
def get_model(config, num_features, output_size):
opt = Adam(learning_rate=get_deep(config, 'hp.learning_rate'), beta_1=get_deep(config, 'hp.beta_1'))
inputs = Input(shape=[None, num_features], dtype=tf.float32, ragged=True)
layers = LSTM(get_deep(config, 'hp.lstm_neurons'), activation=get_deep(config, 'hp.lstm_activation'))(
inputs.to_tensor(), mask=tf.sequence_mask(inputs.row_lengths()))
layers = BatchNormalization()(layers)
if 'dropout_rate' in config['hp']:
layers = Dropout(get_deep(config, 'hp.dropout_rate'))(layers)
for layer in get_deep(config, 'hp.dense_layers'):
layers = Dense(layer['neurons'], activation=layer['activation'])(layers)
layers = BatchNormalization()(layers)
if 'dropout_rate' in layer:
layers = Dropout(layer['dropout_rate'])(layers)
layers = Dense(output_size, activation='sigmoid')(layers)
model = Model(inputs, layers)
model.compile(loss='mse', optimizer=opt, metrics=['mse'])
model.summary()
return model
I've tuned some of the layer's params using AWS SageMaker. While validating the model I've run a model with a specific configuration several times. Most of the time the results are similar, however, one run was much better than others, which led me to think that the initial state of the model is probably crucial in order to get the best performance.
As suggested in this video, weight initialization can provide some performance boost.
I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Update:
As suggested in some of the comments / answers I'm using a fixed seed to "lock" the model results:
# Set `python` built-in pseudo-random generator at a fixed value
random.seed(seed_value)
# Set `numpy` pseudo-random generator at a fixed value
np.random.seed(seed_value)
# Set `tensorflow` pseudo-random generator at a fixed value
tf.random.set_seed(seed_value)
The results replicate for each new train, however, different seeds can produce much better results than others. So how do I find/tune the best seed?
... which led me to think that the initial state of the model is probably crucial in order to get the best performance.
..... As suggested in this video, weight initialization can provide some performance boost. I've googled around and found layer weight initializers, but I'm not sure what ranges should I tune.
Firstly, in that video, apart from the state or weights initializer, all the other factors such as learning rate, schedule, optimizer, batch size, loss function, model depth, etc are something you should play with them to find the best set (we will talk about the role of seed later). Normally, we don't need to tune the default weight or state initializer as those are currently the best; and as usual, this state initialization is a research problem.
Secondly, in keras, the default weight initializer for Convolution, Dense and RNN-GRU/LSTM is glorot_uniform, also known as Xavier uniform initializer. And the default bias initializer is zeros. If you follow the source code of LSTM (in your case), you would find them. About it, according to the doc
Draws samples from a uniform distribution within [-limit, limit], where limit = sqrt(6 / (fan_in + fan_out)) (fan_in is the number of input units in the weight tensor and fan_out is the number of output units).
Now, you may already notice that this initializer is inherited from the VarianceScaling; and same as GlorotUniform, others like GlorotNormal, LecunNormal, LecunUniform, HeNormal, HeUniform are also inheriting it. Regarding the VarianceScaling, here is listed the supported parameter. For example, technically, the following two are the same.
# in case if you want to try various initializer -
# use VarianceScaling by passing proper parameter.
# ie. tf.keras.layers.LSTM(..., kernel_initializer=initializer)
# bur recommended to stick with glorot_uniform (default)
initializer = tf.keras.initializers.VarianceScaling(scale=1.,
mode='fan_avg', seed=101,
distribution='uniform')
print(initializer(shape=(2, 2)))
initializer = tf.keras.initializers.GlorotUniform(seed=101)
print(initializer(shape=(2, 2)))
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
tf.Tensor(
[[-1.0027379 1.0746485]
[-1.2234 -1.1489409]], shape=(2, 2), dtype=float32)
In short, you can play with tf.keras.initializers.VarianceScaling (at the bottom of the page). Additionally, you can make your own initializer by defining a callable function or by subclassing the Initializer class. For example:
def conv_kernel_initializer(shape, dtype=None):
kernel_height, kernel_width, _, out_filters = shape
fan_out = int(kernel_height * kernel_width * out_filters)
return tf.random.normal(
shape, mean=0.0, stddev=np.sqrt(2.0 / fan_out), dtype=dtype)
def dense_kernel_initializer(shape, dtype=None):
init_range = 1.0 / np.sqrt(shape[1])
return tf.random.uniform(shape, -init_range, init_range, dtype=dtype)
Here is one good article about initializing the weights, you may enjoy reading. Butt again, better to go with default ones.
Thirdly, for setting different seed values and different sets of hyper-parameter, etc, I better leave one of my old answers here, mostly the first diagram probably come in handy to your experiments. One of an approach that I follow is to keep my seed same (let's say for first 5 experiments) and change another factor and log the results. And after 5 iterations, hopefully, we would get some best set and approach further.
Update
Find/Tune Seed. Before searching the method to find the best seed, one must understand that seed is not a hyper-parameter that needs to be tuned with other hyperparameters such as learning rate, scheduler, optimizer, etc.
Here are one scenario, let's say you split the data randomly into two parts with seed 42: train set (70%) and test set (30%) and after training on the train set, you evaluate on the test set of your model and received score 80. Then you change your seed to 101, and again do the same but now you got score 50. Now, this doens't mean picking seed 42 is better; but it simply means your model is unstable and most like won't do well on the unseen data. This is actually a well-known issue if someone randomly split their data set for training and testing. Why it happens? Because, when you split the data randomly, it's possible that there would be a mismatch in class-distribution. Please, check the following two very related discussion on this:
Is random seed a hyper-parameter to tune in training deep neural network?
How to choose the random seed?
I dont think there is a "one shoe fits all" solution to this issue. The initial weights heavily depend on the kind of problem at hand and the data that we are using to solve that problem. All we can do is point you towards a good resource from where you can try to see which of the approaches mentioned fits your problem.
The following article is a good resource that not only provides you with a detailed understanding of how and why to initialize weights but also points towards peer reviewed research that can help build an academic understanding.
maybe you search for exponential decay learning rate.
let me explain
for example you first epoch has sometimes a loss of 3000, 4000 , sometimes just 500.
if you run a model often, you probably recognize a "real barrier", where you dont say "thats because of the initial state" anymore.
you want to go fast there, but dont keep the bad side effects of high learning rate (e.g. 1E-3) , you more want 1E-5.
there the exponential decay come in place.
call an instance of myLr=tf.train.exponential_decay(...) and pass it instead of the numerical learning rate parameter to your optimizer
for example Adam(myLr)
Indeed, the initial state of the model is crucial in order to get the best performance. Deep Learning works by optimizing a non-convex loss function in order to find the best local minima.
The initial weights will define the starting location of the optimization. As defined in the picture below. The starting point is defined by the initial weights and training the model will make it reach the local minima. As you see there is a starting weight configuration that allows reaching the global minima.
It is sometimes possible to have better weights initialization with Transfer Learning which is reusing the weights of a trained model on a downstream task. (For example VGG-16 in image classification, or NLP with BERT).
In your case, You should not try to finetune the weight initialization as this is meant to be random. Changing the architecture of your neural network, or its hyperparameter will certainly lead to better performance improvement.
Short answer: you can neither efficiently nor effectively tune the seed for a pseudo-random number generator. It is not only infeasible due to the extremely large search space, but also impractical for many other reasons, including the fact that pseudo-random number generator implementations change from time to time so you would need to start over every time that happened.
If, for some reason, you are hell-bent on discovering this for yourself, I recommend using NumPy's default_rng object to be the single source of all pseudo-randomness in your algorithm. Then, based on a single seed, you can produce other seeds deterministically for use with, say, tf.random.set_seed.
I have a convolutional neural network with some layers in keras. The last layer in this network is a custom layer that is responsible for sorting some numbers those this layer gets from previous layer, then, the output of custom layer is sent for calculate loss function.
for this purpose (sorting) I use some operator in this layer such as K.argmax and K.gather.
In the back-propagation phase I get error from keras that says:
An operation has None for gradient. Please make sure that all of your ops have a gradient defined (i.e. are differentiable). Common ops without gradient: K.argmax, K.round, K.eval
that is reasonable cause the involvement of this layer in the derivation process.
Given that my custom layer do not need to corporate in differential chain rule, how can I control differential chain in keras? can I disable this process in custom layer?
Reorder layer that I used in my code is simply following:
def Reorder(args):
z = args[0]
l = args[1]
index = K.tf.argmax(l, axis=1)
return K.tf.gather(z, index)
Reorder_Layer = Lambda(Reorder, name='out_x')
pred_x = Reorder_Layer([z, op])
A few things:
It's impossible to train without a derivative, so, there is no solution if you want to train this model
It's not necessary to "compile" if you are only going to predict, so you don't need custom derivation rules
If the problem is really in that layer, I suppose that l is computed by the model using trainable layers before it.
If you really want to try this, which doesn't seem a good idea, you can try a l = keras.backend.stop_gradient(args[1]). But this means that absolutely nothing will be trained from l until the beginning of the model. If this doesn't work, then you have to make all layers that produce l have trainable=False before compiling the model.
I'm training a neural net using Keras in Python for time-series climate data (predicting value X at time t=T), and tried adding a (20%) dropout layer on the inputs, which seemed to limit overfitting and cause a slight increase in performance. However, after I added a new and particularly useful feature (the value of the response variable at time of prediction t=0), I found massively increased performance by removing the dropout layer. This makes sense to me, since I can imagine how the neural net would "learn" the importance of that one feature and base the rest of its training around adjusting that value (i.e, "how do these other features affect how the response at t=0 changes by time t=T").
In addition, there are a few other features that I think should be present for all epochs. That said, I am still hopeful that a dropout layer could improve the model performance-- it just needs to not drop out certain features, like X at t_0: I need a dropout layer that will only drop out certain features.
I have searched for examples of doing this, and read the Keras documentation here, but can't seem to find a way to do it. I may be missing something obvious, as I'm still not familiar with how to manually edit layers. Any help would be appreciated. Thanks!
Edit: sorry for any lack of clarity. Here is the code where I define the model (p is the number of features):
def create_model(p):
model = Sequential()
model.add(Dropout(0.2, input_shape=(p,))) # % of features dropped
model.add(Dense(1000, input_dim=p, kernel_initializer='normal'
, activation='sigmoid'))
model.add(Dense(30, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal',activation='linear'))
model.compile(loss=cost_fn, optimizer='adam')
return model
The best way I can think of applying dropout only to specific features is to simply separate the features in different layers.
For that, I suggest you simply divide your inputs in essential features and droppable features:
from keras.layers import *
from keras.models import Model
def create_model(essentialP,droppableP):
essentialInput = Input((essentialP,))
droppableInput = Input((droppableP,))
dropped = Dropout(0.2)(droppableInput) # % of features dropped
completeInput = Concatenate()([essentialInput, dropped])
output = Dense(1000, kernel_initializer='normal', activation='sigmoid')(completeInput)
output = Dense(30, kernel_initializer='normal', activation='relu')(output)
output = Dense(1, kernel_initializer='normal',activation='linear')(output)
model = Model([essentialInput,droppableInput],output)
model.compile(loss=cost_fn, optimizer='adam')
return model
Train the model using two inputs. You have to manage your inputs before training:
model.fit([essential_train_data,droppable_train_data], predictions, ...)
I don't see any harm to using dropout in the input layer. The usage/effect would be a little different than normal of course. The effect would be similar to adding synthetic noise to an input signal; only the feature/pixel/whatever would be entirely unknown[zeroed out] instead of noisy. And inserting synthetic noise into the input is one of the oldest ways to improve robustness; certainly not bad practice as long as you think about whether it makes sense for your data set.
This question has already an accepted answer but it seems to me you are using dropout in a bad way.
Dropout is only for the hidden layers, not for the input layer !
Dropout act as a regularizer, and prevent the hidden layer complex coadaptation, quoting Hinton paper "Our work extends this idea by showing that dropout can be effectively applied in the hidden layers as well and that it can be interpreted as a form of model averaging" (http://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf)
Dropout can be seen as training several different models with your data and averaging the prediction at test time. If you prevent your models to have all the inputs during training, it will perform badly, especially if one input is crucial. What you want is actually avoid overfitting, meaning you prevent too complex models during the training phase (so each of your models will select the most important features first) before testing.
It is common practice to drop some of the features in ensemble learning but it is control and not stochastic like for dropout. It also works for neural networks as hidden layers have (often) way more neurons as inputs, and so dropout follows the law of big numbers, as for a small number of inputs, you can have in some bad case almost all your inputs dropped.
In conlusion: it is a bad practice to use dropout in the input layer of a neural network.
I'm trying to build a simple multilayer perceptron model on a large data set but I'm getting the loss value as nan. The weird thing is: after the first training step, the loss value is not nan and is about 46 (which is oddly low. when i run a logistic regression model, the first loss value is about ~3600). But then, right after that the loss value is constantly nan. I used tf.print to try and debug it as well.
The goal of the model is to predict ~4500 different classes - so it's a classification problem. When using tf.print, I see that after the first training step (or feed forward through MLP), the predictions coming out from the last fully connected layer seem right (all varying numbers between 1 and 4500). But then, after that the outputs from the last fully connected layer go to either all 0's or some other constant number (0 0 0 0 0).
For some information about my model:
3 layer model. all fully connected layers.
batch size of 1000
learning rate of .001 (i also tried .1 and .01 but nothing changed)
using CrossEntropyLoss (i did add an epsilon value to prevent log0)
using AdamOptimizer
learning rate decay is .95
The exact code for the model is below: (I'm using the TF-Slim library)
input_layer = slim.fully_connected(model_input, 5000, activation_fn=tf.nn.relu)
hidden_layer = slim.fully_connected(input_layer, 5000, activation_fn=tf.nn.relu)
output = slim.fully_connected(hidden_layer, vocab_size, activation_fn=tf.nn.relu)
output = tf.Print(output, [tf.argmax(output, 1)], 'out = ', summarize = 20, first_n = 10)
return {"predictions": output}
Any help would be greatly appreciated! Thank you so much!
Two (possibly more) reasons why it doesn't work:
You skipped or inappropriately applied feature scaling of your
inputs and outputs. Consequently, data may be difficult to handle
for Tensorflow.
Using ReLu, which is a discontinuous function, may raise issues. Try using other activation functions, such as tanh or sigmoid.
For some reasons, your training process has diverged, and you may have infinite values in your weights, wich gives NaN losses. The reasons can be many, try changing your training parameters (use smaller batchs for test).
Also, using a relu for the last output in a classifier is not the usual method, try using a sigmoid.
From my understanding Relu doesn't put a cap on the upper bound for Neural Networks so its more likely to deconverge depending upon its implementation.
Try switching all the activation functions to tanh or sigmoid. Relu is generally used for convolution in cnns.
Its also difficult to determine if your deconverging due to cross entropy as we don't know how you effected it with your epsilon value. Try just using the residual its much simpler but still effective.
Also a 5000-5000-4500 neural network is huge. Its unlikely you actually need a network that large.
Currently I'm using VGG16 + Keras + Theano thought the Transfer Learning methodology to recognize plants classes. It works just fine and gives me a good accuracy. But the next problem I'm trying to solve - is to find a way of identifying if an input image contains plant at all. I don't want to have another one classifier that will do it, because it's not really efficiently.
So I did some search and have found that we can get activations from the latest model layer (before activation layer) and analyze it.
from keras import backend as K
model = util.load_model() # VGG16 model
model.load_weights(path_to_weights)
def get_activations(m, layer, X_batch):
x = [m.layers[0].input, K.learning_phase()]
y = [m.get_layer(layer).output]
get_activations = K.function(x, y)
activations = get_activations([X_batch, 0])
# trying to get some features from activations
# to understand how can we identify if an image is relevant
for l in activations[0]:
not_nulls = [x for x in l if x > 0]
# shows percentage of activated neurons
c1 = float(len(not_nulls)) / len(l)
n_activated = len(not_nulls)
print 'c1:{}, n_activated:{}'.format(c1, n_activated)
return activations
get_activations(model, 'the_latest_layer_name', inputs)
From the above code I've noticed that when we have very irrelevant image, the number of activated neurons is bigger than for images that contain plants:
For images that was using for model training, number of activated neurons 19%-23%
For images that contain unknown plants species 20%-26%
For irrelevant images 24%-28%
It's not really a good feature to understand if an image relevant as percentage values are intersect.
So, is there a good way to resolve this issue?
Thanks to Feras's idea in the comment above. After some trials, I've come up with the ultimate solution that allows solving this problem with accuracy up to 99.99%.
Steps are:
Train your model on a dataset;
Store activations (see method above how to get them) by predicting relevant and non-relevant images using trained model from the previous step. You should get activations from the penultimate layer. For VGG16 it's the last of two Dense(4096), for InceptionV3 - an extra penultimate Dense(1024) layer, for resnet50 - an extra penultimate Dense(2048) layer.
Solve a binary problem using stored activations data. I've tried a simple flat NN and Logistic Regression. Both were good in accuracy (flat NN was a bit more accurate), but I've chosen the Logistic Regression as it's simpler, faster and consumes less memory and CPU/GPU.
This process should be repeated each time after your model retrained as each time the final weights for CNN are different and what was working previously, will be different next time.
So as result we have another small model for solving the problem.