Determine number of Nodes and Layers based on shape of the data - python

Is there a way to determine number of nodes and hidden layers based on shape of the data?
Also, is there a way to determine the best activation function based on the topic?
For example, Im making model for fake news prediction. My features are number of words in text, number of words in title, number of questions, number of capital letters etc.
My dataset has 22 features and around 35000 rows. My output should be 0 or 1.
Based on that, how many layers and nodes should I use and what activation functions are the best for this?
This is my net:
model = Sequential()
model.add(Dense(100, input_dim = features.shape[1], activation = 'relu')) # input layer requires input_dim param
model.add(Dense(100, activation = 'relu'))
model.add(Dense(100, activation = 'relu'))
model.add(Dropout(0.1))
model.add(Dense(1, activation='sigmoid')) # sigmoid instead of relu for final probability between 0 and 1
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss="mean_squared_error", optimizer=sgd, metrics=['accuracy'])
# call the function to fit to the data training the network)
model.fit(x_train, y_train, epochs = 10, shuffle = True, batch_size=32, validation_data=(x_test, y_test), verbose=1)
scores = model.evaluate(features, results)
print(model.metrics_names[1], scores[1]*100)

Selecting those requires prior experience, otherwise we won't need that much ML Engineers trying different architectures and writing papers.
But for a start I would recommend you take a look at autokeras, It will help with your problem as it's kind of a known problem -Text Classification-, you only need to structure your data as input(X and Y) and then feed that to their Text Classifier which will try different models(You could specify that) to choose the best fitting for your case.
You could find more examples in the docs here
https://autokeras.com/tutorial/text_classification/
import autokeras as ak
# Initialize the text classifier.
clf = ak.TextClassifier(max_trials=10) # It tries 10 different models
# Feed the text classifier with training data.
clf.fit(x_train, y_train)
# Predict with the best model.
predicted_y = clf.predict(x_test)
# Evaluate the best model with testing data.
print(clf.evaluate(x_test, y_test))

Answer is no and no.
Well these are also hyperparameters. You can select a bunch of them and try all of them to get a rough idea of which is giving you the best result. Yes the same statement holds for activation function as well.
You can use more layers than you need and then use regularization to stop producing an overfitted model. Also if it is too less you can clearly understand the underfitting behavior from the loss curve giving high training error.
There is no formula for determining all these. You have to try different things based on the problem at hand and you will see some of it would work better than the others.
For output softmax layer would be good as this will give you a probability of predictions which you can easily convert to one-hot encoding.

Related

Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!
Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow

which type of ANN should I use?

I am working on a project in which I have to predict the methane production
input:pH,temperature,solution concentration
output: methene production
I have used Keras TensorFlow
my questions are:
(as of now I have 60 experimental data) the accuracy is always 0.2-0.3 why?should I increase te number of data?
I used the following code:
classifier.add(Dense(6, activation='relu', kernel_initializer='uniform',input_dim=9))
classifier.add(Dense(6, kernel_initializer = 'uniform', activation = 'relu'))
classifier.add(Dense(1, kernel_initializer= 'uniform', activation = 'sigmoid'))
classifier.compile(optimizer = 'adam', loss='mean_squared_error',metrics=['mean_squared_error'])
classifier.fit(X_train, y_train, batch_size = 10, epochs = 100)
3.It is possible to predict other than binary outputs, right? if no then which one will be suitable for predicting non binary values
If you only have 60 data points, yes definitely try to get more data. In general it is good to have hundreds (if not thousands) of data points to effectively train a neural network. Your network looks fine (assuming the relationship between those inputs and the output is fairly linear), if that is not the case you could try making your hidden layer wider (more neurons).
It is definitely possible to predict other than binary outputs, in fact it looks like your network should be doing so. It really just depends on the activation function you put on your output layer. For example, softmax is good for classifying data when there are several possible labels. For binary classification, a sigmoid activation function is good. If you're just trying to predict an output quantity, you can probably just not have an activation function on your output.
yes have to provide more data to learn the pattern in data points, if have linear regression than used it for better

Keras neural network predicting the same output

I need to develop a neural network with Keras to predict a disease using genetic data. It is known, that predicting this disease is possible even with logistic regression (however the predictions, in this case, are of very poor quality). It's worth mentioning that my data is imbalanced, so I introduced class weights later.
I decided to start with the simplest way to predict it - with a network, analogous to a logistic regression - one hidden layer with one neuron and achieved a bad, yet at least some result - 0.12-0.14 F1 score. Then I tried with 2 hidden and 1 output layers with different amount of neurons in the first hidden layer - from 1 to 8.
It turns out that in some cases it learns something, and in some is predicting the same output for every sample. I displayed the accuracy and loss function over the epochs and this is what I get:
Network loss function by epoch. It's clear that the loss function has roughly the same value, for the training data.
Network accuracy by epoch. It's clear that the accuracy is not improving, but fluctuates from 0 to 1
I searched for similar questions and the suggestions were the following:
Make more neurons - I just have to make it work with 1, 2 or more neurons in the first layer, so I can't add neurons to this one. I increased the amount of neurons in the second hidden layer up to 20, but it then stopped predicting anything with any number oh neurons in the first layer configuration.
Make more layers - I tried adding one more layer, but still have the same problem
To introduce dropout and increase it - what dropout are we talking about if it can learn with just one layer and one neuron in it
Reduce learning rate - decreased it from the default 10^(-3) to 10^(-4)
Reduce batch size - varied it from 500 samples in a minibatch to 1 (stochastic gradient descent)
More epochs - isn't 20 to 50 epochs on a 500'000 sample dataset enough?
Here's the model:
def run_nn_class_weights(data, labels, model):
n_iter = 20
predicted = None
true = None
print('Splitting the data')
x_train, x_valid, y_train, y_valid = train_test_split(data, labels, test_size = 0.05)
#model = create_model()
early_stopping_monitor=EarlyStopping(patience=240)
class_weights = class_weight.compute_class_weight('balanced',
np.unique(labels),
labels)
class_weights = dict(enumerate(class_weights))
hist = model.fit(x_train, y_train, validation_data=[x_valid, y_valid], class_weight=class_weights,
epochs=n_iter, batch_size=500, shuffle=True, callbacks=[early_stopping_monitor],verbose=1)
proba = model.predict(data)
predicted = proba.flatten()
true = labels
return(model, proba, hist)
def old_model_n_pred(n_neurons_1st = 1):
model = Sequential()
model.add(Dense(n_neurons_1st, activation='relu', input_shape=(7516,), kernel_initializer='glorot_normal'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
#model.add(Flatten())
model.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
return model
This is a small network that should be able to converge to something that's not an atractor (getting stuck on a single value).
I suggest taking a look at the weights of all the neurons with ReLu activation.
ReLus are great because get quick calculations; but half of the relu has derivate of zero, which doesn't help with gradient descent. This might be your case.
In guess in yout case the enemy would be the first neuron.
In order to overcome this problem, I would try to do regularize inputs (to have all samples centered around 0.5 and scaled by the standard deviation). If you do this to a ReLU, you'll make it ignore anything under between [-inf, sd].
if that does not fix part of the problem, swich to a different activation function in the first layer. A sigmoid will work very good and it's not too expensive for just one neuron.
Also, take a close look at your input distribution. What your network actually does is doing a sigmoid-like classification, then using between 4 to 8 neurons to "zoom"/correct on the important parts of the function that the first transformation didn't account for.

What Activation Function is appropriate for input range (0,1) and output range (-∞,∞) for Regression Nerwork in Keras

input images are regularized to (0, 1)
and output is float32 values having pseudo gaussian distribution (-∞,∞)
when fitted, both train and validation accuracy says over 0.999
but when predict using train and validation set, it does not reproduce itself.
predicted output shows only negative values( and few positive identical values )
is this problem caused by wrong selection of activation function?
i have tried, instead of 'relu', 'linear', 'sigmoid' too.
the results was same.
model = Sequential()
model.add(Convolution1D(filters=64, kernel_size=2, input_shape=(img_width, img_height)))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size=(2)))
model.add(Convolution1D(filters=32, kernel_size=2))
model.add(Activation("relu"))
model.add(MaxPooling1D(pool_size=(2)))
model.add(Flatten())
model.add(Dense(256))
model.add(Activation("relu"))
model.add(Dense(1, activation='linear'))
model.compile(loss='mse', optimizer=optimizers.RMSprop(lr=0.0001), metrics=['accuracy'])
Prediction done like this,
model.fit(x_train, y_train, epochs=2,
validation_data=(x_valid, y_valid),
batch_size=2048,
shuffle='batch',
use_multiprocessing=True)
# right after fitting
result = model.predict(x_train, use_multiprocessing=True)
First of all, it's extremely hard to design a model to output in such a big range, the error rate of the model will be extremely high.
I suggest you normalize your outputs in range (0., 1.) and use sigmoid in the last layer.
You can always use an inverse transform to reconstruct the original outputs.
mn = np.min(y_train)
mx = np.max(y_train)
y_train = (y_train - mn)/(mx - mn)
# ... train
# inverse transform
y_train_original = y_train*(mx-mn) + mn
when fitted, both train and validation accuracy says over 0.999 but when
predict using train and validation set, it does not reproduce itself.
reason: overfitting. your data is impossible to learn with such complex output distribution, so the model just blindly memorizes the training data without learning any patterns.
to avoid :
use output normalizing.
model.add(Dense(256)) - reduce number of neurons here, try with 32->64->128
use dropout
Convolution1D are not the standard choice to deal with images, I suggest you Convolution2D
Secondly, 'accuracy' is not the correct metric for regression task, good choice are mean squared error (mse), mean absolute error (mae), root mean squared error (rmse)
when fitted, both train and validation accuracy says over 0.999 but
when predict using train and validation set, it does not reproduce
itself.
This suggests something is going wrong with your prediction code, which you have not included. Either something wrong with your testing data or the way you are predicting (not loading weights?)

Explosion in loss function, LSTM autoencoder

I am training a LSTM autoencoder, but the loss function randomly shoots up as in the picture below:
I tried multiple to things to prevent this, adjusting the batch size, adjusting the number of neurons in my layers, but nothing seems to help. I checked my input data to see if it contains null / infinity values, but it doesn't, it is normalized also. Here is my code for reference:
model = Sequential()
model.add(Masking(mask_value=0, input_shape=(430, 3)))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, activation='relu'))
model.add(RepeatVector(430))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(3)))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy'])
context_paths = loadFile()
X_train, X_test = train_test_split(context_paths, test_size=0.20)
history = model.fit(X_train, X_train, epochs=1, batch_size=4, verbose=1, validation_data=(X_test, X_test))
The loss function explodes at random points in time, sometimes sooner, sometimes later. I read this thread about possible problems, but at this point after trying multiple things I am not sure what to do to prevent the loss function from skyrocketing at random. Any advice is appreciated. Other than this I can see that my accuracy is not increasing very much, so the problems may be interconnected.
Two main points:
1st point As highlighted by Daniel Möller:
Don't use 'relu' for LSTM, leave the standard activation which is 'tanh'.
2nd point: One way to fix the exploding gradient is to use clipnorm or clipvalue for the optimizer
Try something like this for the last two lines
For clipnorm:
opt = tf.keras.optimizers.Adam(clipnorm=1.0)
For clipvalue:
opt = tf.keras.optimizers.Adam(clipvalue=0.5)
See this post for help (previous version of TF):
How to apply gradient clipping in TensorFlow?
And this post for general explanation:
https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/
Two main issues:
Don't use 'relu' for LSTM, leave the standard activation which is 'tanh'. Because LSTM's are "recurrent", it's very easy for them to accumulate growing or decreasing of values to a point of making the numbers useless.
Check the range of your data X_train and X_test. Make sure they're not too big. Something between -4 and +4 is sort of good. You should consider normalizing your data if it's not normalized yet.
Notice that "accuracy" doesn't make any sense for problems that are not classificatino. (I notice your final activation is "linear", so you're not doing classification, right?)
Finally, if the two hints above don't work. Check whether you have an example that is all zeros, this might be creating a "full mask" sequence, and this "might" (I don't know) cause a bug.
(X_train == 0).all(axis=[1,2]).any() #should be false

Categories