how can I get to decrease loss when the epoch increases? - python

I have a problem that the loss does not decrease when the epoch increases.
here is my code
model = Sequential()
model.add(LSTM(50, input_shape=(None, 1)))
model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(lr=0.01, decay=0.001))
hist =, y_train, epochs=40, batch_size=32, verbose=2)
Total parameters is number of 10451 and
train dataset is number of 2285
I am wondering if the total parameter is a reasonable ratio for train_data.
In other words, I wonder if the total parameter is appropriate to have a ratio of train_data.
and here is my loss graph
I tried parameter and hyperparameter tuning But this could not be solved.
The dataset has been preprocessed between 0 and 1.
The ensemble rather made the result worse.
how can I get to decrease loss when the epoch increases?

There really is no simple answer to your question. You have a loss graph that shows a fast initial learning that then tails off to a much slower reduction after a couple of epochs. This is quite a common phenomenon.
Your question amounts to "how do I make a better machine learning model for this dataset?". Which is an impossible question to answer generically.
Directly you could increase layers, weights, etc. Eventually (at least in theory) you would have enough complexity in your model to memorise your entire training set and get the loss down close to 0. The resulting model would almost certainly be overfitted and perform very poorly when it came to data it had never seen before though.
Objectively, if your labels are spread uniformly between 0 and 1, you already seem to have a very low loss value (less than 0.0005 MSE, right??)?
Have you tried this on your test set? It's not at all clear why you need to drive this further down.


Overfitting on LSTM text classification using Keras

I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history =, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!
Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie.

Accuracy of TensorFlow model changes a lot each time i run

My project is to try and find out if I can predict gender of people speaking near phone from data from gyroscope and accelerometer. I have 315 examples(60sec each) and each example has 2997 lines where each line represents magnitude of vector from gyro/accelerometer xyz axis.
I shuffled input and output by same seed and I normalized input data. I split data on 60|20|20. In this test I try from accelerometer to see if there is male speaking, so output is binary.
When I train data with current model, sometimes I get accuracy as high as 0.68 and as low as 0.36 while loss is almost always around 0.69. I run it in a for loop for 10 times and average is 0.5 accuracy and 0.69 loss.
First question is i tried multiple types of models, learning rates, optimization algorithms etc. but in average i wasnt too successful. Should I try Recurrent NNs and where can i learn it?
Second question is if i train model with accuracy of 68%, is it okay to say the model has 68% accuracy even though i know average is 50%?
model = tf.keras.Sequential()
model.add(layers.Dense(512, activation='relu',input_shape = (2997,), kernel_regularizer=regularizers.l2(0.001)))
for j in range(10) :
model.add(layers.Dense(1024, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dense(1, activation='sigmoid'))
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
callbacks = [
monitor='val_loss', patience = 20
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate = lr_schedule),
metrics = ['accuracy'])
history =
validation_split = 0.25,
epochs =80,
loss1, accuracy = model.evaluate(test_vector_examples, test_vector_labels)
This is from my own experience working with different types of data; to have get good solutions to your questions you should probably study the characteristics of the data closely before coming up with any models/algorithms.
First question: generally speaking, RNNs are good for data that has time dependency, or in other words, for cases where the inputs' order matters (e.g. time series, text). So I think RNNs may not be the best choice for your type of data, as I suppose ordering does not matter in your dataset.
Second question: this really depends on the difficulty of the problem you are trying to solve; but in my opinion 68% is quite low as 50% is basically the same as random choice. You probably want to improve the accuracy further.
Also, from your explanations, I can see that each gyro/accelerometer input has shape of rank 3 (xyz), so maybe you can try some CNN architectures and see how it goes.

Keras RNN accuracy doesn't improve

I'm trying to improve my model so it can become a bit more accurate. Right now I'm training the model and get this as my training and validation accuracy.
For every epoch I get an training accuracy of 0.0003 and an validation accuracy of 0. I know this isn't good but I don't know how I can fix this.
Data is normalized with the minmax scaler. 4 of the 8 features are normalized (other 4 are hour, day, day_of_week and month)
I've also tried to normalize the entire dataset and it doesn't make a differance
scaling = MinMaxScaler(feature_range=(0,1)).fit(df[cols])
df[[cols]] = scaling.transform(df[[cols]])
My model: The shape is (5351, 1, 8)
and the input_shape is (1, 8)
model = keras.Sequential()
model.add(keras.layers.Bidirectional(keras.layers.LSTM(2,input_shape=(X_train.shape[1], X_train.shape[2]), return_sequences=True, activation='linear')))
model.compile(loss='mean_squared_error', optimizer='Adamax', metrics=['acc'])
history =
X_train, y_train,
i tried using the answer of this question:
Keras model accuracy not improving
but it didn't work
A mean_sqared_error loss is for regression tasks while a acc metric is for classification problems. So it makes no sense to use them together.
If you work on a classification problem, use binary_crossentropy or categorical_crossentropy as loss and keep the metric parameter as you did.
If it is a regression tasks, change the metric to [mse] for mean squares error instead of [acc].
Your model "works" and you have applied the standard formula for backpropagation by using the mean squares error loss. But measuring the accuracy will make Keras check if your model's output is EXACTLY equals to the expected values. Since the loss function is for regression, it will hardly ever be equal.
Three last points because that little change won't correct everything.
Firstly, your last dense layer should have an activation function. (It's safier)
Secondly, I'm pretty sure a Bidirectional+LSTM layer placed before a Dense layer should have a return_sequences=False. A LSTM layer (with or without Bidirectional) can return thé full séquence of vector (like a matrix) but a dense layer takes vectors as input. But in this case it will work because of the third point.
The last point is about the shape of your data. You have 5351 examples of shape (1, 8) each which a vector of size 8. But a LSTM layer takes a sequence of vectors still thé size of your séquence is one. I don't know if it is relevent to use an RNN type layer here.

Very large loss values when training multiple regression model in Keras

I was trying to build a multiple regression model to predict housing prices using the following features:
[bedrooms bathrooms sqft_living view grade]
= [0.09375 0.266667 0.149582 0.0 0.6]
I have standardized and scaled the features using sklearn.preprocessing.MinMaxScaler.
I used Keras to build the model:
def build_model(X_train):
model = Sequential()
model.add(Dense(5, activation = 'relu', input_shape = X_train.shape[1:]))
optimizer = Adam(lr = 0.001)
model.compile(loss = 'mean_squared_error', optimizer = optimizer)
return model
When I go to train the model, my loss values are insanely high, something like 4 or 40 trillion and it will only go down about a million per epoch making training infeasibly slow. At first I tried increasing the learning rate, but it didn't help much. Then I did some searching and found that others have used a log-MSE loss function so I tried it and my model seemed to work fine. (Started at 140 loss, went down to 0.2 after 400 epochs)
My question is do I always just use log-MSE when I see very large MSE values for linear/multiple regression problems? Or are there other things i can do to try and fix this issue?
A guess as to why this issue occurred is the scale between my predictor and response variables were vastly different. X's are between 0-1 while the highest Y went up to 8 million. (Am I suppose to scale down my Y's? And then scale back up for predicting?)
A lot of people believe in scaling everything. If your y goes up to 8 million, I'd scale it, yes, and reverse the scaling when you get predictions out, later.
Don't worry too much about specifically what loss number you see. Sure, 40 trillion is a bit ridiculously high, indicating changes may need to be made to the network architecture or parameters. The main concern is whether the validation loss is actually decreasing, and the network actually learning therewith. If, as you say, it 'went down to 0.2 after 400 epochs', then it sounds like you're on the right track.
There are many other loss functions besides log-mse, mse, and mae, for regression problems. Have a look at these. Hope that helps!

Neural network accuracy optimization

I have constructed an ANN in keras which has 1 input layer(3 inputs), one output layer (1 output) and two hidden layers with with 12 and 3 nodes respectively.
The way i construct and train my network is:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.cross_validation import train_test_split
import numpy
# fix random seed for reproducibility
seed = 7
dataset = numpy.loadtxt("sorted output.csv", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:3]
Y = dataset[:,3]
# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)
# create model
model = Sequential()
model.add(Dense(12, input_dim=3, init='uniform', activation='relu'))
model.add(Dense(3, init='uniform', activation='relu'))
model.add(Dense(1, init='uniform', activation='sigmoid'))
# Compile model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model, y_train, validation_data=(X_test,y_test), nb_epoch=150, batch_size=10)
Sorted output csv file looks like:
so after 150 epochs i get: loss: 0.6932 - acc: 0.5000 - val_loss: 0.6970 - val_acc: 0.1429
My question is: how could i modify my NN in order to achieve higher accuracy?
You could try the following things. I have written this roughly in the order of importance - i.e. the order I would try things to fix the accuracy problem you are seeing:
Normalise your input data. Usually you would take mean and standard deviation of training data, and use them to offset+scale all further inputs. There is a standard normalising function in sklearn for this. Remember to treat your test data in the same way (using the mean and std from the training data, not recalculating it)
Train for more epochs. For problems with small numbers of features and limited training set sizes, you often have to run for thousands of epochs before the network will converge. You should plot the training and validation loss values to see whether the network is still learning, or has converged as best as it can.
For your simple data, I would avoid relu activations. You may have heard they are somehow "best", but like most NN options, they have types of problems where they work well, and others where they are not best choice. I think you would be better off with tanh or sigmoid activations in hidden layers for your problem. Save relu for very deep networks and/or convolutional problems on images/audio.
Use more training data. Not clear how much you are feeding it, but NNs work best with large amounts of training data.
Provided you already have lots of training data - increase size of hidden layers. More complex relationships require more hidden neurons (and sometimes more layers) for the NN to be able to express the "shape" of the decision surface. Here is a handy browser-based network allowing you to play with that idea and get a feel for it.
Add one or more dropout layers after the hidden layers or add some other regularisation. The network could be over-fitting (although with a training accuracy of 0.5 I suspect it isn't). Unlike relu, using dropout is pretty close to a panacea for tougher NN problems - it improves generalisation in many cases. A small amount of dropout (~0.2) might help with your problem, but like most hyper-parameters, you will need to search for the best values.
Finally, it is always possible that the relationship you want to find that allows you to predict Y from X is not really there. In which case it would be a correct result from the NN to be no better than guessing at Y.
Neil Slater already provided a long list of helpful general advices.
In your specific examaple, normalization is the important thing. If you add the following lines to your code
X = dataset[:,0:3]
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
you will get 100% accuracy on your toy data, even with much simpler network structures. Without normalization, the optimizer won't work.
