I was trying to build a multiple regression model to predict housing prices using the following features:
[bedrooms bathrooms sqft_living view grade]
= [0.09375 0.266667 0.149582 0.0 0.6]
I have standardized and scaled the features using sklearn.preprocessing.MinMaxScaler.
I used Keras to build the model:
def build_model(X_train):
model = Sequential()
model.add(Dense(5, activation = 'relu', input_shape = X_train.shape[1:]))
model.add(Dense(1))
optimizer = Adam(lr = 0.001)
model.compile(loss = 'mean_squared_error', optimizer = optimizer)
return model
When I go to train the model, my loss values are insanely high, something like 4 or 40 trillion and it will only go down about a million per epoch making training infeasibly slow. At first I tried increasing the learning rate, but it didn't help much. Then I did some searching and found that others have used a log-MSE loss function so I tried it and my model seemed to work fine. (Started at 140 loss, went down to 0.2 after 400 epochs)
My question is do I always just use log-MSE when I see very large MSE values for linear/multiple regression problems? Or are there other things i can do to try and fix this issue?
A guess as to why this issue occurred is the scale between my predictor and response variables were vastly different. X's are between 0-1 while the highest Y went up to 8 million. (Am I suppose to scale down my Y's? And then scale back up for predicting?)
A lot of people believe in scaling everything. If your y goes up to 8 million, I'd scale it, yes, and reverse the scaling when you get predictions out, later.
Don't worry too much about specifically what loss number you see. Sure, 40 trillion is a bit ridiculously high, indicating changes may need to be made to the network architecture or parameters. The main concern is whether the validation loss is actually decreasing, and the network actually learning therewith. If, as you say, it 'went down to 0.2 after 400 epochs', then it sounds like you're on the right track.
There are many other loss functions besides log-mse, mse, and mae, for regression problems. Have a look at these. Hope that helps!
Related
I have a simple model in tensorflow which is being trained on the first 1000 images in the MNIST datset. From my previous experience the learning rates which I used were of the order of around 0.001, however for my model to converge the learning rate needs to be far heigher, at least larger than 1. The model is shown below.
def gen_model():
return tf.keras.models.Sequential([
tf.keras.Input(shape=(28,28,)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='sigmoid'),
tf.keras.layers.Dense(10, activation='softmax')
])
model = gen_model()
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=5), loss='mean_squared_error')
model.summary()
model.fit(x_train, y_train, batch_size=1000, epochs=10000)
Is it expected for models of this form to require an extremely high learning rate, or is there something I have missed? When I use a learning rate of around 0.001 the loss changes incredibly slowly.
The dataset was created with the following code:
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.astype("float32") / 255.0
x_train = x_train.reshape(60000,28,28)[:1000];
y_train = y_train[:1000];
y_train = tf.one_hot(y_train, 10)
Generally speaking, models that require learning rates larger than 1 raise a red flag for me. It seems like your model is a vanilla multilayer perceptron, so there's nothing overly complicated about that, but there are a couple things about your setup that stand out:
The output from your model uses a softmax, which is normally used to represent values from a categorical distribution (i.e., 1-of-k) -- this is typical for a classification model. But the loss you're using is typically used for optimizing Gaussian or regression outputs. You might want to try using a cross-entropy loss to see if that helps.
The output from your model is in probability space, so the values you get out from your model are in [0, 1]. The loss you're using is averaging the squared differences between the model output and the target 1-hot vector (whose values are in {0, 1}). The value you'll get for this loss is always smaller than 1, so with a learning rate less than 1, and multiplying by the existing model weights, the delta that you'll apply to your model weights is always going to be small. Sometimes that's a good thing, but my guess is that in this case -- and particularly at the start of training when the model weights aren't near their optimal values -- this is going to be quite slow.
Related to the above point, you might try initializing your model weights with a larger range of values than the default. This would help make the gradient values larger, but could also make the model more likely to diverge.
You could also try to replace your softmax output activation with a plain linear activation, in effect converting your model's output to (unnormalized) log-probability space. Then you'd need to change your dataset labels to also represent target log-probability values, which isn't possible exactly, but could get close with something like 1e8 * (1 - one_hot). But if you wanted to go this route, you'd effectively be implementing a cross-entropy loss yourself; see the first point.
My project is to try and find out if I can predict gender of people speaking near phone from data from gyroscope and accelerometer. I have 315 examples(60sec each) and each example has 2997 lines where each line represents magnitude of vector from gyro/accelerometer xyz axis.
I shuffled input and output by same seed and I normalized input data. I split data on 60|20|20. In this test I try from accelerometer to see if there is male speaking, so output is binary.
When I train data with current model, sometimes I get accuracy as high as 0.68 and as low as 0.36 while loss is almost always around 0.69. I run it in a for loop for 10 times and average is 0.5 accuracy and 0.69 loss.
First question is i tried multiple types of models, learning rates, optimization algorithms etc. but in average i wasnt too successful. Should I try Recurrent NNs and where can i learn it?
Second question is if i train model with accuracy of 68%, is it okay to say the model has 68% accuracy even though i know average is 50%?
model = tf.keras.Sequential()
model.add(layers.Dense(512, activation='relu',input_shape = (2997,), kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.5))
for j in range(10) :
model.add(layers.Dense(1024, activation='relu', kernel_regularizer=regularizers.l2(0.001)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))
lr_schedule = tf.keras.optimizers.schedules.InverseTimeDecay(
0.001,
decay_steps=200,
decay_rate=1,
staircase=True)
callbacks = [
tf.keras.callbacks.EarlyStopping(
monitor='val_loss', patience = 20
)
]
model.compile(optimizer=tf.keras.optimizers.RMSprop(learning_rate = lr_schedule),
loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
metrics = ['accuracy'])
history = model.fit(
train_vector_examples,
train_vector_labels,
validation_split = 0.25,
epochs =80,
callbacks=callbacks,
verbose=0,
shuffle=False
)
loss1, accuracy = model.evaluate(test_vector_examples, test_vector_labels)
This is from my own experience working with different types of data; to have get good solutions to your questions you should probably study the characteristics of the data closely before coming up with any models/algorithms.
First question: generally speaking, RNNs are good for data that has time dependency, or in other words, for cases where the inputs' order matters (e.g. time series, text). So I think RNNs may not be the best choice for your type of data, as I suppose ordering does not matter in your dataset.
Second question: this really depends on the difficulty of the problem you are trying to solve; but in my opinion 68% is quite low as 50% is basically the same as random choice. You probably want to improve the accuracy further.
Also, from your explanations, I can see that each gyro/accelerometer input has shape of rank 3 (xyz), so maybe you can try some CNN architectures and see how it goes.
I have a problem that the loss does not decrease when the epoch increases.
here is my code
model = Sequential()
model.add(LSTM(50, input_shape=(None, 1)))
model.add(Dropout(0.05))
BatchNormalization()
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(lr=0.01, decay=0.001))
hist = model.fit(x_train, y_train, epochs=40, batch_size=32, verbose=2)
Total parameters is number of 10451 and
train dataset is number of 2285
I am wondering if the total parameter is a reasonable ratio for train_data.
In other words, I wonder if the total parameter is appropriate to have a ratio of train_data.
and here is my loss graph
I tried parameter and hyperparameter tuning But this could not be solved.
The dataset has been preprocessed between 0 and 1.
The ensemble rather made the result worse.
how can I get to decrease loss when the epoch increases?
There really is no simple answer to your question. You have a loss graph that shows a fast initial learning that then tails off to a much slower reduction after a couple of epochs. This is quite a common phenomenon.
Your question amounts to "how do I make a better machine learning model for this dataset?". Which is an impossible question to answer generically.
Directly you could increase layers, weights, etc. Eventually (at least in theory) you would have enough complexity in your model to memorise your entire training set and get the loss down close to 0. The resulting model would almost certainly be overfitted and perform very poorly when it came to data it had never seen before though.
Objectively, if your labels are spread uniformly between 0 and 1, you already seem to have a very low loss value (less than 0.0005 MSE, right??)?
Have you tried this on your test set? It's not at all clear why you need to drive this further down.
I have a network with 32 input nodes, 20 hidden nodes and 65 output nodes. My network input actually is a hash code of length 32 and the output is the word.
The input is the ascii value of each character of the Hash code. The output of the network is a binary representation I have made. Say for example a is equal to 00000 and b is equal to 00001 and so on and so forth. It only includes the alphabet and the space that why it's only 5 bits per character. I have a maximum limit of only 13 characters in my training input, so my output nodes is 13 * 5 = 65. And Im expecting a binary output like 10101010101010101010101010101010101010101010101010101010101001011 . The bit sequence can predict at most 16 characters word given a hash code of 32 length as an input. Below is my current code:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform((train_samples).reshape(-1, 32))
train_labels = train_labels.reshape(-1, 65)
model = Sequential([
Dense(32, input_shape=(32,), activation = 'sigmoid'),
BatchNormalization(),
Dense(25, activation='tanh'),
BatchNormalization(),
Dense(65, input_shape=(65,), activation='sigmoid')
])
overfitCallback = EarlyStopping(monitor='loss', min_delta=0, patience = 1000)
model.summary()
model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_samples, train_labels, batch_size=1000, epochs=1000000, callbacks=[overfitCallback], shuffle = True, verbose=2)
I plan to overfit the model, so that it can memorize all the hash codes of the words in the dictionary. As an initial, my training samples is only 5,000 something. I just wanted to see if it will learn from a small dataset. How will I make network converge faster? I think its running more than one hour, and its loss function is still .5004 something and the accuracy is .7301. It gets up and down but when I check every 10 minutes or so, I can see only alittle improvement. How will I fine tune it?
UPDATE :
The training had already stopped but it didn't converge. It's loss is .4614 and accuracy is .7422
There are some hyper parameters that i would suggest to change first.
Try 'relu' or LeakyReLU() as the activation function for the non-output layers. Basically relu is the standard activation function for baseline models.
The standard optimizer (for most cases) currently is Adam, try using this. Tweak its learning rate when needed. You could get better results with sgd, but it often takes a lot of epochs and a lot of hyper parameter tuning. Adam is basically the quickest (in general) optimizer to reach a 'low' loss.
To prevent overfitting you might also want to implement Dropout(0.5), where the 0.5 is as an example.
Once you have reached the lowest loss, you might start changing these hyper parameters even more, to try and egt a lower loss.
Apart from this, the first thing i actually suggest is trying and add multiple hidden layers with different sizes. This might have a way larger impact then trying to optimize all the hyper parameters.
Edit: Maybe you could post a screenshot of your training loss vs epochs for the train & val data? This might make things more clear for others.
I am creating a deep convolutional neural network for pixel-wise classification. I am using adam optimizer, softmax with cross entropy.
Github Repository
I asked a similar question found here but the answer I was given did not result in me solving the problem. I also have a more detailed graph of what it going wrong. Whenever I use softmax, the problem in the graph occurs. I have done many things such as adjusting training and epsilon rates, trying different optimizers, etc. The loss never decreases past 500. I do not shuffle my data at the moment. Using sigmoid in place of softmax results in this problem not occurring. However, my problem has multiple classes, so the accuracy of sigmoid is not very good. It should also be mentioned that when the loss is low, my accuracy is only about 80%, I need much better than this. Why would my loss suddenly spike like this?
x = tf.placeholder(tf.float32, shape=[None, 7168])
y_ = tf.placeholder(tf.float32, shape=[None, 7168, 3])
#Many Convolutions and Relus omitted
final = tf.reshape(final, [-1, 7168])
keep_prob = tf.placeholder(tf.float32)
W_final = weight_variable([7168,7168,3])
b_final = bias_variable([7168,3])
final_conv = tf.tensordot(final, W_final, axes=[[1], [1]]) + b_final
cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=final_conv))
train_step = tf.train.AdamOptimizer(1e-5).minimize(cross_entropy)
correct_prediction = tf.equal(tf.argmax(final_conv, 2), tf.argmax(y_, 2))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
You need label smoothing.
I just had the same problem. I was training with tf.nn.sparse_softmax_cross_entropy_with_logits which is the same as if you use tf.nn.softmax_cross_entropy_with_logits with one-hot labels. My dataset predicts the occurrence of rare events so the labels in the training set are 99% class 0 and 1% class 1. My loss would start to fall, then stagnate (but with reasonable predictions), then suddenly explode and then the predictions also went bad.
Using the tf.summary ops to log internal network state into Tensorboard, I observed that the logits were growing and growing in absolute value. Eventually at >1e8, tf.nn.softmax_cross_entropy_with_logits became numerically unstable and that's what generated those weird loss spikes.
In my opinion, the reason why this happens is with the softmax function itself, which is in line with Jai's comment that putting a sigmoid in there before the softmax will fix things. But that will quite surely also make it impossible for the softmax likelihoods to be accurate, as it limits the value range of the logits. But in doing so, it prevents the overflow.
Softmax is defined as likelihood[i] = tf.exp(logit[i]) / tf.reduce_sum(tf.exp(logit[!=i])). Cross-entropy is defined as tf.reduce_sum(-label_likelihood[i] * tf.log(likelihood[i]) so if your labels are one-hot, that reduces to just the negative logarithm of your target likelihood. In practice, that means you're pushing likelihood[true_class] as close to 1.0 as you can. And due to the softmax, the only way to do that is if tf.exp(logit[!=true_class]) becomes as close to 0.0 as possible.
So in effect, you have asked the optimizer to produce tf.exp(x) == 0.0 and the only way to do that is by making x == - infinity. And that's why you get numerical instability.
The solution is to "blur" the labels so instead of [0,0,1] you use [0.01,0.01,0.98]. Now the optimizer works to reach tf.exp(x) == 0.01 which results in x == -4.6 which is safely inside the numerical range where GPU calculations are accurate and reliably.
Not sure, what it causes it exactly. I had the same issue a few times. A few things generally help: You might reduce the learning rate, ie. the bound of the learning rate for Adam (eg. 1e-5 to 1e-7 or so) or try stochastic gradient descent. Adam tries to estimate learning rates which can lead to instable training: See Adam optimizer goes haywire after 200k batches, training loss grows
Once I also removed batchnorm and that actually helped, but this was for a "specially" designed network for stroke data (= point sequences), which was not very deep with Conv1d layers.