I'm using pytorch to implement a simple linear regression model.
The code works perfectly for randomly created datasets, but when it comes to the dataset I wanted to train, it gives significantly wrong results.
Here is the code:
x = torch.linspace(1,100,steps=100)
learn_rate = 0.000001
x_train = x[:100]
x_test = x[100:]
y_train = data[:100]
y_test = data[100:]
# y_train = -0.01*x_train + torch.randn(100)*10 #Code for generating random data.
w = torch.rand(1,requires_grad=True)
b= torch.rand(1,requires_grad=True)
for i in range(1000):
loss = torch.mean((y_train-(w*x_train+b))**2)
if(i%100==0):
print(loss)
loss.backward()
w.data.add_(-w.grad.data*learn_rate)
b.data.add_(-b.grad.data*learn_rate)
w.grad.data.zero_()
b.grad.data.zero_()
The result it gives makes no sense.
However, when I used a randomly generated dataset, it works perfectly:
The dataset actually looks similar. I am not sure for the reason of the inaccuracy of this model.
Code for plotting data:
plt.plot(x_train.numpy(),y_train.numpy())
plt.plot(x_train.numpy(),(w*x_train+b).data.numpy())
plt.show()
--
Now the problem seems to be that weight converges much faster than bias. At the current learning rate, bias will not converge to the optimal. However, if I increase the learning rate just by a little, the weight will simply diverge. I have to set two learning rates.
However, I'm wondering whether setting different learning rate is the best solution for a simple model like this, because I've found out that not much model actually uses different learning rate for different parameters.
Your code seems to be correct, but your model converges slower when there is a large bias in your data (because it now has to update the bias parameter many times before it reaches the correct value).
You could try running it for more iterations or increasing the learning rate.
Related
In this simple toy example, the network leans the XOR operation:
import tensorflow as tf
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score
model = tf.keras.Sequential(layers=[
tf.keras.layers.Input(shape=(2,)),
tf.keras.layers.Dense(4, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(
loss=tf.keras.losses.binary_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01)
)
x_train = np.random.uniform(-1, 1, (10000, 2))
tmp = x_train > 0
y_train = (tmp[:, 0] ^ tmp[:, 1])
model.fit(x=x_train, y=y_train, epochs=10)
x_test = np.random.uniform(-1, 1, (1000, 2))
tmp = x_test > 0
y_test = (tmp[:, 0] ^ tmp[:, 1])
prediction = model.predict(x_test) > 0.5
print(f'Accuracy: {accuracy_score(y_pred=prediction, y_true=y_test)}')
print(f'recall: {recall_score(y_pred=prediction, y_true=y_test)}')
print(f'precision: {precision_score(y_pred=prediction, y_true=y_test)}')
This example can also be found in the tensorflow playground
When the initial loss is <3, this will quickly converge (in 2-3 epochs). But sometimes the initial conditions lead it to have ~7 loss, in which case it never converges (not even after 1000 epochs).
It's easy to know right after the first epoch if it's going to work or not, but it makes searching for hyper parameters very difficult, since you never know if converged successfully by chance due to initial conditions, or if the hyper parameter is the cause.
Is there a way to make this network less dependent on initial conditions? A different optimizer? Some optimizer hyper parameter? weight regularization?
I've tried changing these, but didn't get consistent improvements.
In the playground example, it never gets stuck at this kind of high loss.
Edit: If you make the training long enough, it might jump to loss 7 even after settling on a good solution with loss < 0.03.
Theoretically, there's no way to be 100% sure if it's the hyper params or the initial config. You'll need to implement something for the case when there's divergence.
Practically though, you could:
Train multiple times, and incorporate how often the network converges into the strategy on selecting the best hyperparameters;.
Find some ranges for which you feel like the model is consistent.
Incorporate the initialization of your weights into your hyperparameter optimization. Right now they are randomly initiliazed, and are the cause of the problem. There's a number of ways to do this. Try playing around with different initilizers, but there's no "one best initiliazers for every ML problem".
Just fix the initial conditions. Fix the random seed that tensorflow uses for initilization using tf.random.set_seed, but that will affect your performance by a lot of course, so I don't think that's really what you want. You could make the claim that you are now sure that a network performs well because of the architecture, but that's only true for that specific random seed, not for all.
According to this blog, adding batchnorm should make the network less sensitive to the initialisation approach.
I'm new to ANN, but I've managed to train a convolutional model successfully (using some legacy tensorflow v1 code) up to ~90% accuracy or so on my data. But when I evaluate (test) it on any given batch, the result is somewhat random, even though it's 90% correct. I've tried to re-evaluate the data N times and averaging (using N's between 1 and 25), but still each evaluation differs from the others between 3% to 10% of the data points.
Is there any way to make the evaluation predictable, so that the evaluation of an input batch X always yield the exact same result Y every time I run it (once training is done)?
I'm not sure if it's relevant, but my layers are batch normalized like so:
inp = tf.identity(inp)
channels = inp.get_shape()[-1]
offset = tf.compat.v1.get_variable(
'offset', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.zeros_initializer())
scale = tf.compat.v1.get_variable(
'scale', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.random_normal_initializer(1.0, 0.02))
mean, variance = tf.nn.moments(x=inp, axes=[0, 1], keepdims=False)
variance_epsilon = 1e-5
normalized = tf.nn.batch_normalization(
inp, mean, variance, offset, scale, variance_epsilon=variance_epsilon)
The scale part is initialized with random data, but I assume that gets loaded when I do tf.compat.v1.train.Saver().restore(session, checkpoint_fname)?
I am assuming you are testing the model on your training batches?
You can't equate the accuracy of a portion of your total training dataset to the accuracy of the whole.
Think of it like a regression problem. If you only take a part of the dataset, there is no guarantee that it would average out close to the full dataset.
If you want consistent accuracy, evaluate on the full dataset.
I've written an LSTM in Keras for univariate time series forecasting. I'm using an input window of size 48 and an output window of size 12, i.e. I'm predicting 12 steps at once. This is working generally well with an optimization metric such as RMSE.
For non-stationary time series I'm differencing the data before feeding the data to the LSTM. Then after predicting, I take the inverse difference of the predictions.
When differencing, RMSE is not suitable as an optimization metric as the earlier prediction steps are a lot more important than later steps. When we do the inverse difference after creating a 12-step forecast, then the earlier (differenced) prediction steps are going to affect the inverse difference of later steps.
So what I think I need is an optimization metric that gives the early prediction steps more weight, preferably exponentially.
Does such a metric exist already or should I write my own? Am I overlooking something?
Just wrote my own optimization metric, it seems to work well, certainly better than RMSE.
Still curious what's the best practice here. I'm relatively new to forecasting.
from tensorflow.keras import backend as K
def weighted_rmse(y_true, y_pred):
weights = K.arange(start=y_pred.get_shape()[1], stop=0, step=-1, dtype='float32')
y_true_w = y_true * weights
y_pred_w = y_pred * weights
return K.sqrt(K.mean(K.square(y_true_w - y_pred_w), axis=-1))
I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.
I am using tensorflow to do image recognition on the MNIST dataset. In each training epoch, I picked 10,000 random images and conducted online training with batch size of 1. The recognition rate increased for the first few epochs, however, after several epochs the recognition rate started to drop greatly. (In the first 20 epochs, the recognition rate goes up to ~94%. Afterwards, the recognition rate went from 90->50->40->30->20). What is the reason for this?
Also, with a batch size of 1, the performance is worse than when using a batch size of 100 (max recognition rate 94% vs. 96%). I looked through several references but there seems to be contradictory results on whether small or large batch sizes achieve better performance. What would be this case in this situation?
Edit: I also added a figure of the recognition rate of the training dataset and the test dataset.Recognition rate vs. epoch
I have attached a copy of the code below. Thanks for the help!
import tensorflow as tf
import numpy as np
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot = True)
#parameters
n_nodes_hl1 = 500
n_nodes_hl2 = 500
n_nodes_hl3 = 500
n_classes = 10
batch_size = 1
x = tf.placeholder('float', [None, 784])
y = tf.placeholder('float')
#model of neural network
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, n_nodes_hl1]) , name='l1_w'),
'biases': tf.Variable(tf.random_normal([n_nodes_hl1]) , name='l1_b')}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl1, n_nodes_hl2]) , name='l2_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl2]) , name='l2_b')}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl2, n_nodes_hl3]) , name='l3_w'),
'biases' :tf.Variable(tf.random_normal([n_nodes_hl3]) , name='l3_b')}
output_layer = {'weights':tf.Variable(tf.random_normal([n_nodes_hl3, n_classes]) , name='lo_w'),
'biases' :tf.Variable(tf.random_normal([n_classes]) , name='lo_b')}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
#train neural network
def train_neural_network(x):
prediction = neural_network_model(x)
cost = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
optimizer = tf.train.AdamOptimizer().minimize(cost)
hm_epoches = 100
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for epoch in range(hm_epoches):
epoch_loss=0
for batch in range (10000):
epoch_x, epoch_y=mnist.train.next_batch(batch_size)
_,c =sess.run([optimizer, cost], feed_dict = {x:epoch_x, y:epoch_y})
epoch_loss += c
correct = tf.equal(tf.argmax(prediction, 1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct, 'float'))
print(epoch_loss)
print('Accuracy_test:', accuracy.eval({x:mnist.test.images, y:mnist.test.labels}))
print('Accuracy_train:', accuracy.eval({x:mnist.train.images, y:mnist.train.labels}))
train_neural_network(x)
DROPPING ACCURACY
You're over-fitting. This is when the model learns false features that are specific to artifacts of the images in the training data, at the expense of important features. One of the main experimental results of any application is to determine the optimal number of training iterations.
For instance, perhaps 80% of the 7's in your training data happen to have a little extra slant to the right near the bottom of the stem, where 4's and 1's do not. After too much training, your model "decides" that the best way to tell a 7 from another digit is from that extra slant, despite any other features. As a result, some 1's and 4's now get classed as 7's.
BATCH SIZE
Again, the best batch size is one of the experimental results. Typically, a batch size of 1 is too small: this gives the first few input images too much influence on the early weights in kernel or perceptron training. This is a minor case of over-fitting: one item having undue influence on the model. However, it's significant enough to alter your best results by 2%.
You need to balance the batch size with the other hyper-parameters to find the model's "sweet spot", optimum performance followed by shortest training time. In my experience, it's been best to increase the batch size until my time per image degraded. The models I've used most (MNIST, CIFAR-10, AlexNet, GoogleNet, ResNet, VGG, etc.) had very little loss of accuracy once we reached a rather minimal batch size; from there, the training speed was usually a matter of choosing the batch size the best used available RAM.
There are a few possibilities, although you'll need to do some experimentation to find out which it is.
Overfitting
Prune did a good job of explaining this. I'll add that the simplest way to avoid overfitting is to just remove 10-15% of the training set and evaluate the recognition rate on this held out validation set after every few epochs. If you graph the change in recognition rate on both the training and validation sets, you'll eventually reach a point on the graph where the training error keeps going down but the validation error starts going up. Stop training at that point; that's where overfitting is starting in earnest. Note that it's important that there be no overlap between the training/validation/test sets.
This was more likely before you mentioned that the training error wasn't also decreasing, but it's possible that it's overfitting on a fairly homogeneous part of your training set at the expense of the outliers, or something like this. Try randomizing the order of your training set after each epoch; if it's fitting one section of the set at the expense of the others, this might help.
Addendum: The massive instantaneous drop in quality around epoch 20 makes this even less likely; that is not what overfitting looks like.
Numerical Instability
If you get a particularly incorrect input at a point on the activation function with a large gradient, it's possible to end up with a gigantic weight update that screws up everything it's learned thus far. It's common to put a hard limit on the gradient magnitude for this reason. But you're using AdamOptimizer, which has an epsilon parameter for avoiding instability. I haven't read the paper it references, so I don't know exactly how it works, but the fact that it's there makes instability less likely.
Saturated Neurons
Some activation functions have regions with very small gradients, so if you end up with weights such that the function is almost always in that region, you have a tiny gradient and thus can't learn effectively. Sigmoids and Tanh are particularly prone to this since they have flat regions on both sides of the function. ReLUs don't have a flat region on the high end, but do on the low end. Try replacing your activation functions with Softplus; those are similar to ReLU, but with a continuous nonzero gradient.