Delay gap between reality and prediction - python

Using machine learning (as library I've tried Tensorflow and Tflearn (which, I know is just a wrapping of Tensorflow)) I'm trying to predict the congestion in an area for the next week (see my previous questions if you want more backstory on it). My training set is composed of 400K tagged entry (with the date an congestion value for each minute).
My problem is that I now have a time gap between predictions and reality.
If I had to draw a chart with the reality and prediction you would see that my prediction, while have the same shape as the reality is in advance. She increase/decrease before the reality. It started to make me think that maybe my training had a problem. It would seem like that my prediction didn't start when my training ended.
Both of my data-sets (training/testing) are on 2 different file. First I train on my training set (for convenience sake let's say it end at 100th minutes and my testing set start at 101th minute), once my model saved I do my predictions, it should normally then start to predict 101 or am I wrong somewhere? Because it seem like it's starting to predict way way after my training stopped (if I keep my example it would start predicting value 107 for example).
For now one of a bad fix was to remove from the training set as many value as I had of delay (take this example, it would be 7) and it worked, no more delay but I don't understand why I have this problem or how to fix it so it wouldn't happen later.
Following some advices found on different website it seem like having gap in my training dataset (missing timestamp in this case) could be a problem, seeing that there do was some (in total around 7 to 9% of the whole dataset was missing) I've used Pandas to add the missing timestamps (I've also gave them the congestion value of the last know timestamp) while I do think that it may have helped a little (the gap is smaller) it haven't fixed the problem.
I tried multistep forecasting, multivariate forecasting, LSTM, GRU, MLP, Tensorflow, Tflearn but it change nothing making me think it could come from my training.
Here is my model training.
def fit_lstm(train, batch_size, nb_epoch, neurons):
X, y = train[:, 0:-1], train[:, -1]
X = X.reshape(X.shape[0], 1, X.shape[1])
print X.shape
print y.shape
model = Sequential()
model.add(LSTM(neurons, batch_input_shape=(None, X.shape[1], X.shape[2]), stateful=False))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
for i in range(nb_epoch):
model.fit(X, y, epochs=1, batch_size=batch_size, verbose=0, shuffle=False)
model.reset_states()
return model
The 2 shape are :
(80485, 1, 1)
(80485,)
(On this example I'm using only 80K of data as training for speed purpose).
As parameter I'm using 1 neuron, 64 of batch_size and 5 epoch.
My dataset is made of 2 file. First is the training file with 2 column:
timestamp | values
The second have the same shape but is the testing set (separated to avoid any influence of it on my prediction), the file is only used once every prediction have been made and to compare reality and prediction. The testing set start where the training set stop.
Do you have an idea of what could be the reason of this problem?
Edit:
On my code I have this function:
# invert differencing
yhat = inverse_difference(raw_values, yhat, len(test_scaled)+1-i)
# invert differenced value
def inverse_difference(history, yhat, interval=1):
return yhat + history[-interval]
It's supposed to invert the difference (to go from a scaled value to the real one).
When using it like in the pasted example (using the testing set) I get perfection, accuracy above 95% and no gap.
Since in reality we wouldn't know theses values I had to change it.
I tried first to use the training set but got the problem explained on this post:
Why is this happening? Is there an explanation for this problem?

Found it. It was a problem with the "def inverse_difference(history, yhat, interval=1):" function. In fact it make my result look like my last lines of training. This is why I had a gap, since there is a pattern in my data (peak always at more or less the same moment) I thought he was doing prediction while he was just giving me back values from training.

Related

Why is my Keras model predicting trend but not scale?

I'm making a model to predict a the irradiance value on a solar field. The thing is that my model, despite being very simple (added code below), performs very well. The problem is that for any reason, it predicts a different scale, giving almos always lower values but in the same trend. I have appended the plot which compares both outputs and real data, in train and test set. Also linked the dataset.
Some details: The model has a total of 24 columns which correspond to 24 pyranometers which are the ones that gives information about the sun. The model has just been trained with the first one for simplicity, therefore with more data we can achieve better performance. Also, I'm processing my data to have a 15 steps back in time and a predict window of 20 steps forward.
input = Input((LAG,1)) # LAG is the number of steps I take backward
hidden = LSTM(32, return_sequences=True)(input)
output = Dense(1, activation='linear')(hidden)
model = Model(input, output)
Dataset
Model output vs real in train set
Model output vs real in test set

Can a trained ANN (tensorflow) model be made predictable?

I'm new to ANN, but I've managed to train a convolutional model successfully (using some legacy tensorflow v1 code) up to ~90% accuracy or so on my data. But when I evaluate (test) it on any given batch, the result is somewhat random, even though it's 90% correct. I've tried to re-evaluate the data N times and averaging (using N's between 1 and 25), but still each evaluation differs from the others between 3% to 10% of the data points.
Is there any way to make the evaluation predictable, so that the evaluation of an input batch X always yield the exact same result Y every time I run it (once training is done)?
I'm not sure if it's relevant, but my layers are batch normalized like so:
inp = tf.identity(inp)
channels = inp.get_shape()[-1]
offset = tf.compat.v1.get_variable(
'offset', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.zeros_initializer())
scale = tf.compat.v1.get_variable(
'scale', [channels],
dtype=tf.float32,
initializer=tf.compat.v1.random_normal_initializer(1.0, 0.02))
mean, variance = tf.nn.moments(x=inp, axes=[0, 1], keepdims=False)
variance_epsilon = 1e-5
normalized = tf.nn.batch_normalization(
inp, mean, variance, offset, scale, variance_epsilon=variance_epsilon)
The scale part is initialized with random data, but I assume that gets loaded when I do tf.compat.v1.train.Saver().restore(session, checkpoint_fname)?
I am assuming you are testing the model on your training batches?
You can't equate the accuracy of a portion of your total training dataset to the accuracy of the whole.
Think of it like a regression problem. If you only take a part of the dataset, there is no guarantee that it would average out close to the full dataset.
If you want consistent accuracy, evaluate on the full dataset.

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.
So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.
Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

Neural Network model validation accuracy and tranning accuracy never change

The Task:
So me and a friend have been tasked to make a NeuralNet for a old Kaggle competition which is a regression problem as you have to predict the sales of the shop for upto x amount of months which is Here.
The Dataset:
We've decided to use Keras and Pandas to preprocess the train dataset, The train dataset has the following fields
(The text inside parenthesis are NOT in the dataset but only to show you the range and data type)
Store(1-1115) DayOfWeek(1-7) Date(YYYY-MM-DD) Sales(int) Customers(int) Open(0 OR 1) Promo(0 OR 1) StateHoliday(0,a,b,c) SchoolHoliday(0 OR 1)
How We Process/Make The Datasets:
Train.csv
Y:
So we decided to delete 'Sales' from the trainning dataset and to use it as Y, the target which is where I come to the first question. Should we normalise the Y/Sales target data?
X:
Now We process the rest of the trainning data as follows.
We remove the date as we thought it would not provide much use
especially if we put the data in order.
We have changed State Holiday to a (0 or 1) binary value to state weather it is/isn't a State Holiday. We have also changed it to 0.25/0.5/0.75/1.0 to represent all the State Holidays, we did this in a different iteration of the dataset (the different iteration part is important later)
We then normalised Customers to between 0-1 just like the sales
We also removed School Holiday as we thought it didnt hold a strong enough value to the sales target
We also removed Open as if the shop was shut it got no Sales and just didnt make sense to keep
We also got rid of Day Of Week in One iteration of the dataset and kept it in another
Now my friend is much more of a mathematician than I am, I'm more of a Computer Scientist. So using another dataset which was also in the competition called Store.csv and it looks like the following: Store.csv
Store(1-1115) StoreType(a-d) Assortment(a-c) CompetitionDistance(int Meters) CompetitionOpenSinceMonth(int M) CompetitionOpenSinceYear(int YYYY) Promo2(0 OR 1) Promo2SinceWeek(int) Promo2SinceYear(YYYY) PromoInterval(Str months seperated by csv)
Now my friend wrote a distance function to make N clusters for our trainning dataset, the function is as follows(We settled on 5 clusters):
I put the most similiar together based on customer distance store type
assortment and promo
We then have 5 datasets which are split by our distance function.
We now 5 datasets and we remove the following:
Clusters
Store
Sales - save this as Y/Target
So now our final dataset looks like the following:
DayOfWeek(1-7) Customers(normalised) Promo(0 OR 1) StateHoliday(0 OR 1)
The Problem
So far when we train our the Neural Network the accuracy and validation accurcy of our network just converges to a number and does not budge, no matter what epoch or learning rates or momentum we set. We have tried differnt clusters that still converge to different numbers. We made one big dataset with all the clusters and the numebrs still converge.
We made a big datasete with no cluster seperation and the accuracy and validation accuracy just reaches a fixed number after 2/3/4 epochs and never changes.
the network we made looks like the following:
model = Sequential()
model.add(Dense(8, input_dim=4, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(3, activation='relu'))
model.add(Dense(1, activation='relu'))
# model.add(Dense(1, activation='linear'))
model.compile(loss="MSE", optimizer="nadam", metrics=['accuracy'])
model.fit(x=xt.as_matrix(), y=yt.as_matrix(), validation_data=(xv.as_matrix(), yv.as_matrix()), epochs=5000)
print("stop")
We'v even tried adding removing a few features and the same problem still happens.
Questions
Why is the numbers converging no matter it we change the dataset?
Could this be a network shape/activation function/loss/optimizer
issue?
Is the problem with the dataset?
Is the clustering idea a good idea to pre process the data?
How would you process this data to get good results?
What type of layers/neurons would be appropriate?
Should we normalize the data?

For Keras LSTM, what is the difference in passing in lag features vs timesteps of features?

I'm getting acquainted with LSTMs and I need clarity on something. I'm modeling a time series using t-300:t-1 to predict t:t+60. My first approach was to set up an LSTM like this:
# fake dataset to put words into code:
X = [[1,2...299,300],[2,3,...300,301],...]
y = [[301,302...359,360],[302,303...360,361],...]
# LSTM requires (num_samples, timesteps, num_features)
X = X.reshape(X.shape[0],1,X.shape[1])
model = Sequential()
model.add(LSTM(n_neurons[0], batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dense(y.shape[1]))
model.compile(loss='mse', optimizer='adam')
model.fit(X, y, epochs=1, batch_size=1, verbose=1, shuffle=False)
With my real dataset, the results have been suboptimal, and on CPU it was able to train 1 epoch of around 400,000 samples in 20 minutes. The network converged quickly after a single epoch, and for any set of points I fed it, the same results would come out.
My latest change has been to reshape X in the following way:
X = X.reshape(X.shape[0],X.shape[1],1)
Training seems to be going slower (I have not tried on the full dataset), but it is noticably slower. It takes about 5 minutes to train over a single epoch of 2,800 samples. I toyed around with a smaller subset of my real data and a smaller number of epochs and it seems to be promising. I am not getting the same output for different inputs.
Can anyone help me understand what is happening here?
In Keras, timesteps in (num_samples, timesteps, num_features) determine how many steps BPTT will propagate the error back.
This, in turn, takes more time to do hence the slow down that you are observing.
X.reshape(X.shape[0], X.shape[1], 1) is the right thing to do in your case, since what you have is a single feature, with 300 timesteps.

Categories