LSTM - Multivariate Time Series Predictions

LSTM - Multivariate Time Series Predictions - python

I was reading the tutorial on Multivariate Time Series Forecasting with LSTMs in Keras
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/#comment-442845
I have followed through the entire tutorial and got stuck with a problem which is as follows-
In this tutorial, the train and test splits have 8 features viz., 'pollution', 'dew', 'temp', 'press', 'wnd_dir', 'wnd_spd', 'snow', 'rain' at step 't-1', while the output feature is 'pollution' at current step 't'.
This is because, the framing of the dataset as a supervised learning problem is about predicting the 'pollution' at current hour/time step 't', given the pollution and weather measurements at the prior hour/time step 't-1'
After fitting the model to the training and testing data splits, what if I want to make predictions for a new dataset having 7 features since it does not have 'pollution' feature in it and I explicitly just want to predict for this one feature using the other 7 features.
Thanks for your help!
How do I handle such a situation? (while the remaining 7 features remain the same)
Edit-
Assume that my dataset has the following 3 features while training/fitting the model-
shop_number, item_number, number_of_units_sold
AFTER, I have trained the LSTM model, I get a dataset having the features-
'shop_number' AND 'item_number'.
The dataset DOES NOT have 'number_of_units_sold'.
The 'input_shape' argument in 'LSTM' has 1 as time step and 3 as features while training.
But while predicting, I have 1 time step but ONLY 2 features (as 'number_of_units_sold' is what I have to predict).
So how should I proceed?

If pollution is the last feature:
X = original_data[:,:,:-1]
Y = original_data[:,:,-1:]
If pollution is the first feature
X = original_data[:,:,1:]
Y = original_data[:,:,:1]
Else
i = index_of_pollution_feature
X = np.concatenate([original_data[:,:,:i], original_data[:,:,i+1:],axis=-1)
Y = original_data[:,:,i:i+1]
Make a model with return_sequences=True, stative=False and that's it.
Don't use Flatten, Global poolings or anything that removes the steps dimension.
If you don't have any pollution data at all for training, then you can't.

Related

Binary classification of multivariate time series in the form of panel data using LSTM

Problem definition
Dear community, I need your help in implementing an LSTM neural network for a classification problem of panel data using Keras. The panel data I am manipulating consists of ids (let's call it id), a timestep for each id (t), n time varying covariates and a binary outcome y. Each id contains a number of timesteps and for each timestep I have my covariates and a unique outcome (0 or 1). I have reason to believe that each covariate for each id can have a certain degree of autocorrelation and henceforth can be considered a small timeseries of t steps. For simplicity, I consider that each id has a fixed number of t observations) with t not a big number (about 10 or so).
Data
Below is a toy example of what the data might look like in my case. In this example, the parameters are 2 individuals, 4 timesteps each, 4 covariates and each observation has a unique binary outcome. Covariates may be considered as (short) timeseries since they might be autocorrelated.
print(df)
[out]:
A B C D y
id t
id1 1 1.054127 0.346052 1.091299 -0.058137 0.0
2 0.621390 -0.204682 -1.056786 0.864572 0.0
3 1.275124 2.473959 0.264029 -1.047810 0.0
4 -0.328441 -0.135891 0.148498 0.470876 1.0
id2 1 0.362969 0.777082 0.197423 -2.225296 0.0
2 0.227134 0.086731 0.550267 -0.361482 0.0
3 0.223526 0.556242 -0.160042 0.675871 1.0
4 0.070125 0.156659 -2.922709 -1.143887 1.0
I have reason to assume that, for id1, the target at timestep 4 is conditional on the three previous timesteps for that same individual (id1). In addition, The target variable y may contain more than one value of 1 for each individual (as outlined in the case of id2 above). I do not have reason to believe that the data from an individual would affect the result of another (as with many behavior analysis scenarios since every individual is unique).
Prediction problem
What I would like to do is to predict a single outcome for a new individual for whom I have those 4 rows of observation. In other words, based on the historical data of an individual, I would like to know if said individual is likely to have an outcome 1 or 0. If I understand correctly, this can be achieved using an LSTM (alternatively, an RNN) with some data manipulation.
Things I have tried so far
To start simple, I have tried aggregating every set of id rows into a single row with a single outcome and applied a typical statistical learning approach such as boosted trees and got a model as good as random.
I looked into shaping it as a survival analysis problem, in vain. I would not be interested in any estimation of a survival function unlike tutorials on how to handle panel data in the medical field (nor would I have access to such data).
I have tried reshaping my data such that the input is a 3D array in the form of [observations, timesteps, features] where observations are unique ids for an LSTM like so in python :
# separate into features and target
df_feat = df.drop("y", axis = 1)
df_target = df[["y"]]
# get reshaped values for 3D tensor
n_samples = len(df_feat.index.get_level_values('id').unique().tolist())
n_timesteps = 4
n_features = df_feat.shape[1]
# reshape input array to be 3D
X_3D = df_feat.to_numpy().reshape(n_samples, n_timesteps, n_features)
print(X_3D.shape)
[out]:
(2, 4, 4)
However, at this point I get confused as to what my learning instances for the LSTM are and what the outcome y should be shaped like. I have tried having a shape like one outcome per training instance by taking only the last observation for each id (so y=[1,1] and y.shape = (2,) in the toy example above) which technically makes an LSTM script run... but does not capture prior information. Below is the code for such LSTM:
def train_lstm(X_train, y_train, X_valid, y_valid, save_name='best_lstm.h5'):
# starts a sequential model
model = Sequential()
# add first lstm hidden layer with 64 units and default keras params
model.add(LSTM(64, input_shape = (X_train.shape[1], X_train.shape[2]), return_sequences=True))
# add a second hidden lstm layer with 128 units and default keras params
model.add(LSTM(128, return_sequences = True))
# add one last hidden layer
model.add(LSTM(64))
# add one dense layer with 2 units and a sigmoid activation function
model.add(Dense(2, activation = 'sigmoid'))
# define adam optimiser with learning rate
opt = tf.keras.optimizers.Adam(learning_rate = 0.01)
# compile model with binary cross entropy as loss function and accuracy as metrics
model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])
# define early stopping and best model checkpoint parameters
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 0, patience = 20)
mc = ModelCheckpoint(save_name, monitor = 'val_accuracy', mode = 'max', verbose = 0, save_best_only = True)
# train the model using fit method (target vector is one-hot encoded as required by keras)
history = model.fit(X_train, tf.one_hot(y_train, depth = 2),
validation_data = (X_valid, tf.one_hot(y_valid, depth = 2)),
epochs = 100, callbacks = [es, mc])
return history
It runs and it makes predictions the way I want them to (for one id of previous history, we can predict one outcome) but results in poor performance since it fails to capture outcomes prior to the last.
I have carefully read and followed this nicely written medium article by Alexander Laskorunsky which remotely resembles what I am trying to do, and slides the window of K-length frames to capture the prior outcomes (and not just the last as I have done which makes more sense). However, in Alexander's case, he does not consider panel data but rather a multivariate timeseries classification that uses n_timesteps to predict the target using all predictors and all rows even if it overlaps (so not using panel data).
Questions
Am I right to believe that I need a many to one LSTM architecture?
How may I divide and reshape training and testing samples such that a new, previously unseen individual which would not be related in any way to other ids can be classified?
Should each id be considered as one sample / training instance? Should each id be split into training and testing sets and concatenate all training and testing sets to feed to an LSTM architecture?
Would you be so kind as to provide code snippets on how to correctly split and reshape my data as well as a simple LSTM architecture using keras (or maybe modify my own function above in case I coded it wrong)? No need for basic preprocessing and encoding variables.
Any help or advice / tutorials / articles regarding what architecture is most suitable for that kind of problem is greatly appreciated and thank you in advance for your help!

Why is my Keras model predicting trend but not scale?

I'm making a model to predict a the irradiance value on a solar field. The thing is that my model, despite being very simple (added code below), performs very well. The problem is that for any reason, it predicts a different scale, giving almos always lower values but in the same trend. I have appended the plot which compares both outputs and real data, in train and test set. Also linked the dataset.
Some details: The model has a total of 24 columns which correspond to 24 pyranometers which are the ones that gives information about the sun. The model has just been trained with the first one for simplicity, therefore with more data we can achieve better performance. Also, I'm processing my data to have a 15 steps back in time and a predict window of 20 steps forward.
input = Input((LAG,1)) # LAG is the number of steps I take backward
hidden = LSTM(32, return_sequences=True)(input)
output = Dense(1, activation='linear')(hidden)
model = Model(input, output)
Dataset
Model output vs real in train set
Model output vs real in test set

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.

So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.

Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

Trend "Predictor" in Python?

I'm currently working with data frames (in pandas) that have 2 columns: the first column is some numeric quantitative data, like weight, amount of money spent on some day, GPA, etc., and the second column are date values, i.e. the date on which the corresponding column 1 entry was added on.
I was wondering, is there a way to "predict" what the next value after time X is going to be in Python? E.g. if I have 100 weight entries spanning over 2-3 months (not all entries have the same time difference, so 1 entry could be during Day 3, the next Day 5, and the next Day 10), and wanted to "predict" what my next entry after 1 month, is there a way to do that?
I think this has something to do with Time Series Analysis, but my statistical background isn't very strong, so I don't know if that's the right approach. If it is, how could I apply it to my data frames (i.e. which packages)? Would there be any significance to the value it potentially returns, or would it be meaningless in the context of what I'm working with? Thank you.

For predicting time-series data, I feel the best choice would be a LSTM, which is a type of recurrent neural network, which are well suited for time-series regression.
If you don't want to dive deep into the backend of neural networks, I suggest using the Keras library, which is a wrapper for the Tensorflow framework.
Lets say you have a 1-D array of values and you want to predict the next value. Code in Keras could look like:
#start off by building the training data, let arr = the list of values
X = []
y = []
for i in range(len(arr)-100-1):
X.append(arr[i:i+100]) #get prev 100 values for the X
y.append(arr[i+100]) # predict next value for Y
Since an LSTM takes a 3-D input, we want to reshape our X data to have 3 dimensions:
import numpy as np
X = np.array(X)
X = X.reshape(len(X), len(X[0]), 1)
Now X is in the form (samples, timesteps, features)
Here we can build a neural network using keras:
from keras.models import Sequential
from keras.layers import Dense, LSTM
model = Sequential()
model.add(LSTM(input_shape = (len(X[0], 1)) #input 3-D timeseries data
model.add(Dense(1)) #output 1-D vector of predicted values
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X, y)
And viola, you can use your model to predict the next values in your data

Statsmodels is a python module that provides one of the "most famous" methods in time series forecasting (Arima).
An example can be seen in the following link : https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
Other methods for time series forecasting are available in some libraries, like support vector regression, Holt-Winters and Simple Exponential Smoothing.
Spark-ts (https://github.com/sryza/spark-timeseries) is one time series library that supports Python , and provides methods like Arima, Holt-Winters and Exponential Weighted Moving Average.
Libsvm (https://github.com/cjlin1/libsvm) provides support vector regression methods.

LSTM validation

I have a dataset with 100k rows, which are the pairs of store-item numbers (eg. (store 1, item 190)), 300 columns, which are a series of dates (eg. 2017-01-01, 2017-01-02, 2017-01-03 ...). Values are the sales.
I tried to use LSTM keras to predict future sales, how can I construct my train and validation dataset?
I am thinking to split train and validation like data[:, :n_days] and data[:, n_days:]. So I will have same number of samples (100k) in both my train and validation dataset. Do I think it wrong?
If this is the way, how should I define n_days, should the train and validation dataset be exactly the same dimensions? (something like, n_days = 150, 149 days used to predict 1 day).

how can I construct my train and validation dataset?
Not sure if a rule of thumb, but a common approach is to split your dataset into a ~80% training set and ~20% validation set; in your case this would be approximately 80k and 20k. The actual percentages may vary, but that ratio is the one most sources recommend. Ideally you would also want to have a test dataset, one that you never used during training or validation, to evaluate the final performance of your models.
Now, regarding the shape of your data it is important to recall what the keras docs on Recurrent Layers says:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
Defining this shape would depend on the nature of your problem. You mention you want to predict sales, so this can be phrased as a Regression Problem. You also mention your data consists of 300 columns that make up your time series, and naturally you have the real sales value for each of those rows.
In this case, using a batch size of 1, your shape seems will be (1, 300, 1). Which means you are training on batches of 1 element (the most thorough Gradient update), where each has 300 time steps and 1 feature or dimension on each time step.
For splitting your data one option you can use that has helped me before is the train_test_split method from Sklearn, where you simply pass your data and labels and indicate the ratio you want:
from sklearn.cross_validation import train_test_split
#Split your data to have 15% validation split
X, X_val, Y, Y_val = train_test_split(data, labels, test_size=0.15)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.