How to label time series data for CNN

How to label time series data for CNN - python

I want to classify time series data using a CNN (given model).
My input signal data is given as shape = (nr. channels x samples x nr. trials) where:
nr. channels = number of sensors recording simultaneously
samples = split my signal into 3s windows so the training data will have the same length 3s x sample rate
nr. trials = by splitting the signal into windows, I obtain X chunks
The time series/signal data is subject based, for each subject I have numerous signals recorded. The labels are also subject based and not signal based (so it's not an event classification problem, but more of a global subject classification one), meaning that for each subject's recording I will have the same label. Example:
SUBJECT 1 -> 12 recorded signals of about 2 min each -> split into 3s windows -> all of which will have the same label
So I have an excel table where for each subject I have an associated class. I have extracted these labels and I've tried shaping them as (1 x samples x nr. trials).
The model I'm using requires the NHWC (trials, channels, samples, kernels) format so I have reshaped my training data as:
X train shape = (2642, 14, 384, 1)
y train shape = (2642, 1, 384, 1)
But I get a
ValueError: Data cardinality is ambiguous:
x sizes: 14
y sizes: 1
Make sure all arrays contain the same number of samples."
How would I need to shape my labels data? Reshape it as (2642, 14, 384, 1) making it the same shape as the training data? Or?

Ended up solving it by trying different things.
The labels, y_train, array had to be shaped as
(trials, 1)
where trials in my case was the number of windows generated by windowing the signal

Related

How to predict on new data using a trained and saved Feedforward NN

I am trying to make predictions on new data, using a trained and saved model. My new data does not have the same shape as the data used to build the saved model.
I have tried using model.save() as well as model.save_weights(), as I still want to keep the training configurations, but they both produce the same error.
Is there a way to use ones saved model on new data even if the shape is not the same?
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Activation, Dense
model = Sequential([
Dense(units=11, activation='relu', input_shape = (42,), kernel_regularizer=keras.regularizers.l2(0.001)),
Dense(units=1, activation='sigmoid')
])
new_model.load_weights('Fin_weights.h5')
y_pred = new_model.predict(X)
ValueError: Error when checking input: expected dense_6_input to have shape (44,) but got array with shape (42,)

No, you have to exactly match same input shape.
Both your model's code (model = Sequential([... lines) should correspond exactly to your saved model and your input data (X in y_pred = new_model.predict(X) line) should be of same shape as in saved model ('Fin_weights.h5').
Only thing you can do is to somehow pad your new data with e.g. zeros. But this can help only if the rest of values correspond to same features or signals.
Lets for example imagine that you were training NN to recognize gray images of shape (2, 3), like below:
1 2 3
4 5 6
Then you trained model and saved it for later use. Afterwards you decided that you want to use your NN on images of smaller or bigger size, like this
1 2
3 4
or this
1 2 3 4
5 6 7 8
9 10 11 12
And you're almost sure that your NN will still give good results on differently shaped input.
Then you just pad first unmatching image with extra zeros on the right like this:
1 2 0
3 4 0
or another way of padding, on the left side
0 1 2
0 3 4
and second image you cut a bit
1 2 3
5 6 7
(or cut it from other sides).
Only then you can apply your NN to this processed input images.
Same in your case, you have to add two zeros. But only in case if it is almost same sequence of encoded input signals or features.
In case if your data for prediction is of wrong size, do this:
y_pred = new_model.predict(
np.pad(X, ((0, 0), (0, 2)))
)
this pads your data with two zeros on right side, although you might want to pad it on left side ((2, 0) instead of (0, 2)), or on both sides ((1, 1) instead of (0, 2)).
In case if your saved weights are of different shape that model's code do this in code for model (change 42 --> 44):
model = Sequential([
Dense(units=11, activation='relu', input_shape = (44,), kernel_regularizer=keras.regularizers.l2(0.001)),
Dense(units=1, activation='sigmoid')
])
You should probably do both things above, to match your saved model/weights.
If NN trained for input of 44 numbers would give totally wrong results for any padding of 42 data then the only way is to re-train your NN for 42 input and save model again.
But you have to take into account the fact that input_shape = (44,) in keras library actually means that the final data X that is fed into model.predict(X) should be of 2-dimensional shape like (10, 44) (where 10 is the number of different objects to be recognized by your NN), keras hides 0-th dimension, it is so-called batch dimension. Batch (0-th) dimension actually can vary, you may feed 5 objects (i.e. array of shape (5, 44)) or 7 (shape (7, 44)) or any other number of object. Batch only means that keras processes several object at one call in parallel, just to be fast/efficient. But each single object is 1-dimensional sub-array of shape (44,). Probably you miss-understood something in how data is fed to network and represented. 44 is not the size of dataset (number of objects), it is number of traits of single object, e.g. if network recognizes/categorizes one human, then 44 can mean 44 characteristics of just one human, like age, gender, height, weight, month of birth, race, color of skin, callories per day, monthly income, monthly spending, salary, etc totalling 44 different fixed characteristics of 1 human object. They probably don't change. But if you got some other data with just 42 or 36 characteristics than you need to place 0 only exactly in positions of characteristics that are missing out of 44, it won't be correct to pad with zeros on right or left, you must place 0s exactly in those positions that are missing out of 44.
But your 44 and 42 and 36 probably mean the number of different input object, each having just 1 characteristics. Imagine a task when you have a DataSet (table) of 50 humans with just two columns of data salary and country then you might want to build NN that guesses country by salary then you'll have input_shape = (1,) (corresponding to 1-D array of 1 number - salary), but definitely not input_shape = (50,) (number of humans in table). input_shape tells the shape of just 1 object, 1 human. 50 is the number of objects (humans), and it is the batch (0-th) dimension in numpy array which is fed for prediction, hence your X array for model.predict(X) is of shape (50, 1), but input_shape = (1,) in the model. Basically keras omits (hides) 0-th batch dimension. If 44 in your case actually meant DataSet size (number of objects) then you've trained NN wrongly and it should be retrained with input_shape = (1,), 44 goes as a batch dimension, and this 44 may vary depending on size of training or testing DataSets.
If you're going to re-train your network, then whole training/evaluation process in simple form is as follows:
Suppose you hav a dataset in CSV file data.csv. For example you have 126 rows and 17 columns there in total.
Read-in your data somehow e.g. by np.loadtxt or by pd.read_csv or by standard python's csv.reader(). Convert data to numbers (floats).
Split your data by rows randomly into two parts training/evaluation approximately in corresponding sizes 90%/10% of rows, e.g. 110 rows for training and 16 for evaluation out of (126 in total).
Decide which columns in your data will be predicted, you can predict any number of columns, lets say we want to predict two columns, 16th and 17th. Now your columns of data are split into two parts X (15 columns, numbered 1-15) and Y (2 columns, numbered 16-17).
In code of your network layers set input_shape = (15,) (15 is number of columns in X) in first layer, and Dense(2) in last layer (2 is number of columns in Y).
Train your network on training dataset using model.fit(X, Y, epochs = 1000, ...) method.
Save trained network to model file through model.save(...) to file like net.h5.
Load your network through model.load(...).
Test network quality through predicted_Y = model.predict(testing_X), compare it to testing_Y, if network model was chosen correctly then testing_Y should be close to predicted_Y, e.g. 80% correct (this ratio is called accuracy).
Why do we split dataset into training/testing parts. Because training stage only sees training dataset sub-part. The task of network training is to remember whole training data well plus generalize prediction by finding some hidden dependencies between X and Y. So if to call model.predict(...) on training data is should give close to 100% accuracy, because network sees all this training data and remembers it. But testing data it doesn't see at all, hence needs to be clever and really predict testing Y by X, hence accuracy in testing is lower e.g. 80%.
If quality of testing results is not great, you have to improve your network architecture and re-run whole training process from start.
If you need to predict partial data, e.g. when you have in your X data only 12 out of total 15 possible columns, then fill-in missing columns values by zeros, e.g. if you're missing column 7 and 11, then insert zeros into 7th and 11th positions. So that total number of columns is 15 again. Your network will support in input for model.predict() only exactly that number of columns that it was trained with, i.e. 15, this number is provided in input_shape = (15,).

How to shape input array appropriately for ML

I apologize in advance if this question seems silly but I am trying to understand a machinelearningmastery blog about LSTM type ML algorithms, more specifically on how the input data gets reshaped. And I dont have a ton of wisdom here on the subject or a CS degree for that matter
About half way into the blog about LSTM CNN section, Jason talks about:
The first step is to split the input sequences into subsequences that
can be processed by the CNN model. For example, we can first split our
univariate time series data into input/output samples with four steps
as input and one as output. Each sample can then be split into two
sub-samples, each with two time steps. The CNN can interpret each
subsequence of two time steps and provide a time series of
interpretations of the subsequences to the LSTM model to process as
input.
We can parameterize this and define the number of subsequences as
n_seq and the number of time steps per subsequence as n_steps. The
input data can then be reshaped to have the required structure:
[samples, subsequences, timesteps, features]
My question is this a requirement for the data to be only shaped into 4 steps? Or can it be larger? This code below will attempt to print the array, I am using my own data sample here on my git account.
import pandas as pd
import numpy as np
# univariate data preparation
from numpy import array
df = pd.read_csv("trainData.csv")
df = df[['kW']].shift(-1)
df = df.dropna()
raw_seq = df.values.tolist()
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# define input sequence
#raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 4
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# reshape from [samples, timesteps] into [samples, subsequences, timesteps, features]
n_features = 1
n_seq = 2
n_steps = 2
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
# summarize the data
for i in range(len(X)):
print(X[i], y[i])
The code above works but when I change n_steps = 7 (from 4) I get this shape error.
File "convArray.py", line 39, in <module>
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
ValueError: cannot reshape array of size 2499 into shape (357,2,2,1)
The reason I want to try and use 7 time steps is the data that I am experimenting with is electrical demand units for a building per day, and 7 days in a week would be an ideal experimental time step!
Any tips greatly appreciated

The problem is mentioned in the error in this case. The problem is that you are trying to reshape the array to a specific shape but it is not possible. The array X has 2499 elements and 2499 cannot be reshaped to a (357,2,2,1) shape. The product of the numbers in the shape is the number of elements overall. A (357,2,2,1) has 357*2*2*1=1428 elements.
So your code returned an array with 1428 elements overall when n_steps = 2.
I think in your case, it is dependent on raw_seq because its length decides the number of times the for loop inside split_sequence() will run and the arrays it returns.raw_seq depends on the data, so for this dataset, n_steps might be limited.
I'm not 100% sure about it though.

What tensorflow distribution to represent a list of categorical data

I want to construct a variational autoencoder where one sample is an N*M matrix where for each row, there are M categories. Essentially one sample is a list of categorical data where only one category can be selected - a list of one-hot vectors.
Currently, I have a working autoencoder for this type of data - I use a softmax on the last dimension to create this constraint and it works (reconstruction cross entropy is low).
Now, I want to use tf.distributions to create a variational autoencoder. I was wondering what kind of distribution would be appropriate.

Does tf.contrib.distributions.Categorical satisfy your needs? Samples should be from (0 to n - 1), where n represents the category.
Example:
# logits has shape [N, M], where M is the number of classes
dist = tf.contrib.distributions.Categorical(logits=logits)
# Sample 20 times. Should give shape [20, N].
samples = dist.sample(20)
# depth is the number of categories.
one_hots = tf.one_hot(samples, depth=M)

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).

The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.