How to shape input array appropriately for ML

How to shape input array appropriately for ML - python

I apologize in advance if this question seems silly but I am trying to understand a machinelearningmastery blog about LSTM type ML algorithms, more specifically on how the input data gets reshaped. And I dont have a ton of wisdom here on the subject or a CS degree for that matter
About half way into the blog about LSTM CNN section, Jason talks about:
The first step is to split the input sequences into subsequences that
can be processed by the CNN model. For example, we can first split our
univariate time series data into input/output samples with four steps
as input and one as output. Each sample can then be split into two
sub-samples, each with two time steps. The CNN can interpret each
subsequence of two time steps and provide a time series of
interpretations of the subsequences to the LSTM model to process as
input.
We can parameterize this and define the number of subsequences as
n_seq and the number of time steps per subsequence as n_steps. The
input data can then be reshaped to have the required structure:
[samples, subsequences, timesteps, features]
My question is this a requirement for the data to be only shaped into 4 steps? Or can it be larger? This code below will attempt to print the array, I am using my own data sample here on my git account.
import pandas as pd
import numpy as np
# univariate data preparation
from numpy import array
df = pd.read_csv("trainData.csv")
df = df[['kW']].shift(-1)
df = df.dropna()
raw_seq = df.values.tolist()
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# define input sequence
#raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 4
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# reshape from [samples, timesteps] into [samples, subsequences, timesteps, features]
n_features = 1
n_seq = 2
n_steps = 2
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
# summarize the data
for i in range(len(X)):
print(X[i], y[i])
The code above works but when I change n_steps = 7 (from 4) I get this shape error.
File "convArray.py", line 39, in <module>
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
ValueError: cannot reshape array of size 2499 into shape (357,2,2,1)
The reason I want to try and use 7 time steps is the data that I am experimenting with is electrical demand units for a building per day, and 7 days in a week would be an ideal experimental time step!
Any tips greatly appreciated

The problem is mentioned in the error in this case. The problem is that you are trying to reshape the array to a specific shape but it is not possible. The array X has 2499 elements and 2499 cannot be reshaped to a (357,2,2,1) shape. The product of the numbers in the shape is the number of elements overall. A (357,2,2,1) has 357*2*2*1=1428 elements.
So your code returned an array with 1428 elements overall when n_steps = 2.
I think in your case, it is dependent on raw_seq because its length decides the number of times the for loop inside split_sequence() will run and the arrays it returns.raw_seq depends on the data, so for this dataset, n_steps might be limited.
I'm not 100% sure about it though.

Related

How to label time series data for CNN

I want to classify time series data using a CNN (given model).
My input signal data is given as shape = (nr. channels x samples x nr. trials) where:
nr. channels = number of sensors recording simultaneously
samples = split my signal into 3s windows so the training data will have the same length 3s x sample rate
nr. trials = by splitting the signal into windows, I obtain X chunks
The time series/signal data is subject based, for each subject I have numerous signals recorded. The labels are also subject based and not signal based (so it's not an event classification problem, but more of a global subject classification one), meaning that for each subject's recording I will have the same label. Example:
SUBJECT 1 -> 12 recorded signals of about 2 min each -> split into 3s windows -> all of which will have the same label
So I have an excel table where for each subject I have an associated class. I have extracted these labels and I've tried shaping them as (1 x samples x nr. trials).
The model I'm using requires the NHWC (trials, channels, samples, kernels) format so I have reshaped my training data as:
X train shape = (2642, 14, 384, 1)
y train shape = (2642, 1, 384, 1)
But I get a
ValueError: Data cardinality is ambiguous:
x sizes: 14
y sizes: 1
Make sure all arrays contain the same number of samples."
How would I need to shape my labels data? Reshape it as (2642, 14, 384, 1) making it the same shape as the training data? Or?

Ended up solving it by trying different things.
The labels, y_train, array had to be shaped as
(trials, 1)
where trials in my case was the number of windows generated by windowing the signal

Cannot reshape array of size x into shape y

I'm following a tutorial to create an LSTM neural network using keras.
I have an array of 1270 rows and 26 features.
I split the data like this:
train_ind = int(0.8 * X.shape[0])
X_train = X[:train_ind]
X_test = X[train_ind:]
y_train = y[:train_ind]
y_test = y[train_ind:]
And i'm trying to reshape it for the lstm using this:
num_steps = 4
X_train_shaped = np.reshape(X_train_sc, newshape=(-1, num_steps, 26))
y_train_shaped = np.reshape(y_train_sc, newshape=(-1, num_steps, 26))
assert X_train_shaped.shape[0] == y_train_shaped.shape[0]
However, i'm getting this error:
ValueError: cannot reshape array of size 1016 into shape (4,26)

Well, 4 x 26 = 104, and 1270 isn't divisible by 104, so np.reshape() can't choose an integer number of rows (the -1) in order to fit that into an array. You need to change either num_steps or num_features (26) so that num_steps * num_features evenly divides 1270. Unfortunately, this is impossible with num_features = 26, since 13 does not divide 1270. Your other option is to choose a different number of total rows, say 1040 or 1144, which are both divisible by 104.
So, instead of setting train_ind = int(0.8 * X.shape[0]), try train_id = 1040 or a smaller multiple of 104. Note, however, that your test data will also have to have a nice number of rows in order to reshape it in the same way.

First of all, you don't need to reshape an array. The shape attribute of a numpy array simply determines how the underlying data is displayed to you and how the data is accessed; changing the shape doesn't actually move any data around.
Likewise, we note that one cannot change the shape to something that is impossible. For example, if an array has size (100,5,6), you can't change this to (100,5,7). In general the axes have to multiple to the correct values. 100*5*6 not equal 100*5*7.
In your case, you sound like you want to work with an LSTM, which would normally mean that you want to simply add an additional axis so that you have input vectors of size 1. A new axis can be added with a None entry in numpy. Something like:
X_train = X[:train_ind,:,None] #The axes are Batch, Time, and the Input Vector.
Shape should now be (1016,26,1).

Dealing with batch size and time step in 1D CNN

I have a batch generator which gives me data in the shape of (500, 1, 12) (i.e. corresponding to (batch size, time steps, features)).
def batch_generator(batch_size, gen_x,gen_y):
batch_features = np.zeros((batch_size,1, 12))
batch_labels = np.zeros((batch_size,9))
while True:
for i in range(batch_size):
batch_features[i] = next(gen_x)
batch_labels[i] = next(gen_y)
yield batch_features, batch_labels
def generate_X():
while True:
with open("/my_path/my_data.csv") as f:
for line in f:
currentline = line.rstrip('\n').split(",")
currentline = np.asarray(currentline)
currentline = currentline.reshape(1,1,12)
yield currentline
def generate_y():
while True:
for i in range(len(y_train)):
y= y_train[i]
yield y
I then try to feed this into a 1D-CNN:
model = Sequential()
model.add(Conv1D(filters=100, kernel_size=1, activation='relu', input_shape=(1,12), data_format="channels_last"))
But now I am not able to use a kernel size of more than 1 (i.e. kernel_size = 1). This is probably because my time step is equal to 1.
How can I use the whole batch size as input to the 1D-CNN and increase the kernel_size?

Keep in mind that 1D-convolution is used when each of our input samples is a sequence, i.e. data in which the order of values are important/specified, like stock market values over a week or the weather temperature values over a period of month or a sequence of genomes or words. With that said, considering your data, there are three different scenarios:
If each line in your csv file is a sequence of length 12, then you are dealing with samples of shape (12,1), i.e. in each sample there are 12 timesteps where each timestep has only on feature. So you should reshape it accordingly (i.e. to (12,1) and not to (1,12)).
However, if each line is not a sequence by itself, but a group of consecutive lines form a sequence, then you must generate your data accordingly: each sample would consists of multiple consecutive lines, e.g. if we consider the number of timesteps to be 10 then lines #1 to #10 would be a sample, lines #2 to #12 would be another sample, and so on. And in this case each sample would have a shape of (number_of_timesteps, 12) (in the example I mentioned it would be (10,12)). Now you can create and generate these samples by writing a custom function, or alternatively you could load all of the data as a numpy array and then use TimeseriesGenerator to do it for you.
If none of the two cases above apply, then it's very likely that your data is not a sequential at all and therefore using 1D-CNN (or any other sequence processing model like RNNs) does not make sense for this data. Instead, you should use other suitable architectures.

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).

The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Reshape Keras Input for LSTM

I have two ndarrays, inputs and results, both consisting of multiple arrays looking like this:
inputs = [
[[1,2],[2,2],[3,2]],
[[2,1],[1,2],[2,3]],
[[2,2],[1,1],[3,3]],
...
]
results = [
[3,4,5],
[3,3,5],
[4,2,6],
...
]
I managed to split them up into train and test arrays, where train contains 66% of the arrays and test the other 33%. Now I'd like to reshape them for further use in my LSTM but my script fails when inputting them into np.reshape() function.
split = int(round(0.66 * results.shape[0]))
train_results = results[:split, :]
train_inputs = inputs[:split, :]
test_results = results[split:, :]
test_inputs = inputs[split:, :]
X_train = np.reshape(train_inputs, (train_inputs.shape[0], train_inputs.shape[1], 1))
X_test = np.reshape(test_inputs, (test_inputs.shape[0], test_inputs.shape[1], 1))
Please tell me how to use np.reshape() correctly in this case.
Basically I am loosely following this tutorial: https://github.com/Vict0rSch/deep_learning/tree/master/keras/recurrent

You just pass a tuple to np.reshape.
For an LSTM layer, you need the shape like (NumberOfExamples, TimeSteps, FeaturesPerStep).
So, we need to know how many steps your sequence has. By the looks of your X array, I'll suppose you have 3 steps and 2 features.
If that's the case:
X_train = train_inputs.reshape((split,3,2))
X_test = X_test.reshape((test_inputs.shape[0], 3, 2))
If, otherwise, you want 6 steps of one feature, the shape is (split,6,1). You can do anything, as long as the multiplication of the three elements in the shape must remain always the same
For the results. Do you want the results to be a result in sequence, matching the input steps? Or are they just single outputs (two independent outputs for the entire sequence)?
Since you've got 3 results, and I have assumed you have 3 time steps, I'll assume these 3 results are in sequence as well, so, I'll reshape them as:
Y_train = train_results.reshape((split,3,1)) #three steps, one result per step
#for this to work, your last LSTM layer should use `return_sequences=True`.
But if they are 3 independent results:
Y_train = train_results.reshape((split,3))
#for this to work, you must have 3 cells in the last layer, be it a Dense or an LSTM. But this LSTM must have `return_sequences=False`.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.