Dealing with batch size and time step in 1D CNN - python

I have a batch generator which gives me data in the shape of (500, 1, 12) (i.e. corresponding to (batch size, time steps, features)).
def batch_generator(batch_size, gen_x,gen_y):
batch_features = np.zeros((batch_size,1, 12))
batch_labels = np.zeros((batch_size,9))
while True:
for i in range(batch_size):
batch_features[i] = next(gen_x)
batch_labels[i] = next(gen_y)
yield batch_features, batch_labels
def generate_X():
while True:
with open("/my_path/my_data.csv") as f:
for line in f:
currentline = line.rstrip('\n').split(",")
currentline = np.asarray(currentline)
currentline = currentline.reshape(1,1,12)
yield currentline
def generate_y():
while True:
for i in range(len(y_train)):
y= y_train[i]
yield y
I then try to feed this into a 1D-CNN:
model = Sequential()
model.add(Conv1D(filters=100, kernel_size=1, activation='relu', input_shape=(1,12), data_format="channels_last"))
But now I am not able to use a kernel size of more than 1 (i.e. kernel_size = 1). This is probably because my time step is equal to 1.
How can I use the whole batch size as input to the 1D-CNN and increase the kernel_size?

Keep in mind that 1D-convolution is used when each of our input samples is a sequence, i.e. data in which the order of values are important/specified, like stock market values over a week or the weather temperature values over a period of month or a sequence of genomes or words. With that said, considering your data, there are three different scenarios:
If each line in your csv file is a sequence of length 12, then you are dealing with samples of shape (12,1), i.e. in each sample there are 12 timesteps where each timestep has only on feature. So you should reshape it accordingly (i.e. to (12,1) and not to (1,12)).
However, if each line is not a sequence by itself, but a group of consecutive lines form a sequence, then you must generate your data accordingly: each sample would consists of multiple consecutive lines, e.g. if we consider the number of timesteps to be 10 then lines #1 to #10 would be a sample, lines #2 to #12 would be another sample, and so on. And in this case each sample would have a shape of (number_of_timesteps, 12) (in the example I mentioned it would be (10,12)). Now you can create and generate these samples by writing a custom function, or alternatively you could load all of the data as a numpy array and then use TimeseriesGenerator to do it for you.
If none of the two cases above apply, then it's very likely that your data is not a sequential at all and therefore using 1D-CNN (or any other sequence processing model like RNNs) does not make sense for this data. Instead, you should use other suitable architectures.

Related

How to convert a large string data array from file to np.array with float data type

My task is to direct a potentially large set of elements for training a neural network. I am trying to use tf.data.experimental.CsvDataset and tf.data.experimental.make_csv_dataset but I keep getting stuck.
My dataset is a text file containing strings with numbers separated by ';'. This is how it looks:
14;14;14;55;55;20;20...33;34;34
20;20;20;15;15;15;26...10;10;10
....
10;10;10;30;30;35;35...23;23;23
Each line contains 2500 numbers, separated by each other. I tried to use this code
dataset = tf.data.experimental.CsvDataset(pathAsk,
record_defaults=[tf.float32],
field_delim=";",
na_value='NA'
)
for element in dataset.as_numpy_iterator():
print(element)
But I get an error like there are more elements in the row than I specified in record_defaults. Also I try to use this:
dataset = tf.data.experimental.make_csv_dataset(pathAsk, batch_size=2, field_delim=';')
iterator = dataset.as_numpy_iterator()
print(dict(next(iterator)))
But I get error:
Cannot have duplicate column names.
My task is to use this dataset so that it can get into the input of a neural network built in a similar way:
inputs = keras.Input(shape=(2500), name="ask")
x = keras.layers.Embedding(1000, 64)
x = keras.layers.Dense(64, activation=keras.activations.relu)(inputs)
x = keras.layers.Dense(32, activation=keras.activations.relu)(x)
outputs = keras.layers.Dense(6, activation=keras.activations.relu)(x)
model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
It looks like I found the answer to my question.
I added an indication of the number of dimensions of the row in the record_defaults property.
datasetAsk = tf.data.experimental.CsvDataset(
pathAsk,
record_defaults = [tf.constant(0, dtype=tf.float32)] * 2500,
header = False,
field_delim = ";",
)
As a result, the number of elements in a line corresponds to the number of data types dtype and the iteration operation passes without errors.

Converting lists of uneven size into LSTM input tensor

So I have a nested list of 1366 samples with 2 features each and varying sequence lengths that is supposed to be the input data for an LSTM. The labels are supposed to be a pair of values for each sequence, i.e. [-0.76797587, 0.0713816]. In essence the data looks like the following:
X = [[[-0.11675862, -0.5416186], [-0.76797587, 0.0713816]], [[-0.5115555, 0.25823522], [0.6099151999999999, 0.21718016], [-0.0022403747, 0.6470206999999999]]]
What I would like to do is convert this list into an input tensor. As I understand, LSTMs accept sequences of different lengths, so in this case the first sample has length 2 and the second has length 3.
Currently I'm trying to convert the list in the following way:
train_data = TensorDataset(torch.tensor(X, dtype=torch.float32), torch.tensor(Y, dtype=torch.float32))
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
Though this produces the following error ValueError: expected sequence of length 5 at dim 1 (got 3)
I'm guessing this is because the first sequence has length five and the second length 3, which is not convertible?
How do I convert the given list into a tensor? Or am I thinking wrong about the way to train the LSTM?
Thanks for any help!
So as you said, the sequence length can be different. But because we work with batches, in each batch the sequence length has to be the same anyways. Thats because all samples are processed simultaneously. So what you have to do is to pad the samples to the same size by taking the length longest sequence in the batch and fill all other samples with zeros so that they have the same size. For that you have to use pytorch's pad functionn, like this:
from torch.nn.utils.rnn import pad_sequence
# the batch must be a python list containing the tensor samples
sample_batch = [torch.tensor((4,2)), torch.tensor((2,2)), torch.tensor((5,2))]
# pad all samples in the batch to the length of the biggest sample
padded_batch = pad_sequence(sample_batch, batch_first=True)
# get the new size of the samples and reshape it to (BATCH_SIZE, SEQUENCE/PAD_SIZE. INPUT_SIZE)
padded_to = list(padded_batch.size())[1]
padded_batch = padded_batch.reshape(len(sample_batch), padded_to, 1)
Now all samples in the batch should have the shape (5,2) because the biggest sample had a sequence length of 5.
If you dont know how to implement this with the pytorch Dataloader you can create a custom collate_fn:
def custom_collate(batch):
batch_size = len(batch)
sample_batch, target_batch = [], []
for sample, target in batch:
sample_batch.append(sample)
target_batch.append(target)
padded_batch = pad_sequence(sample_batch, batch_first=True)
padded_to = list(padded_batch.size())[1]
padded_batch = padded_batch.reshape(len(sample_batch), padded_to, 1)
return padded_batch, torch.cat(target_batch, dim=0).reshape(len(sample_batch)
Now you can tell the DataLoader to apply this function on your batch before returning it:
train_dataloader = DataLoader(
train_data,
batch_size=batch_size,
num_workers=1,
shuffle=True,
collate_fn=custom_collate # <-- NOTE THIS
)
Now the DataLoader returns padded batches!

How to shape input array appropriately for ML

I apologize in advance if this question seems silly but I am trying to understand a machinelearningmastery blog about LSTM type ML algorithms, more specifically on how the input data gets reshaped. And I dont have a ton of wisdom here on the subject or a CS degree for that matter
About half way into the blog about LSTM CNN section, Jason talks about:
The first step is to split the input sequences into subsequences that
can be processed by the CNN model. For example, we can first split our
univariate time series data into input/output samples with four steps
as input and one as output. Each sample can then be split into two
sub-samples, each with two time steps. The CNN can interpret each
subsequence of two time steps and provide a time series of
interpretations of the subsequences to the LSTM model to process as
input.
We can parameterize this and define the number of subsequences as
n_seq and the number of time steps per subsequence as n_steps. The
input data can then be reshaped to have the required structure:
[samples, subsequences, timesteps, features]
My question is this a requirement for the data to be only shaped into 4 steps? Or can it be larger? This code below will attempt to print the array, I am using my own data sample here on my git account.
import pandas as pd
import numpy as np
# univariate data preparation
from numpy import array
df = pd.read_csv("trainData.csv")
df = df[['kW']].shift(-1)
df = df.dropna()
raw_seq = df.values.tolist()
# split a univariate sequence into samples
def split_sequence(sequence, n_steps):
X, y = list(), list()
for i in range(len(sequence)):
# find the end of this pattern
end_ix = i + n_steps
# check if we are beyond the sequence
if end_ix > len(sequence)-1:
break
# gather input and output parts of the pattern
seq_x, seq_y = sequence[i:end_ix], sequence[end_ix]
X.append(seq_x)
y.append(seq_y)
return array(X), array(y)
# define input sequence
#raw_seq = [10, 20, 30, 40, 50, 60, 70, 80, 90]
# choose a number of time steps
n_steps = 4
# split into samples
X, y = split_sequence(raw_seq, n_steps)
# reshape from [samples, timesteps] into [samples, subsequences, timesteps, features]
n_features = 1
n_seq = 2
n_steps = 2
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
# summarize the data
for i in range(len(X)):
print(X[i], y[i])
The code above works but when I change n_steps = 7 (from 4) I get this shape error.
File "convArray.py", line 39, in <module>
X = X.reshape((X.shape[0], n_seq, n_steps, n_features))
ValueError: cannot reshape array of size 2499 into shape (357,2,2,1)
The reason I want to try and use 7 time steps is the data that I am experimenting with is electrical demand units for a building per day, and 7 days in a week would be an ideal experimental time step!
Any tips greatly appreciated
The problem is mentioned in the error in this case. The problem is that you are trying to reshape the array to a specific shape but it is not possible. The array X has 2499 elements and 2499 cannot be reshaped to a (357,2,2,1) shape. The product of the numbers in the shape is the number of elements overall. A (357,2,2,1) has 357*2*2*1=1428 elements.
So your code returned an array with 1428 elements overall when n_steps = 2.
I think in your case, it is dependent on raw_seq because its length decides the number of times the for loop inside split_sequence() will run and the arrays it returns.raw_seq depends on the data, so for this dataset, n_steps might be limited.
I'm not 100% sure about it though.

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Keras sequence prediction with multiple simultaneous sequences

My question is very similar to what it seems this post is asking, although that post doesn't pose a satisfactory solution. To elaborate, I am currently using keras with tensorflow backend and a sequential LSTM model. The end goal is I have n time-dependent sequences with equal time steps (the same number of points on each sequence and the points are all the same time apart) and I would like to feed all n sequences into the same network so it can use correlations between the sequences to better predict the next step for each sequence. My ideal output would be an n-element 1-D array with array[0] corresponding to the next-step prediction for sequence_1, array[1] for sequence_2, and so on.
My inputs are sequences of single values, so each of n inputs can be parsed into a 1-D array.
I was able to get a working model for each sequence independently using the code at the end of this guide by Jakob Aungiers, although my difficulty is adapting it to accept multiple sequences at once and correlate between them (i.e. be analyzed in parallel). I believe the issue is related to the shape of my input data, which is currently in the form of a 4-D numpy array because of how Jakob's Guide splits the inputs into sub-sequences of 30 elements each to analyze incrementally, although I could also be completely missing the target here. My code (which is mostly Jakob's, not trying to take credit for anything that isn't mine) presently looks like this:
As-is this complains with "ValueError: Error when checking target: expected activation_1 to have shape (None, 4) but got array with shape (4, 490)", I'm sure there are plenty of other issues but I'd love some direction on how to achieve what I'm describing. Anything stick out immediately to anyone? Any help you could give will be greatly appreciated.
Thanks!
-Eric
Keras is already prepared to work with batches containing many sequences, there is no secret at all.
There are two possible approaches, though:
You input your entire sequences (all steps at once) and predict n results
You input only one step of all sequences and predict the next step in a loop
Suppose:
nSequences = 30
timeSteps = 50
features = 1 #(as you said: single values per step)
outputFeatures = 1
First apporach: stateful=False:
inputArray = arrayWithShape((nSequences,timeSteps,features))
outputArray = arrayWithShape((nSequences,outputFeatures))
input_shape = (timeSteps,features)
#use layers like this:
LSTM(units) #if the first layer in a Sequential model, add the input_shape
#if you want to return the same number of steps (like a new sequence parallel to the input, use return_sequences=True
Train like this:
model.fit(inputArray,outputArray,....)
Predict like this:
newStep = model.predict(inputArray)
Second approach: stateful=True:
inputArray = sameAsBefore
outputArray = inputArray[:,1:] #one step after input array
inputArray = inputArray[:,:-1] #eliminate the last step
batch_input = (nSequences, 1, features) #stateful layers require the batch size
#use layers like this:
LSMT(units, stateful=True) #if the first layer in a Sequential model, add input_shape
Train like this:
model.reset_states() #you need this in stateful=True models
#if you don't reset states,
#the stateful model will think that your inputs are new steps of the same previous sequences
for step in range(inputArray.shape[1]): #for each time step
model.fit(inputArray[:,step:step+1], outputArray[:,step:step+1],shuffle=False,...)
Predict like this:
model.reset_states()
predictions = np.empty(inputArray.shape)
for step in range(inputArray.shape[1]): #for each time step
predictions[:,step] = model.predict(inputArray[:,step:step+1])

Categories