kerastuner tune hyperparameter that selects a subset of input features

kerastuner tune hyperparameter that selects a subset of input features - python

I have a toy problem, where I have some data (X,Y) where the labels Y are frequencies, and X are cosine functions with frequency Y: X=cos(Y*t+phi)+N, where t is a time vector, phi is some random phase shift and N is additive noise. I am developing a CNN in keras (tensorflow backend) to learn Y from X. However, I don't know how long my time window needs to be. So, I would like to use keras-tuner to help identify the best hyperparameters (winStart,winSpan) to determine which times to select t[winStart:winSpan].
It is unclear if/how I can slice my learning features X using tuned hyperparameters.
First, I defined my data as:
# given sine waves X, estimate frequencies Y
t=np.linspace(0,1,1000)
Y=np.divide(2*np.pi,(np.random.random((100))+1))
X=np.transpose(np.cos(np.expand_dims(t, axis=1)*np.expand_dims(Y,axis=0)+np.ones((t.size,1))*np.random.normal(loc=0,scale=np.pi,size=(1,Y.size)))+np.random.normal(loc=0,scale=.1,size=(t.size,Y.size)))
Y=np.expand_dims(Y,axis=1)
Following tutorials, I have written a function to construct my CNN model:
def build_model(inputSize):
model = Sequential()
model.add(Conv1D(10,
kernel_size=(15,),
padding='same',
activation='ReLU',
batch_input_shape=(None,inputSize,1)))
model.add(MaxPool1D(pool_size=(2,)))
model.add(Dropout(.2))
model.add(Conv1D(10,
kernel_size=(15,),
padding='same',
activation='ReLU',
batch_input_shape=(None,model.layers[-1].output_shape[1],1)))
model.add(MaxPool1D(pool_size=(2,)))
model.add(Dropout(.2))
model.add(Flatten())
# add a dense layer
model.add(Dense(10))
model.add(Dense(1))
model.compile(loss='mean_squared_error',
optimizer='adam')
return model
Additionally, I have written a hypermodel class:
class myHypermodel(HyperModel):
def __init__(self,inputSize):
self.inputSize=inputSize
def build_hp_model(self,hp):
inputSize=1000
self.winStart=hp.Int('winStart',min_value=0,max_value=inputSize-100,step=100)
self.winSpan=hp.Int('fMax',min_value=100,max_value=inputSize,step=100)
return build_model(self.winSpan)
def run_trial(self, trial, x,y,*args, **kwargs):
hp=trial.hyperparameters
# build the model with the current hyperparameters
model=self.build_hp_model(hp)
# Window the feature vectors
x=x[:,self.winStart:np.min([self.winStart+self.winSpan,self.inputSize])]
print('here')
return model.fit(x,y,*args,**kwargs)
Here, the build_hp_model() method is intended to link the hyperparameters to internal variables so that they can be used during when run_trial() method is called. My understanding is that run_trial() will be called by tuner.search() when performing hyperparameter optimization. I expect the run_trial() method to pick a new combination of winStart and winSpan hyperparameters, rebuild the model, remove all values of x except in the window defined by winStart and winSpan, and then run model.fit().
I call my hypermodel class and attempt to perform the hyperparameter search using:
tuner_model=myHypermodel(X.shape[1])
tuner = kt.Hyperband(tuner_model.build_hp_model,
overwrite=True,
objective='val_loss',
max_epochs=25,
factor=3,
hyperband_iterations=3)
tuner.search(x=np.expand_dims(X,axis=2),
y=np.expand_dims(Y,axis=2),
epochs=9,
validation_split=0.25)
When I run the script, I get the error:
ValueError: Input 0 of layer "sequential" is incompatible with the layer: expected shape=(None, 100, 1), found shape=(None, 1000, 1)
So it seems like the build_model() function is being called for a hyperparameter winSpan=100, but then the model is being fit using the full feature vectors X instead of X[:,winStart:winStart+winSpan].
Any suggestions on how I can properly implement this tuning?

Related

Surrogate model for [parameter vector] to [time series]

Say I have a function F that takes in a parameter vector P (say, a 5-element vector), and produces a (numerical) time series Y[t] of length T (eg T=100, so t=1,...,100). The function could be complicated (eg enzyme reaction models)
I want to make a neural network that predicts the output (Y[t]) that would result from feeding a new parameter set (P') into the function. How can this be done?
A simple feed-forward network can work, but it requires a very large number of output nodes, and doesn't take into account the temporal correlation / relationships between points. Is it possible/better to use a RNN or Transformer instead?

Using RNN might work for you. Here is some example code in Keras to get you started:
param_length = 5
time_length = 100
hidden_size = 20
model = tf.keras.Sequential([
# Encode input parameters.
tf.keras.layers.Dense(hidden_size, input_shape=[param_length]),
# Generate a sequence.
tf.keras.layers.RepeatVector(time_length),
tf.keras.layers.LSTM(32, return_sequences=True),
tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(1))
])
model.compile(loss="mse", optimizer="nadam")
model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=10)
The first Dense layer converts input parameters to a hidden state. Then LSTM RNN units generate time sequences. You will need to experiment with hyperparameters like the number of dense and LTSM layers, the size of hidden layers etc.
One more thing you can try is to use different loss function like:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(
monitor="val_mae", patience=50, restore_best_weights=True)
model.compile(loss=tf.keras.losses.Huber(), optimizer="nadam", metrics=["mae"])
history = model.fit(train_x, train_y, validation_data=(val_x, val_y), epochs=500,
callbacks=[early_stopping_cb])

What activation function on the last layer and loss function should I use in an auto encoder for reconstructing a sequence of events? [Keras]

My data set is a 3D array of the size (M,t,N) where M is the number of samples, t is the number of timesteps in a sequence and N is the number of possible events that can happen at time t. By selecting a specific M we have a 2D array of size (t,N) where each row is a timestep and each column is an event. Each column is set to 1 if that event happened at time t, otherwise it's set to 0. Only 1 event can happen at any given timestep.
I want to try and build an auto-encoder for anomaly detection, and in the tutorials and blogs I have read, the last activation layer is 'relu' and the loss function is 'mse'. But since I am trying to basically reconstruct a classification with N classes, would 'softmax' as the last layer and 'categorical_crossentropy' be better?
inputs = Input(shape = (timesteps,n_features))
# Encoder
lstm_enc_1 = LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)(inputs)
lstm_enc_2 = LSTM(latent_dim, activation='relu', return_sequences=False)(lstm_enc_1)
repeater = RepeatVector(timesteps)
# Decoder
lstm_dec_1 = LSTM(latent_dim, activation='relu', return_sequences=True)
lstm_dec_2 = LSTM(32, activation='relu', return_sequences=True)
time_dis = TimeDistributed(Dense(n_features,activation='softmax')) #<-- Does this make sense here?
z = repeater(lstm_enc_2)
h = lstm_dec_1(z)
decoded_h = lstm_dec_2(h)
decoded = time_dis(decoded_h)
ae = Model(inputs,decoded)
ae.compile(loss='categorical_crossentropy', optimizer='adam') #<-- Does this make sense here?
Or should I, for some reason, still use 'relu' and 'mse' as the last activation function and loss function?
Any input is appreciated.

When i read it correctly, N is one-hot encoded and it sounds like you want to do a classification, no regression.
For beeing y one-hot encoded, using categorical_crossentropy is correct.
If you have more classes in y than 4, you may use integer-encodings and use sparse_categorical_crossentropy, which decodes you y values to one-hot matrices on the way.
mse is better used for regression.
As last actication, since you have a classification, you may want to use softmax, which outputs a probability for each of your y classes.
As far as I know, your normally do not use relu is the last layer, if you have a regression task, you prefer sigmoid in general.

RNN fails to fit a linear trend (Keras BPTT issue?)

I am trying to train a simple LSTM to fit a line. My hypothesis is that I should be able to fit a linearly decreasing trend with zero input since the LSTM can decide how much it listens to its input vs. internal state, and can thus learn to just operate on the internal state. Basically a degenerate case for testing whether the LSTM can fit an expected result with zero input.
I create my input and target data:
seq_len = 1000
x_train = np.zeros((1, seq_len, 1)) # [batch_size, seq_len, num_feat]
target = np.linspace(100, 0, num=seq_len).reshape(1, -1, 1)
I create a pretty simple network:
from keras.models import Model
from keras.layers import LSTM, Dense, Input, TimeDistributed
x_in = Input((seq_len, 1))
seq1 = LSTM(8, return_sequences=True)(x_in)
dense1 = TimeDistributed(Dense(8))(seq1)
seq2 = LSTM(8, return_sequences=True)(dense1)
dense2 = TimeDistributed(Dense(8))(seq2)
out = TimeDistributed(Dense(1))(dense2)
model = Model(inputs=x_in, outputs=out)
model.compile(optimizer='adam', loss='mean_squared_error')
history = model.fit(x_train, target, batch_size=1, epochs=1000,
validation_split=0.)
I also created a custom callback that calls model.predict(x_train) after every epoch and adds the results to an array so I can see how my model's output is evolving over time. Basically the model just learns to predict a constant value which gradually (asymptotically) approaches the mean of my target line (target line is in red, not sure why the legend didn't show):
So basically nothing is driving my response to fit the actual line, I'm just gradually approaching the mean of the line. I suspect I am not getting any gradient with respect to time (data index), just an average gradient over time. But I would have thought LSTM losses would automagically give you gradient through time.
I've tried:
different activation functions for the LSTM layers (None, 'relu' for both the regular activation and recurrent activation)
different optimizers ('nadam', 'adadelta', 'rmsprop')
the 'mean_aboslute_error' loss function, which I didn't expect to improve the results, and it acted about the same
passing sequences of random numbers drawn from a normal distribution as input
replacing LSTM with GRU
Nothing seems to do it.
Anybody have a suggestion as to how I can force this thing to train on the gradient as a function of my sequence index, i.e. g(t)? Or any other suggestions on how I can get this to work?
Note: with the trend as shown, if the LSTM results in a constant value at exactly the mean (50), the minimum mean absolute error will be 25 and the minimum mean squared error will be about 835.8. So if we don't see any better than that, we probably aren't fitting the line, just the mean.
Just some references in case you run this yourself.

Experiment shows that LSTM does worse than Random Forest... Why?

LSTM is supposed to be the right tool to capture path-dependency in time-series data.
I decided to run a simple experiment (simulation) to assess the extent to which LSTM is better able to understand path-dependency.
The setting is very simple. I just simulate a bunch (N=100) of paths coming from 4 different data generating processes. Two of these processes represent a real increase and a real decrease, while the other two fake trends that eventually revert to zero.
The following plot shows the simulated paths for each category:
The candidate machine learning algorithm will be given the first 8 values of the path ( t in [1,8] ) and will be trained to predict the subsequent movement over the last 2 steps.
In other words:
the feature vector is X = (p1, p2, p3, p4, p5, p6, p7, p8)
the target is y = p10 - p8
I compared LSTM with a simple Random Forest model with 20 estimators. Here are the definitions and the training of the two models, using Keras and scikit-learn:
# LSTM
model = Sequential()
model.add(LSTM((1), batch_input_shape=(None, H, 1), return_sequences=True))
model.add(LSTM((1), return_sequences=False))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_X_LS, train_y_LS, epochs=100, validation_data=(vali_X_LS, vali_y_LS), verbose=0)
# Random Forest
RF = RandomForestRegressor(random_state=0, n_estimators=20)
RF.fit(train_X_RF, train_y_RF);
The out-of-sample results are the summarized by the following scatter plots:
As you can see, the Random Forest model is clearly outperforming the LSTM. The latter seems to be not able to distinguish between the real and the fake trends.
Do you have any idea to explain why this is happening?
How would you modify the LSTM model to make it better at this problem?
Some remarks:
The data points are divided by 100 to make sure gradients do not explode
I tried to increase the sample size, but I noticed no differences
I tried to increase the number of epochs over which the LSTM is trained, but I noticed no differences (the loss becomes stagnant after a bunch of epochs)
You can find the code I used to run the experiment here
Update:
Thanks to SaTa's reply, I changed the model and obtained much better results:
# Updated LSTM Model
model = Sequential()
model.add(LSTM((8), batch_input_shape=(None, H, 1), return_sequences=False))
model.add(Dense(4))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
Still, the Random Forest model does better. The point is that RF seems to understand that, conditional on the class, a higher p8 predicts a lower outcome p10-p8 and viceversa because of the way the noise is added. LSTM seems to fail on that, so it predicts the class rather well, but we see that within-class downward-sloping pattern in the final scatter plot.
Any suggestion to improve on that?

I won't expect LSTM to win at all the battles against traditional methods, but I do expect it to perform well for the problem you have posed. Here are couple things you can try:
1) Increase the number of hidden units in the first layer.
model.add(LSTM((32), batch_input_shape=(None, H, 1), return_sequences=True))
2) The output of an LSTM layer is tanh by default which limits the output to (-1, 1) as you can see in the right plot. I recommend either adding a Dense layer or using LSTM with linear activation on the output. Like this:
model.add(LSTM((1), return_sequences=False, activation='linear'))
Or
model.add(LSTM((16), return_sequences=False))
model.add(Dense(1))
Try the above with 10K samples that you have.

What is the significance of extracted weight of Keras layer for doing forward pass without using Keras API

My intention is to know the kernel weight which has used during convolution and then do the forward pass on an image to classify. The work which is easy to do by using Keras API but it is a demand for my master's thesis as here I want to build a CNN model on FPGA only for testing/ classification.
Instead of using Keras API:
1/ I will write a plain code where I will give my preprocessed image as an input
2/ I will write convolution algorithm and give the extracted information of the Kernel to do the convolution
3/ I will write the algorithm for Flatten and
4/ By using Dense algorithm I want to predict the class
MY query is:
1/ What is the information actually is giving by layer.get_weights()? Is it giving us the kernel weight which will use for the convolution?
2/ If I want to do the classification with the help of extracted weight how can I approach?
The following is my model:(For simplicity I have just written a model with minimal layer)
def cnn_model():
model = Sequential()
model.add(Conv2D(1, (3, 3), padding='same',
input_shape=input_shape,
activation='relu'))
model.add(Flatten())
model.add(Dense(num_classes, activation='softmax'))
return model
model = cnn_model()
lr = 0.01
sgd = SGD(lr=lr, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='categorical_crossentropy',
optimizer=sgd,
metrics=['accuracy'])
the input image is grayscale and width, height is 80,80.
I have trained my model using the following code:
def lr_schedule(epoch):
return lr * (0.1 ** int(epoch / 10))
batch_size = batch_size
epochs = nb_epoch
model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=epochs,
validation_data=(X_test, Y_test),
#np.resize(img, (-1, <image shape>)
callbacks= [LearningRateScheduler(lr_schedule),
ModelCheckpoint('path_to_save_model/model.h5',
save_best_only=True)])
I have extracted the layers weight by using:
from keras.models import load_model
import pandas as pd
weight_list=[]
for lay in model.layers:
name=lay.name
weight=lay.get_weights()
print(name," layer weight is:\n\n",weight,"\n\n")
weight_list.append(weight)
weight_array=[]
weight_array=np.array(weight_list)
print("weight_array's fist element is: \n\n",weight_array[0],"\n\n")
output of weight_array=[0] is
[array([[[[ 0.3856341 ]],
[[-0.35276324]],
[[-0.51678646]]],
[[[-0.62636113]],
[[ 0.43428165]],
[[-0.26765126]]],
[[[ 0.461921 ]],
[[-0.14468761]],
[[-0.3061749 ]]]], dtype=float32), array([-0.1087065], dtype=float32)]
Any suggestions would be appreciatable.

1) What is the information actually is giving by layer.get_weights()? Is it giving us the kernel weight which will use for the convolution?
For convolutional layers, layer.get_weights() returns [kernel, bias].
2) If I want to do the classification with the help of extracted weight how can I approach?
Replicate each stage of your network, the mathematics of which is well documented. Pay close attention to the exact operation being performed, for example 'convolution' in deep learning is not quite the same as in standard maths (no transform is applied). I suggest you pass a known input through the network and check that you get the same answers at every stage.

I am trying to explain the solution I have done and understood regarding my question. I am giving here the solution that I have tried. Major information I have given from here.
In a model file convolution kernel weight , convolution bias, Dense layer weight, Dense layer Bias are stored. If anyone wants to write a Forward pass from scratch Numpy using Python or a function in C++ these kernels weights are important. Details information is given in the Github link

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

kerastuner tune hyperparameter that selects a subset of input features - python

Related

Surrogate model for [parameter vector] to [time series]

What activation function on the last layer and loss function should I use in an auto encoder for reconstructing a sequence of events? [Keras]

RNN fails to fit a linear trend (Keras BPTT issue?)

Experiment shows that LSTM does worse than Random Forest... Why?

What is the significance of extracted weight of Keras layer for doing forward pass without using Keras API

Categories

Resources