How to get Keras network to not output all 1s

How to get Keras network to not output all 1s - python

I have a bunch of images that look like this of someone playing a videogame (a simple game I created in Tkinter):
The idea of the game is that the user controls the box at the bottom of the screen in order to dodge the falling balls (they can only dodge left and right).
My goal is to have the neural network output the position of the player on the bottom of the screen. If they're totally on the left, the neural network should output a 0, if they're in the middle, a .5, and all the way right, a 1, and all the values in-between.
My images are 300x400 pixels. I stored my data very simply. I recorded each of the images and position of the player as a tuple for each frame in a 50-frame game. Thus my result was a list in the form [(image, player position), ...] with 50 elements. I then pickled that list.
So in my code I try to create an extremely basic feed-forward network that takes in the image and outputs a value between 0 and 1 representing where the box on the bottom of the image is. But my neural network is only outputting 1s.
What should I change in order to get it to train and output values close to what I want?
Of course, here is my code:
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0:1]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = [list(val) for val in zip(*data)]
for index, image in enumerate(inputs):
# convert the PIL images into numpy arrays so Keras can process them
inputs[index] = pil_image_to_np_array(image)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X[0].shape[1] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
for index, output in enumerate(Y):
Y[index] = output / width
# flatten the image inputs so they can be passed to a neural network
for index, inpt in enumerate(X):
X[index] = np.ndarray.flatten(inpt)
# keras expects an array (not a list) of image-arrays for input to the neural network
X = np.array(X)
Y = np.array(Y)
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix
EDIT: print(Y) gives me:
so it's definitely not all zeroes.

Of course, a deeper model might give you a better accuracy, but considering the fact that your images are simple, a pretty simple (shallow) model with only one hidden layer should give a medium to high accuracy. So here are the modifications you need to make this happen:
Make sure X and Y are of type float32 (currently, X is of type uint8):
X = np.array(X, dtype=np.float32)
Y = np.array(Y, dtype=np.float32)
When training a neural network it would be much better to normalize the training data. Normalization helps the optimization process to go smoothly and speed up the convergence to a solution. It further prevent large values to cause large gradient updates which would be desruptive. Usually, the values of each feature in the input data should fall in a small range, where two common ranges are [-1,1] and [0,1]. Therefore, to make sure that all values fall in the range [-1,1], we subtract from each feature its mean and divide it by its standard deviation:
X_mean = X.mean(axis=0)
X -= X_mean
X_std = X.std(axis=0)
X /= X_std + 1e-8 # add a very small constant to prevent division by zero
Note that we are normalizing each feature (i.e. each pixel in this case) here not each image. When you want to predict on new data, i.e. in inference or testing mode, you need to subtract X_mean from test data and divide it by X_std (you should NEVER EVER subtract from test data its own mean or divide it by its own standard deviation; rather, use the mean and std of training data):
X_test -= X_mean
X_test /= X_std + 1e-8
If you apply the changes in points one and two, you might notice that the network no longer predicts only ones or only zeros. Rather, it shows some faint signs of learning and predicts a mix of zeros and ones. This is not bad but it is far from good and we have high expectations! The predictions should be much better than a mix of only zeros and ones. There, you should take into account the (forgotten!) learning rate. Since the network has relatively large number of parameters considering a relatively simple problem (and there are a few samples of training data), you should choose a smaller learning rate to smooth the gradient updates and the learning process:
from keras import optimizers
model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(lr=0.0001))
You would notice the difference: the loss value reaches to around 0.01 after 10 epochs. And the network no longer predicts a mix of zeros and ones; rather the predictions are much more accurate and close to what they should be (i.e. Y).
Don't forget! We have high (logical!) expectations. So, how can we do better without adding any new layers to the network (obviously, we assume that adding more layers might help!!)?
4.1. Gather more training data.
4.2. Add weight regularization. Common ones are L1 and L2 regularization (I highly recommend the Jupyter notebooks of the the book Deep Learning with Python written by François Chollet the creator of Keras. Specifically, here is the one which discusses regularization.)
You should always evaluate your model in a proper and unbiased way. Evaluating it on the training data (that you have used to train it) does not tell you anything about how well your model would perform on unseen (i.e. new or real world) data points (e.g. consider a model which stores or memorize all the training data. It would perform perfectly on the training data, but it would be a useless model and perform poorly on new data). So we should have test and train datasets: we train model on the training data and evaluate the model on the test (i.e. new) data. However, during the process of coming up with a good model you are performing lots of experiments: for example, you first change the type and number of layers, train the model and then evaluate it on test data to make sure it is good. Then you change another thing say the learning rate, train it again and then evaluate it again on test data... To make it short, these cycles of tuning and evaluations somehow causes an over-fitting on the test data. Therefore, we would need a third dataset called validation data (read more: What is the difference between test set and validation set?):
# first shuffle the data to make sure it isn't in any particular order
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
Y = Y[indices]
# you have 200 images
# we select 100 images for training,
# 50 images for validation and 50 images for test data
X_train = X[:100]
X_val = X[100:150]
X_test = X[150:]
Y_train = Y[:100]
Y_val = Y[100:150]
Y_test = Y[150:]
# train and tune the model
# you can attempt train and tune the model multiple times,
# each time with different architecture, hyper-parameters, etc.
model.fit(X_train, Y_train, epochs=15, batch_size=10, validation_data=(X_val, Y_val))
# only and only after completing the tuning of your model
# you should evaluate it on the test data for just one time
model.evaluate(X_test, Y_test)
# after you are satisfied with the model performance
# and want to deploy your model for production use (i.e. real world)
# you can train your model once more on the whole data available
# with the best configurations you have found out in your tunings
model.fit(X, Y, epochs=15, batch_size=10)
(Actually, when we have few training data available it would be wasteful to separate validation and test data from whole available data. In this case, and if the model is not computationally expensive, instead of separating a validation set which is called cross-validation, one can do K-fold cross-validation or iterated K-fold cross-validation in case of having very few data samples.)
It is around 4 AM at the time of writing this answer and I am feeling sleepy, but I would like to mention one more thing which is not directly related to your question: by using the Numpy library and its functionalities and methods you can write more concise and efficient code and also save yourself a lot time. So make sure you practice using it more as it is heavily used in machine learning community and libraries. To demonstrate this, here is the same code you have written but with more use of Numpy (Note that I have not applied all the changes I mentioned above in this code):
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = zip(*data)
inputs = [pil_image_to_np_array(image) for image in inputs]
inputs = np.array(inputs, dtype=np.float32)
outputs = np.array(outputs, dtype=np.float32)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X.shape[2] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
Y /= width
# flatten the image inputs so they can be passed to a neural network
X = np.reshape(X, (X.shape[0], -1))
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix

Related

How to split the Cora dataset to train a GCN model only on training part?

I am training a GCN (Graph Convolutional Network) on Cora dataset.
The Cora dataset has the following attributes:
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Since my code is very long, I only put the relevent parts of my code here. Firstly, I split the Cora dataset as follows:
def to_mask(index, size):
mask = torch.zeros(size, dtype=torch.bool)
mask[index] = 1
return mask
def cora_splits(data, num_classes):
indices = []
for i in range(num_classes):
# returns all indices of the elements = i from data.y tensor
index = (data.y == i).nonzero().view(-1)
# returns a random permutation of integers from 0 to index.size(0).
index = index[torch.randperm(index.size(0))]
# indices is a list of tensors and it has a length of 7
indices.append(index)
# select 20 nodes from each class for training
train_index = torch.cat([i[:20] for i in indices], dim=0)
rest_index = torch.cat([i[20:] for i in indices], dim=0)
rest_index = rest_index[torch.randperm(len(rest_index))]
data.train_mask = to_mask(train_index, size=data.num_nodes)
data.val_mask = to_mask(rest_index[:500], size=data.num_nodes)
data.test_mask = to_mask(rest_index[500:], size=data.num_nodes)
return data
The train is as follows (taken from here with few modifications):
def train(model, optimizer, data, epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(data)
loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
loss_train.backward()
optimizer.step()
loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])
def accuracy(output, labels):
preds = output.max(1)[1].type_as(labels)
correct = preds.eq(labels).double()
correct = correct.sum()
return correct / len(labels)
When I ran my code with 200 epochs in 10 runs I gained:
tensor([0.7690, 0.8030, 0.8530, 0.8760, 0.8600, 0.8550, 0.8850, 0.8580, 0.8940, 0.8830])
Val Loss: 0.5974, Test Accuracy: 0.854 ± 0.039
where each value in the tensor belongs to the model accurracy of each run and the mean accuracy of all those 10 runs is 0.854 with std ± 0.039.
As it can be observed, the accuracy from the first run to the 10th one is increasing substantially. Therefore, I think the model is overfitting. One reason of overfitting is that in the code, the test data has been seen by the model in the training time since in the train function, there is a line output = model(data) so the model is trained over the whole data. What I intend to do is to train my model only on a part of the data (something similar to data[data.train_mask]) but the problem is I cannot pass data[data.train_mask], due to the forward function of the GCN model (from this repository):
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = F.relu(self.conv1(x, edge_index))
for conv in self.convs:
x = F.relu(conv(x, edge_index))
x = F.relu(self.lin1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
return F.log_softmax(x, dim=-1)
If I pass data[data.train_mask] to the GCN model, then in the above forward function in line x, edge_index = data.x, data.edge_index, x and edge_index cannot be retrieved from data[data.train_mask]. Therefore, I need to find a way to split the Cora dataset in a way that I can pass a specefic part of it with the nodes, edge-index and other attributes to the model. My question is how to do it?
Also, any suggestion about k-fold cross validation is much appreciated.

I guess you are a little confused by the nature of transductive learning and the question you asked doesn't actually address the problem you are facing.
As it can be observed, the accuracy from the first run to the 10th one
is increasing substantially. Therefore, I think the model is
overfitting.
Not necessarily, increasing test accuracy could be a normal behavior when your model is learning from the training samples. The learning can last for several dozens of epochs due to the complexity and non-convexity of loss function. The best signal to tell overfit is when your training accuracy increase but test accuracy decreases significantly.
One reason of overfitting is that in the code, the test data has been
seen by the model in the training time since in the train function,
there is a line output = model(data) so the model is trained over the
whole data.
The model indeed has seen the entire graph(adjacency matrix) in the training, but it only sees the labels of nodes in the training set and knows nothing about the labels of nodes in the test set. This is exactly what transductive learning does.
In the end, if you are 100% sure you want to avoid the paradigm of transductive learning, then you might need to write your own split algorithm to achieve that. But I would like to remind that in the real-world use case, transduction is perfectly suitable. An example is to predict the potential links between social network users, where we have the whole network structure as input and want to simply run edge prediction--> transduction. Thus it doesn't make a lot of sense to avoid it.
Depending on your task, you could take a look of how Stellargraph's EdgeSplitter class(docs) and scikit-learn’s train_test_split function (docs) achive the split.
Node classification
If your task is a node classification task, this Node classification with Graph Convolutional Network (GCN) is a good example of how to load data and do train-test-split. It took Cora dataset as an example. The most important steps are the following:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
train_subjects, test_subjects = model_selection.train_test_split(
node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
test_subjects, train_size=500, test_size=None, stratify=test_subjects
)
train_gen = generator.flow(train_subjects.index, train_targets)
val_gen = generator.flow(val_subjects.index, val_targets)
test_gen = generator.flow(test_subjects.index, test_targets)
Basically, it's the same as train-test-split with a normal classification task, except what we split here is nodes.
Edge classification
If your task is edge classification, you could have a look at this Link prediction example: GCN on the Cora citation dataset. The most relevant code for train-test-split is
# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", keep_connected=True
)
# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", keep_connected=True
)
# For training we create a generator on the G_train graph, and make an
# iterator over the training links using the generator’s flow() method:
train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)
Here the splitting algorithm behind EdgeSplitter class(docs) is more complex, it needs to maintain the graph structure while doing the split, such as keeping the graph connectivity for example. For more details, cf source code for EdgeSplitter

loss not reducing in tensorflow classification attempt

I wanted to simulate classifying whether a student will pass or fail a course depending on training data with a single input, namely a student's exam score.
I start by creating data set of test scores for 1000 students, normally distributed with a mean of 80.
I then created a classification "1" (passing) for the top 300 students, which based on the seed is a test score of 80.87808591534409.
(Obviously we don't really need machine learning for this, as this means anyone with a test score higher than 80.87808591534409 passes the class. But I want to build a model that accurately predicts this, so that I can start adding new input features and expand my classification beyond, pass/fail).
Next I created a test set in the same way, and classified these students using the classification threshold previously computed for the training set (80.87808591534409).
Then, as you can see below or in the linked Jupyter notebook, I created a model that takes one input feature and returns two results (a probability for the zero index classification (fail) and a probability for one index classification (pass).
Then I trained it on the training data set. But as you can see the loss never really improves per iteration. It just kind of hovers at 0.6.
Finally, I ran the trained model on the test data set and generated predictions.
I plotted the results as follows:
The green line represents the actual (not the predicted) classifications of the test set.
The blue line represents the probability of 0 index outcome (failing) and the orange line represents the probability of the 1 index outcome (passing).
As you can see they remain flat. If my model is working, I would have expected these lines to trade places at the threshold where the actual data switches from failing to passing.
I imagine I could be doing a lot of things wrong, but if anyone has time to look at the code below and give me some advice I would be grateful.
I've created a public working example of my attempt here.
And I've included the current code below.
The problem I'm having is that the model training seems to get stuck in computing the loss, and as a result, it reports that every student in my testing set (all 1,000 students fail) no matter what their test result is, which is obviously wrong.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")
## Create data
# Set Seed
np.random.seed(0)
# Create 1000 test scores normally distributed with a range of 2 with a mean of 80
train_exam_scores = np.sort(np.random.normal(80,2,1000))
# Create classification; top 300 pass the class (classification of 1), bottom 700 do not class (classification of 0)
train_labels = np.array([0. for i in range(700)])
train_labels = np.append(train_labels, [1. for i in range(300)])
print("Point at which test scores correlate with passing class: {}".format(train_exam_scores[701]))
print("computed point with seed of 0 should be: 80.87808591534409")
print("Plot point at which test scores correlate with passing class")
## Plot view
plt.plot(train_exam_scores)
plt.plot(train_labels)
plt.show()
#create another set of 1000 test scores with different seed (10)
np.random.seed(10)
test_exam_scores = np.sort(np.random.normal(80,2,1000))
# create classification labels for the new test set based on passing rate of 80.87808591534409 determined above
test_labels = np.array([])
for index, i in enumerate(test_exam_scores):
if (i >= 80.87808591534409):
test_labels = np.append(test_labels, 1)
else:
test_labels = np.append(test_labels, 0)
plt.plot(test_exam_scores)
plt.plot(test_labels)
plt.show()
print(tf.shape(train_exam_scores))
print(tf.shape(train_labels))
print(tf.shape(test_exam_scores))
print(tf.shape(test_labels))
train_dataset = tf.data.Dataset.from_tensor_slices((train_exam_scores, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_exam_scores, test_labels))
BATCH_SIZE = 5
SHUFFLE_BUFFER_SIZE = 1000
train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)
# view example of feature to label correlation, values above 80.87808591534409 are classified as 1, those below are classified as 0
features, labels = next(iter(train_dataset))
print(features)
print(labels)
# create model with first layer to take 1 input feature per student; and output layer of two values (percentage of 0 or 1 classification)
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(1,)), # input shape required
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(2)
])
# Test untrained model on training features; should produce nonsense results
predictions = model(features)
print(tf.nn.softmax(predictions[:5]))
print("Prediction: {}".format(tf.argmax(predictions, axis=1)))
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
model.compile(optimizer=optimizer,
loss=loss_object,
metrics=['categorical_accuracy'])
#train model
model.fit(train_dataset,
epochs=20,
validation_data=test_dataset,
verbose=1)
#make predictions on test scores from test_dataset
predictions = model.predict(test_dataset)
tf.nn.softmax(predictions[:1000])
tf.argmax(predictions, axis=1)
# I anticipate that the predictions would show a higher probability for index position [0] (classification 0, "did not pass")
#until it reaches a value greater than 80.87808591534409
# which in the test data with a seed of 10 should be the value at the 683 index position
# but at this point I would expect there to be a higher probability for index position [1] (classification 1), "did pass"
# because it is obvious from the data that anyone who scores higher than 80.87808591534409 should pass.
# Thus in the chart below I would expect the lines charting the probability to switch precisely at the point where the test classifications shift.
# However this is not the case. All predictions are the same for all 1000 values.
plt.plot(tf.nn.softmax(predictions[:1000]))
plt.plot(test_labels)
plt.show()

The main issue here: Use softmax activation in the last layer, not separetely outside the model. Change the final layer to:
tf.keras.layers.Dense(2, activation="softmax")
Secondly, for two hidden layers with relu, 0.1 may be too high a learning rate. Try with a lower rate of maybe 0.01 or 0.001.
Another thing to try is to divide the input by 100, to get inputs in the range [0, 1]. This makes training easier, since the update step does not heavily modify the weights.

Keras LSTM appears to be fitting the end of time-series input instead of the prediction target

To preface this, I have plenty of experience with python and moderate experience building and using machine learning networks. That being said, this is the first LSTM I have made aside from some of the cookie-cutter examples available, so any help is appreciated. I feel like this is a problem with a simple solution and that I have just been looking at this code for far too long to see it.
This model is made in a python3.5 venv using Keras with a tensorflow backend.
In short, I am trying to make predictions of some temporal data using the data itself as well as a few mathematical permutations of this data, creating four input features. I am building a time-series input from the prior 60 data points and specifying the prediction target to be 60 data points in the future.
Shape of complete training data (input)(target): (2476224, 60, 4) (2476224)
Shape of single data "point" (input)(target): (1, 60, 4) (1)
What appears to be happening is that the trained model has fit the trailing value of my input time-series (the current value) instead of the target I have provided it (60 cycles in the future).
What is interesting is that the loss function seems to be calculating according to the correct prediction target, yet the model is not converging to the proper solution.
I have no idea why the model should be doing this. My first thought was that I was preprocessing my data incorrectly and feeding it the wrong target. I have tested my input formatting of the data extensively and am pretty confident that I am providing the model with he correct target and input information.
In one instance, I had increased the learning rate a tad such that the model converged to a local minima. This testing loss of this convergence was very similar to the loss of my preferred learning rate (still quite high). But the predictions were still of the "current value". Why is this so?
Here is how I created my model:
def create_model():
lstm_model = Sequential()
lstm_model.add(CuDNNLSTM(100, batch_input_shape=(batch_size, time_step, train_input.shape[2]),
stateful=True, return_sequences=True,
kernel_initializer='random_uniform'))
lstm_model.add(Dropout(0.4))
lstm_model.add(CuDNNLSTM(60))
lstm_model.add(Dropout(0.4))
lstm_model.add(Dense(20, activation='relu'))
lstm_model.add(Dense(1, activation='linear'))
optimizer = optimizers.Adagrad(lr=params["lr"])
lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)
return lstm_model
This is how I am pre-processing the data. The first function, build_timeseries, constructs my input-output pairs. I believe this is working correctly (but please correct me if I am wrong). The second function trims the pairs to fit the batch size. I do the exact same for the test input/target.
train_input, train_target = build_timeseries(train_input, time_step, pred_horiz, 0)
train_input = trim_dataset(train_input, batch_size)
train_target = trim_dataset(train_target, batch_size)
def build_timeseries(mat, TIME_STEPS, PRED_HORIZON, y_col_index):
# y_col_index is the index of column that would act as output column
dim_0 = mat.shape[0] # num datasets
dim_1 = mat.shape[1] # num features
dim_2 = mat.shape[2] # num datapoints
# Reformatted matrix
mat = mat.swapaxes(1, 2)
x = np.zeros((dim_0*(dim_2-PRED_HORIZON), TIME_STEPS, dim_1))
y = np.zeros((dim_0*(dim_2-PRED_HORIZON),))
k = 0
for i in range(dim_0): # Iterate through datasets
for j in range(TIME_STEPS, dim_2-PRED_HORIZON):
x[k] = mat[i, j-TIME_STEPS:j]
y[k] = mat[i, j+PRED_HORIZON, y_col_index]
k += 1
print("length of time-series i/o", x.shape, y.shape)
return x, y
def trim_dataset(mat, batch_size):
no_of_rows_drop = mat.shape[0] % batch_size
if(no_of_rows_drop > 0):
return mat[no_of_rows_drop:]
else:
return mat
Lastly, this is how I call the actual model.
history = model.fit(train_input, train_target, epochs=params["epochs"], verbose=2, batch_size=batch_size,
shuffle=True, validation_data=(test_input, test_target), callbacks=[es, mcp])
As the model converges, I expect it to predict values close to the specified targets I had fed it. However instead, its predictions align much more closely with the trailing value of the time-series data (or the current value). Though, on the other hand, the model appears to be evaluating the loss according to the specified target.... Why is it working this way and how can I fix it? Any help is appreciated.

Regressor Neural Network built with Keras only ever predicts one value

I'm trying to build a NN with Keras and Tensorflow to predict the final chart position of a song, given a set of 5 features.
After playing around with it for a few days I realised that although my MAE was getting lower, this was because the model had just learned to predict the mean value of my training set for all input, and this was the optimal solution. (This is illustrated in the scatter plot below)
This is a random sample of 50 data points from my testing set vs what the network thinks they should be
At first I realised this was probably because my network was too complicated. I had one input layer with shape (5,) and a single node in the output layer, but then 3 hidden layers with over 32 nodes each.
I then stripped back the excess layers and moved to just a single hidden layer with a couple nodes, as shown here:
self.model = keras.Sequential([
keras.layers.Dense(4,
activation='relu',
input_dim=num_features,
kernel_initializer='random_uniform',
bias_initializer='random_uniform'
),
keras.layers.Dense(1)
])
Training this with a gradient descent optimiser still results in exactly the same prediction being made the whole time.
Then it occurred to me that perhaps the actual problem I'm trying to solve isn't hard enough for the network, that maybe it's linearly separable. Since this would respond better to not having a hidden layer at all, essentially just doing regular linear regression, I tried that. I changed my model to:
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
This also changed nothing. My MAE, the predicted value are all the same.
I've tried so many different things, different permutations of optimisation functions, learning rates, network configurations, and nothing can help. I'm pretty sure the data is good, but I've included a sample of it just in case.
chartposition,tagcount,dow,artistscore,timeinchart,finalpos
121,3925,5,35128,7,227
131,4453,3,85545,25,130
69,2583,4,17594,24,523
145,1165,3,292874,151,187
96,1679,5,102593,111,540
134,3494,5,1252058,37,370
6,34895,7,6824048,22,5
A sample of my dataset, finalpos is the value I'm trying to predict. Dataset contains ~40,000 records, split 80/20 - training/testing
def __init__(self, validation_split, num_features, should_log):
self.should_log = should_log
self.validation_split = validation_split
inp = keras.Input(shape=(num_features,))
out = keras.layers.Dense(1, activation='relu')(inp)
self.model = keras.Model(inp,out)
optimizer = tf.train.GradientDescentOptimizer(0.01)
self.model.compile(loss='mae',
optimizer=optimizer,
metrics=['mae'])
def train(self, data, labels, plot=False):
early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', patience=20)
history = self.model.fit(data,
labels,
epochs=self.epochs,
validation_split=self.validation_split,
verbose=0,
callbacks = [PrintDot(), early_stop])
if plot: self.plot_history(history)
All code relevant to constructing and training the networ
def normalise_dataset(df, mini, maxi):
return (df - mini)/(maxi-mini)
Normalisation of the input data. Both my testing and training data are normalised to the max and min of the testing set
Graph of my loss vs validation curves with the one hidden layer network with an adamoptimiser, learning rate 0.01
Same graph but with linear regression and a gradient descent optimiser.

So I am pretty sure that your normalization is the issue: You are not normalizing by feature (as is the de-fact industry standard), but across all data.
That means, if you have two different features that have very different orders of magnitude/ranges (in your case, compare timeinchart with artistscore.
Instead, you might want to normalize using something like scikit-learn's StandardScaler. Not only does this normalize per column (so you can pass all features at once), but it also does unit variance (which is some assumption about your data, but can potentially help, too).
To transform your data, use something along these lines
from sklearn.preprocessing import StandardScaler
import numpy as np
raw_data = np.array([[1,40], [2, 80]])
scaler = StandardScaler()
processed_data = scaler.fit_transform(raw_data)
# fit() calculates mean etc, transform() puts it to the new range.
print(processed_data) # returns [[-1, -1], [1,1]]
Note that you have two possibilities to normalize/standardize your training data:
Either scale them together with your training data, and then split afterwards,
or you instead only fit the training data, and then use the same scaler to transform your test data.
Never fit_transform your test set separate from training data!
Since you have potentially different mean/min/max values, you can end up with totally wrong predictions! In a sense, the StandardScaler is your definition of your "data source distribution", which is inherently still the same for your test set, even though they might be a subset not exactly following the same properties (due to small sample size etc.)
Additionally, you might want to use a more advanced optimizer, like Adam, or specify some momentum property (0.9 is a good choice in practic, as a rule of thumb) for your SGD.

Turns out the error was a really stupid and easy to miss bug.
When I was importing my dataset, I shuffle it, however when I performed the shuffling, I was accidentally applying the shuffling only to the labels set, not the whole dataset as a whole.
As a result, each label was being assigned to a completely random feature set, of course the model didn't know what to do with this.
Thanks to #dennlinger for suggesting for me to look in the place where I eventually found this bug.

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.