I wanted to simulate classifying whether a student will pass or fail a course depending on training data with a single input, namely a student's exam score.
I start by creating data set of test scores for 1000 students, normally distributed with a mean of 80.
I then created a classification "1" (passing) for the top 300 students, which based on the seed is a test score of 80.87808591534409.
(Obviously we don't really need machine learning for this, as this means anyone with a test score higher than 80.87808591534409 passes the class. But I want to build a model that accurately predicts this, so that I can start adding new input features and expand my classification beyond, pass/fail).
Next I created a test set in the same way, and classified these students using the classification threshold previously computed for the training set (80.87808591534409).
Then, as you can see below or in the linked Jupyter notebook, I created a model that takes one input feature and returns two results (a probability for the zero index classification (fail) and a probability for one index classification (pass).
Then I trained it on the training data set. But as you can see the loss never really improves per iteration. It just kind of hovers at 0.6.
Finally, I ran the trained model on the test data set and generated predictions.
I plotted the results as follows:
The green line represents the actual (not the predicted) classifications of the test set.
The blue line represents the probability of 0 index outcome (failing) and the orange line represents the probability of the 1 index outcome (passing).
As you can see they remain flat. If my model is working, I would have expected these lines to trade places at the threshold where the actual data switches from failing to passing.
I imagine I could be doing a lot of things wrong, but if anyone has time to look at the code below and give me some advice I would be grateful.
I've created a public working example of my attempt here.
And I've included the current code below.
The problem I'm having is that the model training seems to get stuck in computing the loss, and as a result, it reports that every student in my testing set (all 1,000 students fail) no matter what their test result is, which is obviously wrong.
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")
## Create data
# Set Seed
np.random.seed(0)
# Create 1000 test scores normally distributed with a range of 2 with a mean of 80
train_exam_scores = np.sort(np.random.normal(80,2,1000))
# Create classification; top 300 pass the class (classification of 1), bottom 700 do not class (classification of 0)
train_labels = np.array([0. for i in range(700)])
train_labels = np.append(train_labels, [1. for i in range(300)])
print("Point at which test scores correlate with passing class: {}".format(train_exam_scores[701]))
print("computed point with seed of 0 should be: 80.87808591534409")
print("Plot point at which test scores correlate with passing class")
## Plot view
plt.plot(train_exam_scores)
plt.plot(train_labels)
plt.show()
#create another set of 1000 test scores with different seed (10)
np.random.seed(10)
test_exam_scores = np.sort(np.random.normal(80,2,1000))
# create classification labels for the new test set based on passing rate of 80.87808591534409 determined above
test_labels = np.array([])
for index, i in enumerate(test_exam_scores):
if (i >= 80.87808591534409):
test_labels = np.append(test_labels, 1)
else:
test_labels = np.append(test_labels, 0)
plt.plot(test_exam_scores)
plt.plot(test_labels)
plt.show()
print(tf.shape(train_exam_scores))
print(tf.shape(train_labels))
print(tf.shape(test_exam_scores))
print(tf.shape(test_labels))
train_dataset = tf.data.Dataset.from_tensor_slices((train_exam_scores, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_exam_scores, test_labels))
BATCH_SIZE = 5
SHUFFLE_BUFFER_SIZE = 1000
train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)
# view example of feature to label correlation, values above 80.87808591534409 are classified as 1, those below are classified as 0
features, labels = next(iter(train_dataset))
print(features)
print(labels)
# create model with first layer to take 1 input feature per student; and output layer of two values (percentage of 0 or 1 classification)
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(1,)), # input shape required
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(2)
])
# Test untrained model on training features; should produce nonsense results
predictions = model(features)
print(tf.nn.softmax(predictions[:5]))
print("Prediction: {}".format(tf.argmax(predictions, axis=1)))
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
model.compile(optimizer=optimizer,
loss=loss_object,
metrics=['categorical_accuracy'])
#train model
model.fit(train_dataset,
epochs=20,
validation_data=test_dataset,
verbose=1)
#make predictions on test scores from test_dataset
predictions = model.predict(test_dataset)
tf.nn.softmax(predictions[:1000])
tf.argmax(predictions, axis=1)
# I anticipate that the predictions would show a higher probability for index position [0] (classification 0, "did not pass")
#until it reaches a value greater than 80.87808591534409
# which in the test data with a seed of 10 should be the value at the 683 index position
# but at this point I would expect there to be a higher probability for index position [1] (classification 1), "did pass"
# because it is obvious from the data that anyone who scores higher than 80.87808591534409 should pass.
# Thus in the chart below I would expect the lines charting the probability to switch precisely at the point where the test classifications shift.
# However this is not the case. All predictions are the same for all 1000 values.
plt.plot(tf.nn.softmax(predictions[:1000]))
plt.plot(test_labels)
plt.show()
The main issue here: Use softmax activation in the last layer, not separetely outside the model. Change the final layer to:
tf.keras.layers.Dense(2, activation="softmax")
Secondly, for two hidden layers with relu, 0.1 may be too high a learning rate. Try with a lower rate of maybe 0.01 or 0.001.
Another thing to try is to divide the input by 100, to get inputs in the range [0, 1]. This makes training easier, since the update step does not heavily modify the weights.
Related
I am training a GCN (Graph Convolutional Network) on Cora dataset.
The Cora dataset has the following attributes:
Number of graphs: 1
Number of features: 1433
Number of classes: 7
Number of nodes: 2708
Number of edges: 10556
Number of training nodes: 140
Training node label rate: 0.05
Is undirected: True
Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])
Since my code is very long, I only put the relevent parts of my code here. Firstly, I split the Cora dataset as follows:
def to_mask(index, size):
mask = torch.zeros(size, dtype=torch.bool)
mask[index] = 1
return mask
def cora_splits(data, num_classes):
indices = []
for i in range(num_classes):
# returns all indices of the elements = i from data.y tensor
index = (data.y == i).nonzero().view(-1)
# returns a random permutation of integers from 0 to index.size(0).
index = index[torch.randperm(index.size(0))]
# indices is a list of tensors and it has a length of 7
indices.append(index)
# select 20 nodes from each class for training
train_index = torch.cat([i[:20] for i in indices], dim=0)
rest_index = torch.cat([i[20:] for i in indices], dim=0)
rest_index = rest_index[torch.randperm(len(rest_index))]
data.train_mask = to_mask(train_index, size=data.num_nodes)
data.val_mask = to_mask(rest_index[:500], size=data.num_nodes)
data.test_mask = to_mask(rest_index[500:], size=data.num_nodes)
return data
The train is as follows (taken from here with few modifications):
def train(model, optimizer, data, epoch):
t = time.time()
model.train()
optimizer.zero_grad()
output = model(data)
loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
loss_train.backward()
optimizer.step()
loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])
def accuracy(output, labels):
preds = output.max(1)[1].type_as(labels)
correct = preds.eq(labels).double()
correct = correct.sum()
return correct / len(labels)
When I ran my code with 200 epochs in 10 runs I gained:
tensor([0.7690, 0.8030, 0.8530, 0.8760, 0.8600, 0.8550, 0.8850, 0.8580, 0.8940, 0.8830])
Val Loss: 0.5974, Test Accuracy: 0.854 ± 0.039
where each value in the tensor belongs to the model accurracy of each run and the mean accuracy of all those 10 runs is 0.854 with std ± 0.039.
As it can be observed, the accuracy from the first run to the 10th one is increasing substantially. Therefore, I think the model is overfitting. One reason of overfitting is that in the code, the test data has been seen by the model in the training time since in the train function, there is a line output = model(data) so the model is trained over the whole data. What I intend to do is to train my model only on a part of the data (something similar to data[data.train_mask]) but the problem is I cannot pass data[data.train_mask], due to the forward function of the GCN model (from this repository):
def forward(self, data):
x, edge_index = data.x, data.edge_index
x = F.relu(self.conv1(x, edge_index))
for conv in self.convs:
x = F.relu(conv(x, edge_index))
x = F.relu(self.lin1(x))
x = F.dropout(x, p=0.5, training=self.training)
x = self.lin2(x)
return F.log_softmax(x, dim=-1)
If I pass data[data.train_mask] to the GCN model, then in the above forward function in line x, edge_index = data.x, data.edge_index, x and edge_index cannot be retrieved from data[data.train_mask]. Therefore, I need to find a way to split the Cora dataset in a way that I can pass a specefic part of it with the nodes, edge-index and other attributes to the model. My question is how to do it?
Also, any suggestion about k-fold cross validation is much appreciated.
I guess you are a little confused by the nature of transductive learning and the question you asked doesn't actually address the problem you are facing.
As it can be observed, the accuracy from the first run to the 10th one
is increasing substantially. Therefore, I think the model is
overfitting.
Not necessarily, increasing test accuracy could be a normal behavior when your model is learning from the training samples. The learning can last for several dozens of epochs due to the complexity and non-convexity of loss function. The best signal to tell overfit is when your training accuracy increase but test accuracy decreases significantly.
One reason of overfitting is that in the code, the test data has been
seen by the model in the training time since in the train function,
there is a line output = model(data) so the model is trained over the
whole data.
The model indeed has seen the entire graph(adjacency matrix) in the training, but it only sees the labels of nodes in the training set and knows nothing about the labels of nodes in the test set. This is exactly what transductive learning does.
In the end, if you are 100% sure you want to avoid the paradigm of transductive learning, then you might need to write your own split algorithm to achieve that. But I would like to remind that in the real-world use case, transduction is perfectly suitable. An example is to predict the potential links between social network users, where we have the whole network structure as input and want to simply run edge prediction--> transduction. Thus it doesn't make a lot of sense to avoid it.
Depending on your task, you could take a look of how Stellargraph's EdgeSplitter class(docs) and scikit-learn’s train_test_split function (docs) achive the split.
Node classification
If your task is a node classification task, this Node classification with Graph Convolutional Network (GCN) is a good example of how to load data and do train-test-split. It took Cora dataset as an example. The most important steps are the following:
dataset = sg.datasets.Cora()
display(HTML(dataset.description))
G, node_subjects = dataset.load()
train_subjects, test_subjects = model_selection.train_test_split(
node_subjects, train_size=140, test_size=None, stratify=node_subjects
)
val_subjects, test_subjects = model_selection.train_test_split(
test_subjects, train_size=500, test_size=None, stratify=test_subjects
)
train_gen = generator.flow(train_subjects.index, train_targets)
val_gen = generator.flow(val_subjects.index, val_targets)
test_gen = generator.flow(test_subjects.index, test_targets)
Basically, it's the same as train-test-split with a normal classification task, except what we split here is nodes.
Edge classification
If your task is edge classification, you could have a look at this Link prediction example: GCN on the Cora citation dataset. The most relevant code for train-test-split is
# Define an edge splitter on the original graph G:
edge_splitter_test = EdgeSplitter(G)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G, and obtain the
# reduced graph G_test with the sampled links removed:
G_test, edge_ids_test, edge_labels_test = edge_splitter_test.train_test_split(
p=0.1, method="global", keep_connected=True
)
# Define an edge splitter on the reduced graph G_test:
edge_splitter_train = EdgeSplitter(G_test)
# Randomly sample a fraction p=0.1 of all positive links, and same number of negative links, from G_test, and obtain the
# reduced graph G_train with the sampled links removed:
G_train, edge_ids_train, edge_labels_train = edge_splitter_train.train_test_split(
p=0.1, method="global", keep_connected=True
)
# For training we create a generator on the G_train graph, and make an
# iterator over the training links using the generator’s flow() method:
train_gen = FullBatchLinkGenerator(G_train, method="gcn")
train_flow = train_gen.flow(edge_ids_train, edge_labels_train)
test_gen = FullBatchLinkGenerator(G_test, method="gcn")
test_flow = train_gen.flow(edge_ids_test, edge_labels_test)
Here the splitting algorithm behind EdgeSplitter class(docs) is more complex, it needs to maintain the graph structure while doing the split, such as keeping the graph connectivity for example. For more details, cf source code for EdgeSplitter
TLDR
My aim is to categorize sentences in a foreign language (Hungarian) to 3 sentiment categories: negative, neutral & positive. I would like to improve the accuracy of the model used, which can be found below in the "Define, Compile, Fit the Model" section. The rest of the post is here for completeness and reproducibility.
I am new to asking questions on Machine Learning topics, suggestions are welcome here as well: How to ask a good question on Machine Learning?
Data preparation
For this I have 10000 sentences, given to 5 human annotators, categorized as negative, neutral or positive, available from here. The first few lines look like this:
I categorize the sentence positive (denoted by 2) if sum of the scores by annotators is positive, neutral if it is 0 (denoted by 1), and negative (denoted by 0) if the sum is negative:
import pandas as pd
sentences_df = pd.read_excel('/content/OpinHuBank_20130106.xls')
sentences_df['annotsum'] = sentences_df['Annot1'] +\
sentences_df['Annot2'] +\
sentences_df['Annot3'] +\
sentences_df['Annot4'] +\
sentences_df['Annot5']
def categorize(integer):
if 0 < integer: return 2
if 0 == integer: return 1
else: return 0
sentences_df['sentiment'] = sentences_df['annotsum'].apply(categorize)
Following this tutorial, I use SubwordTextEncoder to proceed. From here, I download web2.2-freq-sorted.top100k.nofreqs.txt, which contains 100000 most frequently used word in the target language. (Both the sentiment data and this data was recommended by this.)
Reading in list of most frequent words:
wordlist = pd.read_csv('/content/web2.2-freq-sorted.top100k.nofreqs.txt',sep='\n',header=None,encoding = 'ISO-8859-1')[0].dropna()
Encoding data, conversion to tensors
Initializing encoder using build_from_corpus method:
import tensorflow_datasets as tfds
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(
corpus_generator=(word for word in wordlist), target_vocab_size=2**16)
Building on this, encoding the sentences:
import numpy as np
import tensorflow as tf
def applyencoding(string):
return tf.convert_to_tensor(np.asarray(encoder.encode(string)))
sentences_df['encoded_sentences'] = sentences_df['Sentence'].apply(applyencoding)
Convert to a tensor each sentence's sentiment:
def tensorise(input):
return tf.convert_to_tensor(input)
sentences_df['sentiment_as_tensor'] = sentences_df['sentiment'].apply(tensorise)
Defining how much data to be preserved for testing:
test_fraction = 0.2
train_fraction = 1-test_fraction
From the pandas dataframe, let's create numpy array of encoded sentence train tensors:
nparrayof_encoded_sentence_train_tensors = \
np.asarray(sentences_df['encoded_sentences'][:int(train_fraction*len(sentences_df['encoded_sentences']))])
These tensors have different lengths, so lets use padding to make them have the same:
padded_nparrayof_encoded_sentence_train_tensors = tf.keras.preprocessing.sequence.pad_sequences(
nparrayof_encoded_sentence_train_tensors, padding="post")
Let's stack these tensors together:
stacked_padded_nparrayof_encoded_sentence_train_tensors = tf.stack(padded_nparrayof_encoded_sentence_train_tensors)
Stacking the sentiment tensors together as well:
stacked_nparray_sentiment_train_tensors = \
tf.stack(np.asarray(sentences_df['sentiment_as_tensor'][:int(train_fraction*len(sentences_df['encoded_sentences']))]))
Define, Compile, Fit the Model (ie the main point)
Define & compile the model as follows:
### THE QUESTION IS ABOUT THESE ROWS ###
model = tf.keras.Sequential([
tf.keras.layers.Embedding(encoder.vocab_size, 64),
tf.keras.layers.Conv1D(128, 5, activation='sigmoid'),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='sigmoid'),
tf.keras.layers.Dense(3, activation='sigmoid')
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits = True), optimizer='adam', metrics=['accuracy'])
Fit it:
NUM_EPOCHS = 40
history = model.fit(stacked_padded_nparrayof_encoded_sentence_train_tensors,
stacked_nparray_sentiment_train_tensors,
epochs=NUM_EPOCHS)
The first few lines of the output is:
Testing results
As in TensorFlow's RNN tutorial, let's plot the results we gained so far:
import matplotlib.pyplot as plt
def plot_graphs(history):
plt.plot(history.history['accuracy'])
plt.plot(history.history['loss'])
plt.xlabel("Epochs")
plt.ylabel('accuracy / loss')
plt.legend(['accuracy','loss'])
plt.show()
plot_graphs(history)
Which gives us:
Prepare the testing data as we prepared the training data:
nparrayof_encoded_sentence_test_tensors = \
np.asarray(sentences_df['encoded_sentences'][int(train_fraction*len(sentences_df['encoded_sentences'])):])
padded_nparrayof_encoded_sentence_test_tensors = tf.keras.preprocessing.sequence.pad_sequences(
nparrayof_encoded_sentence_test_tensors, padding="post")
stacked_padded_nparrayof_encoded_sentence_test_tensors = tf.stack(padded_nparrayof_encoded_sentence_test_tensors)
stacked_nparray_sentiment_test_tensors = \
tf.stack(np.asarray(sentences_df['sentiment_as_tensor'][int(train_fraction*len(sentences_df['encoded_sentences'])):]))
Evaluate the model using only test data:
test_loss, test_acc = model.evaluate(stacked_padded_nparrayof_encoded_sentence_test_tensors,stacked_nparray_sentiment_test_tensors)
print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
Giving result:
Full notebook available here.
The question
How can I change the model definition and compilation rows above to have higher accuracy on the test set after no more than 1000 epochs?
You are using word piece subwords, you can try BPE. Also, you can build your model upon BERT and use transfer learning, that will literally skyrocket your results.
Firstly, change the kernel size in your Conv1D layer and try various values for it. Recommended would be [3, 5, 7]. Then, consider adding layers. Also, in the second last layer i.e. Dense, increase the number of units in it, that might help. Alternately, you can try a network with just LSTM layers or LSTM layers followed by Conv1D layer.
By trying out if it works then great otherwise repeat. But, the training loss gives a hint about it, if you see, the loss is not going down smoothly, you may assume, that your network is lacking predictive power i.e. underfitting and increase the number of neurons in it.
Yes, more data does help. But, if the fault is in your network i.e. it is underfitting, then, it won't help. First, you should explore the limits of the model you have before looking for faults in the data.
Yes, using the most common words is the usual norm because probabilistically, the less used words won't occur more and thus, won't affect the predictions greatly.
To preface this, I have plenty of experience with python and moderate experience building and using machine learning networks. That being said, this is the first LSTM I have made aside from some of the cookie-cutter examples available, so any help is appreciated. I feel like this is a problem with a simple solution and that I have just been looking at this code for far too long to see it.
This model is made in a python3.5 venv using Keras with a tensorflow backend.
In short, I am trying to make predictions of some temporal data using the data itself as well as a few mathematical permutations of this data, creating four input features. I am building a time-series input from the prior 60 data points and specifying the prediction target to be 60 data points in the future.
Shape of complete training data (input)(target): (2476224, 60, 4) (2476224)
Shape of single data "point" (input)(target): (1, 60, 4) (1)
What appears to be happening is that the trained model has fit the trailing value of my input time-series (the current value) instead of the target I have provided it (60 cycles in the future).
What is interesting is that the loss function seems to be calculating according to the correct prediction target, yet the model is not converging to the proper solution.
I have no idea why the model should be doing this. My first thought was that I was preprocessing my data incorrectly and feeding it the wrong target. I have tested my input formatting of the data extensively and am pretty confident that I am providing the model with he correct target and input information.
In one instance, I had increased the learning rate a tad such that the model converged to a local minima. This testing loss of this convergence was very similar to the loss of my preferred learning rate (still quite high). But the predictions were still of the "current value". Why is this so?
Here is how I created my model:
def create_model():
lstm_model = Sequential()
lstm_model.add(CuDNNLSTM(100, batch_input_shape=(batch_size, time_step, train_input.shape[2]),
stateful=True, return_sequences=True,
kernel_initializer='random_uniform'))
lstm_model.add(Dropout(0.4))
lstm_model.add(CuDNNLSTM(60))
lstm_model.add(Dropout(0.4))
lstm_model.add(Dense(20, activation='relu'))
lstm_model.add(Dense(1, activation='linear'))
optimizer = optimizers.Adagrad(lr=params["lr"])
lstm_model.compile(loss='mean_squared_error', optimizer=optimizer)
return lstm_model
This is how I am pre-processing the data. The first function, build_timeseries, constructs my input-output pairs. I believe this is working correctly (but please correct me if I am wrong). The second function trims the pairs to fit the batch size. I do the exact same for the test input/target.
train_input, train_target = build_timeseries(train_input, time_step, pred_horiz, 0)
train_input = trim_dataset(train_input, batch_size)
train_target = trim_dataset(train_target, batch_size)
def build_timeseries(mat, TIME_STEPS, PRED_HORIZON, y_col_index):
# y_col_index is the index of column that would act as output column
dim_0 = mat.shape[0] # num datasets
dim_1 = mat.shape[1] # num features
dim_2 = mat.shape[2] # num datapoints
# Reformatted matrix
mat = mat.swapaxes(1, 2)
x = np.zeros((dim_0*(dim_2-PRED_HORIZON), TIME_STEPS, dim_1))
y = np.zeros((dim_0*(dim_2-PRED_HORIZON),))
k = 0
for i in range(dim_0): # Iterate through datasets
for j in range(TIME_STEPS, dim_2-PRED_HORIZON):
x[k] = mat[i, j-TIME_STEPS:j]
y[k] = mat[i, j+PRED_HORIZON, y_col_index]
k += 1
print("length of time-series i/o", x.shape, y.shape)
return x, y
def trim_dataset(mat, batch_size):
no_of_rows_drop = mat.shape[0] % batch_size
if(no_of_rows_drop > 0):
return mat[no_of_rows_drop:]
else:
return mat
Lastly, this is how I call the actual model.
history = model.fit(train_input, train_target, epochs=params["epochs"], verbose=2, batch_size=batch_size,
shuffle=True, validation_data=(test_input, test_target), callbacks=[es, mcp])
As the model converges, I expect it to predict values close to the specified targets I had fed it. However instead, its predictions align much more closely with the trailing value of the time-series data (or the current value). Though, on the other hand, the model appears to be evaluating the loss according to the specified target.... Why is it working this way and how can I fix it? Any help is appreciated.
I have a bunch of images that look like this of someone playing a videogame (a simple game I created in Tkinter):
The idea of the game is that the user controls the box at the bottom of the screen in order to dodge the falling balls (they can only dodge left and right).
My goal is to have the neural network output the position of the player on the bottom of the screen. If they're totally on the left, the neural network should output a 0, if they're in the middle, a .5, and all the way right, a 1, and all the values in-between.
My images are 300x400 pixels. I stored my data very simply. I recorded each of the images and position of the player as a tuple for each frame in a 50-frame game. Thus my result was a list in the form [(image, player position), ...] with 50 elements. I then pickled that list.
So in my code I try to create an extremely basic feed-forward network that takes in the image and outputs a value between 0 and 1 representing where the box on the bottom of the image is. But my neural network is only outputting 1s.
What should I change in order to get it to train and output values close to what I want?
Of course, here is my code:
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0:1]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = [list(val) for val in zip(*data)]
for index, image in enumerate(inputs):
# convert the PIL images into numpy arrays so Keras can process them
inputs[index] = pil_image_to_np_array(image)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X[0].shape[1] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
for index, output in enumerate(Y):
Y[index] = output / width
# flatten the image inputs so they can be passed to a neural network
for index, inpt in enumerate(X):
X[index] = np.ndarray.flatten(inpt)
# keras expects an array (not a list) of image-arrays for input to the neural network
X = np.array(X)
Y = np.array(Y)
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix
EDIT: print(Y) gives me:
so it's definitely not all zeroes.
Of course, a deeper model might give you a better accuracy, but considering the fact that your images are simple, a pretty simple (shallow) model with only one hidden layer should give a medium to high accuracy. So here are the modifications you need to make this happen:
Make sure X and Y are of type float32 (currently, X is of type uint8):
X = np.array(X, dtype=np.float32)
Y = np.array(Y, dtype=np.float32)
When training a neural network it would be much better to normalize the training data. Normalization helps the optimization process to go smoothly and speed up the convergence to a solution. It further prevent large values to cause large gradient updates which would be desruptive. Usually, the values of each feature in the input data should fall in a small range, where two common ranges are [-1,1] and [0,1]. Therefore, to make sure that all values fall in the range [-1,1], we subtract from each feature its mean and divide it by its standard deviation:
X_mean = X.mean(axis=0)
X -= X_mean
X_std = X.std(axis=0)
X /= X_std + 1e-8 # add a very small constant to prevent division by zero
Note that we are normalizing each feature (i.e. each pixel in this case) here not each image. When you want to predict on new data, i.e. in inference or testing mode, you need to subtract X_mean from test data and divide it by X_std (you should NEVER EVER subtract from test data its own mean or divide it by its own standard deviation; rather, use the mean and std of training data):
X_test -= X_mean
X_test /= X_std + 1e-8
If you apply the changes in points one and two, you might notice that the network no longer predicts only ones or only zeros. Rather, it shows some faint signs of learning and predicts a mix of zeros and ones. This is not bad but it is far from good and we have high expectations! The predictions should be much better than a mix of only zeros and ones. There, you should take into account the (forgotten!) learning rate. Since the network has relatively large number of parameters considering a relatively simple problem (and there are a few samples of training data), you should choose a smaller learning rate to smooth the gradient updates and the learning process:
from keras import optimizers
model.compile(loss='mean_squared_error', optimizer=optimizers.Adam(lr=0.0001))
You would notice the difference: the loss value reaches to around 0.01 after 10 epochs. And the network no longer predicts a mix of zeros and ones; rather the predictions are much more accurate and close to what they should be (i.e. Y).
Don't forget! We have high (logical!) expectations. So, how can we do better without adding any new layers to the network (obviously, we assume that adding more layers might help!!)?
4.1. Gather more training data.
4.2. Add weight regularization. Common ones are L1 and L2 regularization (I highly recommend the Jupyter notebooks of the the book Deep Learning with Python written by François Chollet the creator of Keras. Specifically, here is the one which discusses regularization.)
You should always evaluate your model in a proper and unbiased way. Evaluating it on the training data (that you have used to train it) does not tell you anything about how well your model would perform on unseen (i.e. new or real world) data points (e.g. consider a model which stores or memorize all the training data. It would perform perfectly on the training data, but it would be a useless model and perform poorly on new data). So we should have test and train datasets: we train model on the training data and evaluate the model on the test (i.e. new) data. However, during the process of coming up with a good model you are performing lots of experiments: for example, you first change the type and number of layers, train the model and then evaluate it on test data to make sure it is good. Then you change another thing say the learning rate, train it again and then evaluate it again on test data... To make it short, these cycles of tuning and evaluations somehow causes an over-fitting on the test data. Therefore, we would need a third dataset called validation data (read more: What is the difference between test set and validation set?):
# first shuffle the data to make sure it isn't in any particular order
indices = np.arange(X.shape[0])
np.random.shuffle(indices)
X = X[indices]
Y = Y[indices]
# you have 200 images
# we select 100 images for training,
# 50 images for validation and 50 images for test data
X_train = X[:100]
X_val = X[100:150]
X_test = X[150:]
Y_train = Y[:100]
Y_val = Y[100:150]
Y_test = Y[150:]
# train and tune the model
# you can attempt train and tune the model multiple times,
# each time with different architecture, hyper-parameters, etc.
model.fit(X_train, Y_train, epochs=15, batch_size=10, validation_data=(X_val, Y_val))
# only and only after completing the tuning of your model
# you should evaluate it on the test data for just one time
model.evaluate(X_test, Y_test)
# after you are satisfied with the model performance
# and want to deploy your model for production use (i.e. real world)
# you can train your model once more on the whole data available
# with the best configurations you have found out in your tunings
model.fit(X, Y, epochs=15, batch_size=10)
(Actually, when we have few training data available it would be wasteful to separate validation and test data from whole available data. In this case, and if the model is not computationally expensive, instead of separating a validation set which is called cross-validation, one can do K-fold cross-validation or iterated K-fold cross-validation in case of having very few data samples.)
It is around 4 AM at the time of writing this answer and I am feeling sleepy, but I would like to mention one more thing which is not directly related to your question: by using the Numpy library and its functionalities and methods you can write more concise and efficient code and also save yourself a lot time. So make sure you practice using it more as it is heavily used in machine learning community and libraries. To demonstrate this, here is the same code you have written but with more use of Numpy (Note that I have not applied all the changes I mentioned above in this code):
# machine learning code mostly from https://machinelearningmastery.com/tutorial-first-neural-network-python-keras/
from keras.models import Sequential
from keras.layers import Dense
import numpy as np
import pickle
def pil_image_to_np_array(image):
'''Takes an image and converts it to a numpy array'''
# from https://stackoverflow.com/a/45208895
# all my images are black and white, so I only need one channel
return np.array(image)[:, :, 0]
def data_to_training_set(data):
# split the list in the form [(frame 1 image, frame 1 player position), ...] into [[all images], [all player positions]]
inputs, outputs = zip(*data)
inputs = [pil_image_to_np_array(image) for image in inputs]
inputs = np.array(inputs, dtype=np.float32)
outputs = np.array(outputs, dtype=np.float32)
return (inputs, outputs)
if __name__ == "__main__":
# fix random seed for reproducibility
np.random.seed(7)
# load data
# data will be in the form [(frame 1 image, frame 1 player position), (frame 2 image, frame 2 player position), ...]
with open("position_data1.pkl", "rb") as pickled_data:
data = pickle.load(pickled_data)
X, Y = data_to_training_set(data)
# get the width of the images
width = X.shape[2] # == 400
# convert the player position (a value between 0 and the width of the image) to values between 0 and 1
Y /= width
# flatten the image inputs so they can be passed to a neural network
X = np.reshape(X, (X.shape[0], -1))
# create model
model = Sequential()
# my images are 300 x 400 pixels, so each input will be a flattened array of 120000 gray-scale pixel values
# keep it super simple by not having any deep learning
model.add(Dense(1, input_dim=120000, activation='sigmoid'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
# Fit the model
model.fit(X, Y, epochs=15, batch_size=10)
# see what the model is doing
predictions = model.predict(X, batch_size=10)
print(predictions) # this prints all 1s! # TODO fix
I want to implement PI Model architecture proposed in this paper "TEMPORAL ENSEMBLING FOR SEMI-SUPERVISED
LEARNING".
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0ahUKEwjKiqyMmOfZAhXGz4MKHcFJCxEQFgguMAA&url=https%3A%2F%2Farxiv.org%2Fabs%2F1610.02242&usg=AOvVaw3VuNNeChU0O7WYXZQsfxOh.
The authors provided the code in theano but theano is no longer supported so I am writing the algorithm in tensorflow.
The paper is about semi-supervised model where a batch of training data contains both labeled and unlabeled samples. The algorithm has two branches - one for labeled and the other for unlabeled samples. One sample serves as the input to the two branches, however in ech branch the input undergoes random augmentation. In the labeled branch the cross entropy loss is used between the label and the output of a CNN while for the unlabeled branch, the mean square loss between the outputs of a CNN for the two random augmented versions of the same input is used.
I am using cifar10 dataset and I have randomly changed the lables of some samples to 10 to represent the unlabled samples. For either labeled or unlabled sample, the mean square will always be part of the training loss, however if the sample is unlabeled(i.e, the label == 10) then cross entropy should be zero.
I have come up with the following snippet in my code
sentinel = tf.Variable(-1, name='sentinel')
if tf.argmax(y_, axis=1)[0] != tf.constant(10):
sentinel = sentinel
else:
sentinel = 1
cross_entropy = tf.cond(tf.greater(sentinel, tf.constant(0)), lambda: tf.constant(0.0), lambda: tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=train_prediction)), name ='cross_entropy')
train_loss = cross_entropy
mean_sq_loss = tf.losses.mean_squared_error(train_prediction, train_prediction_b)
train_loss += unsup_weight * mean_sq_loss
train_op = tf.train.AdamOptimizer(config.adam_epsilon).minimize(train_loss)
but the condition
if tf.argmax(y_, axis=1)[0] != tf.constant(10):
does not evaluate to true even when
tf.argmax(y_, axis=1)[0] equals 10.
I used sentinel to track if the sample is labeled or not but when I checked the output for a batch size of 1, sentinel does not change even when the sample is unlabeled. I will appreciate any contribution.
Thanks