I am trying to solve FizzBuzz using Keras and it works quite well for numbers between 1 and 10.000 (90-100% win rate and close to 0 loss). However, if I try even higher numbers, that is numbers between 1 and 100.000 it doesn't seem to perform well (~50% win rate, loss ~0.3). In fact, it performs quite poorly and I have no clue what I can do to solve this task. So far I am using a very simple neural net architecture with 3 hidden layers:
model = Sequential()
model.add(Dense(2000, input_dim=state_size, activation="relu"))
model.add(Dense(1000, activation="relu"))
model.add(Dense(500, activation="relu"))
model.add(Dense(num_actions, activation="softmax"))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=["accuracy"])
I found that the more neurons I have the better it performs, at least for numbers below 10.000.
I am training my neural net in a step-wise fashion, meaning that I am not computing the inputs and targets beforehand, but instead train the network step by step. Again, this works quite well and it shouldn't make a difference right? Here's the main loop:
for epoch in range(np_epochs):
action = random_number()
x_raw = to_binary(action)
x = np.expand_dims(x_raw, 0)
prediction = model.predict(x)
y, victory, _, _ = check_prediction(action, prediction)
memory.append((x_raw, y))
curr_batch_size = min(batch_size, len(memory))
batch = random.sample(memory, curr_batch_size)
inputs = []
targets = []
for i, t in batch:
inputs.append(i)
targets.append(t)
if victory:
wins += 1
loss, accuracy = model.train_on_batch(np.array(inputs), np.array(targets))
As you can see, I am training my network not on decimal numbers but convert them into binary first before feeding it into the net.
Another thing to mention here is that I am using a memory, to make it more like a supervised problem. I thought it may perform better if train on numbers that the neural net has already been trained on. It doesn't seem to make any difference at all.
Is there anything I can do to solve this particular problem with a neural net? I mean is it so hard for a function approximator to figure out the simple math behind FizzBuzz? Am I doing something wrong? Do you suggest a different architecture?
See my code on MachineLabs. You can simply fork my lab and fiddle with it if you want. To view to code, simply click on the 'Editor' tab at the top.
Related
I am trying to develop an LSTM model using Keras, following this tutorial. However, I am implementing it with a different dataset of U.S. political news articles with the aim of classifying them based on a political bias (labels: Left, Centre and Right). I have gotten a model to run with the tutorial, but the loss and accuracy would look very off, like this:
I tried to play around with different DropOut probabilities (i.e. 0.5 instead of 0.2), adding/removing hidden layers (and making them less dense), and decreasing/increasing the max number of words and max sequence length.
I have managed to get the graphs to align a bit more, however, that has led to the model having less accuracy with the training data (and the problem of overfitting is still bad):
Additionally, I am not sure why the validation accuracy always seems to be higher than the model accuracy in the first epoch (shouldn't it usually be lower)?
Here is some code that is being used when tokenizing, padding, and initializing variables:
# The maximum number of words to be used. (most frequent)
MAX_NB_WORDS = 500
# Max number of words in each news article
MAX_SEQUENCE_LENGTH = 100 # I am aware this may be too small
# This is fixed.
EMBEDDING_DIM = 64
tokenizer = Tokenizer(num_words=MAX_NB_WORDS, filters='!"#$%&()*+,-./:;<=>?#[\]^_`{|}~',
lower=True)
tokenizer.fit_on_texts(df_raw['titletext'].values)
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))
X = tokenizer.texts_to_sequences(df_raw['titletext'].values)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of data tensor:', X.shape)
Y = pd.get_dummies(df_raw['label']).values
print('Shape of label tensor:', Y.shape)
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.20)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
X_train.view()
When I look at what is shown when X_train.view() is executed, I am also not sure why all the arrays start with zeros like this:
I also did a third attempt that was just a second attempt with the number of epochs increased, it looks like this:
Here is the code of the actual model:
model = Sequential()
model.add(Embedding(MAX_NB_WORDS, EMBEDDING_DIM, input_length=X.shape[1]))
# model.add(SpatialDropout1D(0.2)) ---> commented out
# model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2)) ---> commented out
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dropout(0.5))
model.add(Dense(8))
model.add(Dropout(0.5))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
epochs = 25
batch_size = 64
history = model.fit(X_train, Y_train, epochs=epochs,
batch_size=batch_size,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', patience=3, min_delta=0.0001)])
Here is the link to the full code, including the dataset
Any help would be greatly appreciated!
Hyperparameter adjustments for reducing overfitting in neural networks
Identify and ascertain overfitting. The first attempt shows largely overfitting, with early divergence of your test & train loss. I would try a lower learning rate here (in addition to the steps you took for regularisation with dropout layers). Using the default rate does not guarantee best results.
Allowing your model to find the global mimima / not being stuck in a local minima. On the second attempt, it looks better. However, if the x-axis shows the number of epochs -- it could be that your early stopping is too strict? ie. increase the threshold. Consider other optimisers, including SGD with a learning rate scheduler.
Too large network leads to overfitting on the trainset and difficulty in generalisation. Too many neurons may cause the network to 'memorize' all you trainset and overfit. I would try out 8, 16 or 24 neurons in your LSTM layer for example.
Data preprocessing & cleaning. Check your padding_sequences. It is probably padding the start of each text with zeros. I would pad post text.
Dataset. Depending on the size of your current dataset, I would suggest data augmentation to get to a sizable amount of text of training (empirically >=1M words). I would also try several techniques including feature engineering / improving data quality such as, spell checks. Are the classes imbalanced? You may need to balance them out by over/undersampling.
Consider using transfer learning and incorporate trained language models as your embeddings layer instead of training one from scratch. ie. https://www.gcptutorials.com/post/how-to-create-embedding-with-tensorflow
LSTM is supposed to be the right tool to capture path-dependency in time-series data.
I decided to run a simple experiment (simulation) to assess the extent to which LSTM is better able to understand path-dependency.
The setting is very simple. I just simulate a bunch (N=100) of paths coming from 4 different data generating processes. Two of these processes represent a real increase and a real decrease, while the other two fake trends that eventually revert to zero.
The following plot shows the simulated paths for each category:
The candidate machine learning algorithm will be given the first 8 values of the path ( t in [1,8] ) and will be trained to predict the subsequent movement over the last 2 steps.
In other words:
the feature vector is X = (p1, p2, p3, p4, p5, p6, p7, p8)
the target is y = p10 - p8
I compared LSTM with a simple Random Forest model with 20 estimators. Here are the definitions and the training of the two models, using Keras and scikit-learn:
# LSTM
model = Sequential()
model.add(LSTM((1), batch_input_shape=(None, H, 1), return_sequences=True))
model.add(LSTM((1), return_sequences=False))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
history = model.fit(train_X_LS, train_y_LS, epochs=100, validation_data=(vali_X_LS, vali_y_LS), verbose=0)
# Random Forest
RF = RandomForestRegressor(random_state=0, n_estimators=20)
RF.fit(train_X_RF, train_y_RF);
The out-of-sample results are the summarized by the following scatter plots:
As you can see, the Random Forest model is clearly outperforming the LSTM. The latter seems to be not able to distinguish between the real and the fake trends.
Do you have any idea to explain why this is happening?
How would you modify the LSTM model to make it better at this problem?
Some remarks:
The data points are divided by 100 to make sure gradients do not explode
I tried to increase the sample size, but I noticed no differences
I tried to increase the number of epochs over which the LSTM is trained, but I noticed no differences (the loss becomes stagnant after a bunch of epochs)
You can find the code I used to run the experiment here
Update:
Thanks to SaTa's reply, I changed the model and obtained much better results:
# Updated LSTM Model
model = Sequential()
model.add(LSTM((8), batch_input_shape=(None, H, 1), return_sequences=False))
model.add(Dense(4))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam', metrics=['accuracy'])
Still, the Random Forest model does better. The point is that RF seems to understand that, conditional on the class, a higher p8 predicts a lower outcome p10-p8 and viceversa because of the way the noise is added. LSTM seems to fail on that, so it predicts the class rather well, but we see that within-class downward-sloping pattern in the final scatter plot.
Any suggestion to improve on that?
I won't expect LSTM to win at all the battles against traditional methods, but I do expect it to perform well for the problem you have posed. Here are couple things you can try:
1) Increase the number of hidden units in the first layer.
model.add(LSTM((32), batch_input_shape=(None, H, 1), return_sequences=True))
2) The output of an LSTM layer is tanh by default which limits the output to (-1, 1) as you can see in the right plot. I recommend either adding a Dense layer or using LSTM with linear activation on the output. Like this:
model.add(LSTM((1), return_sequences=False, activation='linear'))
Or
model.add(LSTM((16), return_sequences=False))
model.add(Dense(1))
Try the above with 10K samples that you have.
I have a network with 32 input nodes, 20 hidden nodes and 65 output nodes. My network input actually is a hash code of length 32 and the output is the word.
The input is the ascii value of each character of the Hash code. The output of the network is a binary representation I have made. Say for example a is equal to 00000 and b is equal to 00001 and so on and so forth. It only includes the alphabet and the space that why it's only 5 bits per character. I have a maximum limit of only 13 characters in my training input, so my output nodes is 13 * 5 = 65. And Im expecting a binary output like 10101010101010101010101010101010101010101010101010101010101001011 . The bit sequence can predict at most 16 characters word given a hash code of 32 length as an input. Below is my current code:
scaler = MinMaxScaler(feature_range=(0,1))
scaled_train_samples = scaler.fit_transform((train_samples).reshape(-1, 32))
train_labels = train_labels.reshape(-1, 65)
model = Sequential([
Dense(32, input_shape=(32,), activation = 'sigmoid'),
BatchNormalization(),
Dense(25, activation='tanh'),
BatchNormalization(),
Dense(65, input_shape=(65,), activation='sigmoid')
])
overfitCallback = EarlyStopping(monitor='loss', min_delta=0, patience = 1000)
model.summary()
model.compile(SGD(lr=.01, decay=1e-6, momentum=0.9), loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_samples, train_labels, batch_size=1000, epochs=1000000, callbacks=[overfitCallback], shuffle = True, verbose=2)
I plan to overfit the model, so that it can memorize all the hash codes of the words in the dictionary. As an initial, my training samples is only 5,000 something. I just wanted to see if it will learn from a small dataset. How will I make network converge faster? I think its running more than one hour, and its loss function is still .5004 something and the accuracy is .7301. It gets up and down but when I check every 10 minutes or so, I can see only alittle improvement. How will I fine tune it?
UPDATE :
The training had already stopped but it didn't converge. It's loss is .4614 and accuracy is .7422
There are some hyper parameters that i would suggest to change first.
Try 'relu' or LeakyReLU() as the activation function for the non-output layers. Basically relu is the standard activation function for baseline models.
The standard optimizer (for most cases) currently is Adam, try using this. Tweak its learning rate when needed. You could get better results with sgd, but it often takes a lot of epochs and a lot of hyper parameter tuning. Adam is basically the quickest (in general) optimizer to reach a 'low' loss.
To prevent overfitting you might also want to implement Dropout(0.5), where the 0.5 is as an example.
Once you have reached the lowest loss, you might start changing these hyper parameters even more, to try and egt a lower loss.
Apart from this, the first thing i actually suggest is trying and add multiple hidden layers with different sizes. This might have a way larger impact then trying to optimize all the hyper parameters.
Edit: Maybe you could post a screenshot of your training loss vs epochs for the train & val data? This might make things more clear for others.
I'm a beginner in Neural Network and trying to predict values which are temperature values(output) with 5 inputs in python. I used keras package in python to work Neural Network.
Also, I used two algorithms which are feedforward Neural Network(Regression) and Recurrent Neural Network(LSTM) to predict values. However, both of algorithms didn't work well for forecasting.
In my case of Feedforward Neural Network(Regression), I used 3 hidden layers(with 100, 200, 300 neurons) like code below,
def baseline_model():
# create model
model = Sequential()
model.add(Dense(100, input_dim=5, kernel_initializer='normal', activation='sigmoid'))
model.add(Dense(200, kernel_initializer = 'normal', activation='sigmoid'))
model.add(Dense(300, kernel_initializer = 'normal', activation='sigmoid'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
df = DataFrame({'Time': TIME_list, 'input1': input1_list, 'input2': input2_list, 'input3': input3_list, 'input4': input4_list, 'input5': input5_list, 'output': output_list})
df.index = pd.to_datetime(df.Time)
df = df.values
#Setting training data and test data
train_size_x = int(len(df)*0.8) #The user can change the range of training data
print(train_size_x)
X_train = df[0:train_size_x, 0:5]
t_train = df[0:train_size_x, 6]
X_test = df[train_size_x:int(len(df)), 0:5]
t_test = df[train_size_x:int(len(df)), 6]
# fix random seed for reproducibility
seed = 7
np.random.seed(seed)
scale = StandardScaler()
X_train = scale.fit_transform(X_train)
X_test = scale.transform(X_test)
#Regression in Keras package
clf = KerasRegressor(build_fn=baseline_model, nb_epoch=50, batch_size=5, verbose=0)
clf.fit(X_train,t_train)
res = clf.predict(X_test)
However, the error was quite big. The maximum absolute error was 78.4834. So I tried to minimize that error by changing number of hidden layer or neurons in hidden layer, but the error stayed around same.
After feedforward NN, secondly, I used Recurrent Neural Network(LSTM) algorithm which can predict by using only one input. In my case, the input is temperature. It gives me much less error than the feedforward NN, but I was lost in deep thought that Recurrent Nueral Network(LSTM) I implemented is little ambiguous in my case because it didn't use 5 inputs that affect the output(temperature value) such as feedforward regression that I implemented above.
And now I got lost what other kinds of algorithm I should use.
Any suggestions or ideas for my case..?
Thanks in advance.
I have to agree with the commenter to your question, you are jumping a little ahead of yourself. Neural networks can seem like black magic at times and its worth taking the time to understand whats actually going on under the hood. A good place to start learning and experimenting is with sklearn. Sklearn is a good place to start because you can try different techniques easily, this will help you learn quickly how to structure your problems. There is also an abundance of info and tutorials.
From there, you will be better equipped to tackling your own NN from scratch. Additionally, sklearn has many useful functions to pre-process/normalize your training data, which is a whole art in itself.
There are tons of good networks already available for common situations. Most of the work is in choosing the right structure for your problem, getting good data to train on, and massaging that data so it can be utilized properly.
Check it out... http://scikit-learn.org/stable/
I'm trying to understand how to implement neural networks. So I made my own dataset. Xtrain is numpy.random floats. Ytrain is sign(sin(1/x^3).
Try to implement neural networks gave me very poor results. 30%accuracy. Random Forest with 100 trees give 97%. But I heard that NN can approximate any function. What is wrong in my understanding?
import numpy as np
import keras
import math
from sklearn.ensemble import RandomForestClassifier as RF
train = np.random.rand(100000)
test = np.random.rand(100000)
def g(x):
if math.sin(2*3.14*x) > 0:
if math.cos(2*3.14*x) > 0:
return 0
else:
return 1
else:
if math.cos(2*3.14*x) > 0:
return 2
else:
return 3
def f(x):
x = (1/x) ** 3
res = [0, 0, 0, 0]
res[g(x)] = 1
return res
ytrain = np.array([f(x) for x in train])
ytest = np.array([f(x) for x in test])
train = np.array([[x] for x in train])
test = np.array([[x] for x in test])
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
model = Sequential()
model.add(Dense(100, input_dim=1))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(100))
model.add(Activation('sigmoid'))
model.add(Dense(4))
model.add(Activation('softmax'))
model.compile(optimizer='sgd',
loss='categorical_crossentropy',
metrics=['accuracy'])
P.S. I tried out many layers, activation functions, loss functions, optimizers, but never got more than 30% accuracy :(
I suspect that the 30% accuracy is a combination of small learning rate setting and a small training-step setting.
I ran your code snippet with model.fit(train, ytrain, nb_epoch=5, batch_size=32), after 5 epoch's training it yields about 28% accuracy. With the same setting but increasing the training steps to nb_epoch=50, the loss drops to ~1.157 ish and the accuracy raises to 40%. Further increase training steps should lead the model to further converging. Other than that, you can also try to configure the model with a larger learning rate setting which could make the converging faster :
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.1, momentum=0.9, nesterov=True), metrics=['accuracy'])
Although be careful don't set the learning rate to be too large otherwise your loss could blow up.
EDIT:
NN is known for having the potential for modeling extremely complex function, however, whether or not the model actually produce a good performance is a matter of how the model is designed, trained, and many other matters related to the specific application.
Zhongyu Kuang's answer is correct in stating that you may need to train it longer or with a different learning rate.
I'll add that the deeper your network, the longer you'll need to train it before it converges. For a relatively simple function like sign(sin(1/x^3)), you may be able to get away with a smaller network than the one you're using.
Additionally, softmax probably isn't the best output layer. You just need to yield -1 or 1. A single tanh unit seems like it would do well. softmax is generally used when you want to learn a probability distribution over a finite set. (You'll probably want to switch your error function from cross entropy to mean square error for similar reasons.)
Try a network with one sigmoidal hidden layer and an output layer with just one tanh unit. Then play around with the layer size and learning rate. Maybe add a second hidden layer if you can't get results with just one, but I wouldn't be surprised if it's unnecessary.
Addendum: In this approach, you'll replace f(x) with a direct calculation of the target function instead of the one-hot vector you're using currently.