I wrote this little model using Keras Functional API to find similarity of a dialogue between two individuals. I am using Gensim's Doc2Vec embeddings for transforming text-data into vectors (vocab size: 4117). My data is equally divided up into 56 positive cases and 64 negative cases. (yes I know the dataset is small - but that's all I have for the time being).
def euclidean_distance(vects):
x, y = vects
sum_square = K.sum(K.square(x - y), axis=1, keepdims=True)
return K.sqrt(K.maximum(sum_square, K.epsilon()))
ch_inp = Input(shape=(38, 200))
csr_inp = Input(shape=(38, 200))
inp = Input(shape=(38, 200))
net = Embedding(int(vocab_size), 16)(inp)
net = Conv2D(16, 1, activation='relu')(net)
net = TimeDistributed(LSTM(8, return_sequences=True))(net)
out = Activation('relu')(net)
sia = Model(inp, out)
x = sia(csr_inp)
y = sia(ch_inp)
sub = Subtract()([x, y])
mul = Multiply()([sub, sub])
mul_x = Multiply()([x, x])
mul_y = Multiply()([y, y])
sub_xy = Subtract()([x, y])
euc = Lambda(euclidean_distance)([x, y])
z = Concatenate(axis=-1)([euc, sub_xy, mul])
z = TimeDistributed(Bidirectional(LSTM(4)))(z)
z = Activation('relu')(z)
z = GlobalMaxPooling1D()(z)
z = Dense(2, activation='relu')(z)
out = Dense(1, activation = 'sigmoid')(z)
model = Model([ch_inp, csr_inp], out)
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
The problem is: my accuracy won't improve from 60.87% - I ran 10 epochs and the accuracy remains constant. Is there something I've done here in my code that's causing that? Or perhaps its an issue with my data?
I also did K-Fold Validation for some Sklearn models and got these results from the dataset:
Additionally, an overview of my dataset is attached below:
I'm definitely struggling with this one - so literally any help here would be appreciated. Thanks!
I increased my data-size to 1875 train-samples. Its accuracy improved to 70.28%. But its still constant over all iterations.

I see two things that may be important there.
You're using 'relu' after the LSTM. An LSTM in Keras already has 'tanh' as default activation. So, although you're not locking your model, you're making it harder for it to learn, with an activation that constraints the results between as small range plus one that cuts the negative values
You're using 'relu' with very few units! Relu with few units, bad initialization, big learning rates and bad luck will get stuck in the zero region without any gradients.
If your loss completely freezes, it's most probably due to the second point above. And even if it doesn't freeze, it may be using just one unit from the 2 Dense units, for instance, making the layer very poor.
You should do something from below:
Your model is small, so quit using 'relu' and use 'tanh' instead. This will give your model the expected power it should have.
Otherwise, you should definitely increase the number of units, both for the LSTM and for the Dense, so 'relu' doesn't get easily stuck.
You can add a BatchNormalization layer after Dense and before 'relu', this way you guarantee that a good amount units will always be above zero.
In any case, don't use 'relu' after the LSTM.
The other approach would be making the model more powerful.
For instance:
z = TimeDistributed(Bidirectional(LSTM(4)))(z)
z = Conv1D(10, 3, activation = 'tanh')(z) #or 'relu' maybe
z = MaxPooling1D(z)
z = Conv1D(15, 3, activation = 'tanh')(z) #or 'relu' maybe
z = Flatten()(z) #unless the length is variable, then GlobalAveragePooling1D()(z)
z = Dense(10, activation='relu')(z)
out = Dense(1, activation = 'sigmoid')(z)


What activation function on the last layer and loss function should I use in an auto encoder for reconstructing a sequence of events? [Keras]

My data set is a 3D array of the size (M,t,N) where M is the number of samples, t is the number of timesteps in a sequence and N is the number of possible events that can happen at time t. By selecting a specific M we have a 2D array of size (t,N) where each row is a timestep and each column is an event. Each column is set to 1 if that event happened at time t, otherwise it's set to 0. Only 1 event can happen at any given timestep.
I want to try and build an auto-encoder for anomaly detection, and in the tutorials and blogs I have read, the last activation layer is 'relu' and the loss function is 'mse'. But since I am trying to basically reconstruct a classification with N classes, would 'softmax' as the last layer and 'categorical_crossentropy' be better?
inputs = Input(shape = (timesteps,n_features))
# Encoder
lstm_enc_1 = LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)(inputs)
lstm_enc_2 = LSTM(latent_dim, activation='relu', return_sequences=False)(lstm_enc_1)
repeater = RepeatVector(timesteps)
# Decoder
lstm_dec_1 = LSTM(latent_dim, activation='relu', return_sequences=True)
lstm_dec_2 = LSTM(32, activation='relu', return_sequences=True)
time_dis = TimeDistributed(Dense(n_features,activation='softmax')) #<-- Does this make sense here?
z = repeater(lstm_enc_2)
h = lstm_dec_1(z)
decoded_h = lstm_dec_2(h)
decoded = time_dis(decoded_h)
ae = Model(inputs,decoded)
ae.compile(loss='categorical_crossentropy', optimizer='adam') #<-- Does this make sense here?
Or should I, for some reason, still use 'relu' and 'mse' as the last activation function and loss function?
Any input is appreciated.
When i read it correctly, N is one-hot encoded and it sounds like you want to do a classification, no regression.
For beeing y one-hot encoded, using categorical_crossentropy is correct.
If you have more classes in y than 4, you may use integer-encodings and use sparse_categorical_crossentropy, which decodes you y values to one-hot matrices on the way.
mse is better used for regression.
As last actication, since you have a classification, you may want to use softmax, which outputs a probability for each of your y classes.
As far as I know, your normally do not use relu is the last layer, if you have a regression task, you prefer sigmoid in general.

How to implement Batch Normalization on tensorflow with Keras as a high-level API

BatchNormalization (BN) operates slightly differently when in training and in inference. In training, it uses the average and variance of the current mini-batch to scale its inputs; this means that the exact result of the application of batch normalization depends not only on the current input, but also on all other elements of the mini-batch. This is clearly not desirable when in inference mode, where we want a deterministic result. Therefore, in that case, a fixed statistic of the global average and variance over the entire training set is used.
In Tensorflow, this behavior is controlled by a boolean switch training that needs to be specified when calling the layer, see How do I deal with this switch when using Keras high-level API? Am I correct in assuming that it is dealt with automatically, depending whether we are using, ...) or model.predict(x, ...)?
To test this, I have written this example. We start with a random distribution and we want to classify whether the input is positive or negative. However, we also have a test dataset coming from a different distribution where the inputs are displaced by 2 (and consequently the labels check whether x>2).
import numpy as np
from math import ceil
from import Dataset
from tensorflow.python.keras import Input, Model
from tensorflow.python.keras.layers import Dense, BatchNormalization
xt = np.random.randn(10_000, 1)
yt = np.array([[int(x > 0)] for x in xt])
train_data = Dataset.from_tensor_slices((xt, yt)).shuffle(10_000).repeat().batch(32).prefetch(2)
xv = np.random.randn(100, 1)
yv = np.array([[int(x > 0)] for x in xv])
valid_data = Dataset.from_tensor_slices((xv, yv)).repeat().batch(32).prefetch(2)
xs = np.random.randn(100, 1) + 2
ys = np.array([[int(x > 2)] for x in xs])
test_data = Dataset.from_tensor_slices((xs, ys)).repeat().batch(32).prefetch(2)
x = Input(shape=(1,))
a = BatchNormalization()(x)
a = Dense(8, activation='sigmoid')(a)
a = BatchNormalization()(a)
y = Dense(1, activation='sigmoid')(a)
model = Model(inputs=x, outputs=y, )
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']), epochs=10, steps_per_epoch=ceil(10_000 / 32), validation_data=valid_data,
validation_steps=ceil(100 / 32))
zs = model.predict(test_data, steps=ceil(100 / 32))
print(sum([ys[i] == int(zs[i] > 0.5) for i in range(100)]))
Running the code prints the value 0.5, meaning that half the examples are labeled properly. This is what I would expect if the system was using the global statistics on the training set to implement BN.
If we change the BN layers to read
x = Input(shape=(1,))
a = BatchNormalization()(x, training=True)
a = Dense(8, activation='sigmoid')(a)
a = BatchNormalization()(a, training=True)
y = Dense(1, activation='sigmoid')(a)
and run the code again we find 0.87. Forcing always the training state, the percentage of correct prediction has changed. This is consistent with the idea that model.predict(x, ...) is now using the statistic of the mini-batch to implement BN, and is therefore able to slightly "correct" the mismatch in the source distributions between training and test data.
Is that correct?
If I'm understanding your question correctly, then yes, keras does automatically manage training vs inference behavior based on fit vs predict/evaluate. The flag is called learning_phase, and it determines the behavior of batch norm, dropout, and potentially other things. The current learning phase can be seen with keras.backend.learning_phase(), and set with keras.backend.set_learning_phase().

Creating a neural network in keras to multiply two input integers

I am playing around with Keras v2.0.8 in Python v2.7 (Tensorflow backend) to create small neural networks that calculate simple arithmetic functions (add, subtract, multiply, etc.), and am a bit confused. The below code is my network which generates a random training dataset of integers with the corresponding labels (the two inputs added together):
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] + b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
X, y = create_data(0, 500, 10000)
model = Sequential()
model.add(Dense(3, input_dim=2))
model.add(Dense(5, activation='relu'))
model.add(Dense(3, activation='relu'))
model.add(Dense(5, activation='relu'))
model.add(Dense(1, activation='relu'))
model.compile(optimizer='adam', loss='mean_squared_error', metrics=['accuracy']), y, epochs=10, batch_size=10)
test_data, _ = create_data(0, 500, 10)
results = model.predict(test_data, batch_size=2)
sq_error = []
for i in range(0, len(test_data)):
print 'test value:', test_data[i], 'result:', results[i][0], 'error:',\
'%.2f' %(results[i][0] - (test_data[i][0] + test_data[i][1]))
sq_error.append((results[i][0] - (test_data[i][0] + test_data[i][1])))
print '\n total rmse error: ', sqrt(np.sum(np.array(sq_error)))
This trains perfectly well and produces no unexpected results. However, when I create the training data by multiplying the two inputs together the model's loss for each epoch stays around 7,000,000,000 and the model does not converge at all. The data creation function for this is as follows:
def create_data(low, high, examples):
train_data = []
label_data = []
a = np.random.randint(low=low, high=high, size=examples, dtype='int')
b = np.random.randint(low=low, high=high, size=examples, dtype='int')
for i in range(0, examples):
train_data.append([a[i], b[i]])
label_data.append((a[i] * b[i]))
train_data = np.array(train_data)
label_data = np.array(label_data)
return train_data, label_data
I also had the same problem when I had training data of a single input integer and created the label by squaring the input data. However, it worked fine when I only multiplied the single input by a constant value or added/subtracted by a constant.
I have two questions:
1) Why is this the case? I assume it has something to do with the fundamentals of neural networks, but I can't work it out.
2) How could I adapt this code to train a model that multiplies two input numbers together.
The network architecture (2 - 3 - 5 - 3 - 5 - 1) is fairly random right now. I've tried lots of different ones varying in layers and neurons, this one just happened to be on my screen as I write this and got an accuracy of 100% for adding two inputs.
It is due to large gradient updates caused by large numbers in training data. When using a neural network, you should first ensure that the training data falls in a small range (usually [-1,1] or [0,1]) to help the optimization process and prevent disruptive gradient updates. Therefore, you should first normalize data. In this case, one good candidate would be log-normalization.
Further, the 'accuracy' as a metric in Keras is used in case of a classification problem. In a regression problem, using it does not make sense, and instead it's better to use a relevant metric like "mean absolute error" or 'mae'.

Keras binary_crossentropy cost function systematic error with large networks

The title may be too specific and this may also apply to other cost functions as well. Basically I have the dummy setup as below (in my case I have a vastly different architecture, but the problem is always replicated):
l2_sparsity = 5e-7
l1_sparsity = 1e-8
mod=Sequential([Dense(hidden_units, input_shape = (1000,), activation="relu", kernel_regularizer=l1_l2(l1=l1_sparsity, l2=l2_sparsity),
Dense(hidden_units, activation="relu", kernel_regularizer=l1_l2(l1=l1_sparsity, l2=l2_sparsity),
Dense(1000, activation="sigmoid",
adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0)
mod.compile(loss='binary_crossentropy', optimizer=adam, metrics=['binary_accuracy'])
x = np.array([np.array([random.random() for i in range(1000)], dtype=np.float64) for i in range(10)])
classes = (x + 0.5).astype(np.uint32)
def replica_cross_entropy_loss(predictions, truth):
predictions = np.clip(predictions, eps, 1.0-eps)
x = np.log(1.0 *predictions / (1.0 - predictions) )
return np.mean(np.clip(x, 0, None) - x * truth + np.log(1 + np.exp(-abs(x))))[:2],classes[:2], batch_size=128, epochs=1)
preds = mod.predict(x[:2])
print replica_cross_entropy_loss(preds,classes[:2])
print mod.test_on_batch(x[:2],classes[:2])[0]
print 100*(mod.test_on_batch(x[:2],classes[:2])[0] /
replica_cross_entropy_loss(preds,classes[:2]) - 1)
What I am doing here is comparing the built-in keras loss to a (I believe) faithful reconstruction using numpy. Now, experimenting with the hidden units in the mlp architecture outlined yields some interesting results. Specifically:
The keras cost function is systematically higher than the numpy
equivalent, with the difference related to the number of neurons in
the hidden layer.
As an illustration, here is a simple graph that plots the number of neurons in the hidden layer of this mlp vs the pct overestimate of the keras cost function:
I should say that this seems to be backend agnostic, effecting both theano and tensorflow.
Judging from this behaviour, it seems likely to be a precision issue. Do you know if there is a way to mitigate this, as for my real model the cost function is consistently 5-10% different to the numpy answer
There was a regularisation term difference between the two cost functions, which is arguably not that useful in evaluating performance

Implemention of Neural Networks

I'm trying to understand how to implement neural networks. So I made my own dataset. Xtrain is numpy.random floats. Ytrain is sign(sin(1/x^3).
Try to implement neural networks gave me very poor results. 30%accuracy. Random Forest with 100 trees give 97%. But I heard that NN can approximate any function. What is wrong in my understanding?
import numpy as np
import keras
import math
from sklearn.ensemble import RandomForestClassifier as RF
train = np.random.rand(100000)
test = np.random.rand(100000)
def g(x):
if math.sin(2*3.14*x) > 0:
if math.cos(2*3.14*x) > 0:
return 0
return 1
if math.cos(2*3.14*x) > 0:
return 2
return 3
def f(x):
x = (1/x) ** 3
res = [0, 0, 0, 0]
res[g(x)] = 1
return res
ytrain = np.array([f(x) for x in train])
ytest = np.array([f(x) for x in test])
train = np.array([[x] for x in train])
test = np.array([[x] for x in test])
from keras.models import Sequential
from keras.layers import Dense, Activation, Embedding, LSTM
model = Sequential()
model.add(Dense(100, input_dim=1))
P.S. I tried out many layers, activation functions, loss functions, optimizers, but never got more than 30% accuracy :(
I suspect that the 30% accuracy is a combination of small learning rate setting and a small training-step setting.
I ran your code snippet with, ytrain, nb_epoch=5, batch_size=32), after 5 epoch's training it yields about 28% accuracy. With the same setting but increasing the training steps to nb_epoch=50, the loss drops to ~1.157 ish and the accuracy raises to 40%. Further increase training steps should lead the model to further converging. Other than that, you can also try to configure the model with a larger learning rate setting which could make the converging faster :
model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.1, momentum=0.9, nesterov=True), metrics=['accuracy'])
Although be careful don't set the learning rate to be too large otherwise your loss could blow up.
NN is known for having the potential for modeling extremely complex function, however, whether or not the model actually produce a good performance is a matter of how the model is designed, trained, and many other matters related to the specific application.
Zhongyu Kuang's answer is correct in stating that you may need to train it longer or with a different learning rate.
I'll add that the deeper your network, the longer you'll need to train it before it converges. For a relatively simple function like sign(sin(1/x^3)), you may be able to get away with a smaller network than the one you're using.
Additionally, softmax probably isn't the best output layer. You just need to yield -1 or 1. A single tanh unit seems like it would do well. softmax is generally used when you want to learn a probability distribution over a finite set. (You'll probably want to switch your error function from cross entropy to mean square error for similar reasons.)
Try a network with one sigmoidal hidden layer and an output layer with just one tanh unit. Then play around with the layer size and learning rate. Maybe add a second hidden layer if you can't get results with just one, but I wouldn't be surprised if it's unnecessary.
Addendum: In this approach, you'll replace f(x) with a direct calculation of the target function instead of the one-hot vector you're using currently.
