Tensorflow / Keras predict function output length does not match input length

Tensorflow / Keras predict function output length does not match input length - python

I am using Ubuntu 19.04 (Disco Dingo), Python 3.7.3, and TensorFlow 1.14.0.
I noticed that the number of outputs given by the tensorflow.keras.Sequential.predict function is different than the number of inputs. Furthermore, it appears that there is no relation between the inputs and outputs.
Example:
import tensorflow as tf
import math
import numpy as np
import json
# We will train the model to recognize an XOR
x = [ [0,0], [0,1], [1,0], [1,1] ]
y = [ 0, 1, 1, 0 ]
xt = tf.cast(x, tf.float64)
yt = tf.cast(y, tf.float64)
# This model should be more than enough to learn an XOR
L0 = tf.keras.layers.Dense(2)
L1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
L2 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
L3 = tf.keras.layers.Dense(2, activation=tf.nn.softmax)
model = tf.keras.Sequential([L0,L1,L2,L3])
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
model.fit(
x=xt,
y=yt,
batch_size=32,
epochs=1000, # Try to overfit data
shuffle=False,
steps_per_epoch=math.ceil(len(x)/32)
)
# While it is training, the loss drops to near zero
# and the accuracy goes to 100%.
# The large number of epochs and the small number of training examples
# should mean that the network is overtrained.
print("testing")
for i in range(len(y)):
m = tf.cast([x[i]],tf.float64)
# m should be the ith training example
values = model.predict(m,steps=1)
best = np.argmax(values[0])
print(x[i],y[i],best)
The output I always get is:
(input, correct answer, predicted answer)
[0, 0] 0 0
[0, 1] 1 0
[1, 0] 1 0
[1, 1] 0 0
or
[0, 0] 0 1
[0, 1] 1 1
[1, 0] 1 1
[1, 1] 0 1
So, even though I thought that the network would be overtrained, even though the program said that the accuracy was 100% and the loss was virtually zero, the output looks as though the network hadn't trained at all.
Stranger yet is when I replace the testing section with the following:
print("testing")
m = tf.cast([], tf.float64)
values = model.predict(m, steps=1)
print(values)
I would think that this would return an empty array or throw an exception. Instead it gives:
[[0.9979249 0.00207507]
[0.10981816 0.89018184]
[0.10981816 0.89018184]
[0.9932179 0.0067821 ]]
This corresponds to [0,1,1,0]
So even though it was given nothing to predict on, it still gives out predictions for something. And it appears as though the predictions match up with what what we would expect from sending the entire training set into the predict method.
Replacing the testing section again:
print("testing")
m = tf.cast([[0,0]], tf.float64)
# [0,0] is the first training example
# the output should be something close to [[1.0,0.0]]
values = model.predict(m, steps=1)
for j in range(len(values)):
print(values[j])
exit()
I get:
[0.9112452 0.08875483]
[0.00552484 0.9944752 ]
[0.00555605 0.99444395]
[0.9112452 0.08875483]
This corresponds to [0,1,1,0]
So asking it to predict on zero inputs, gives out 4 predictions. Asking it to predict on one input gives out 4 predictions. Furthermore, the predictions it gives out looks like what we would expect if we put the entire training set into the predict function.
Any ideas as to what's going on? How do I get my network to give exactly one prediction for each input given?

Providing the solution here (Answer Section), even though it is present in the Comment Section, for the benefit of the community.
Upgrading Tensorflow from 1.14.0 >=2.0 has resolved the issue.
After upgrading test section works as expected
m = tf.cast([[0,0]], tf.float64)
# [0,0] is the first training example
# the output should be something close to [[1.0,0.0]]
values = model.predict(m, steps=1)
for j in range(len(values)):
print(values[j])
exit()
Output:
[0.9921625 0.00783745]

Related

How to make x*y with simple deep learning(linear regression)

For my future use,I wanted to test multivariate multilayer perceptron.
In order to test it, I made a simple python program.
Here's the code.
import tensorflow as tf
import pandas as pd
import numpy as np
import random
input = []
result = []
for i in range(0,10000):
x = random.random()*100
y = random.random()*100
input.append([x,y])
result.append(x*y)
input = np.array(input,dtype=float)
result = np.array(result,dtype = float)
activation_func = "relu"
unit_count = 256
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1,input_dim=2),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(1)])
model.compile(optimizer="adam",loss="mse")
model.fit(input,result,epochs=10)
predict_input = np.array([[7,3],[5,4],[8,8]]);
print(model.predict(predict_input))
I tried with this code, and the result was not good. The loss value seem not to get lower at some point.
I also tried with smaller x and y. It made model inaccurate with bigger numbers.
I've changed activation function, made more dense layers and increased the number of units but it didnt get better.

Neural networks are not able to adapt themself (without additional training) to a different domain, this means that you should train on a domain and run the inference on the same domain.
In images, we often just scale the input images from [0,255] to the [-1,1] and let the network learn from values in this range (and during inference we rescale always the input values to be in the [-1,1] range).
For solving your tasks you should bring the problem to a restricted domain.
In practice, if you're interested in training a model only for multiplying positive number you can squash them in the [0,1] range, and since the multiplication of values in this range always gives an output value in the same range.
I slightly modified your code and added some comments in the source code.
import random
import numpy as np
import pandas as pd
import tensorflow as tf
input = []
result = []
# We want to train our network to work in a fixed domain
# the [0,1] range.
# Let's also increase the training set -> more data is always better
for i in range(0, 100000):
x = random.random()
y = random.random()
input.append([x, y])
result.append(x * y)
print(input, result)
sys.exit()
input = np.array(input, dtype=float)
result = np.array(result, dtype=float)
activation_func = "relu"
unit_count = 256
# no need for a tons of layers
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(unit_count, input_dim=2, activation=activation_func),
tf.keras.layers.Dense(unit_count, activation=activation_func),
tf.keras.layers.Dense(1, use_bias=False),
]
)
model.compile(optimizer="adam", loss="mse")
model.fit(input, result, epochs=10)
# Bring our input values in the [0,1] range
max_value = 10
predict_input = np.array([[7, 3], [5, 4], [8, 8]]) / max_value
print(predict_input)
# Back to the original domain
# Multiply by max_value**2 is required since the multiplication
# for a number in [0,1] it's the same of a division
print(model.predict(predict_input) * max_value ** 2)
Example output:
[[0.7 0.3]
[0.5 0.4]
[0.8 0.8]]
[[21.04468 ]
[20.028284]
[64.05521 ]]

Keras Recurrent Neural Networks For Multivariate Time Series

I have been reading about Keras RNN models (LSTMs and GRUs), and authors seem to largely focus on language data or univariate time series that use training instances composed of previous time steps. The data I have is a bit different.
I have 20 variables measured every year for 10 years for 100,000 persons as input data, and the 20 variables measured for year 11 as output data. What I would like to do is predict the value of one of the variables (not the other 19) for the 11th year.
I have my data structured as X.shape = [persons, years, variables] = [100000, 10, 20] and Y.shape = [persons, variable] = [100000, 1]. Below is my Python code for a LSTM model.
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(128, activation = 'tanh',
input_shape = (X.shape[1], X.shape[2])))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'adam', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X, Y, epochs = 25, batch_size = 128)
I have four (related) questions, please:
Have I coded the Keras model correctly for the data structure I have? The performance I get from a fully-connected network (using flattened data) and from LSTM, GRU, and 1D CNN models are nearly identical, and I don't know if I have made an error in Keras or if a recurrent model is simply not helpful in this case.
Should I have Y as a series with shape Y.shape = [persons, years] = [100000, 11], rather than including the variable in X, which would then have shape X.shape = [persons, years, variables] = [100000, 10, 19]? If so, how can I get the RNN to output the predicted sequence? When I use return_sequences = True, Keras returns an error.
Is this the best way to predict with the data I have? Are there better option choices available in the Keras RNN models, or even other models?
How could I simulate data resembling the data structure I have so that a RNN model would outperform a fully-connected network?
UPDATE:
I have tried a simulation, with what I hope is a very simple case where an RNN should be expected to outperform a FNN.
While the LSTM tends to outperform the FNN when both have less hidden layers (4), the performance becomes identical with more hidden layers (8+). Can anyone think of a better simulation where a RNN would be expected to outperform a FNN with a similar data structure?
from keras import models
from keras import layers
from keras.layers import Dense, LSTM
import numpy as np
import matplotlib.pyplot as plt
The code below simulates data for 10,000 instances, 10 time steps, and 2 variables. If the second variable has a 0 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 3. If the second variable has a 1 in the very first time step, then Y is the value of the first variable for the very last time step multiplied by 9.
My hope was that the RNN would keep the value of second variable at the very first time step in memory and use that to know which value (3 or 9) to multiply the the first variable for the very last time step.
## Simulate data.
instances = 10000
sequences = 10
X = np.zeros((instances, sequences * 2))
X[:int(instances / 2), 1] = 1
for i in range(instances):
for j in range(0, sequences * 2, 2):
X[i, j] = np.random.random()
Y = np.zeros((instances, 1))
for i in range(len(Y)):
if X[i, 1] == 0:
Y[i] = X[i, -2] * 3
if X[i, 1] == 1:
Y[i] = X[i, -2] * 9
Below is code for a FNN:
## Densely connected model.
# Define model.
network_dense = models.Sequential()
network_dense.add(layers.Dense(4, activation = 'relu',
input_shape = (X.shape[1],)))
network_dense.add(Dense(1, activation = None))
# Compile model.
network_dense.compile(optimizer = 'rmsprop', loss = 'mean_absolute_error')
# Fit model.
history_dense = network_dense.fit(X, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_dense.predict(X[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_dense.predict(X[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
Below is code for a LSTM:
## Structure X data for LSTM.
X_lstm = X.reshape(X.shape[0], X.shape[1] // 2, 2)
X_lstm.shape
## LSTM model.
# Define model.
network_lstm = models.Sequential()
network_lstm.add(layers.LSTM(4, activation = 'relu',
input_shape = (X_lstm.shape[1], 2)))
network_lstm.add(layers.Dense(1, activation = None))
# Compile model.
network_lstm.compile(optimizer = 'rmsprop', loss = 'mean_squared_error')
# Fit model.
history_lstm = network_lstm.fit(X_lstm, Y, epochs = 100, batch_size = 256, verbose = False)
plt.scatter(Y[X[:, 1] == 0, :], network_lstm.predict(X_lstm[X[:, 1] == 0, :]), alpha = 0.1)
plt.plot([0, 3], [0, 3], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 0 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
plt.scatter(Y[X[:, 1] == 1, :], network_lstm.predict(X_lstm[X[:, 1] == 1, :]), alpha = 0.1)
plt.plot([0, 9], [0, 9], color = 'black', linewidth = 2)
plt.title('LSTM, FNN, Second Variable has a 1 in the Very First Time Step')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()

Yes the code used is correct for what you are trying to do. 10 years is the time window used to predict the following year so that should be the number of inputs into your model for each of the 20 variables. The sample size of 100,000 observations is not relevant to the input shape of your model.
The way that you had originally shaped the dependent variable Y is correct. You are predicting a window of 1 year for 1 variable and you have 100,000 observations. The key word argument return_sequences=True will cause an error to be thrown because you only have a single LSTM layer. Set this parameter to True if you are implementing multiple LSTM layers and the layer in question is followed by another LSTM layer.
I wish I could offer some guidance to 3 but without actually having your dataset I don't know if it's possible to answer this with any sort of certainty.
I will say that LSTM's were designed to address what is know as the the long term dependency problem present in regular RNN's. What this problem boils down to is that as the gap between when the relevant information was observed to the point where that information would be useful grows, the standard RNN will have a harder time learning the relationship between them. Think of predicting a stock price based on 3 days of activity vs an entire year.
This leads into number 4. If I use the term 'resembling' loosely and stretch your time window further out to say 50 years as opposed to 10, the advantages gained from using an LSTM would become more apparent. Although I'm sure that someone more experienced will be able to offer a better answer and I look forward to seeing it.
I found this page helpful for understanding LSTM's:
https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Check the accuracy of vector

I learning AI with Python and have this situation: I created a deep learning model that has 10 neurons in his Input layer. On the output layer I have 3 neurons. I split up my data to 80% for learning and 20% for testing.
The trained model is ready for testing.
Until now, I always got situation that I have only one neuron in the output layer. So, I tested the accuracy in that way:
classifier = Sequential()
# ...
classifier.add(Dense(units = 3, kernel_initializer = 'uniform', activation = 'sigmoid'))
# ...
y_pred = classifier.predict(np.array(X_test))
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
which working great when the output layer has only ONE value on each prediction.
In my case, I have 3 values in each prediction.
y_pred = array ([[3.142904686503911194e-11, 1.000000000000000000e+00, 1.729809626091548085e-16],
[7.398544450698540942e-12, 1.000000000000000000e+00, 1.776427415878292515e-22],
[4.224535246066807304e-07, 1.000000000000000000e+00 7.929732391553923065e-12]])
And I want to compare it to my expected values, which:
y_test = [[0, 1, 0], [0, 1, 0], [0, 1, 0]]
So, I have the option to make this work manually:
Put 1 in the highest value in the prediction value. Other values are getting 0.
Compare the two vectors row by row.
It looks like must have a better way to do it?

You want to measure how "close" the prediction vector is to the expected vector. A good formula that describes the "amount of difference" between two vectors is to check the magnitude (or square magnitude) of the delta vector (prediction - expected).
In this case, you can do something like this:
def square_magnitude(vector):
return sum(x*x for x in vector)
def inaccuracy(pred, test): #should only get equal-length items
return square_magnitude([pred[i] - test[i] for i in range(len(pred))]) / len(pred)
Since you have three samples:
total_inaccuracy = sum(inaccuracy(y_pred[i], y_test[i]) for i in range(len(y_pred))) / len(y_pred)
This should be 0 when it's perfectly accurate and higher (positive) when it's less accurate.

Cost function always returning zero for a binary classification in tensorflow

I have written the following binary classification program in tensorflow that is buggy. The cost is returning to be zero all the time no matter what the input is. I am trying to debug a larger program which is not learning anything from the data. I have narrowed down at least one bug to the cost function always returning zero. The given program is using some random inputs and is having the same problem. self.X_train and self.y_train is originally supposed to read from files and the function self.predict() has more layers forming a feedforward neural network.
import numpy as np
import tensorflow as tf
class annClassifier():
def __init__(self):
with tf.variable_scope("Input"):
self.X = tf.placeholder(tf.float32, shape=(100, 11))
with tf.variable_scope("Output"):
self.y = tf.placeholder(tf.float32, shape=(100, 1))
self.X_train = np.random.rand(100, 11)
self.y_train = np.random.randint(0,2, size=(100, 1))
def predict(self):
with tf.variable_scope('OutputLayer'):
weights = tf.get_variable(name='weights',
shape=[11, 1],
initializer=tf.contrib.layers.xavier_initializer())
bases = tf.get_variable(name='bases',
shape=[1],
initializer=tf.zeros_initializer())
final_output = tf.matmul(self.X, weights) + bases
return final_output
def train(self):
prediction = self.predict()
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=self.y))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
print(sess.run(cost, feed_dict={self.X:self.X_train, self.y:self.y_train}))
with tf.Graph().as_default():
classifier = annClassifier()
classifier.train()
If someone could please figure out what I am doing wrong in this, I can try making the same change in my original program. Thanks a lot!

The only problem is invalid cost used. softmax_cross_entropy_with_logits should be used if you have more than two classes, as softmax of a single output always returns 1, as it is defined as :
softmax(x)_i = exp(x_i) / SUM_j exp(x_j)
so for a single number (one dimensional output)
softmax(x) = exp(x) / exp(x) = 1
Furthermore, for softmax output TF expects one-hot encoded labels, so if you provide only 0 or 1, there are two possibilities:
True label is 0, so the cost is -0*log(1) = 0
True label is 1, so the cost is -1*log(1) = 0
Tensorflow has a separate function to handle binary classification which applies sigmoid instead (note, that the same function for more than one output would apply sigmoid independently on each dimension which is what multi-label classification would expect):
tf.sigmoid_cross_entropy_with_logits
just switch to this cost and you are good to go, you do not have to encode anything as one-hot anymore either, as this function is designed solely to be used for your use-case.
The only missing bit is that .... your code does not have actual training routine you need to define optimiser, ask it to minimise a loss and then run a train op in the loop. In your current setting you just try to predict over and over, with the network which never changes.
In particular, please refer to Cross Entropy Jungle question on SO which provides more detailed description of all these different helper functions in TF (and other libraries), which have different requirements/use cases.

The softmax_cross_entropy_with_logits is basically a stable implementation of the 2 parts :
softmax = tf.nn.softmax(prediction)
cost = -tf.reduce_mean(labels * tf.log(softmax), 1)
Now in your example, prediction is a single value, so when you apply softmax on it, its going to be always 1 irrespective of the value (exp(prediction)/exp(prediction) = 1), and so the tf.log(softmax) term becomes 0. Thats why you always get your cost zero.
Either apply sigmoid to get your probabilities between 0 or 1 or if you use want to use softmax get the labels as [1, 0] for class 0 and [0, 1] for class 1.

Simple TensorFlow Neural Network minimizes cost function yet all results are close to 1

So I tried implementing the neural network from:
http://iamtrask.github.io/2015/07/12/basic-python-network/
but using TensorFlow instead. I printed out the cost function twice during training and the error is appears to be getting smaller according yet all the values in the output layer are close to 1 when only two of them should be. I imagine it might be something wrong with my maths but I'm not sure. There is no difference when I try with a hidden layer or use Error Squared as cost function. Here is my code:
import tensorflow as tf
import numpy as np
input_layer_size = 3
output_layer_size = 1
x = tf.placeholder(tf.float32, [None, input_layer_size]) #holds input values
y = tf.placeholder(tf.float32, [None, output_layer_size]) # holds true y values
tf.set_random_seed(1)
input_weights = tf.Variable(tf.random_normal([input_layer_size, output_layer_size]))
input_bias = tf.Variable(tf.random_normal([1, output_layer_size]))
output_layer_vals = tf.nn.sigmoid(tf.matmul(x, input_weights) + input_bias)
cross_entropy = -tf.reduce_sum(y * tf.log(output_layer_vals))
training = tf.train.AdamOptimizer(0.1).minimize(cross_entropy)
x_data = np.array(
[[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]])
y_data = np.reshape(np.array([0,0,1,1]).T, (4, 1))
with tf.Session() as ses:
init = tf.initialize_all_variables()
ses.run(init)
for _ in range(1000):
ses.run(training, feed_dict={x: x_data, y:y_data})
if _ % 500 == 0:
print(ses.run(output_layer_vals, feed_dict={x: x_data}))
print(ses.run(cross_entropy, feed_dict={x: x_data, y:y_data}))
print('\n\n')
And this is what it outputs:
[[ 0.82036656]
[ 0.96750367]
[ 0.87607527]
[ 0.97876281]]
0.21947 #first cross_entropy error
[[ 0.99937409]
[ 0.99998224]
[ 0.99992537]
[ 0.99999785]]
0.00062825 #second cross_entropy error, as you can see, it's smaller

First of all: you have no hidden layer. As far as I remember basic perceptrons could possibly model the XOR problem, but it needed some adjustments. However, AI is just invented by biology, but it does not model real neural networks exactly. Thus, you have to at least build an MLP (Multilayer perceptron), which consits of at least one input, one hidden and one output layer. The XOR problem needs at least two neurons + bias in the hidden layer to be solved correctly (with a high precision).
Additionally your learning rate is too high. 0.1 is a very high learning rate. To put it simply: it basically means that you update/adapt your current state by 10% of one single learning step. This lets your network forget about already learned invariants quickly. Usually the learning rate is something in between 1e-2 to 1e-6, depending on your problem, network size and general architecture.
Moreover you implemented the "simplified/short" version of cross-entropy. See wikipedia for the full version: cross-entropy. However, to avoid some edge cases TensorFlow already has its own version of cross-entropy: for example tf.nn.softmax_cross_entropy_with_logits.
Finally you should remember that the cross-entropy error is a logistic loss function that operates on probabilities of your classes. Although your sigmoid function squashes the output layer into an interval of [0, 1], this does only work in your case because you have one single output neuron. As soon as you have more than one output neuron, you also need the sum of the output layer to be exactly 1,0 in order to really describes probabilities for every class on the output layer.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.