XOR neural network 2-1-1

XOR neural network 2-1-1 - python

I am trying to implement a XOR in neural networks with the typology of 2 inputs, 1 element in the hidden layer, and 1 output. But the learning rate is really bad (0,5). I think it is because I am missing a connection between the inputs AND the outputs, but I am not really sure how to do it. I have already made the bias connection so that the learning is better. Only using Numpy.
def sigmoid_output_to_derivative(output):
return output*(1-output)
a=0.1
X = np.array([[0,0],
[0,1],
[1,0],
[1,1]])
np.random.seed(1)
y = np.array([[0],
[1],
[1],
[0]])
bias = np.ones(4)
X = np.c_[bias, X]
synapse_0 = 2*np.random.random((3,1)) - 1
synapse_1 = 2*np.random.random((1,1)) - 1
for j in (0,600000):
layer_0 = X
layer_1 = sigmoid(np.dot(layer_0,synapse_0))
layer_2 = sigmoid(np.dot(layer_1,synapse_1))
layer_2_error = layer_2 - y
if (j% 10000) == 0:
print( "Error after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error))))
layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)
layer_1_error = layer_2_delta.dot(synapse_1.T)
layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)
synapse_1 -= a *(layer_1.T.dot(layer_2_delta))
synapse_0 -= a *(layer_0.T.dot(layer_1_delta))

You need to be careful with statements like
the learning rate is bad
Usually the learning rate is the step size that gradient descent takes in negative gradient direction. So, I'm not sure what you mean by a bad learning rate.
I'm also not sure if I understand your code correctly, but the forward step of a neural net is basically a matrix multiplication of the weight matrix for the hidden layer times the input vector. This will (if you set up everything correctly) result in a matrix which is equal to the size of your hidden layer. Now, you can simply add the bias before applying your logistic function elementwise to this matrix.
h_i = f(h_i+bias_in)
Afterwards you can do the same thing for the hidden layer times the output weights and apply its activation to get the outputs.
o_j = f(o_j+bias_h)
The backwards step is to calculate the deltas at output and hidden layer including another elementwise operation with your function
sigmoid_output_to_derivative(output)
and update both weight matrices using the gradients (here the learning rate is needed to define the step size). The gradients are simply the value of a corresponding node times its delta.
Note: The deltas are differently calculated for output and hidden nodes.
I'd advice you to keep separate variables for the biases. Because modern approaches usually update those by summing up the deltas of its connected notes times a different learning rate and subtract this product from the specific bias.
Take a look at the following tutorial (it uses numpy):
http://peterroelants.github.io/posts/neural_network_implementation_part04/

Related

Can you reverse a PyTorch neural network and activate the inputs from the outputs?

Can we activate the outputs of a NN to gain insight into how the neurons are connected to input features?
If I take a basic NN example from the PyTorch tutorials. Here is an example of a f(x,y) training example.
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
After I've finished training the network to predict y from x inputs. Is it possible to reverse the trained NN so that it can now predict x from y inputs?
I don't expect y to match the original inputs that trained the y outputs. So I expect to see what features the model activates on to match x and y.
If it is possible, then how do I rearrange the Sequential model without breaking all the weights and connections?

It is possible but only for very special cases. For a feed-forward network (Sequential) each of the layers needs to be reversible; that means the following arguments apply to each layer separately. The transformation associated with one layer is y = activation(W*x + b) where W is the weight matrix and b the bias vector. In order to solve for x we need to perform the following steps:
Reverse activation; not all activation functions have an inverse though. For example the ReLU function does not have an inverse on (-inf, 0). If we used tanh on the other hand we can use its inverse which is 0.5 * log((1 + x) / (1 - x)).
Solve W*x = inverse_activation(y) - b for x; for a unique solution to exist W must have similar row and column rank and det(W) must be non-zero. We can control the former by choosing a specific network architecture while the latter depends on the training process.
So for a neural network to be reversible it must have a very specific architecture: all layers must have the same number of input and output neurons (i.e. square weight matrices) and the activation functions all need to be invertible.
Code: Using PyTorch we will have to do the inversion of the network manually, both in terms of solving the system of linear equations as well as finding the inverse activation function. Consider the following example of a 1-layer neural network (since the steps apply to each layer separately extending this to more than 1 layer is trivial):
import torch
N = 10 # number of samples
n = 3 # number of neurons per layer
x = torch.randn(N, n)
model = torch.nn.Sequential(
torch.nn.Linear(n, n), torch.nn.Tanh()
)
y = model(x)
z = y # use 'z' for the reverse result, start with the model's output 'y'.
for step in list(model.children())[::-1]:
if isinstance(step, torch.nn.Linear):
z = z - step.bias[None, ...]
z = z[..., None] # 'torch.solve' requires N column vectors (i.e. shape (N, n, 1)).
z = torch.solve(z, step.weight)[0]
z = torch.squeeze(z) # remove the extra dimension that we've added for 'torch.solve'.
elif isinstance(step, torch.nn.Tanh):
z = 0.5 * torch.log((1 + z) / (1 - z))
print('Agreement between x and z: ', torch.dist(x, z))

If I've understood correctly, there are two questions here:
Is it possible to determine what features in the input have activated neurons?
If so, is it possible to use this information to generate samples from p(x|y)?
Regarding 1, a basic way to determine if a neuron is sensitive to an input feature x_i is to compute the gradient of this neuron's output w.r.t x_i. A high gradient will indicate sensitivity to a particular input element. There is a rich literature on the subject, for example, you can have a look at guided backpropagation or at GradCam (the latter is about classification with convnets, but it does contain useful ideas).
As for 2, I don't think that your approach to "reversing the problem" is correct. The problem is that your network is discriminative and what it outputs can be seen as argmax_y p(y|x). Note that this is a point-wise estimation, not a full modeling of the distribution. However, the inverse problem that you're interested in seems to be sampling from
p(x|y)=constant*p(y|x)p(x).
You don't know how to sample from p(y|x) and you don't know anything about p(x). Even if you use a method to discover correlations between the neurons and specific input features, you have only discovered which features where more important to the networks prediction, but depending on the nature of y this might be insufficiant. Consider a toy example where your inputs x are 2d points distributed according to some distribution in R^2 and where the output y is binary, such that any (a,b) in R^2 is classified as 1 if a<1 and it is classified as 0 if a>1. Then a discriminative network could learn the vertical line x=1 as its decision boundary. Inspecting correlations between neurons and input features will reveal that only the first coordinate was useful in this prediction, but this information is not sufficient for sampling from the full 2d distribution of inputs.
I think that Variational autoencoders could be what you're looking for.

How to reproduce a Keras model from the weights/biases?

I want to use the weights and biases from a Keras ML model to make a mathematical prediction function in another program that does not have Keras installed (and cannot).
I have a simple, MLP model that I'm using to fit data. I'm running in Python with Keras and a TensorFlow backend; for now, I'm using an input layer, 1 hidden layer, and 1 output layer. All layers are RELU, my optimizer is adam, and the loss function is mean_squared_error.
From what I understand, the weights I get for the layers should be used mathematically in the form:
(SUM (w*i)) + b
Where the sum is over all weights and inputs, and b is for the bias on the neuron. For example, let's say I have an input layer of shape (33, 64). There are 33 inputs with 64 neurons. I'll have a vector input of dim 33 and a vector output of dim 64. This would make each SUM 33 terms * 33 weights, and the output would be all of the 64 SUMs plus the 64 biases (respectively).
The next layer, in my case it's 32 neurons, will do the same but with 64 inputs and 32 outputs. The output layer I have goes to a single value, so input 32 and output 1.
I have written code to try to mimic the model. Here is a snippet for making a single prediction:
def modelR(weights, biases, data):
# This is the input layer.
y = []
for i in range(len(weights[0][0])):
x = np.zeros(len(weights[0][0]))
for j in range(len(data)):
x[i] += weights[0][j][i]*data[j]
y.append(x[i]+biases[0][i])
# This is the hidden layer.
z = []
for i in range(len(weights[1][0])):
x = np.zeros(len(weights[1][0]))
for j in range(len(y)):
x[i] += weights[1][j][i]*y[j]
z.append(x[i]+biases[1][i])
# This is the output layer.
p = 0.0
for i in range(len(z)):
p += weights[-1][i][0]*z[i]
p = p+biases[-1][0]
return p
To be clear, "weights" and "biases are derived via:
weights = []
biases = []
for i in range(len(model.layers)):
weights.append(model.layers[i].get_weights()[0])
biases.append(model.layers[i].get_weights()[1])
weights = np.asarray(weights)
biases = np.asarray(biases)
So the first weight on the first neuron for the first input is weight[0][0][0], the first weight on the first input for the second neuron is weight[0][1][0], etc. I could be wrong on this, which may be where I'm getting stuck. But this makes sense as we're going from (1 x 33) vector to a (1 x 64) vector, so we ought to have a (33 x 64) matrix.
Any ideas of where I'm going wrong? Thanks!
EDIT: ANSWER FOUND
I'm marking jhso's answer as correct, even though it didn't work properly in my code as such (I'm probably missing an import statement somewhere). The key was the activation function. I was using RELU, so I shouldn't have been passing along any negative values. Also, jhso shows a nice way to not use loops but to simply do the matrix multiplication (which I didn't know Python did). Now I just have to figure out how to do it in c++!

I think it's good to familiarise yourself with linear algebra when working with machine learning. When we have an equation of the form sum(matrix elem times another matrix elem) it's often a simple matrix multiplication of the form matrix1 * matrix2.T. This simplifies your code quite a bit:
def modelR(weights, biases, data):
# This is the input layer.
y = np.matmul(data,weights[0])+biases[0][None,:]
y_act = relu(y) #also dropout or any other function you use here
z = np.matmul(y_act,weights[1])+biases[1][None,:]
z_act = relu(z) #also dropout and any other function you use here
p = np.matmul(z_act,weights[2])+biases[2][None,:]
p_act = sigmoid(p)
return p_act
I made a guess at which activation function you use. I'm also unsure of how your data is structured, just make sure that the features/weights are always the inner dimension of the multiplication, ie. if your input is (Bx10) and your weights are (10x64) then input*weights is good enough and will produce an output of shape (Bx64).

Simple TensorFlow Neural Network minimizes cost function yet all results are close to 1

So I tried implementing the neural network from:
http://iamtrask.github.io/2015/07/12/basic-python-network/
but using TensorFlow instead. I printed out the cost function twice during training and the error is appears to be getting smaller according yet all the values in the output layer are close to 1 when only two of them should be. I imagine it might be something wrong with my maths but I'm not sure. There is no difference when I try with a hidden layer or use Error Squared as cost function. Here is my code:
import tensorflow as tf
import numpy as np
input_layer_size = 3
output_layer_size = 1
x = tf.placeholder(tf.float32, [None, input_layer_size]) #holds input values
y = tf.placeholder(tf.float32, [None, output_layer_size]) # holds true y values
tf.set_random_seed(1)
input_weights = tf.Variable(tf.random_normal([input_layer_size, output_layer_size]))
input_bias = tf.Variable(tf.random_normal([1, output_layer_size]))
output_layer_vals = tf.nn.sigmoid(tf.matmul(x, input_weights) + input_bias)
cross_entropy = -tf.reduce_sum(y * tf.log(output_layer_vals))
training = tf.train.AdamOptimizer(0.1).minimize(cross_entropy)
x_data = np.array(
[[0,0,1],
[0,1,1],
[1,0,1],
[1,1,1]])
y_data = np.reshape(np.array([0,0,1,1]).T, (4, 1))
with tf.Session() as ses:
init = tf.initialize_all_variables()
ses.run(init)
for _ in range(1000):
ses.run(training, feed_dict={x: x_data, y:y_data})
if _ % 500 == 0:
print(ses.run(output_layer_vals, feed_dict={x: x_data}))
print(ses.run(cross_entropy, feed_dict={x: x_data, y:y_data}))
print('\n\n')
And this is what it outputs:
[[ 0.82036656]
[ 0.96750367]
[ 0.87607527]
[ 0.97876281]]
0.21947 #first cross_entropy error
[[ 0.99937409]
[ 0.99998224]
[ 0.99992537]
[ 0.99999785]]
0.00062825 #second cross_entropy error, as you can see, it's smaller

First of all: you have no hidden layer. As far as I remember basic perceptrons could possibly model the XOR problem, but it needed some adjustments. However, AI is just invented by biology, but it does not model real neural networks exactly. Thus, you have to at least build an MLP (Multilayer perceptron), which consits of at least one input, one hidden and one output layer. The XOR problem needs at least two neurons + bias in the hidden layer to be solved correctly (with a high precision).
Additionally your learning rate is too high. 0.1 is a very high learning rate. To put it simply: it basically means that you update/adapt your current state by 10% of one single learning step. This lets your network forget about already learned invariants quickly. Usually the learning rate is something in between 1e-2 to 1e-6, depending on your problem, network size and general architecture.
Moreover you implemented the "simplified/short" version of cross-entropy. See wikipedia for the full version: cross-entropy. However, to avoid some edge cases TensorFlow already has its own version of cross-entropy: for example tf.nn.softmax_cross_entropy_with_logits.
Finally you should remember that the cross-entropy error is a logistic loss function that operates on probabilities of your classes. Although your sigmoid function squashes the output layer into an interval of [0, 1], this does only work in your case because you have one single output neuron. As soon as you have more than one output neuron, you also need the sum of the output layer to be exactly 1,0 in order to really describes probabilities for every class on the output layer.

Why does this backpropagation implementation fail to train weights correctly?

I've written the following backpropagation routine for a neural network, using the code here as an example. The issue I'm facing is confusing me, and has pushed my debugging skills to their limit.
The problem I am facing is rather simple: as the neural network trains, its weights are being trained to zero with no gain in accuracy.
I have attempted to fix it many times, verifying that:
the training sets are correct
the target vectors are correct
the forward step is recording information correctly
the backward step deltas are recording properly
the signs on the deltas are correct
the weights are indeed being adjusted
the deltas of the input layer are all zero
there are no other errors or overflow warnings
Some information:
The training inputs are an 8x8 grid of [0,16) values representing an intensity; this grid represents a numeral digit (converted to a column vector)
The target vector is an output that is 1 in the position corresponding to the correct number
The original weights and biases are being assigned by Gaussian distribution
The activations are a standard sigmoid
I'm not sure where to go from here. I've verified that all things I know to check are operating correctly, and it's still not working, so I'm asking here. The following is the code I'm using to backpropagate:
def backprop(train_set, wts, bias, eta):
learning_coef = eta / len(train_set[0])
for next_set in train_set:
# These record the sum of the cost gradients in the batch
sum_del_w = [np.zeros(w.shape) for w in wts]
sum_del_b = [np.zeros(b.shape) for b in bias]
for test, sol in next_set:
del_w = [np.zeros(wt.shape) for wt in wts]
del_b = [np.zeros(bt.shape) for bt in bias]
# These two helper functions take training set data and make them useful
next_input = conv_to_col(test)
outp = create_tgt_vec(sol)
# Feedforward step
pre_sig = []; post_sig = []
for w, b in zip(wts, bias):
next_input = np.dot(w, next_input) + b
pre_sig.append(next_input)
post_sig.append(sigmoid(next_input))
next_input = sigmoid(next_input)
# Backpropagation gradient
delta = cost_deriv(post_sig[-1], outp) * sigmoid_deriv(pre_sig[-1])
del_b[-1] = delta
del_w[-1] = np.dot(delta, post_sig[-2].transpose())
for i in range(2, len(wts)):
pre_sig_vec = pre_sig[-i]
sig_deriv = sigmoid_deriv(pre_sig_vec)
delta = np.dot(wts[-i+1].transpose(), delta) * sig_deriv
del_b[-i] = delta
del_w[-i] = np.dot(delta, post_sig[-i-1].transpose())
sum_del_w = [dw + sdw for dw, sdw in zip(del_w, sum_del_w)]
sum_del_b = [db + sdb for db, sdb in zip(del_b, sum_del_b)]
# Modify weights based on current batch
wts = [wt - learning_coef * dw for wt, dw in zip(wts, sum_del_w)]
bias = [bt - learning_coef * db for bt, db in zip(bias, sum_del_b)]
return wts, bias
By Shep's suggestion, I checked what's happening when training a network of shape [2, 1, 1] to always output 1, and indeed, the network trains properly in that case. My best guess at this point is that the gradient is adjusting too strongly for the 0s and weakly on the 1s, resulting in a net decrease despite an increase at each step - but I'm not sure.

I suppose your problem is in choice of initial weights and in choice of initialization of weights algorithm. Jeff Heaton author of Encog claims that it as usually performs worse then other initialization method. Here is another results of weights initialization algorithm perfomance. Also from my own experience recommend you to init your weights with different signs values. Even in cases when I had all positive outputs weights with different signs perfomed better then with the same sign.

Issue with gradient calculation in a Neural Network (stuck at 7% error in MNIST)

Hi I am having an issue with my calculation of checking the gradient when implementing a neural network in python using numpy.
I am using mnist dataset to try and trying to using mini-batch gradient descent.
I have check the math and on paper look good so maybe you can give me a hint of what's happening here:
EDIT: One answer made me realize that indeed the cost function was being calculated wrong. Howerver that does not explain the problem with the gradient as it is calculated using back_prop. I get %7 error rate using 300 units in the hidden layer using minibatch gradient descent with rmsprop, 30 epochs and 100 batches. (learning_rate = 0.001, small due to the rmsprop).
each input is has 768 features so for a 100 samples I have a matrix. Mnist has 10 classes.
X = NoSamplesxFeatures = 100x768
Y = NoSamplesxClasses = 100x10
I am using a one hidden layer neural network with hidden layer size of 300 when fully training. Another question I have is whether I should use a softmax output function for calculating the error... which I think not. But I am kinda newbie to all of this and the obvious might seem strange to me.
(NOTE: I know the code is ugly, but this is my first Python/Numpy code done under pressure, bear with me)
Here is back_prof and activations:
def sigmoid(z):
return np.true_divide(1,1 + np.exp(-z) )
#not calculated really - this the fake version to make it faster.
def sigmoid_prime(a):
return (a)*(1 - a)
def _back_prop(self,W,X,labels,f=sigmoid,fprime=sigmoid_prime,lam=0.001):
"""
Calculate the partial derivates of the cost function using backpropagation.
"""
#Weight for first layer and hidden layer
Wl1,bl1,Wl2,bl2 = self._extract_weights(W)
# get the forward prop value
layers_outputs = self._forward_prop(W,X,f)
#from a number make a binary vector, for mnist 1x10 with all 0 but the number.
y = self.make_1_of_c_encoding(labels)
num_samples = X.shape[0] # layers_outputs[-1].shape[0]
# Dot product return Numsamples (N) x Outputs (No CLasses)
# Y is NxNo Clases
# Layers output to
big_delta = np.zeros(Wl2.size + bl2.size + Wl1.size + bl1.size)
big_delta_wl1, big_delta_bl1, big_delta_wl2, big_delta_bl2 = self._extract_weights(big_delta)
# calculate the gradient for each training sample in the batch and accumulate it
for i,x in enumerate(X):
# Error with respect the output
dE_dy = layers_outputs[-1][i,:] - y[i,:]
# bias hidden layer
big_delta_bl2 += dE_dy
# get the error for the hiddlen layer
dE_dz_out = dE_dy * fprime(layers_outputs[-1][i,:])
#and for the input layer
dE_dhl = dE_dy.dot(Wl2.T)
#bias input layer
big_delta_bl1 += dE_dhl
small_delta_hl = dE_dhl*fprime(layers_outputs[-2][i,:])
#here calculate the gradient for the weights in the hidden and first layer
big_delta_wl2 += np.outer(layers_outputs[-2][i,:],dE_dz_out)
big_delta_wl1 += np.outer(x,small_delta_hl)
# divide by number of samples in the batch (should be done here)?
big_delta_wl2 = np.true_divide(big_delta_wl2,num_samples) + lam*Wl2*2
big_delta_bl2 = np.true_divide(big_delta_bl2,num_samples)
big_delta_wl1 = np.true_divide(big_delta_wl1,num_samples) + lam*Wl1*2
big_delta_bl1 = np.true_divide(big_delta_bl1,num_samples)
# return
return np.concatenate([big_delta_wl1.ravel(),
big_delta_bl1,
big_delta_wl2.ravel(),
big_delta_bl2.reshape(big_delta_bl2.size)])
Now the feed_forward:
def _forward_prop(self,W,X,transfer_func=sigmoid):
"""
Return the output of the net a Numsamples (N) x Outputs (No CLasses)
# an array containing the size of the output of all of the laye of the neural net
"""
# Hidden layer DxHLS
weights_L1,bias_L1,weights_L2,bias_L2 = self._extract_weights(W)
# Output layer HLSxOUT
# A_2 = N x HLS
A_2 = transfer_func(np.dot(X,weights_L1) + bias_L1 )
# A_3 = N x Outputs
A_3 = transfer_func(np.dot(A_2,weights_L2) + bias_L2)
# output layer
return [A_2,A_3]
And the cost function for the gradient checking:
def cost_function(self,W,X,labels,reg=0.001):
"""
reg: regularization term
No weight decay term - lets leave it for later
"""
outputs = self._forward_prop(W,X,sigmoid)[-1] #take the last layer out
sample_size = X.shape[0]
y = self.make_1_of_c_encoding(labels)
e1 = np.sum((outputs - y)**2, axis=1))*0.5
#error = e1.sum(axis=1)
error = e1.sum()/sample_size + 0.5*reg*(np.square(W)).sum()
return error

What kind of results are you getting when you run gradient checking? Often times you can tease out the nature of the implementation error by looking at the output of your gradient vs the output produced by gradient checking.
Furthermore, square error is usually a poor choice for a classification task such as MNIST and I would suggest using either a simple sigmoid top-layer or a softmax. With sigmoid the cross entropy function you want to use is:
L(h,Y) = -Y*log(h) - (1-Y)*log(1-h)
For a softmax
L(h,Y) = -sum(Y*log(h))
where Y is the target given as a 1x10 vector and h is your predicted value, but easily extends to arbitrary batch sizes.
In both cases the top-layer delta simply becomes:
delta = h - Y
And the top-layer gradient becomes:
grad = dot(delta, A_in)
Where A_in is the input into the top layer from the previous layer.
While I am having some trouble getting my head around your backprop routine, I suspect from your code that the error in gradient is due to the fact that you are not calculating the top-level dE/dw_l2 correctly when using square error, along with computing fprime on the incorrect input.
When using square error the top layer delta should be:
delta = (h - Y) * fprime(Z_l2)
Here Z_l2 is the input into your transfer function for layer 2. Similarly when computing fprime for the lower layers, you want to use the input to your transfer function (i.e. dot(X,weights_L1) + bias_L1)
Hope that helps.
EDIT:
As some added justification for using cross entropy error over square error I would suggest looking up Geoffrey Hinton's lectures on linear classification methods:
www.cs.toronto.edu/~hinton/csc2515/notes/lec3.ppt
EDIT2:
I ran some tests locally with my implementation of neural nets on the MNIST dataset with different parameters and 1 hidden layer using RMSPROP. Here are the results:
Test1
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.001
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 6.1%
Test Error: 6.9%
Test2
Epochs: 30
Hidden Size: 300
Learn Rate: 0.001
Lambda: 0.000002
Train Method: RMSPROP with decrements=0.5 and increments=1.3
Train Error: 4.5%
Test Error: 5.7%
It already appears that if you decrease your lambda parameter by a couple orders of magnitude you should end up with better performance.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.