Arbitrary length of input/output for Recurrent Neural Network (LSTM)

Arbitrary length of input/output for Recurrent Neural Network (LSTM) - python

This is an example of use Elman Recurrent Neural Network from Neurolab Python Library:
import neurolab as nl
import numpy as np
# Create train samples
i1 = np.sin(np.arange(0, 20))
i2 = np.sin(np.arange(0, 20)) * 2
t1 = np.ones([1, 20])
t2 = np.ones([1, 20]) * 2
input = np.array([i1, i2, i1, i2]).reshape(20 * 4, 1)
target = np.array([t1, t2, t1, t2]).reshape(20 * 4, 1)
# Create network with 2 layers
net = nl.net.newelm([[-2, 2]], [10, 1], [nl.trans.TanSig(), nl.trans.PureLin()])
# Set initialized functions and init
net.layers[0].initf = nl.init.InitRand([-0.1, 0.1], 'wb')
net.layers[1].initf= nl.init.InitRand([-0.1, 0.1], 'wb')
net.init()
# Train network
error = net.train(input, target, epochs=500, show=100, goal=0.01)
# Simulate network
output = net.sim(input)
# Plot result
import pylab as pl
pl.subplot(211)
pl.plot(error)
pl.xlabel('Epoch number')
pl.ylabel('Train error (default MSE)')
pl.subplot(212)
pl.plot(target.reshape(80))
pl.plot(output.reshape(80))
pl.legend(['train target', 'net output'])
pl.show()
In this example it's merging 2 unit length input and also it's merging 2 unit length output. After that it's training the network with these merged arrays.
First of all it doesn't seem like the schema that I got from here:
My main question is;
I have to train the network with arbitrary length of inputs and outputs like these:
Arbitrary length inputs to fixed length outputs
Fixed length inputs to arbitrary length outputs
Arbitrary length inputs to arbitrary length outputs
At this point this will come to your mind: "Your answer is Long short-term memory networks."
And I know It but Neurolab is easy to use because of it's good features. Particularly, it is exceptionally Pythonic. So I'm insisting on using Neurolab Library for my problem. But if you suggest me another library like Neurolab with better LSTM functionality, I will accept it.
Eventually, How can I rearrange this example for arbitrary length of inputs and outputs?
I don't have the best understanding about RNNs and LSTMs so please be explanatory.

After a long time when I look at this question of mine today, I can see it was a question of a person with lack of understanding about neural networks.
Matrix multiplication is the basic math at the heart of neural networks. You can not simply change the shape of input matrix because it changes the shape of the product and breaks the consistency among the dataset.
Neural networks are always trained with fixed length of input and output. Here is a very simple neural network implementation that using nothing but numpy's dot product to feedforward:
import numpy as np
# sigmoid function
def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [0,0,1],
[0,1,1],
[1,0,1],
[1,1,1] ])
# output dataset
y = np.array([[0,0,1,1]]).T
# seed random numbers to make calculation
# deterministic (just a good practice)
np.random.seed(1)
# initialize weights randomly with mean 0
syn0 = 2*np.random.random((3,1)) - 1
for iter in xrange(10000):
# forward propagation
l0 = X
l1 = nonlin(np.dot(l0,syn0))
# how much did we miss?
l1_error = y - l1
# multiply how much we missed by the
# slope of the sigmoid at the values in l1
l1_delta = l1_error * nonlin(l1,True)
# update weights
syn0 += np.dot(l0.T,l1_delta)
print "Output After Training:"
print l1
credit: http://iamtrask.github.io/2015/07/12/basic-python-network/

Related

How to make x*y with simple deep learning(linear regression)

For my future use,I wanted to test multivariate multilayer perceptron.
In order to test it, I made a simple python program.
Here's the code.
import tensorflow as tf
import pandas as pd
import numpy as np
import random
input = []
result = []
for i in range(0,10000):
x = random.random()*100
y = random.random()*100
input.append([x,y])
result.append(x*y)
input = np.array(input,dtype=float)
result = np.array(result,dtype = float)
activation_func = "relu"
unit_count = 256
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(1,input_dim=2),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(unit_count,activation=activation_func),
tf.keras.layers.Dense(1)])
model.compile(optimizer="adam",loss="mse")
model.fit(input,result,epochs=10)
predict_input = np.array([[7,3],[5,4],[8,8]]);
print(model.predict(predict_input))
I tried with this code, and the result was not good. The loss value seem not to get lower at some point.
I also tried with smaller x and y. It made model inaccurate with bigger numbers.
I've changed activation function, made more dense layers and increased the number of units but it didnt get better.

Neural networks are not able to adapt themself (without additional training) to a different domain, this means that you should train on a domain and run the inference on the same domain.
In images, we often just scale the input images from [0,255] to the [-1,1] and let the network learn from values in this range (and during inference we rescale always the input values to be in the [-1,1] range).
For solving your tasks you should bring the problem to a restricted domain.
In practice, if you're interested in training a model only for multiplying positive number you can squash them in the [0,1] range, and since the multiplication of values in this range always gives an output value in the same range.
I slightly modified your code and added some comments in the source code.
import random
import numpy as np
import pandas as pd
import tensorflow as tf
input = []
result = []
# We want to train our network to work in a fixed domain
# the [0,1] range.
# Let's also increase the training set -> more data is always better
for i in range(0, 100000):
x = random.random()
y = random.random()
input.append([x, y])
result.append(x * y)
print(input, result)
sys.exit()
input = np.array(input, dtype=float)
result = np.array(result, dtype=float)
activation_func = "relu"
unit_count = 256
# no need for a tons of layers
model = tf.keras.models.Sequential(
[
tf.keras.layers.Dense(unit_count, input_dim=2, activation=activation_func),
tf.keras.layers.Dense(unit_count, activation=activation_func),
tf.keras.layers.Dense(1, use_bias=False),
]
)
model.compile(optimizer="adam", loss="mse")
model.fit(input, result, epochs=10)
# Bring our input values in the [0,1] range
max_value = 10
predict_input = np.array([[7, 3], [5, 4], [8, 8]]) / max_value
print(predict_input)
# Back to the original domain
# Multiply by max_value**2 is required since the multiplication
# for a number in [0,1] it's the same of a division
print(model.predict(predict_input) * max_value ** 2)
Example output:
[[0.7 0.3]
[0.5 0.4]
[0.8 0.8]]
[[21.04468 ]
[20.028284]
[64.05521 ]]

Can you reverse a PyTorch neural network and activate the inputs from the outputs?

Can we activate the outputs of a NN to gain insight into how the neurons are connected to input features?
If I take a basic NN example from the PyTorch tutorials. Here is an example of a f(x,y) training example.
import torch
N, D_in, H, D_out = 64, 1000, 100, 10
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)
model = torch.nn.Sequential(
torch.nn.Linear(D_in, H),
torch.nn.ReLU(),
torch.nn.Linear(H, D_out),
)
loss_fn = torch.nn.MSELoss(reduction='sum')
learning_rate = 1e-4
for t in range(500):
y_pred = model(x)
loss = loss_fn(y_pred, y)
model.zero_grad()
loss.backward()
with torch.no_grad():
for param in model.parameters():
param -= learning_rate * param.grad
After I've finished training the network to predict y from x inputs. Is it possible to reverse the trained NN so that it can now predict x from y inputs?
I don't expect y to match the original inputs that trained the y outputs. So I expect to see what features the model activates on to match x and y.
If it is possible, then how do I rearrange the Sequential model without breaking all the weights and connections?

It is possible but only for very special cases. For a feed-forward network (Sequential) each of the layers needs to be reversible; that means the following arguments apply to each layer separately. The transformation associated with one layer is y = activation(W*x + b) where W is the weight matrix and b the bias vector. In order to solve for x we need to perform the following steps:
Reverse activation; not all activation functions have an inverse though. For example the ReLU function does not have an inverse on (-inf, 0). If we used tanh on the other hand we can use its inverse which is 0.5 * log((1 + x) / (1 - x)).
Solve W*x = inverse_activation(y) - b for x; for a unique solution to exist W must have similar row and column rank and det(W) must be non-zero. We can control the former by choosing a specific network architecture while the latter depends on the training process.
So for a neural network to be reversible it must have a very specific architecture: all layers must have the same number of input and output neurons (i.e. square weight matrices) and the activation functions all need to be invertible.
Code: Using PyTorch we will have to do the inversion of the network manually, both in terms of solving the system of linear equations as well as finding the inverse activation function. Consider the following example of a 1-layer neural network (since the steps apply to each layer separately extending this to more than 1 layer is trivial):
import torch
N = 10 # number of samples
n = 3 # number of neurons per layer
x = torch.randn(N, n)
model = torch.nn.Sequential(
torch.nn.Linear(n, n), torch.nn.Tanh()
)
y = model(x)
z = y # use 'z' for the reverse result, start with the model's output 'y'.
for step in list(model.children())[::-1]:
if isinstance(step, torch.nn.Linear):
z = z - step.bias[None, ...]
z = z[..., None] # 'torch.solve' requires N column vectors (i.e. shape (N, n, 1)).
z = torch.solve(z, step.weight)[0]
z = torch.squeeze(z) # remove the extra dimension that we've added for 'torch.solve'.
elif isinstance(step, torch.nn.Tanh):
z = 0.5 * torch.log((1 + z) / (1 - z))
print('Agreement between x and z: ', torch.dist(x, z))

If I've understood correctly, there are two questions here:
Is it possible to determine what features in the input have activated neurons?
If so, is it possible to use this information to generate samples from p(x|y)?
Regarding 1, a basic way to determine if a neuron is sensitive to an input feature x_i is to compute the gradient of this neuron's output w.r.t x_i. A high gradient will indicate sensitivity to a particular input element. There is a rich literature on the subject, for example, you can have a look at guided backpropagation or at GradCam (the latter is about classification with convnets, but it does contain useful ideas).
As for 2, I don't think that your approach to "reversing the problem" is correct. The problem is that your network is discriminative and what it outputs can be seen as argmax_y p(y|x). Note that this is a point-wise estimation, not a full modeling of the distribution. However, the inverse problem that you're interested in seems to be sampling from
p(x|y)=constant*p(y|x)p(x).
You don't know how to sample from p(y|x) and you don't know anything about p(x). Even if you use a method to discover correlations between the neurons and specific input features, you have only discovered which features where more important to the networks prediction, but depending on the nature of y this might be insufficiant. Consider a toy example where your inputs x are 2d points distributed according to some distribution in R^2 and where the output y is binary, such that any (a,b) in R^2 is classified as 1 if a<1 and it is classified as 0 if a>1. Then a discriminative network could learn the vertical line x=1 as its decision boundary. Inspecting correlations between neurons and input features will reveal that only the first coordinate was useful in this prediction, but this information is not sufficient for sampling from the full 2d distribution of inputs.
I think that Variational autoencoders could be what you're looking for.

Simple Neural Network does not learn non linear data?

I am trying to understand why this sample Neural Network with Numpy does not learn non-linear data. Even a simple NN is supposed to learn non-linear data right?
I want my NN to learn that if the input is 1 then 0 if the input is greater than 1 and less than 4 then 1. If value > 4 then 0.
I have tried many sample NN codes with numpy from google, I seem to get this problem.
The below code does not learn, but learn well with input [2,2,0,0] desired [1,1,0,0].
import numpy as np
# #sigmoid function
def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x)
return 1/(1+np.exp(-x))
# input dataset
X = np.array([ [1],
[2],
[3],
[4] ])
# #output dataset
y = np.array([[0,1,1,0]]).T
# #seed random numbers to make calculation
# #deterministic (just a good practice)
np.random.seed(1)
# #initialize weights randomly with mean 0
syn0 = 2*np.random.random((1,1)) - 1
for iter in range(10000):
# #forward propagation
l0 = X
l1 = nonlin(np.dot(l0,syn0))
# #how much did we miss?
l1_error = y - l1
# multiply how much we missed by the
# slope of the sigmoid at the values in l1
l1_delta = l1_error * nonlin(l1,True)
# #update weights
syn0 += np.dot(l0.T,l1_delta)
print ("Output After Training:")
print (l1)

Because your model is essentially a linear model. You need to add at least one hidden layer if you want to fit nonlinear data.

As already said, you have built a simple linear logistic regression model.
The sigmoid in your NN is only used to get the prediction of your model and not to actually non-linearly train the NN.
A good start at learning neural networks is this: http://www.wildml.com/2015/09/implementing-a-neural-network-from-scratch/

XOR neural network 2-1-1

I am trying to implement a XOR in neural networks with the typology of 2 inputs, 1 element in the hidden layer, and 1 output. But the learning rate is really bad (0,5). I think it is because I am missing a connection between the inputs AND the outputs, but I am not really sure how to do it. I have already made the bias connection so that the learning is better. Only using Numpy.
def sigmoid_output_to_derivative(output):
return output*(1-output)
a=0.1
X = np.array([[0,0],
[0,1],
[1,0],
[1,1]])
np.random.seed(1)
y = np.array([[0],
[1],
[1],
[0]])
bias = np.ones(4)
X = np.c_[bias, X]
synapse_0 = 2*np.random.random((3,1)) - 1
synapse_1 = 2*np.random.random((1,1)) - 1
for j in (0,600000):
layer_0 = X
layer_1 = sigmoid(np.dot(layer_0,synapse_0))
layer_2 = sigmoid(np.dot(layer_1,synapse_1))
layer_2_error = layer_2 - y
if (j% 10000) == 0:
print( "Error after "+str(j)+" iterations:" + str(np.mean(np.abs(layer_2_error))))
layer_2_delta = layer_2_error*sigmoid_output_to_derivative(layer_2)
layer_1_error = layer_2_delta.dot(synapse_1.T)
layer_1_delta = layer_1_error * sigmoid_output_to_derivative(layer_1)
synapse_1 -= a *(layer_1.T.dot(layer_2_delta))
synapse_0 -= a *(layer_0.T.dot(layer_1_delta))

You need to be careful with statements like
the learning rate is bad
Usually the learning rate is the step size that gradient descent takes in negative gradient direction. So, I'm not sure what you mean by a bad learning rate.
I'm also not sure if I understand your code correctly, but the forward step of a neural net is basically a matrix multiplication of the weight matrix for the hidden layer times the input vector. This will (if you set up everything correctly) result in a matrix which is equal to the size of your hidden layer. Now, you can simply add the bias before applying your logistic function elementwise to this matrix.
h_i = f(h_i+bias_in)
Afterwards you can do the same thing for the hidden layer times the output weights and apply its activation to get the outputs.
o_j = f(o_j+bias_h)
The backwards step is to calculate the deltas at output and hidden layer including another elementwise operation with your function
sigmoid_output_to_derivative(output)
and update both weight matrices using the gradients (here the learning rate is needed to define the step size). The gradients are simply the value of a corresponding node times its delta.
Note: The deltas are differently calculated for output and hidden nodes.
I'd advice you to keep separate variables for the biases. Because modern approaches usually update those by summing up the deltas of its connected notes times a different learning rate and subtract this product from the specific bias.
Take a look at the following tutorial (it uses numpy):
http://peterroelants.github.io/posts/neural_network_implementation_part04/

Compute updates in Theano after N number of loss calculations

I've constructed a LSTM recurrent NNet using lasagne that is loosely based on the architecture in this blog post. My input is a text file that has around 1,000,000 sentences and a vocabulary of 2,000 word tokens. Normally, when I construct networks for image recognition my input layer will look something like the following:
l_in = nn.layers.InputLayer((32, 3, 128, 128))
(where the dimensions are batch size, channel, height and width) which is convenient because all the images are the same size so I can process them in batches. Since each instance in my LSTM network has a varying sentence length, I have an input layer that looks like the following:
l_in = nn.layers.InputLayer((None, None, 2000))
As described in above referenced blog post,
Masks:
Because not all sequences in each minibatch will always have the same length, all recurrent layers in
lasagne
accept a separate mask input which has shape
(batch_size, n_time_steps)
, which is populated such that
mask[i, j] = 1
when
j <= (length of sequence i)
and
mask[i, j] = 0
when
j > (length
of sequence i)
.
When no mask is provided, it is assumed that all sequences in the minibatch are of length
n_time_steps.
My question is: Is there a way to process this type of network in mini-batches without using a mask?
Here is a simplified version if my network.
# -*- coding: utf-8 -*-
import theano
import theano.tensor as T
import lasagne as nn
softmax = nn.nonlinearities.softmax
def build_model():
l_in = nn.layers.InputLayer((None, None, 2000))
lstm = nn.layers.LSTMLayer(l_in, 4096, grad_clipping=5)
rs = nn.layers.SliceLayer(lstm, 0, 0)
dense = nn.layers.DenseLayer(rs, num_units=2000, nonlinearity=softmax)
return l_in, dense
model = build_model()
l_in, l_out = model
all_params = nn.layers.get_all_params(l_out)
target_var = T.ivector("target_output")
output = nn.layers.get_output(l_out)
loss = T.nnet.categorical_crossentropy(output, target_var).sum()
updates = nn.updates.adagrad(loss, all_params, 0.005)
train = theano.function([l_in.input_var, target_var], cost, updates=updates)
From there I have generator that spits out (X, y) pairs and I am computing train(X, y) and updating the gradient with each iteration. What I want to do is do an N number of training steps and then update the parameters with the average gradient.
To do this, I tried creating a compute_gradient function:
gradient = theano.grad(loss, all_params)
compute_gradient = theano.function(
[l_in.input_var, target_var],
output=gradient
)
and then looping over several training instances to create a "batch" and collect the gradient calculations to a list:
grads = []
for _ in xrange(1024):
X, y = train_gen.next() # generator for producing training data
grads.append(compute_gradient(X, y))
this produces a list of lists
>>> grads
[[<CudaNdarray at 0x7f83b5ff6d70>,
<CudaNdarray at 0x7f83b5ff69f0>,
<CudaNdarray at 0x7f83b5ff6270>,
<CudaNdarray at 0x7f83b5fc05f0>],
[<CudaNdarray at 0x7f83b5ff66f0>,
<CudaNdarray at 0x7f83b5ff6730>,
<CudaNdarray at 0x7f83b5ff6b70>,
<CudaNdarray at 0x7f83b5ff64f0>] ...
From here I would need to take the mean of the gradient at each layer, and then update the model parameters. This is possible to do in pieces like this does does the gradient calc/parameter update need to happen all in one theano function?
Thanks.

NOTE: this is a solution, but by no means do i have enough experience to verify its best and the code is just a sloppy example
You need 2 theano functions. The first being the grad one you seem to have already judging from the information provided in your question.
So after computing the batched gradients you want to immediately feed them as an input argument back into another theano function dedicated to updating the shared variables. For this you need to specify the expected batch size at the compile time of your neural network. so you could do something like this: (for simplicity i will assume you have a global list variable where all your params are stored)
params #list of params you wish to update
BATCH_SIZE = 1024 #size of the expected training batch
G = [T.matrix() for i in range(BATCH_SIZE) for param in params] #placeholder for grads result flattened so they can be fed into a theano function
updates = [G[i] for i in range(len(params))] #starting with list of param updates from first batch
for i in range(len(params)): #summing the gradients for each individual param
for j in range(1, len(G)/len(params)):
updates[i] += G[i*BATCH_SIZE + j]
for i in range(len(params)): #making a list of tuples for theano.function updates argument
updates[i] = (params[i], updates[i]/BATCH_SIZE)
update = theano.function([G], 0, updates=updates)
Like this theano will be taking the mean of the gradients and updating the params as usual
dont know if you need to flatten the inputs as I did, but probably
EDIT: gathering from how you edited your question it seems important that the batch size can vary in that case you could add 2 theano functions to your existing one:
the first theano function takes a batch of size 2 of your params and returns the sum. you could apply this theano function using python's reduce() and get the sum of the over the whole batch of gradients
the second theano function takes those summed param gradients and a scaler (the batch size) as input and hence is able to update the NN params over the mean of the summed gradients.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.