implementing RNN with numpy - python

I'm trying to implement the recurrent neural network with numpy.
My current input and output designs are as follow:
x is of shape: (sequence length, batch size, input dimension)
h : (number of layers, number of directions, batch size, hidden size)
initial weight: (number of directions, 2 * hidden size, input size + hidden size)
weight: (number of layers -1, number of directions, hidden size, directions*hidden size + hidden size)
bias: (number of layers, number of directions, hidden size)
I have looked up pytorch API of RNN the as reference (https://pytorch.org/docs/stable/nn.html?highlight=rnn#torch.nn.RNN), but have slightly changed it to include initial weight as input. (output shapes are supposedly the same as in pytorch)
While it is running, I cannot determine whether it is behaving right, as I am inputting randomly generated numbers as input.
In particular, I am not so certain whether my input shapes are designed correctly.
Could any expert give me a guidance?
def rnn(xs, h, w0, w=None, b=None, num_layers=2, nonlinearity='tanh', dropout=0.0, bidirectional=False, training=True):
num_directions = 2 if bidirectional else 1
batch_size = xs.shape[1]
input_size = xs.shape[2]
hidden_size = h.shape[3]
hn = []
y = [None]*len(xs)
for l in range(num_layers):
for d in range(num_directions):
if l==0 and d==0:
wi = w0[d, :hidden_size, :input_size].T
wh = w0[d, hidden_size:, input_size:].T
wi = np.reshape(wi, (1,)+wi.shape)
wh = np.reshape(wh, (1,)+wh.shape)
else:
wi = w[max(l-1,0), d, :, :hidden_size].T
wh = w[max(l-1,0), d, :, hidden_size:].T
for i,x in enumerate(xs):
if l==0 and d==0:
ht = np.tanh(np.dot(x, wi) + np.dot(h[l, d], wh) + b[l, d][np.newaxis])
ht = np.reshape(ht,(batch_size, hidden_size)) #otherwise, shape is (bs,1,hs)
else:
ht = np.tanh(np.dot(y[i], wi) + np.dot(h[l, d], wh) + b[l, d][np.newaxis])
y[i] = ht
hn.append(ht)
y = np.asarray(y)
y = np.reshape(y, y.shape+(1,))
return np.asarray(y), np.asarray(hn)

Regarding the shape, it probably makes sense if that's how PyTorch does it, but the Tensorflow way is a bit more intuitive - (batch_size, seq_length, input_size) - batch_size sequences of seq_length length where each element has input_size size. Both approaches can work, so I guess it's a matter of preferences.
To see whether your rnn is behaving appropriately, I'd just print the hidden state at each time step, run it on some small random data (e.g. 5 vectors, 3 elements each) and compare the results with your manual calculations.
Looking at your code, I'm unsure if it does what it's supposed to, but instead of doing this on your own based on an existing API, I'd recommend you read and try to replicate this awesome tutorial from wildml (in part 2 there's a pure numpy implementation).

Related

Global deep neural network with Tensorflow.keras in Python

I am working on implementing an algorithm which approximates the solution of a partial differential equation. The main idea behind this is that I start at time 0 with a guess of the solution u[0], and its gradient z[0], and then use a recursive formula to approximately calculate the solution up to the last time point in a forward manner. The formula looks like this
u[i+1] = u[i] + f(t[i],x[i],u[i],z[i])*dt + z[i]*dW[i]
where the function f, time time discretization, the time step dt, and the increment of a Brownian motion dW is given. The gradient z[i] at time point i is being approximated by a deep neural network with input x[i] which I already have implemented with tf.keras with two hidden dense layers. These networks perform quite well. So far, I have N (number of time points) independent neural networks approximating z[i] for each time point respectively.
My task is to form a global neural network with input (x, W), and where (u[0], z[0]) will be given to this network as network parameters, such that this network can than optimize its parameters by minimizing the expected quadratic loss of the output/approximation of uN and the given terminal condition of the partial differential equation g(x). u[0] will then be the solution of the PDE. So while my neural networks approximating have 2 hidden layers each, the global network should have 2*(N-1) layers in total.
My neural networks for the gradients look like this:
# Input dimension
d = 1
# Output dimension
d_1 = 1
# Number of neurons
m = d + 10
# Batch size
batch_size = 32
# Training data
x_tr = some_simulation()
z_tr = calculated_given(x_tr)
# Test data
x_te = some_simulation()
z_te = calculated_given(x_te)
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(d,), dtype=tf.float32))
model.add(tf.keras.layers.Dense(m, activation=tf.nn.tanh))
model.add(tf.keras.layers.Dense(m, activation=tf.nn.tanh))
model.add(tf.keras.layers.Dense(d_1, activation=tf.keras.activations.linear))
model.compile(optimizer='adam',
loss='MeanSquaredError',
metrics=[])
model.fit(x_tr, z_tr, batch_size=batch_size, epochs=10)
val_loss = model.evaluate(x_te, z_te)
print(val_loss)
So I have trained N of them, and saved each as a file using
model.save(path_to_model)
So given the approximations of the gradients, I now want to stack all the subnetworks together to form a global deep neural network, which is based on the recursive formula above, which only takes the N-dimensional vectors x, and W as input data, and which gives the final output u[N] as output, and which uses (u[0], z[0]) as parameters. But I am trying to wrap my head around for two days as to how such a global neural network should be implemented in Python using Tensorflow.keras, so maybe someone can give me a push in the right direction?
I assume that you'll pass the tuple (x, dW, t) as input to your model, since t is also indexed. Furthermore, you can always create dW from W using np.diff. I also assume that u0 and z0 are scalars (common to all batches).
With all of that in mind, you can subclass the base Model and override its call() as follows
class GlobalModel(tf.keras.models.Model):
def __init__(self, u0, z0, dt, subnet_list, **kwargs):
super().__init__(**kwargs)
self.u0 = tf.Variable(u0, trainable=True, dtype=tf.float32)
self.z0 = tf.Variable(z0, trainable=True, dtype=tf.float32)
self.dt = tf.constant(dt, dtype=tf.float32)
self.subnet_list = subnet_list
# Freeze the pre-trained subnets
for subnet in subnet_list:
subnet.trainable = False
def f(self, t, x, u, z):
# code of your function f() goes here
def step_update(self, t, x, u, z, dW):
return u + self.f(t, x, u, z) * self.dt + z * dW
def call(self, inputs, training=None):
x, dW, t = inputs
# First step
x_i = tf.gather(x, 0, axis=1)
dW_i = tf.gather(dW, 0, axis=1)
t_i = tf.gather(t, 0, axis=1)
u_i = self.step_update(t_i, x_i, self.u0, self.z0, dW_i)
# Subsequent steps
for i, subnet in enumerate(self.subnet_list):
x_i = tf.gather(x, i+1, axis=1)
dW_i = tf.gather(dW, i+1, axis=1)
t_i = tf.gather(t, i+1, axis=1)
z_i = subnet(x_i, training=False)
u_i = self.step_update(t_i, x_i, u_i, z_i, dW_i)
return u_i
You initialize this model by
global_model = GlobalModel(init_u0, init_z0, dt, subnet_list)
where subnet_list is a list of your pre-trained subnets, ordered by time index. That is, the subnet responsible for predicting z_i should be at index i-1 in this list.
After compiling, you call fit() on the model by
global_model.fit(x=(x_tr, dW_tr, t_tr), y=y_tr, batch_size=batch_size, epochs=epochs)
where y_tr is your target.

Tricky forward pass pytorch

I have a precipitation map timeseries dataset with input shape (None, seq_length =7, c = 75, w=112, h=112) output shape (None, lead_times = 60, c=51, w=28, h=28). The model (Conv downsampler + ConvGRU + Axial Attention) predicts precipitation in a 28x28 region in the middle with 51 categorical precipitation intervals and is conditioned with 60 different lead times (5, 10, ..., 300 minutes).
Right now my forward pass looks like this:
def forward(self, imgs):
"""It takes a rank 5 tensor
- imgs [bs, seq_len, channels, h, w]
"""
# Compute all timesteps, probably can be parallelized
res = []
for i in range(self.forecast_steps):
x_i = self.encode_timestep(imgs, i)
out = self.head(x_i)
res.append(out)
res = torch.stack(res, dim=1)
return res
Here imgs is the input tensor without lead time encoding, so only 15 channels. The imgs is then one-hot encoded for each respective lead time and the output is the entire predicted time series (5-300min). However this leads to severe memory issues even with batch_size = 1 so I want the forward loop to only do one random lead time at a time. I am training this with pytorch-lightning module for easier parallelization so I don't have much control of the training loop.
The issue is that the effective batch size with this training loop is 60*batch_size. The paper solves this by only doing one random lead time per sample, which now makes sense to me. This solves the memory issue by allowing effective minimum batch size to be 1. How can pass a random integer (the lead time) to the forward pass and couple it with the correct Y when pytorch-lightning computes the loss?
I want
y_hat = forward(self, X[n], lead_time=random)
...
loss(y_hat-Y[n,lead_time,:,:])
My code is available at https://github.com/ValterFallenius/metnet.
I figured out how to fix it. Once explaining the problem to someone else, I realized how simple the solution was...
def forward(self, imgs,lead_time):
"""It takes a rank 5 tensor
- imgs [bs, seq_len, channels, h, w]
- lead_time #random int between 0,self.forecast_steps
"""
x_i = self.encode_timestep(imgs, lead_time)
out = self.head(x_i)
res.append(out)
res = torch.stack(res, dim=1)
return res
The trick was to simply add the lead_time variable to the training_step method:
def training_step(self, batch, batch_idx):
x, y = batch
lead_time = np.random.randint(0,self.forecast_steps)
y_hat = self(x.float(),lead_time)
loss = F.mse_loss(y_hat, y[:,lead_time])
pbar = {"training_loss": loss}
return {"loss": loss, "progress_bar": pbar}

Understanding the dimensions of states returned by rnn.BasicLSTMCell

I am watching this tutorial where he write a tensorflow code for the MNIST classification.
Here is the RNN model:
batch_size = 128
chunk_size = 28
n_chunks = 28
rnn_size = 128
def recurrent_neural_network(x):
layer = {'weights':tf.Variable(tf.random_normal([rnn_size,n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes]))}
x = tf.transpose(x, [1,0,2])
x = tf.reshape(x, [-1, chunk_size])
x = tf.split(x, n_chunks, 0)
lstm_cell = rnn.BasicLSTMCell(rnn_size,state_is_tuple=True)
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
output = tf.matmul(outputs[-1],layer['weights']) + layer['biases']
return output,outputs,states
after this i print out the dimensions of outputs and states respectively
like this:
print("\n", len(outputs),"\n",len(outputs[0]),"\n",len(outputs[0][0]))
print("\n", len(states),"\n",len(states[0]),"\n",len(states[0][0]))
I get the output of print statements as:
28
128
128
2
128
128
I understand that the output shape is 28x128x128 (time_steps x rnn_size x batch_size)
but i don't understand the shape of "states" ?
Check this very good blog post about how LSTMs work : http://colah.github.io/posts/2015-08-Understanding-LSTMs/
LSTMs have one hidden state but also one memory cell state; hence the size of the first dimension of your states variable (2). The size of the following dimensions are batch_size then rnn_size.
The states contains 2 matrix, cell and hidden which is c and h in following formula.
Each LSTM have two states, 0th for Long term state, whereas 1st for short term state.
BasicRNNCell, always have one state, i.e. short term state.
Rest you already explained:
128: Number of Neurons or can say rnn_size in your case.
128: Batch size i.e. one output for each input.

First Neural Network, (MLP), from Scratch, Python -- Questions

I understand how the Neural Network with backpropogation is supposed to work. I know how to use Python's own MLPClassifier and fit functions work in sklearn. I am creating my own because I'd like to know the details better. I will first show my code (with comments) and then discuss my problems.
import numpy as np
import scipy as sp
import sklearn as ML
# z: the linear combination of the previous layer
#
# returns the activation for the node
#
def sigmoid(z):
a = 1 / (1 + np.exp(-z))
return a
# z: the contribution of a layer
#
# returns the derivative of the sigmoid evaluated at z
#
def sig_grad(z):
d = (1 - sigmoid(z))*sigmoid(z)
return d
# input: the data we want to train the network with
# hidden_layers: the number of nodes in the hidden layers
# num_layers: how many hidden layers between the input layer and the output layer
# num_output: how many outputs there are... this becomes relevant when we input many features.
#
# returns the activations determined
# and the linear combinations of previous layer's nodes for each layer
#
def feedforward(input, hidden_layers, num_layers, num_output, thresh, weights):
#initialize the vector for inputs AND threshold values
X = np.hstack([thresh[0], input])
#intialize the activations list
A = []
#intialize the linear combos for each layer
Z = []
w = list(weights)
#place ones in the first row of each layer of weights for the threshold
w[0] = np.vstack([np.ones([1,hidden_layers]), w[0]])
for i in range(1,num_layers):
w[i] = np.vstack([np.ones([1,hidden_layers]), weights[i]])
w[-1] = np.vstack([np.ones([1,num_output]), w[-1]])
#the first layer of weights are initialized outside function
#cycle through the hidden layers
for i in range(1, num_layers+1):
Z.append( np.dot(X, w[i-1])); S = sigmoid(Z[i-1]); A.append(S); X = np.hstack([thresh[i], A[i-1]])
#find the output/last layer activations
Z.append( np.dot(X, w[-1]) ); S = sigmoid(Z[-1]); A.append(S);
return A, Z
#
# truth: what we know the output should be
# activations: the activations determined at each node by the sigmoid
# function in the previous feedforward pass
# combos: the linear combinations at each layer in the prev. ff pass
# num_layers: the number of hidden layers
#
# error: the errors determined at each layer; will be needed for gradient descent
#
def backprop(input, truth, activations, combos, num_layers, weights):
#initialize an array of errors for each hidden layer and the output layer
error = [0 for x in range(0,num_layers+1)]
#intialize the lists containing the gradients w.r.t. weights and threshold
derivW = []; derivb = []
#set the output layer since its error is computed differently than the others
error[num_layers] = (activations[num_layers] - truth)*sig_grad(combos[num_layers])
#find the rate of change for weights and thresh for connections to output
derivW.append( activations[num_layers-1]*error[num_layers]); derivb.append(np.sum(error[num_layers]))
if(num_layers > 1):
#find the errors for each of the hidden layers
for i in range(num_layers - 1, 0, -1):
error[i] = np.dot(weights[i+1],error[i+1])*sig_grad(combos[i])
derivW.append( np.outer(activations[i-1], error[i]) ); derivb.append(np.sum(error[i]))
#
#finding the derivative for weights of input to next layer
#
error[0] = np.dot(weights[i],error[i])*sig_grad(combos[0])
derivW.append( np.outer(input, error[0]) ); derivb.append(np.sum(error[0]))
return derivW, derivb
#
# weights: our networks weights to update via gradient descent
# thresh: the threshold values to update for our system
# derivb: the derivative of our cost function with respect to b for each layer
# derivW: the derivative of our cost function with respect to W for each layer
# stepsize: the stepsize we want to take, determines how big of a step we take
#
# returns the updated weights and threshold values for our network
def gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers):
#perform gradient descent
for j in range(100):
for i in range(0, num_layers + 1):
weights[i] = weights[i] - stepsize*derivW[num_layers-i]
thresh[i] = thresh[i] - stepsize*derivb[num_layers-i]
return weights, thresh
#input: the data to send through the network
#hidden_layers: the number of hidden_layers between the input layer and the output layer
#num_layers: the number of nodes in the hidden layer
#num_output: the number of nodes in the output layer
#
#returns the output of the network
#
def nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter, stepsize):
#assuming that input is an array where each element is an input/sample
#we also need to know the size of each sample itself
m = input.size
thresh = np.random.randn(num_layers + 1, 1)
thresh_weights = np.ones([num_layers + 1, 1])
# initialize the weights as a list because each layer might have
# a different number of weights
weights = []; weights.append(np.random.randn(m,hidden_layers));
if( num_layers > 1):
for i in range(1, num_layers):
weights.append(np.random.randn(hidden_layers, hidden_layers))
weights.append(np.random.randn(hidden_layers, num_output))
for i in range(maxiter):
activations, combos = feedforward(input, hidden_layers, num_layers, num_output, thresh, weights)
derivW, derivb = backprop(input, truth, activations, combos, num_layers, weights)
weights, thresh = gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers)
return weights, thresh
def main():
# a very, very simple neural network
input = np.array([1,0,0])
truth = 0
hidden_layers = 3
num_layers = 2
num_output = 1
#train the network
w, t = nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter = 10, stepsize = 0.001)
#test the network on a new set of arguments
#activations, combos = feedforward(new_input, hidden_layers = 3, num_layers = 2, thresh = t, weights = w)
main()
I've tested this code on simple examples where there are n input of one dimension and output of n dimension (not yet able to work out the bugs when I type import NN.py into the console, but works when I run it piece by piece in the console). I have a few questions to help me better understand what is going on when I have n input there are m dimensions. For example, the digits data in Python (there are 1797 samples and each sample is 64x1 -- an 8x8 image vectorized).
1) Is each of the 64 pixels considered an input? If so, is the neural net trained one image at a time? This would be an easy fix for me.
2) If the neural net is trained all images at once, what are suggestions for modifying my code?
3) Obviously the output for an image comes in the form of 0, 1, 2, 3, ... , or 9. But, does the output come in the form of a vector 10x1 where there is a 1 in the digit the image represents and 0's elsewhere? So, my prediction vector would have the highest value where the 1 might be, right?
4) Then, I'm not quite sure how #3 would look if #2 is true..
I apologize for the long note. Thanks for taking a look and helping me understand better!

Pass in matrix of images of variables sizes into Theano

I'm trying to use Theano to do some recognition. All my images are different sizes, and I don't want to resize them because they're paintings so they shouldn't be the same size. I was wondering how to pass in a matrix of images of variable image size lengths into the Theano function.
I'm under the impression that this not possible with numpy. Is there an alternative?
def floatX(X):
return np.asarray(X, dtype=theano.config.floatX)
def init_weights(shape):
return theano.shared(floatX(np.random.randn(*shape) * 0.01))
def model(X, w):
return T.nnet.softmax(T.dot(X, w))
X = T.fmatrix()
Y = T.fmatrix()
w = init_weights((784, 10))
py_x = model(X, w)
y_pred = T.argmax(py_x, axis=1)
cost = T.mean(T.nnet.categorical_crossentropy(py_x, Y))
gradient = T.grad(cost=cost, wrt=w)
update = [[w, w - gradient * 0.05]]
train = theano.function(inputs=[X, Y], outputs=cost, updates=update, allow_input_downcast=True)
predict = theano.function(inputs=[X], outputs=y_pred, allow_input_downcast=True)
Unless I'm mistaken in my interpretation of your code, I don't think what you're trying to do makes sense.
If I understand correctly, in model() you are computing a weighted sum over your image pixels using dot(X, w), where I assume that X is an (nimages, npixels) array of image data, and w is a weight matrix with fixed dimensions (784, 10).
In order for that dot product to even be computable, X.shape[1] (the number of pixels in each of your input images) must be equal to w.shape[0].
If the sizes of your input images vary, how can you expect to learn a single weight matrix with fixed dimensions?

Categories