handwriting text recognition (CNN + LSTM + CTC) RNN explanation required

handwriting text recognition (CNN + LSTM + CTC) RNN explanation required - python

I am trying to understand the following code, which is in python & tensorflow. Im trying to implement a handwriting text recognition. I am referring to the following code here
I dont understand why the RNN output is put through a "atrous_conv2d"
This is the architecture of my model, takes a CNN input and pass into this RNN process and then pass it to a CTC.
def build_RNN(self, rnnIn4d):
rnnIn3d = tf.squeeze(rnnIn4d, axis=[2]) # squeeze remove 1 dimensions, here it removes the 2nd index
n_hidden = 256
n_layers = 2
cells = []
for _ in range(n_layers):
cells.append(tf.nn.rnn_cell.LSTMCell(num_units=n_hidden))
stacked = tf.nn.rnn_cell.MultiRNNCell(cells) # combine the 2 LSTMCell created
# BxTxF -> BxTx2H
((fw, bw), _) = tf.nn.bidirectional_dynamic_rnn(cell_fw=stacked, cell_bw=stacked, inputs=rnnIn3d,
dtype=rnnIn3d.dtype)
# BxTxH + BxTxH -> BxTx2H -> BxTx1X2H
concat = tf.expand_dims(tf.concat([fw, bw], 2), 2)
# project output to chars (including blank): BxTx1x2H -> BxTx1xC -> BxTxC
kernel = tf.Variable(tf.truncated_normal([1, 1, n_hidden * 2, len(self.char_list) + 1], stddev=0.1))
rnn = tf.nn.atrous_conv2d(value=concat, filters=kernel, rate=1, padding='SAME')
return tf.squeeze(rnn, axis=[2])

The input to CTC loss layer will be of the form B x T x C
B - Batch Size
T - Max length of the output (twice max word length due to blank char)
C - number of character + 1 (blank char)
Input to atrous is of shape (B x T x 1 X 2T) == (batch, height ,width ,channel)
filter we are using is (1,1,2T,C) == (height ,width ,input channel ,output channel)
After atrous CNN we will get (B ,T ,1 ,C) which is the desired output for CTC
note: we will take a transpose before we input our image to CNN since tf is row major.
atrous with rate 1 is same as normal conv layer.

Related

Multi dimensional input multi dimensional output rnn keras data preprocessing

I want to create a RNN model in Keras. In each time-step the input has 9 element and the output has 4 element.
input_size = (304414,9)
target_size = (304414,4)
How can I create a dataset of sliding windows over the time-series.

You can use this code by considering windows size and stride
for idx in range(0, input.shape[0] - window_size - 1, stride):
input.append(input_data[idx + 1: idx + 1 + window_size, :])
input = np.reshape( input, (len(input), input[0].shape[0], input[0].shape[1]))

Accessing the elements of an Input Layer in a keras model

I am trying to compile and train an RNN model for regression using Keras Tensorflow. I am using the "Functional API" way for the definition of my model.
I need to have 2 different inputs. The first one (input) is my training data which is an array with the shape: (TOTAL_TRAIN_DATA, SEQUENCE_LENGTH, NUM_OF_FEATURES) = (15000,1564,2). To make it more clear, I have 2 features for every frame of 15000 videos. The videos had initially a different number of frames, so all of them have been padded to have SEQUENCE_LENGTH=1564 frames (by repeating the last row). The second input (lengths) is a vector (15000,) that contains the initial length of each video. It's something like this: lengths = [317 215 576 ... 1245 213 654].
What I am trying to do is concatenate the features in the output of a GRU layer and then multiply them with the appropriate masks to keep only the features corresponding to the initial video lengths. To be more precise, the output of the GRU layer has a shape of (batch_size, SEQUENCE_LENGTH, GRU_UNITS) = (50,1564,256). I have defined a Flatten() layer that reshapes the output of the RNN to (50, 1564*256). So in this step, I want to create a mask array with a shape of (50,1564*256). Each row of the array is going to be the mask for the corresponding sample of the batch.
def mask_creator(lengths,number_of_GRU_features=256,max_pad_len=1564):
masks = np.zeros((lengths.shape[0],number_of_GRU_features*max_pad_len))
for i, length in enumerate(lengths):
masks[i,:] = np.concatenate((np.ones([length * number_of_GRU_features, ]),
np.zeros([(max_pad_len - length) * number_of_GRU_features, ])), axis=0)
return masks
#tf.compat.v1.enable_eager_execution()
#tf.data.experimental.enable_debug_mode()
#tf.config.run_functions_eagerly(True)
GRU_UNITS = 256
SEQUENCE_LENGTH = 1564
NUM_OF_FEATURES = 2
input = tf.keras.layers.Input(shape=(SEQUENCE_LENGTH,NUM_OF_FEATURES))
lengths = tf.keras.layers.Input(shape=())
masks = tf.keras.layers.Lambda(mask_creator, name="mask_function")(lengths)
gru = tf.keras.layers.GRU(GRU_UNITS , return_sequences=True)(input)
flat = tf.keras.layers.Flatten()(gru)
multiplied = tf.keras.layers.Multiply()([flat, masks])
outputs = tf.keras.layers.Dense(7, name="pred")(multiplied )
# Compile
model = tf.keras.Model([input, lengths], outputs, name="RNN")
# optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)
#Compile keras model
model.compile(optimizer='adam',
loss='mean_squared_error',
metrics=['MeanSquaredError', 'MeanAbsoluteError']),
#run_eagerly=True)
model.summary()
To create the masks, I have to somehow access the length vector that I am passing as an input argument to my keras model (lengths = tf.keras.layers.Input(shape=())). For that purpose, I thought about defining a Lamda layer (masks=tf.keras.layers.Lambda(mask_creator, name="mask_function")(lengths)) which calls the mask_creator function to create the masks. The lengths variable is supposed to be a Tensor with a shape of (batch_size,)=(50,) if I am not mistaken. However, I cannot, by any means, access the elements of the lengths as I get different types of errors, like that.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-8e31522694ee> in <module>()
9 input = tf.keras.layers.Input(shape=(SEQUENCE_LENGTH,FEATURES))
10 lengths = tf.keras.layers.Input(shape=())
---> 11 masks = tf.keras.layers.Lambda(mask_creator, name="mask_function")(lengths)
12 gru = tf.keras.layers.GRU(GRU_UNITS , return_sequences=True)(input)
13 flat = tf.keras.layers.Flatten()(gru)
1 frames
<ipython-input-19-9490084e8336> in mask_creator(lengths, number_of_GRU_features, max_pad_len)
1 def mask_creator(lengths,number_of_GRU_features=256,max_pad_len=1564):
2
----> 3 masks = np.zeros((lengths.shape[0],number_of_GRU_features*max_pad_len))
4
5 for i, length in enumerate(lengths):
TypeError: Exception encountered when calling layer "mask_function" (type Lambda).
'NoneType' object cannot be interpreted as an integer
Call arguments received:
• inputs=tf.Tensor(shape=(None,), dtype=float32)
• mask=None
• training=None
Why is that and how could I fix this?

Try using tf operations only:
import tensorflow as tf
#tf.function
def mask_creator(lengths, number_of_GRU_features=256, max_pad_len=1564):
ones = tf.ragged.range(lengths * number_of_GRU_features)* 0 + 1
zeros = tf.ragged.range((max_pad_len - lengths) * number_of_GRU_features) * 0
masks = tf.concat([ones, zeros], axis=1)
return masks.to_tensor()
lengths = tf.constant([5, 10])
tf.print(mask_creator(lengths).shape, summarize=-1)

Unable to solve the XOR problem with just two hidden neurons in Python

I have a small, 3 layer, neural network with two input neurons, two hidden neurons and one output neuron. I am trying to stick to the below format of using only 2 hidden neurons.
I am trying to show how this can be used to behave as the XOR logic gate, however with just two hidden neurons I get the following poor output after 1,000,000 iterations!
Input: 0 0 Output: [0.01039096]
Input: 1 0 Output: [0.93708829]
Input: 0 1 Output: [0.93599738]
Input: 1 1 Output: [0.51917667]
If I use three hidden neurons I get a much better output with 100,000 iterations:
Input: 0 0 Output: [0.01831612]
Input: 1 0 Output: [0.98558057]
Input: 0 1 Output: [0.98567602]
Input: 1 1 Output: [0.02007876]
I am getting a decent output with 3 neurons in the hidden layer but not with two neurons in the hidden layer. Why?
As per a comment below, this repo contains code of high to solve the XOR problem using two hidden neurons.
I can't figure out what I am doing wrong. Any suggestions are appreciated!
Attached is my code:
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
# Sigmoid function
def sigmoid(x, deriv=False):
if deriv:
return x * (1 - x)
return 1 / (1 + np.exp(-x))
alpha = [0.7]
# Input dataset
X = np.array([[0, 0],
[0, 1],
[1, 0],
[1, 1]])
# Output dataset
y = np.array([[0, 1, 1, 0]]).T
# seed random numbers to make calculation deterministic
np.random.seed(1)
# initialise weights randomly with mean 0
syn0 = 2 * np.random.random((2, 3)) - 1 # 1st layer of weights synapse 0 connecting L0 to L1
syn1 = 2 * np.random.random((3, 1)) - 1 # 2nd layer of weights synapse 0 connecting L1 to L2
# Randomize inputs for stochastic gradient descent
data = np.hstack((X, y)) # append Input and output dataset
np.random.shuffle(data) # shuffle
x, y = np.array_split(data, 2, 1) # Split along vertical(1) axis
for iter in range(100000):
for i in range(4):
# forward prop
layer0 = x[i] # Input layer
layer1 = sigmoid(np.dot(layer0, syn0)) # Prediction step for layer 1
layer2 = sigmoid(np.dot(layer1, syn1)) # Prediction step for layer 2
layer2_error = y[i] - layer2 # Compare how well layer2's guess was with input
layer2_delta = layer2_error * sigmoid(layer2, deriv=True) # Error weighted derivative step
if iter % 10000 == 0:
print("Error: ", str(np.mean(np.abs(layer2_error))))
plt.plot(iter, layer2_error, 'ro')
# Uses "confidence weighted error" from l2 to establish an error for l1
layer1_error = layer2_delta.dot(syn1.T)
layer1_delta = layer1_error * sigmoid(layer1, deriv=True) # Error weighted derivative step
# Since SGD we need to dot product two 1D arrays. This is how.
syn1 += (alpha * np.dot(layer1[:, None], layer2_delta[None, :])) # Update weights
syn0 += (alpha * np.dot(layer0[:, None], layer1_delta[None, :]))
# Training was done above, below we re run to test algorithm
layer0 = X # Input layer
layer1 = sigmoid(np.dot(layer0, syn0)) # Prediction step for layer 1
layer2 = sigmoid(np.dot(layer1, syn1)) # Prediction step for layer 2
plt.show()
print("output after training: \n")
print("Input: 0 0 \t Output: ", layer2[0])
print("Input: 1 0 \t Output: ", layer2[1])
print("Input: 0 1 \t Output: ", layer2[2])
print("Input: 1 1 \t Output: ", layer2[3])

This is due to the fact that you have not considered any bias for the neurons.
You have only used weights to try and fit the XOR model.
Incase of 2 neurons in the hidden layer, the network under-fits as it can't compensate for the bias.
When you use 3 neurons in the hidden layer, the extra neuron counters the effect caused due to the lack of bias.
This is an example of a network for XOR gate. You'll notice theta (bias) added to the hidden layers. This gives the network an additional parameter to tweak.
Additional resources

It is an unsolvable equation system, that is why NN can not solve it either.
While it may be an oversimplification, if we say the transfer function is linear, the expression becomes something like
z = (w1*x+w2*y)*w3 + (w4*x+w5*y)*w6
Then there are the 4 cases:
xy=00, z=0 = 0
xy=10, z=1 = w1*w3+w4*w6
xy=01, z=1 = w2*w3+w5*w6
xy=11, z=0 = (w1+w2)*w3 + (w4+w5)*w6
The problem is that
0 = (w1+w2)*w3 + (w4+w5)*w6 = w1*w3+w2*w3 + w4*w6+w5*w6 <-- xy=11 line
= w1*w3+w4*w6 + w2*w3+w5*w6 = 1+1 = 2 <-- xy=10 and xy=01 lines
So the seemingly 6 degrees of freedom are just not enough here, that is why you experience the need for adding something extra.

what is the effect of tf.nn.conv2d() on an input tensor shape?

I am studying tensorboard code from Dandelion Mane specificially: https://github.com/dandelionmane/tf-dev-summit-tensorboard-tutorial/blob/master/mnist.py
His convolution layer is specifically defined as:
def conv_layer(input, size_in, size_out, name="conv"):
with tf.name_scope(name):
w = tf.Variable(tf.truncated_normal([5, 5, size_in, size_out], stddev=0.1), name="W")
b = tf.Variable(tf.constant(0.1, shape=[size_out]), name="B")
conv = tf.nn.conv2d(input, w, strides=[1, 1, 1, 1], padding="SAME")
act = tf.nn.relu(conv + b)
tf.summary.histogram("weights", w)
tf.summary.histogram("biases", b)
tf.summary.histogram("activations", act)
return tf.nn.max_pool(act, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding="SAME")
I am trying to work out what is the effect of the conv2d on the input tensor size. As far as I can tell it seems the first 3 dimensions are unchanged but the last dimension of the output follows the size of the last dimension of w.
For example, ?x47x36x64 input becomes ?x47x36x128 with w shape=5x5x64x128
And I also see that: ?x24x18x128 becomes ?x24x18x256 with w shape=5x5x128x256
So, is the resultant size for input: [a,b,c,d] the output size of [a,b,c,w.shape[3]]?
Would it be correct to think that the first dimension does not change?

This works in your case because of the stride used and the padding applied. The output width and height will not always be the same as the input.
Check out this excellent discussion of the topic. The basic takeaway (taken almost verbatim from that link) is that a convolution layer:
Accepts an input volume of size W1 x H1 x D1
Requires four hyperparameters:
Number of filters K
Spatial extent of filters F
The stride with which the filter moves S
The amount of zero padding P
Produces a volume of size W2 x H2 x D2 where:
W2 = (W1 - F + 2*P)/S + 1
H2 = (H1 - F + 2*P)/S + 1
D2 = K
And when you are processing batches of data in Tensorflow they typically have shape [batch_size, width, height, depth], so the first dimension which is just the number of samples in your batch should not change.
Note that the amount of padding P in the above is a little tricky with TF. When you give the padding='same' argument to tf.nn.conv2d, tensorflow applies zero padding to both sides of the image to make sure that no pixels of the image are ignored by your filter, but it may not add the same amount of padding to both sides (can differ by only one I think). This SO thread has some good discussion on the topic.
In general, with a stride S of 1 (which your network has), zero padding of P = (F - 1) / 2 will ensure that the output width/height equals the input, i.e. W2 = W1 and H2 = H1. In your case, F is 5, so tf.nn.conv2d must be adding two zeros to each side of the image for a P of 2, and your output width according to the above equation is W2 = (W1 - 5 + 2*2)/1 + 1 = W1 - 1 + 1 = W1.

First Neural Network, (MLP), from Scratch, Python -- Questions

I understand how the Neural Network with backpropogation is supposed to work. I know how to use Python's own MLPClassifier and fit functions work in sklearn. I am creating my own because I'd like to know the details better. I will first show my code (with comments) and then discuss my problems.
import numpy as np
import scipy as sp
import sklearn as ML
# z: the linear combination of the previous layer
#
# returns the activation for the node
#
def sigmoid(z):
a = 1 / (1 + np.exp(-z))
return a
# z: the contribution of a layer
#
# returns the derivative of the sigmoid evaluated at z
#
def sig_grad(z):
d = (1 - sigmoid(z))*sigmoid(z)
return d
# input: the data we want to train the network with
# hidden_layers: the number of nodes in the hidden layers
# num_layers: how many hidden layers between the input layer and the output layer
# num_output: how many outputs there are... this becomes relevant when we input many features.
#
# returns the activations determined
# and the linear combinations of previous layer's nodes for each layer
#
def feedforward(input, hidden_layers, num_layers, num_output, thresh, weights):
#initialize the vector for inputs AND threshold values
X = np.hstack([thresh[0], input])
#intialize the activations list
A = []
#intialize the linear combos for each layer
Z = []
w = list(weights)
#place ones in the first row of each layer of weights for the threshold
w[0] = np.vstack([np.ones([1,hidden_layers]), w[0]])
for i in range(1,num_layers):
w[i] = np.vstack([np.ones([1,hidden_layers]), weights[i]])
w[-1] = np.vstack([np.ones([1,num_output]), w[-1]])
#the first layer of weights are initialized outside function
#cycle through the hidden layers
for i in range(1, num_layers+1):
Z.append( np.dot(X, w[i-1])); S = sigmoid(Z[i-1]); A.append(S); X = np.hstack([thresh[i], A[i-1]])
#find the output/last layer activations
Z.append( np.dot(X, w[-1]) ); S = sigmoid(Z[-1]); A.append(S);
return A, Z
#
# truth: what we know the output should be
# activations: the activations determined at each node by the sigmoid
# function in the previous feedforward pass
# combos: the linear combinations at each layer in the prev. ff pass
# num_layers: the number of hidden layers
#
# error: the errors determined at each layer; will be needed for gradient descent
#
def backprop(input, truth, activations, combos, num_layers, weights):
#initialize an array of errors for each hidden layer and the output layer
error = [0 for x in range(0,num_layers+1)]
#intialize the lists containing the gradients w.r.t. weights and threshold
derivW = []; derivb = []
#set the output layer since its error is computed differently than the others
error[num_layers] = (activations[num_layers] - truth)*sig_grad(combos[num_layers])
#find the rate of change for weights and thresh for connections to output
derivW.append( activations[num_layers-1]*error[num_layers]); derivb.append(np.sum(error[num_layers]))
if(num_layers > 1):
#find the errors for each of the hidden layers
for i in range(num_layers - 1, 0, -1):
error[i] = np.dot(weights[i+1],error[i+1])*sig_grad(combos[i])
derivW.append( np.outer(activations[i-1], error[i]) ); derivb.append(np.sum(error[i]))
#
#finding the derivative for weights of input to next layer
#
error[0] = np.dot(weights[i],error[i])*sig_grad(combos[0])
derivW.append( np.outer(input, error[0]) ); derivb.append(np.sum(error[0]))
return derivW, derivb
#
# weights: our networks weights to update via gradient descent
# thresh: the threshold values to update for our system
# derivb: the derivative of our cost function with respect to b for each layer
# derivW: the derivative of our cost function with respect to W for each layer
# stepsize: the stepsize we want to take, determines how big of a step we take
#
# returns the updated weights and threshold values for our network
def gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers):
#perform gradient descent
for j in range(100):
for i in range(0, num_layers + 1):
weights[i] = weights[i] - stepsize*derivW[num_layers-i]
thresh[i] = thresh[i] - stepsize*derivb[num_layers-i]
return weights, thresh
#input: the data to send through the network
#hidden_layers: the number of hidden_layers between the input layer and the output layer
#num_layers: the number of nodes in the hidden layer
#num_output: the number of nodes in the output layer
#
#returns the output of the network
#
def nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter, stepsize):
#assuming that input is an array where each element is an input/sample
#we also need to know the size of each sample itself
m = input.size
thresh = np.random.randn(num_layers + 1, 1)
thresh_weights = np.ones([num_layers + 1, 1])
# initialize the weights as a list because each layer might have
# a different number of weights
weights = []; weights.append(np.random.randn(m,hidden_layers));
if( num_layers > 1):
for i in range(1, num_layers):
weights.append(np.random.randn(hidden_layers, hidden_layers))
weights.append(np.random.randn(hidden_layers, num_output))
for i in range(maxiter):
activations, combos = feedforward(input, hidden_layers, num_layers, num_output, thresh, weights)
derivW, derivb = backprop(input, truth, activations, combos, num_layers, weights)
weights, thresh = gradDesc(weights, thresh, derivb, derivW, stepsize, num_layers)
return weights, thresh
def main():
# a very, very simple neural network
input = np.array([1,0,0])
truth = 0
hidden_layers = 3
num_layers = 2
num_output = 1
#train the network
w, t = nNetwork(input, truth, hidden_layers, num_layers, num_output, maxiter = 10, stepsize = 0.001)
#test the network on a new set of arguments
#activations, combos = feedforward(new_input, hidden_layers = 3, num_layers = 2, thresh = t, weights = w)
main()
I've tested this code on simple examples where there are n input of one dimension and output of n dimension (not yet able to work out the bugs when I type import NN.py into the console, but works when I run it piece by piece in the console). I have a few questions to help me better understand what is going on when I have n input there are m dimensions. For example, the digits data in Python (there are 1797 samples and each sample is 64x1 -- an 8x8 image vectorized).
1) Is each of the 64 pixels considered an input? If so, is the neural net trained one image at a time? This would be an easy fix for me.
2) If the neural net is trained all images at once, what are suggestions for modifying my code?
3) Obviously the output for an image comes in the form of 0, 1, 2, 3, ... , or 9. But, does the output come in the form of a vector 10x1 where there is a 1 in the digit the image represents and 0's elsewhere? So, my prediction vector would have the highest value where the 1 might be, right?
4) Then, I'm not quite sure how #3 would look if #2 is true..
I apologize for the long note. Thanks for taking a look and helping me understand better!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

handwriting text recognition (CNN + LSTM + CTC) RNN explanation required - python

Related

Multi dimensional input multi dimensional output rnn keras data preprocessing

Accessing the elements of an Input Layer in a keras model

Unable to solve the XOR problem with just two hidden neurons in Python

what is the effect of tf.nn.conv2d() on an input tensor shape?

First Neural Network, (MLP), from Scratch, Python -- Questions

Categories

Resources