Initializing weights in a 3 layer neural network

Initializing weights in a 3 layer neural network - python

So I'm learning the SIMPLEST way to code a neural network, one that can be modified in many ways depending on what you want, basically like a template. I found i am trask's 11 line neural network code, and the weight initialization makes perfect sense:
syn0 = 2*np.random.random((3,1)) - 1
However, when I look at it for his extended 3 layer network, it looks like this:
syn0 = 2*np.random.random((3,4)) - 1
syn1 = 2*np.random.random((4,1)) - 1
I would understand if syn1 was a bit different, but BOTH are now different! He doesn't explain it, only gives a comment saying, "randomly initialize our weights with mean 0."
Can someone explain to me the mathematical reasoning behind this? Go full crazy if you want, I'm a math person since 5.

If by different, you are referring to the arguments of np.random.random(), then it is because you are creating weights with different shapes/dimensions. In this example (which ignores biases), you are trying to go from an input of dimension 3 to an output of dimension 1. With one layer, you require the shape (3,1). For two layers, you need shapes (3,n) and (n,1), where n is any integer. This is just to ensure that matrix multiplication is valid. Here n = 4 has been chosen as the hidden layer dimension.

Related

No gradients provided for any variable error

I'm creating a model using the Keras functional API.
The layer architecture is as follows:
n = tf.keras.layers.Dense(1)(input)
for i in tf.range(n):
output = tf.keras.layers.Dense(4)(input)
I then concat the outputs and return for a tensor with shape [1, None, 4] where [1] is the batch dimension, [None] is n, and [4] is the output from the second dense layer.
My loss function involves comparing the shape of the expected output, and comparing the outputs.
loss = tf.convert_to_tensor(abs(tf.shape(logits)[1] - tf.shape(expected)[1])) * 100.
When running this on a custom training loop, I'm getting the error
ValueError: No gradients provided for any variable: (['while/dense/kernel:0',
'while/dense/bias:0', 'while/while/dense_1/kernel:0', 'while/while/dense_1/bias:0'],).
Provided `grads_and_vars` is ((None, <tf.Variable 'while/dense/kernel:0' shape=(786432, 1)

Shape is not differentiable, you cannot do things like this with gradient based learning. Problems like this need to be tackled with more powerful tools, e.g. reinforcement learning where one considers n as an action, and get policy gradient for that.
A rule of thumb to remember is that you cannot really backprop through discrete objects. You need to produce floats, as gradients require smooth functions. In your case n should be an integer (what does a loop over a float mean?) so this should be your first warning sign. The other being shape itself, which is also an integer. A target can be discrete, but not the prediction. Note that even in classification we do not output class we output probability as probability is smooth.
You could build your model by assuming some maximum number of N and treat it more like a classification where you supervise N directly, and use some form of masking to keep all the results around.

Converting PyTorch Boolean target to regression target

Question
I have code that is based on Part 2, Chapter 11 of Deep Learning with PyTorch, by Luca Pietro Giovanni Antiga, Thomas Viehmann, and Eli Stevens. It's working just fine. It predicts the value of a Boolean variable. I want to convert this so that it predicts the value of a real number variable that happens to be always between 0 and 34.
There are two parts that I don't know how to convert. First, this part:
pos_t = torch.tensor([
not candidateInfo_tup.isNodule_bool,
candidateInfo_tup.isNodule_bool
],
dtype=torch.long,
)
(Why are two values passed in here when one is completely determined by the other?)
and then this part:
self.head_linear = nn.Linear(1152, 2)
self.head_softmax = nn.Softmax(dim=1)
How do I do this?
Guess
I don't want people to think I haven't thought about this at all, so here is my guess:
First part:
age_t = torch.tensor(candidateInfo_tup.age_int, dtype=torch.double)
Second part:
self.head_linear = nn.Linear(299520, 1)
self.head_relu = nn.ReLU()
I'm also guessing that I need to change this:
loss_func = nn.CrossEntropyLoss(reduction='none')
to something like this:
loss_func = nn.L1Loss()
My guesses are based on this article by Christian Versloot.

The example from the book is working but it has some redundant elements which confuse you.
Normally output size of 1 is enough for a binary classification problem. To bring it to 0 or 1, one may use sigmoid and then rounding, like in the example here: PyTorch Binary Classification - same network structure, 'simpler' data, but worse performance?
Or just put after single output neuron this:
y_pred_binary = torch.round(torch.sigmoid(y_pred))
The book example uses output size of 2 and then applies softmax to get to 0 and 1. This works, but such technique is typically used in multi-class classification.
For prediction of 0-34 variable:
if these are discrete variables - it is called "multiclass classification" as Ken indicated. Use size 35 output and softmax in this case. Search for "pytorch multiclass classification" for examples.
if this is a regression - than your changes seem in the right direction, except 'Second part'. instead of RELU - clip output at both ends at [0, 34]. Also 299520 - is too much for the previous layer. Use whatever input size there was before. Search for "pytorch regression" for examples.

Stuck understanding ResNet's Identity block and Convolutional blocks

I'm learning Residual Networks (ResNet50) from Andrew Ng coursera lectures. I understand that one of the main reasons why ResNets work is that they can learn identity function and that's why adding more and more layers in network does not hurt the performance of the network.
Now as described in lectures, there are two type of blocks are used in ResNets: 1) Identity block and Convolutional block.
Identity Block is used when there is no change in input and output dimensions. Convolutional block is almost same as identity block but there is a convolutional layer in short-cut path to just change the dimension such that the dimension of input and output matches.
Here is identity block:
and here is convolutional block:
Now in implementation of convolutional block (2nd image), First block (i.e. conv2d --> BatchNorm --> ReLu is implemented with 1x1 convolution and stride > 1.
# First component of main path
X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', padding = 'valid', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
X = Activation('relu')(X)
I don't understand the reason behind keeping stride > 1 with window size 1. Isn't it just data loss? We are just considering alternate pixels in this case.
What should be the possible reason for such hyperparameter selection? Any intuitive explanation will help! Thanks.

I don't understand the reason behind keeping stride > 1 with window
size 1. Isn't it just data loss?
Please refer the section on Deeper Bottleneck Architectures in the resnet paper. Also, Figure 5.
https://arxiv.org/pdf/1512.03385.pdf
1 x 1 convolutions are typically used to increase or decrease the dimensionality along the filter dimension. So, in the bottleneck architecture the first 1 x 1 layer reduces the dimensions so that the 3 x 3 layer needs to handle smaller input/output dimensions. Then the final 1 x 1 layer increases the filter dimensions again.
It's done to save on computation/training time.
From the paper,
"Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design".

I believe you might have answered your own question. The convolutional block is used whenever you need to change the dimension in order for the output and input dimensions to match. That being said, how do you change the dimension of a certain volume using convolutions? Well, you change the stride.
For any given convolution operation, assuming a square input, the dimension of the output volume can be obtained through the formula (n+2p-f)/s +1, where n is the input dimension, p is your zero-padding, f the filter dimension and s is the stride. By increasing the stride you're effectively reducing the dimension of your shortcut's output volume, and thus, it can be used in such a way as to make sure that the dimensions of your shortcut and lower paths will match in order for the final sum to be performed.
Why is it >1 then? Well, if you didn't need a stride larger than one, you wouldn't be needing a dimension alteration in the first place and therefore would be using the identity block instead.

Merging a tensor of 3x10 to 1x10, what methods to use?

I have made some model that in the end will output a tensor of 3 x 10. The reason why it's 3 x 10 is because the vocabulary size is 10, and there are 3 elements in a sequence (this is a sequence multilabel classification problem). This tensor will need to somehow be softmaxed to a 1x10 tensor. Can someone give me explanations about the methods that are available and maybe some example in Keras?
I saw some merging methods in Keras like average or add. Those can be useful in this case but those seems to need two or more tensors as the input. Therefore I probably need to split the 3 x 10 tensor to 3 tensors 1 x 10 each and average them. Maybe there are better ways to achieve this?

A simple way to achieve what you want is to use a final 1x1 Convolution layer.
A layer with 1×1 convolution kernel allow to merge your 3x10 tensor into a 1x10, and
simultaneously learns the fusion weight during training.
Add this layer :
output = Conv2D(1, (1, 1), activation='your_activation')(your_3x10_tensor)
Hope this is the solution you were looking for !

How to interpret clearly the meaning of the units parameter in Keras?

I am wondering how LSTM work in Keras. In this tutorial for example, as in many others, you can find something like this :
model.add(LSTM(4, input_shape=(1, look_back)))
What does the "4" mean. Is it the number of neuron in the layer. By neuron, I mean something that for each instance gives a single output ?
Actually, I found this brillant discussion but wasn't really convinced by the explanation mentioned in the reference given.
On the scheme, one can see the num_unitsillustrated and I think I am not wrong in saying that each of this unit is a very atomic LSTM unit (i.e. the 4 gates). However, how these units are connected ? If I am right (but not sure), x_(t-1)is of size nb_features, so each feature would be an input of a unit and num_unit must be equal to nb_features right ?
Now, let's talk about keras. I have read this post and the accepted answer and get trouble. Indeed, the answer says :
Basically, the shape is like (batch_size, timespan, input_dim), where input_dim can be different from the unit
In which case ? I am in trouble with the previous reference...
Moreover, it says,
LSTM in Keras only define exactly one LSTM block, whose cells is of unit-length.
Okay, but how do I define a full LSTM layer ? Is it the input_shape that implicitely create as many blocks as the number of time_steps (which, according to me is the first parameter of input_shape parameter in my piece of code ?
Thanks for lighting me
EDIT : would it also be possible to detail clearly how to reshape data of, say, size (n_samples, n_features) for a stateful LSTM model ? How to deal with time_steps and batch_size ?

First, units in LSTM is NOT the number of time_steps.
Each LSTM cell(present at a given time_step) takes in input x and forms a hidden state vector a, the length of this hidden unit vector is what is called the units in LSTM(Keras).
You should keep in mind that there is only one RNN cell created by the code
keras.layers.LSTM(units, activation='tanh', …… )
and RNN operations are repeated by Tx times by the class itself.
I've linked this to help you understand it better in with a very simple code.

You can (sort of) think of it exactly as you think of fully connected layers. Units are neurons.
The dimension of the output is the number of neurons, as with most of the well known layer types.
The difference is that in LSTMs, these neurons will not be completely independent of each other, they will intercommunicate due to the mathematical operations lying under the cover.
Before going further, it might be interesting to take a look at this very complete explanation about LSTMs, its inputs/outputs and the usage of stative = true/false: Understanding Keras LSTMs. Notice that your input shape should be input_shape=(look_back, 1). The input shape goes for (time_steps, features).
While this is a series of fully connected layers:
hidden layer 1: 4 units
hidden layer 2: 4 units
output layer: 1 unit
This is a series of LSTM layers:
Where input_shape = (batch_size, arbitrary_steps, 3)
Each LSTM layer will keep reusing the same units/neurons over and over until all the arbitrary timesteps in the input are processed.
The output will have shape:
(batch, arbitrary_steps, units) if return_sequences=True.
(batch, units) if return_sequences=False.
The memory states will have a size of units.
The inputs processed from the last step will have size of units.
To be really precise, there will be two groups of units, one working on the raw inputs, the other working on already processed inputs coming from the last step. Due to the internal structure, each group will have a number of parameters 4 times bigger than the number of units (this 4 is not related to the image, it's fixed).
Flow:
Takes an input with n steps and 3 features
Layer 1:
For each time step in the inputs:
Uses 4 units on the inputs to get a size 4 result
Uses 4 recurrent units on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps
output features = 4
Layer 2:
Same as layer 1
Layer 3:
For each time step in the inputs:
Uses 1 unit on the inputs to get a size 1 result
Uses 1 unit on the outputs of the previous step
Outputs the last (return_sequences=False) or all (return_sequences = True) steps

The number of units is the size (length) of the internal vector states, h and c of the LSTM. That is no matter the shape of the input, it is upscaled (by a dense transformation) by the various kernels for the i, f, and o gates. The details of how the resulting latent features are transformed into h and c are described in the linked post. In your example, the input shape of data
(batch_size, timesteps, input_dim)
will be transformed to
(batch_size, timesteps, 4)
if return_sequences is true, otherwise only the last h will be emmited making it (batch_size, 4). I would recommend using a much higher latent dimension, perhaps 128 or 256 for most problems.

I would put it this way - there are 4 LSTM "neurons" or "units", each with 1 Cell State and 1 Hidden State for each timestep they process. So for an input of 1 timestep processing , you will have 4 Cell States, and 4 Hidden States and 4 Outputs.
Actually the correct way to say this is - for one timestep sized input you 1 Cell State (a vector of size 4) and 1 Hidden State (a vector of size 4) and 1 Output (a vector of size 4).
So if you feed in a timeseries with 20 steps, you will have 20 (intermediate) Cell States, each of size 4. That is because the inputs in LSTM are processed sequentially, 1 after the other. Similarly you will have 20 Hidden States, each of size 4.
Usually, your output will be the output of the LAST step (a vector of size 4). However in case you want the outputs of each intermediate step(remember you have 20 timesteps to process), you can make return_sequences = TRUE. In which case you will have 20 , 4 sized vectors each telling you what was the output when each of those steps got processed as those 20 inputs came one after the other.
In case when you put return_states = TRUE , you get the last Hidden State of size = 4 and last Cell State of size 4.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.