Here is my example from my code:
loss += lambda_coord * ((tf.math.sqrt(bb[2]) - tf.math.sqrt(label[2]))**2
+ (tf.math.sqrt(bb[3]) - tf.math.sqrt(label[3]))**2)
bb is a tensor and so is label
Is it possible for me to rewrite this as
loss += lambda_coord * ((math.sqrt(bb[2]) - math.sqrt(label[2]))**2
+ (math.sqrt(bb[3]) - math.sqrt(label[3]))**2)
using only the standard library function?
Another example, if I have a tensor which only contains the value 2, can I do 2 * tensor, or do I have to use the tf.math.multiply function?
Related
I am going to implement binary addition by Recurrent Neural Network (RNN) as a sample. I have coped with an issue to implement it by Python, so I decided to share my problem in there to come up with ideas to fix it.
As can be seen in my notebook code (Backpropagation through time (BPTT) section),
There is a chain rule like below to update input weight matrix like below:
My problem is this part:
I've tried to implement this part in my Python code or notebook code (class input_layer, backward method), but unmatched dimensions raises an error.
In my sample code, W_hidden is 16*16, whereas the result of delta pre_hidden is 1*2. This makes the error. If you run the code, you could see the error.
I spent a lot of time to check my chain rule as well as my code. I guess my chain rule is right. Only reason to make this error is my code.
As I know, multiple unmatched matrices in terms of dimension is impossible. If my chain rule is correct, how it could be implemented by Python?
Any idea?
Thanks in advance.
You need to apply dimension balancing on the gradients. Taken from the Stanford's cs231n course, it comes down to two simple modifications:
Given , and , we will have:
,
Here is the code I used to ensure the gradient calculation is correct. You should be able to update your code accordingly.
import torch
torch.random.manual_seed(0)
x_1, x_2 = torch.zeros(size=(1, 8)).normal_(0, 0.01), torch.zeros(size=(1, 8)).normal_(0, 0.01)
y = torch.zeros(size=(1, 8)).normal_(0, 0.01)
h_0 = torch.zeros(size=(1, 16)).normal_(0, 0.01)
weight_ih = torch.zeros(size=(8, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_hh = torch.zeros(size=(16, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_ho = torch.zeros(size=(16, 8)).normal_(mean=0, std=0.01).requires_grad_(True)
h_1 = x_1.mm(weight_ih) + h_0.mm(weight_hh)
h_2 = x_2.mm(weight_ih) + h_1.mm(weight_hh)
g_2 = h_2.sigmoid()
j_2 = g_2.mm(weight_ho)
y_predicted = j_2.sigmoid()
loss = 0.5 * (y - y_predicted).pow(2).sum()
loss.backward()
delta_1 = -1 * (y - y_predicted) * y_predicted * (1 - y_predicted)
delta_2 = delta_1.mm(weight_ho.t()) * (g_2 * (1 - g_2))
delta_3 = delta_2.mm(weight_hh.t())
# 16 x 8
weight_ho_grad = g_2.t() * delta_1
# 16 x 16
weight_hh_grad = h_1.t() * delta_2 + (h_0.t() * delta_3)
# 8 x 16
weight_ih_grad = x_2.t() * delta_2 + x_1.t() * delta_3
atol = 1e-10
assert torch.allclose(weight_ho.grad, weight_ho_grad, atol=atol)
assert torch.allclose(weight_hh.grad, weight_hh_grad, atol=atol)
assert torch.allclose(weight_ih.grad, weight_ih_grad, atol=atol)
I have this function that rotates the MNIST images. The function returns a pytorch Tensor. I am more familiar with Tensorflow and I want to convert the pytorch tensor to a numpy ndarray that I can use. Is there a function that will allow me to do that? I tried to modify the function a little bit by adding .numpy() after tensor(img.rotate(rotation)).view(784) and save it in an empty ndarray, but that didn't work. Parameter d is MNIST data saved in .pt (pytensor, I think). Thanks! (Would love to know if there is a tensorflow function that can rotate the data.)
t = 1
min_rot = 1.0 * t / 20 * (180 - 0) + \
0
max_rot = 1.0 * (t + 1) / 20 * \
(180 - 0) + 0
rot = random.random() * (max_rot - min_rot) + min_rot
rotate_dataset(x_tr, rot)
def rotate_dataset(d, rotation):
result = torch.FloatTensor(d.size(0), 784)
tensor = transforms.ToTensor()
for i in range(d.size(0)):
img = Image.fromarray(d[i].numpy(), mode='L')
result[i] = tensor(img.rotate(rotation)).view(784)
return result
How about not converting to tensor in the first place:
result[i] = np.array(img.rotate(rotation)).flatten()
I'm having trouble using numpy to parallelize this for loop below (get_new_weights). With my first attempt for df_dm in update_weights, the weight is completely wrong. With my second attempt at df_dm, my weight overshoots the optimal weight.
Note - bias is a single number and weight is a single number (one variable linear regression) and X is shape (442,1) and y is shape (442,1). Also note that updating my bias term works perfectly in update_weights - its just updating the weight that I'm having trouble with.
# This is the for loop that I am trying to parallelize with numpy:
def get_new_weights(X, y, weight, bias, learning_rate=0.01):
weight_deriv = 0
bias_deriv = 0
total = len(X)
for i in range(total):
# -2x(y - (mx + b))
weight_deriv += -2*X[i] * (y[i] - (weight*X[i] + bias))
# -2(y - (mx + b))
bias_deriv += -2*(y[i] - (weight*X[i] + bias))
weight -= (weight_deriv / total) * learning_rate
bias -= (bias_deriv / total) * learning_rate
return weight, bias
# This is my attempt at parallelization
def update_weights(X, y, weight, bias, lr=0.01):
df_dm = np.average(-2*X * (y-(weight*X+bias))) # this was my first guess
# df_dm = np.average(np.dot((-X).T, ((weight*X+bias)-y))) # this was my second guess
df_db = np.average(-2*(y-(weight*X+bias)))
weight = weight - (lr*df_dm)
bias = bias - (lr*df_db)
return weight,bias
This is the equation I am using for updating my weight and bias:
thanks for everyone who took a look at my question. I am loosely using the term parallelization to refer to the optimization in terms of runtime that I'm looking for by removing the need for a for loop. The answer to this problem is:
df_dm = (1/len(X)) * np.dot((-2*X).T, (y-(weight*X+bias)))
The issue here was making sure that all of the arrays resulting from the intermediate steps had the correct shape. And - for those interested in the runtime difference between these two functions: the for loop took 10 times longer.
I want to define the following function of two variables in Theano and compute its Jacobian:
f(x1,x2) = sum((2 + 2k - exp(k*x1) - exp(k*x2))^2, k = 1..10)
How do I make a Theano function for the above expression - and eventually minimize it using its Jacobian?
Since your function is scalar, the Jacobian reduces to the gradient. Assuming your two variables x1, x2 are scalar (looks like it from the formula, easily generalizable to other objects), you can write
import theano
import theano.tensor as T
x1 = T.fscalar('x1')
x2 = T.fscalar('x2')
k = T.arange(1, 10)
expr = ((2 + 2 * k - T.exp(x1 * k) - T.exp(x2 * k)) ** 2).sum()
func = theano.function([x1, x2], expr)
You can call func on two scalars
In [1]: func(0.25,0.25)
Out[1]: array(126.5205307006836, dtype=float32)
The gradient (Jacobian) is then
grad_expr = T.grad(cost=expr, wrt=[x1, x2])
And you can use updates in theano.function in the standard way (see theano tutorials) to make your gradient descent, setting x1, x2 as shared variables in givens, by hand on the python level, or using scan as indicated by others.
I'm learning about neural networks, specifically looking at MLPs with a back-propagation implementation. I'm trying to implement my own network in python and I thought I'd look at some other libraries before I started. After some searching I found Neil Schemenauer's python implementation bpnn.py. (http://arctrix.com/nas/python/bpnn.py)
Having worked through the code and read the first part of Christopher M. Bishops book titled 'Neural Networks for Pattern Recognition' I found an issue in the backPropagate function:
# calculate error terms for output
output_deltas = [0.0] * self.no
for k in range(self.no):
error = targets[k]-self.ao[k]
output_deltas[k] = dsigmoid(self.ao[k]) * error
The line of code that calculates the error is different in Bishops book. On page 145, equation 4.41 he defines the output units error as:
d_k = y_k - t_k
Where y_k are the outputs and t_k are the targets. (I'm using _ to represent subscript)
So my question is should this line of code:
error = targets[k]-self.ao[k]
Be infact:
error = self.ao[k] - targets[k]
I'm most likely completely wrong but could someone help clear up my confusion please. Thanks
It all depends on the error measure you use. To give just a few examples of error measures (for brevity, I'll use ys to mean a vector of n outputs and ts to mean a vector of n targets):
mean squared error (MSE):
sum((y - t) ** 2 for (y, t) in zip(ys, ts)) / n
mean absolute error (MAE):
sum(abs(y - t) for (y, t) in zip(ys, ts)) / n
mean logistic error (MLE):
sum(-log(y) * t - log(1 - y) * (1 - t) for (y, t) in zip(ys, ts)) / n
Which one you use depends entirely on the context. MSE and MAE can be used for when the target outputs can take any values, and MLE gives very good results when your target outputs are either 0 or 1 and when y is in the open range (0, 1).
With that said, I haven't seen the errors y - t or t - y used before (I'm not very experienced in machine learning myself). As far as I can see, the source code you provided doesn't square the difference or use the absolute value, are you sure the book doesn't either? The way I see it y - t or t - y can't be very good error measures and here's why:
n = 2 # We only have two output neurons
ts = [ 0, 1 ] # Our target outputs
ys = [ 0.999, 0.001 ] # Our sigmoid outputs
# Notice that your outputs are the exact opposite of what you want them to be.
# Yet, if you use (y - t) or (t - y) to measure your error for each neuron and
# then sum up to get the total error of the network, you get 0.
t_minus_y = (0 - 0.999) + (1 - 0.001)
y_minus_t = (0.999 - 0) + (0.001 - 1)
Edit: Per alfa's comment, in the book, y - t is actually the derivative of MSE. In that case, t - y is incorrect. Note, however, that the actual derivative of MSE is 2 * (y - t) / n, not simply y - t.
If you don't divide by n (so you actually have a summed squared error (SSE), not a mean squared error), then the derivative would be 2 * (y - t). Furthermore, if you use SSE / 2 as your error measure, then the 1 / 2 and the 2 in the derivative cancel out and you are left with y - t.
You have to backpropagate the derivative of
0.5*(y-t)^2 or 0.5*(t-y)^2 with respect to y
which is always
y-t = (y-t)(+1) = (t-y)(-1)
You can study this implementation of MLP from Padasip library.
And the documentation is here
In actual code, we often calculate NEGATIVE grad(of loss with regard to w), and use w += eta*grad to update weight. Actually its a grad ascent.
In some text book, POSITIVE grad is calculated and w -= eta*grad to update weight.