Adam Optimizer in Style Transfer - python

I am trying to practice the exercise questions in this style transfer tutorial, is there anyone know how to replace the basic gradient descent with Adam Optimizer.
I think these code maybe the place to change. Thank you very much for help.
# Reduce the dimensionality of the gradient.
grad = np.squeeze(grad)
# Scale the step-size according to the gradient-values.
step_size_scaled = step_size / (np.std(grad) + 1e-8)
# Update the image by following the gradient.
mixed_image -= grad * step_size_scaled

Referring to slides 36 and 37 from Stanford CS231n slides,
first_moment = 0
second_moment = 0
must be declared above the for i in range(num_iterations): line present in that GitHub file. Also, initialize beta1 and beta2 variables from below based on your requirements. Then, you can replace your code block with the following:
# Reduce the dimensionality of the gradient.
grad = np.squeeze(grad)
# Calculate moments
first_moment = beta1 * first_moment + (1 - beta1) * grad
second_moment = beta2 * second_moment + (1 - beta2) * grad * grad
# Bias correction steps
first_unbias = first_moment / (1 - beta1 ** i)
second_unbias = second_moment / (1 - beta2 ** i)
# Update the image by following the gradient (AdaGrad/RMSProp step)
mixed_image -= step_size * first_unbias / (tf.sqrt(second_unbias) + 1e-8)

I initialize beta1 and beta2 like this:
beta1=tf.Variable(0,name='beta1')
beta2=tf.Variable(0,name='beta2')
session.run([beta1.initializer,beta2.initializer])
However,there are something go wrong: Tensor' object has no attribute 'sqrt'.
The detailed error looks like this.

Related

Implement a system of stochastic ODEs using python

I want to add noise on a system of ODEs from (Ramos & al. 2021) (Kind of SIR model)
system
I implemented the Milstein scheme on some relevant equation
# Create Brownian Motion
np.random.seed(1)
dS = np.sqrt(dt) * np.random.randn(tmax)
dE=np.sqrt(dt) * np.random.randn(tmax)
dI=np.sqrt(dt) * np.random.randn(tmax)
dIu=np.sqrt(dt) * np.random.randn(tmax)
dDu=np.sqrt(dt) * np.random.randn(tmax)
dHR=np.sqrt(dt) * np.random.randn(tmax)
dHD=np.sqrt(dt) * np.random.randn(tmax)
dB=[dS,dE,dI,dIu,dDu,dHR,dHD]
sigma=[0.5,0,0,0,0,0,0]
#Brief definition of systemf function evaluating the second term at time t for each variant i
for i in range(nvariants):
newe = S[0]*(mbetae[i]*E[i] + mbetai[i]*I[i] + mbetaiu[i]*Iu[i] + mbetahr[i]*HR[i] + mbetahd[i]*HD[i])/totalpop
newi = gammae * E[i]
newhid = gammai * I[i]
newhiu = gammaiu * Iu[i]
newr= gammahr * HR[i]
newd = gammahd*HD[i]
newq = gammaq * Q[i]
neweS = neweS + newe
fE[i] = newe-newi
fI[i] = newi - newhid
fIu[i]= (1-theta[i]-omegau)*newhid - newhiu
fHR[i]= p[i]*(theta[i]-fatrate[i])*newhid - newr
fHD[i]= fatrate[i] *newhid - newd
fQ[i] = (1-p[i])*(theta[i]-fatrate[i])*newhid + newr - newq
fS[0] = - neweS -(vjRK[int(mt.floor(t))]) #fS.insert(0,-neweS -(vjRK[mt.floor(t)]))
return [fS, fE, fI, fIu, fHR, fHD, fQ]
for t in range(delayini,tmax-1):
fsyseval=systemf(t,states[t],beta[2*t],gamma[2*t],frate[2*t],theta[2*t],p[2*t],omegau[2*t],vjsum)
#Running the scheme
for s in range(numstates):
for i in range(nvariants):
states[t+1][s][i] =states[t][s][i]+fsyseval[s][i]*dt+sigma[s]*dB[s][t]*states[t][s][i] + 0.5*sigma[s]**2 * states[t][s][i] * (dB[s][t] ** 2 - dt)
The problem is when I plot the results of each variable (susceptible-infected ....) the result is very strange and have nothing to do with the deterministic model (I see no fluctuations and the shape is not even close to deterministic one) which is illogic. so, I thought that maybe I didn't implement well the stochastic scheme and I missed something.
Now I want to know if my implementation of stochasticity is correct (if yes why the results show no fluctuation despite the high level of noise)
If no, how can I add the stochastic part correctly ?
I thank you for advance for your help

Backpropagation bug

I am trying to implement backpropagation from scratch. While my cost is decreasing, gradient check yields a whooping 0.767399376130221. I've been trying to figure out what's wrong and managed to slim down the code to these few lines:
def forward(self,X,y):
z2 = self.params_l1.dot(X.T)
a2 = self.sigmoid(z2)
z3 = self.params_l2.dot(a2)
a3 = self.sigmoid(z3)
loss = self.cross_entropy(a3,y)
return a3,loss,z2,a2,z3
def backward(self,X,y):
n_examples = len(X)
yh,loss,Z2,A2,Z3 = self.forward(X,y)
delta3 = np.multiply(-(yh - y),self.dsigmoid(Z3))
delta2 = (np.dot(self.params_l2.T,delta3))*self.dsigmoid(Z2)
de3 = np.dot(delta3,A2.T)
de2 = np.dot(delta2,X)
self.params_l2 = self.params_l2 - self.lr * (de3 /n_examples)
self.params_l1 = self.params_l1 - self.lr * (de2 / n_examples)
return de3/n_examples ,de2 /n_examples
It is a simple (2,2,1) MLP. I'm using cross-entropy as the loss function. I am following the chain rule for the backprop.
I suspect the problem may lay in the order in which i take the products, but i have tried every which way and still had no luck.
I managed to get a difference of 1.7250119005319425e-10 by computing delta3 just through yh - yand no further multiplications. Now I need to figure out why this is.

Multiple unmatched matrices in backpropagation through time

I am going to implement binary addition by Recurrent Neural Network (RNN) as a sample. I have coped with an issue to implement it by Python, so I decided to share my problem in there to come up with ideas to fix it.
As can be seen in my notebook code (Backpropagation through time (BPTT) section),
There is a chain rule like below to update input weight matrix like below:
My problem is this part:
I've tried to implement this part in my Python code or notebook code (class input_layer, backward method), but unmatched dimensions raises an error.
In my sample code, W_hidden is 16*16, whereas the result of delta pre_hidden is 1*2. This makes the error. If you run the code, you could see the error.
I spent a lot of time to check my chain rule as well as my code. I guess my chain rule is right. Only reason to make this error is my code.
As I know, multiple unmatched matrices in terms of dimension is impossible. If my chain rule is correct, how it could be implemented by Python?
Any idea?
Thanks in advance.
You need to apply dimension balancing on the gradients. Taken from the Stanford's cs231n course, it comes down to two simple modifications:
Given , and , we will have:
,
Here is the code I used to ensure the gradient calculation is correct. You should be able to update your code accordingly.
import torch
torch.random.manual_seed(0)
x_1, x_2 = torch.zeros(size=(1, 8)).normal_(0, 0.01), torch.zeros(size=(1, 8)).normal_(0, 0.01)
y = torch.zeros(size=(1, 8)).normal_(0, 0.01)
h_0 = torch.zeros(size=(1, 16)).normal_(0, 0.01)
weight_ih = torch.zeros(size=(8, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_hh = torch.zeros(size=(16, 16)).normal_(mean=0, std=0.01).requires_grad_(True)
weight_ho = torch.zeros(size=(16, 8)).normal_(mean=0, std=0.01).requires_grad_(True)
h_1 = x_1.mm(weight_ih) + h_0.mm(weight_hh)
h_2 = x_2.mm(weight_ih) + h_1.mm(weight_hh)
g_2 = h_2.sigmoid()
j_2 = g_2.mm(weight_ho)
y_predicted = j_2.sigmoid()
loss = 0.5 * (y - y_predicted).pow(2).sum()
loss.backward()
delta_1 = -1 * (y - y_predicted) * y_predicted * (1 - y_predicted)
delta_2 = delta_1.mm(weight_ho.t()) * (g_2 * (1 - g_2))
delta_3 = delta_2.mm(weight_hh.t())
# 16 x 8
weight_ho_grad = g_2.t() * delta_1
# 16 x 16
weight_hh_grad = h_1.t() * delta_2 + (h_0.t() * delta_3)
# 8 x 16
weight_ih_grad = x_2.t() * delta_2 + x_1.t() * delta_3
atol = 1e-10
assert torch.allclose(weight_ho.grad, weight_ho_grad, atol=atol)
assert torch.allclose(weight_hh.grad, weight_hh_grad, atol=atol)
assert torch.allclose(weight_ih.grad, weight_ih_grad, atol=atol)

Gradient Descent basic algorithm overshooting and doesn't converge in python

So I'm new to learning ML and I am using gradient descent as my first algorithm I would like to get good at and learn well. I wrote my first code and have looked online for the issue I'm facing but due to lack of concrete knowledge I'm having a hard time understanding how I would go about diagnosing my issue. My gradient begins by approaching the correct answer and when the error has been cut by a factor of 8, the algorithm loses it's value and the b-value begins to go negative and the m-value goes past the target value. I'm sorry if I worded this odd, hopefully the code will help.
I am learning this from multiple sources on youtube and on google. I have been following Siraj Raval's math of intelligence playlist on youtube, I understood how the underlying algorithm worked but I decided to take my own approach and it seems to not be working too great. I'm struggling to read online resources as I'm inexperienced in what ever algorithm means and how it's implemented into python. I know this issue has something to do with training and testing but I don't know where to apply this.
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred))
new_b = -(2/N) * sum(y - ypred)
# applies the new b and m value
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
def run(iterations, initial_m, initial_b):
current_m = initial_m
current_b = initial_b
for i in range(iterations):
error = get_error(current_m, current_b)
current_m, current_b = gradient_updater(error, current_m, current_b)
print(current_m, current_b, error)
I expected the m and b values to converge to a specific value, this didn't occur and the values kept increasing in opposite direction.
If I am understanding your code correctly, I think your problem is that your taking the partial derivative to get your new slope and intercept on just one point. I'm not sure what exactly some of the variables within the gradient_updater are, so I will try to provide an example that better explains the concept:
I'm not sure we are calculating the optimization in the same way, so in my code, b0 is your 'x' in y=mx+b and b1 is your 'b' that same equation. The following code is for calculating a total b0_temp and b1_temp that will be divided by the batch size to present a new b0 and b1 to fit your graph.
for i in range(len(X)):
ERROR = ERROR + (b1*X[i] + b0 - Y[i])**2
b1_temp = b1_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])*X[i]
b0_temp = b0_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])
I run through this for every value within my dataset, where X[i] and Y[i] represent an individual datapoint.
Next, I adjust the slope that is currently fitting the graph:
b1_temp = b1_temp / batch_size
b0_temp = b0_temp / batch_size
b0 = b0 - learning_rate * b0_temp
b1 = b1 - learning_rate * b1_temp
b1_temp = 0
b0_temp = 0
Where batch_size can just be taken as len(X). I run through this for some number of epochs (i.e. a for loop of some number, 100 should work), and the line of best fit will adjust accordingly over time. The overall concept behind it is decrease the distance between each point and the line to where it is at a minimum.
Hope I was able to better explain this to you and provide you with a basic code base to adjust your's upon!
Here's where I think the error in your code lies - the calculation of the gradient. I believe that your cost function is similar to the one used in https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. To solve the gradient, you need to aggregate the effects from all partial derivatives. In your implementation however, you iterate over the range x, without accumulating the effects. Therefore, your new_m and new_b are only calculated for the final term, x (Items marked 1 and 2 below).
Your implementation:
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred)) #-- 1 --
new_b = -(2/N) * sum(y - ypred) #-- 2 --
# applies the new b and m value <-- Indent this block to place inside the for loop
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
That said, I think your implementation should come closer to the mathematical formula if you just update mcurr and bcurr in every iteration (See inline comment). The other thing to do is to divide both sum(x*(y - ypred)) and sum(y - ypred) by N as well, in computing new_m and new_b.
Note
Since I do not know what your actual cost function is, I just want to point out that you are also using a constant y value in your code. It is more likely to be an array of different values and be called by Y[i] and X[i] respectively.

How do you update the weights in function approximation with reinforcement learning?

My SARSA with gradient-descent keep escalating the weights exponentially. At Episode 4 step 17 the value is already nan
Exception: Qa is nan
e.g:
6) Qa:
Qa = -2.00890180632e+303
7) NEXT Qa:
Next Qa with west = -2.28577776413e+303
8) THETA:
1.78032402991e+303 <= -0.1 + (0.1 * -2.28577776413e+303) - -2.00890180632e+303
9) WEIGHTS (sample)
5.18266630725e+302 <= -1.58305782482e+301 + (0.3 * 1.78032402991e+303 * 1)
I don't know where to look for the mistake I made.
Here's some code FWIW:
def getTheta(self, reward, Qa, QaNext):
""" let t = r + yQw(s',a') - Qw(s,a) """
theta = reward + (self.gamma * QaNext) - Qa
def updateWeights(self, Fsa, theta):
""" wi <- wi + alpha * theta * Fi(s,a) """
for i, w in enumerate(self.weights):
self.weights[i] += (self.alpha * theta * Fsa[i])
I have about 183 binary features.
you need normalization in each trial. This will keep the weights in a bounded range. (e.g. [0,1]). They way you are adding the weights each time, just grows the weights and it would be useless after the first trial.
I would do something like this:
self.weights[i] += (self.alpha * theta * Fsa[i])
normalize(self.weights[i],wmin,wmax)
or see the following example (from literature of RL):
You need to write the normalization function by yourself though ;)
I do not have access to the full code in your application, so I might be wrong. But I think that I know where you are going wrong.
First and foremost, normalization should not be necessary here. For weights to get bloated so soon in this situation suggests something wrong with your implementation.
I think your update equation should be:-
self.weights[:, action_i] = self.weights[:, action_i] + (self.alpha * theta * Fsa[i])
That is to say that you should be updating columns instead of rows, because rows are for states and columns for for actions in the weight matrix.

Categories