I'm trying to make a start with neural networks from the very beginning. This means starting off toying with perceptrons. At the moment I'm trying to implement batch gradient descent. The guide I'm following provided the following pseudocode:
I've tried implementing as below with some dummy data and noticed it isn't particularly accurate. It converges, to what I presume is some local minima.
My question is:
What ways are there for me to check that this is infact a local minima, I've been looking into how to plot this, but I'm unsure how to actually go about doing this. In addition to this, is there a way to achieve a more accurate result using gradient descent? Or would I have to use a more complex approach, or possibly run it numerous times starting from different random weights to try and find the global minimum?
I had a look about the forum before posting this, but didn't find much information that made me feel confident what I'm doing here or what's happening is in fact correct, so any helps would be great.
import pandas as pd
import numpy as np
import random
import math
def main():
learningRate = 0.1
np.random.seed(1)
trainingInput = np.asmatrix([
[1, -1],
[2, 1],
[1.5, 0.5],
[2, -1],
[1, 2]
])
biasAccount = np.ones((5,1))
trainingInput = np.append(biasAccount, trainingInput, axis=1)
trainingOutput = np.asmatrix([
[0],
[1],
[0],
[0],
[1]
])
weights = 1 * np.random.random((3,1))-1
for iteration in range(10000):
prediction = np.dot(trainingInput, weights)
print("Weights: \n" + str(weights))
print("Prediction: \n" + str(prediction))
error = trainingOutput - prediction
print("Error: \n" + str(error))
intermediateResult = np.dot(error.T, trainingInput)
delta = np.dot(learningRate, intermediateResult)
print("Delta: \n" + str(delta))
weights += delta.T
main()
There is no guarantee that you'll find the global minimum. Often, people perform multiple runs and take the best one. Advance approaches include decaying the learning rate, using an adaptive learning rate (e.g. with RMSProp or Adam), or using GD with momentum.
There are multiple ways to monitor convergence:
Use the loss (hint: (t -Xw)X is the derivative), check for small values or for small changes.
Early stopping: check that the error on a (held out) validation set decreases, if it doesn't, stop training.
(Possible, you could even check the distances between weights in consecutive steps to see if anything changes.)
Related
** Edition: to avoid ambiguity, my problem is "too many" ops, "a large number" of ops. Not about the numerical value of a certain op.
My Old Title: How to avoid a large number of ops from a loop in
tensorflow?
** Following is the main text
Hi, Guys. I'm using tensorflow 1.12 to implement a network.
The background is:
In my loss function, I want to compute a large number (about 5000) of random point pairs that sampled from every input image, which costs me a lot of memory and time. The subscripts of point pairs are chosen by a certain method with randomness, so I think it's impossible to integrate the loop into a matrix. It appears like this: ( in my_loss_func() )
# gt & pred is tensors of [w, h]
# x1, y1, x2, y2 are lists of the point pairs' coordinates which were randomly sampled from input image
# for example, x1 = [1, 2, 3], y1 = [4, 5, 6], x2 = [7, 8, 9], y2 = [10, 11, 12]
# then it will compute loss between gt[1, 4] & pred[7, 10], gt[2, 5] & pred[8, 11] etc.
num_pairs = 5000
...
# compute loss for this batch
for i in range(num_pairs):
diff = gt[x1[i], y1[i]] - pred[x2[i], y2[i]]
loss = some_math_computation(diff)
return loss/num_pairs
The problem is:
If I build the graph like above, it will definitely create the loss computing op for 5k times, which costs me a lot. In fact, every time I run the program, I must wait for about 10 mins before the first batch of data's getting trained. In my DEBUG log, I found that the loop runs for only about 25 times within a second. Thus the large time cost is definitely from here. I feel myself so nooob.
Can you tell me how to avoid building the graph like this? Btw, I'm using tf.estimator, so the sess.run() operation is invalid.
My opinions:
Maybe I can load the total graph for once, and save this graph as file. Then for every time I run the model, I can load the total graph from that file, so I won't need to wait such a long time. Will it work?
Please tell me the correct way to implement this kind of loss. I think maybe this kind of loss is quite common in some fields of computer vision.
I'm learning to train a Linear Regression model via TensorFlow.
It's quite a simple formula:
y = W * x + b
I have generated a sample data:
After the model training I can see in Tensorboard that "W" is correct when "b" goes a completely wrong way. So, Loss is quite high.
Here is my code.
QUESTION
Why is "b" being trained a wrong way?
Shall I do something with the optimizer?
On line 16, you are adding gaussian noise with a standard deviation of 300!!
noise = np.random.normal(scale=n, size=(N, 1))
Try using:
noise = np.random.normal(size=(N, 1))
That's using mean=0 and std=1 (standard Gaussian noise).
Also, 20k iterations is more than enough (in this problem) for training.
For a more comprehensive explanation of what is happening, look at your plot. Given an x value, the possible values for y have thousands of units of difference. That means that there are a lot of lines that explain your data. Hence a lot of values for B are possible, but no matter which one you choose (even the true b value) all of them are going to have a big loss.
The optimization is working correctly but the problem is with the b parameter whose estimation is much more heavily influenced by the initial "roll of dice" of noise (which has a standard deviation of N) than the actual value of b_true (which is much smaller than N).
I'm trying to create a function which returns the value of θ0 & θ1 of hypothesis function of linear regression. But I'm getting different results for different initial (random) values of θ0 & θ1.
What's wrong in the code?
training_data_set = [[1, 1], [2, 3], [4, 3], [3, 2], [5, 5]]
initial_theta = [1, 0]
def gradient_descent(data, theta0, theta1):
def h(x, theta0, theta1):
return theta0 + theta1 * x
m = len(data)
alpha = 0.01
for n in range(m):
cost = 0
for i in range(m):
cost += (h(data[i][0], theta0, theta1) - data[i][1])**2
cost = cost/(2*m)
error = 0
for i in range(m):
error += h(data[i][0], theta0, theta1) - data[i][1]
theta0 -= alpha*error/m
theta1 -= alpha*error*data[n][0]/m
return theta0, theta1
for i in range(5):
initial_theta = gradient_descent(training_data_set, initial_theta[0], initial_theta[1])
final_theta0 = initial_theta[0]
final_theta1 = initial_theta[1]
print(f'theta0 = {final_theta0}\ntheta1 = {final_theta1}')
Output:
When initial_theta = [0, 0]
theta0 = 0.27311526522692103
theta1 = 0.7771301328221445
When initial_theta = [1, 1]
theta0 = 0.8829506006170339
theta1 = 0.6669442287905096
Convergence
You've run five iterations of gradient descent over just 5 training samples with a (probably reasonable) learning rate of 0.01. That is not expected to bring you to a "final" answer of your problem - you'd need to do many iterations of gradient descent just like you implemented, repeating the process until your thetas converge to a stable value. Then it'd make sense to compare the resulting values.
Replace the 5 in for i in range(5) with 5000 and then look at what happens. It might be illustrative to plot the decrease of the error rate / cost function to see how fast the process converges to a solution.
This is not a problem rather a very usual thing. For that you need to understand how gradient decent works.
Every time you randomly initialise your parameters the hypothesis starts it's journey from a random place. With every iteration it updates the parameters so that the cost function converges. In your case as you have ran your gradient decent just for 5 iteration, for different initialisation it ends up with too much different results. Try higher iterations you will see significant similarity even with different initialisation. If i could use visualisation that would be helpful for you.
Here is how I see gradient descent: imagine that you are high up on a rocky mountainside in the fog. Because of the fog, you cannot see the fastest path down the mountain. So, you look around your feet and go down based on what you see nearby. After taking a step, you look around your feet again, and take another step. Sometimes this will trap you in a small low spot where you cannot see any way down (a local minimum) and sometimes this will get you safely to the bottom of the mountain (global minimum). Starting from different random locations on the foggy mountainside might trap you in different local minima, though you might find your way down safely if the random starting location is good.
I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.
I am trying to predict quality of metal coil. I have the metal coil with width 10 meters and length from 1 to 6 kilometers. As training data I have ~600 parameters measured each 10 meters, and final quality control mark - good/bad (for whole coil). Bad means there is at least 1 place there is coil is bad, there is no data where is exactly. I have data for approx 10000 coils.
Lets imagine we want to train logistic regression for this data(with 2 factors).
X = [[0, 0],
...
[0, 0],
[1, 1], # coil is actually broken here, but we don't know it yet.
[0, 0],
...
[0, 0]]
Y = ?????
I can't just put all "bad" in Y and run classifier, because I will be confusing for classifier. I can't put all "good" and one "bad" in Y becuase I don't know where is the bad position.
The solution I have in mind is the following, I could define loss function as sum( (Y-min(F(x1,x2)))^2 ) (min calculated by all F belonging to one coil) not sum( (Y-F(x1,x2))^2 ). In this case probably I get F trained correctly to point bad place. I need gradient for that, it there is impossible to calculate it in all points, the min is not differentiable in all places, but I could use weak gradient instead(using values of functions which is minimal in coil in every place).
I more or less know how to implement it myself, the question is what is the simplest way to do it in python with scikit-learn. Ideally it should be same (or easily adaptable) with several learning method(a lot of methods based on loss function and gradient), is where possible to make some wrapper for learning methods which works this way?
update: looking at gradient_boosting.py - there is internal abstract class LossFunction with ability to calculate loss and gradient, looks perspective. Looks like there is no common solution.
What you are considering here is known in machine learning community as superset learning, meaning, that instead of typical supervised setting where you have training set in the form of {(x_i, y_i)} you have {({x_1, ..., x_N}, y_1)} such that you know that at least one element from the set has property y_1. This is not a very common setting, but existing, with some research available, google for papers in the domain.
In terms of your own loss functions - scikit-learn is a no-go. Scikit-learn is about simplicity, it provides you with a small set of ready to use tools with very little flexibility. It is not a research tool, and your problem is researchy. What can you use instead? I suggest you go for any symbolic-differentiation solution, for example autograd which gives you ability to differentiate through python code, simply apply scipy.optimize.minimize on top of it and you are done! Any custom loss function will work just fine.
As a side note - minimum operator is not differentiable, thus the model might have hard time figuring out what is going on. You could instead try to do sum((Y - prod_x F(x_1, x_2) )^2) since multiplication is nicely differentiable, and you will still get the similar effect - if at least one element is predicted to be 0 it will remove any "1" answer from the remaining ones. You can even go one step further to make it more numerically stable and do:
if Y==0 then loss = sum_x log(F(x_1, x_2 ) )
if Y==1 then loss = sum_x log(1-F(x_1, x_2))
which translates to
Y * sum_x log(1-F(x_1, x_2)) + (1-Y) * sum_x log( F(x_1, x_2) )
you can notice similarity with cross entropy cost which makes perfect sense since your problem is indeed a classification. And now you have perfect probabilistic loss - you are attaching such probabilities of each segment to be "bad" or "good" so the probability of the whole object being bad is either high (if Y==0) or low (if Y==1).