I'm coding linear regression by using gradient descent. By using for loop not tensor.
I think my code is logically right, and when I plot the graph theta value and linear model seems to be coming out good. But the value of cost function is high. Can you help me?
The value of cost function is 1,160,934 which is abnormal.
def gradient_descent(alpha,x,y,ep=0.0001, max_repeat=10000000):
m = x.shape[0]
converged = False
repeat = 0
theta0 = 1.0
theta3 = -1.0
# J=sum([(theta0 +theta3*x[i]- y[i])**2 for i in range(m)]) / 2*m #######
J=1
while not converged :
grad0= sum([(theta0 +theta3*x[i]-y[i]) for i in range (m)]) / m
grad1= sum([(theta0 + theta3*x[i]-y[i])*x[i] for i in range (m)])/ m
temp0 = theta0 - alpha*grad0
temp1 = theta3 - alpha*grad1
theta0 = temp0
theta3 = temp1
msqe = (sum([(theta0 + theta3*x[i] - y[i]) **2 for i in range(m)]))* (1 / 2*m)
print(theta0,theta3,msqe)
if abs(J-msqe) <= ep:
print ('Converged, iterations: {0}', repeat, '!!!')
converged = True
J = msqe
repeat += 1
if repeat == max_repeat:
converged = True
print("max 까지 갔다")
return theta0, theta3, J
[theta0,theta3,J]=gradient_descent(0.001,X3,Y,ep=0.0000001,max_repeat=1000000)
print("************\n theta0 : {0}\ntheta3 : {1}\nJ : {2}\n"
.format(theta0,theta3,J))
This is the data set.
I think the dataset itself is quite widespread and that's why the best fit line shows a large amount for the cost function. If you scale your data - you would see it drop significantly.
It is quite normal for cost to be high while dealing with the large dataset which has huge variance. Moreover your data is dealing with big numbers so cost is pretty high, normalizing data will give you the correct estimate as normalized data don't need to be scaled. Try this for verifying start with random wrights, observe the cost every time if the cost fluctuates in huge range then there might be some mistake that else its fine.
Related
According to (steward,1998). A matrix A which is invertible can be approximated by the formula A^{-1} = \sum^{inf}_{n=0} (I- A)^{n}
I tried implementing an algorithm to approximate a simple matrix's inverse, the loss function showed funny results. please look at the code below. more info about the Neumann series can be found here and here
here is my code.
A = np.array([[1,0,2],[3,1,-2],[-5,-1,9]])
class Neumann_inversion():
def __init__(self,A,rank):
self.A = A
self.rank = rank
self.eye = np.eye(len(A))
self.loss = []
self.loss2 =[]
self.A_hat = np.zeros((3,3),dtype = float)
#self.loss.append(np.linalg.norm(np.linalg.inv(self.A)-self.A_hat))
def approximate(self):
# self.A_hat = None
n = 0
L = (self.eye-self.A)
while n < self.rank:
self.A_hat += np.linalg.matrix_power(L,n)
loss = np.linalg.norm(np.linalg.inv(self.A) - self.A_hat)
self.loss.append(loss)
n+= 1
plt.plot(self.loss)
plt.ylabel('Loss')
plt.xlabel('rank')
# ax.axis('scaled')
return
Matrix = Neumann_inversion(A,200)
Matrix.approximate()
The formula is valid only if $A^n$ tends to zero as $n$ increase. So your matrix must satisfy
np.all(np.abs(np.linalg.eigvals(A)) < 1)
Try
Neumann_inversion(A/10, 200).approximate()
and you can take the loss seriously :)
The origin of the formula has something to do with
(1-x) * (1 + x + x^2 + ... x^n) = (1 - x^(n+1))
If, and only if, all the eigenvalues of the matrix have magnitude less than 1 the term x^(n+1) will be close to zero, so the sum will be approximately the inverse of (1-x).
So I'm new to learning ML and I am using gradient descent as my first algorithm I would like to get good at and learn well. I wrote my first code and have looked online for the issue I'm facing but due to lack of concrete knowledge I'm having a hard time understanding how I would go about diagnosing my issue. My gradient begins by approaching the correct answer and when the error has been cut by a factor of 8, the algorithm loses it's value and the b-value begins to go negative and the m-value goes past the target value. I'm sorry if I worded this odd, hopefully the code will help.
I am learning this from multiple sources on youtube and on google. I have been following Siraj Raval's math of intelligence playlist on youtube, I understood how the underlying algorithm worked but I decided to take my own approach and it seems to not be working too great. I'm struggling to read online resources as I'm inexperienced in what ever algorithm means and how it's implemented into python. I know this issue has something to do with training and testing but I don't know where to apply this.
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred))
new_b = -(2/N) * sum(y - ypred)
# applies the new b and m value
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
def run(iterations, initial_m, initial_b):
current_m = initial_m
current_b = initial_b
for i in range(iterations):
error = get_error(current_m, current_b)
current_m, current_b = gradient_updater(error, current_m, current_b)
print(current_m, current_b, error)
I expected the m and b values to converge to a specific value, this didn't occur and the values kept increasing in opposite direction.
If I am understanding your code correctly, I think your problem is that your taking the partial derivative to get your new slope and intercept on just one point. I'm not sure what exactly some of the variables within the gradient_updater are, so I will try to provide an example that better explains the concept:
I'm not sure we are calculating the optimization in the same way, so in my code, b0 is your 'x' in y=mx+b and b1 is your 'b' that same equation. The following code is for calculating a total b0_temp and b1_temp that will be divided by the batch size to present a new b0 and b1 to fit your graph.
for i in range(len(X)):
ERROR = ERROR + (b1*X[i] + b0 - Y[i])**2
b1_temp = b1_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])*X[i]
b0_temp = b0_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])
I run through this for every value within my dataset, where X[i] and Y[i] represent an individual datapoint.
Next, I adjust the slope that is currently fitting the graph:
b1_temp = b1_temp / batch_size
b0_temp = b0_temp / batch_size
b0 = b0 - learning_rate * b0_temp
b1 = b1 - learning_rate * b1_temp
b1_temp = 0
b0_temp = 0
Where batch_size can just be taken as len(X). I run through this for some number of epochs (i.e. a for loop of some number, 100 should work), and the line of best fit will adjust accordingly over time. The overall concept behind it is decrease the distance between each point and the line to where it is at a minimum.
Hope I was able to better explain this to you and provide you with a basic code base to adjust your's upon!
Here's where I think the error in your code lies - the calculation of the gradient. I believe that your cost function is similar to the one used in https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. To solve the gradient, you need to aggregate the effects from all partial derivatives. In your implementation however, you iterate over the range x, without accumulating the effects. Therefore, your new_m and new_b are only calculated for the final term, x (Items marked 1 and 2 below).
Your implementation:
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred)) #-- 1 --
new_b = -(2/N) * sum(y - ypred) #-- 2 --
# applies the new b and m value <-- Indent this block to place inside the for loop
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
That said, I think your implementation should come closer to the mathematical formula if you just update mcurr and bcurr in every iteration (See inline comment). The other thing to do is to divide both sum(x*(y - ypred)) and sum(y - ypred) by N as well, in computing new_m and new_b.
Note
Since I do not know what your actual cost function is, I just want to point out that you are also using a constant y value in your code. It is more likely to be an array of different values and be called by Y[i] and X[i] respectively.
I try to implement the Stochastic Gradient Descent Algorithm.
The first solution works:
def gradientDescent(x,y,theta,alpha):
xTrans = x.transpose()
for i in range(0,99):
hypothesis = np.dot(x,theta)
loss = hypothesis - y
gradient = np.dot(xTrans,loss)
theta = theta - alpha * gradient
return theta
This solution gives the right theta values but the following algorithm
doesnt work:
def gradientDescent2(x,y,theta,alpha):
xTrans = x.transpose();
for i in range(0,99):
hypothesis = np.dot(x[i],theta)
loss = hypothesis - y[i]
gradientThetaZero= loss * x[i][0]
gradientThetaOne = loss * x[i][1]
theta[0] = theta[0] - alpha * gradientThetaZero
theta[1] = theta[1] - alpha * gradientThetaOne
return theta
I don't understand why solution 2 does not work, basically it
does the same like the first algorithm.
I use the following code to produce data:
def genData():
x = np.random.rand(100,2)
y = np.zeros(shape=100)
for i in range(0, 100):
x[i][0] = 1
# our target variable
e = np.random.uniform(-0.1,0.1,size=1)
y[i] = np.sin(2*np.pi*x[i][1]) + e[0]
return x,y
And use it the following way:
x,y = genData()
theta = np.ones(2)
theta = gradientDescent2(x,y,theta,0.005)
print(theta)
I hope you can help me!
Best regards, Felix
Your second code example overwrites the gradient computation on each iteration over your observation data.
In the first code snippet, you properly adjust your parameters in each looping iteration based on the error (loss function).
In the second code snippet, you calculate the point-wise gradient computation in each iteration, but then don't do anything with it. That means that your final update effectively only trains on the very last data point.
If instead you accumulate the gradients within the loop by summing ( += ), it should be closer to what you're looking for (as an expression of the gradient of the loss function with respect to your parameters over the entire observation set).
Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?
I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o || is the euclidean norm and _i means index of a row.
Here, A is your data, x is your weights and b is your labels.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).
Hope it helps!
So I have one vector of alpha, one vector of beta, and I am trying to find a theta for when the sum of all the estimates (for alpha's 1 to N and beta's 1 to N) equals 60:
def CalcTheta(grensscore, alpha, beta):
theta = 0.0001
estimate = [grensscore-1]
while(sum(estimate) < grensscore):
theta += 0.00001
for x in range(len(beta)):
if x == 0:
estimate = []
estimate.append(math.exp(alpha[x] * (theta - beta[x])) /
(1 + math.exp(alpha[x] * (theta - beta[x]))))
return(theta)
Basically what I did is start from theta = 0.0001, and iterate through, calculating all these sums, and when it is lower than 60, continue by adding 0.0001 each time, while above 60 means we found the theta.
I found the value theta this way. Problem is, it took me about 60 seconds using Python, to find a theta of 0.456.
What is quicker approach to find this theta (since I would like to apply this for other data)?
If you know a lower and an upper bound for θ, and the function is monotonic in the range between these, then you could employ a bisection algorithm to easily and quickly find the desired value.