I'm having trouble using numpy to parallelize this for loop below (get_new_weights). With my first attempt for df_dm in update_weights, the weight is completely wrong. With my second attempt at df_dm, my weight overshoots the optimal weight.
Note - bias is a single number and weight is a single number (one variable linear regression) and X is shape (442,1) and y is shape (442,1). Also note that updating my bias term works perfectly in update_weights - its just updating the weight that I'm having trouble with.
# This is the for loop that I am trying to parallelize with numpy:
def get_new_weights(X, y, weight, bias, learning_rate=0.01):
weight_deriv = 0
bias_deriv = 0
total = len(X)
for i in range(total):
# -2x(y - (mx + b))
weight_deriv += -2*X[i] * (y[i] - (weight*X[i] + bias))
# -2(y - (mx + b))
bias_deriv += -2*(y[i] - (weight*X[i] + bias))
weight -= (weight_deriv / total) * learning_rate
bias -= (bias_deriv / total) * learning_rate
return weight, bias
# This is my attempt at parallelization
def update_weights(X, y, weight, bias, lr=0.01):
df_dm = np.average(-2*X * (y-(weight*X+bias))) # this was my first guess
# df_dm = np.average(np.dot((-X).T, ((weight*X+bias)-y))) # this was my second guess
df_db = np.average(-2*(y-(weight*X+bias)))
weight = weight - (lr*df_dm)
bias = bias - (lr*df_db)
return weight,bias
This is the equation I am using for updating my weight and bias:
thanks for everyone who took a look at my question. I am loosely using the term parallelization to refer to the optimization in terms of runtime that I'm looking for by removing the need for a for loop. The answer to this problem is:
df_dm = (1/len(X)) * np.dot((-2*X).T, (y-(weight*X+bias)))
The issue here was making sure that all of the arrays resulting from the intermediate steps had the correct shape. And - for those interested in the runtime difference between these two functions: the for loop took 10 times longer.
Related
I'm coding linear regression by using gradient descent. By using for loop not tensor.
I think my code is logically right, and when I plot the graph theta value and linear model seems to be coming out good. But the value of cost function is high. Can you help me?
The value of cost function is 1,160,934 which is abnormal.
def gradient_descent(alpha,x,y,ep=0.0001, max_repeat=10000000):
m = x.shape[0]
converged = False
repeat = 0
theta0 = 1.0
theta3 = -1.0
# J=sum([(theta0 +theta3*x[i]- y[i])**2 for i in range(m)]) / 2*m #######
J=1
while not converged :
grad0= sum([(theta0 +theta3*x[i]-y[i]) for i in range (m)]) / m
grad1= sum([(theta0 + theta3*x[i]-y[i])*x[i] for i in range (m)])/ m
temp0 = theta0 - alpha*grad0
temp1 = theta3 - alpha*grad1
theta0 = temp0
theta3 = temp1
msqe = (sum([(theta0 + theta3*x[i] - y[i]) **2 for i in range(m)]))* (1 / 2*m)
print(theta0,theta3,msqe)
if abs(J-msqe) <= ep:
print ('Converged, iterations: {0}', repeat, '!!!')
converged = True
J = msqe
repeat += 1
if repeat == max_repeat:
converged = True
print("max 까지 갔다")
return theta0, theta3, J
[theta0,theta3,J]=gradient_descent(0.001,X3,Y,ep=0.0000001,max_repeat=1000000)
print("************\n theta0 : {0}\ntheta3 : {1}\nJ : {2}\n"
.format(theta0,theta3,J))
This is the data set.
I think the dataset itself is quite widespread and that's why the best fit line shows a large amount for the cost function. If you scale your data - you would see it drop significantly.
It is quite normal for cost to be high while dealing with the large dataset which has huge variance. Moreover your data is dealing with big numbers so cost is pretty high, normalizing data will give you the correct estimate as normalized data don't need to be scaled. Try this for verifying start with random wrights, observe the cost every time if the cost fluctuates in huge range then there might be some mistake that else its fine.
I have a CSV file of various persons with 8 parameters to determine whether the person is diabetic or not.
You will get the CSV file from here
I am making a model that will train and predict if a person is diabetic or not without using of third-party applications like Tensorlfow Scikitlearn etc. I am making it from scratch.
here is my code:
from numpy import genfromtxt
import numpy as np
my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.zeros((X.shape[1],1))
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
The output which is getting generated is:
It is in this link please take a look.
I have tried to -
initialize the value of W with random numbers.
spent a lot of time to debug but cannot find what I have done wrong
This short answer is that your learning rate is about 500x too big for this problem. Think about it like you're trying to pilot your W vector into a canyon in the cost function. At each step, the gradient tells you which way is down hill, but the steps you take in that direction are so big that you jump over the canyon and end up on the other side. Each time this happens, your cost goes up because you're getting farther and farther out of the canyon until after 2 iterations, it blows up.
If you replace the line
W,b = optimizer(W, b, X, Y, 100, 0.05)
with
W,b = optimizer(W, b, X, Y, 100, 0.0001)
It will converge, though still not at a reasonable speed. (Side note, there's no good way to know the learning rate you need for a given problem. You just try lower and lower values until your cost value doesn't diverge.)
The longer answer is that the problem is that your features are all on different scales.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
yields
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
This means that the variations in the numbers of the second feature are about 100x as large as the variations in the numbers of the second to last feature which in turn means that the number of the second value in your W vector will have to be tuned to about 100x the precision of the value of the second to last number in your W vector.
There are two ways to deal with this in practice. First, you could use a fancier optimizer. Instead of basic gradient descent, you could use gradient descent with momentum, but that would change all your code. The second, simpler, way is just to scale your features so they're all about the same size.
col_means = X.mean(axis=0)
col_stds = X.std(axis=0)
print('column means: ', col_means)
print('column stdevs: ', col_stds)
X -= col_means
X /= col_stds
W, b = optimizer(W, b, X, Y, 100, 1.0)
Here we subtract the mean value of each feature and divide each feature's value by its standard deviation. Sometimes newbies are thrown off by this -- "you can't change your data values, that changes the problem" -- but it makes sense if you realize that it's just another mathematical transformation, just like multiplying by W, adding b, taking the sigmoid, etc. The only catch is that you've got to make sure you do the same thing for any future data. Just like the values of your W vector are learned parameters of your model, the values of the col_means and col_stds are too, so you've got to save them like W and b and use them if you want to perform inference with this model on new data in the future.
That lets us use a much bigger learning rater of 1.0 because now all the features are approximately the same size.
Now if you try, you'll get the following output:
column means: [ 3.84505208 120.89453125 69.10546875 20.53645833 79.79947917
31.99257812 0.4718763 33.24088542]
column stdevs: [ 3.36738361 31.95179591 19.34320163 15.94182863 115.16894926
7.87902573 0.33111282 11.75257265]
0.6931471805599452
0.5902957589079032
0.5481784378158732
0.5254804089153315
...
0.4709931321295562
0.4709931263193595
0.47099312122176273
0.4709931167488006
0.470993112823447
This is what you want. Your cost function is going down at each step and at the end of your 100 iterations, the cost is stable to ~8 significant figures, so dropping it more probably won't do much.
Welcome to machine learning!
The problem is with your initialization of the weight and bias. It’s important that you don’t initialize at least the weights to zero and instead initialize them with some random small numbers. The value of A is coming out to be zero making your cost function undefined
Update:
Try something like this:
from numpy import genfromtxt
import numpy as np
# my_data = genfromtxt('E:/diabaties.csv', delimiter=',')
# X,Y = my_data[1: ,:-1], my_data[1: ,-1:] #striping data and output from my_data
# Using random data
n_points = 100
n_neurons = 5
X = np.random.rand(n_points, n_neurons) # 5 dimensional data from uniform distribution [0, 1)
Y = np.random.randint(low=0, high=2, size=(n_points, 1)) # Binary labels
def sigmoid(x):
return (1/(1+np.exp(-x)))
m = X.shape[0]
def propagate(W, b, X, Y):
#forward propagation
A = sigmoid(np.dot(X, W) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
print(cost)
#backward propagation
dw = (1 / m) * np.dot(X.T, (A - Y))
db = (1 / m) * np.sum(A - Y)
return(dw, db, cost)
def optimizer(W,b,X,Y,number_of_iterration,learning_rate):
for i in range(number_of_iterration):
dw, db, cost = propagate(W,b,X,Y)
W = W - learning_rate*dw
b = b - learning_rate*db
return(W, b)
W = np.random.normal(loc=0, scale=0.01, size=(n_neurons, 1)) # Drawing random initialization from gaussian
b = 0
W,b = optimizer(W, b, X, Y, 100, 0.05)
Your NaN problem is simply due to np.log encountering a zero value. You always want to scale your X values. Statistical (mean, std) normalization will work, but I find min-max scaling works best. Here is code for that:
def minmax_scaler(x):
min = np.nanmin(x, axis=0)
max = np.nanmax(x, axis=0)
return (x-min)/(max-min)
Also, your neural net has only one neuron. When you call np.dot(X, W) these should be matrices of shape (cases, features) and (features, neurons) respectively. So, now your initialization code looks like this:
X = minmax_scaler(X)
neurons = 10
learning_rate = 0.05
W = np.random.random((X.shape[1], neurons))
b = np.zeros((1, neurons)) # b width to match W
I got decent convergence without needing to change the learning rate. See chart:
This is such a small dataset that, even with 10-20 neurons, you are in danger of overfitting it. Ordinarily, you would code a predict() method and an accuracy check, and then set aside some of the data to test for overfitting.
So I'm new to learning ML and I am using gradient descent as my first algorithm I would like to get good at and learn well. I wrote my first code and have looked online for the issue I'm facing but due to lack of concrete knowledge I'm having a hard time understanding how I would go about diagnosing my issue. My gradient begins by approaching the correct answer and when the error has been cut by a factor of 8, the algorithm loses it's value and the b-value begins to go negative and the m-value goes past the target value. I'm sorry if I worded this odd, hopefully the code will help.
I am learning this from multiple sources on youtube and on google. I have been following Siraj Raval's math of intelligence playlist on youtube, I understood how the underlying algorithm worked but I decided to take my own approach and it seems to not be working too great. I'm struggling to read online resources as I'm inexperienced in what ever algorithm means and how it's implemented into python. I know this issue has something to do with training and testing but I don't know where to apply this.
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred))
new_b = -(2/N) * sum(y - ypred)
# applies the new b and m value
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
def run(iterations, initial_m, initial_b):
current_m = initial_m
current_b = initial_b
for i in range(iterations):
error = get_error(current_m, current_b)
current_m, current_b = gradient_updater(error, current_m, current_b)
print(current_m, current_b, error)
I expected the m and b values to converge to a specific value, this didn't occur and the values kept increasing in opposite direction.
If I am understanding your code correctly, I think your problem is that your taking the partial derivative to get your new slope and intercept on just one point. I'm not sure what exactly some of the variables within the gradient_updater are, so I will try to provide an example that better explains the concept:
I'm not sure we are calculating the optimization in the same way, so in my code, b0 is your 'x' in y=mx+b and b1 is your 'b' that same equation. The following code is for calculating a total b0_temp and b1_temp that will be divided by the batch size to present a new b0 and b1 to fit your graph.
for i in range(len(X)):
ERROR = ERROR + (b1*X[i] + b0 - Y[i])**2
b1_temp = b1_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])*X[i]
b0_temp = b0_temp + (1/2)*((1/len(X))*(b1*X[i] + b0 - Y[i])**2)**(-1/2) * (2/len(X))*(b1*X[i] + b0 - Y[i])
I run through this for every value within my dataset, where X[i] and Y[i] represent an individual datapoint.
Next, I adjust the slope that is currently fitting the graph:
b1_temp = b1_temp / batch_size
b0_temp = b0_temp / batch_size
b0 = b0 - learning_rate * b0_temp
b1 = b1 - learning_rate * b1_temp
b1_temp = 0
b0_temp = 0
Where batch_size can just be taken as len(X). I run through this for some number of epochs (i.e. a for loop of some number, 100 should work), and the line of best fit will adjust accordingly over time. The overall concept behind it is decrease the distance between each point and the line to where it is at a minimum.
Hope I was able to better explain this to you and provide you with a basic code base to adjust your's upon!
Here's where I think the error in your code lies - the calculation of the gradient. I believe that your cost function is similar to the one used in https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html. To solve the gradient, you need to aggregate the effects from all partial derivatives. In your implementation however, you iterate over the range x, without accumulating the effects. Therefore, your new_m and new_b are only calculated for the final term, x (Items marked 1 and 2 below).
Your implementation:
def gradient_updater(error, mcurr, bcurr):
for i in x:
# gets the predicted y-value
ypred = (mcurr * i) + bcurr
# uses partial derivative formula to get new m and b
new_m = -(2/N) * sum(x*(y - ypred)) #-- 1 --
new_b = -(2/N) * sum(y - ypred) #-- 2 --
# applies the new b and m value <-- Indent this block to place inside the for loop
mcurr = mcurr - (learning_rate * new_m)
bcurr = bcurr - (learning_rate * new_b)
return mcurr, bcurr
That said, I think your implementation should come closer to the mathematical formula if you just update mcurr and bcurr in every iteration (See inline comment). The other thing to do is to divide both sum(x*(y - ypred)) and sum(y - ypred) by N as well, in computing new_m and new_b.
Note
Since I do not know what your actual cost function is, I just want to point out that you are also using a constant y value in your code. It is more likely to be an array of different values and be called by Y[i] and X[i] respectively.
Currently my convergence criteria for SGD checks whether the MSE error ratio is within a specific boundary.
def compute_mse(data, labels, weights):
m = len(labels)
hypothesis = np.dot(data,weights)
sq_errors = (hypothesis - labels) ** 2
mse = np.sum(sq_errors)/(2.0*m)
return mse
cur_mse = 1.0
prev_mse = 100.0
m = len(labels)
while cur_mse/prev_mse < 0.99999:
prev_mse = cur_mse
for i in range(m):
d = np.array(data[i])
hypothesis = np.dot(d, weights)
gradient = np.dot((labels[i] - hypothesis), d)/m
weights = weights + (alpha * gradient)
cur_mse = compute_mse(data, labels, weights)
if cur_mse > prev_mse:
return
The weights are update w.r.t. to a single data point in the training set.
With an alpha of 0.001, the model is supposed to have converged within a few iterations however I get no convergence. Is this convergence criteria too strict?
I'll try to answer the question. First, the pseudocode of stochastic gradient descent looks something like this:
input: f(x), alpha, initial x (guess or random)
output: min_x f(x) # x that minimizes f(x)
while True:
shuffle data # good practice, not completely needed
for d in data:
x -= alpha * grad(f(x)) # df/dx
if <stopping criterion>:
break
There can be other regularization parameters added to the function that you want to minimize, such as the l1 penalty to avoid overfitting.
Going back to your problem, looking at your data and definition of the gradient, looks like you want to solve a simple linear system of equations of the form:
Ax = b
which yields the objevtive function:
f(x) = ||Ax - b||^2
stochastic gradient descent uses one row data at a time:
||A_i x - b||
where || o || is the euclidean norm and _i means index of a row.
Here, A is your data, x is your weights and b is your labels.
The gradient of the function is then computed as a:
grad(f(x)) = 2 * A.T (Ax - b)
Or in the case of the stochastic gradient descent:
2 * A_i.T (A_i x - b)
where .T means transpose.
Putting everything back into your code... first I will setup a synthetic data:
A = np.random.randn(100, 2) # 100x2 data
x = np.random.randn(2, 1) # 2x1 weights
b = np.random.randint(0, 2, 100).reshape(100, 1) # 100x1 labels
b[b == 0] = -1 # labels in {-1, 1}
Then, define the parameters:
alpha = 0.001
cur_mse = 100.
prev_mse = np.inf
it = 0
max_iter = 100
m = A.shape[0]
idx = range(m)
And loop!
while cur_mse/prev_mse < 0.99999 and it < max_iter:
prev_mse = cur_mse
shuffle(idx)
for i in idx:
d = A[i:i+1]
y = b[i:i+1]
h = np.dot(d, x)
dx = 2 * np.dot(d.T, (h - y))
x -= (alpha * dx)
cur_mse = np.mean((A.dot(x) - b)**2)
if cur_mse > prev_mse:
raise Exception("Not converging")
it += 1
This code is pretty much the same as yours, with a couple of additions:
Another stopping criterion based on the number of iterations (to avoid looping forever if the system doesn't converge or does too slowly)
Redefinition of the gradient dx (still similar to yours). You have the sign inverted and therefore the weight update is positive + since in my example is negative - (makes sense since you are going down in a gradient).
Indexing of data and labels. While data[i] gives a tuple of size (2,) (in this case for a 100x2 data), using fancy indexing data[i:i+1] will return a view of the data without reshaping it (e.g with shape (1, 2)) and therefore will allow you to perform the proper matrix multiplications.
You can add a 3rd stopping criterion based on acceptable mse error, i.e: if cur_mse < 1e-3: break.
This algorithm, with random data, converges in 20-40 iterations for me (depending on the generated random data).
So... assuming that this is the function you want to minimize, if this method doesn't work for you, it might mean that your system is underdeterminated (you have less training data than features, which means A is more wide than high).
Hope it helps!
My SARSA with gradient-descent keep escalating the weights exponentially. At Episode 4 step 17 the value is already nan
Exception: Qa is nan
e.g:
6) Qa:
Qa = -2.00890180632e+303
7) NEXT Qa:
Next Qa with west = -2.28577776413e+303
8) THETA:
1.78032402991e+303 <= -0.1 + (0.1 * -2.28577776413e+303) - -2.00890180632e+303
9) WEIGHTS (sample)
5.18266630725e+302 <= -1.58305782482e+301 + (0.3 * 1.78032402991e+303 * 1)
I don't know where to look for the mistake I made.
Here's some code FWIW:
def getTheta(self, reward, Qa, QaNext):
""" let t = r + yQw(s',a') - Qw(s,a) """
theta = reward + (self.gamma * QaNext) - Qa
def updateWeights(self, Fsa, theta):
""" wi <- wi + alpha * theta * Fi(s,a) """
for i, w in enumerate(self.weights):
self.weights[i] += (self.alpha * theta * Fsa[i])
I have about 183 binary features.
you need normalization in each trial. This will keep the weights in a bounded range. (e.g. [0,1]). They way you are adding the weights each time, just grows the weights and it would be useless after the first trial.
I would do something like this:
self.weights[i] += (self.alpha * theta * Fsa[i])
normalize(self.weights[i],wmin,wmax)
or see the following example (from literature of RL):
You need to write the normalization function by yourself though ;)
I do not have access to the full code in your application, so I might be wrong. But I think that I know where you are going wrong.
First and foremost, normalization should not be necessary here. For weights to get bloated so soon in this situation suggests something wrong with your implementation.
I think your update equation should be:-
self.weights[:, action_i] = self.weights[:, action_i] + (self.alpha * theta * Fsa[i])
That is to say that you should be updating columns instead of rows, because rows are for states and columns for for actions in the weight matrix.