Coursera ML - Implementing regularized logistic regression cost function in python

Coursera ML - Implementing regularized logistic regression cost function in python - python

I am working through Andrew Ng's Machine Learning on Coursera by implementing all the code in python rather than MATLAB.
In Programming Exercise 3, I implemented my regularized logistic regression cost function in a vectorized form:
def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta**2)
return 1/len(y) * np.sum(-y # np.log(sigmoid(X#theta))
- (1-y) # np.log(1-sigmoid(X#theta))) + reg
On the following test inputs:
theta_test = np.array([-2,-1,1,2])
X_test = np.concatenate((np.ones((5,1)),
np.fromiter((x/10 for x in range(1,16)), float).reshape((3,5)).T), axis = 1)
y_test = np.array([1,0,1,0,1])
lambda_test = 3
the above cost function outputs 3.734819396109744. However, according to the skeleton MATLAB code provided to us, the correct output should be 2.534819. I'm puzzled because I cannot find anything wrong with my cost function but I believe it has a bug. In fact, I've also implemented it in Programming Exercise 2 in the binary classification case and it works fine, giving a result close to the expected value.
I thought that one reason could be that I've constructed my *_test input arrays wrongly based on misinterpreting the provided skeleton MATLAB code which are:
theta_t = [-2; -1; 1; 2];
X_t = [ones(5,1) reshape(1:15,5,3)/10];
y_t = ([1;0;1;0;1] >= 0.5);
lambda_t = 3;
However, I had ran them through an Octave interpreter to see what they actually are, and ensure that I could match them exactly in python.
Furthermore, the computation of gradient based on these inputs using my own vectorized and regularized gradient function is also correct. Lastly, I decided to just proceed with the computation and examine the prediction results. The accuracy of my predictions were way lower than the expected accuracy, so it gives all the more reason to suspect that something is wrong with my cost function that is making everything else wrong.
Help please! Thank you.

If you recall from regularization, you do not regularize the bias coefficient. Not only do you set the gradient to zero when performing gradient descent but you do not include this in the cost function. You have a slight mistake where you are including this as part of the sum (see cell #18 on your notebook that you linked - the sum should start from j = 1 but you have it as j = 0). Therefore, you need to sum from the second element to the end for your theta, not the first. You can verify this on Page 9 of the ex2.pdf PDF assignment that is seen on your Github repo. This explains the inflated cost as you are including the bias unit as part of the regularization.
Therefore, when computing regularization in reg, index theta so that you start from the second element and onwards:
def compute_cost_regularized(theta, X, y, lda):
reg =lda/(2*len(y)) * np.sum(theta[1:]**2) # Change here
return 1/len(y) * np.sum(-y # np.log(sigmoid(X#theta))
- (1-y) # np.log(1-sigmoid(X#theta))) + reg
Once I do this, define your test values as well as define your sigmoid function, I get the right answer that you're expecting:
In [8]: def compute_cost_regularized(theta, X, y, lda):
...: reg =lda/(2*len(y)) * np.sum(theta[1:]**2)
...: return 1/len(y) * np.sum(-y # np.log(sigmoid(X#theta))
...: - (1-y) # np.log(1-sigmoid(X#theta))) + reg
...:
In [9]: def sigmoid(z):
...: return 1 / (1 + np.exp(-z))
...:
In [10]: theta_test = np.array([-2,-1,1,2])
...: X_test = np.concatenate((np.ones((5,1)),
...: np.fromiter((x/10 for x in range(1,16)), float).reshape((3,5)).T), axis = 1)
...: y_test = np.array([1,0,1,0,1])
...: lambda_test = 3
...:
In [11]: compute_cost_regularized(theta_test, X_test, y_test, lambda_test)
Out[11]: 2.5348193961097438

Related

Gradient Descent Algorithm in Python

I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.

The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.

Python gradient descent - cost keeps increasing

I'm trying to implement gradient descent in python and my loss/cost keeps increasing with every iteration.
I've seen a few people post about this, and saw an answer here: gradient descent using python and numpy
I believe my implementation is similar, but cant see what I'm doing wrong to get an exploding cost value:
Iteration: 1 | Cost: 697361.660000
Iteration: 2 | Cost: 42325117406694536.000000
Iteration: 3 | Cost: 2582619233752172973298548736.000000
Iteration: 4 | Cost: 157587870187822131053636619678439702528.000000
Iteration: 5 | Cost: 9615794890267613993157742129590663647488278265856.000000
I'm testing this on a dataset I found online (LA Heart Data): http://www.umass.edu/statdata/statdata/stat-corr.html
Import code:
dataset = np.genfromtxt('heart.csv', delimiter=",")
x = dataset[:]
x = np.insert(x,0,1,axis=1) # Add 1's for bias
y = dataset[:,6]
y = np.reshape(y, (y.shape[0],1))
Gradient descent:
def gradientDescent(weights, X, Y, iterations = 1000, alpha = 0.01):
theta = weights
m = Y.shape[0]
cost_history = []
for i in xrange(iterations):
residuals, cost = calculateCost(theta, X, Y)
gradient = (float(1)/m) * np.dot(residuals.T, X).T
theta = theta - (alpha * gradient)
# Store the cost for this iteration
cost_history.append(cost)
print "Iteration: %d | Cost: %f" % (i+1, cost)
Calculate cost:
def calculateCost(weights, X, Y):
m = Y.shape[0]
residuals = h(weights, X) - Y
squared_error = np.dot(residuals.T, residuals)
return residuals, float(1)/(2*m) * squared_error
Calculate hypothesis:
def h(weights, X):
return np.dot(X, weights)
To actually run it:
gradientDescent(np.ones((x.shape[1],1)), x, y, 5)

Assuming that your derivation of the gradient is correct, you are using: =- and you should be using: -=. Instead of updating theta, you are reassigning it to - (alpha * gradient)
EDIT (after the above issue was fixed in the code):
I ran what the code on what I believe is the right dataset and was able to get the cost to behave by setting alpha=1e-7. If you run it for 1e6 iterations you should see it converging. This approach on this dataset appears very sensitive to learning rate.

In general, if your cost is increasing, then the very first thing you should check is to see if your learning rate is too large. In such cases, the rate is causing the cost function to jump over the optimal value and increase upwards to infinity. Try different small values of your learning rate. When I face the problem that you describe, I usually repeatedly try 1/10 of the learning rate until I can find a rate where J(w) decreases.
Another problem might be a bug in your derivative implementation. A good way to debug is to do a gradient check to compare the analytic gradient versus the numeric gradient.

Calculating decision function of SVM manually

I'm attempting to calculate the decision_function of a SVC classifier MANUALLY (as opposed to using the inbuilt method) using the the python library SKLearn.
I've tried several methods, however, I can only ever get the manual calculation to match when I don't scale my data.
z is a test datum (that's been scaled) and I think the other variables speak for themselves (also, I'm using an rbf kernel if thats not obvious from the code).
Here are the methods that I've tried:
1 Looping method:
dec_func = 0
for j in range(np.shape(sup_vecs)[0]):
norm2 = np.linalg.norm(sup_vecs[j, :] - z)**2
dec_func = dec_func + dual_coefs[0, j] * np.exp(-gamma*norm2)
dec_func += intercept
2 Vectorized Method
diff = sup_vecs - z
norm2 = np.sum(np.sqrt(diff*diff), 1)**2
dec_func = dual_coefs.dot(np.exp(-gamma_params*norm2)) + intercept
However, neither of these ever returns the same value as decision_function. I think it may have something to do with rescaling my values or more likely its something silly that I've been over looking!
Any help would be appreciated.

So after a bit more digging and head scratching, I've figured it out.
As I mentioned above z is a test datum that's been scaled. To scale it I had to extract .mean_ and .std_ attributes from the preprocessing.StandardScaler() object (after calling .fit() on my training data of course).
I was then using this scaled z as an input to both my manual calculations and to the inbuilt function. However the inbuilt function was a part of a pipeline which already had StandardScaler as its first 'pipe' in the pipeline and as a result z was getting scaled twice!
Hence, when I removed scaling from my pipeline, the manual answers "matched" the inbuilt function's answer.
I say "matched" in quotes by the way as I found I always had to flip the sign of my manual calculations to match the inbuilt version. Currently I have no idea why this is the case.
To conclude, I misunderstood how pipelines worked.
For those that are interested, here's the final versions of my manual methods:
diff = sup_vecs - z_scaled
# Looping Method
dec_func_loop = 0
for j in range(np.shape(sup_vecs)[0]):
norm2 = np.linalg.norm(diff[j,:])
dec_func_loop = dec_func_loop + dual_coefs[j] * np.exp(-gamma*(norm2**2))
dec_func_loop = -1 * (dec_func_loop - intercept)
# Vectorized method
norm2 = np.array([np.linalg.norm(diff[n, :]) for n in range(np.shape(sup_vecs)[0])])
dec_func_vec = -1 * (dual_coefs.dot(np.exp(-gamma*(norm2**2))) - intercept)
Addendum
For those who are interested in implementing a manual method for a multiclass SVC, the following link is helpful: https://stackoverflow.com/a/27752709/1182556

How to make sure that solution is global minimum while using python scipy.optimize.minimize

I was implementing logistic regression in python. To find theta , I was struggling to decide which is the best algorithm that always guarantees global optima without bothering about initial parameter theta.
import numpy as np
import scipy.optimize as op
def Sigmoid(z):
return 1/(1 + np.exp(-z));
def Gradient(theta,x,y):
m , n = x.shape
theta = theta.reshape((n,1));
y = y.reshape((m,1))
sigmoid_x_theta = Sigmoid(x.dot(theta));
grad = ((x.T).dot(sigmoid_x_theta-y))/m;
return grad.flatten();
def CostFunc(theta,x,y):
m,n = x.shape;
theta = theta.reshape((n,1));
y = y.reshape((m,1));
term1 = np.log(Sigmoid(x.dot(theta)));
term2 = np.log(1-Sigmoid(x.dot(theta)));
term1 = term1.reshape((m,1))
term2 = term2.reshape((m,1))
term = y * term1 + (1 - y) * term2;
J = -((np.sum(term))/m);
return J;
data = np.loadtxt('ex2data1.txt',delimiter=',');
# m training samples and n attributes
m , n = data.shape
X = data[:,0:n-1]
y = data[:,n-1:]
X = np.concatenate((np.ones((m,1)), X),axis = 1)
initial_theta = np.zeros((n,1))
m , n = X.shape;
Result = op.minimize(fun = CostFunc,
x0 = initial_theta,
args = (X,y),
method = 'TNC',
jac = Gradient);
theta = Result.x;
where content of ex2data1.txt is:
34.62365962451697,78.0246928153624,0
30.28671076822607,43.89499752400101,0
35.84740876993872,72.90219802708364,0
60.18259938620976,86.30855209546826,1
79.0327360507101,75.3443764369103,1
45.08327747668339,56.3163717815305,0
61.10666453684766,96.51142588489624,1
75.02474556738889,46.55401354116538,1
76.09878670226257,87.42056971926803,1
84.43281996120035,43.53339331072109,1
95.86155507093572,38.22527805795094,0
75.01365838958247,30.60326323428011,0
82.30705337399482,76.48196330235604,1
69.36458875970939,97.71869196188608,1
39.53833914367223,76.03681085115882,0
53.9710521485623,89.20735013750205,1
69.07014406283025,52.74046973016765,1
67.94685547711617,46.67857410673128,0
70.66150955499435,92.92713789364831,1
76.97878372747498,47.57596364975532,1
67.37202754570876,42.83843832029179,0
89.67677575072079,65.79936592745237,1
50.534788289883,48.85581152764205,0
34.21206097786789,44.20952859866288,0
77.9240914545704,68.9723599933059,1
62.27101367004632,69.95445795447587,1
80.1901807509566,44.82162893218353,1
93.114388797442,38.80067033713209,0
61.83020602312595,50.25610789244621,0
38.78580379679423,64.99568095539578,0
61.379289447425,72.80788731317097,1
85.40451939411645,57.05198397627122,1
52.10797973193984,63.12762376881715,0
52.04540476831827,69.43286012045222,1
40.23689373545111,71.16774802184875,0
54.63510555424817,52.21388588061123,0
33.91550010906887,98.86943574220611,0
64.17698887494485,80.90806058670817,1
74.78925295941542,41.57341522824434,0
34.1836400264419,75.2377203360134,0
83.90239366249155,56.30804621605327,1
51.54772026906181,46.85629026349976,0
94.44336776917852,65.56892160559052,1
82.36875375713919,40.61825515970618,0
51.04775177128865,45.82270145776001,0
62.22267576120188,52.06099194836679,0
77.19303492601364,70.45820000180959,1
97.77159928000232,86.7278223300282,1
62.07306379667647,96.76882412413983,1
91.56497449807442,88.69629254546599,1
79.94481794066932,74.16311935043758,1
99.2725269292572,60.99903099844988,1
90.54671411399852,43.39060180650027,1
34.52451385320009,60.39634245837173,0
50.2864961189907,49.80453881323059,0
49.58667721632031,59.80895099453265,0
97.64563396007767,68.86157272420604,1
32.57720016809309,95.59854761387875,0
74.24869136721598,69.82457122657193,1
71.79646205863379,78.45356224515052,1
75.3956114656803,85.75993667331619,1
35.28611281526193,47.02051394723416,0
56.25381749711624,39.26147251058019,0
30.05882244669796,49.59297386723685,0
44.66826172480893,66.45008614558913,0
66.56089447242954,41.09209807936973,0
40.45755098375164,97.53518548909936,1
49.07256321908844,51.88321182073966,0
80.27957401466998,92.11606081344084,1
66.74671856944039,60.99139402740988,1
32.72283304060323,43.30717306430063,0
64.0393204150601,78.03168802018232,1
72.34649422579923,96.22759296761404,1
60.45788573918959,73.09499809758037,1
58.84095621726802,75.85844831279042,1
99.82785779692128,72.36925193383885,1
47.26426910848174,88.47586499559782,1
50.45815980285988,75.80985952982456,1
60.45555629271532,42.50840943572217,0
82.22666157785568,42.71987853716458,0
88.9138964166533,69.80378889835472,1
94.83450672430196,45.69430680250754,1
67.31925746917527,66.58935317747915,1
57.23870631569862,59.51428198012956,1
80.36675600171273,90.96014789746954,1
68.46852178591112,85.59430710452014,1
42.0754545384731,78.84478600148043,0
75.47770200533905,90.42453899753964,1
78.63542434898018,96.64742716885644,1
52.34800398794107,60.76950525602592,0
94.09433112516793,77.15910509073893,1
90.44855097096364,87.50879176484702,1
55.48216114069585,35.57070347228866,0
74.49269241843041,84.84513684930135,1
89.84580670720979,45.35828361091658,1
83.48916274498238,48.38028579728175,1
42.2617008099817,87.10385094025457,1
99.31500880510394,68.77540947206617,1
55.34001756003703,64.9319380069486,1
74.77589300092767,89.52981289513276,1
Above code gives theta = Result.x value as [-25.87282405 0.21193078 0.20722013]. This is global minimum if initial_theta = np.zeros((n,1)). But if initial_theta = np.ones((n,1)), it gives error. So in this case our result depends on initial values of parameter theta. So can this be automated in any way to avoid this issue.
Also I tried using 'BFGS' method instead of 'TNC' method in minimize function call as shown below, then I get RuntimeWarning.
initial_theta = np.zeros((n,1))
result = op.minimize(fun = CostFunc,
x0 = intial_theta,
args = (X,y),
method = 'BFGS',
jac = Gradient);
optimal_theta = result.x
I have called above function several times with different initial values of initial_theta and I found that BFGS maximum time converges to local minima. When I called BFGS with
initial_theta = np.array([-25,0.2,0.2])
which is nearer to global minima, it converged. So it seems that TNC is better than BFGS because with intial_theta being same in both cases, TNC converges to global minima while BFGS converges to local minima. So
Is this true in all cases or it depends on particular problem?
Which is better BFGS or TNC?
Is there any difference between scipy.optimize.fmin_bfgs and scipy.optimize.minimize with method parameter = 'BFGS' or both are same?
Any help or insight will be helpful. Thank you.

There is no practical algorithm that is guaranteed to find a global optimum. However, there are some heuristics like DIRECT (see e.g. here) that work very well in practice for given bounds. These can be used to find a good initialization for an algorithm that finds the local optimum in the vicinity of the initialization and works more efficiently.
However, logistic regression is a convex optimization problem. That means there is only one minimum of the objective function (error function), i.e. the local minimum is always the global minimum. Hence, you can use any local optimizer (Gradient Descent, L-BFGS, Conjugate Gradient, ...). The only problem is that you cannot compute the minimum directly because of the nonlinear logistic function. There is a similar problem called linear regression without that logistic function. In this case the global minimum of the error function can be computed directly without any complex optimization algorithm.
A comparison of optimizers for logistic regression can be found in Fabian Pedregosa's blog. My first guess would be that you have an error in your gradient computation. Maybe you should compare it to the numerical approximation of the gradient with scipy.optimize.check_grad.
scipy.optimize.minimize calls scipy.optimize.fmin_bfgs

This isn't possible with an efficient, general algorithm. You'll never really know what the cost function looked like on the inputs you didn't try. Perhaps there was some miracle trench running through a high plateau you ignored. Perhaps the cost function starts with if arg1 == secret: return -1e100. Who can say? If you really, absolutely need a global minimum, you either need to take advantage of extra knowledge about the cost function, or you need to try each and every single possible input.

Posterior Sampling in pymc3

1:23 PM (20 minutes ago)
Hi,
Trying to learn pymc3 (never learned pymc2, so jumping into the new stuff), and I suspect there is a very simple example/pseudocode for what I'm trying to do. Wondering if someone can help me out, as the past few hours I've not made much progress...
My problem is to sample from a posterior in a rather straightforward manner. Let "x" be a vector, "t(x)" be a function (R^n --> R^n map) of that vector, and "D" be some observed data. I want to sample vectors x from
P( x | D ) \propto P( D | x ) P(x)
Usual Bayesian stuff. An example of how to do this using NUTS would be spectacular! My main problem seems to be getting the function t(x) to work appropriately, and have the model return samples from the posterior (rather than the prior).
Any and all help/hints appreciated. In the mean time I'll continue to try stuff out.
Best,
TJ

Your notation is a little confusing to me, but if I understand correctly, you want to sample from the likelihood (some function of the parameters and data) times the prior. And I agree - that's typical Bayesian stuff.
I think Bayesian logistic regression is a good example since we can't solve it analytically. Let's say our model is the following:
B ~ Normal(0, sigma2 * I)
p(y_i | B) = p_i ^ {y_i} (1 - p_i) ^{1 - y_i}
Where y_i is observed and p_i = 1 / (1 + exp(-z_i)) and
z_i = B_0 + B_1 * x_i
We'll assume sigma2 is known. After we load data into numpy arrays x and y, we can sample from the posterior with the following:
with pm.Model() as model:
#Priors
b0 = pm.Normal("b0", mu=0, tau=1e-6)
b1 = pm.Normal("b1", mu=0, tau=1e-6)
#Likelihood
yhat = pm.Bernoulli("yhat", 1 / (1 + t.exp(-(b0 + b1*x))), observed=y)
# Sample from the posterior
trace = pm.sample(10000, pm.NUTS(), progressbar=False)
To see a full example, check out this iPython notebook:
http://nbviewer.ipython.org/gist/jbencook/9295751c917941208349
pymc3 also has a nice glm syntax. You can see how that works here:
http://jbencook.github.io/portfolio/bayesian_logistic_regression.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Coursera ML - Implementing regularized logistic regression cost function in python - python

Related

Gradient Descent Algorithm in Python

Python gradient descent - cost keeps increasing

Calculating decision function of SVM manually

How to make sure that solution is global minimum while using python scipy.optimize.minimize

Posterior Sampling in pymc3

Categories

Resources