Maximum likelihood linear regression tensorflow - python

After I implemented a LS estimation with gradient descent for a simple linear regression problem, I'm now trying to do the same with Maximum Likelihood.
I used this equation from wikipedia. The maximum has to be found.
train_X = np.random.rand(100, 1) # all values [0-1)
train_Y = train_X
X = tf.placeholder("float", None)
Y = tf.placeholder("float", None)
theta_0 = tf.Variable(np.random.randn())
theta_1 = tf.Variable(np.random.randn())
var = tf.Variable(0.5)
hypothesis = tf.add(theta_0, tf.mul(X, theta_1))
lhf = 1 * (50 * np.log(2*np.pi) + 50 * tf.log(var) + (1/(2*var)) * tf.reduce_sum(tf.pow(hypothesis - Y, 2)))
op = tf.train.GradientDescentOptimizer(0.01).minimize(lhf)
This code works, but I still have some questions about it:
If I change the lhf function from 1 * to -1 * and minimize -lhf (according to the equation), it does not work. But why?
The value for lhf goes up and down during optimization. Shouldn't it only change in one direction?
The value for lhf sometimes is a NaN during optimization. How can I avoid that?
In the equation, σ² is the variance of the error (right?). My values are perfectly on a line. Why do I get a value of var above 100?

The symptoms in your question indicate a common problem: the learning rate or step size might be too high for the problem.
The zig-zag behaviour, where the function to be maximized goes up and down, is usual when the learning rate is too high. Specially when you get NaNs.
The simplest solution is to lower the learning rate, by dividing your current learning rate by 10 until the learning curve is smooth and there are no NaNs or up-down behavior.
As you are using TensorFlow you can also try AdamOptimizer as this adjust the learning rate dynamically as you train.

Related

Gaussian process regression - explain behaviour

I'm looking into GP regression, but I'm getting some behaviour that I do not understand.
Basically, I wanted to show convergence for GP on the osciallatory Genz function (basically a period wave), which led me to this picture Gp convergence, sorry for the missing labels (x axis: num samples, y axis: relative error measure in 2000 points)
This is OK, but I was curious why it took so long before the error started to drop. Plotting the resulting GP fit I got this (busy) plot GP fit is orange, true function is blue. What I don't understand is what happens up until it starts to capture the true function. I assumed it had something to do with the kernel. The plot here uses a RBF kernel with length_scale = 1 (I also tried both higher and lower values, but got the same results).
I kind of expected it to have a more smooth behaviour even if it couldn't capture the true model.
So, to my question: why do I see this "spikey" behaviour? And can I do something to change it (kernel-wise or other)?
kernel = RBF(length_scale = 1, length_scale_bounds = (1e-2, 1e2))
gp = GaussianProcessRegressor(kernel=kernel)
gp.fit(X, y)
def genz(x, method = 'default'):
d = x.shape[1]
a = 10/d
w = 1/2
num_points = x.shape[0]
funcval = np.empty([1,num_points])
for i in range(num_points):
funcval[0,i] = np.cos(2 * np.pi * w + np.sum(a * x[i,:]))
return funcval
It seems like the optimized length scale is very small compared to its domain space. I also felt very weird when I was digging into this library; changing some hyperparameters and the number of optimization didn't work for me as well. It might be helpful to change your kernel function to matern with changing the gamma value but not very much. If you really want to customize as you want, I might recommend you to use gpytorch similar to torch implementation or the GPML matlab toolbox.

Bad quality of Viterbi Algorithm (HMM)

I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.

Gradient Descent Algorithm in Python

I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.
The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.

Spark mllib predicting weird number or NaN

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

Python implemented Gradient Descent Algorithm won't converge

I've implemented a gradient descent algorithm in python and it is just not converging when it runs. When I debugging it, I have to make the alpha very small to let it 'seems' to converge. The alpha is like, have to be 1e-12 that small.
Here is my code
def batchGradDescent(dataMat, labelMat):
dataMatrix = mat(dataMat)
labelMatrix = mat(labelMat).transpose()
m, n = shape(dataMatrix)
cycle = 1000000
alpha = 7e-11
saved_weights = ones((n,1))
weights = saved_weights
for k in range(cycle):
hypothesis = dataMatrix * weights
saved_weights = weights
error = labelMatrix - hypothesis
weights = saved_weights + alpha * dataMatrix.transpose() * error
print weights-saved_weights
return weights
And my dataset is like this(a row)
800 0 0.3048 71.3 0.00266337 126.201
First five elements are features and the last one is the label.
Could anyone provide help? I'm really frustrated here. I think my algorithm is theoretically right. Is it about the normalization on the dataset?
Thank you.

Categories