I've implemented a gradient descent algorithm in python and it is just not converging when it runs. When I debugging it, I have to make the alpha very small to let it 'seems' to converge. The alpha is like, have to be 1e-12 that small.
Here is my code
def batchGradDescent(dataMat, labelMat):
dataMatrix = mat(dataMat)
labelMatrix = mat(labelMat).transpose()
m, n = shape(dataMatrix)
cycle = 1000000
alpha = 7e-11
saved_weights = ones((n,1))
weights = saved_weights
for k in range(cycle):
hypothesis = dataMatrix * weights
saved_weights = weights
error = labelMatrix - hypothesis
weights = saved_weights + alpha * dataMatrix.transpose() * error
print weights-saved_weights
return weights
And my dataset is like this(a row)
800 0 0.3048 71.3 0.00266337 126.201
First five elements are features and the last one is the label.
Could anyone provide help? I'm really frustrated here. I think my algorithm is theoretically right. Is it about the normalization on the dataset?
Thank you.
Related
I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.
I am trying to write a gradient descent function in python as part of a multivariate linear regression exercise. It runs, but does not compute the correct answer. My code is below. I've been trying for weeks to finish this problem but have made zero progress.
I believe that I understand the concept of gradient descent to optimize a multivariate linear regression function and also that the 'math' is correct. I believe that the error is in my code, but I am still learning python. Your help is very much appreciated.
def regression_gradient_descent(feature_matrix,output,initial_weights,step_size,tolerance):
from math import sqrt
converged = False
weights = np.array(initial_weights)
while not converged:
predictions = np.dot(feature_matrix,weights)
errors = predictions - output
gradient_sum_squares = 0
for i in range(len(weights)):
derivative = -2 * np.dot(errors[i],feature_matrix[i])
gradient_sum_squares = gradient_sum_squares + np.dot(derivative, derivative)
weights[i] = weights[i] - step_size * derivative[i]
gradient_magnitude = sqrt(gradient_sum_squares)
print gradient_magnitude
if gradient_magnitude < tolerance:
converged = True
return(weights)
Feature matrix is:
sales = gl.SFrame.read_csv('kc_house_data.csv',column_type_hints = {'bathrooms':float, 'waterfront':int, 'sqft_above':int, 'sqft_living15':float,'grade':int, 'yr_renovated':int, 'price':float, 'bedrooms':float, 'zipcode':str,'long':float, 'sqft_lot15':float, 'sqft_living':float, 'floors':str, 'condition':int,'lat':float, 'date':str, 'sqft_basement':int, 'yr_built':int, 'id':str, 'sqft_lot':int,'view':int})
I'm calling the function as:
train_data,test_data = sales.random_split(.8,seed=0)
simple_features = ['sqft_living']
my_output= 'price'
(simple_feature_matrix, output) = get_numpy_data(train_data, simple_features, my_output)
initial_weights = np.array([-47000., 1.])
step_size = 7e-12
tolerance = 2.5e7
simple_weights = regression_gradient_descent(simple_feature_matrix, output,initial_weights,step_size,tolerance)
**get_numpy_data is just a function to convert everything into arrays and works as intended
Update: I fixed the formula to:
derivative = 2 * np.dot(errors,feature_matrix)
and it seems to have worked. The derivation of this formula in my online course used
-2 * np.dot(errors,feature_matrix)
and I'm not sure why this formula did not provide the correct answer.
The step size seems too small, and the tolerance unusually big. Perhaps you meant to use them the other way around?
In general, the step size is determined by a trial-and-error procedure: the "natural" step size α=1 might lead to divergence, so one could try to lower the value (e.g. taking α=1/2, α=1/4, etc until convergence is achieved. Don't start with a very small step size.
I'm trying to replicate the data analysis from a paper from Richard McElreath, in which he fitted the data with a hierarchical zero inflated Gamma model. The data is about the hunting returns of around 15000 hunting trips from about 150 hunters over twenty years. Because a good many hunting trips have zero returns, the model assume each trip has pi probability of zero returns, and 1 - pi probability of positive returns which follow a Gamma distribution with parameters alpha and beta.
The predictor variable is age, the model use an age polynomial (up to order 3) to model pi and alpha. And since the 15000 trips belong to 150 individual hunters, each hunter has coefficients of his own and all the coefficients follow a common multivariate normal distribution. For details of the model please refer to the following code. The model specification seems alright, but NUTS is having trouble start sampling: it gives only about 10 samples after about 20 minutes, and the sampler just halted there, and told me it will take hundreds of hours to finish the sampling. I want to know what is causing the problems.
The usual imports
import pymc3 as pm
import numpy as np
from pymc3.distributions import Continuous, Gamma
import theano.tensor as tt
The data can be obtained from github
n_trip = len(d)
n_hunter = len(d['hunter.id'].unique())
idx_hunter = d['hunter.id'].values
y = d['kg.meat'].values
age = d['age.s'].values
age2 = (d['age.s'].values)**2
age3 = (d['age.s'].values)**3
The log probability density function for Zero inflated Gamma.
class ZeroInflatedGamma(Continuous):
def __init__(self, alpha, beta, pi, *args, **kwargs):
super(ZeroInflatedGamma, self).__init__(*args, **kwargs)
self.alpha = alpha
self.beta = beta
self.pi = pi = tt.as_tensor_variable(pi)
self.gamma = Gamma.dist(alpha, beta)
def logp(self, value):
return tt.switch(value > 0,
tt.log(1 - self.pi) + self.gamma.logp(value),
tt.log(self.pi))
This is a matrix to index the correlation matrix prior to a 9X9 matrix, the LKJ prior in pymc3 is given as a one dimentional vector
dim = 9
n_elem = dim * (dim - 1) / 2
tri_index = np.zeros([dim, dim], dtype=int)
tri_index[np.triu_indices(dim, k=1)] = np.arange(n_elem)
tri_index[np.triu_indices(dim, k=1)[::-1]] = np.arange(n_elem)
And here is the model
with pm.Model() as Vary9_model:
# hyper-priors
mu_a = pm.Normal('mu_a', mu=0, sd=100, shape=9)
sigma_a = pm.HalfCauchy('sigma_a', 5, shape=9)
# build the covariance matrix
C_triu = pm.LKJCorr('C_triu', n=2, p=9)
C = tt.fill_diagonal(C_triu[tri_index], 1)
sigma_diag = tt.nlinalg.diag(sigma_a)
cov = tt.nlinalg.matrix_dot(sigma_diag, C, sigma_diag)
# priors for each hunter and all the linear components, 9 dimensional Gaussian
a = pm.MvNormal('a', mu=mu_a, cov=cov, shape=(n_hunter, 9))
# linear function
mupi = a[:,0][idx_hunter] + a[:,1][idx_hunter] * age + a[:,2][idx_hunter] * age2 + a[:,3][idx_hunter] * age3
mualpha = a[:,4][idx_hunter] + a[:,5][idx_hunter] * age + a[:,6][idx_hunter] * age2 + a[:,7][idx_hunter] * age3
pi = pm.Deterministic('pi', pm.math.sigmoid(mupi))
alpha = pm.Deterministic('alpha', pm.math.exp(mualpha))
beta = pm.Deterministic('beta', pm.math.exp(a[:,8][idx_hunter]))
y_obs = ZeroInflatedGamma('y_obs', alpha, beta, pi, observed=y)
Vary9_trace = pm.sample(6000, njobs=2)
And this is the status of the model:
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -28,366: 100%|██████████| 200000/200000 [15:36<00:00, 213.57it/s]
Finished [100%]: Average ELBO = -28,365
0%| | 22/6000 [15:51<63:49:25, 38.44s/it]
I have some thoughts on the problem but not sure which might be the reason.
is the nine dimensional Gaussian too difficult to sample with? I previously only modeled the intercepts for mualpha and mupi as bivariate Gaussian, it's slow but worked(the model fitting took about 20 minutes)
is it the probability density that's causing the problem? I wrote the density function myself and am not sure if it's functioning well. I think the density function is not differentiable at zero, will this cause trouble for the nuts sampler?
is it because the predictor variables are highly correlated? The linear model components in this model are polynomials of age, to the third degree, and naturally the predictors are highly correlated.
Or maybe it's because of something else?
As a side note, I tried using the Metropolis sampler, my computer has run out of memory and the chains still haven't converged.
The ZeroInflatedGamma looks fine. The density function is differentiable with respect to pi, alpha and beta. That is all you need for an observed variable. You only need the derivatives with respect to the value if you are trying to estimate the values.
There was an issue in the implementation of LKJCorr:
https://github.com/pymc-devs/pymc3/pull/1863
You could try again on master. Sadly, pymc3 does not have support for using MVNormal and LKJCorr in cholesky decomposed parametrization. This might help, too. There is a work in progress pull request for this on github:
https://github.com/pymc-devs/pymc3/pull/1875
To improve convergence you could try a non-centered parameterization for a. Something along the lines of
a_raw = pm.Normal('a_raw', shape=(9, n_hunter))
a = mu_a[None, :] + tt.dot(tt.slinalg.cholesky(cov), a_raw)
Of course this would be faster if we had that cholesky LKJCorr...
After I implemented a LS estimation with gradient descent for a simple linear regression problem, I'm now trying to do the same with Maximum Likelihood.
I used this equation from wikipedia. The maximum has to be found.
train_X = np.random.rand(100, 1) # all values [0-1)
train_Y = train_X
X = tf.placeholder("float", None)
Y = tf.placeholder("float", None)
theta_0 = tf.Variable(np.random.randn())
theta_1 = tf.Variable(np.random.randn())
var = tf.Variable(0.5)
hypothesis = tf.add(theta_0, tf.mul(X, theta_1))
lhf = 1 * (50 * np.log(2*np.pi) + 50 * tf.log(var) + (1/(2*var)) * tf.reduce_sum(tf.pow(hypothesis - Y, 2)))
op = tf.train.GradientDescentOptimizer(0.01).minimize(lhf)
This code works, but I still have some questions about it:
If I change the lhf function from 1 * to -1 * and minimize -lhf (according to the equation), it does not work. But why?
The value for lhf goes up and down during optimization. Shouldn't it only change in one direction?
The value for lhf sometimes is a NaN during optimization. How can I avoid that?
In the equation, σ² is the variance of the error (right?). My values are perfectly on a line. Why do I get a value of var above 100?
The symptoms in your question indicate a common problem: the learning rate or step size might be too high for the problem.
The zig-zag behaviour, where the function to be maximized goes up and down, is usual when the learning rate is too high. Specially when you get NaNs.
The simplest solution is to lower the learning rate, by dividing your current learning rate by 10 until the learning curve is smooth and there are no NaNs or up-down behavior.
As you are using TensorFlow you can also try AdamOptimizer as this adjust the learning rate dynamically as you train.
I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?
The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)