What determines whether my Python gradient descent algorithm converges?

What determines whether my Python gradient descent algorithm converges? - python

I've implemented a single-variable linear regression model in Python that uses gradient descent to find the intercept and slope of the best-fit line (I'm using gradient descent rather than computing the optimal values for intercept and slope directly because I'd eventually like to generalize to multiple regression).
The data I am using are below. sales is the dependent variable (in dollars) and temp is the independent variable (degrees celsius) (think ice cream sales vs temperature, or something similar).
sales temp
215 14.20
325 16.40
185 11.90
332 15.20
406 18.50
522 22.10
412 19.40
614 25.10
544 23.40
421 18.10
445 22.60
408 17.20
And this is my data after it has been normalized:
sales temp
0.06993007 0.174242424
0.326340326 0.340909091
0 0
0.342657343 0.25
0.515151515 0.5
0.785547786 0.772727273
0.529137529 0.568181818
1 1
0.836829837 0.871212121
0.55011655 0.46969697
0.606060606 0.810606061
0.51981352 0.401515152
My code for the algorithm:
import numpy as np
import pandas as pd
from scipy import stats
class SLRegression(object):
def __init__(self, learnrate = .01, tolerance = .000000001, max_iter = 10000):
# Initialize learnrate, tolerance, and max_iter.
self.learnrate = learnrate
self.tolerance = tolerance
self.max_iter = max_iter
# Define the gradient descent algorithm.
def fit(self, data):
# data : array-like, shape = [m_observations, 2_columns]
# Initialize local variables.
converged = False
m = data.shape[0]
# Track number of iterations.
self.iter_ = 0
# Initialize theta0 and theta1.
self.theta0_ = 0
self.theta1_ = 0
# Compute the cost function.
J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('J is: ', J)
# Iterate over each point in data and update theta0 and theta1 on each pass.
while not converged:
diftemp0 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) for i in range(m)])
diftemp1 = (1.0/m) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0]) * data[i][1] for i in range(m)])
# Subtract the learnrate * partial derivative from theta0 and theta1.
temp0 = self.theta0_ - (self.learnrate * diftemp0)
temp1 = self.theta1_ - (self.learnrate * diftemp1)
# Update theta0 and theta1.
self.theta0_ = temp0
self.theta1_ = temp1
# Compute the updated cost function, given new theta0 and theta1.
new_J = (1.0/(2.0*m)) * sum([(self.theta0_ + self.theta1_*data[i][1] - data[i][0])**2 for i in range(m)])
print('New J is: %s') % (new_J)
# Test for convergence.
if abs(J - new_J) <= self.tolerance:
converged = True
print('Model converged after %s iterations!') % (self.iter_)
# Set old cost equal to new cost and update iter.
J = new_J
self.iter_ += 1
# Test whether we have hit max_iter.
if self.iter_ == self.max_iter:
converged = True
print('Maximum iterations have been reached!')
return self
def point_forecast(self, x):
# Given feature value x, returns the regression's predicted value for y.
return self.theta0_ + self.theta1_ * x
# Run the algorithm on a data set.
if __name__ == '__main__':
# Load in the .csv file.
data = np.squeeze(np.array(pd.read_csv('sales_normalized.csv')))
# Create a regression model with the default learning rate, tolerance, and maximum number of iterations.
slregression = SLRegression()
# Call the fit function and pass in the data.
slregression.fit(data)
# Print out the results.
print('After %s iterations, the model converged on Theta0 = %s and Theta1 = %s.') % (slregression.iter_, slregression.theta0_, slregression.theta1_)
# Compare our model to scipy linregress model.
slope, intercept, r_value, p_value, slope_std_error = stats.linregress(data[:,1], data[:,0])
print('Scipy linear regression gives intercept: %s and slope = %s.') % (intercept, slope)
# Test the model with a point forecast.
print('As an example, our algorithm gives y = %s given x = .87.') % (slregression.point_forecast(.87)) # Should be about .83.
print('The true y-value for x = .87 is about .8368.')
I'm having trouble understanding exactly what allows the algorithm to converge versus return values that are completely wrong. Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005. This brings changes in the cost function from iteration to iteration down to around 614, but I can't get it to go any lower.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why? Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values? For instance, if I were going to deliver this algorithm to a client so they could make predictions of their own (I'm not, but for the sake of argument..), wouldn't I want them to simply be able to plug in the un-normalized x-value?
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time. Is this normal, or are there flaws in my algorithm that are contributing to this issue?

Given learnrate = .01, tolerance = .0000000001, and max_iter = 10000, in combination with normalized data, I can get the gradient descent algorithm to converge. However, when I use the un-normalized data, the smallest I can make the learning rate without the algorithm returning NaN is .005
That's kind of to be expected the way you have your algorithm set up.
The normalization of the data makes it so the y-intercept of the best fit is around 0.0. Otherwise, you could have a y-intercept thousands of units off of the starting guess, and you'd have to trek there before you ever really started the optimization part.
Is it definitely a requirement of this type of algorithm to have normalized data, and if so, why?
No, absolutely not, but if you don't normalize, you should pick a starting point more intelligently (you're starting at (m,b) = (0,0)). Your learnrate may also be too small if you don't normalize your data, and same with your tolerance.
Also, what would be the best way to take a novel x-value in non-normalized form and plug it into the point forecast, given that the algorithm needs normalized values?
Apply whatever transformation that you applied to the original data to get the normalized data to your new x-value. (The code for normalization is outside of what you have shown). If this test point fell within the (minx,maxx) range of your original data, once transformed, it should fall within 0 <= x <= 1. Once you have this normalized test point, plug it into your theta equation of a line (remember, your thetas are m,b of the y-intercept form of the equation of a line).
All and all, playing around with the tolerance, max_iter, and learnrate gives me non-convergent results the majority of the time.
For a well-formed problem, if you're in fact diverging it often means your step size is too large. Try lowering it.
If it's simply not converging before it hits the max iterations, that could be a few issues:
Your step size is too small,
Your tolerance is too small,
Your max iterations is too small,
Your starting point is poorly chosen
In your case, using the non normalized data results in your starting point of (0,0) being very far off (the (m,b) of the non-normalized data is around (-159, 30) while the (m,b) of your normalized data is (0.10,0.79)), so most if not all of your iterations are being used just getting to the area of interest.
The problem with this is that by increasing the step size to get to the area of interest faster also makes it less-likely to find convergence once it gets there.
To account for this, some gradient descent algorithms have dynamic step size (or learnrate) such that large steps are taken at the beginning, and smaller ones as it nears convergence.
It may also be helpful for you to keep a history of of the theta pairs throughout the algorithm, then plot them. You'll be able to see the difference immediately between using normalized and non-normalized input data.

Related

Getting a negative R-squared value with curve_fit()

I've read a related post on manually calculating R-squared values after using scipy.optimize.curve_fit(). However, they calculate an R-squared value when their function follows the power-law (f(x) = a*x^b). I'm trying to do the same but get negative R-squared values.
Here is my code:
def powerlaw(x, a, b):
'''Generic power law function.'''
return a * x**b
X = s_lt[4:] # independent variable (Pandas series)
Y = s_lm[4:] # dependent variable (Pandas series)
popt, pcov = curve_fit(powerlaw, X, Y)
residuals = Y - powerlaw(X, *popt)
ss_res = np.sum(residuals**2) # residual sum of squares
ss_tot = np.sum((Y-np.mean(Y))**2) # total sum of squares
r_squared = 1 - (ss_res / ss_tot) # r-squared value
print("R-squared of power-law fit = ", str(r_squared))
I got an R-squared value of -0.057....
From my understanding, it's not good to use R-squared values for non-linear functions, but I expected to get a much higher R-squared value than a linear model due to overfitting. Did something else go wrong?

See The R-squared and nonlinear regression: a difficult marriage?. Also When is R squared negative?.
Basically, we have two problems:
nonlinear models do not have an intercept term, at least, not in the usual sense;
the equality SStot=SSreg+SSres may not hold.
The first reference above denotes your statistic "pseudo-R-square" (in the case of non-linear models), and notes that it may be lower than 0.
To further understand what's going on you probably want to plot your data Y as a function of X, the predicted values from the power law as a function of X, and the residuals as a function of X.
For non-linear models I have sometimes calculated the sum of squared deviation from zero, to examine how much of that is explained by the model. Something like this:
pred = powerlaw(X, *popt)
ss_total = np.sum(Y**2) # Not deviation from mean.
ss_resid = np.sum((Y - pred)**2)
pseudo_r_squared = 1 - ss_resid/ss_total
Calculated this way, pseudo_r_squared can potentially be negative (if the model is really bad, worse than just guessing the data are all 0), but if pseudo_r_squared is positive I interpret it as the amount of "variation from 0" explained by the model.

Numeric Gradient Descent in python

I wrote this code to get the gradient descent of a vector function .
Where: f is the function, X0 is the starting point and eta is the step size.
It is essentially made up of two parts, first that obtains the gradient of the function and the second, which iterates over x, subtracting the gradient.
the problem is that you usually have trouble converging on some functions, for example:
if we take , the gradient descent does not converge to [20,25]
Something I need to change or add?
def descenso_grad(f,X0,eta):
def grad(f,X):
import numpy as np
def partial(g,k,X):
h=1e-9
Y=np.copy(X)
X[k-1]=X[k-1]+h
dp=(g(X)-g(Y))/h
return dp
grd=[]
for i in np.arange(0,len(X)):
ai=partial(f,i+1,X)
grd.append(ai)
return grd
#iterations
i=0
while True:
i=i+1
X0=X0-eta*np.array(grad(f,X0))
if np.linalg.norm(grad(f,X0))<10e-8 or i>400: break
return X0

Your gradient descent implementation is a good basic implementation, but your gradient sometimes oscillate and exploses. First we should precise that your gradient descent does not always diverge. For some combinations of eta and X0, it actually converges.
But first let me suggest a few edits to the code:
The import numpy as np statement should be at the top of your file, not within a function. In general, any import statement should be at the beginning of the code so that they are executed only once
It is better not to write nested functions but to separate them: you can write the partial function outside of the gradfunction, and the gradfunction outside of the descendo_grad function. it is better for debugging.
I strongly recommend to pass parameters such as the learning rate (eta), the number of steps (steps) and the tolerance (set to 10e-8 in your code, or 1e-7) as parameters to the descendo_grad function. This way you will be able to compare their influence on the result.
Anyway, here is the implementation of your code I will use in this answer:
import numpy as np
def partial(g, k, X):
h = 1e-9
Y = np.copy(X)
X[k - 1] = X[k - 1] + h
dp = (g(X) - g(Y)) / h
return dp
def grad(f, X):
grd = []
for i in np.arange(0, len(X)):
ai = partial(f, i + 1, X)
grd.append(ai)
return grd
def descenso_grad(f,X0,eta, steps, tolerance=1e-7):
#iterations
i=0
while True:
i=i+1
X0=X0-eta*np.array(grad(f,X0))
if np.linalg.norm(grad(f,X0))<tolerance or i>steps: break
return X0
def f(X):
return (X[0]-20)**4 + (X[1]-25)**4
Now, about the convergence. I said that your implementation didn't always diverge. Indeed:
X0 = [2, 30]
eta = 0.001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)
Will print [20.55359068 25.55258024]
But:
X0 = [2, 0]
eta = 0.001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)
Will actually diverge to [ 2.42462695e+01 -3.54879793e+10]
1) What happened
Your gradient is actually oscillating aroung the y axis. Let's compute the gradient of f at X0 = [2, 0]:
print(grad(f, X0))
We get grad(f, X0) = [-23328.00067961216, -62500.01024454831], which is quite high but in the right direction.
Now let's compute the next step of gradient descent:
eta = 0.001
X1=X0-eta*np.array(grad(f,X0))
print(X1)
We get X1 = [25.32800068 62.50001025]. We can see that on the x axis, we actually get closer to the minimal, but on the y axis, the gradient descent jumped to the other side of the minimal and went even further from it. Indeed, X0[1] was at a disctance of 25 from the minimal (X0[1] - Xmin[1] = 25) at its left while X0[1] is now at a distance of 65-25 = 40 but on its right*. Since the curve drawn by f has a simple U shape around the y axis, the value taken by f in X1 will be higher than before (to simplify, we ignore the influence of the x coordinate).
If we look at the next steps, we can clearly see the exploding oscillations around the minimal:
X0 = [2, 0]
eta = 0.001
steps = 10
#record values taken by X[1] during gradient descent
curve_y = [X0[1]]
i = 0
while True:
i = i + 1
X0 = X0 - eta * np.array(grad(f, X0))
curve_y.append(X0[1])
if np.linalg.norm(grad(f, X0)) < 10e-8 or i > steps: break
print(curve_y)
We get [0, 62.50001024554831, -148.43710232226067, 20719.6258707022, -35487979280.37413]. We can see that X1 gets further and further from the minimal while oscillating around it.
In order to illustrate this, let's assume that the value along the x axis is fixed, and look only at what happens on the y axis. The picture shows in black the oscillations of the function's values taken at each step of the gradient descent (synthetic data for the purpose of illustrating only). The gradient descent takes us further from the minimal at each step because the update value is too large:
Note that the gradient descent we gave as an example makes only 5 steps while we programmed 10 steps. This is because when the values taken by the function are too high, python does not succeed to make the difference between f(X[1]) and f(X[1]+h), so it computes a gradient equal to zero:
x = 24 # for the example
y = -35487979280.37413
z = f([x, y+h]) - f([x, y])
print(z)
We get 0.0. This issue is about the computer's computation precision, but we will get back to this later.
So, these oscillations are due to the combination of:
the very high value of the partial gradient with regards to the y axis
a too big value of eta that does not compensate the exploding gradient in the update.
If this is true, we might converge if we use a smaller learning rate. Let's check:
X0 = [2, 0]
# divide eta by 100
eta = 0.0001
steps = 400
xmin = descenso_grad(f, X0, eta, steps)
print(xmin)
We will get [18.25061287 23.24796497]. We might need more steps but we are converging this time!!
2) How to avoid that?
A) In your specific case
Since the function shape is simple and it has no local minimas or no saddle points, we can avoid this issue by simply clipping the gradient value. This means that we define a maximum value for the norm of the gradients:
def grad_clipped(f, X, clip):
grd = []
for i in np.arange(0, len(X)):
ai = partial(f, i + 1, X)
if ai<0:
ai = max(ai, -1*clip)
else:
ai = min(ai, clip)
grd.append(ai)
return grd
def descenso_grad_clipped(f,X0,eta, steps, clip=100, tolerance=10e-8):
#iterations
i=0
while True:
i=i+1
X0=X0-eta*np.array(grad_clipped(f,X0, clip))
if np.linalg.norm(grad_clipped(f,X0, clip))<tolerance or i>steps: break
return X0
Let's test it using the diverging example:
X0 = [2, 0]
eta = 0.001
steps = 400
clip=100
xmin = descenso_grad_clipped(f, X0, eta, clip, steps)
print(xmin)
This time we are converging: [19.31583901 24.20307188]. Note that this can slow the process since the gradient descend will take smaller steps. Here we can get closer to the real minimum by increasing the number of steps.
Note that this technique also solves the numerical calculous issue we faced when the function's value was too high.
B) In general
In general, there are a lot of caveats the gradient descent allgorithms try to avoid (exploding or vanishing gradients, saddle points, local minimas...). Backpropagation algorithms like Adam, RMSprop, Adagrad, etc try to avoid these caveats.
I am not going to dwelve into the details because this would deserve a whole article, however here are two resources you can use (I suggest to read them in the order given) to deepen your understanding of the topic:
A good article on towardsdatascience.com explaining the basics of gradients descents and its most common flaws
An overview of gradient descent algorithms

Vectorized SVM gradient

I was going through the code for SVM loss and derivative, I did understand the loss but I cannot understand how the gradient is being computed in a vectorized manner
def svm_loss_vectorized(W, X, y, reg):
loss = 0.0
dW = np.zeros(W.shape) # initialize the gradient as zero
num_train = X.shape[0]
scores = X.dot(W)
yi_scores = scores[np.arange(scores.shape[0]),y]
margins = np.maximum(0, scores - np.matrix(yi_scores).T + 1)
margins[np.arange(num_train),y] = 0
loss = np.mean(np.sum(margins, axis=1))
loss += 0.5 * reg * np.sum(W * W)
Understood up to here, After here I cannot understand why we are summing up row-wise in binary matrix and subtracting by its sum
binary = margins
binary[margins > 0] = 1
row_sum = np.sum(binary, axis=1)
binary[np.arange(num_train), y] = -row_sum.T
dW = np.dot(X.T, binary)
# Average
dW /= num_train
# Regularize
dW += reg*W
return loss, dW

Let us recap the scenario and the loss function first, so we are on the same page:
Given are P sample points in N-dimensional space in the form of a PxN matrix X, so the points are the rows of this matrix. Each point in X is assigned to one out of M categories. These are given as a vector Y of length P that has integer values between 0 and M-1.
The goal is to predict the classes of all points by M linear classifiers (one for each category) given in the form of a weight matrix W of shape NxM, so the classifiers are the columns of W. To predict the categories of all samples X the scalar products between all points and all weight vectors are formed. This is the same as matrix multiplying X and W yielding a score matrix Y0 that is arranged such that its rows are ordered like theh elements of Y, each row corresponds to one sample. The predicted category for each sample is simply that with the largest score.
There are no bias terms so I presume there is some kind of symmetry or zero mean assumption.
Now, to find a good set of weights we want a loss function that is small for good predictions and large for bad predictions and that lets us do gradient descent. One of the most straight-forward ways is to just punish for each sample i each score that is larger than the score of the correct category for that sample and let the penalty grow linearly with the difference. So if we write A[i] for the set of categories j that score more than the correct category Y0[i, j] > Y0[i, Y[i]] the loss for sample i could be written as
sum_{j in A[i]} (Y0[i, j] - Y0[i, Y[i]])
or equivalently if we write #A[i] for the number of elements in A[i]
(sum_{j in A[i]} Y0[i, j]) - #A[i] Y0[i, Y[i]]
The partial derivatives with respect to the score are thus simply
| -#A[i] if j == Y[i]
dloss / dY0[i, j] = { 1 if j in A[i]
| 0 else
which is precisely what the first four lines you say you don't understand compute.
The next line applies the chain rule dloss/dW = dloss/dY0 dY0/dW.
It remains to divide by the number of samples to get a per sample loss and to add the derivative of the regulatization term which the regularization being just a componentwise quadratic function is easy.

Personally, I found it much easier to understand the whole gradient calculation through looking at the analytic derivation of the loss function in more detail. To extend on the given answer, I would like to point to the derivatives of the loss function
with respect to the weights as follows:
Loss gradient wrt w_yi (correct class)
Hence, we count the cases where w_j is not meeting the margin requirement and sum those cases up. This negative sum is then specified as weight for the position of the correct class w_yi. (we later need to multiply this value with xi, this is what you do in your code in line 5)
2) Loss gradient wrt w_j (incorrect classes)
where 1 is the indicator function, 1 if true, else 0.
In other words, "programatically" we need to apply equation (2) to all cases where the margin requirement is not met, and adding the negative sum of all unmet requirements to the true class column (as in (1)).
So what you did in the first 3 lines of your code is to determine the cases where the margin is not met, as well as adding the negative sum of these cases to the correct class column (j). In the 5 line, you do the final step where you multiply the x_i's to the other term - and this completes the gradient calculations as in (1) and (2).
I hope this makes it easier to understand, let me know if anything remains unclear. source

Python gradient descent - cost keeps increasing

I'm trying to implement gradient descent in python and my loss/cost keeps increasing with every iteration.
I've seen a few people post about this, and saw an answer here: gradient descent using python and numpy
I believe my implementation is similar, but cant see what I'm doing wrong to get an exploding cost value:
Iteration: 1 | Cost: 697361.660000
Iteration: 2 | Cost: 42325117406694536.000000
Iteration: 3 | Cost: 2582619233752172973298548736.000000
Iteration: 4 | Cost: 157587870187822131053636619678439702528.000000
Iteration: 5 | Cost: 9615794890267613993157742129590663647488278265856.000000
I'm testing this on a dataset I found online (LA Heart Data): http://www.umass.edu/statdata/statdata/stat-corr.html
Import code:
dataset = np.genfromtxt('heart.csv', delimiter=",")
x = dataset[:]
x = np.insert(x,0,1,axis=1) # Add 1's for bias
y = dataset[:,6]
y = np.reshape(y, (y.shape[0],1))
Gradient descent:
def gradientDescent(weights, X, Y, iterations = 1000, alpha = 0.01):
theta = weights
m = Y.shape[0]
cost_history = []
for i in xrange(iterations):
residuals, cost = calculateCost(theta, X, Y)
gradient = (float(1)/m) * np.dot(residuals.T, X).T
theta = theta - (alpha * gradient)
# Store the cost for this iteration
cost_history.append(cost)
print "Iteration: %d | Cost: %f" % (i+1, cost)
Calculate cost:
def calculateCost(weights, X, Y):
m = Y.shape[0]
residuals = h(weights, X) - Y
squared_error = np.dot(residuals.T, residuals)
return residuals, float(1)/(2*m) * squared_error
Calculate hypothesis:
def h(weights, X):
return np.dot(X, weights)
To actually run it:
gradientDescent(np.ones((x.shape[1],1)), x, y, 5)

Assuming that your derivation of the gradient is correct, you are using: =- and you should be using: -=. Instead of updating theta, you are reassigning it to - (alpha * gradient)
EDIT (after the above issue was fixed in the code):
I ran what the code on what I believe is the right dataset and was able to get the cost to behave by setting alpha=1e-7. If you run it for 1e6 iterations you should see it converging. This approach on this dataset appears very sensitive to learning rate.

In general, if your cost is increasing, then the very first thing you should check is to see if your learning rate is too large. In such cases, the rate is causing the cost function to jump over the optimal value and increase upwards to infinity. Try different small values of your learning rate. When I face the problem that you describe, I usually repeatedly try 1/10 of the learning rate until I can find a rate where J(w) decreases.
Another problem might be a bug in your derivative implementation. A good way to debug is to do a gradient check to compare the analytic gradient versus the numeric gradient.

PyMC: Hierachical Hidden Markov Model

This is a follow up on PyMC: Parameter estimation in a Markov system
I have a system which is defined by its position and velocity at each timestep. The behavior of the system is defined as:
vel = vel + damping * dt
pos = pos + vel * dt
So, here is my PyMC model. To estimate vel, pos and most importantly damping.
# PRIORS
damping = pm.Normal("damping", mu=-4, tau=(1 / .5**2))
# we assume some system noise
tau_system_noise = (1 / 0.1**2)
# the state consist of (pos, vel); save in lists
# vel: we can't judge the initial velocity --> assume it's 0 with big std
vel_states = [pm.Normal("v0", mu=-4, tau=(1 / 2**2))]
# pos: the first pos is just the observation
pos_states = [pm.Normal("p0", mu=observations[0], tau=tau_system_noise)]
for i in range(1, len(observations)):
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=tau_system_noise)
vel_states.append(new_vel)
pos_states.append(
pm.Normal("s" + str(i),
mu=pos_states[-1] + new_vel * dt,
tau=tau_system_noise)
)
# we assume some observation noise
tau_observation_noise = (1 / 0.5**2)
obs = pm.Normal("obs", mu=pos_states, tau=tau_observation_noise, value=observations, observed=True)
This is how I run the sampling:
mcmc = pm.MCMC([damping, obs, vel_states, pos_states])
mcmc.sample(50000, 25000)
pm.Matplot.plot(mcmc.get_node("damping"))
damping_samples = mcmc.trace("damping")[:]
print "damping -- mean:%f; std:%f" % (mean(damping_samples), std(damping_samples))
print "real damping -- %f" % true_damping
The value for damping is dominated by the prior. Even if I change the prior to Uniform or whatever, it is still the case.
What am I doing wrong? It's pretty much like the previous example, just with another layer.
The full IPython notebook of this problem is available here: http://nbviewer.ipython.org/github/sotte/random_stuff/blob/master/PyMC%20-%20HMM%20Dynamic%20System.ipynb
[EDIT: Some clarifications & code for sampling.]
[EDIT2: #Chris answer didn't help. I could not use AdaptiveMetropolis since the *_states don't seem to be part of the model.]

There are a couple of issues with the model, looking at it again. First and foremost, you did not add all of your PyMC objects to the model. You have only added [damping, obs]. You should pass all of the PyMC nodes to the model.
Also, note that you don't need to call both Model and MCMC. This is fine:
model = pm.MCMC([damping, obs, vel_states, pos_states])
The best workflow for PyMC is to keep your model in a separate file from the running logic. That way, you can just import the model and pass it to MCMC:
import my_model
model = pm.MCMC(my_model)
Alternately, you can write your model as a function, returning locals (or vars), then calling the function as the argument for MCMC. For example:
def generate_model():
# put your model definition here
return locals()
model = pm.MCMC(generate_model())

Assuming you know the structure of your model -- you are doing parameter estimation, not system identification -- you can construct your PyMC model as a regression, with unknown damping, initial position and initial velocity as parameters and the array of positions, your observations.
That is, with class PM representing the point-mass system:
pm = PM(true_damping)
positions, velocities = pm.integrate(true_pos, true_vel, N, dt)
# Assume little system noise
std_system_noise = 0.05
tau_system_noise = 1.0/std_system_noise**2
# Treat the real positions as observations
observations = positions + np.random.randn(N,)*std_system_noise
# Damping is modelled with a Uniform prior
damping = mc.Uniform("damping", lower=-4.0, upper=4.0, value=-0.5)
# Initial position & velocity unknown -> assume Uniform priors
init_pos = mc.Uniform("init_pos", lower=-1.0, upper=1.0, value=0.5)
init_vel = mc.Uniform("init_vel", lower=0.0, upper=2.0, value=1.5)
#mc.deterministic
def det_pos(d=damping, pi=init_pos, vi=init_vel):
# Apply damping, init_pos and init_vel estimates and integrate
pm.damping = d.item()
pos, vel = pm.integrate(pi, vi, N, dt)
return pos
# Standard deviation is modelled with a Uniform prior
std_pos = mc.Uniform("std", lower=0.0, upper=1.0, value=0.5)
#mc.deterministic
def det_prec_pos(s=std_pos):
# Precision, based on standard deviation
return 1.0/s**2
# The observations are based on the estimated positions and precision
obs_pos = mc.Normal("obs", mu=det_pos, tau=det_prec_pos, value=observations, observed=True)
# Create the model and sample
model = mc.Model([damping, init_pos, init_vel, det_prec_pos, obs_pos])
mcmc = mc.MCMC(model)
mcmc.sample(50000, 25000)
The full listing is here:
https://gist.github.com/stuckeyr/7762371
Increasing N and decreasing dt will improve your estimates markedly.

What do you mean by unreasonable? Are they shrunken toward the prior? Damping seems to have a pretty tight variance -- what if you give it a more diffuse prior?
Also, you might try using the AdaptiveMetropolis sampler on the *_states arrays:
my_model.use_step_method(AdaptiveMetropolis, my_model.vel_states)
It sometimes mixes better for correlated variables, as these likely are.

I think that your initial approach is fine and should work, except that the "obs" variable has not been included in the list of nodes supplied to MCMC (see In[10] in your notebook). After including this variable, the MCMC solver runs fine and does enforce the conditional dependencies specified by your model. I'd like to repeat the point made by Chris that it is best to define the model in a different file or under a function to avoid such mistakes.
The reason why you don't get the right results, is that your priors have been chosen arbitrarily and in some cases, the values are such that it is very difficult for the model to mix properly in order to converge. Your toy problem tries to estimate a damping value such that the positions converge to vector of observed positions. For this, your model should have the flexibility to choose velocity and damping values in a wide range so that stochastic errors in the position/velocity can be corrected when going from one time step to the next. Otherwise, as a result of your Euler integration scheme, the errors just keep getting propagated. I think Chris referred to the same thing when he suggested choosing a more diffuse prior.
I suggest playing around with the tau values for each of the Normal variables. For instance, I changed the following values:
damping = pm.Normal("damping", mu=0, tau=1/20.**2) # was tau=1/2.**2
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=(1/2.**2)) # was tau=tau_system_noise=(1 / 0.5**2)
tau_observation_noise = (1 / 0.005**2) # was 1 / 0.5**2
You can see the modified file here.
The plots at the bottom show that the positions are indeed converging. The velocities are all over the place. The estimated mean value of damping is 6.9, which is very different from -1.5. Perhaps you can achieve better estimates by choosing appropriate values for the priors.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.