Spark mllib predicting weird number or NaN - python

I am new to Apache Spark and trying to use the machine learning library to predict some data. My dataset right now is only about 350 points. Here are 7 of those points:
"365","4",41401.387,5330569
"364","3",51517.886,5946290
"363","2",55059.838,6097388
"362","1",43780.977,5304694
"361","7",46447.196,5471836
"360","6",50656.121,5849862
"359","5",44494.476,5460289
Here's my code:
def parsePoint(line):
split = map(sanitize, line.split(','))
rev = split.pop(-2)
return LabeledPoint(rev, split)
def sanitize(value):
return float(value.strip('"'))
parsedData = textFile.map(parsePoint)
model = LinearRegressionWithSGD.train(parsedData, iterations=10)
print model.predict(parsedData.first().features)
The prediction is something totally crazy, like -6.92840330273e+136. If I don't set iterations in train(), then I get nan as a result. What am I doing wrong? Is it my data set (the size of it, maybe?) or my configuration?

The problem is that LinearRegressionWithSGD uses stochastic gradient descent (SGD) to optimize the weight vector of your linear model. SGD is really sensitive to the provided stepSize which is used to update the intermediate solution.
What SGD does is to calculate the gradient g of the cost function given a sample of the input points and the current weights w. In order to update the weights w you go for a certain distance in the opposite direction of g. The distance is your step size s.
w(i+1) = w(i) - s * g
Since you're not providing an explicit step size value, MLlib assumes stepSize = 1. This seems to not work for your use case. I'd recommend you to try different step sizes, usually lower values, to see how LinearRegressionWithSGD behaves:
LinearRegressionWithSGD.train(parsedData, numIterartions = 10, stepSize = 0.001)

Related

Incorporate known data moments into GPFlow fitting

I have recently been working with gpflow, in-particular Gaussian process regression, to model a process for which I have access to approximated moments for each input. I have a vector of input values X of size (N,1) and a vector of responses Y of size (N,1). However, I also know, for each (x,y) pair, an approximation of the associated variance, skewness, kurtosis and so on for the particular y value.
From this, I know properties that inform me of appropriate likelihoods to use for each data point.
In the simplest case, I just assume all likelihoods are Gaussian, and specify the variance at each point. I've created a minimal example of my code by adapting the tutorial on: https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/advanced/varying_noise.ipynb#Demo-2:-grouped-noise-variances.
import numpy as np
import gpflow
def generate_data(N=100):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
groups = np.arange( 0, N, 1 ).reshape(-1,1)
NoiseVar = np.array([i/100.0 for i in range(N)])[groups]
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, groups, NoiseVar
# Get data
X, Y, groups, NoiseVar = generate_data()
Y_data = np.hstack([Y, groups])
# Generate one likelihood per data-point
likelihood = gpflow.likelihoods.SwitchedLikelihood( [gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])])
# model construction (notice that num_latent is 1)
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# Specify the likelihood as non-trainable
model.likelihood.set_trainable(False)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
# Stop Adam from optimising the variational parameters
model.q_mu.trainable = False
model.q_sqrt.trainable = False
# Create Adam tensor
adam_tensor = gpflow.train.AdamOptimizer(learning_rate=0.1).make_optimize_tensor(model)
for i in range(200):
session.run(natgrad_tensor)
session.run(adam_tensor)
# update the cache of the parameters in the current session
model.anchor(session)
print(model)
The above code works for a gaussian likelihood, and known variances. Inspecting my real data, I see that it is skewed very often and as a result, I want to use non-gaussian likelihoods to model it, but am unsure how to specify these other likelihood parameters given what I know.
So my question is: Given this setup, how can I adapt my code so far to include non-Gaussian likelihoods at each step, in-particular specifying and fixing their parameters based on my known variances, skewness, kurtosis and so on associated with each individual y value?
Firstly, you will need to choose which non-Gaussian likelihood you use. GPflow includes various ones in likelihoods.py. You then need to adapt the line
likelihood = gpflow.likelihoods.SwitchedLikelihood(
[gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])]
)
to give a list of your non-Gaussian likelihoods.
Which likelihood can take advantage of your skewness and kurtosis information is a statistical question. Depending on what you come up with, you may need to implement your own likelihood class, which can be done by inheriting from Likelihood. You should be able to follow some other examples from likelihoods.py.

Tensorflow: How to train a Linear Regression?

I'm learning to train a Linear Regression model via TensorFlow.
It's quite a simple formula:
y = W * x + b
I have generated a sample data:
After the model training I can see in Tensorboard that "W" is correct when "b" goes a completely wrong way. So, Loss is quite high.
Here is my code.
QUESTION
Why is "b" being trained a wrong way?
Shall I do something with the optimizer?
On line 16, you are adding gaussian noise with a standard deviation of 300!!
noise = np.random.normal(scale=n, size=(N, 1))
Try using:
noise = np.random.normal(size=(N, 1))
That's using mean=0 and std=1 (standard Gaussian noise).
Also, 20k iterations is more than enough (in this problem) for training.
For a more comprehensive explanation of what is happening, look at your plot. Given an x value, the possible values for y have thousands of units of difference. That means that there are a lot of lines that explain your data. Hence a lot of values for B are possible, but no matter which one you choose (even the true b value) all of them are going to have a big loss.
The optimization is working correctly but the problem is with the b parameter whose estimation is much more heavily influenced by the initial "roll of dice" of noise (which has a standard deviation of N) than the actual value of b_true (which is much smaller than N).

Bad quality of Viterbi Algorithm (HMM)

I've been trying to get into hidden Markov models and the Viterbi algorithm recently. I found a library called hmmlearn (http://hmmlearn.readthedocs.io/en/latest/tutorial.html) to help me generate a state sequence for two states (with Gaussian emissions). Then I wanted to re-determine the state sequence using Viterbi. My code works, but predicts approximately 5% of the states wrong (depending on the means and variances of the Gaussian emissions). The hmmlearn library has a .predict method which also uses Viterbi to determine the state sequence.
My problem now is that the Viterbi algorithm by hmmlearn is much better than my hand-written one (error rate is lower than 0.5% compared to my 5%). I couldn't find any major problem in my code, so I'm not sure why this is the case. Below is my code where I first generate the state and observation sequence Z and X, predict Z with hmmlearn and finally predict it with my own code:
# Import libraries
import numpy as np
import scipy.stats as st
from hmmlearn import hmm
# Generate a sequence
model = hmm.GaussianHMM(n_components = 2, covariance_type = "spherical")
model.startprob_ = pi
model.transmat_ = A
model.means_ = obs_means
model.covars_ = obs_covars
X, Z = model.sample(T)
## Predict the states from generated observations with the hmmlearn library
Z_pred = model.predict(X)
# Predict the state sequence with Viterbi by hand
B = np.concatenate((st.norm(mean_1,var_1).pdf(X), st.norm(mean_2,var_2).pdf(X)), axis = 1)
delta = np.zeros(shape = (T, 2))
psi = np.zeros(shape= (T, 2))
### Calculate starting values
for s in np.arange(2):
delta[0, s] = np.log(pi[s]) + np.log(B[0, s])
psi = np.zeros((T, 2))
### Take everything in log space since values get very low as t -> T
for t in range(1,T):
for s_post in range(0, 2):
delta[t, s_post] = np.max([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1) + np.log(B[t, s_post])
psi[t, s_post] = np.argmax([delta[t - 1, :] + np.log(A[:, s_post])], axis = 1)
### Backtrack
states = np.zeros(T, dtype=np.int32)
states[T-1] = np.argmax(delta[T-1])
for t in range(T-2, -1, -1):
states[t] = psi[t+1, states[t+1]]
I'm not sure if I have a big error in my code or if hmmlearn just uses a more refined Viterbi algorithm. I have noticed by looking into the falsely predicted states that the impact of the emission probability B seems to be too big as it causes the states to change too frequently even if the transition probability to go to the other state is really low.
I'm rather new to python so please excuse my ugly coding. Thanks in advance for any tips you might have!
Edit: As you can see in the code, I'm stupid and used variances instead of the standard deviation to determine the emission probabilities. After fixing this, I get the same result as the implemented Viterbi algorithm.

Maximum likelihood linear regression tensorflow

After I implemented a LS estimation with gradient descent for a simple linear regression problem, I'm now trying to do the same with Maximum Likelihood.
I used this equation from wikipedia. The maximum has to be found.
train_X = np.random.rand(100, 1) # all values [0-1)
train_Y = train_X
X = tf.placeholder("float", None)
Y = tf.placeholder("float", None)
theta_0 = tf.Variable(np.random.randn())
theta_1 = tf.Variable(np.random.randn())
var = tf.Variable(0.5)
hypothesis = tf.add(theta_0, tf.mul(X, theta_1))
lhf = 1 * (50 * np.log(2*np.pi) + 50 * tf.log(var) + (1/(2*var)) * tf.reduce_sum(tf.pow(hypothesis - Y, 2)))
op = tf.train.GradientDescentOptimizer(0.01).minimize(lhf)
This code works, but I still have some questions about it:
If I change the lhf function from 1 * to -1 * and minimize -lhf (according to the equation), it does not work. But why?
The value for lhf goes up and down during optimization. Shouldn't it only change in one direction?
The value for lhf sometimes is a NaN during optimization. How can I avoid that?
In the equation, σ² is the variance of the error (right?). My values are perfectly on a line. Why do I get a value of var above 100?
The symptoms in your question indicate a common problem: the learning rate or step size might be too high for the problem.
The zig-zag behaviour, where the function to be maximized goes up and down, is usual when the learning rate is too high. Specially when you get NaNs.
The simplest solution is to lower the learning rate, by dividing your current learning rate by 10 until the learning curve is smooth and there are no NaNs or up-down behavior.
As you are using TensorFlow you can also try AdamOptimizer as this adjust the learning rate dynamically as you train.

PyMC: Hierachical Hidden Markov Model

This is a follow up on PyMC: Parameter estimation in a Markov system
I have a system which is defined by its position and velocity at each timestep. The behavior of the system is defined as:
vel = vel + damping * dt
pos = pos + vel * dt
So, here is my PyMC model. To estimate vel, pos and most importantly damping.
# PRIORS
damping = pm.Normal("damping", mu=-4, tau=(1 / .5**2))
# we assume some system noise
tau_system_noise = (1 / 0.1**2)
# the state consist of (pos, vel); save in lists
# vel: we can't judge the initial velocity --> assume it's 0 with big std
vel_states = [pm.Normal("v0", mu=-4, tau=(1 / 2**2))]
# pos: the first pos is just the observation
pos_states = [pm.Normal("p0", mu=observations[0], tau=tau_system_noise)]
for i in range(1, len(observations)):
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=tau_system_noise)
vel_states.append(new_vel)
pos_states.append(
pm.Normal("s" + str(i),
mu=pos_states[-1] + new_vel * dt,
tau=tau_system_noise)
)
# we assume some observation noise
tau_observation_noise = (1 / 0.5**2)
obs = pm.Normal("obs", mu=pos_states, tau=tau_observation_noise, value=observations, observed=True)
This is how I run the sampling:
mcmc = pm.MCMC([damping, obs, vel_states, pos_states])
mcmc.sample(50000, 25000)
pm.Matplot.plot(mcmc.get_node("damping"))
damping_samples = mcmc.trace("damping")[:]
print "damping -- mean:%f; std:%f" % (mean(damping_samples), std(damping_samples))
print "real damping -- %f" % true_damping
The value for damping is dominated by the prior. Even if I change the prior to Uniform or whatever, it is still the case.
What am I doing wrong? It's pretty much like the previous example, just with another layer.
The full IPython notebook of this problem is available here: http://nbviewer.ipython.org/github/sotte/random_stuff/blob/master/PyMC%20-%20HMM%20Dynamic%20System.ipynb
[EDIT: Some clarifications & code for sampling.]
[EDIT2: #Chris answer didn't help. I could not use AdaptiveMetropolis since the *_states don't seem to be part of the model.]
There are a couple of issues with the model, looking at it again. First and foremost, you did not add all of your PyMC objects to the model. You have only added [damping, obs]. You should pass all of the PyMC nodes to the model.
Also, note that you don't need to call both Model and MCMC. This is fine:
model = pm.MCMC([damping, obs, vel_states, pos_states])
The best workflow for PyMC is to keep your model in a separate file from the running logic. That way, you can just import the model and pass it to MCMC:
import my_model
model = pm.MCMC(my_model)
Alternately, you can write your model as a function, returning locals (or vars), then calling the function as the argument for MCMC. For example:
def generate_model():
# put your model definition here
return locals()
model = pm.MCMC(generate_model())
Assuming you know the structure of your model -- you are doing parameter estimation, not system identification -- you can construct your PyMC model as a regression, with unknown damping, initial position and initial velocity as parameters and the array of positions, your observations.
That is, with class PM representing the point-mass system:
pm = PM(true_damping)
positions, velocities = pm.integrate(true_pos, true_vel, N, dt)
# Assume little system noise
std_system_noise = 0.05
tau_system_noise = 1.0/std_system_noise**2
# Treat the real positions as observations
observations = positions + np.random.randn(N,)*std_system_noise
# Damping is modelled with a Uniform prior
damping = mc.Uniform("damping", lower=-4.0, upper=4.0, value=-0.5)
# Initial position & velocity unknown -> assume Uniform priors
init_pos = mc.Uniform("init_pos", lower=-1.0, upper=1.0, value=0.5)
init_vel = mc.Uniform("init_vel", lower=0.0, upper=2.0, value=1.5)
#mc.deterministic
def det_pos(d=damping, pi=init_pos, vi=init_vel):
# Apply damping, init_pos and init_vel estimates and integrate
pm.damping = d.item()
pos, vel = pm.integrate(pi, vi, N, dt)
return pos
# Standard deviation is modelled with a Uniform prior
std_pos = mc.Uniform("std", lower=0.0, upper=1.0, value=0.5)
#mc.deterministic
def det_prec_pos(s=std_pos):
# Precision, based on standard deviation
return 1.0/s**2
# The observations are based on the estimated positions and precision
obs_pos = mc.Normal("obs", mu=det_pos, tau=det_prec_pos, value=observations, observed=True)
# Create the model and sample
model = mc.Model([damping, init_pos, init_vel, det_prec_pos, obs_pos])
mcmc = mc.MCMC(model)
mcmc.sample(50000, 25000)
The full listing is here:
https://gist.github.com/stuckeyr/7762371
Increasing N and decreasing dt will improve your estimates markedly.
What do you mean by unreasonable? Are they shrunken toward the prior? Damping seems to have a pretty tight variance -- what if you give it a more diffuse prior?
Also, you might try using the AdaptiveMetropolis sampler on the *_states arrays:
my_model.use_step_method(AdaptiveMetropolis, my_model.vel_states)
It sometimes mixes better for correlated variables, as these likely are.
I think that your initial approach is fine and should work, except that the "obs" variable has not been included in the list of nodes supplied to MCMC (see In[10] in your notebook). After including this variable, the MCMC solver runs fine and does enforce the conditional dependencies specified by your model. I'd like to repeat the point made by Chris that it is best to define the model in a different file or under a function to avoid such mistakes.
The reason why you don't get the right results, is that your priors have been chosen arbitrarily and in some cases, the values are such that it is very difficult for the model to mix properly in order to converge. Your toy problem tries to estimate a damping value such that the positions converge to vector of observed positions. For this, your model should have the flexibility to choose velocity and damping values in a wide range so that stochastic errors in the position/velocity can be corrected when going from one time step to the next. Otherwise, as a result of your Euler integration scheme, the errors just keep getting propagated. I think Chris referred to the same thing when he suggested choosing a more diffuse prior.
I suggest playing around with the tau values for each of the Normal variables. For instance, I changed the following values:
damping = pm.Normal("damping", mu=0, tau=1/20.**2) # was tau=1/2.**2
new_vel = pm.Normal("v" + str(i),
mu=vel_states[-1] + damping * dt,
tau=(1/2.**2)) # was tau=tau_system_noise=(1 / 0.5**2)
tau_observation_noise = (1 / 0.005**2) # was 1 / 0.5**2
You can see the modified file here.
The plots at the bottom show that the positions are indeed converging. The velocities are all over the place. The estimated mean value of damping is 6.9, which is very different from -1.5. Perhaps you can achieve better estimates by choosing appropriate values for the priors.

Categories