GaussianProcessRegressor Fitting Kernel/Hyperparameters - python

good day everyone. I have got the following:
I am using a GaussianProcessRegressor object from the Sklearn library.
After fitting the model, I want to sample points using predict, to get a better idea of what the model looks like so far. But now I do get the issue that it just assumed the points zero anywhere except for the training points.
I reset the alpha value of the Regressor from my initial 1e-5 back to default 1e-10 and the n_restarts_optimizer from 9 back to default zero, my kernel is a Matern kernel with nearly standard settings. Now I do get non-zero values, however I am not sure how to proceed:
I have the following:
a = df_reduced.values[0:4, :]
print("a[0,0]: ", a[0,0])
gp.predict(a)
Of course this gives me a nice result (since it's the fitting data):
a[0,0]: 150.0
Out[47]:
array([[10.4 ],
[ 9.3 ],
[78.39990234],
[78.39990234]])
Now I slightly alter the first feature of the first sample in it's initial vicinity:
a = df_reduced.values[1:4, :]
a[0, 0] = 151
gp.predict(a)
array([[4.85703698e-254],
[7.83999023e+001],
[7.83999023e+001]])
, and for a[0, 0] = 152
array([[ 0. ],
[78.39990234],
[78.39990234]])
. So it seems that in most of the area the function is simply zero, which is kind of a problem, because I want to use this for a Gaussian Hyperparameter Optimization minimising globally. Would somebody have a lead how to optimise better?
Btw I am using 16 features, and fitting on 30 samples so far and the output function takes values between 0 and 100.
Parameters are as follows (copy-paste):
matern = C(1.0)*Matern(length_scale=1.0, nu=2.5)
gp = GaussianProcessRegressor(kernel=matern)
gp.fit(df_reduced.values, Y) # df_reduced.values, because meanwhile we have overwritten X_reduced
Thanks already for any lead,
Best regards,
robTheBob86

Related

Lagged auto-correlations of numpy.random.normal not nul

I am struggling with an unexpected/unwanted behavior of the function random.normal of numpy.
By generating vectors of T elements with this function, I find that in average the lagged auto-correlation of those vectors in not 0 at lags different than 0. The auto-correlation value tends to -1/(T-1).
See for example this simple code:
import numpy as np
N = 10000000
T = 100
invT = -1./(T-1.)
sd = 1
av = 0
mxlag = 10
X = np.random.normal(av, sd, size=(N, T))
acf = X[:,0:mxlag+1]
for i in range(N):
acf[i,:] = [1. if l==0 else np.corrcoef(X[i,l:],X[i,:-l])[0][1] for l in range(mxlag+1)]
acf_mean = np.average(acf, axis=0)
print('mean auto-correlation of random_normal vector of length T=',T,' : ',acf_mean)
print('to be compared with -1/(T-1) = ',invT)
I have this behavior with both those version of Python: v2.7.9 and v3.7.4. Also, codding this in NCL gives me the same result.
The problem I am referring to might seem tiny. However, it leads to larger biases when those vectors are used as seeds to generate auto-regressive time series. This is also problematic in case one uses this function to create bootstrap statistical tests.
Someone would have an explanation about this? Am I doing something obviously wrong?
Many thanks!
Actually this is not a shortcoming of the function. My problem is coming from the bias in the estimate of the sample auto-correlation. This has been documented back in 1954...
Reference: Marriott, F. H. C., and J. A. Pope. "Bias in the estimation of autocorrelations." Biometrika 41.3/4 (1954): 390-402 (https://www.jstor.org/stable/2332719)

normalize_y in GPR

I'm trying to use GaussianProcessRegressor in sklearn to predict values of unknown.
The target values are typically between 1000-10000.
Since they are not 0-mean prior, I set the model with normalize_y = False, which is a default setup.
from sklearn.gaussian_process import GaussianProcessRegressor
gpr = GaussianProcessRegressor(kernel = RBF, random_state=0, alpha=1e-10, normalize_y = False)
when I predicted unknown with the gpr model, the returned std values are unrealistically too small, like in the scale of 0.1, which is 0.001% of the predicted values.
When I changed the setting to normalize_y = True, the returned std values are more realistic, about 500ish.
Can someone explain exactly what normalize_y does here, and if I set it to True or False in this case?
I found the closest answer HERE: https://github.com/scikit-learn/scikit-learn/issues/15612
"OK I think I know what might be going on here. It's a bit tricky to see but I think that none of the kernels have a vertical length scale parameter, so kernel(x,x) is always equal to 1. All the diagonal elements of K are equal to 1 (before we add the ridge to it), for example.
We can then see that the variance of the predictions can only be between 0 and 1. For example, if we're predicting at a point far from the training data (so kernel(X, x_new) is a vector of zeros) then the variance is just
sigma^2 = kernel(x_new, x_new) = 1
I think the real problem here is that the prior is for data with unit variance, but the data doesn't have unit variance. The solution would be to normalise the data so that it has unit variance after it 'enters' the GP, conduct the GP analysis, and then 'unnormalise' it back again at the end. The code already removes the mean automatically, so I think we just need to divide by the standard deviation at the same point and it would work OK.
So could just need a few extra lines!"
For this reason, changing the length_scale_bounds parameter of your kernel should fix this issue!
I hope this helps those who land here as I faced the same issue!

predict_proba is not working for my gaussian mixture model (sklearn, python)

Running Python 3.7.3
I have made a simple GMM and fit it to some data. Using the predict_proba method, the returns are 1's and 0's, instead of probabilities for the input belonging to each gaussian.
I initially tried this on a bigger data set and then tried to get a minimum example.
from sklearn.mixture import GaussianMixture
import pandas as pd
feat_1 = [1,1.8,4,4.1, 2.2]
feat_2 = [1.4,.9,4,3.9, 2.3]
test_df = pd.DataFrame({'feat_1': feat_1, 'feat_2': feat_2})
gmm_test = GaussianMixture(n_components =2 ).fit(test_df)
gmm_test.predict_proba(test_df)
gmm_test.predict_proba(np.array([[8,-1]]))
I'm getting arrays that are just 1's and 0's, or almost (10^-30 or whatever).
Unless I'm interpreting something incorrectly, the return should be a probability of each, so for example,
gmm_test.predict_proba(np.array([[8,-1]]))
should certainly not be [1,0] or [0,1].
The example you gave is giving you weird results because you have only 5 data points and still you are using 2 mixture components, which is basically causing overfitting.
If you do check the means and covariances of your components:
print(gmm_test.means_)
>>> [[4.05 3.95 ]
[1.66666667 1.53333333]]
print(gmm_test.covariances_)
>>> [[[ 0.002501 -0.0025 ]
[-0.0025 0.002501 ]]
[[ 0.24888989 0.13777778]
[ 0.13777778 0.33555656]]]
From this you can see that the first Gaussian is basically fitted with a very small covariance matrix, meaning that unless a point is very close to its center (4.05,3.95), the probability to belong to this Gaussian will always be negligible.
To convince you that despite this, your model is working as expected, try this:
epsilon = 0.005
print(gmm_test.predict_proba([gmm_test.means_[0]+epsilon]))
>>> array([[0.03142181, 0.96857819]])
As soon as you will increase epsilon, it will only return you array([[0., 1.]]), like you observed.
It might be useful to know that increasing cova_reg will decrease the confidence:
gmm_test = GaussianMixture(n_components =2,reg_covar=1).fit(test_df)
# output [[0.56079116 0.43920884]]

Simulating Time Series With Unobserved Components Model

After fitting a local level model using UnobservedComponents from statsmodels , we are trying to find ways to simulate new time series with the results. Something like:
import numpy as np
import statsmodels as sm
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = sm.tsa.arima_process.ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * x + np.random.normal(size=100)
y[70:] += 10
plt.plot(X, label='X')
plt.plot(y, label='y')
plt.axvline(69, linestyle='--', color='k')
plt.legend();
ss = {}
ss["endog"] = y[:70]
ss["level"] = "llevel"
ss["exog"] = X[:70]
model = UnobservedComponents(**ss)
trained_model = model.fit()
Is it possible to use trained_model to simulate new time series given the exogenous variable X[70:]? Just as we have the arma_process.generate_sample(nsample=100), we were wondering if we could do something like:
trained_model.generate_random_series(nsample=100, exog=X[70:])
The motivation behind it is so that we can compute the probability of having a time series as extreme as the observed y[70:] (p-value for identifying the response is bigger than the predicted one).
[EDIT]
After reading Josef's and cfulton's comments, I tried implementing the following:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
But this resulted in simulations that doesn't seem to track the predicted_mean of the forecast for X_post as exog. Here's an example:
While the y_post meanders around 100, the simulation is at -400. This approach always leads to p_value of 50%.
So when I tried using the initial_sate=0 and the random shocks, here's the result:
It seemed now that the simulations were following the predicted mean and its 95% credible interval (as cfulton commented below, this is actually a wrong approach as well as it's replacing the level variance of the trained model).
I tried using this approach just to see what p-values I'd observe. Here's how I compute the p-value:
samples = 1000
r = 0
y_post_sum = y_post.sum()
for _ in range(samples):
sim = mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post)))
r += sim.sum() >= y_post_sum
print(r / samples)
For context, this is the Causal Impact model developed by Google. As it's been implemented in R, we've been trying to replicate the implementation in Python using statsmodels as the core to process time series.
We already have a quite cool WIP implementation but we still need to have the p-value to know when in fact we had an impact that is not explained by mere randomness (the approach of simulating series and counting the ones whose summation surpasses y_post.sum() is also implemented in Google's model).
In my example I used y[70:] += 10. If I add just one instead of ten, Google's p-value computation returns 0.001 (there's an impact in y) whereas in Python's approach it's returning 0.247 (no impact).
Only when I add +5 to y_post is that the model returns p_value of 0.02 and as it's lower than 0.05, we consider that there's an impact in y_post.
I'm using python3, statsmodels version 0.9.0
[EDIT2]
After reading cfulton's comments I decided to fully debug the code to see what was happening. Here's what I found:
When we create an object of type UnobservedComponents, eventually the representation of the Kalman Filter is initiated. As default, it receives the parameter initial_variance as 1e6 which sets the same property of the object.
When we run the simulate method, the initial_state_cov value is created using this same value:
initial_state_cov = (
np.eye(self.k_states, dtype=self.ssm.transition.dtype) *
self.ssm.initial_variance
)
Finally, this same value is used to find initial_state:
initial_state = np.random.multivariate_normal(
self._initial_state, self._initial_state_cov)
Which results in a normal distribution with 1e6 of standard deviation.
I tried running the following then:
mod1 = UnobservedComponents(np.zeros(len(X_post)), level='llevel', exog=X_post, initial_variance=1)
sim = mod1.simulate(f_model.params, len(X_post))
plt.plot(sim, label='simul')
plt.plot(y_post, label='y')
plt.legend();
print(sim.sum() > y_post.sum())
Which resulted in:
I tested then the p-value and finally for a variation of +1 in y_post the model now is identifying correctly the added signal.
Still, when I tested with the same data that we have in R's Google package the p-value was still off. Maybe it's a matter of further tweaking the input to increase its accuracy.
#Josef is correct and you did the right thing with:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
mod1.simulate(f_model.params, len(X_post))
The simulate method simulates data according to the model in question, which is why you can't directly use trained_model to simulate when you have exogenous variables.
But for some reason the simulations always ended up being lower than y_post.
I think this should be expected - running your example and looking at the estimated coefficients, we get:
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------------
sigma2.irregular 0.9278 0.194 4.794 0.000 0.548 1.307
sigma2.level 0.0021 0.008 0.270 0.787 -0.013 0.018
beta.x1 1.1882 0.058 20.347 0.000 1.074 1.303
The variance of the level is very small, which means that it is extremely unlikely that the level would shift upwards by nearly 10 percent in a single period, based on the model you specified.
When you used:
mod1.simulate(f_model.params, len(X_post), initial_state=0, state_shocks=np.random.normal(size=len(X_post))
what happened is that the level term is the only unobserved state here, and by providing your own shocks with a variance equal to 1, you essentially overrode the level variance actually estimated by the model. I don't think that setting the initial state to 0 has much of an effect here. (see edit).
You write:
the p-value computation was closer, but still is not correct.
I'm not sure what this means - why would you expect the model to think such a jump was a likely occurrence? What p-value are you expecting to achieve?
Edit:
Thanks for investigating further (in Edit 2). First, what I think you should do is:
mod1 = UnobservedComponents(np.zeros(y_post), 'llevel', exog=X_post)
initial_state = np.random.multivariate_normal(
f_model.predicted_state[..., -1], f_model.predicted_state_cov[..., -1])
mod1.simulate(f_model.params, len(X_post), initial_state=initial_state)
Now, the explanation:
In Statsmodels 0.9, we didn't yet have exact treatment of states with a diffuse initialization (it has been merged in since then, though, and this is one reason that I wasn't able to replicate your results until I tested your example with the 0.9 codebase). These "initially diffuse" states don't have a long-run mean that we can solve for (e.g. a random walk process), and the state in the local level case is such a state.
The "approximate" diffuse initialization involves setting the initial state mean to zero and the initial state variance to a large number (as you discovered).
For simulations, the initial state is, by default, sampled from the given initial state distribution. Since this model is initialized with approximate diffuse initialization, that explains why your process was initialized around some random number.
Your solution is a good patch, but it's not optimal because it doesn't base the initial state for the simulated period on the last state from the estimated model / data. These values are given by f_model.predicted_state[..., -1] and f_model.predicted_state_cov[..., -1].

Hidden Markov Model converging to one state using hmmlearn

I have a machine learning problem that I'm trying to solve. I'm using a Gaussian HMM (from hmmlearn) with 5 states, modelling extreme negative, negative, neutral, positive and extreme positive in the sequence. I have set up the model in the gist below
https://gist.github.com/stevenwong/cb539efb3f5a84c8d721378940fa6c4c
import numpy as np
import pandas as pd
from hmmlearn.hmm import GaussianHMM
x = pd.read_csv('data.csv')
x = np.atleast_2d(x.values)
h = GaussianHMM(n_components=5, n_iter=10, verbose=True, covariance_type="full")
h = h.fit(x)
y = h.predict(x)
The problem is that most of the estimated states converges to the middle, even when I can visibly see that there are spades of positive values and spades of negative values but they are all lumped together. Any idea how I can get it to better fit the data?
EDIT 1:
Here is the transition matrix. I believe the way it's read in hmmlearn is across the row (i.e., row[0] means prob of transiting to itself, state 1, 2, 3...)
In [3]: h.transmat_
Out[3]:
array([[ 0.19077231, 0.11117929, 0.24660208, 0.20051377, 0.25093255],
[ 0.12289066, 0.17658589, 0.24874935, 0.24655888, 0.20521522],
[ 0.15713787, 0.13912972, 0.25004413, 0.22287976, 0.23080852],
[ 0.14199694, 0.15423031, 0.25024992, 0.2332739 , 0.22024893],
[ 0.17321093, 0.12500688, 0.24880728, 0.21205912, 0.2409158 ]])
If I set all the transition probs to 0.2, it looks like this (if I do average by state the separation is worse).
Apparently, your model learned large variance for state 2. GMM is a generative model trained with max likelihood criteria, so in some sense, you got the optimal fit to the data. I can see it provides meaningful prediction in extreme cases, so if you want it to attribute more observations to classes other than 2, I would try the following:
Data preprocessing. Try to use log values for your input to make the difference between them sharper.
Look at your transition matrix, maybe transition probs from state 2 are too low. Try to set all probabilities to equal and see what happens.

Categories