Gibbs sampler fails to converge - python

I've been trying to understand Gibbs sampling for some time. Recently, I saw a video that made a good deal of sense.
https://www.youtube.com/watch?v=a_08GKWHFWo
The author used Gibbs sampling to converge on the mean values (theta_1 and theta_2) of a bivariate normal distribution, using the process as follows:
init: Initialize theta_2 to a random value.
Loop:
sample theta_1 conditioned on theta_2 as N~(p(theta_2), [1-p**2])
sample theta_2 conditioned on theta_1 as N~(p(theta_1), [1-p**2])
(repeat until convergence.)
I tried this on my own and ran into an issue:
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
rv = multivariate_normal(mean=[0.5, -0.2], cov=[[1, 0.9], [0.9, 1]])
rv.mean
>>>
array([ 0.5, -0.2])
rv.cov
>>>
array([[1. , 0.9],
[0.9, 1. ]])
import numpy as np
samples = []
curr_t2 = np.random.rand()
def gibbs(iterations=5000):
theta_1 = np.random.normal(curr_t2, (1-0.9**2), None)
theta_2 = np.random.normal(theta_1, (1-0.9**2), None)
samples.append((theta_1,theta_2))
for i in range(iterations-1):
theta_1 = np.random.normal(theta_2, (1-0.9**2), None)
theta_2 = np.random.normal(theta_1, (1-0.9**2), None)
samples.append((theta_1,theta_2))
gibbs()
sum([a for a,b in samples])/len(samples)
>>>
4.745736136676516
sum([b for a,b in samples])/len(samples)
>>>
4.746816908769834
Now, I see where I messed up. I found theta_1 conditioned on theta_2's actual value, not its probability. Likewise, I found theta_2 conditioned on theta_1's actual value, not its probability.
Where I'm stuck is, how do I evaluate the probability of either theta taking on any given observed value?
Two options I see: probability density (based on location on normal curve) AND p-value (integration from infinity (and/or negative infinity) to the observed value). Neither of these solutions sound "right."
How should I proceed?

Perhaps my video wasn't clear enough. The algorithm does not converge "on the mean values" but rather it converges to samples from the distribution. Nonetheless, averages of samples from the distributions will converge to their respective mean values.
The issue is with your conditional means. In the video, I choose marginal means that were zero to reduce notation. If you have non-zero marginal means, the conditional expectation for a bivariate normal involves the marginal means, the correlation, and the standard deviations (which are 1 in your bivariate normal). The updated code is
import numpy as np
from scipy.stats import multivariate_normal
mu1 = 0.5
mu2 = -0.2
rv = multivariate_normal(mean=[mu1, mu2], cov=[[1, 0.9], [0.9, 1]])
samples = []
curr_t2 = np.random.rand()
def gibbs(iterations=5000):
theta_1 = np.random.normal(mu1 + 0.9 * (curr_t2-mu2), (1-0.9**2), None)
theta_2 = np.random.normal(mu2 + 0.9 * (theta_1-mu1), (1-0.9**2), None)
samples.append((theta_1,theta_2))
for i in range(iterations-1):
theta_1 = np.random.normal(mu1 + 0.9 * (theta_2-mu2), (1-0.9**2), None)
theta_2 = np.random.normal(mu2 + 0.9 * (theta_1-mu1), (1-0.9**2), None)
samples.append((theta_1,theta_2))
gibbs()
sum([a for a,b in samples])/len(samples)
sum([b for a,b in samples])/len(samples)

Related

Discrepancy between log_prob and manual calculation

I want to define multivariate normal distribution with mean [1, 1, 1] and variance covariance matrix with 0.3 on diagonal. After that I want to calculate log likelihood on datapoints [2, 3, 4]
By torch distributions
import torch
import torch.distributions as td
input_x = torch.tensor([2, 3, 4])
loc = torch.ones(3)
scale = torch.eye(3) * 0.3
mvn = td.MultivariateNormal(loc = loc, scale_tril=scale)
mvn.log_prob(input_x)
tensor(-76.9227)
From scratch
By using formula for log likelihood:
We obtain tensor:
first_term = (2 * np.pi* 0.3)**(3)
first_term = -np.log(np.sqrt(first_term))
x_center = input_x - loc
tmp = torch.matmul(x_center, scale.inverse())
tmp = -1/2 * torch.matmul(tmp, x_center)
first_term + tmp
tensor(-24.2842)
where I used fact that
My question is - what's the source of this discrepancy?
You are passing the covariance matrix to the scale_tril instead of covariance_matrix. From the docs of PyTorch's Multivariate Normal
scale_tril (Tensor) – lower-triangular factor of covariance, with positive-valued diagonal
So, replacing scale_tril with covariance_matrix would yield the same results as your manual attempt.
In [1]: mvn = td.MultivariateNormal(loc = loc, covariance_matrix=scale)
In [2]: mvn.log_prob(input_x)
Out[2]: tensor(-24.2842)
However, it's more efficient to use scale_tril according to the authors:
...Using scale_tril will be more efficient:
You can calculate the lower choelsky using torch.linalg.cholesky
In [3]: mvn = td.MultivariateNormal(loc = loc, scale_tril=torch.linalg.cholesky(scale))
In [4]: mvn.log_prob(input_x)
Out[4]: tensor(-24.2842)

Monte Carlo simulations in Python using quasi random standard normal numbers using sobol sequences gives erroneous values

I'm trying to perform Monte Carlo Simulations using quasi-random standard normal numbers. I understand that we can use Sobol sequences to generate uniform numbers, and then use probability integral transform to convert them to standard normal numbers. My code gives unrealistic values of the simulated asset path:
import sobol_seq
import numpy as np
from scipy.stats import norm
def i4_sobol_generate_std_normal(dim_num, n, skip=1):
"""
Generates multivariate standard normal quasi-random variables.
Parameters:
Input, integer dim_num, the spatial dimension.
Input, integer n, the number of points to generate.
Input, integer SKIP, the number of initial points to skip.
Output, real np array of shape (n, dim_num).
"""
sobols = sobol_seq.i4_sobol_generate(dim_num, n, skip)
normals = norm.ppf(sobols)
return normals
def GBM(Ttm, TradingDaysInAYear, NoOfPaths, UnderlyingPrice, RiskFreeRate, Volatility):
dt = float(Ttm) / TradingDaysInAYear
paths = np.zeros((TradingDaysInAYear + 1, NoOfPaths), np.float64)
paths[0] = UnderlyingPrice
for t in range(1, TradingDaysInAYear + 1):
rand = i4_sobol_generate_std_normal(1, NoOfPaths)
lRand = []
for i in range(len(rand)):
a = rand[i][0]
lRand.append(a)
rand = np.array(lRand)
paths[t] = paths[t - 1] * np.exp((RiskFreeRate - 0.5 * Volatility ** 2) * dt + Volatility * np.sqrt(dt) * rand)
return paths
GBM(1, 252, 8, 100., 0.05, 0.5)
array([[1.00000000e+02, 1.00000000e+02, 1.00000000e+02, ...,
1.00000000e+02, 1.00000000e+02, 1.00000000e+02],
[9.99702425e+01, 1.02116774e+02, 9.78688323e+01, ...,
1.00978615e+02, 9.64128959e+01, 9.72154915e+01],
[9.99404939e+01, 1.04278354e+02, 9.57830834e+01, ...,
1.01966807e+02, 9.29544649e+01, 9.45085180e+01],
...,
[9.28295879e+01, 1.88049044e+04, 4.58249200e-01, ...,
1.14117599e+03, 1.08089096e-02, 8.58754653e-02],
[9.28019642e+01, 1.92029616e+04, 4.48483141e-01, ...,
1.15234371e+03, 1.04211828e-02, 8.34842557e-02],
[9.27743486e+01, 1.96094448e+04, 4.38925214e-01, ...,
1.16362072e+03, 1.00473641e-02, 8.11596295e-02]])
Values like 8.11596295e-02 should not be generated, hence I think there is something wrong in the code. If I use standard normal draws from the numpy library rand = np.random.standard_normal(NoOfPaths) then the price matches with the Black Scholes price. Hence I think the problem is with the random number generator. The value 8.11596295e-02 refers to a price in a path, and it's very unlikely that the price would come down from 100 (initial price) to 8.11596295e-02.
References: 1, 2, 3.
It appears there is a bug in sobol_seq. Anaconda, python 3.7, 64bit, Windows 10 x64, installed sobol_seq via pip
pip install sobol_seq
# Name Version Build Channel
sobol-seq 0.1.2 pypi_0 pypi
Simple code
print(sobol_seq.i4_sobol_generate(1, 1, 0))
print(sobol_seq.i4_sobol_generate(1, 1, 1))
print(sobol_seq.i4_sobol_generate(1, 1, 2))
print(sobol_seq.i4_sobol_generate(1, 1, 3))
produced output
[[0.5]]
[[0.5]]
[[0.5]]
[[0.5]]
Code from http://people.sc.fsu.edu/~jburkardt/py_src/sobol/sobol.html, sobol_lib.py behaves reasonable (well, except first point).
Well, enclosed code looks like it might work, keeping seed together with sampled array. Slow, though...
import sobol_seq
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def i4_sobol_generate_std_normal(dim_num, seed, size=None):
"""
Generates multivariate standard normal quasi-random variables.
Parameters:
Input, integer dim_num, the spatial dimension.
Input, integer n, the number of points to generate.
Input, integer seed, initial seed
Output, real np array of shape (n, dim_num).
"""
if size is None:
q, seed = sobol_seq.i4_sobol(dim_num, seed)
normals = norm.ppf(q)
return (normals, seed)
if isinstance(size, int) or isinstance(size, np.int32) or isinstance(size, np.int64) or isinstance(size, np.int16):
rc = np.empty((dim_num, size))
for k in range(size):
q, seed = sobol_seq.i4_sobol(dim_num, seed)
rc[:,k] = norm.ppf(q)
return (rc, seed)
else:
raise ValueError("Size type is not recognized")
return None
seed = 1
x, seed = i4_sobol_generate_std_normal(1, seed)
print(x)
x, seed = i4_sobol_generate_std_normal(1, seed)
print(x)
seed = 1
x, seed = i4_sobol_generate_std_normal(1, seed, size=10)
print(x)
x, seed = i4_sobol_generate_std_normal(1, seed, size=1000)
print(x)
hist, bins = np.histogram(x, bins=20, range=(-2.5, 2.5), density=True)
plt.bar(bins[:-1], hist, width = 0.22, align='edge')
plt.show()
Here is the picture
For future reference, we added Sobol' sequence in SciPy 1.7: scipy.stats.qmc.Sobol

Variance inflation factor in ridge regression in python

I'm running a ridge regression on somewhat collinear data. One of the methods used to identify a stable fit is a ridge trace and thanks to the great example on scikit-learn, I'm able to do that. Another method is to calculate variance inflation factors (VIFs) for each variable as k increases. When the VIFs decrease to <5 it is an indication the fit is satisfactory. Statsmodels has code for VIFs, but it is for an OLS regression. I've attempted to alter it to handle a ridge regression.
I'm checking my results against Regression Analysis by Example, 5th edition, chapter 10. My code generates the correct results for k = 0.000, but not after that. Working SAS code is available, but I'm not a SAS user and I don't know the differences between that implementation and scikit-learn's (and/or statsmodels's).
I've been stuck on this for a few days so any help would be much appreciated.
#http://www.ats.ucla.edu/stat/sas/examples/chp/chp_ch10.htm
from __future__ import division
import numpy as np
import pandas as pd
example = pd.read_csv('by_example_import.csv')
example.dropna(inplace=True)
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(example)
scaler.transform(example)
X = example.drop(['year', 'import'], axis=1)
#c_matrix = X.corr()
y = example['import']
#w, v = np.linalg.eig(c_matrix)
import pylab as pl
from sklearn import linear_model
###############################################################################
# Compute paths
alphas = [0.000, 0.001, 0.003, 0.005, 0.007, 0.009, 0.010, 0.012, 0.014, 0.016, 0.018,
0.020, 0.022, 0.024, 0.026, 0.028, 0.030, 0.040, 0.050, 0.060, 0.070, 0.080,
0.090, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2.0]
clf = linear_model.Ridge(fit_intercept=False)
clf2 = linear_model.Ridge(fit_intercept=False)
coefs = []
vif_list = [[] for x in range(X.shape[1])]
for a in alphas:
clf.set_params(alpha=a)
clf.fit(X, y)
coefs.append(clf.coef_)
for j, data in enumerate(X.columns):
cols = [col for col in X.columns if col not in [data]]
Z = X[cols]
yy = X.iloc[:,j]
clf2.set_params(alpha=a)
clf2.fit(Z, yy)
r_squared_j = clf2.score(Z, yy)
vif = 1. / (1. - r_squared_j)
print r_squared_j
vif_list[j].append(vif)
pd.DataFrame(vif_list, columns = alphas).T
pd.DataFrame(coefs, index=alphas)
###############################################################################
# Display results
ax = pl.gca()
ax.set_color_cycle(['b', 'r', 'g', 'c', 'k', 'y', 'm'])
ax.plot(alphas, coefs)
pl.vlines(ridge_cv.alpha_, np.min(coefs), np.max(coefs), linestyle='dashdot')
pl.xlabel('alpha')
pl.ylabel('weights')
pl.title('Ridge coefficients as a function of the regularization')
pl.axis('tight')
pl.show()
Variance inflation factor for Ridge regression is just three lines. I checked it with the example on the UCLA statistics page.
A variation of this will make it into the next statsmodels release. Here is my current function:
def vif_ridge(corr_x, pen_factors, is_corr=True):
"""variance inflation factor for Ridge regression
assumes penalization is on standardized variables
data should not include a constant
Parameters
----------
corr_x : array_like
correlation matrix if is_corr=True or original data if is_corr is False.
pen_factors : iterable
iterable of Ridge penalization factors
is_corr : bool
Boolean to indicate how corr_x is interpreted, see corr_x
Returns
-------
vif : ndarray
variance inflation factors for parameters in columns and ridge
penalization factors in rows
could be optimized for repeated calculations
"""
corr_x = np.asarray(corr_x)
if not is_corr:
corr = np.corrcoef(corr_x, rowvar=0, bias=True)
else:
corr = corr_x
eye = np.eye(corr.shape[1])
res = []
for k in pen_factors:
minv = np.linalg.inv(corr + k * eye)
vif = minv.dot(corr).dot(minv)
res.append(np.diag(vif))
return np.asarray(res)

IDL's INT_TABULATE - SciPy equivalent?

I am working on moving some code from IDL into python. One IDL call is to INT_TABULATE which performs integration on a fixed range.
The INT_TABULATED function integrates a tabulated set of data { xi , fi } on the closed interval [MIN(x) , MAX(x)], using a five-point Newton-Cotes integration formula.
Result = INT_TABULATED( X, F [, /DOUBLE] [, /SORT] )
Where result is the area under the curve.
IDL DOCS
My question is, does Numpy/SciPy offer a similar form of integration? I see that [scipy.integrate.newton_cotes] exists, but it appears to return "weights and error coefficient for Newton-Cotes integration instead of area".
Scipy does not provide such a high order integrator for tabulated data by default. The closest you have available without coding it yourself is scipy.integrate.simps, which uses a 3 point Newton-Cotes method.
If you simply want to get comparable integration precision, you could split your x and f arrays into 5 point chunks and integrate them one at a time, using the weights returned by scipy.integrate.newton_cotes doing something along the lines of:
def idl_tabulate(x, f, p=5) :
def newton_cotes(x, f) :
if x.shape[0] < 2 :
return 0
rn = (x.shape[0] - 1) * (x - x[0]) / (x[-1] - x[0])
weights = scipy.integrate.newton_cotes(rn)[0]
return (x[-1] - x[0]) / (x.shape[0] - 1) * np.dot(weights, f)
ret = 0
for idx in xrange(0, x.shape[0], p - 1) :
ret += newton_cotes(x[idx:idx + p], f[idx:idx + p])
return ret
This does 5-point Newton-Cotes on all intervals, except perhaps the last, where it will do a Newton-Cotes of the number of points remaining. Unfortunately, this will not give you the same results as IDL_TABULATE because the internal methods are different:
Scipy calculates the weights for points not equally spaced using what seems like a least-sqaures fit, I don't fully understand what is going on, but the code is pure python, you can find it in your Scipy installation in file scipy\integrate\quadrature.py.
INT_TABULATED always performs 5-point Newton-Cotes on equispaced data. If the data are not equispaced, it builds an equispaced grid, using a cubic spline to interpolate the values at those points. You can check the code here.
For the example in the INT_TABULATED docstring, which is suppossed to return 1.6271 using the original code, and have an exact solution of 1.6405, the above function returns:
>>> x = np.array([0.0, 0.12, 0.22, 0.32, 0.36, 0.40, 0.44, 0.54, 0.64,
... 0.70, 0.80])
>>> f = np.array([0.200000, 1.30973, 1.30524, 1.74339, 2.07490, 2.45600,
... 2.84299, 3.50730, 3.18194, 2.36302, 0.231964])
>>> idl_tabulate(x, f)
1.641998154242472

generate random lognormal distributions using shape of observed data

I'm trying to fit some data to a lognormal distribution and from this generate random lognormal distribution using optimized parameters.
After some search I found some solutions, but none convincing:
solution1 using the fit function:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
shape, loc, scale = lognorm.fit(mydata)
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
or Solution 2 using mu and sigma from original data:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
mu = np.mean([np.log(i) for i in mydata])
sigma = np.std([np.log(i) for i in mydata])
distr = lognorm(mu, sigma)
rnd_log = distr.rvs (size=100)
None of those solutions are fitting well:
import pylab
pylab.plot(sorted(mydata, reverse=True), 'ro')
pylab.plot(sorted(rnd_log, reverse=True), 'bx')
I am not sure if i understand well the way to use distributions, or if I am missing something else...
I though finding the solution here: Does anyone have example code of using scipy.stats.distributions?
but I am not able to get the shape from my data... am I missing something in the use of the fit function?
thanks
EDIT:
this is an example in order to understand better my problem:
print 'solution 1:'
means = []
stdes = []
distr = lognorm(mu, sigma)
for _ in xrange(1000):
rnd_log = distr.rvs (size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
print '\nsolution 2:'
means = []
stdes = []
shape, loc, scale = lognorm.fit(mydata)
for _ in xrange(1000):
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
the result is:
solution 1:
observed mean: 1.82562655734 mean simulated mean: 1.18929982267
observed std : 1.39003773799 mean simulated std : 0.88985924363
solution 2:
observed mean: 1.82562655734 mean simulated mean: 4.50608084668
observed std : 1.39003773799 mean simulated std : 5.44206119499
while, if I do the same in R:
mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354)
meanlog <- mean(log(mydata))
sdlog <- sd(log(mydata))
means <- c()
stdes <- c()
for (i in 1:1000){
rnd.log <- rlnorm(length(mydata), meanlog, sdlog)
means <- c(means, mean(log(rnd.log)))
stdes <- c(stdes, sd(log(rnd.log)))
}
print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' '))
print (paste('observed std :',sdlog ,'mean simulated std :',mean(stdes),sep=' '))
i get:
[1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317"
[1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"
that is much more closer, so I guess I am doing something wrong when using scipy...
The lognormal distribution in scipy is parametrized a little different from the usual way. See the scipy.stats.lognorm docs, particularly the "Notes" section.
Here's how to get the results you're expecting (note that we hold location to 0 when fitting):
In [315]: from scipy import stats
In [316]: x = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354])
In [317]: mu, sigma = stats.norm.fit(np.log(x))
In [318]: mu, sigma
Out[318]: (1.8256265573350701, 1.3900377379913127)
In [319]: shape, loc, scale = stats.lognorm.fit(x, floc=0)
In [320]: np.log(scale), shape
Out[320]: (1.8256267737298788, 1.3900309739954713)
Now you can generate samples and confirm your expectations:
In [321]: dist = stats.lognorm(shape, loc, scale)
In [322]: means, sds = [], []
In [323]: for i in xrange(1000):
.....: sample = dist.rvs(size=100)
.....: logsample = np.log(sample)
.....: means.append(logsample.mean())
.....: sds.append(logsample.std())
.....:
In [324]: np.mean(means), np.mean(sds)
Out[324]: (1.8231068508345041, 1.3816361818739145)

Categories