I'm new to coding in python, and want to get parameters from a data set that I know from theory is most likely t-distributed. The first method I tried was using t.fit(). To double check the results I also used st.stats.describe(), and noticed I got different results. I also used t.stats() to get the moments "mvsk". I'm not sure what the different functions do, and which results to trust. The parameters are later going to be used in a Monte Carlo Simulation. Can somebody explain the different methods, and what I'm doing wrong?
import numpy as np
from scipy.stats import norm,t
import scipy.stats as st
import pandas as pd
import math
SP = pd.read_excel('S&P+sectors.xlsx',
parse_dates=['date'],
index_col='date')['.SPX']
rets = np.log(SP).diff()
rets = rets.dropna()
t.fit(rets)
print("Parameters from t.fit: ", t.fit(rets), "\n")
d = st.stats.describe(rets)
print(d, "\n")
print("Standard Deviation from st.stats.describe : ",np.sqrt(d[3]), "\n")
mean, var, skew, kurt = t.stats(t.fit(rets)[0], moments='mvsk',
loc = t.fit(rets)[1], scale = t.fit(rets)[2])
print("mean, std.dev, skew, kurt: ",mean,np.sqrt(var),skew,kurt)
Output:
Parameters from t.fit: (2.563005821560674, 0.0005384408493821172, 0.006945103287629065)
DescribeResult(nobs=4767, minmax=(-0.09469514468085727, 0.10957195934756658), mean=0.00011244654312862343, variance=0.00014599380983290917, skewness=-0.21364378793604263, kurtosis=8.494830112279583)
Standard Deviation from st.stats.describe : 0.012082789819942626
mean, std.dev, skew, kurt: 0.0005384408493821172 0.014818254946408262 nan nan
You can see that I get different means from the t.fit() and st.stats.describe(). The standard deviation is different for all three, and the skewness and kurtosis is also different. Why is this?
There is no difference
SQRT(0.00014599380983290917) = 0.01208278982
One is variance, another is stddev
Ok, lets make it more descriptive.
Parameters from t.fit is what fitter think is best to put t-Distribution curve over set of sampled data.
DescribeResult produced variance, not stddev, so here we take square root of variance and get stddev, SQRT(0.00014599380983290917) = 0.01208278982. Then you compute stddev yourslef, and they are the same. Please remember, those values (like stddev, variance, mean) are taken from sampled data.
On the last line you compute DISTRIBUTION mean and stddev, most likely by applying formulas or doing numerical integration. They are ALWAYS different from sampled mean or sampled stddev. Fitting is trying to fit everything (all moments) at once, minimizing some or another error. It works other way around as well - if you come with distribution parameters, compute distribution mean, stddev, and then run some sample and compute sampled mean/stddev, they would be different from distribution ones. Only in case of infinite sample szie you'll reach agreement between distribution moments and sampled moments.
Related
For the data yielding the below histogram, I used gamma.fit(data) function. It yields (0.2856629839547822, 0.001612367540316285, 1.3126526078419007) which must be the alpha, loc, and scale parameters of the distribution. However, the mean and standard deviations are m=0.04181341484525036 and s=0.02581912984507876 for the given dataset. The PDF is zero below the mean (m) value. I couldn't find any questions about this problem. What am I missing?
Histogram of the data
The PDF is most definitely not zero for all values below your calculated mean. I fact, over 40% of the PDF area resides at x<=m:
from scipy import stats
g = stats.gamma(0.2856629839547822, 0.001612367540316285, 1.3126526078419007)
m=0.04181341484525036
print(g.cdf(m))
0.4078
I have two list. Both include normalized percent:
actual_population_distribution = [0.2,0.3,0.3,0.2]
sample_population_distribution = [0.1,0.4,0.2,0.3]
I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.
I have already able to get KL.
This is the function I used to calculate gamma:
def gamma_random_sample(data_list):
mean = np.mean(data_list)
var = np.var(data_list)
g_alpha = mean * mean / var
g_beta = mean / var
for i in range(len(data_list)):
yield random.gammavariate(g_alpha, 1/g_beta)
Fit two lists into gamma distribution:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
This is the code I used to calculate KL:
kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))
The code above does not produce any errors.
But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.
Indeed, the number is different to:
mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')
if I use this way.
By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.
What do I miss?
See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std
I don't understand what you are trying to do with:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.
The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .
I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.
I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.
EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.
I am new to using the PyMC3 package and am just trying to implement an example from a course on measurement uncertainty that I’m taking. (Note this is an optional employee education course through work, not a graded class where I shouldn’t find answers online). The course uses R but I find python to be preferable.
The (simple) problem is posed as following:
Say you have an end-gauge of actual (unknown) length at room-temperature length, and measured length m. The relationship between the two is:
length = m / (1 + alpha*dT)
where alpha is an expansion coefficient and dT is the deviation from room temperature and m is the measured quantity. The goal is to find the posterior distribution on length in order to determine its expected value and standard deviation (i.e. the measurement uncertainty)
The problem specifies prior distributions on alpha and dT (Gaussians with small standard deviation) and a loose prior on length (Gaussian with large standard deviation). The problem specifies that m was measured 25 times with an average of 50.000215 and standard deviation of 5.8e-6. We assume that the measurements of m are normally distributed with a mean of the true value of m.
One issue I had is that the likelihood doesn’t seem like it can be specified just based on these statistics in PyMC3, so I generated some dummy measurement data (I ended up doing 1000 measurements instead of 25). Again, the question is to get a posterior distribution on length (and in the process, although of less interest, updated posteriors on alpha and dT).
Here’s my code, which is not working and having convergence issues:
from IPython.core.pylabtools import figsize
import numpy as np
from matplotlib import pyplot as plt
import scipy.stats as stats
import pymc3 as pm
import theano.tensor as tt
basic_model = pm.Model()
xdata = np.random.normal(50.000215,5.8e-6*np.sqrt(1000),1000)
with basic_model:
#prior distributions
theta = pm.Normal('theta',mu=-.1,sd=.04)
alpha = pm.Normal('alpha',mu=.0000115,sd=.0000012)
length = pm.Normal('length',mu=50,sd=1)
mumeas = length*(1+alpha*theta)
with basic_model:
obs = pm.Normal('obs',mu=mumeas,sd=5.8e-6,observed=xdata)
#yobs = Normal('yobs',)
start = pm.find_MAP()
#trace = pm.sample(2000, step=pm.Metropolis, start=start)
step = pm.Metropolis()
trace = pm.sample(10000, tune=200000,step=step,start=start,njobs=1)
length_samples = trace['length']
fig,ax=plt.subplots()
plt.hist(length_samples, histtype='stepfilled', bins=30, alpha=0.85,
label="posterior of $\lambda_1$", color="#A60628", normed=True)
I would really appreciate any help as to why this isn’t working. I've been trying for a while and it never converges to the expected solution given from the R code. I tried the default sampler (NUTS I think) as well as Metropolis but that completely failed with a zero gradient error. The (relevant) course slides are attached as an image. Finally, here is the comparable R code:
library(rjags)
#Data
jags_data <- list(xbar=50.000215)
jags_code <- jags.model(file = "calibration.txt",
data = jags_data,
n.chains = 1,
n.adapt = 30000)
post_samples <- coda.samples(model = jags_code,
variable.names =
c("l","mu","alpha","theta"),#,"ypred"),
n.iter = 30000)
summary(post_samples)
mean(post_samples[[1]][,"l"])
sd(post_samples[[1]][,"l"])
plot(post_samples)
and the calibration.txt model:
model{
l~dnorm(50,1.0)
alpha~dnorm(0.0000115,694444444444)
theta~dnorm(-0.1,625)
mu<-l*(1+alpha*theta)
xbar~dnorm(mu,29726516052)
}
(note I think the dnorm distribution takes 1/sigma^2, hence the weird-looking variances)
Any help or insight as to why the PyMC3 sampling isn't converging and what I should do differently would be extremely appreciated. Thanks!
I also had trouble getting anything useful from the generated data and model in the code. It seems to me that the level of noise in the fake data could equally be explained by the different sources of variance in the model. That can lead to a situation of highly correlated posterior parameters. Add to that the extreme scale imbalances, then it makes sense this would have sampling issues.
However, looking at the JAGS model, it seems they really are using just that one input observation. I've never seen this technique(?) before, that is, inputting summary statistics of data instead of the raw data itself. I suppose it worked for them in JAGS, so I decided to try running the exact same MCMC, including using the precision (tau) parameterization of the Gaussian.
Original Model with Metropolis
with pm.Model() as m0:
# tau === precision parameterization
dT = pm.Normal('dT', mu=-0.1, tau=625)
alpha = pm.Normal('alpha', mu=0.0000115, tau=694444444444)
length = pm.Normal('length', mu=50.0, tau=1.0)
mu = pm.Deterministic('mu', length*(1+alpha*dT))
# only one input observation; tau indicates the 5.8 nm sd
obs = pm.Normal('obs', mu=mu, tau=29726516052, observed=[50.000215])
trace = pm.sample(30000, tune=30000, chains=4, cores=4, step=pm.Metropolis())
While it's still not that great at sampling length and dT, it at least appears convergent overall:
I think noteworthy here is that despite the relatively weak prior on length (sd=1), the strong priors on all the other parameters appear to propagate a tight uncertainty bound on the length posterior. Ultimately, this is the posterior of interest, so this seems to be consistent with the intent of the exercise. Also, see that mu comes out in the posterior as exactly the distribution described, namely, N(50.000215, 5.8e-6).
Trace Plots
Forest Plot
Pair Plot
Here, however, you can see the core problem is still there. There's both strong correlation between length and dT, plus 4 or 5 orders of magnitude scale difference between the standard errors. I'd definitely do a long run before I really trusted the result.
Alternative Model with NUTS
In order to get this running with NUTS, you'd have to address the scaling issue. That is, somehow we need to reparameterize to get all the tau values closer to 1. Then, you'd run the sampler and transform back into the units you're interested in. Unfortunately, I don't have time to play around with this right now (I'd have to figure it out too), but maybe it's something you can start exploring on your own.
Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable
I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=∑ln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)