I have the mean and the SD from a log normal distribution. However, in order to provide a sampling with from a log-normal distribution in python I need to transfer these variables into a the mean and SD of the underlying Normal distribution.
from numpy.random import seed
from numpy.random import normal
import numpy as np
mu = 25.2
sigma = 10.5
#pd.reset_option('display.float_format')
r = []
r = np.random.lognormal(mu, sigma, 1000)
for i in range(1000):
while r[i] > 64 or r[i] < 4:
y = np.random.lognormal(mu, sigma, 1)
r[i] = y[0]
df = pd.DataFrame(r, columns = ['Column_A'])
print(df)
sns.set_style("whitegrid", {'axes.grid' : False})
sns.set(rc={"figure.figsize": (8, 4)})
sns.distplot(df['Column_A'], bins = 70)
This is what I get
And this is what I want
However, I don't know how to transfer these values
If I understand correctly your post, you want to access to the underlying (mu sigma^2) parametrization of the normal distribution that produced your log-normal observations ?
TL;DR
Assuming your log-normal observations are stored in r:
mu = np.log(np.median(r))
var = 2*(np.log(np.mean(r)) - np.log(np.median(r)))
sd = np.sqrt(var)
Theoretical part
Start reading ref some statistics about log-normal distribution. It appears it's quite hard to retrieve (mu, sigma^2) from the empirical mean and variance of a log-normal sample ...
Let X be a log-normal random variable and let Y=ln(X). It appears Y follows a normal distribution with mean (mu, sigma^2). Let M ans S be the mean and variance of X. It turns out that:
M = exp(mu + sigma^2/2)
S = (exp(sigma^2) - 1) * exp(2*mu + sigma^2)
Which hardly leads to a simple expression for (mu, sigma^2).
However, according to ref, inverting your (M, S) system will be easier by replacing the variance S by either the median Med or the mode Mode since they hold a much simpler expression wrt (mu, sigma^2):
Med = exp(mu)
Mode = exp(mu - sigma^2)
The empirical median will be easier to compute through Numpy so let's assume we'll use it in our computations. The inverted system should lead to the following estimators for (mu, sigma^2):
mu = log(Med)
sigma2 = 2*(log(M) - log(Med))
Pythonic part
Supposing your log-normal observations are stored in your r array:
mu = np.log(np.median(r))
var = 2*(np.log(np.mean(r)) - np.log(np.median(r)))
sd = np.sqrt(var)
And a quick-check shows it's likely to be right :
# random log-normal sample with (mu, sigma)=(1, 2)
r = np.random.lognormal(1, 2, size=(1000000))
# estimators
mu = np.log(np.median(r))
var = 2*(np.log(np.mean(r)) - np.log(np.median(r)))
sd = np.sqrt(var)
$> mu = 1.001368782773
$> sigma = 2.0024723139
I wanted to calculated 1, 2, 3 sigma error of a distribution using python. It is described in following 68–95–99.7 rule wikipedia page. So far I have written following code. Is it correct way to compute such kpi's. Thanks.
import numpy as np
# sensor and reference value
temperature_measured = np.random.rand(1000) # value from a sensor under test
temperature_reference = np.random.rand(1000) # value from a best sensor from market
# error computation
error = temperature_reference - temperature_measured
error_sigma = np.std(error)
error_mean = np.mean(error)
# kpi comutation
expected_sigma = 1 # 1 degree deviation is allowed (Customer requirement)
kpi_1_sigma = (abs(error - error_mean) < 1*expected_sigma).mean()*100.0 >= 68.27
kpi_2_sigma = (abs(error - error_mean) < 2*expected_sigma).mean()*100.0 >= 95.45
kpi_3_sigma = (abs(error - error_mean) < 3*expected_sigma).mean()*100.0 >= 99.73
I would recommend to use the definition you found in wikipedia and just calculate the percentiles, i.e., calculate the diference between:
((mu+sigma)-(mu-sigma) )/2.
sigma1 = (np.percentile(error, 50+34.1, axis=0)- np.percentile(error, 50-34.1, axis=0))/2.
sigma2 = (np.percentile(error, 50+34.1+13.6, axis=0)- np.percentile(error, 50-34.1-13.6, axis=0))/2.
sigma3 = (np.percentile(error, 50+34.1+13.6+2.1, axis=0)- np.percentile(error, 50-34.1-13.6-2.1, axis=0))/2.
An easier way could be like so (taken from here):
NumPy's std yields the standard deviation, which is usually denoted
with "sigma". To get the 2-sigma or 3-sigma ranges, you can simply
multiply sigma with 2 or 3:
print [x.mean() - 3 * x.std(), x.mean() + 3 * x.std()]
result:
[-27.545797458510656, 52.315028227741429]
I have been working with the following link,
Fitting empirical distribution to theoretical ones with Scipy (Python)?
I have been using my data to the code from the link and found out that the common distribution for my data is the Non-Central Student’s T distribution. I couldn’t find the distribution in the pymc3 package, so, I decided to have a look with scipy to understand how the distribution is formed. I created a custom distribution and I have few questions:
I would like to know if my approach to creating the distribution is right?
How can I implement the custom distribution into models?
Regarding the prior distribution, do I use same steps in normal distribution priors (mu and sigma) combined with halfnormed for degree of freedom and noncentral value?
My custom distribution:
import numpy as np
import theano.tensor as tt
from scipy import stats
from scipy.special import hyp1f1, nctdtr
import warnings
from pymc3.theanof import floatX
from pymc3.distributions.dist_math import bound, gammaln
from pymc3.distributions.continuous import assert_negative_support, get_tau_sigma
from pymc3.distributions.distribution import Continuous, draw_values, generate_samples
class NonCentralStudentT(Continuous):
"""
Parameters
----------
nu: float
Degrees of freedom, also known as normality parameter (nu > 0).
mu: float
Location parameter.
sigma: float
Scale parameter (sigma > 0). Converges to the standard deviation as nu increases. (only required if lam is not specified)
lam: float
Scale parameter (lam > 0). Converges to the precision as nu increases. (only required if sigma is not specified)
"""
def __init__(self, nu, nc, mu=0, lam=None, sigma=None, sd=None, *args, **kwargs):
super().__init__(*args, **kwargs)
super(NonCentralStudentT, self).__init__(*args, **kwargs)
if sd is not None:
sigma = sd
warnings.warn("sd is deprecated, use sigma instead", DeprecationWarning)
self.nu = nu = tt.as_tensor_variable(floatX(nu))
self.nc = nc = tt.as_tensor_variable(floatX(nc))
lam, sigma = get_tau_sigma(tau=lam, sigma=sigma)
self.lam = lam = tt.as_tensor_variable(lam)
self.sigma = self.sd = sigma = tt.as_tensor_variable(sigma)
self.mean = self.median = self.mode = self.mu = mu = tt.as_tensor_variable(mu)
self.variance = tt.switch((nu > 2) * 1, (1 / self.lam) * (nu / (nu - 2)), np.inf)
assert_negative_support(lam, 'lam (sigma)', 'NonCentralStudentT')
assert_negative_support(nu, 'nu', 'NonCentralStudentT')
assert_negative_support(nc, 'nc', 'NonCentralStudentT')
def random(self, point=None, size=None):
"""
Draw random values from Non-Central Student's T distribution.
Parameters
----------
point: dict, optional
Dict of variable values on which random values are to be
conditioned (uses default point if not specified).
size: int, optional
Desired size of random sample (returns one sample if not
specified).
Returns
-------
array
"""
nu, nc, mu, lam = draw_values([self.nu, self.nc, self.mu, self.lam], point=point, size=size)
return generate_samples(stats.nct.rvs, nu, nc, loc=mu, scale=lam ** -0.5, dist_shape=self.shape, size=size)
def logp(self, value):
"""
Calculate log-probability of Non-Central Student's T distribution at specified value.
Parameters
----------
value: numeric
Value(s) for which log-probability is calculated. If the log probabilities for multiple
values are desired the values must be provided in a numpy array or theano tensor
Returns
-------
TensorVariable
"""
nu = self.nu
nc = self.nc
mu = self.mu
lam = self.lam
n = nu * 1.0
nc = nc * 1.0
x2 = value * value
ncx2 = nc * nc * x2
fac1 = n + x2
trm1 = n / 2. * tt.log(n) + gammaln(n + 1)
trm1 -= n * tt.log(2) + nc * nc / 2. + (n / 2.) * tt.log(fac1) + gammaln(n / 2.)
Px = tt.exp(trm1)
valF = ncx2 / (2 * fac1)
trm1 = tt.sqrt(2) * nc * value * hyp1f1(n / 2 + 1, 1.5, valF)
trm1 /= np.asarray(fac1 * tt.gamma((n + 1) / 2))
trm2 = hyp1f1((n + 1) / 2, 0.5, valF)
trm2 /= np.asarray(np.sqrt(fac1) * tt.gamma(n / 2 + 1))
Px *= trm1 + trm2
return bound(Px, lam > 0, nu > 0, nc > 0)
def logcdf(self, value):
"""
Compute the log of the cumulative distribution function for Non-Central Student's T distribution
at the specified value.
Parameters
----------
value: numeric
Value(s) for which log CDF is calculated. If the log CDF for multiple
values are desired the values must be provided in a numpy array or theano tensor.
Returns
-------
TensorVariable
"""
nu = self.nu
nc = self.nc
return nctdtr(nu, nc, value)
My Custom model:
with pm.Model() as model:
# Prior Distributions for unknown model parameters:
mu = pm.Normal('sigma', 0, 10)
sigma = pm.Normal('sigma', 0, 10)
nc= pm.HalfNormal('nc', sigma=10)
nu= pm.HalfNormal('nu', sigma=1)
# Observed data is from a Likelihood distributions (Likelihood (sampling distribution) of observations):
=> (input custom distribution) observed_data = pm.Beta('observed_data', alpha=alpha, beta=beta, observed=data)
# draw 5000 posterior samples
trace = pm.sample(draws=5000, tune=2000, chains=3, cores=1)
# Obtaining Posterior Predictive Sampling:
post_pred = pm.sample_posterior_predictive(trace, samples=3000)
print(post_pred['observed_data'].shape)
print('\nSummary: ')
print(pm.stats.summary(data=trace))
print(pm.stats.summary(data=post_pred))
Edit 1:
I redesigned the custom model to include the custom distribution, however, I keep on getting error based on the equations used to get the likelihood distribution or sometimes tensor locks down and the code just freeze. Find my code below,
with pm.Model() as model:
# Prior Distributions for unknown model parameters:
mu = pm.Normal('mu', mu=0, sigma=1)
sd = pm.HalfNormal('sd', sigma=1)
nc = pm.HalfNormal('nc', sigma=10)
nu = pm.HalfNormal('nu', sigma=1)
# Custom distribution:
# observed_data = pm.DensityDist('observed_data', NonCentralStudentT, observed=data_list)
# Observed data is from a Likelihood distributions (Likelihood (sampling distribution) of observations):
observed_data = NonCentralStudentT('observed_data', mu=mu, sd=sd, nc=nc, nu=nu, observed=data_list)
# draw 5000 posterior samples
trace_S = pm.sample(draws=5000, tune=2000, chains=3, cores=1)
# Obtaining Posterior Predictive Sampling:
post_pred_S = pm.sample_posterior_predictive(trace_S, samples=3000)
print(post_pred_S['observed_data'].shape)
print('\nSummary: ')
print(pm.stats.summary(data=trace_S))
print(pm.stats.summary(data=post_pred_S))
Edit 2:
I am looking online in order to convert the function to theano, the only thing that I found to define the function is from the following GitHub link hyp1f1 function GitHub
Will this be enough to use in order to convert the function into theano?
In addition, I have a question, it is okay to use NumPy arrays with theano?
Also, I thought of another way but I am not sure if this can be implemented, I looked into the nct function in scipy and they wrote the following,
If Y is a standard normal random variable and V is an independent
chi-square random variable ( chi2 ) with k degrees of freedom, then
X=(Y+c) / sqrt(V/k)
has a non-central Student’s t distribution on the real line. The
degrees of freedom parameter k (denoted df in the implementation)
satisfies k>0 and the noncentrality parameter c (denoted nc in the
implementation) is a real number.
The probability density above is defined in the “standardized” form.
To shift and/or scale the distribution use the loc and scale
parameters. Specifically, nct.pdf(x, df, nc, loc, scale) is
identically equivalent to nct.pdf(y, df, nc) / scale with y = (x -
loc) / scale .
So, I thought of only using the priors as normal and chi2 random variables code part in their distributions and use the degree of freedom variable as mentioned before in the code into the equation mentioned in SciPy, will it be enough to get the distribution?
Edit 3:
I managed to run the code in the link about fitting empirical distribution and found out the second best was the student t distribution, so, I will be using this. Thank you for your help. I just have a side question, I ran my model with student t distribution but I got these warnings:
There were 52 divergences after tuning. Increase target_accept or
reparameterize. The acceptance probability does not match the target.
It is 0.7037574708196309, but should be close to 0.8. Try to increase
the number of tuning steps. The number of effective samples is smaller
than 10% for some parameters.
I am just confused about these warnings, Do you have any idea what it means? I know that this won't affect my code, but, I can reduce the divergences? and regarding the effective samples, Do I need to increase the number of samples in the trace code?
I implemented the KS-Test to test which Distributions are better fitting together. At this moment, I gave the CDFs as input, because the standard KS-Test involves computing the maximum difference between the CDFs of the function. I just wanted to know if this is the right way to do it. Or should I use the PDFS as input? The statistics values and p-values seem good for me. With the critical value of the KS-Test i can chose which Hypothesis tests I should not reject.
Code example
gammafit = stats.gamma.fit(h4)
pdf_gamma = stats.gamma.pdf(lnspc, *gammafit)
cdf_gamma = stats.gamma.cdf(lnspc, *gammafit)
plt.plot(lnspc, pdf_gamma, label="Gamma")
gamma_kstest999 = stats.ks_2samp(np.cumsum(n4), cdf_gamma)
You should use the pdfs as input. ks_2samp takes as input the pdfs and creates the cdfs inside the code. According to the function source code:
data1 = np.sort(data1)
data2 = np.sort(data2)
n1 = data1.shape[0]
n2 = data2.shape[0]
data_all = np.concatenate([data1, data2])
cdf1 = np.searchsorted(data1, data_all, side='right') / (1.0*n1)
cdf2 = np.searchsorted(data2, data_all, side='right') / (1.0*n2)
d = np.max(np.absolute(cdf1 - cdf2))
# Note: d absolute not signed distance
en = np.sqrt(n1 * n2 / float(n1 + n2))
try:
prob = distributions.kstwobign.sf((en + 0.12 + 0.11 / en) * d)
except:
prob = 1.0
return Ks_2sampResult(d, prob)
The cdf1 and cdf2 variables represent the produced cumulative distributions.
I am very confused about how to sample measurement error using normal distribution (Gaussian pdf) in Python.
What I want to do is just to create noise (error) under Gaussian pdf and add it to measured values. In short, I put the problem as follows:
Inputs:
M(i) - measurement value; i = 1...n, n - number of measurements;
Output:
M_noisy(i) = M(i) + noise(i);
where, noise(i) - noise in measurement; M(i) - measurement value.
Important: This noise should be as a zero-mean Gaussian noise with variance equal to, 10 % of the measurement value.
I put the following code but I could not continue...
My code:
import numpy as np
# sigma - standard deviation of M
# mu - mean value of M
# n - number of measurements
# I dont know if this is correct or not:
noise = sigma * np.random.randn(n) + mu;
## M_noisy(i) - ?
Thanks for any answers/suggestions in advance.
random_scale_ammounts = np.random.randn(n)
#creates a list of values between -1 and 1
offset_from_mean = sigma *random_scales #randomly -std to +std
noise = offset_from_mean + mu;
clean_y_data = np.arange(n)
noisy_y_data = clean_y_data + noise
might be what you are after?