Generate random numbers from lognormal distribution in python

Generate random numbers from lognormal distribution in python - python

I need to generate pseudo-random numbers from a lognormal distribution in Python. The problem is that I am starting from the mode and standard deviation of the lognormal distribution. I don't have the mean or median of the lognormal distribution, nor any of the parameters of the underlying normal distribution.
numpy.random.lognormal takes the mean and standard deviation of the underlying normal distribution. I tried to calculate these from the parameters I have, but wound up with a quartic function. It has a solution, but I hope that there is a more straightforward way to do this.
scipy.stats.lognorm takes parameters that I don't understand. I am not a native English speaker and the documentation doesn't make sense.
Can you help me, please?

You have the mode and the standard deviation of the log-normal distribution. To use the rvs() method of scipy's lognorm, you have to parameterize the distribution in terms of the shape parameter s, which is the standard deviation sigma of the underlying normal distribution, and the scale, which is exp(mu), where mu is the mean of the underlying distribution.
You pointed out that making this reparameterization requires solving a quartic polynomial. For that, we can use the numpy.poly1d class. Instances of that class have a roots attribute.
A little algebra shows that exp(sigma**2) is the unique positive real root of the polynomial
x**4 - x**3 - (stddev/mode)**2 = 0
where stddev and mode are the given standard deviation and mode of the log-normal distribution, and for that solution, the scale (i.e. exp(mu)) is
scale = mode*x
Here's a function that converts the mode and standard deviation to the shape and scale:
def lognorm_params(mode, stddev):
"""
Given the mode and std. dev. of the log-normal distribution, this function
returns the shape and scale parameters for scipy's parameterization of the
distribution.
"""
p = np.poly1d([1, -1, 0, 0, -(stddev/mode)**2])
r = p.roots
sol = r[(r.imag == 0) & (r.real > 0)].real
shape = np.sqrt(np.log(sol))
scale = mode * sol
return shape, scale
For example,
In [155]: mode = 123
In [156]: stddev = 99
In [157]: sigma, scale = lognorm_params(mode, stddev)
Generate a sample using the computed parameters:
In [158]: from scipy.stats import lognorm
In [159]: sample = lognorm.rvs(sigma, 0, scale, size=1000000)
Here's the standard deviation of the sample:
In [160]: np.std(sample)
Out[160]: 99.12048952171304
And here's some matplotlib code to plot a histogram of the sample, with a vertical line drawn at the mode of the distribution from which the sample was drawn:
In [176]: tmp = plt.hist(sample, normed=True, bins=1000, alpha=0.6, color='c', ec='c')
In [177]: plt.xlim(0, 600)
Out[177]: (0, 600)
In [178]: plt.axvline(mode)
Out[178]: <matplotlib.lines.Line2D at 0x12c5a12e8>
The histogram:
If you want to generate the sample using numpy.random.lognormal() instead of scipy.stats.lognorm.rvs(), you can do this:
In [200]: sigma, scale = lognorm_params(mode, stddev)
In [201]: mu = np.log(scale)
In [202]: sample = np.random.lognormal(mu, sigma, size=1000000)
In [203]: np.std(sample)
Out[203]: 99.078297384090902
I haven't looked into how robust poly1d's roots algorithm is, so be sure to test for a wide range of possible input values. Alternatively, you can use a solver from scipy to solve the above polynomial for x. You can bound the solution using:
max(sqrt(stddev/mode), 1) <= x <= sqrt(stddev/mode) + 1

The log-normal distribution is (confusingly) the result of applying the exponential function to a normal distribution. Wikipedia gives the relationship between the parameters as
where μ and σ are the mean and standard deviation of what you call the "underlying normal distribution", and m and v are the mean and variance of the log-normal distribution.
Now, what you say you have is the mode and standard deviation of the log-normal distribution. The variance v is just the square of the standard deviation. Getting from the mode to m is trickier: again quoting that Wikipedia article, if the mean is then the mode is . From this, and the above, we can deduce that
where n is the mode of the log-normal distribution and v, m are as above. This reduces to a quartic,
or
where u = m2. I suspect this is the same quartic you mentioned in your question. It can be solved, but like most quartics, the radical form of the solutions are a giant hairball. The most practical approach for your purposes might be to plug numeric values for n and v into the above and then use a numeric solver to find the positive root(s).
Sorry I can't be more help. This is really a math question, not a programming question; you might get more helpful answers on https://math.stackexchange.com/.

Adding to #WarrenWeckesser excellent answer, here's a function that provides the exact return values to reparametrize a lognormal distribution in terms of the mode and the SD:
import numpy as np
def lognorm_params(mode, stddev):
a = stddev**2 / mode**2
x = 1/4*np.sqrt(-(16*(2/3)**(1/3)*a)/(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3) +
2*(2/3)**(2/3)*(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3)+1) + \
1/2*np.sqrt((4*(2/3)**(1/3)*a)/(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3) -
(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3)/(2**(1/3)*3**(2/3)) +
1/(2*np.sqrt(-(16*(2/3)**(1/3)*a)/(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3) +
2*(2/3)**(2/3)*(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3)+1))+1/2) + \
1/4
shape = np.sqrt(np.log(x))
scale = mode * x
return shape, scale
Essentially, I just computed the exact solution of the quartic. The advantages are that the solution is a) exact, b) faster and c) vectorizable. As in the case of the answer by #WarrenWeckesser, this function returns, for a given mode and SD, the parameters shape and scale as used by the scipy function scipy.stats.lognormal().

Related

How to calculate the probability between two numbers from a probability distribution in python

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?

Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields

I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.

Translating Log Normal and Log Normal Truncated Simulations in #Risk to Python

I have been given a task of translating the simulations inside of the Excel plug-in #Risk to Python. The functionalities closely line up with numpy's random number simulation given a distribution type and mu, sigma, or high and low values. An example for what I am doing is here.
In the linked example, mu=2 and sigma=1. Using numpy I get the same distribution as #Risk.
dist = np.random.lognormal(2, 1, 1000)
However, when I use numpy with the following parameters - I can no longer replicate the #Risk distributions.
mu=0.4, sigma=0.16 in #Risk:
Histogram for 1000 rsamples
and in Python:
histogram for 1000 rsamples
The result is two completely different distributions for the same mu and sigma. So I am very confused now on what numpy is expecting for mu and sigma inputs. I've read through the docs linked here, but why would one set of parameters give me the matching distributions, and another set of values will not.
What am I missing here?

Take another look at the #RISK documentation that you linked to and the docstring for numpy.random.lognormal. The #RISK function whose parameters match those of numpy.random.lognormal is RiskLognorm2. The parameters for numpy.random.lognormal and RiskLognorm2 are the mean and standard deviation of the underlying normal distribution. In other words, they describe the distribution of the logarithm of the data.
The #RISK documentation explains that the parameters for RiskLognorm are the mean and standard distribution of the log-normal distribution itself. It gives the formulas for translating between the two methods of parametrizing the distribution.
If you are sure that the parameters in the #RISK code are correct, then you will have to translate those parameters to the form used by numpy.random.lognormal. Given the values mean and stddev as the parameters used by RiskLognorm, you can get the parameters mu and sigma of numpy.random.lognormal as follows:
sigma_squared = np.log((stddev/mean)**2 + 1)
mu = np.log(mean) - 0.5*sigma_squared
sigma = np.sqrt(sigma_squared)
For example, suppose the mean and std. dev. of the distribution are
In [31]: mean = 0.40
In [32]: stddev = 0.16
Compute mu and sigma:
In [33]: sigma_squared = np.log((stddev/mean)**2 + 1)
In [34]: mu = np.log(mean) - 0.5*sigma_squared
In [35]: sigma = np.sqrt(sigma_squared)
Generate a sample using numpy.random.logormal, and check its statistics:
In [36]: sample = np.random.lognormal(mu, sigma, size=1000)
In [37]: np.mean(sample)
Out[37]: 0.3936244646485409
In [38]: np.std(sample)
Out[38]: 0.16280712706987954
In [39]: np.min(sample), np.max(sample)
Out[39]: (0.1311593293919604, 1.7218021130668375)

Python: how to randomly sample from nonstandard Cauchy distribution, hence with different parameters?

I was looking here: numpy
And I can see you can use the command np.random.standard_cauchy() specifying an array, to sample from a standard Cauchy.
I need to sample from a Cauchy which might have x_0 != 0 and gamma != 1, i.e. might not be located at the origin, nor have scale equal to 1.
How can I do this?

If you have scipy, you can use scipy.stats.cauchy, which takes a location (x0) and a scale (gamma) parameter. It exposes the rvs method to draw random samples:
x = stats.cauchy.rvs(loc=100, scale=2.5, size=1000) # draw 1000 samples

You may avoid the dependency on SciPy, since the Cauchy distribution is part of the location-scale family. That means, if you draw a sample x from Cauchy(0, 1), just shift it by x_0 and multiply with gamma and x' = x_0 + gamma * x will be distributed according to Cauchy(x_0, gamma).

What does the parameters in scipy.stats.zipf mean?

From the docs
The probability mass function for zipf is:
zipf.pmf(k, a) = 1/(zeta(a) * k**a)
for k >= 1.
zipf takes a as shape parameter.
The probability mass function above is defined in the “standardized” form. To shift distribution use the loc parameter. Specifically, zipf.pmf(k, a, loc) is identically equivalent to zipf.pmf(k - loc, a).
But what does the a and k refer to? What does "shape parameter" mean?
Additionally, in scipy.stats.zipf.interval, there's an alpha parameter.
The description of the .interval() method is simply:
Endpoints of the range that contains alpha percent of the distribution
What does the alpha parameter mean? Is that the "confidence interval"?

What does "shape parameter" mean?
As the name suggests, a shape parameter determines the shape of a distribution. This is probably easiest to explain when starting with what a shape parameter is not:
A location parameter shifts the distribution but leaves it otherwise unchanged. For example, the mean of a normal distribution is a location parameter. If X is normally distributed with mean mu, then X + a is normally distributed with mean mu + a.
A scale parameter makes the distribution wider or narrower. For example, the standard deviation of a normal distribution is a scale parameter. If X is normally distributed with standard deviation sigma, then X * a is normally distributed with standard deviation sigma * a.
Finally, a shape parameter changes the shape of the distribution. For example, the Gamma distribution has a shape parameter k that determines how skewed the distribution is (= how much it "leans" to one side).
But what does the a and k refer to?
k is the variable parameterized by the distribution. With zipf.pmf you can compute the probability of any k, given shape parameter a. Below is a plot that demonstrates how achanges the shape of the distribution (the individual probabilities of different k).
A high a makes large values of k very unlikely, while a low a makes small k less likely and larger kare possible.
What does the alpha parameter mean? Is that the "confidence interval"?
It is wrong to say that alpha is the confidence interval. It is the confidence level. I guess that is what you meant. For example, alpha=0.95 Means that you have a 95% confidence interval. If you generate random ks from the particular distribution, 95% of them will be in the range returned by zipf.interval.
Code for the plot:
from scipy.stats import zipf
import matplotlib.pyplot as plt
import numpy as np
k = np.linspace(0, 10, 101)
for a in [1.3, 2.6]:
p = zipf.pmf(k, a=a)
plt.plot(k, p, label='a={}'.format(a), linewidth=2)
plt.xlabel('k')
plt.ylabel('probability')
plt.legend()
plt.show()

Correct way to obtain confidence interval with scipy

I have a 1-dimensional array of data:
a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
for which I want to obtain the 68% confidence interval (ie: the 1 sigma).
The first comment in this answer states that this can be achieved using scipy.stats.norm.interval from the scipy.stats.norm function, via:
from scipy import stats
import numpy as np
mean, sigma = np.mean(a), np.std(a)
conf_int = stats.norm.interval(0.68, loc=mean,
scale=sigma)
But a comment in this post states that the actual correct way of obtaining the confidence interval is:
conf_int = stats.norm.interval(0.68, loc=mean,
scale=sigma / np.sqrt(len(a)))
that is, sigma is divided by the square-root of the sample size: np.sqrt(len(a)).
The question is: which version is the correct one?

The 68% confidence interval for a single draw from a normal distribution with
mean mu and std deviation sigma is
stats.norm.interval(0.68, loc=mu, scale=sigma)
The 68% confidence interval for the mean of N draws from a normal distribution
with mean mu and std deviation sigma is
stats.norm.interval(0.68, loc=mu, scale=sigma/sqrt(N))
Intuitively, these formulas make sense, since if you hold up a jar of jelly beans and ask a large number of people to guess the number of jelly beans, each individual may be off by a lot -- the same std deviation sigma -- but the average of the guesses will do a remarkably fine job of estimating the actual number and this is reflected by the standard deviation of the mean shrinking by a factor of 1/sqrt(N).
If a single draw has variance sigma**2, then by the Bienaymé formula, the sum of N uncorrelated draws has variance N*sigma**2.
The mean is equal to the sum divided by N. When you multiply a random variable (like the sum) by a constant, the variance is multiplied by the constant squared. That is
Var(cX) = c**2 * Var(X)
So the variance of the mean equals
(variance of the sum)/N**2 = N * sigma**2 / N**2 = sigma**2 / N
and so the standard deviation of the mean (which is the square root of the variance) equals
sigma/sqrt(N).
This is the origin of the sqrt(N) in the denominator.
Here is some example code, based on Tom's code, which demonstrates the claims made above:
import numpy as np
from scipy import stats
N = 10000
a = np.random.normal(0, 1, N)
mean, sigma = a.mean(), a.std(ddof=1)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)
print('{:0.2%} of the single draws are in conf_int_a'
.format(((a >= conf_int_a[0]) & (a < conf_int_a[1])).sum() / float(N)))
M = 1000
b = np.random.normal(0, 1, (N, M)).mean(axis=1)
conf_int_b = stats.norm.interval(0.68, loc=0, scale=1 / np.sqrt(M))
print('{:0.2%} of the means are in conf_int_b'
.format(((b >= conf_int_b[0]) & (b < conf_int_b[1])).sum() / float(N)))
prints
68.03% of the single draws are in conf_int_a
67.78% of the means are in conf_int_b
Beware that if you define conf_int_b with the estimates for mean and sigma
based on the sample a, the mean may not fall in conf_int_b with the desired
frequency.
If you take a sample from a distribution and compute the
sample mean and std deviation,
mean, sigma = a.mean(), a.std()
be careful to note that there is no guarantee that these will
equal the population mean and standard deviation and that we are assuming
the population is normally distributed -- those are not automatic givens!
If you take a sample and want to estimate the population mean and standard
deviation, you should use
mean, sigma = a.mean(), a.std(ddof=1)
since this value for sigma is the unbiased estimator for the population standard deviation.

I just checked how R and GraphPad calculate confidence intervals, and they increase the interval in case of small sample size (n). E.g., more than 6-fold for n=2 compared to a large n. This code (based on shasan's answer) matches their confidence intervals:
import numpy as np, scipy.stats as st
# returns confidence interval of mean
def confIntMean(a, conf=0.95):
mean, sem, m = np.mean(a), st.sem(a), st.t.ppf((1+conf)/2., len(a)-1)
return mean - m*sem, mean + m*sem
For R, I checked against t.test(a). GraphPad's confidence interval of a mean page has "user level" info on the sample size dependency.
Here the output for Gabriel's example:
In [2]: a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
In [3]: confIntMean(a, 0.68)
Out[3]: (3.9974214366806184, 4.877578563319382)
In [4]: st.norm.interval(0.68, loc=np.mean(a), scale=st.sem(a))
Out[4]: (4.0120010966037407, 4.8629989033962593)
Note that the difference between the confIntMean() and st.norm.interval() intervals is relatively small here; len(a) == 16 is not too small.

I tested out your methods using an array with a known confidence interval. numpy.random.normal(mu,std,size) returns an array centered on mu with a standard deviation of std (in the docs, this is defined as Standard deviation (spread or “width”) of the distribution.).
from scipy import stats
import numpy as np
from numpy import random
a = random.normal(0,1,10000)
mean, sigma = np.mean(a), np.std(a)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)
conf_int_b = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
conf_int_a
(-1.0011149125527312, 1.0059797764202412)
conf_int_b
(-0.0076030415111100983, 0.012467905378619625)
As the sigma value should be -1 to 1, the / np.sqrt(len(a)) method appears to be incorrect.
Edit
Since I don't have the reputation to comment above I'll clarify how this answer ties into unutbu's thorough answer. If you populate a random array with a normal distribution, 68% of the total will fall within 1-σ of the mean. In the case above, if you check that you see
b = a[np.where((a>-1)&(a <1))]
len(a)
> 6781
or 68% of the population falls within 1σ. Well, about 68%. As you use a larger and larger array, you will approach 68% (In a trial of 10, 9 were between -1 and 1). That's because the 1-σ is the inherent distribution of the data, and the more data you have the better you can resolve it.
Basically, my interpretation of your question was If I have a sample of data I want to use to describe the distribution they were drawn from, what is the method to find the standard deviation of that data? whereas unutbu's interpretation appears to be more What is the interval to which I can place the mean with 68% confidence?. Which would mean, for jelly beans, I answered How are they guessing and unutbu answered What do their guesses tell us about the jelly beans.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate random numbers from lognormal distribution in python - python

Related

How to calculate the probability between two numbers from a probability distribution in python

Translating Log Normal and Log Normal Truncated Simulations in #Risk to Python

Python: how to randomly sample from nonstandard Cauchy distribution, hence with different parameters?

What does the parameters in scipy.stats.zipf mean?

Correct way to obtain confidence interval with scipy

Categories

Resources