I have a 1-dimensional array of data:
a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
for which I want to obtain the 68% confidence interval (ie: the 1 sigma).
The first comment in this answer states that this can be achieved using scipy.stats.norm.interval from the scipy.stats.norm function, via:
from scipy import stats
import numpy as np
mean, sigma = np.mean(a), np.std(a)
conf_int = stats.norm.interval(0.68, loc=mean,
scale=sigma)
But a comment in this post states that the actual correct way of obtaining the confidence interval is:
conf_int = stats.norm.interval(0.68, loc=mean,
scale=sigma / np.sqrt(len(a)))
that is, sigma is divided by the square-root of the sample size: np.sqrt(len(a)).
The question is: which version is the correct one?
The 68% confidence interval for a single draw from a normal distribution with
mean mu and std deviation sigma is
stats.norm.interval(0.68, loc=mu, scale=sigma)
The 68% confidence interval for the mean of N draws from a normal distribution
with mean mu and std deviation sigma is
stats.norm.interval(0.68, loc=mu, scale=sigma/sqrt(N))
Intuitively, these formulas make sense, since if you hold up a jar of jelly beans and ask a large number of people to guess the number of jelly beans, each individual may be off by a lot -- the same std deviation sigma -- but the average of the guesses will do a remarkably fine job of estimating the actual number and this is reflected by the standard deviation of the mean shrinking by a factor of 1/sqrt(N).
If a single draw has variance sigma**2, then by the Bienaymé formula, the sum of N uncorrelated draws has variance N*sigma**2.
The mean is equal to the sum divided by N. When you multiply a random variable (like the sum) by a constant, the variance is multiplied by the constant squared. That is
Var(cX) = c**2 * Var(X)
So the variance of the mean equals
(variance of the sum)/N**2 = N * sigma**2 / N**2 = sigma**2 / N
and so the standard deviation of the mean (which is the square root of the variance) equals
sigma/sqrt(N).
This is the origin of the sqrt(N) in the denominator.
Here is some example code, based on Tom's code, which demonstrates the claims made above:
import numpy as np
from scipy import stats
N = 10000
a = np.random.normal(0, 1, N)
mean, sigma = a.mean(), a.std(ddof=1)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)
print('{:0.2%} of the single draws are in conf_int_a'
.format(((a >= conf_int_a[0]) & (a < conf_int_a[1])).sum() / float(N)))
M = 1000
b = np.random.normal(0, 1, (N, M)).mean(axis=1)
conf_int_b = stats.norm.interval(0.68, loc=0, scale=1 / np.sqrt(M))
print('{:0.2%} of the means are in conf_int_b'
.format(((b >= conf_int_b[0]) & (b < conf_int_b[1])).sum() / float(N)))
prints
68.03% of the single draws are in conf_int_a
67.78% of the means are in conf_int_b
Beware that if you define conf_int_b with the estimates for mean and sigma
based on the sample a, the mean may not fall in conf_int_b with the desired
frequency.
If you take a sample from a distribution and compute the
sample mean and std deviation,
mean, sigma = a.mean(), a.std()
be careful to note that there is no guarantee that these will
equal the population mean and standard deviation and that we are assuming
the population is normally distributed -- those are not automatic givens!
If you take a sample and want to estimate the population mean and standard
deviation, you should use
mean, sigma = a.mean(), a.std(ddof=1)
since this value for sigma is the unbiased estimator for the population standard deviation.
I just checked how R and GraphPad calculate confidence intervals, and they increase the interval in case of small sample size (n). E.g., more than 6-fold for n=2 compared to a large n. This code (based on shasan's answer) matches their confidence intervals:
import numpy as np, scipy.stats as st
# returns confidence interval of mean
def confIntMean(a, conf=0.95):
mean, sem, m = np.mean(a), st.sem(a), st.t.ppf((1+conf)/2., len(a)-1)
return mean - m*sem, mean + m*sem
For R, I checked against t.test(a). GraphPad's confidence interval of a mean page has "user level" info on the sample size dependency.
Here the output for Gabriel's example:
In [2]: a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])
In [3]: confIntMean(a, 0.68)
Out[3]: (3.9974214366806184, 4.877578563319382)
In [4]: st.norm.interval(0.68, loc=np.mean(a), scale=st.sem(a))
Out[4]: (4.0120010966037407, 4.8629989033962593)
Note that the difference between the confIntMean() and st.norm.interval() intervals is relatively small here; len(a) == 16 is not too small.
I tested out your methods using an array with a known confidence interval. numpy.random.normal(mu,std,size) returns an array centered on mu with a standard deviation of std (in the docs, this is defined as Standard deviation (spread or “width”) of the distribution.).
from scipy import stats
import numpy as np
from numpy import random
a = random.normal(0,1,10000)
mean, sigma = np.mean(a), np.std(a)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)
conf_int_b = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))
conf_int_a
(-1.0011149125527312, 1.0059797764202412)
conf_int_b
(-0.0076030415111100983, 0.012467905378619625)
As the sigma value should be -1 to 1, the / np.sqrt(len(a)) method appears to be incorrect.
Edit
Since I don't have the reputation to comment above I'll clarify how this answer ties into unutbu's thorough answer. If you populate a random array with a normal distribution, 68% of the total will fall within 1-σ of the mean. In the case above, if you check that you see
b = a[np.where((a>-1)&(a <1))]
len(a)
> 6781
or 68% of the population falls within 1σ. Well, about 68%. As you use a larger and larger array, you will approach 68% (In a trial of 10, 9 were between -1 and 1). That's because the 1-σ is the inherent distribution of the data, and the more data you have the better you can resolve it.
Basically, my interpretation of your question was If I have a sample of data I want to use to describe the distribution they were drawn from, what is the method to find the standard deviation of that data? whereas unutbu's interpretation appears to be more What is the interval to which I can place the mean with 68% confidence?. Which would mean, for jelly beans, I answered How are they guessing and unutbu answered What do their guesses tell us about the jelly beans.
Related
Suppose I draw randomly from a normal distribution with mean zero and standard deviation represented by a vector of, say, dimension 3 with
scale_rng=np.array([1,2,3])
eps=np.random.normal(0,scale_rng)
I need to compute a weighted average based on some simulations for which I draw the above mentioned eps. The weights of this average are "the probability of eps" (hence I will have a vector with 3 weights). For weighted average I simply mean an arithmetic sum wehere each component is multiplied by a weight, i.e. a number between 0 and 1 and where all the weights should sum up to one.
Such weighted average shall be calculated as follows: I have a time series of observations for one variable, x. I calculate an expanding rolling standard deviation of x (say this is the values in scale). Then, I extract a random variable eps from a normal distribution as explained above for each time-observation in x and I add it to it, say obtaining y=x+eps. Finally, I need to compute the weighted average of y where each value of y is weighted by the "probability of drawing each value of eps from a normal distribution with mean zero and standard deviation equal to scale.
Now, I know that I cannot think of this being the points on the pdf corresponding to the values randomly drawn because a normal random variable is continuous and as such the pdf at a certain point is zero. Hence, the only solution I Found out is to discretize a normal distribution with a certain number of bins and then find the probability that a value extracted with the code of above is actually drawn. How could I do this in Python?
EDIT: the solution I found is to use
norm.cdf(eps_it+0.5, loc=0, scale=scale_rng)-norm.cdf(eps_it-0.5, loc=0, scale=scale_rng)
which is not really based on the discretization but at least it seems feasible to me "probability-wise".
here's an example leaving everything continuous.
import numpy as np
from scipy import stats
# some function we want a monte carlo estimate of
def fn(eps):
return np.sum(np.abs(eps), axis=1)
# define distribution of eps
sd = np.array([1,2,3])
d_eps = stats.norm(0, sd)
# draw uniform samples so we don't double apply the normal density
eps = np.random.uniform(-6*sd, 6*sd, size=(10000, 3))
# calculate weights (working with log-likelihood is better for numerical stability)
w = np.prod(d_eps.pdf(eps), axis=1)
# normalise so weights sum to 1
w /= np.sum(w)
# get estimate
np.sum(fn(eps) * w)
which gives me 4.71, 4.74, 4.70 4.78 if I run it a few times. we can verify this is correct by just using a mean when eps is drawn from a normal directly:
np.mean(fn(d_eps.rvs(size=(10000, 3))))
which gives me essentially the same values, but with expected lower variance. e.g. 4.79, 4.76, 4.77, 4.82, 4.80.
I couldn't understand how to properly use this function, could someone please explain it to me?
Let's say I have:
a mean of 172.7815
a standard deviation of 4.1532
N = 50 (50 samples)
When I'm asked to calculate the (95%) margin of error using norm.ppf() will the code look like below?
norm.ppf(0.95, loc=172.78, scale=4.15)
or will it look like this?
norm.ppf(0.95, loc=0, scale=1)
Because I know it's calculating the area of the curve to the right of the confidence interval (95%, 97.5% etc...see image below), but when I have a mean and a standard deviation, I get really confused as to how to use the function.
The method norm.ppf() takes a percentage and returns a standard deviation multiplier for what value that percentage occurs at.
It is equivalent to a, 'One-tail test' on the density plot.
From scipy.stats.norm:
ppf(q, loc=0, scale=1) Percent point function (inverse of cdf — percentiles).
Standard Normal Distribution
The code:
norm.ppf(0.95, loc=0, scale=1)
Returns a 95% significance interval for a one-tail test on a standard normal distribution (i.e. a special case of the normal distribution where the mean is 0 and the standard deviation is 1).
Our Example
To calculate the value for OP-provided example at which our 95% significance interval lies (For a one-tail test) we would use:
norm.ppf(0.95, loc=172.7815, scale=4.1532)
This will return a value (that functions as a 'standard-deviation multiplier') marking where 95% of data points would be contained if our data is a normal distribution.
To get the exact number, we take the norm.ppf() output and multiply it by our standard deviation for the distribution in question.
A Two-Tailed Test
If we need to calculate a 'Two-tail test' (i.e. We're concerned with values both greater and less than our mean) then we need to split the significance (i.e. our alpha value) because we're still using a calculation method for one-tail. The split in half symbolizes the significance level being appropriated to both tails. A 95% significance level has a 5% alpha; splitting the 5% alpha across both tails returns 2.5%. Taking 2.5% from 100% returns 97.5% as an input for the significance level.
Therefore, if we were concerned with values on both sides of our mean, our code would input .975 to represent a 95% significance level across two-tails:
norm.ppf(0.975, loc=172.7815, scale=4.1532)
Margin of Error
Margin of error is a significance level used when estimating a population parameter with a sample statistic. We want to generate our 95% confidence interval using the two-tailed input to norm.ppf() since we're concerned with values both greater and less than our mean:
ppf = norm.ppf(0.975, loc=172.7815, scale=4.1532)
Next, we'd take the ppf and multiply it by our standard deviation to return the interval value:
interval_value = std * ppf
Finally, we'd mark the confidence intervals by adding & subtracting the interval value from the mean:
lower_95 = mean - interval_value
upper_95 = mean + interval_value
Plot with a vertical line:
_ = plt.axvline(lower_95, color='r', linestyle=':')
_ = plt.axvline(upper_95, color='r', linestyle=':')
James' statement that norm.ppf returns a "standard deviation multiplier" is wrong. This feels pertinent as his post is the top google result when one searches for norm.ppf.
'norm.ppf' is the inverse of 'norm.cdf'. In the example, it simply returns the value at the 95% percentile. There is no "standard deviation multiplier" involved.
A better answer exists here:
How to calculate the inverse of the normal cumulative distribution function in python?
You can figure out the confidence interval with norm.ppf directly, without calculating margin of error
upper_of_interval = norm.ppf(0.975, loc=172.7815, scale=4.1532/np.sqrt(50))
lower_of_interval = norm.ppf(0.025, loc=172.7815, scale=4.1532/np.sqrt(50))
4.1532 is sample standard deviation, not the standard deviation of the sampling distribution of the sample mean. So, scale in norm.ppf will be specified as scale = 4.1532 / np.sqrt(50), which is the estimator of standard deviation of the sampling distribution.
(The value of standard deviation of the sampling distribution is equal to population standard deviation / np.sqrt(sample size). Here, we did not know the population standard deviation and the sample size is more than 30, so sample standard deviation / np.sqrt(sample size) can be used as a good estimator).
Margin of error can be calculated with (upper_of_interval - lower_of_interval) / 2.
calculate the amount for the 95% percentile and draw a vertical line and an annotation with the amount
mean=172.7815
std=4.1532
N = 50
results=norm.rvs(mean,std, size=N)
pct_5 = norm.ppf(.95,mean,std)
plt.hist(results,bins=10)
plt.axvline(pct_5)
plt.annotate(pct_5,xy=(pct_5,6))
plt.show()
As other answers pointed out, norm.ppf(1-alpha) returns the value on the (1-alpha)x100-th percentile of a normal distribution specified by the parameters passed to the it. For example in the OP, it returns the 95th percentile of a normal distribution with mean 172.78 and standard deviation 4.15.
If you're looking for a function that returns the same value (N-th percentile on the normal distribution) as a function of alpha instead, there's the inverse survival function, norm.isf(alpha), which tells you the number at which (1-alpha) is above it.
from scipy.stats import norm
alpha = 0.05
v1 = norm.isf(alpha)
v2 = norm.ppf(1-alpha)
np.isclose(v1, v2) # True
I need to generate pseudo-random numbers from a lognormal distribution in Python. The problem is that I am starting from the mode and standard deviation of the lognormal distribution. I don't have the mean or median of the lognormal distribution, nor any of the parameters of the underlying normal distribution.
numpy.random.lognormal takes the mean and standard deviation of the underlying normal distribution. I tried to calculate these from the parameters I have, but wound up with a quartic function. It has a solution, but I hope that there is a more straightforward way to do this.
scipy.stats.lognorm takes parameters that I don't understand. I am not a native English speaker and the documentation doesn't make sense.
Can you help me, please?
You have the mode and the standard deviation of the log-normal distribution. To use the rvs() method of scipy's lognorm, you have to parameterize the distribution in terms of the shape parameter s, which is the standard deviation sigma of the underlying normal distribution, and the scale, which is exp(mu), where mu is the mean of the underlying distribution.
You pointed out that making this reparameterization requires solving a quartic polynomial. For that, we can use the numpy.poly1d class. Instances of that class have a roots attribute.
A little algebra shows that exp(sigma**2) is the unique positive real root of the polynomial
x**4 - x**3 - (stddev/mode)**2 = 0
where stddev and mode are the given standard deviation and mode of the log-normal distribution, and for that solution, the scale (i.e. exp(mu)) is
scale = mode*x
Here's a function that converts the mode and standard deviation to the shape and scale:
def lognorm_params(mode, stddev):
"""
Given the mode and std. dev. of the log-normal distribution, this function
returns the shape and scale parameters for scipy's parameterization of the
distribution.
"""
p = np.poly1d([1, -1, 0, 0, -(stddev/mode)**2])
r = p.roots
sol = r[(r.imag == 0) & (r.real > 0)].real
shape = np.sqrt(np.log(sol))
scale = mode * sol
return shape, scale
For example,
In [155]: mode = 123
In [156]: stddev = 99
In [157]: sigma, scale = lognorm_params(mode, stddev)
Generate a sample using the computed parameters:
In [158]: from scipy.stats import lognorm
In [159]: sample = lognorm.rvs(sigma, 0, scale, size=1000000)
Here's the standard deviation of the sample:
In [160]: np.std(sample)
Out[160]: 99.12048952171304
And here's some matplotlib code to plot a histogram of the sample, with a vertical line drawn at the mode of the distribution from which the sample was drawn:
In [176]: tmp = plt.hist(sample, normed=True, bins=1000, alpha=0.6, color='c', ec='c')
In [177]: plt.xlim(0, 600)
Out[177]: (0, 600)
In [178]: plt.axvline(mode)
Out[178]: <matplotlib.lines.Line2D at 0x12c5a12e8>
The histogram:
If you want to generate the sample using numpy.random.lognormal() instead of scipy.stats.lognorm.rvs(), you can do this:
In [200]: sigma, scale = lognorm_params(mode, stddev)
In [201]: mu = np.log(scale)
In [202]: sample = np.random.lognormal(mu, sigma, size=1000000)
In [203]: np.std(sample)
Out[203]: 99.078297384090902
I haven't looked into how robust poly1d's roots algorithm is, so be sure to test for a wide range of possible input values. Alternatively, you can use a solver from scipy to solve the above polynomial for x. You can bound the solution using:
max(sqrt(stddev/mode), 1) <= x <= sqrt(stddev/mode) + 1
The log-normal distribution is (confusingly) the result of applying the exponential function to a normal distribution. Wikipedia gives the relationship between the parameters as
where μ and σ are the mean and standard deviation of what you call the "underlying normal distribution", and m and v are the mean and variance of the log-normal distribution.
Now, what you say you have is the mode and standard deviation of the log-normal distribution. The variance v is just the square of the standard deviation. Getting from the mode to m is trickier: again quoting that Wikipedia article, if the mean is then the mode is . From this, and the above, we can deduce that
where n is the mode of the log-normal distribution and v, m are as above. This reduces to a quartic,
or
where u = m2. I suspect this is the same quartic you mentioned in your question. It can be solved, but like most quartics, the radical form of the solutions are a giant hairball. The most practical approach for your purposes might be to plug numeric values for n and v into the above and then use a numeric solver to find the positive root(s).
Sorry I can't be more help. This is really a math question, not a programming question; you might get more helpful answers on https://math.stackexchange.com/.
Adding to #WarrenWeckesser excellent answer, here's a function that provides the exact return values to reparametrize a lognormal distribution in terms of the mode and the SD:
import numpy as np
def lognorm_params(mode, stddev):
a = stddev**2 / mode**2
x = 1/4*np.sqrt(-(16*(2/3)**(1/3)*a)/(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3) +
2*(2/3)**(2/3)*(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3)+1) + \
1/2*np.sqrt((4*(2/3)**(1/3)*a)/(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3) -
(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3)/(2**(1/3)*3**(2/3)) +
1/(2*np.sqrt(-(16*(2/3)**(1/3)*a)/(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3) +
2*(2/3)**(2/3)*(np.sqrt(3)*np.sqrt(256*a**3+27*a**2)-9*a)**(1/3)+1))+1/2) + \
1/4
shape = np.sqrt(np.log(x))
scale = mode * x
return shape, scale
Essentially, I just computed the exact solution of the quartic. The advantages are that the solution is a) exact, b) faster and c) vectorizable. As in the case of the answer by #WarrenWeckesser, this function returns, for a given mode and SD, the parameters shape and scale as used by the scipy function scipy.stats.lognormal().
How do I calculate the z score of a p-value and vice versa?
For example if I have a p-value of 0.95 I should get 1.96 in return.
I saw some functions in scipy but they only run a z-test on an array.
I have access to numpy, statsmodel, pandas, and scipy (I think).
>>> import scipy.stats as st
>>> st.norm.ppf(.95)
1.6448536269514722
>>> st.norm.cdf(1.64)
0.94949741652589625
As other users noted, Python calculates left/lower-tail probabilities by default. If you want to determine the density points where 95% of the distribution is included, you have to take another approach:
>>>st.norm.ppf(.975)
1.959963984540054
>>>st.norm.ppf(.025)
-1.960063984540054
Starting in Python 3.8, the standard library provides the NormalDist object as part of the statistics module.
It can be used to get the zscore for which x% of the area under a normal curve lies (ignoring both tails).
We can obtain one from the other and vice versa using the inv_cdf (inverse cumulative distribution function) and the cdf (cumulative distribution function) on the standard normal distribution:
from statistics import NormalDist
NormalDist().inv_cdf((1 + 0.95) / 2.)
# 1.9599639845400536
NormalDist().cdf(1.9599639845400536) * 2 - 1
# 0.95
An explanation for the '(1 + 0.95) / 2.' formula can be found in this wikipedia section.
If you are interested in T-test, you can do similar:
z-statistics (z-score) is used when the data follows a normal distribution, population standard deviation sigma is known and the sample size is above 30. Z-Score tells you how many standard deviations from the mean your result is. The z-score is calculated using the formula:
z_score = (xbar - mu) / sigma
t-statistics (t-score), also known as Student's T-Distribution, is used when the data follows a normal distribution, population standard deviation (sigma) is NOT known, but the sample standard deviation (s) is known or can be calculated, and the sample size is below 30. T-Score tells you how many standard deviations from the mean your result is. The t-score is calculated using the formula:
t_score = (xbar - mu) / (s/sqrt(n))
Summary: If the sample sizes are larger than 30, the z-distribution and the t-distributions are pretty much the same and either one can be used. If the population standard deviation is available and the sample size is greater than 30, t-distribution can be used with the population standard deviation instead of the sample standard deviation.
teststatistics
lookuptable
lookupvalues
criticalvalue
normaldistribution
populationstandarddeviation (sigma)
samplesize
z-statistics
z-table
z-score
z-critical is z-score at a specific confidence level
yes
known
> 30
t-statistics
t-table
t-score
t-critical is t-score at a specific confidence level
yes
not known
< 30
Python Percent Point Function is used to calculate the critical values at a specific confidence level:
z-critical = stats.norm.ppf(1 - alpha) (use alpha = alpha/2 for two-sided)
t-critical = stats.t.ppf(alpha/numOfTails, ddof)
Codes
import numpy as np
from scipy import stats
# alpha to critical
alpha = 0.05
n_sided = 2 # 2-sided test
z_crit = stats.norm.ppf(1-alpha/n_sided)
print(z_crit) # 1.959963984540054
# critical to alpha
alpha = stats.norm.sf(z_crit) * n_sided
print(alpha) # 0.05
Z-score to probability :
The code snippet below maps the negative of the absolute value of the z-score to cdf of a Std Normal Distribution and multiplies by 2 . This will give the prob of finding the probability of Area1 + Area2 shaded in the picture here :
import numpy as np
from scipy.stats import norm
norm(0, 1).cdf(-np.absolute(zscore)) * 2
Ref: https://mathbitsnotebook.com/Algebra2/Statistics/STzScores.html
I am doing some work, comparing the interpolated fft of the concentrations of some gases over a period, of which is unevenly sampled, with the lomb-scargle periodogram of the same data. I am using scipy's fft function to calculate the fourier transform and then squaring the modulus of this to give what I believe to be the power spectral density, in units of parts per billion(ppb) squared.
I can get the lomb-scargle plot to match almost the exact pattern as the FFT but never the same scale of magnitude, the FFT power spectral density always is higher, even though I thought the lomb-scargle power was power spectral density. Now the lomb code I am using:http://www.astropython.org/snippet/2010/9/Fast-Lomb-Scargle-algorithm, normalises the dataset taking away the average and dividing by 2 times the variance from the data, therefore I have normalised the FFT data in the same manner, but still the magnitudes do not match.
Therefore I did some more research and found that the normalised lomb-scargle power could unitless and therefore I cannot the plots match. This leads me to the 2 questions:
What units (if any) are the power spectral density of a normalised lim-scargle perioogram in?
How would I proceed to match my fft plot with my lomb-scargle plot, in terms of magnitude and pattern?
Thank you.
The squared modulus of the Fourier transform of a series is defined as the energy spectral density (ESD). You need to divide the ESD by the length of the series to convert to an estimate of power spectral density (PSD).
Units
The units of a PSD are [units]**2/[frequency] where [units] represents the units of your original series.
Normalization
To check for proper normalization, one can numerically integrate the PSD of a white noise (with known variance). If the integrated spectrum equals the variance of the series, the normalization is correct. A factor of 2 (too low) is not incorrect, though, and may indicate the PSD is normalized to be double-sided; in that case, just multiply by 2 and you have a properly normalized, single-sided PSD.
Using numpy, the randn function generates pseudo-random numbers that are Gaussian distributed. For example
10 * np.random.randn(1, 100)
produces a 1-by-100 array with mean=0 and variance=100. If the sampling frequency is, say, 1-Hz, the single-sided PSD will theoretically be flat at 200 units**2/Hz, from [0,0.5] Hz; the integrated spectrum would thus be 10, equaling the variance of the series.
Update
I modified the example included in the python code you linked to demonstrate the normalization for a normally distributed series of length 20, with variance 1, and sampling frequency 10:
import numpy
import lomb
numpy.random.seed(999)
nd = 20
fs = 10
x = numpy.arange(nd)
y = numpy.random.randn(nd)
fx, fy, nout, jmax, prob = lomb.fasper(x, y, 1., fs)
fNy = fx[-1]
fy = fy/fs
Si = numpy.mean(fy)*fNy
print fNy, Si, Si*2
This gives, for me:
5.26315789474 0.482185882163 0.964371764327
which shows you a few things:
The "Nyquist" frequency asked for is actually the sampling frequency.
The result needs to be divided by the sampling frequency.
The output is normalized for a double-sided PSD, so multiplying by 2 makes the integrated spectrum nearly 1.
In the time since this question was asked and answered, the AstroPy project has gained a Lomb-Scargle method, and this question is addressed in the documentation: http://docs.astropy.org/en/stable/stats/lombscargle.html#psd-normalization-unnormalized
In brief, you can compute a Fourier periodogram and compare it to the astropy Lomb-Scargle periodogram as follows
import numpy as np
from astropy.stats import LombScargle
def fourier_periodogram(t, y):
N = len(t)
frequency = np.fft.fftfreq(N, t[1] - t[0])
y_fft = np.fft.fft(y)
positive = (frequency > 0)
return frequency[positive], (1. / N) * abs(y_fft[positive]) ** 2
t = np.arange(100)
y = np.random.randn(100)
frequency, PSD_fourier = fourier_periodogram(t, y)
PSD_LS = LombScargle(t, y).power(frequency, normalization='psd')
np.allclose(PSD_fourier, PSD_LS)
# True
Since AstroPy is a common tool used in astronomy, I thought this might be more useful than an answer based on the code snippet mentioned above.