Is there a sci.stats.moment function for binned data? - python

I'm looking for a function which calculates the n-th central moment
(same as the one out of scipy.stats.moment)
for my binned data (Out of the numpy.histogram function).
# Generate normal distributed data
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(size=500,loc=1,scale=2)
H = np.histogram(data,bins=50)
plt.scatter(H[1][:-1],H[0])
plt.show()
for my above code example the results should be (0,4,0,48) for the first four moments as there sigma = 2 (for the central moment).

Working with binned data is essentially the same as working with weighted data. One uses the midpoint of each bin as a data point, and the count of that bin as its weight. If scipy.stats.moment supported weights, we could do this computation directly. As is, use the method numpy.average which supports weights.
midpoints = 0.5 * (H[1][1:] + H[1][:-1])
ev = np.average(midpoints, weights = H[0])
print(ev)
for k in range(2, 5):
print(np.average((midpoints - ev)**k, weights = H[0]))
Output (obviously random):
1.08242834443
4.21602099286
0.713129264647
51.6257736139
I didn't print the centered 1st moment (which is 0 by construction), printing the expected value instead. Theoretically*, these are 1, 4, 0, 48 but for any given sample, there is going to be some deviation from the parameters of the distribution.
(*) Not exactly. In the formula for variance I didn't include the correction factor n/(n-1) (where n is the total size of data set, i.e., the sum of weights). This factor adjusts the sample variance so it becomes an unbiased estimator of the population variance. You can include it if you like. Similar adjustments are probably needed for higher-order moments (if the goal is to have unbiased estimators), but I'd have to look this up, and in any case this is not a statistics site.

Related

Seaborn KDEPlot - not enough variation in data?

I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)
But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)
Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.
The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})
for i, bw in enumerate(['scott', 0.3]):
for j, num_same in enumerate([400, 450, 500]):
y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
sns.kdeplot(y, bw=bw, ax=axs[i, j])
axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()
The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by #mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.
If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.
We can, however, try to separate the sample into two sub-samples:
the sub-sample(s) with replications,
the sub-sample with unique realizations.
(This idea has already been mentionned by johanc).
Below is an attempt to perform this classification. The np.unique method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.
The following function will be useful in order to overcome a limitation with the current implementation of the draw method of Mixtures with OpenTURNS.
def DrawMixtureWithDiracs(distribution):
"""Draw a distributions which has Diracs.
https://github.com/openturns/openturns/issues/1489"""
graph = distribution.drawPDF()
graph.setLegends(["Mixture"])
for atom in distribution.getDistributionCollection():
if atom.getName() == "Dirac":
curve = atom.drawPDF()
curve.setLegends(["Dirac"])
graph.add(curve)
return graph
The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.
import openturns as ot
import numpy as np
distribution = ot.Mixture([ot.Dirac(-3.0),
ot.Normal()], [0.5, 0.5])
DrawMixtureWithDiracs(distribution)
This is the result.
Then we create a sample.
sample = distribution.getSample(100)
This is where your problem begins. We count the number of occurences of each realizations.
array = np.array(sample)
unique, index, count = np.unique(array, axis=0, return_index=True,
return_counts=True)
For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.
sampleSize = sample.getSize()
listOfDiracs = []
listOfWeights = []
uniqueValues = []
for i in range(len(unique)):
if count[i] == 1:
uniqueValues.append(unique[i][0])
else:
atom = ot.Dirac(unique[i])
listOfDiracs.append(atom)
w = count[i] / sampleSize
print("New Dirac =", unique[i], " with weight =", w)
listOfWeights.append(w)
The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.
complementaryWeight = 1.0 - sum(listOfWeights)
weights = list(listOfWeights)
weights.append(complementaryWeight)
The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.
sampleUniques = ot.Sample(uniqueValues, 1)
factory = ot.KernelSmoothing()
kde = factory.build(sampleUniques)
atoms = list(listOfDiracs)
atoms.append(kde)
Et voilĂ : the Mixture is ready.
mixture_estimated = ot.Mixture(atoms, weights)
The following script compares the initial Mixture and the estimated one.
graph = DrawMixtureWithDiracs(distribution)
graph.setColors(["dodgerblue3", "dodgerblue3"])
curve = DrawMixtureWithDiracs(mixture_estimated)
curve.setColors(["darkorange1", "darkorange1"])
curve.setLegends(["Est. Mixture", "Est. Dirac"])
graph.add(curve)
graph
The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.

How to compute the probability of a value given a list of samples from a distribution in Python?

Not sure if this belongs in statistics, but I am trying to use Python to achieve this. I essentially just have a list of integers:
data = [300,244,543,1011,300,125,300 ... ]
And I would like to know the probability of a value occurring given this data.
I graphed histograms of the data using matplotlib and obtained these:
In the first graph, the numbers represent the amount of characters in a sequence. In the second graph, it's a measured amount of time in milliseconds. The minimum is greater than zero, but there isn't necessarily a maximum. The graphs were created using millions of examples, but I'm not sure I can make any other assumptions about the distribution. I want to know the probability of a new value given that I have a few million examples of values. In the first graph, I have a few million sequences of different lengths. Would like to know probability of a 200 length, for example.
I know that for a continuous distribution the probability of any exact point is supposed to be zero, but given a stream of new values, I need be able to say how likely each value is. I've looked through some of the numpy/scipy probability density functions, but I'm not sure which to choose from or how to query for new values once I run something like scipy.stats.norm.pdf(data). It seems like different probability density functions will fit the data differently. Given the shape of the histograms I'm not sure how to decide which to use.
Since you don't seem to have a specific distribution in mind, but you might have a lot of data samples, I suggest using a non-parametric density estimation method. One of the data types you describe (time in ms) is clearly continuous, and one method for non-parametric estimation of a probability density function (PDF) for continuous random variables is the histogram that you already mentioned. However, as you will see below, Kernel Density Estimation (KDE) can be better. The second type of data you describe (number of characters in a sequence) is of the discrete kind. Here, kernel density estimation can also be useful and can be seen as a smoothing technique for the situations where you don't have a sufficient amount of samples for all values of the discrete variable.
Estimating Density
The example below shows how to first generate data samples from a mixture of 2 Gaussian distributions and then apply kernel density estimation to find the probability density function:
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
from sklearn.neighbors import KernelDensity
# Generate random samples from a mixture of 2 Gaussians
# with modes at 5 and 10
data = np.concatenate((5 + np.random.randn(10, 1),
10 + np.random.randn(30, 1)))
# Plot the true distribution
x = np.linspace(0, 16, 1000)[:, np.newaxis]
norm_vals = mlab.normpdf(x, 5, 1) * 0.25 + mlab.normpdf(x, 10, 1) * 0.75
plt.plot(x, norm_vals)
# Plot the data using a normalized histogram
plt.hist(data, 50, normed=True)
# Do kernel density estimation
kd = KernelDensity(kernel='gaussian', bandwidth=0.75).fit(data)
# Plot the estimated densty
kd_vals = np.exp(kd.score_samples(x))
plt.plot(x, kd_vals)
# Show the plots
plt.show()
This will produce the following plot, where the true distribution is shown in blue, the histogram is shown in green, and the PDF estimated using KDE is shown in red:
As you can see, in this situation, the PDF approximated by the histogram is not very useful, while KDE provides a much better estimate. However, with a larger number of data samples and a proper choice of bin size, histogram might produce a good estimate as well.
The parameters you can tune in case of KDE are the kernel and the bandwidth. You can think about the kernel as the building block for the estimated PDF, and several kernel functions are available in Scikit Learn: gaussian, tophat, epanechnikov, exponential, linear, cosine. Changing the bandwidth allows you to adjust the bias-variance trade-off. Larger bandwidth will result in increased bias, which is good if you have less data samples. Smaller bandwidth will increase variance (fewer samples are included into the estimation), but will give a better estimate when more samples are available.
Calculating Probability
For a PDF, probability is obtained by calculating the integral over a range of values. As you noticed, that will lead to the probability 0 for a specific value.
Scikit Learn does not seem to have a builtin function for calculating probability. However, it is easy to estimate the integral of the PDF over a range. We can do it by evaluating the PDF multiple times within the range and summing the obtained values multiplied by the step size between each evaluation point. In the example below, N samples are obtained with step step.
# Get probability for range of values
start = 5 # Start of the range
end = 6 # End of the range
N = 100 # Number of evaluation points
step = (end - start) / (N - 1) # Step size
x = np.linspace(start, end, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
print(probability)
Please note that kd.score_samples generates log-likelihood of the data samples. Therefore, np.exp is needed to obtain likelihood.
The same computation can be performed using builtin SciPy integration methods, which will give a bit more accurate result:
from scipy.integrate import quad
probability = quad(lambda x: np.exp(kd.score_samples(x)), start, end)[0]
For instance, for one run, the first method calculated the probability as 0.0859024655305, while the second method produced 0.0850974209996139.
OK I offer this as a starting point, but estimating densities is a very broad topic. For your case involving the amount of characters in a sequence, we can model this from a straight-forward frequentist perspective using empirical probability. Here, probability is essentially a generalization of the concept of percentage. In our model, the sample space is discrete and is all positive integers. Well, then you simply count the occurrences and divide by the total number of events to get your estimate for the probabilities. Anywhere we have zero observations, our estimate for the probability is zero.
>>> samples = [1,1,2,3,2,2,7,8,3,4,1,1,2,6,5,4,8,9,4,3]
>>> from collections import Counter
>>> counts = Counter(samples)
>>> counts
Counter({1: 4, 2: 4, 3: 3, 4: 3, 8: 2, 5: 1, 6: 1, 7: 1, 9: 1})
>>> total = sum(counts.values())
>>> total
20
>>> probability_mass = {k:v/total for k,v in counts.items()}
>>> probability_mass
{1: 0.2, 2: 0.2, 3: 0.15, 4: 0.15, 5: 0.05, 6: 0.05, 7: 0.05, 8: 0.1, 9: 0.05}
>>> probability_mass.get(2,0)
0.2
>>> probability_mass.get(12,0)
0
Now, for your timing data, it is more natural to model this as a continuous distribution. Instead of using a parametric approach where you assume that your data has some distribution and then fit that distribution to your data, you should take a non-parametric approach. One straightforward way is to use a kernel density estimate. You can simply think of this as a way of smoothing a histogram to give you a continuous probability density function. There are several libraries available. Perhaps the most straightforward for univariate data is scipy's:
>>> import scipy.stats
>>> kde = scipy.stats.gaussian_kde(samples)
>>> kde.pdf(2)
array([ 0.15086911])
To get the probability of an observation in some interval:
>>> kde.integrate_box_1d(1,2)
0.13855869478828692
Here is one possible solution. You count the number of occurrences of each value in the original list. The future probability for a given value is its past rate of occurrence, which is simply the # of past occurrences divided by the length of the original list. In Python it's very simple:
x is the given list of values
from collections import Counter
c = Counter(x)
def probability(a):
# returns the probability of a given number a
return float(c[a]) / len(x)

How to handle shape of pymc3 Deterministic variables

I've been working on getting a hierarchical model of some psychophysical behavioral data up and running in pymc3. I'm incredibly impressed with things overall, but after trying to get up to speed with Theano and pymc3 I have a model that mostly works, however has a couple problems.
The code is built to fit a parameterized version of a Weibull to seven sets of data. Each trial is modeled as a binary Bernoulli outcome, while the thresholds (output of thact as the y values which are used to fit a Gaussian function for height, width, and elevation (a, c, and d on a typical Gaussian).
Using the parameterized Weibull seems to be working nicely, and is now hierarchical for the slope of the Weibull while the thresholds are fit separately for each chunk of data. However - the output I'm getting from k and y_est leads me to believe they may not be the correct size, and unlike the probability distributions, it doesn't look like I can specify shape (unless there's a theano way to do this that I haven't found - though from what I've read specifying shape in theano is tricky).
Ultimately, I'd like to use y_est to estimate the gaussian height or width, however the output right now results in an incredible mess that I think originates with size problems in y_est and k. Any help would be fantastic - the code below should simulate some data and is followed by the model. The model does a nice job fitting each individual threshold and getting the slopes, but falls apart when dealing with the rest.
Thanks for having a look - I'm super impressed with pymc3 so far!
EDIT: Okay, so the shape output by y_est.tag.test_value.shape looks like this
y_est.tag.test_value.shape
(101, 7)
k.tag.test_value.shape
(7,)
I think this is where I'm running into trouble, though it may just be poorly constructed on my part. k has the right shape (one k value per unique_xval). y_est is outputting an entire set of data (101x7) instead of a single estimate (one y_est per unique_xval) for each difficulty level. Is there some way to specify that y_est get specific subsets of df_y_vals to control this?
#Import necessary modules and define our weibull function
import numpy as np
import pylab as pl
from scipy.stats import bernoulli
#x stimulus intensity
#g chance (0.5 for 2AFC)
# m slope
# t threshold
# a performance level defining threshold
def weib(x,g,a,m,t):
k=-np.log(((1-a)/(1-g))**(1/t))
return 1- (1-g)*np.exp(- (k*x/t)**m);
#Output values from weibull function
xit=101
xvals=np.linspace(0.05,1,xit)
out_weib=weib(xvals, 0.5, 0.8, 3, 0.6)
#Okay, fitting the perfect output of a Weibull should be easy, contaminate with some noise
#Slope of 3, threshold of 0.6
#How about 5% noise!
noise=0.05*np.random.randn(np.size(out_weib))
out=out_weib+noise
#Let's make this more like a typical experiment -
#i.e. no percent correct, just one or zero
#Randomly pick based on the probability at each point whether they got the trial right or wrong
trial=np.zeros_like(out)
for i in np.arange(out.size):
p=out_weib[i]
trial[i] = bernoulli.rvs(p)
#Iterate for 6 sets of data, similar slope (from a normal dist), different thresh (output from gaussian)
#Gauss parameters=
true_gauss_height = 0.3
true_gauss_width = 0.01
true_gauss_elevation = 0.2
#What thresholds will we get then? 6 discrete points along that gaussian, from 0 to 180 degree mask
x_points=[0, 30, 60, 90, 120, 150, 180]
x_points=np.asarray(x_points)
gauss_points=true_gauss_height*np.exp(- ((x_points**2)/2*true_gauss_width**2))+true_gauss_elevation
import pymc as pm2
import pymc3 as pm
import pandas as pd
slopes=pm2.rnormal(3, 3, size=7)
out_weib=np.zeros([xvals.size,x_points.size])
for i in np.arange(x_points.size):
out_weib[:,i]=weib(xvals, 0.5, 0.8, slopes[i], gauss_points[i])
#Let's make this more like a typical experiment - i.e. no percent correct, just one or zero
#Randomly pick based on the probability at each point whether they got the trial right or wrong
trials=np.zeros_like(out_weib)
for i in np.arange(len(trials)):
for ii in np.arange(gauss_points.size):
p=out_weib[i,ii]
trials[i,ii] = bernoulli.rvs(p)
#Let's make that data into a DataFrame for pymc3
y_vals=np.tile(xvals, [7, 1])
df_correct = pd.DataFrame(trials, columns=x_points)
df_y_vals = pd.DataFrame(y_vals.T, columns=x_points)
unique_xvals=x_points
import theano as th
with pm.Model() as hierarchical_model:
# Hyperpriors for group node
mu_slope = pm.Normal('mu_slope', mu=3, sd=1)
sigma_slope = pm.Uniform('sigma_slope', lower=0.1, upper=2)
#Priors for the overall gaussian function - 3 params, the height of the gaussian
#Width, and elevation
gauss_width = pm.HalfNormal('gauss_width', sd=1)
gauss_elevation = pm.HalfNormal('gauss_elevation', sd=1)
slope = pm.Normal('slope', mu=mu_slope, sd=sigma_slope, shape=unique_xvals.size)
thresh=pm.Uniform('thresh', upper=1, lower=0.1, shape=unique_xvals.size)
k = -th.tensor.log(((1-0.8)/(1-0.5))**(1/thresh))
y_est=1-(1-0.5)*th.tensor.exp(-(k*df_y_vals/thresh)**slope)
#We want our model to predict either height or width...height would be easier.
#Our Gaussian function has y values estimated by y_est as the 82% thresholds
#and Xvals based on where each of those psychometrics were taken.
#height_est=pm.Deterministic('height_est', (y_est/(th.tensor.exp((-unique_xvals**2)/2*gauss_width)))+gauss_elevation)
height_est = pm.Deterministic('height_est', (y_est-gauss_elevation)*th.tensor.exp((unique_xvals**2)/2*gauss_width**2))
#Define likelihood as Bernoulli for each binary trial
likelihood = pm.Bernoulli('likelihood',p=y_est, shape=unique_xvals.size, observed=df_correct)
#Find start
start=pm.find_MAP()
step=pm.NUTS(state=start)
#Do MCMC
trace = pm.sample(5000, step, njobs=1, progressbar=True) # draw 5000 posterior samples using NUTS sampling
I'm not sure exactly what you want to do when you say "Is there some way to specify that y_est get specific subsets of df_y_vals to control this". Can you describe for each y_est value what values of df_y_vals are you supposed to use? What's the shape of df_y_vals? What's the shape of y_est supposed to be? (7,)?
I suspect what you want is to index into df_y_vals using numpy advanced indexing, which works the same in PyMC as in numpy. Its hard to say exactly without more information.

numpy.random.normal different distribution: selecting values from distribution

I have a power-law distribution of energies and I want to pick n random energies based on the distribution. I tried doing this manually using random numbers but it is too inefficient for what I want to do. I'm wondering is there a method in numpy (or other) that works like numpy.random.normal, except instead of a using normal distribution, the distribution may be specified. So in my mind an example might look like (similar to numpy.random.normal):
import numpy as np
# Energies from within which I want values drawn
eMin = 50.
eMax = 2500.
# Amount of energies to be drawn
n = 10000
photons = []
for i in range(n):
# Method that I just made up which would work like random.normal,
# i.e. return an energy on the distribution based on its probability,
# but take a distribution other than a normal distribution
photons.append(np.random.distro(eMin, eMax, lambda e: e**(-1.)))
print(photons)
Printing photons should give me a list of length 10000 populated by energies in this distribution. If I were to histogram this it would have much greater bin values at lower energies.
I am not sure if such a method exists but it seems like it should. I hope it is clear what I want to do.
EDIT:
I have seen numpy.random.power but my exponent is -1 so I don't think this will work.
Sampling from arbitrary PDFs well is actually quite hard. There are large and dense books just about how to efficiently and accurately sample from the standard families of distributions.
It looks like you could probably get by with a custom inversion method for the example that you gave.
If you want to sample from an arbitrary distribution you need the inverse of the cumulative density function (not the pdf).
You then sample a probability uniformly from range [0,1] and feed this into the inverse of the cdf to get the corresponding value.
It is often not possible to obtain the cdf from the pdf analytically.
However, if you're happy to approximate the distribution, you could do so by calculating f(x) at regular intervals over its domain, then doing a cumsum over this vector to get an approximation of the cdf and from this approximate the inverse.
Rough code snippet:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
def f(x):
"""
substitute this function with your arbitrary distribution
must be positive over domain
"""
return 1/float(x)
#you should vary inputVals to cover the domain of f (for better accurracy you can
#be clever about spacing of values as well). Here i space them logarithmically
#up to 1 then at regular intervals but you could definitely do better
inputVals = np.hstack([1.**np.arange(-1000000,0,100),range(1,10000)])
#everything else should just work
funcVals = np.array([f(x) for x in inputVals])
cdf = np.zeros(len(funcVals))
diff = np.diff(funcVals)
for i in xrange(1,len(funcVals)):
cdf[i] = cdf[i-1]+funcVals[i-1]*diff[i-1]
cdf /= cdf[-1]
#you could also improve the approximation by choosing appropriate interpolator
inverseCdf = scipy.interpolate.interp1d(cdf,inputVals)
#grab 10k samples from distribution
samples = [inverseCdf(x) for x in np.random.uniform(0,1,size = 100000)]
plt.hist(samples,bins=500)
plt.show()
Why don't you use eval and put the distribution in a string?
>>> cmd = "numpy.random.normal(500)"
>>> eval(cmd)
you can manipulate the string as you wish to set the distribution.

How do I match lomb-scargle and FFT plots of same dataset?

I am doing some work, comparing the interpolated fft of the concentrations of some gases over a period, of which is unevenly sampled, with the lomb-scargle periodogram of the same data. I am using scipy's fft function to calculate the fourier transform and then squaring the modulus of this to give what I believe to be the power spectral density, in units of parts per billion(ppb) squared.
I can get the lomb-scargle plot to match almost the exact pattern as the FFT but never the same scale of magnitude, the FFT power spectral density always is higher, even though I thought the lomb-scargle power was power spectral density. Now the lomb code I am using:http://www.astropython.org/snippet/2010/9/Fast-Lomb-Scargle-algorithm, normalises the dataset taking away the average and dividing by 2 times the variance from the data, therefore I have normalised the FFT data in the same manner, but still the magnitudes do not match.
Therefore I did some more research and found that the normalised lomb-scargle power could unitless and therefore I cannot the plots match. This leads me to the 2 questions:
What units (if any) are the power spectral density of a normalised lim-scargle perioogram in?
How would I proceed to match my fft plot with my lomb-scargle plot, in terms of magnitude and pattern?
Thank you.
The squared modulus of the Fourier transform of a series is defined as the energy spectral density (ESD). You need to divide the ESD by the length of the series to convert to an estimate of power spectral density (PSD).
Units
The units of a PSD are [units]**2/[frequency] where [units] represents the units of your original series.
Normalization
To check for proper normalization, one can numerically integrate the PSD of a white noise (with known variance). If the integrated spectrum equals the variance of the series, the normalization is correct. A factor of 2 (too low) is not incorrect, though, and may indicate the PSD is normalized to be double-sided; in that case, just multiply by 2 and you have a properly normalized, single-sided PSD.
Using numpy, the randn function generates pseudo-random numbers that are Gaussian distributed. For example
10 * np.random.randn(1, 100)
produces a 1-by-100 array with mean=0 and variance=100. If the sampling frequency is, say, 1-Hz, the single-sided PSD will theoretically be flat at 200 units**2/Hz, from [0,0.5] Hz; the integrated spectrum would thus be 10, equaling the variance of the series.
Update
I modified the example included in the python code you linked to demonstrate the normalization for a normally distributed series of length 20, with variance 1, and sampling frequency 10:
import numpy
import lomb
numpy.random.seed(999)
nd = 20
fs = 10
x = numpy.arange(nd)
y = numpy.random.randn(nd)
fx, fy, nout, jmax, prob = lomb.fasper(x, y, 1., fs)
fNy = fx[-1]
fy = fy/fs
Si = numpy.mean(fy)*fNy
print fNy, Si, Si*2
This gives, for me:
5.26315789474 0.482185882163 0.964371764327
which shows you a few things:
The "Nyquist" frequency asked for is actually the sampling frequency.
The result needs to be divided by the sampling frequency.
The output is normalized for a double-sided PSD, so multiplying by 2 makes the integrated spectrum nearly 1.
In the time since this question was asked and answered, the AstroPy project has gained a Lomb-Scargle method, and this question is addressed in the documentation: http://docs.astropy.org/en/stable/stats/lombscargle.html#psd-normalization-unnormalized
In brief, you can compute a Fourier periodogram and compare it to the astropy Lomb-Scargle periodogram as follows
import numpy as np
from astropy.stats import LombScargle
def fourier_periodogram(t, y):
N = len(t)
frequency = np.fft.fftfreq(N, t[1] - t[0])
y_fft = np.fft.fft(y)
positive = (frequency > 0)
return frequency[positive], (1. / N) * abs(y_fft[positive]) ** 2
t = np.arange(100)
y = np.random.randn(100)
frequency, PSD_fourier = fourier_periodogram(t, y)
PSD_LS = LombScargle(t, y).power(frequency, normalization='psd')
np.allclose(PSD_fourier, PSD_LS)
# True
Since AstroPy is a common tool used in astronomy, I thought this might be more useful than an answer based on the code snippet mentioned above.

Categories