Discretize normal distribution to get prob of a random variable

Discretize normal distribution to get prob of a random variable - python

Suppose I draw randomly from a normal distribution with mean zero and standard deviation represented by a vector of, say, dimension 3 with
scale_rng=np.array([1,2,3])
eps=np.random.normal(0,scale_rng)
I need to compute a weighted average based on some simulations for which I draw the above mentioned eps. The weights of this average are "the probability of eps" (hence I will have a vector with 3 weights). For weighted average I simply mean an arithmetic sum wehere each component is multiplied by a weight, i.e. a number between 0 and 1 and where all the weights should sum up to one.
Such weighted average shall be calculated as follows: I have a time series of observations for one variable, x. I calculate an expanding rolling standard deviation of x (say this is the values in scale). Then, I extract a random variable eps from a normal distribution as explained above for each time-observation in x and I add it to it, say obtaining y=x+eps. Finally, I need to compute the weighted average of y where each value of y is weighted by the "probability of drawing each value of eps from a normal distribution with mean zero and standard deviation equal to scale.
Now, I know that I cannot think of this being the points on the pdf corresponding to the values randomly drawn because a normal random variable is continuous and as such the pdf at a certain point is zero. Hence, the only solution I Found out is to discretize a normal distribution with a certain number of bins and then find the probability that a value extracted with the code of above is actually drawn. How could I do this in Python?
EDIT: the solution I found is to use
norm.cdf(eps_it+0.5, loc=0, scale=scale_rng)-norm.cdf(eps_it-0.5, loc=0, scale=scale_rng)
which is not really based on the discretization but at least it seems feasible to me "probability-wise".

here's an example leaving everything continuous.
import numpy as np
from scipy import stats
# some function we want a monte carlo estimate of
def fn(eps):
return np.sum(np.abs(eps), axis=1)
# define distribution of eps
sd = np.array([1,2,3])
d_eps = stats.norm(0, sd)
# draw uniform samples so we don't double apply the normal density
eps = np.random.uniform(-6*sd, 6*sd, size=(10000, 3))
# calculate weights (working with log-likelihood is better for numerical stability)
w = np.prod(d_eps.pdf(eps), axis=1)
# normalise so weights sum to 1
w /= np.sum(w)
# get estimate
np.sum(fn(eps) * w)
which gives me 4.71, 4.74, 4.70 4.78 if I run it a few times. we can verify this is correct by just using a mean when eps is drawn from a normal directly:
np.mean(fn(d_eps.rvs(size=(10000, 3))))
which gives me essentially the same values, but with expected lower variance. e.g. 4.79, 4.76, 4.77, 4.82, 4.80.

Related

Generating random numbers with upper and lower standard deviation of the scatter

I have to produce randomly generated, normally distributed numbers, based on an astronomical table containing a most probable value and a standard deviation. The peculiar thing is that the standard deviation is not given by one, but two numbers - an upper standard deviation of the error and a lower, something like this:
mass_object, error_up, error_down
7.33, 0.12, 0.07
9.40, 0.04, 0.02
6.01, 0.11, 0.09
...
For example, this means for the first object that if a random mass m gets generated with m<7.33, then probably it will be further away from 7.33 than in the case that m>7.33. So I am looking now for a way to randomly generate the numbers and this way has to include the 2 possible standard deviations. If I was dealing with just one standard deviation per object, I would create the random number (mass) of the first object like that:
mass_random = np.random.normal(loc=7.33, scale=0.12)
Do you have ideas how to create these random numbers with upper and lower standard deviation of the scatter? Tnx

As we discussed in the comments, a normal distribution has the same standard deviation in each direction (it's symmetric around the mean). So we know our distribution won't be normal. We can try a lognormal approach, since this allows us to introduce the idea of skewness. To do this in Python, you'll need Scipy. Here's a crude approach, assuming that 68% of data is on the mean, 16% is at the high point, and 16% is at the low point. We fit the distribution to that crude dataset, then we can calculate new points from the distribution:
import scipy.stats
# Choose one of the rows
mean, high, low = 7.33, 0.12, 0.07
# Create a dummy dataset to fit the distribution
values = [mean] * 68 + [mean + high] * 16 + [mean - low ] * 16
# Print the fit distribution
fit_dist = scipy.stats.lognorm.fit(values)
print(fit_dist)
# Calculate 10 new random values based on the fit
scipy.stats.lognorm.rvs(*fit_dist, size=10)
array([7.25541865, 7.34873107, 7.33831589, 7.36387121, 7.26912469,
7.33084677, 7.35626689, 7.33907124, 7.32522422, 7.31688687])

The immediate solution would be a two step sampling:
for a given row i, one samples from a uniform distribution over the interval error_down and error_up obtaining \sigma_i, and then one samples the final value from a normal distribution with mean m_i and standard deviation \sigma_i.
In practice, one imports numpy, defines a custom function sampling and, then, applies it at the whole table:
import numpy as np
def sampling (row) :
sigma = np.random.uniform(row[1], row[2])
m = row[0]
return (np.random.normal(m, sigma))
sampled_values = map(sampling, table)

Python: how to associate a probability to a given value?

I would like to associate a probability value to a number.
Let's say, I consider a norman probability distribution with mean = 7 and std = 3.
I can generate a random number based on such distribution in this way
np.random.normal(7, 3, 1)
I would like to find a method to associate to a given number the value of the probability associated to it.
For instance, what is the value of the probability associated with 0.6 based on such distribution?
Let's assume I generate the histogram of n random values.
x = np.random.normal(7, 3, 100000)
plt.hist(x, 10)
Here I can I see that a value of 5 has a probability of ~0.11 while a value of 20 has probability 0.

For any normalized continuous distribution represented on a histogram as you have above, the only way to find the probability for a given histogram bin is to take the integral of that distribution over the range of the bins. So this depends on:
The distribution
The range of the bin you are considering
You can use the scipy package for example to do this integral numerically for you.
https://docs.scipy.org/doc/scipy/reference/tutorial/integrate.html
If you need something more simple, you can approximate this probability by taking the value of the CDF at the center of the bin and multiplying by the width of the bin.

How to use norm.ppf()?

I couldn't understand how to properly use this function, could someone please explain it to me?
Let's say I have:
a mean of 172.7815
a standard deviation of 4.1532
N = 50 (50 samples)
When I'm asked to calculate the (95%) margin of error using norm.ppf() will the code look like below?
norm.ppf(0.95, loc=172.78, scale=4.15)
or will it look like this?
norm.ppf(0.95, loc=0, scale=1)
Because I know it's calculating the area of the curve to the right of the confidence interval (95%, 97.5% etc...see image below), but when I have a mean and a standard deviation, I get really confused as to how to use the function.

The method norm.ppf() takes a percentage and returns a standard deviation multiplier for what value that percentage occurs at.
It is equivalent to a, 'One-tail test' on the density plot.
From scipy.stats.norm:
ppf(q, loc=0, scale=1) Percent point function (inverse of cdf — percentiles).
Standard Normal Distribution
The code:
norm.ppf(0.95, loc=0, scale=1)
Returns a 95% significance interval for a one-tail test on a standard normal distribution (i.e. a special case of the normal distribution where the mean is 0 and the standard deviation is 1).
Our Example
To calculate the value for OP-provided example at which our 95% significance interval lies (For a one-tail test) we would use:
norm.ppf(0.95, loc=172.7815, scale=4.1532)
This will return a value (that functions as a 'standard-deviation multiplier') marking where 95% of data points would be contained if our data is a normal distribution.
To get the exact number, we take the norm.ppf() output and multiply it by our standard deviation for the distribution in question.
A Two-Tailed Test
If we need to calculate a 'Two-tail test' (i.e. We're concerned with values both greater and less than our mean) then we need to split the significance (i.e. our alpha value) because we're still using a calculation method for one-tail. The split in half symbolizes the significance level being appropriated to both tails. A 95% significance level has a 5% alpha; splitting the 5% alpha across both tails returns 2.5%. Taking 2.5% from 100% returns 97.5% as an input for the significance level.
Therefore, if we were concerned with values on both sides of our mean, our code would input .975 to represent a 95% significance level across two-tails:
norm.ppf(0.975, loc=172.7815, scale=4.1532)
Margin of Error
Margin of error is a significance level used when estimating a population parameter with a sample statistic. We want to generate our 95% confidence interval using the two-tailed input to norm.ppf() since we're concerned with values both greater and less than our mean:
ppf = norm.ppf(0.975, loc=172.7815, scale=4.1532)
Next, we'd take the ppf and multiply it by our standard deviation to return the interval value:
interval_value = std * ppf
Finally, we'd mark the confidence intervals by adding & subtracting the interval value from the mean:
lower_95 = mean - interval_value
upper_95 = mean + interval_value
Plot with a vertical line:
_ = plt.axvline(lower_95, color='r', linestyle=':')
_ = plt.axvline(upper_95, color='r', linestyle=':')

James' statement that norm.ppf returns a "standard deviation multiplier" is wrong. This feels pertinent as his post is the top google result when one searches for norm.ppf.
'norm.ppf' is the inverse of 'norm.cdf'. In the example, it simply returns the value at the 95% percentile. There is no "standard deviation multiplier" involved.
A better answer exists here:
How to calculate the inverse of the normal cumulative distribution function in python?

You can figure out the confidence interval with norm.ppf directly, without calculating margin of error
upper_of_interval = norm.ppf(0.975, loc=172.7815, scale=4.1532/np.sqrt(50))
lower_of_interval = norm.ppf(0.025, loc=172.7815, scale=4.1532/np.sqrt(50))
4.1532 is sample standard deviation, not the standard deviation of the sampling distribution of the sample mean. So, scale in norm.ppf will be specified as scale = 4.1532 / np.sqrt(50), which is the estimator of standard deviation of the sampling distribution.
(The value of standard deviation of the sampling distribution is equal to population standard deviation / np.sqrt(sample size). Here, we did not know the population standard deviation and the sample size is more than 30, so sample standard deviation / np.sqrt(sample size) can be used as a good estimator).
Margin of error can be calculated with (upper_of_interval - lower_of_interval) / 2.

calculate the amount for the 95% percentile and draw a vertical line and an annotation with the amount
mean=172.7815
std=4.1532
N = 50
results=norm.rvs(mean,std, size=N)
pct_5 = norm.ppf(.95,mean,std)
plt.hist(results,bins=10)
plt.axvline(pct_5)
plt.annotate(pct_5,xy=(pct_5,6))
plt.show()

As other answers pointed out, norm.ppf(1-alpha) returns the value on the (1-alpha)x100-th percentile of a normal distribution specified by the parameters passed to the it. For example in the OP, it returns the 95th percentile of a normal distribution with mean 172.78 and standard deviation 4.15.
If you're looking for a function that returns the same value (N-th percentile on the normal distribution) as a function of alpha instead, there's the inverse survival function, norm.isf(alpha), which tells you the number at which (1-alpha) is above it.
from scipy.stats import norm
alpha = 0.05
v1 = norm.isf(alpha)
v2 = norm.ppf(1-alpha)
np.isclose(v1, v2) # True

Generating random value for given cdf

Depending on sample of values of random variable I create cumulative density function using kernel density estimation.
cdf = gaussian_kde(sample)
What I need is to generate sample values of random variable whose density function is equal to constructed cdf. I know about the way of inversing the probability distribution function, but since I can not do it analitically it requires pretty complicated preparations. Is there integrated solution or maybe another way to accomplish the task?

If you're using a kernel density estimator (KDE) with Gaussian kernels, your density estimate is a Gaussian mixture model. This means that the density function is a weighted sum of 'mixture components', where each mixture component is a Gaussian distribution. In a typical KDE, there's a mixture component centered over each data point, and each component is a copy of the kernel. This distribution is easy to sample from without using the inverse CDF method. The procedure looks like this:
Setup
Let mu be a vector where mu[i] is the mean of mixture component i. In a KDE, this will just be the locations of the original data points
Let sigma be a vector where sigma[i] is the standard deviation of mixture component i. In typical KDEs, this will be the kernel bandwidth, which is shared for all points (but variable-bandwidth variants do exist).
Let w be a vector where w[i] contains the weight of mixture component i. The weights must be positive and sum to 1. In a typical, unweighted KDE, all weights will be 1/(number of data points) (but weighted variants do exist).
Choose the number of random points to sample, n_total
Determine how many points will be drawn from each mixture component.
Let n be a vector where n[i] contains the number of points to sample from mixture component i.
Draw n from a multinomial distribution with "number of trials" equal to n_total and "success probabilities" equal to w. This means the number of points to draw from each mixture component will be randomly chosen, proportional to the component weights.
Draw random values
For each mixture component i:
Draw n[i] values from a normal distribution with mean mu[i] and standard deviation sigma[i]
Shuffle the list of random values, so they have random order.
This procedure is relatively straightforward because random number generators (RNGs) for multinomial and normal distributions are widely available. If your kernels aren't Gaussian but some other probability distribution, you can replicate this strategy, replacing the normal RNG in step 4 with a RNG for that distribution (if it's available). You can also use this procedure to sample from mixture models in general, not just KDEs.

How do I match lomb-scargle and FFT plots of same dataset?

I am doing some work, comparing the interpolated fft of the concentrations of some gases over a period, of which is unevenly sampled, with the lomb-scargle periodogram of the same data. I am using scipy's fft function to calculate the fourier transform and then squaring the modulus of this to give what I believe to be the power spectral density, in units of parts per billion(ppb) squared.
I can get the lomb-scargle plot to match almost the exact pattern as the FFT but never the same scale of magnitude, the FFT power spectral density always is higher, even though I thought the lomb-scargle power was power spectral density. Now the lomb code I am using:http://www.astropython.org/snippet/2010/9/Fast-Lomb-Scargle-algorithm, normalises the dataset taking away the average and dividing by 2 times the variance from the data, therefore I have normalised the FFT data in the same manner, but still the magnitudes do not match.
Therefore I did some more research and found that the normalised lomb-scargle power could unitless and therefore I cannot the plots match. This leads me to the 2 questions:
What units (if any) are the power spectral density of a normalised lim-scargle perioogram in?
How would I proceed to match my fft plot with my lomb-scargle plot, in terms of magnitude and pattern?
Thank you.

The squared modulus of the Fourier transform of a series is defined as the energy spectral density (ESD). You need to divide the ESD by the length of the series to convert to an estimate of power spectral density (PSD).
Units
The units of a PSD are [units]**2/[frequency] where [units] represents the units of your original series.
Normalization
To check for proper normalization, one can numerically integrate the PSD of a white noise (with known variance). If the integrated spectrum equals the variance of the series, the normalization is correct. A factor of 2 (too low) is not incorrect, though, and may indicate the PSD is normalized to be double-sided; in that case, just multiply by 2 and you have a properly normalized, single-sided PSD.
Using numpy, the randn function generates pseudo-random numbers that are Gaussian distributed. For example
10 * np.random.randn(1, 100)
produces a 1-by-100 array with mean=0 and variance=100. If the sampling frequency is, say, 1-Hz, the single-sided PSD will theoretically be flat at 200 units**2/Hz, from [0,0.5] Hz; the integrated spectrum would thus be 10, equaling the variance of the series.
Update
I modified the example included in the python code you linked to demonstrate the normalization for a normally distributed series of length 20, with variance 1, and sampling frequency 10:
import numpy
import lomb
numpy.random.seed(999)
nd = 20
fs = 10
x = numpy.arange(nd)
y = numpy.random.randn(nd)
fx, fy, nout, jmax, prob = lomb.fasper(x, y, 1., fs)
fNy = fx[-1]
fy = fy/fs
Si = numpy.mean(fy)*fNy
print fNy, Si, Si*2
This gives, for me:
5.26315789474 0.482185882163 0.964371764327
which shows you a few things:
The "Nyquist" frequency asked for is actually the sampling frequency.
The result needs to be divided by the sampling frequency.
The output is normalized for a double-sided PSD, so multiplying by 2 makes the integrated spectrum nearly 1.

In the time since this question was asked and answered, the AstroPy project has gained a Lomb-Scargle method, and this question is addressed in the documentation: http://docs.astropy.org/en/stable/stats/lombscargle.html#psd-normalization-unnormalized
In brief, you can compute a Fourier periodogram and compare it to the astropy Lomb-Scargle periodogram as follows
import numpy as np
from astropy.stats import LombScargle
def fourier_periodogram(t, y):
N = len(t)
frequency = np.fft.fftfreq(N, t[1] - t[0])
y_fft = np.fft.fft(y)
positive = (frequency > 0)
return frequency[positive], (1. / N) * abs(y_fft[positive]) ** 2
t = np.arange(100)
y = np.random.randn(100)
frequency, PSD_fourier = fourier_periodogram(t, y)
PSD_LS = LombScargle(t, y).power(frequency, normalization='psd')
np.allclose(PSD_fourier, PSD_LS)
# True
Since AstroPy is a common tool used in astronomy, I thought this might be more useful than an answer based on the code snippet mentioned above.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.