Generating normal distribution in order python, numpy - python

I am able to generate random samples of normal distribution in numpy like this.
>>> mu, sigma = 0, 0.1 # mean and standard deviation
>>> s = np.random.normal(mu, sigma, 1000)
But they are in random order, obviously. How can I generate numbers in order, that is, values should rise and fall like in a normal distribution.
In other words, I want to create a curve (gaussian) with mu and sigma and n number of points which I can input.
How to do this?

To (1) generate a random sample of x-coordinates of size n (from the normal distribution) (2) evaluate the normal distribution at the x-values (3) sort the x-values by the magnitude of the normal distribution at their positions, this will do the trick:
import numpy as np
mu,sigma,n = 0.,1.,1000
def normal(x,mu,sigma):
return ( 2.*np.pi*sigma**2. )**-.5 * np.exp( -.5 * (x-mu)**2. / sigma**2. )
x = np.random.normal(mu,sigma,n) #generate random list of points from normal distribution
y = normal(x,mu,sigma) #evaluate the probability density at each point
x,y = x[np.argsort(y)],np.sort(y) #sort according to the probability density

Related

Creating Histogram from Poisson Distributions with weights (matplotlib)

I am working on a project looking at the Poisson filling of droplets by a contaminant whereby the Poisson mean depends on the droplets Volume. There is a volume distribution and each volume size has a likelihood from a Gaussian.
I have a loop generating a Poisson distribution (an array of 2000 numbers) for a different mean in each step. Each distribution has a weight that I generate from a gaussian. Currently, I am just adding all Poisson arrays and creating one large normalised histogram. I wish to weight the frequency of numbers in each array, such that the histogram can take into account the weight. I am unsure how to do this however as it is the frequency of the numbers in each array that has to be weighted and not the numbers themselves.
import numpy as np
from scipy.stats import poison
from matplotlib import pyplot as plt
def gaussian(mu,sig,x): # Gaussian Gives Weight
P_r = 1./(np.sqrt(2.*np.pi)*sig)*np.exp(-np.power((x - mu)/sig, 2.)/2)
return P_r
def poisson(mean):
P = np.random.poisson(mean, 2000)
return P
R= np.linspace(45, 75, 2000) #min and max radius and steps taken between them to gen Poisson
Average_Droplet_Radius = 60
Variance = 15
Mean_Droplet_Average_Occupancy = float(input('Enter mean droplet occupation ')) #Poisson Mean
for mu, sig in [(Average_Droplet_Radius,Variance)]:
np.prob = gaussian(mu,sig,R)
C = Mean_Droplet_Average_Occupancy / (4/3 *np.pi * ( Average_Droplet_Radius**3)) #The constant parameter for all distributions
i = 0
a = np.array([])
for cell in R:
Individual_Mean = C * (4/3 *np.pi * ( R[i]**3))
Individual_Weight = np.prob[i] #want to weight frequency in given Poisson by this
b = (poisson(Individual_Mean))
a = np.append(a, b) # Unweighted Poissons combined
i = i+1
bins_val = np.arange(0, a.max() + 1.5) - 0.5
count, bins, ignored = plt.hist( a, bins_val, density=True) # Creates unweighted, normalised histogram
plt.show()
I was unsure how to use the weights part of plt.hist, as it is a large array of numbers that has weight.
Currently, I get a histogram where each droplet size is equally likely, how can I get the weights in the final distribution?

Scipy Non-central Chi-Squared Random Variable

Consider a sum of n squared iid normal random variables S = sum (Z^2(mu, sig^2)). According to this question, S / sig^2 has a noncentral chi-squared distribution with degrees of freedom = n and non-centrality parameter = n*mu^2.
However, compare generating N of these variables S by summing squared normals with generating N noncentral chi-squared random variables directly using scipy.ncx2:
import numpy as np
from scipy.stats import ncx2, chi2
import matplotlib.pyplot as plt
n = 1000 # number of normals in sum
N_MC = 100000 # number of trials
mu = 0.05
sig = 0.3
### Generate sums of squared normals ###
Z = np.random.normal(loc=mu, scale=sig, size=(N_MC, n))
S = np.sum(Z**2, axis=1)
### Generate non-central chi2 RVs directly ###
dof = n
non_centrality = n*mu**2
NCX2 = sig**2 * ncx2.rvs(dof, non_centrality, size=N_MC)
# NCX2 = sig**2 * chi2.rvs(dof, size=N_MC) # for mu = 0.0
### Plot histos ###
fig, ax = plt.subplots()
ax.hist(S, bins=50, label='S')
ax.hist(NCX2, bins=50, label='NCX2', alpha=0.7)
ax.legend()
plt.show()
This results in the histograms
I believe the mathematics is correct; could the discrepancy be a bug in the ncx2 implementation? Setting mu = 0 and using scipy.chi2 looks much better:
The problem is in the second sentence of the question: "S / sig^2 has a noncentral chi-squared distribution with degrees of freedom = n and non-centrality parameter = n*mu^2." That non-centrality parameter is not correct. It should be n*(mu/sig)^2.
The standard definition of the noncentral chi-squared distribution is that it is the sum of the squares of normal variates that have mean mu and standard deviation 1. You are computing S using normal variates with standard deviation sig. Let's write that distribution as N(mu, sig**2). By using the location-scale properties of the normal distribution, we have
N(mu, sig**2) = mu + sig*N(0, 1) = sig*(mu/sig + N(0,1)) = sig*N(mu/sig, 1)
So summing the squares of variates from N(mu, sig**2) is equivalent to summing the squares of sig*N(mu/sig, 1). That gives sig**2 times a noncentral chi-squared variate with noncentrality mu/sig.
If you change the line where non_centrality is computed to
non_centrality = n*(mu/sig)**2
the histograms line up as you expect.

Calculate the Cumulative Distribution Function (CDF) in Python

How can I calculate in python the Cumulative Distribution Function (CDF)?
I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.
(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)
If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.
Let us have a closer look at this with a simple example:
import matplotlib.pyplot as plt
import numpy as np
# create some randomly ddistributed data:
data = np.random.randn(10000)
# sort the data:
data_sorted = np.sort(data)
# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)
# plot the sorted data:
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')
ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')
This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.
This function is easy to invert, and it depends on your application which form you need.
Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete
# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()
We can even print the first few values of the cdf to show they are discrete
print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])
The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate
mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)
In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.
The empirical cumulative distribution function is a CDF that jumps exactly at the values in your data set. It is the CDF for a discrete distribution that places a mass at each of your values, where the mass is proportional to the frequency of the value. Since the sum of the masses must be 1, these constraints determine the location and height of each jump in the empirical CDF.
Given an array a of values, you compute the empirical CDF by first obtaining the frequencies of the values. The numpy function unique() is helpful here because it returns not only the frequencies, but also the values in sorted order. To calculate the cumulative distribution, use the cumsum() function, and divide by the total sum. The following function returns the values in sorted order and the corresponding cumulative distribution:
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
To plot the empirical CDF you can use matplotlib's plot() function. The option drawstyle='steps-post' ensures that jumps occur at the right place. However, you need to force a jump at the smallest data value, so it's necessary to insert an additional element in front of x and y.
import matplotlib.pyplot as plt
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
Example usages:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
plot_ecdf(xvec)
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
plot_ecdf(df['x'])
with output:
For calculating CDF for array of discerete numbers:
import numpy as np
pdf, bin_edges = np.histogram(
data, # array of data
bins=500, # specify the number of bins for distribution function
density=True # True to return probability density function (pdf) instead of count
)
cdf = np.cumsum(pdf*np.diff(bins_edges))
Note that the return array pdf has the length of bins (500 here) and bin_edges has the length of bins+1 (501 here).
So, to calculate the CDF which is nothing but the area below the PDF distribution curve, we can simply calculate the cumulative sum of bin widths (np.diff(bins_edges)) times pdf using Numpy cumsum function
Here's an alternative pandas solution to calculating the empirical CDF, using pd.cut to sort the data into evenly spaced bins first, and then cumsum to compute the distribution.
def empirical_cdf(s: pd.Series, n_bins: int = 100):
# Sort the data into `n_bins` evenly spaced bins:
discretized = pd.cut(s, n_bins)
# Count the number of datapoints in each bin:
bin_counts = discretized.value_counts().sort_index().reset_index()
# Calculate the locations of each bin as just the mean of the bin start and end:
bin_counts["loc"] = (pd.IntervalIndex(bin_counts["index"]).left + pd.IntervalIndex(bin_counts["index"]).right) / 2
# Compute the CDF with cumsum:
return bin_counts.set_index("loc").iloc[:, -1].cumsum()
Below is an example use of the function to discretize the distribution of 10000 datapoints into 100 evenly spaced bins:
s = pd.Series(np.random.randn(10000))
cdf = empirical_cdf(s, n_bins=100)
fig, ax = plt.subplots()
ax.scatter(cdf.index, cdf.values)
import random
import numpy as np
import matplotlib.pyplot as plt
def get_discrete_cdf(values):
values = (values - np.min(values)) / (np.max(values) - np.min(values))
values_sort = np.sort(values)
values_sum = np.sum(values)
values_sums = []
cur_sum = 0
for it in values_sort:
cur_sum += it
values_sums.append(cur_sum)
cdf = [values_sums[np.searchsorted(values_sort, it)]/values_sum for it in values]
return cdf
rand_values = [np.random.normal(loc=0.0) for _ in range(1000)]
_ = plt.hist(rand_values, bins=20)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("nums")
cdf = get_discrete_cdf(rand_values)
x_p = list(zip(rand_values, cdf))
x_p.sort(key=lambda it: it[0])
x = [it[0] for it in x_p]
y = [it[1] for it in x_p]
_ = plt.plot(x, y)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("prob")

Python Uniform distribution of points on 4 dimensional sphere

I need a uniform distribution of points on a 4 dimensional sphere. I know this is not as trivial as picking 3 angles and using polar coordinates.
In 3 dimensions I use
from random import random
u=random()
costheta = 2*u -1 #for distribution between -1 and 1
theta = acos(costheta)
phi = 2*pi*random
x=costheta
y=sin(theta)*cos(phi)
x=sin(theta)*sin(phi)
This gives a uniform distribution of x, y and z.
How can I obtain a similar distribution for 4 dimensions?
A standard way, though, perhaps not the fastest, is to use Muller's method to generate uniformly distributed points on an N-sphere:
import numpy as np
import matplotlib.pyplot as plt
import mpl_toolkits.mplot3d.axes3d as axes3d
N = 600
dim = 3
norm = np.random.normal
normal_deviates = norm(size=(dim, N))
radius = np.sqrt((normal_deviates**2).sum(axis=0))
points = normal_deviates/radius
fig, ax = plt.subplots(subplot_kw=dict(projection='3d'))
ax.scatter(*points)
ax.set_aspect('equal')
plt.show()
Simply change dim = 3 to dim = 4 to generate points on a 4-sphere.
Take a point in 4D space whose coordinates are distributed normally, and calculate its unit vector. This will be on the unit 4-sphere.
from random import random
import math
x=random.normalvariate(0,1)
y=random.normalvariate(0,1)
z=random.normalvariate(0,1)
w=random.normalvariate(0,1)
r=math.sqrt(x*x + y*y + z*z + w*w)
x/=r
y/=r
z/=r
w/=r
print (x,y,z,w)
I like #unutbu's answer if the gaussian sampling really creates an evenly spaced spherical distribution (unlike sampling from a cube), but to avoid sampling on a Gaussian distribution and to have to prove that, there is a simple solution: to sample on a uniform distribution on a sphere (not on a cube).
Generate points on a uniform distribution.
Compute the squared radius of each point (avoid the square root).
Discard points:
Discard points for which the squared radius is greater than 1 (thus, for which the unsquared radius is greater than 1).
Discard points too close to a radius of zero to avoid numerical instabilities related to the division in the next step.
For each sampled point kept, divide the sampled point by the norm so as to renormalize it the unit radius.
Wash and repeat for more points because of discarded samples.
This obviously works in an n-dimensional space, since the radius is always the L2-norm in higher dimensions.
It is fast so as avoiding a square-root and sampling on a Gaussian distribution, but it's not a vectorized algorithm.
I found a good solution for sampling from N-dim sphere. The main idea is:
If Y is drawn from the uncorrelated multivariate normal distribution, then S = Y / ||Y|| has the uniform distribution on the unit d-sphere. Multiplying S by U1/d, where U has the uniform distribution on the unit interval (0,1), creates the uniform distribution in the unit d-dimensional ball.
Here is the python code to do this:
Y = np.random.multivariate_normal(mean=[0], cov=np.eye(1,1), size=(n_dims, n_samples))
Y = np.squeeze(Y, -1)
Y /= np.sqrt(np.sum(Y * sample_isotropic, axis=0))
U = np.random.uniform(low=0, high=1, size=(n_samples)) ** (1/n_dims)
Y *= distr * radius # in my case radius is one
This is what I get for the sphere:

probability density function from histogram in python to fit another histrogram

I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help
You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
Example:
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
pyplot.show()
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.

Categories