I am currently having issue with the implementation of the Metropolis-Hastings algorithm.
I am trying to use the algorithm to calculate integrals of the form
In using this algorithm, we can obtain a long chain of configurations ( in this case, each configuration is just a single numbers) such that in the tail-end of the chain the probability of having a particular configuration follows (or rather tends to) a gaussian distribution.
My code seems to be messing up with obtaining the said gaussian distributions. There is a strange dependence on the transition probablity (the probablity of picking a new candidate configuration depending on the previous configuration in the chain). However, if this transition probability is symmetric, there should be no dependence on this function at all (it only affects speed at which phase space [space of potential configurations] is explored and how quickly the chain converges to the desired distribution)!
In my case I am using a normal distribution transition function (which satisfies the need to be symmetric), with width d.
For each d I use I do indeed get a gaussian distribution however the standard deviation, sigma, depends on my choice of d.
The resulting gaussian should have a sigma of roughly 0.701 but I find that the value I actually get depends on the parameter d, when it shouldn't.
I am not sure where the error in this code is, any help would be greatly appreciated!
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
'''
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
'''
'''
Implementing metropolis-hasting algorithm
'''
#Setting up the initial config for our chain, generating first 2 to generate numpy array
x_0 = np.random.uniform(-20,-10,2)
#Defining function that generates the next N steps in the chain, given a starting config x
#Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
#Success and failures implemented to see roughly the success rate of each step
def next_steps(x,N):
i = 0
Success = 0
Failures = 0
Data = np.array(x)
d = 1.5 #Spread of (normal) transition function
while i < N:
r = np.random.uniform(0,1)
delta = np.random.normal(0,d)
x_old = Data[-1]
x_new = x_old + delta
hasting_ratio = np.exp(-(x_new**2-x_old**2) )
if hasting_ratio > r:
i = i+1
Data = np.append(Data,x_new)
Success = Success +1
else:
Failures = Failures + 1
print(Success)
print(Failures)
return Data
#Number of steps in the chain
N_iteration = 50000
#Generating the data
Data = next_steps(x_0,N_iteration)
#Plotting data to see convergence of chain to gaussian distribution
plt.plot(Data)
plt.show()
#Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
#Plotting a histogram to visually see if guassian
plt.hist(Data, bins = 300)
plt.show()
You need to save x even when it doesn't change. Otherwise the center values are under-counted, and more so as d increases, which increases the variance.
import numpy as np
from scipy.stats import norm
"""
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
"""
"""
Implementing metropolis-hasting algorithm
"""
# Setting up the initial config for our chain
x_0 = np.random.uniform(-20, -10)
# Defining function that generates the next N steps in the chain, given a starting config x
# Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
# Success and failures implemented to see roughly the success rate of each step
def next_steps(x, N):
Success = 0
Failures = 0
Data = np.empty((N,))
d = 1.5 # Spread of (normal) transition function
for i in range(N):
r = np.random.uniform(0, 1)
delta = np.random.normal(0, d)
x_new = x + delta
hasting_ratio = np.exp(-(x_new ** 2 - x ** 2))
if hasting_ratio > r:
x = x_new
Success = Success + 1
else:
Failures = Failures + 1
Data[i] = x
print(Success)
print(Failures)
return Data
# Number of steps in the chain
N_iteration = 50000
# Generating the data
Data = next_steps(x_0, N_iteration)
# Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
I'm looking to generate random normally distributed numbers between 1 and 0, but as the mean moves closer to 1 or 0, the right or left side respectively becomes "squished".
After modifying the normal distribution and playing around with sliders in geogebra, I came up with the following:
Next I needed to create a method in python which would generate random samples that would be distributed according to this PDF.
Originally I thought the only way to do this was to try and derive a new equation for generating random numbers as seen in the Box-Muller proof (which I got by following along with this tutorial).
However, I thought there might be an easier way to do this by using the numpy library's np.random.choice() method.
After all, I should be able to integrate the PDF at a very small step size and get the various probabilities for said steps (approximately of course).
So with that I wrote the following script:
# Standard libs
import math
# Third party libs
import numpy as np
from alive_progress import alive_bar
from matplotlib import pyplot as plt
class RandomNumberGenerator:
def __init__(self):
pass
def clamped_normal_distribution(self, mu: float,
stddev: float, x: float):
""" Computes a value from the clamped normal distribution """
divideByZeroAvoider = 1e-5
if x < 0 or x > 1:
return 0
elif x >= 0 and x <= mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/(x**2 + divideByZeroAvoider)))
elif x <= 1 and x > mu:
return math.exp(-0.5*( (x - mu) / (stddev) )**2 \
* (1/((1-x)**2 + divideByZeroAvoider)))
else:
print("This shouldn't happen!: {}".format(x))
return 0
if __name__ == '__main__':
rng = RandomNumberGenerator()
mu = 0.7
stddev = 1
stepSize = 1e-3
x = np.linspace(stepSize,1, int(1/stepSize) - 1)
# Determine the total area under the curve
samples = []
print("Generating samples...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
samples.append(rng.clamped_normal_distribution(
mu, stddev, i))
bar()
area = np.trapz(samples, dx=stepSize)
print("Area = {}".format(area))
# Determine the probability of x falling in a specific interval
probabilities = []
print("Generating probabilties...")
with alive_bar(len(x.tolist())) as bar:
for i in x:
lead = rng.clamped_normal_distribution(mu,
stddev, i)
lag = rng.clamped_normal_distribution(mu,
stddev, i - stepSize)
probability = np.trapz(
np.array([lag, lead]),
dx=stepSize)
# Divide by the area because this isn't a standard normal
probabilities.append(probability / area)
bar()
# Should be approximately 1
print("Probability: {}".format(sum(probabilities)))
plt.plot(x, probabilities)
plt.show()
y = []
print("Performing distribution test...")
testSize = int(10e3)
with alive_bar(testSize) as bar:
for _ in range(testSize):
randSamp = np.random.choice(samples, p=probabilities)
y.append(randSamp)
bar()
plt.hist(y,300)
plt.show()
The first plot of the probabilities against the linearly spaced samples looks promising, giving me the following graph:
However, if we use these samples as choices with given probabilities, we get the following histogram:
I have no idea why this isn't working correctly.
I've tried other (smaller) examples like the ones listed on the numpy website, and they produce histograms of the according to the given probabilities array.
I'd really appreciate some advice/intuition if at all possible :).
It looks like there is a problem with the first argument in the call np.random.choice(samples, p=probabilities). The first argument should be x, not samples.
ADDITION BY AUTHOR:
The reason for this is the samples are the values of the curve (i.e. the y-axis and NOT the x-axis).
Thus the values with the highest probabilities (i.e. the samples around the mean) all have a value of ~1, which is why we see such a massive spike around the value 1.
Changing this to x gives us the following graphs (for 10e3 samples):
Working as expected, very nice.
In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.
For example if i have an array like below:
x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]
I couldn't apply for this array:
from scipy.stats import nbinom
param = nbinom.fit(x)
But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?
You can use Method of Moments to fit any particular distribution.
Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.
So, in all these cases we only need two moments. Let's get them:
import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom
x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {} # we'll use it later
Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.
Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:
# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:
p = 1 - mean / var # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p
UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:
p = mean / var
r = p * mean / (1-p)
END OF UPD
UPD2:
I'd suggest using #thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.
END OF UPD2
Calculate likelihood:
likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()
Same for Poisson, there is only one parameter:
# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()
Same for Geometric distribution:
# mean = 1 / p # this form fits the scipy definition
p = 1 / mean
likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()
Finally, let's get the best fit:
best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])
Let me know if you have any questions
Great answer by Marat.
In addition to Marat's post I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
I would rewrite the code for Negative Binomial to-
log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()
Note that I have used-
logpmf instead of pmf
sum instead of product
And to find out the best distribution-
best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])
The random module (http://docs.python.org/2/library/random.html) has several fixed functions to randomly sample from. For example random.gauss will sample random point from a normal distribution with a given mean and sigma values.
I'm looking for a way to extract a number N of random samples between a given interval using my own distribution as fast as possible in python. This is what I mean:
def my_dist(x):
# Some distribution, assume c1,c2,c3 and c4 are known.
f = c1*exp(-((x-c2)**c3)/c4)
return f
# Draw N random samples from my distribution between given limits a,b.
N = 1000
N_rand_samples = ran_func_sample(my_dist, a, b, N)
where ran_func_sample is what I'm after and a, b are the limits from which to draw the samples. Is there anything of that sort in python?
You need to use Inverse transform sampling method to get random values distributed according to a law you want. Using this method you can just apply inverted function
to random numbers having standard uniform distribution in the interval [0,1].
After you find the inverted function, you get 1000 numbers distributed according to the needed distribution this obvious way:
[inverted_function(random.random()) for x in range(1000)]
More on Inverse Transform Sampling:
http://en.wikipedia.org/wiki/Inverse_transform_sampling
Also, there is a good question on StackOverflow related to the topic:
Pythonic way to select list elements with different probability
This code implements the sampling of n-d discrete probability distributions. By setting a flag on the object, it can also be made to be used as a piecewise constant probability distribution, which can then be used to approximate arbitrary pdf's. Well, arbitrary pdfs with compact support; if you efficiently want to sample extremely long tails, a non-uniform description of the pdf would be required. But this is still efficient even for things like airy-point-spread functions (which I created it for, initially). The internal sorting of values is absolutely critical there to get accuracy; the many small values in the tails should contribute substantially, but they will get drowned out in fp accuracy without sorting.
class Distribution(object):
"""
draws samples from a one dimensional probability distribution,
by means of inversion of a discrete inverstion of a cumulative density function
the pdf can be sorted first to prevent numerical error in the cumulative sum
this is set as default; for big density functions with high contrast,
it is absolutely necessary, and for small density functions,
the overhead is minimal
a call to this distibution object returns indices into density array
"""
def __init__(self, pdf, sort = True, interpolation = True, transform = lambda x: x):
self.shape = pdf.shape
self.pdf = pdf.ravel()
self.sort = sort
self.interpolation = interpolation
self.transform = transform
#a pdf can not be negative
assert(np.all(pdf>=0))
#sort the pdf by magnitude
if self.sort:
self.sortindex = np.argsort(self.pdf, axis=None)
self.pdf = self.pdf[self.sortindex]
#construct the cumulative distribution function
self.cdf = np.cumsum(self.pdf)
#property
def ndim(self):
return len(self.shape)
#property
def sum(self):
"""cached sum of all pdf values; the pdf need not sum to one, and is imlpicitly normalized"""
return self.cdf[-1]
def __call__(self, N):
"""draw """
#pick numbers which are uniformly random over the cumulative distribution function
choice = np.random.uniform(high = self.sum, size = N)
#find the indices corresponding to this point on the CDF
index = np.searchsorted(self.cdf, choice)
#if necessary, map the indices back to their original ordering
if self.sort:
index = self.sortindex[index]
#map back to multi-dimensional indexing
index = np.unravel_index(index, self.shape)
index = np.vstack(index)
#is this a discrete or piecewise continuous distribution?
if self.interpolation:
index = index + np.random.uniform(size=index.shape)
return self.transform(index)
if __name__=='__main__':
shape = 3,3
pdf = np.ones(shape)
pdf[1]=0
dist = Distribution(pdf, transform=lambda i:i-1.5)
print dist(10)
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
And as a more real-world relevant example:
x = np.linspace(-100, 100, 512)
p = np.exp(-x**2)
pdf = p[:,None]*p[None,:] #2d gaussian
dist = Distribution(pdf, transform=lambda i:i-256)
print dist(1000000).mean(axis=1) #should be in the 1/sqrt(1e6) range
import matplotlib.pyplot as pp
pp.scatter(*dist(1000))
pp.show()
Here is a rather nice way of performing inverse transform sampling with a decorator.
import numpy as np
from scipy.interpolate import interp1d
def inverse_sample_decorator(dist):
def wrapper(pnts, x_min=-100, x_max=100, n=1e5, **kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
return wrapper
Using this decorator on a Gaussian distribution, for example:
#inverse_sample_decorator
def gauss(x, amp=1.0, mean=0.0, std=0.2):
return amp*np.exp(-(x-mean)**2/std**2/2.0)
You can then generate sample points from the distribution by calling the function. The keyword arguments x_min and x_max are the limits of the original distribution and can be passed as arguments to gauss along with the other key word arguments that parameterise the distribution.
samples = gauss(5000, mean=20, std=0.8, x_min=19, x_max=21)
Alternatively, this can be done as a function that takes the distribution as an argument (as in your original question),
def inverse_sample_function(dist, pnts, x_min=-100, x_max=100, n=1e5,
**kwargs):
x = np.linspace(x_min, x_max, int(n))
cumulative = np.cumsum(dist(x, **kwargs))
cumulative -= cumulative.min()
f = interp1d(cumulative/cumulative.max(), x)
return f(np.random.random(pnts))
I was in a similar situation but I wanted to sample from a multivariate distribution, so, I implemented a rudimentary version of Metropolis-Hastings (which is an MCMC method).
def metropolis_hastings(target_density, size=500000):
burnin_size = 10000
size += burnin_size
x0 = np.array([[0, 0]])
xt = x0
samples = []
for i in range(size):
xt_candidate = np.array([np.random.multivariate_normal(xt[0], np.eye(2))])
accept_prob = (target_density(xt_candidate))/(target_density(xt))
if np.random.uniform(0, 1) < accept_prob:
xt = xt_candidate
samples.append(xt)
samples = np.array(samples[burnin_size:])
samples = np.reshape(samples, [samples.shape[0], 2])
return samples
This function requires a function target_density which takes in a data-point and computes its probability.
For details check-out this detailed answer of mine.
import numpy as np
import scipy.interpolate as interpolate
def inverse_transform_sampling(data, n_bins, n_samples):
hist, bin_edges = np.histogram(data, bins=n_bins, density=True)
cum_values = np.zeros(bin_edges.shape)
cum_values[1:] = np.cumsum(hist*np.diff(bin_edges))
inv_cdf = interpolate.interp1d(cum_values, bin_edges)
r = np.random.rand(n_samples)
return inv_cdf(r)
So if we give our data sample that has a specific distribution, the inverse_transform_sampling function will return a dataset with exactly the same distribution. Here the advantage is that we can get our own sample size by specifying it in the n_samples variable.
I can compute the autocorrelation using numpy's built in functionality:
numpy.correlate(x,x,mode='same')
However the resulting correlation is naturally noisy. I can partition my data, and compute the correlation on each resulting window, then average them all together to compute cleaner autocorrelation, similar to what signal.welch does. Is there a handy function in either numpy or scipy that does this, possibly faster than I would get if I were to compute partition and loop through the data myself?
UPDATE
This is motivated by #kazemakase answer. I have tried to show what I mean with some code used to generate the figure below.
One can see that #kazemakase is correct with the fact that the AC function naturally averages out the noise. However the averaging of the AC has the advantage that it is much faster! np.correlate seems to scale as the slow O(n^2) rather than O(nlogn) that I would expect if the correlation was calculated using circular convolution via the FFT...
from statsmodels.tsa.arima_model import ARIMA
import statsmodels as sm
import matplotlib.pyplot as plt
import numpy as np
np.random.seed(12345)
arparams = np.array([.75, -.25, 0.2, -0.15])
maparams = np.array([.65, .35])
ar = np.r_[1, -arparams] # add zero-lag and negate
ma = np.r_[1, maparams] # add zero-lag
x = sm.tsa.arima_process.arma_generate_sample(ar, ma, 10000)
def calc_rxx(x):
x = x-x.mean()
N = len(x)
Rxx = np.correlate(x,x,mode="same")[N/2::]/N
#Rxx = np.correlate(x,x,mode="same")[N/2::]/np.arange(N,N/2,-1)
return Rxx/x.var()
def avg_rxx(x,nperseg=1024):
rxx_windows = []
Nw = int(np.floor(len(x)/nperseg))
print Nw
first = True
for i in range(Nw-1):
xw = x[i*nperseg:nperseg*(i+1)]
y = calc_rxx(xw)
if i%1 == 0:
if first:
plt.semilogx(y,"k",alpha=0.2,label="Short AC")
first = False
else:
plt.semilogx(y,"k",alpha=0.2)
rxx_windows.append(y)
print np.shape(rxx_windows)
return np.mean(rxx_windows,axis=0)
plt.figure()
r_avg = avg_rxx(x,nperseg=300)
r = calc_rxx(x)
plt.semilogx(r_avg,label="Average AC")
plt.semilogx(r,label="Long AC")
plt.xlabel("Lag")
plt.ylabel("Auto-correlation")
plt.legend()
plt.xlim([0,150])
plt.show()
TL-DR: To decrease noise in the autocorrelation function increase the length of your signal x.
Partitioning the data and averaging like in spectral estimation is an interesting idea. I wish it would work...
The autocorrelation is defined as
Let's say we partition the data into two windows. Their autocorrelations become
Note how they are only different in the limits of the sumations. Basically, we split the summation of the autocorrelation into two parts. When we add these back together we are back to the original autocorrelation! So we did not gain anything.
The conclusion is, there is no such thing implemented in numpy/scipy because there is no point in doing so.
Remarks:
I hope it's easy to see that this extends to any number of partitions.
to keep it simple I left the normalization out. If you divide Rxx by n and the partial Rxx by n/2 you get Rxx / n == (Rxx1 * 2/n + Rxx2 * 2/n) / 2. I.e. The mean of the normalized partial autocorrelation is equal to the complete normalized autocorrelation.
to keep it even simpler I assumed the signal x could be indexed beyond the limits of 0 and n-1. In practice, if the signal is stored in an array this is often not possible. In this case there is a small difference between the full and the partialized autocorrelations that increases with the lag l. Unfortunately, this is merely a loss of precision and does not reduce noise.
Code heretic! I don't belive your evil math!
Of course we can try things out and see:
import matplotlib.pyplot as plt
import numpy as np
n = 2**16
n_segments = 8
x = np.random.randn(n) # data
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
segments = x.reshape(n_segments, -1)
m = segments.shape[1]
rs = []
for y in segments:
ry = np.correlate(y, y, mode='same') / m # partial ACF
rs.append(ry)
l2 = np.arange(-m//2, m//2) # lags of partial ACFs
plt.plot(l1, rx, label='full ACF')
plt.plot(l2, np.mean(rs, axis=0), label='partial ACF')
plt.xlim(-m, m)
plt.legend()
plt.show()
Although we used 8 segments to average the ACF, the noise level visually stays the same.
Okay, so that's why it does not work but what is the solution?
Here are the good news: Autocorrelation is already a noise reduction technique! Well, in some way at least: An application of the ACF is to find periodic signals hidden by noise.
Since noise (ideally) has zero mean, its influence diminishes the more elements we sum up. In other words, you can reduce noise in the autocorrelation by using longer signals. (I guess this is probably not true for every type of noise, but should hold for the usual Gaussian white noise and its relatives.)
Behold the noise getting lower with more data samples:
import matplotlib.pyplot as plt
import numpy as np
for n in [2**6, 2**8, 2**12]:
x = np.random.randn(n)
rx = np.correlate(x, x, mode='same') / n # ACF
l1 = np.arange(-n//2, n//2) # Lags
plt.plot(l1, rx, label='n={}'.format(n))
plt.legend()
plt.xlim(-20, 20)
plt.show()