Related
I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.
I am currently having issue with the implementation of the Metropolis-Hastings algorithm.
I am trying to use the algorithm to calculate integrals of the form
In using this algorithm, we can obtain a long chain of configurations ( in this case, each configuration is just a single numbers) such that in the tail-end of the chain the probability of having a particular configuration follows (or rather tends to) a gaussian distribution.
My code seems to be messing up with obtaining the said gaussian distributions. There is a strange dependence on the transition probablity (the probablity of picking a new candidate configuration depending on the previous configuration in the chain). However, if this transition probability is symmetric, there should be no dependence on this function at all (it only affects speed at which phase space [space of potential configurations] is explored and how quickly the chain converges to the desired distribution)!
In my case I am using a normal distribution transition function (which satisfies the need to be symmetric), with width d.
For each d I use I do indeed get a gaussian distribution however the standard deviation, sigma, depends on my choice of d.
The resulting gaussian should have a sigma of roughly 0.701 but I find that the value I actually get depends on the parameter d, when it shouldn't.
I am not sure where the error in this code is, any help would be greatly appreciated!
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
'''
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
'''
'''
Implementing metropolis-hasting algorithm
'''
#Setting up the initial config for our chain, generating first 2 to generate numpy array
x_0 = np.random.uniform(-20,-10,2)
#Defining function that generates the next N steps in the chain, given a starting config x
#Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
#Success and failures implemented to see roughly the success rate of each step
def next_steps(x,N):
i = 0
Success = 0
Failures = 0
Data = np.array(x)
d = 1.5 #Spread of (normal) transition function
while i < N:
r = np.random.uniform(0,1)
delta = np.random.normal(0,d)
x_old = Data[-1]
x_new = x_old + delta
hasting_ratio = np.exp(-(x_new**2-x_old**2) )
if hasting_ratio > r:
i = i+1
Data = np.append(Data,x_new)
Success = Success +1
else:
Failures = Failures + 1
print(Success)
print(Failures)
return Data
#Number of steps in the chain
N_iteration = 50000
#Generating the data
Data = next_steps(x_0,N_iteration)
#Plotting data to see convergence of chain to gaussian distribution
plt.plot(Data)
plt.show()
#Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
#Plotting a histogram to visually see if guassian
plt.hist(Data, bins = 300)
plt.show()
You need to save x even when it doesn't change. Otherwise the center values are under-counted, and more so as d increases, which increases the variance.
import numpy as np
from scipy.stats import norm
"""
We want to get an exponential decay integral approx using importance sampling.
We will try to integrate x^2exp(-x^2) over the real line.
Metropolis-hasting alg will generate configuartions (in this case, single numbers) such that
the probablity of a given configuration x^a ~ p(x^a) for p(x) propto exp(-x^2).
Once configs = {x^a} generated, the apporximation, Q_N, of the integral, I, will be given by
Q_N = 1/N sum_(configs) x^2
lim (N-> inf) Q_N -> I
"""
"""
Implementing metropolis-hasting algorithm
"""
# Setting up the initial config for our chain
x_0 = np.random.uniform(-20, -10)
# Defining function that generates the next N steps in the chain, given a starting config x
# Works by iteratively taking the last element in the chain, generating a new candidate configuration from it and accepting/rejecting according to the algorithm
# Success and failures implemented to see roughly the success rate of each step
def next_steps(x, N):
Success = 0
Failures = 0
Data = np.empty((N,))
d = 1.5 # Spread of (normal) transition function
for i in range(N):
r = np.random.uniform(0, 1)
delta = np.random.normal(0, d)
x_new = x + delta
hasting_ratio = np.exp(-(x_new ** 2 - x ** 2))
if hasting_ratio > r:
x = x_new
Success = Success + 1
else:
Failures = Failures + 1
Data[i] = x
print(Success)
print(Failures)
return Data
# Number of steps in the chain
N_iteration = 50000
# Generating the data
Data = next_steps(x_0, N_iteration)
# Obtaining tail end data and obtaining the standard deviation of resulting gaussian distribution
Data = Data[-40000:]
(mu, sigma) = norm.fit(Data)
print(sigma)
In scipy there is no support for fitting discrete distributions using data. I know there are a lot of subject about this.
For example if i have an array like below:
x = [2,3,4,5,6,7,0,1,1,0,1,8,10,9,1,1,1,0,0]
I couldn't apply for this array:
from scipy.stats import nbinom
param = nbinom.fit(x)
But i would like to ask you up to date, is there any way to fit for these three discrete distributions and then choose the best fit for the discrete dataset?
You can use Method of Moments to fit any particular distribution.
Basic idea: get empirical first, second, etc. moments, then derive distribution parameters from these moments.
So, in all these cases we only need two moments. Let's get them:
import pandas as pd
# for other distributions, you'll need to implement PMF
from scipy.stats import nbinom, poisson, geom
x = pd.Series(x)
mean = x.mean()
var = x.var()
likelihoods = {} # we'll use it later
Note: I used pandas instead of numpy. That is because numpy's var() and std() don't apply Bessel's correction, while pandas' do. If you have 100+ samples, there shouldn't be much difference, but on smaller samples it could be important.
Now, let's get parameters for these distributions. Negative binomial has two parameters: p, r. Let's estimate them and calculate likelihood of the dataset:
# From the wikipedia page, we have:
# mean = pr / (1-p)
# var = pr / (1-p)**2
# without wiki, you could use MGF to get moments; too long to explain here
# Solving for p and r, we get:
p = 1 - mean / var # TODO: check for zero variance and limit p by [0, 1]
r = (1-p) * mean / p
UPD: Wikipedia and scipy are using different definitions of p, one treating it as probability of success and another as probability of failure. So, to be consistent with scipy notion, use:
p = mean / var
r = p * mean / (1-p)
END OF UPD
UPD2:
I'd suggest using #thilak's code log likelihood instead. It allows to avoid loss of precision, which is especially important on large samples.
END OF UPD2
Calculate likelihood:
likelihoods['nbinom'] = x.map(lambda val: nbinom.pmf(val, r, p)).prod()
Same for Poisson, there is only one parameter:
# from Wikipedia,
# mean = variance = lambda. Nothing to solve here
lambda_ = mean
likelihoods['poisson'] = x.map(lambda val: poisson.pmf(val, lambda_)).prod()
Same for Geometric distribution:
# mean = 1 / p # this form fits the scipy definition
p = 1 / mean
likelihoods['geometric'] = x.map(lambda val: geom.pmf(val, p)).prod()
Finally, let's get the best fit:
best_fit = max(likelihoods, key=lambda x: likelihoods[x])
print("Best fit:", best_fit)
print("Likelihood:", likelihoods[best_fit])
Let me know if you have any questions
Great answer by Marat.
In addition to Marat's post I would most certainly recommend taking log of the probability mass function. Some information on why log likelihood is preferred over likelihood- https://math.stackexchange.com/questions/892832/why-we-consider-log-likelihood-instead-of-likelihood-in-gaussian-distribution
I would rewrite the code for Negative Binomial to-
log_likelihoods = {}
log_likelihoods['nbinom'] = x.map(lambda val: nbinom.logpmf(val, r, p)).sum()
Note that I have used-
logpmf instead of pmf
sum instead of product
And to find out the best distribution-
best_fit = max(log_likelihoods, key=lambda x: log_likelihoods[x])
print("Best fit:", best_fit)
print("log_Likelihood:", log_likelihoods[best_fit])
I have recently been working with gpflow, in-particular Gaussian process regression, to model a process for which I have access to approximated moments for each input. I have a vector of input values X of size (N,1) and a vector of responses Y of size (N,1). However, I also know, for each (x,y) pair, an approximation of the associated variance, skewness, kurtosis and so on for the particular y value.
From this, I know properties that inform me of appropriate likelihoods to use for each data point.
In the simplest case, I just assume all likelihoods are Gaussian, and specify the variance at each point. I've created a minimal example of my code by adapting the tutorial on: https://nbviewer.jupyter.org/github/GPflow/GPflow/blob/develop/doc/source/notebooks/advanced/varying_noise.ipynb#Demo-2:-grouped-noise-variances.
import numpy as np
import gpflow
def generate_data(N=100):
X = np.random.rand(N)[:, None] * 10 - 5 # Inputs, shape N x 1
F = 2.5 * np.sin(6 * X) + np.cos(3 * X) # Mean function values
groups = np.arange( 0, N, 1 ).reshape(-1,1)
NoiseVar = np.array([i/100.0 for i in range(N)])[groups]
Y = F + np.random.randn(N, 1) * np.sqrt(NoiseVar) # Noisy data
return X, Y, groups, NoiseVar
# Get data
X, Y, groups, NoiseVar = generate_data()
Y_data = np.hstack([Y, groups])
# Generate one likelihood per data-point
likelihood = gpflow.likelihoods.SwitchedLikelihood( [gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])])
# model construction (notice that num_latent is 1)
kern = gpflow.kernels.Matern52(input_dim=1, lengthscales=0.5)
model = gpflow.models.VGP(X, Y_data, kern=kern, likelihood=likelihood, num_latent=1)
# Specify the likelihood as non-trainable
model.likelihood.set_trainable(False)
# build the natural gradients optimiser
natgrad_optimizer = gpflow.training.NatGradOptimizer(gamma=1.)
natgrad_tensor = natgrad_optimizer.make_optimize_tensor(model, var_list=[(model.q_mu, model.q_sqrt)])
session = model.enquire_session()
session.run(natgrad_tensor)
# update the cache of the variational parameters in the current session
model.anchor(session)
# Stop Adam from optimising the variational parameters
model.q_mu.trainable = False
model.q_sqrt.trainable = False
# Create Adam tensor
adam_tensor = gpflow.train.AdamOptimizer(learning_rate=0.1).make_optimize_tensor(model)
for i in range(200):
session.run(natgrad_tensor)
session.run(adam_tensor)
# update the cache of the parameters in the current session
model.anchor(session)
print(model)
The above code works for a gaussian likelihood, and known variances. Inspecting my real data, I see that it is skewed very often and as a result, I want to use non-gaussian likelihoods to model it, but am unsure how to specify these other likelihood parameters given what I know.
So my question is: Given this setup, how can I adapt my code so far to include non-Gaussian likelihoods at each step, in-particular specifying and fixing their parameters based on my known variances, skewness, kurtosis and so on associated with each individual y value?
Firstly, you will need to choose which non-Gaussian likelihood you use. GPflow includes various ones in likelihoods.py. You then need to adapt the line
likelihood = gpflow.likelihoods.SwitchedLikelihood(
[gpflow.likelihoods.Gaussian(variance=NoiseVar[i]) for i in range(Y.shape[0])]
)
to give a list of your non-Gaussian likelihoods.
Which likelihood can take advantage of your skewness and kurtosis information is a statistical question. Depending on what you come up with, you may need to implement your own likelihood class, which can be done by inheriting from Likelihood. You should be able to follow some other examples from likelihoods.py.
First question:
I'm trying to fit experimental datas with function of the following form:
f(x) = m_o*(1-exp(-t_o*x)) + ... + m_j*(1-exp(-t_j*x))
Currently, I don't find a way to have an undetermined number of parameters m_j, t_j, I'm forced to do something like this:
def fitting_function(x, m_1, t_1, m_2, t_2):
return m_1*(1.-numpy.exp(-t_1*x)) + m_2*(1.-numpy.exp(-t_2*x))
parameters, covariance = curve_fit(fitting_function, xExp, yExp, maxfev = 100000)
(xExp and yExp are my experimental points)
Is there a way to write my fitting function like this:
def fitting_function(x, li):
res = 0.
for el in range(len(li) / 2):
res += li[2*idx]*(1-numpy.exp(-li[2*idx+1]*x))
return res
where li is the list of fitting parameters and then do a curve_fitting? I don't know how to tell to curve_fitting what is the number of fitting parameters.
When I try this kind of form for fitting_function, I have errors like "ValueError: Unable to determine number of fit parameters."
Second question:
Is there any way to force my fitting parameters to be positive?
Any help appreciated :)
See my question and answer here. I've also made a minimal working example demonstrating how it could be done for your application. I make no claims that this is the best way - I am muddling through all this myself, so any critiques or simplifications are appreciated.
import numpy as np
from scipy.optimize import curve_fit
import matplotlib.pyplot as pl
def wrapper(x, *args): #take a list of arguments and break it down into two lists for the fit function to understand
N = len(args)/2
amplitudes = list(args[0:N])
timeconstants = list(args[N:2*N])
return fit_func(x, amplitudes, timeconstants)
def fit_func(x, amplitudes, timeconstants): #the actual fit function
fit = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
fit += m*(1.0-np.exp(-t*x))
return fit
def gen_data(x, amplitudes, timeconstants, noise=0.1): #generate some fake data
y = np.zeros(len(x))
for m,t in zip(amplitudes, timeconstants):
y += m*(1.0-np.exp(-t*x))
if noise:
y += np.random.normal(0, noise, size=len(x))
return y
def main():
x = np.arange(0,100)
amplitudes = [1, 2, 3]
timeconstants = [0.5, 0.2, 0.1]
y = gen_data(x, amplitudes, timeconstants, noise=0.01)
p0 = [1, 2, 3, 0.5, 0.2, 0.1]
popt, pcov = curve_fit(lambda x, *p0: wrapper(x, *p0), x, y, p0=p0) #call with lambda function
yfit = gen_data(x, popt[0:3], popt[3:6], noise=0)
pl.plot(x,y,x,yfit)
pl.show()
print popt
print pcov
if __name__=="__main__":
main()
A word of warning, though. A linear sum of exponentials is going to make the fit EXTREMELY sensitive to any noise, particularly for a large number of parameters. You can test that by adding even a small amount of noise to the data generated in the script - even small deviations cause it to get the wrong answer entirely while the fit still looks perfectly valid by eye (test with noise=0, 0.01, and 0.1). Be very careful interpreting your results even if the fit looks good. It's also a form that allows for variable swapping: the best fit solution is the same even if you swap any pairs of (m_i, t_i) with (m_j, t_j), meaning your chi-square has multiple identical local minima that might mean your variables get swapped around during fitting, depending on your initial conditions. This is unlikely to be a numeriaclly robust way to extract these parameters.
To your second question, yes, you can, by defining your exponentials like so:
m_0**2*(1.0-np.exp(-t_0**2*x)+...
Basically, square them all in your actual fit function, fit them, and then square the results (which could be negative or positive) to get your actual parameters. You can also define variables to be between a certain range by using different proxy forms.