Apply kurtosis to a distribution in python - python

I have a dataset which is in the format of
frequency, direction, normalised power spectral density, spread, skewness, kurtosis
I am able to visualise the distribution of a specific record using the code from the top answer in skew normal distribution in scipy but I am not sure how to apply a kurtosis value to a distribution?
from scipy import linspace
from scipy import pi,sqrt,exp
from scipy.special import erf
from pylab import plot,show
def pdf(factor, x):
return (100*factor)/sqrt(2*pi) * exp(-x**2/2)
def cdf(x):
return (1 + erf(x/sqrt(2))) / 2
def skew(x,e=0,w=1,a=0, norm_psd=1):
t = (x-e) / w
return 2 / w * pdf(norm_psd, t) * cdf(a*t)
n = 540
e = 341.9 # direction
w = 59.3 # spread
a = 3.3 # skew
k = 4.27 # kurtosis
n_psd = 0.5 # normalised power spectral density
x = linspace(-90, 450, n)
p = skew(x, e, w, a, n_psd)
print max(p)
plot(x,p)
show()
Edit: I removed skew normal from my title as I don't think it is actually possible to apply a kurtosis value to the above distribution, I think a different distribution is necessary, as direction is involved a distribution from circular statistics may be more appropriate?
Thanks to the answer below I can apply kurtosis using the pdf_mvsk function demonstrated in the code below, unfortunately my skew values cause a negative y value, but the answer satisfies my question.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.sandbox.distributions.extras as extras
pdffunc = extras.pdf_mvsk([341.9, 59.3, 3.3, 4.27])
range = np.arange(0, 360, 0.1)
plt.plot(range, pdffunc(range))
plt.show()

If you have mean, standard deviation, skew and kurtosis, then you can build an approximately normal distribution with those moments using Gram-Charlier expansion.
I looked into this some time ago, scipy.stats had a function that was wrong and was removed.
I don't remember what the status of this is, since it was a long time ago that I put this in the statsmodels sandbox
http://statsmodels.sourceforge.net/devel/generated/statsmodels.sandbox.distributions.extras.pdf_mvsk.html#statsmodels.sandbox.distributions.extras.pdf_mvsk

Related

How to calculate the probability between two numbers from a probability distribution in python

I've always thought it would be useful to calculate the probability between two values on a probability distribution. While there isn't a built-in way to do this using seaborn or matplotlib, I reckon it just takes some basic calculus, right? Here is some code I found from an article on this topic:
from sklearn.neighbors import KernelDensity
import numpy as np
x = np.random.normal(loc=0.0, scale=1.0, size=1000000)
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(x.mean() - x.std(), x.mean() + x.std(), 100, kd)
0.6338
This returns a probability that converges at 0.6338. This confused me, as the 68-95-99.7 rule states that the probability of a value being within one standard deviation of the mean in either direction should be 68%.
I decided to run another test by calculating the probability between the median and max of a randomly generated sample, figuring it should converge close to 50%:
x = np.random.randint(100, size=(1000000))
# sns.kdeplot(x) # this is how i'd generate a kdeplot of this data
kd = KernelDensity(kernel='gaussian', bandwidth=0.5).fit(np.array(x).reshape(-1, 1))
def get_probability(start_value, end_value, eval_points, kd):
# Number of evaluation points
N = eval_points
step = (end_value - start_value) / (N - 1) # Step size
x = np.linspace(start_value, end_value, N)[:, np.newaxis] # Generate values in the range
kd_vals = np.exp(kd.score_samples(x)) # Get PDF values for each x
probability = np.sum(kd_vals * step) # Approximate the integral of the PDF
return probability.round(4)
get_probability(np.median(x), x.max(), 100, kd)
0.4946
And it's pretty close. Am I missing something here? Why am I nearly 5 percentage points off from the 68-95-99.7 rule? Is this method of generating probabilities from a probability distribution wrong? Is there a better way to find the probability between two values from a probability distribution?
EDIT: Could you potentially calculate something by using the data generated from a kdeplot?
fig, ax = plt.subplots()
sns.kdeplot(x)
kdeline = ax.lines[0]
xs = kdeline.get_xdata()
ys = kdeline.get_ydata()
And implement np.interp() somehow?
More edits:
Using CDFs per #7shoe, I was able to get a way better (and correct) result for my normal distribution example:
from scipy.stats import norm
import numpy as np
np.random.seed(42)
x = np.random.normal(loc=0.0, scale=1.0, size=10000000)
norm.cdf(x.mean() + x.std()) - norm.cdf(x.mean() - x.std())
However, my curiosity is still piqued. Let's say we have a distribution that may or may not be normal. For example, let's look at Tom Brady's epa per pass from last season
import pandas as pd
import seaborn as sns
import random
import numpy as np
YEAR = 2021
data = pd.read_csv(
'https://github.com/nflverse/nflfastR-data/blob/master/data/play_by_play_' \
+ str(YEAR) + '.csv.gz?raw=True',compression='gzip', low_memory=False
)
df = data.loc[data.passer == 'T.Brady','epa'].copy()
# tom brady's distribution
sns.kdeplot(df)
sample_mean = []
for i in range(50):
y = np.random.choice(df, 500)
avg = np.mean(y)
sample_mean.append(avg)
# distribution of sampling means - can we assume this is normal and proceed with cdfs?
sns.kdeplot(sample_mean)
Could we use sampling means or even just bootstrap resampling methods to
Make a more "normal" distribution with sampling means in order to incorporate cdfs if the initial distribution doesn't quite appear normal (this, though, would be a distribution of means rather than individual samples. Is this not encouraged?)
or
If the distribution already resembles a normal distribution, simply use such resampling methods to create better parametric estimates?
Computing the probability p for some interval is not overly complicated. However, it might be tricky to combine the right tools to do so. In particular, since there are several statistical approaches to do so.
1. Probability theory
Given two numbers, let's call them lower and upper, what probability is enclosed in between them? If the cumulative distribution function (CDF) F is known, it is merely p = F(upper) - F(lower). Similarly, p coincides with the area enclosed by the probability density function(PDF) f's graph on the interval [lower, upper].
However, when the CDF/PDF is unknown, it constitutes a statistical question. In a nutshell, estimating the PDF f and computing the area its graph enclosed with the interval will do. But there are several paradigms and estimation procedures to obtain it.
1. Parametric estimation
One could assume that the data x is set of IID realizations of some normal distribution, either because of prior knowledge or convenience. Then, one just needs to estimate its parameters mu (aka scale) and sigma (aka standard deviation or scale). scipy.stats provides all we need in this setting. Moreover, it offers estimation procedures as well as pdf/cdf functions for various parametric distributions.
from scipy import stats
from matplotlib import pyplot as plt
lower, upper = 0.0, 2.0
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit parameter
loc_hat, scale_hat = stats.norm.fit(x)
# probability
p = stats.norm.cdf(upper, loc=loc_hat, scale=scale_hat) - stats.norm.cdf(lower, loc=loc_hat, scale=scale_hat)
# plot
x_axis = np.linspace(-5, 7, 1000)
plt.title('1. Parametric Estimation', fontsize=18)
plt.plot(x_axis, stats.norm.pdf(x_axis, loc_hat, scale_hat))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = stats.norm.pdf(np.arange(lower, upper, 0.01), loc=loc_hat, scale=scale_hat) ,
facecolor='red',
alpha=0.35)
plt.text(x=0.1, y=0.1, s= 'p=' + str(round(p, 3)))
plt.show()
which yields
2. Non-parametric estimation
In the absence of a parametric assumption, various techniques exist to estimate the density directly (rather than identifying it by estimated parameters as seen above). Kernel density estimation is the most popular variant to do so. In this case, as alluded in the question, scikit-learn is an ideal tool. However, in the absence of an analytical CDF, we need to compute the area enclosed by the density's graph over the interval [lower, upper] directly.
In contrast to previous answers, I'd leave this to SciPy's numerical integration routines, e.g. scipy.inegrate.quad(). The advantage is that it is lightning-fast and can be applied to any function (beyond kernel density estimates). The resulting code is as follows
from sklearn.neighbors import KernelDensity
from scipy.integrate import quad
x = [-0.804, -2.267, 1.55, -1.004, 3.173, -0.522, -0.231, 3.95, -0.574, -0.213, 1.333, 2.42, 1.879, 3.814]
# fit density function
f_hat = KernelDensity(bandwidth=.9, kernel='gaussian').fit(np.array(x).reshape(-1, 1))
def f_pred(x):
'''wrapper function to compute probability'''
return np.exp(f_hat.score_samples(np.array(x).reshape(-1, 1)))[0]
p = quad(func=f_pred, a=lower, b=upper)
# plot
plt.title('2. Non-Parametric Estimation', fontsize=18)
xaxis = np.linspace(-5, 7, 1000)
plt.plot(x_axis, np.exp(f_hat.score_samples(xaxis.reshape(-1, 1))))
plt.fill_between(x = np.arange(lower, upper, 0.01),
y1 = np.exp(f_hat.score_samples(np.arange(lower, upper, 0.01).reshape(-1, 1))),
facecolor='red',
alpha=0.35)
plt.text(x=0.15, y=0.1, s= 'p=' + str(round(p[0], 3)))
plt.show()
and yields
I do see a bug in the get_probability function, but that bug causes it to compute a too high result - in np.sum(kd_vals * step), it's multiplying N sample values by a step with N-1 in the denominator, effectively resulting in an output a factor of N/(N-1) too high. (If they wanted to use a trapezoid rule computation for the integral, they should have divided the left and right endpoint values by 2 first.)
Other than that, the computation looks correct. The problem is that the model doesn't reflect the input distribution.
You're not modeling the distribution as a normal distribution. You're modeling it with a kernel density estimator with a Gaussian kernel, and the kernel bandwidth is very high relative to the scale of the distribution and the number of available samples. This results in the model being "flatter" than the actual distribution, with less of the probability concentrated in the center.

Integration of KDE with strange behavior of from scipy.integrate.quad and the setted bandwith

I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:
I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).
Problem
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity
np.random.seed(42)
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
kde = gaussian_kde(test_array,bw_method=0.5)
X_range = np.arange(-16,20,0.1)
y_list = []
for X in X_range:
pdf = lambda x : kde.evaluate([[x]])
y_list.append(pdf(X))
y = np.array(y_list)
_ = plt.plot(X_range,y)
# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]
print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))
print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))
plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()
If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:
First Question:
In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)
Second Question:
The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:
Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?
I hope those question not only arise due to my flawed understanding of statistics ;)
Your initial data is generate here
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.

Formula for partial expectation in scipy lognorm

I have just been trying to match the scipy outputs of the lognormal distribution to the formulas on wikipedia.
And I am stuck on the partial expectation with a lower bound.
If I use this simple lognormal distribution:
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = scist.lognorm(s=sigma, scale=np.exp(mu))
where k is the lower bound,
the partial expectation, as I understand it, is given by:
Fine. So we are simply talking about the mean of the lognormal distribution and a CDF with a z-score. scipy provides the partial
lnorm.expect(lambda x:x, lb=k)
>>> 1.25199...
Indeed, we can confirm this is the partial by checking it against the conditional expectation. Computing it directly or using the partial above yield the same result:
lnorm.expect(lambda x:x, lb=k) / (1 - lnorm.cdf(k))
>>> 1.25385...
lnorm.expect(lambda x:x, lb=k, conditional=True)
>>> 1.25385...
However, scipy's cdf function takes the x variable, not the z-score and I am uncertain how to transform this:
Into an x value. I have tried many different flavors.
I would have thought:
would do the trick to account for the subtraction of mu that must occur when scipy's cdf (presumably) computes the z-score internally.
Any formulation I use ends up with a very small or 0 value.
Any help would be greatly appreciated.
IIUC, you can simply compute the CDF of a Normal distribution N(0,1) in (mu+sigma^2-ln(k))/2, i.e.
import numpy as np
import scipy.stats as sps
def partial_expectation(mu, sigma, k):
"""
Returns partial expectation given
mean, standard deviation and k.
https://en.wikipedia.org/wiki/Log-normal_distribution
"""
# compute cumulative density function
# of Normal distribution N(0,1) in x=x_phi
x_phi = (mu + sigma**2 - np.log(k))/sigma
phi = sps.norm.cdf(x_phi, loc=0, scale=1)
# mean of lognormal
lognorm_mu = np.exp(mu + .5*(sigma**2))
# result
return lognorm_mu * phi
k = .25
sigma = .5
mu = .1 # from the logged variable
lnorm = sps.lognorm(s=sigma, scale=np.exp(mu))
print('from def:', partial_expectation(mu, sigma, k))
print('from sps:', lnorm.expect(lb=k))
from def: 1.251999952174895
from sps: 1.2519999521748952

How to calculate integral for very very small y values (SciPy quad)

Here is a probability density function of a lognormal distribution:
from scipy.stats import lognorm
def f(x): return lognorm.pdf(x, s=0.2, loc=0, scale=np.exp(10))
This function has very small y values (max ~ 1E-5) and distributes over x value ~1E5. We know that the integral of a PDF should be 1, but when using the following codes to directly calculate integral, the answer is round 1E-66 since the computation accuracy is not enough.
from scipy.integrate import quad
import pandas as pd
ans, err = quad(f, -np.inf, np.inf)
Could you kindly help me to correctly calculate an integral like this? Thank you.
The values that you are using correspond to the underlying normal distribution having mean mu = 10 and standard deviation sigma = 0.2. With those values, the mode of the distribution (i.e. the location of the maximum of the PDF) is at exp(mu - sigma**2) = 21162.795717500194. The function quad works pretty well, but it can be fooled. In this case, apparently quad only samples the function where the values are extremely small--it never "sees" the higher values way out around 20000.
You can fix this by computing the integral over two intervals, say [0, mode] and [mode, np.inf]. (There is no need to compute the integral over the negative axis, since the PDF is 0 there.)
For example, this script prints 1.0000000000000004
import numpy as np
from scipy.stats import lognorm
from scipy.integrate import quad
def f(x, mu=0, sigma=1):
return lognorm.pdf(x, s=sigma, loc=0, scale=np.exp(mu))
mu = 10
sigma = 0.2
mode = np.exp(mu - sigma**2)
ans1, err1 = quad(f, 0, mode, args=(mu, sigma))
ans2, err2 = quad(f, mode, np.inf, args=(mu, sigma))
integral = ans1 + ans2
print(integral)

Best way to find the confidence interval of a kernel-density estimate in python / scipy?

I am currently estimating the probability density function of my data, which is not normally distributed in general. I do this with scipy.stats.gaussian_kde, and I want to find the confidence interval for the estimated distribution.
I have not found any method or function in scipy documentation to do that, so currently I am getting the confidence interval by integrating the estimated pdf and optimizing it numerically to obtain the desired confidence level:
from scipy.stats import gaussian_kde
from scipy.optimize import root_scalar
import numpy as np
# example data
data = np.random.randn(10)
kernel = gaussian_kde(data)
# function that return the confidence:
def f(x):
return kernel.integrate_box_1d(-x, x) - 0.95
def fprime(x):
return kernel(x) + kernel(-x)
sol = root_scalar(f, fprime=fprime, x0=0, method='newton')
print(sol.root)
(I know this is not the maximum-likelihood confidence interval, but I am interested in a symmetric one)
Is there any better way to do that?

Categories