How can I generate a CDF using Kernel Density Estimation in Python?

How can I generate a CDF using Kernel Density Estimation in Python? - python

There are a few methods I have come across that can do kernel density estimation which will provide a PDF for a sample of data:
KDEpy
sklearn.neighbors.KernelDensity
scipy.stats.gaussian_kde
Using any of the above I can generate a PDF however I want to know how I can get the CDF for the PDF I am generating. In math I know you can integrate on the PDF to get the CDF, however the issue is that these methods are only supplying x and y points and not a function to integrate on.
I'm wondering how I could transform the data being given into a CDF plot or alternatively find the PDF function for the data to then integrate on to get the CDF. Or use an alternative method where the output is a CDF instead of PDF.

MCVE
Let's create some dummy data to shoulder the discussion:
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
np.random.seed(123)
data = stats.norm(loc=0, scale=1).rvs(10**4)
Here is the baseline idea with the scipy.stats package.
Gaussian KDE
We can estimate KDE using dedicated tools such as gaussian_kde:
kde = stats.gaussian_kde(data)
Which exposes a PDF function to evaluate at every x but is missing the CDF.
Checking samples with Kolmogorov-Smirnov Test we cannot reject the null hypothesis (two distributions are identical) with the threshold of 10%:
stats.ks_2samp(data, kde.resample(100).squeeze())
# KstestResult(statistic=0.0969, pvalue=0.29163373800871994)
Continuous Variable
The scipy.stats package also exposes a generic class rv_continous to inherit from. As stated in documentation:
New random variables can be defined by subclassing the rv_continuous
class and re-defining at least the _pdf or the _cdf method (normalized
to location 0 and scale 1).
So we can use this on purpose logic to fill the gap. Without any performance consideration it boils down to:
class KDEDist(stats.rv_continuous):
def __init__(self, kde, *args, **kwargs):
super().__init__(*args, **kwargs)
self._kde = kde
def _pdf(self, x):
return self._kde.pdf(x)
Then we create the underlying object with our experimental KDE.
X = KDEDist(kde)
stats.ks_2samp(data, X.rvs(size=100)) # This call is kind of intensive
# KstestResult(statistic=0.0625, pvalue=0.8113077271721811)
Now we can naturally - at least in term of API call - evaluate the PDF and CDF as well:
fig, axe = plt.subplots()
axe.hist(data, density=1)
axe.plot(x, X.pdf(x))
axe.plot(x, X.cdf(x))
It returns:
Performance considerations
Notice than this methodology answers your question but is not performant. KDE computation are expensive mainly because the kernel spans the whole data space (Gaussian reaches zero at infinity). Therefore, without cut-off feature computations are based on all observations of the dataset at each evaluation.
Changing the window function can drastically improve the performance. Eg.: triangular window will have fixed span over the whole dataset and reduce computation w.r.t. dataset extent and size.
Implementation considerations
Reading the doc, it seems rv_continuous is initially designed to implement new continuous variable with analytical definition.
Anyway, the class provides automatic resolution/integration for other statistics if underlying methods are not implemented (overridden).
When choosing this methodology, it is up to you to implement missing logic if you wish to make it more performant and robust (numerical stability).
Histogram instead of KDE
If you can relax the KDE needs and is satisfied by an histogram distribution, then you can rely on rv_histogram which essentially does the same based on the binned distribution:
hist = np.histogram(data, bins=100)
hist_dist = stats.rv_histogram(hist)
stats.ks_2samp(data, hist_dist.rvs(size=100))
# KstestResult(statistic=0.0577, pvalue=0.8778871545532821)
Which returns:
KDE Histogram
Provided it is acceptable theoretically, we can mix both strategy by creating the expected histogram from the KDE:
hist = np.histogram(data, bins=1000)
hist_kde = kde.pdf(hist[1][:-1] + np.diff(hist[1]))
hist_dist_kde = stats.rv_histogram([hist_kde, hist[1]])
stats.ks_2samp(data, hist_dist_kde.rvs(size=100))
# KstestResult(statistic=0.1067, pvalue=0.19541766226890545)
Then the CDF has a relative smoothness w.r.t. the KDE (it is still an histogram) and the Continuous Variable object is as performant as rv_histogram can be.

Related

Getting random numbers from a truncated Maxwell-Boltzmann distribution

I would like to generate random numbers using a truncated Maxwell-Boltzmann distribution. I know that scipy has built-in Maxwell random variables, but there is no truncated version of it (I am also aware of a truncated normal distribution, which is irrelevant here). I have tried to write my own random variables using rvs_continuous:
import scipy.stats as st
class maxwell_boltzmann_pdf(st.rv_continuous):
def _pdf(self,x):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(x))*np.exp(-np.square(x/v_0))*np.heaviside(v_esc-x,0)
maxwell_boltzmann_cv = maxwell_boltzmann_pdf(a=0, b=v_esc, name='maxwell_boltzmann_pdf')
This does exactly what I want, but it is way too slow for my purpose (I am doing Monte Carlo simulations), even if I draw all the random velocities outside of all the loops. I have also thought of using Inverse transform sampling method, but the inverse of the CDF does not have an analytic form and I will need to do a bisection for every number I draw. It would be great if there is a convenient way for me to generate random numbers from a truncated Maxwell-Boltzmann distribution with decent speed.

There are several things you can do here.
For fixed parameters v_esc and v_0, n_0 is a constant, so it doesn't need to be calculated in the pdf method.
If you define only a PDF for a SciPy rv_continuous subclass, then the class's rvs, mean, and so on will be very slow, presumably because the method needs to integrate the PDF every time it needs to generate a random variate or calculate a statistic. If speed is at a premium, you will thus need to add to maxwell_boltzmann_pdf an _rvs method that uses its own sampler. (See also this question.) One possible method is the rejection sampling method: Draw a number in a box until the box falls within the PDF. It works for any bounded PDF with a finite domain, as long as you know what the domain and bound are (the bound is the maximum value of f in the domain). See this question for example Python code.
If you know the distribution's CDF, then there are some additional tricks. One of them is the relatively new k-vector sampling method for sampling a continuous distribution. There are two phases: a setup phase and a sampling phase. The setup phase involves approximating the CDF's inverse via root finding, and the sampling phase uses this approximation to generate random numbers that follow the distribution in a very fast way without having to further evaluate the CDF. For a fixed distribution like this one, if you show me the CDF, I can precalculate the necessary data and the code needed to sample the distribution using that data. Essentially, the only non-trivial part of k-vector sampling is the root-finding step.
More information on sampling from an arbitrary distribution is found on my sampling methods page.

It turns out that there is a way to generate a truncated Maxwell-Boltzmann distribution with the inverse transform sampling method using the ppf feature of scipy. I am posting the code here for future reference.
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import erf
from scipy.stats import maxwell
# parameters
v_0 = 220 #km/s
v_esc = 550 #km/s
N = 10000
# CDF(v_esc)
cdf_v_esc = maxwell.cdf(v_esc,scale=v_0/np.sqrt(2))
# pdf for the distribution
def f_MB(v_mag):
n_0 = np.power(np.pi,3/2)*np.square(v_0)*(v_0*erf(v_esc/v_0)-(2/np.sqrt(np.pi))*v_esc*np.exp(-np.square(v_esc/v_0)))
return (1/n_0)*(4*np.pi*np.square(v_mag))*np.exp(-np.square(v_mag/v_0))*np.heaviside(v_esc-v_mag,0)
# plot the pdf
x = np.arange(600)
y = [f_MB(i) for i in x]
plt.plot(x,y,label='pdf')
# use inverse transform sampling to get the truncated Maxwell-Boltzmann distribution
sample = maxwell.ppf(np.random.rand(N)*cdf_v_esc,scale=v_0/np.sqrt(2))
# plot the histogram of the samples
plt.hist(sample,bins=100,histtype='step',density=True,label='histogram')
plt.xlabel('v_mag')
plt.legend()
plt.show()
This code generates the required random variables and compare its histogram with the analytic form of the pdf, which matches with each other pretty well.

Is there a fast alternative to scipy _norm_pdf for correlated distribution sampling?

I have fit a series of SciPy continuous distributions for a Monte-Carlo simulation and am looking to take a large number of samples from these distributions. However, I would like to be able to take correlated samples, such that the ith sample takes the e.g., 90th percentile from each of the distributions.
In doing this, I've found a quirk in SciPy performance:
# very fast way to many uncorrelated samples of length n
for shape, loc, scale, in distro_props:
sp.stats.norm.rvs(*shape, loc=loc, scale=scale, size=n)
# verrrrryyyyy slow way to take correlated samples of length n
correlate = np.random.uniform(size=n)
for shape, loc, scale, in distro_props:
sp.stats.norm.ppf(correlate, *shape, loc=loc, scale=scale)
Most of the results about this claim that the slowness on these SciPy distros if from the type-checking etc. wrappers. However when I profiled the code, the vast bulk of the time is spent in the underlying math function [_continuous_distns.py:179(_norm_pdf)]1. Furthermore, it scales with n, implying that it's looping through every elemnt internally.
The SciPy docs on rv_continuous almost seem to suggest that the subclass should override this for performance, but it seems bizarre that I would monkeypatch into SciPy to speed up their ppf. I would just compute this for the normal from the ppf formula, but I also use lognormal and skewed normal, which are more of a pain to implement.
So, what is the best way in Python to compute a fast ppf for normal, lognormal, and skewed normal distributions? Or more broadly, to take correlated samples from several such distributions?

If you need just the normal ppf, it is indeed puzzling that it is so slow, but you can use scipy.special.erfinv instead:
x = np.random.uniform(0,1,100)
np.allclose(special.erfinv(2*x-1)*np.sqrt(2),stats.norm().ppf(x))
# True
timeit(lambda:stats.norm().ppf(x),number=1000)
# 0.7717257660115138
timeit(lambda:special.erfinv(2*x-1)*np.sqrt(2),number=1000)
# 0.015020604943856597
EDIT:
lognormal and triangle are also straight forward:
c = np.random.uniform()
np.allclose(np.exp(c*special.erfinv(2*x-1)*np.sqrt(2)),stats.lognorm(c).ppf(x))
# True
np.allclose(((1-np.sqrt(1-(x-c)/((x>c)-c)))*((x>c)-c))+c,stats.triang(c).ppf(x))
# True
skew normal I'm not familiar enough, unfortunately.

Ultimately, this issue was caused by my use of the skew-normal distribution. The ppf of the skew-normal actually does not have a closed-form analytic definition, so in order to compute the ppf, it fell back to scipy.continuous_rv's numeric approximation, which involved iteratively computing the cdf and using that to zero in on the ppf value. The skew-normal pdf is the product of the normal pdf and normal cdf, so this numeric approximation called the normal's pdf and cdf many many times. This is why when I profiled the code, it looked like the normal distribution was the problem, not the SKU normal. The other answer to this question was able to achieve time savings by skipping type-checking, but didn't actually make a difference on the run-time growth, just a difference on small-n runtimes.
To solve this problem, I have replaced the skew-normal distribution with the Johnson SU distribution. It has 2 more free parameters than a normal distribution so it can fit different types of skew and kurtosis effectively. It's supported for all real numbers, and it has a closed-form ppf definition with a fast implementation in SciPy. Below you can see example Johnson SU distributions I've been fitting from the 10th, 50th, and 90th percentiles.

Probability Distribution Function Python

I have a set of raw data and I have to identify the distribution of that data. What is the easiest way to plot a probability distribution function? I have tried fitting it in normal distribution.
But I am more curious to know which distribution does the data carry within itself ?
I have no code to show my progress as I have failed to find any functions in python that will allow me to test the distribution of the dataset. I do not want to slice the data and force it to fit in may be normal or skew distribution.
Is any way to determine the distribution of the dataset ? Any suggestion appreciated.
Is this any correct approach ? Example
This is something close what I am looking for but again it fits the data into normal distribution. Example
EDIT:
The input has million rows and the short sample is given below
Hashtag,Frequency
#Car,45
#photo,4
#movie,6
#life,1
The frequency ranges from 1 to 20,000 count and I am trying to identify the distribution of the frequency of the keywords. I tried plotting a simple histogram but I get the output as a single bar.
Code:
import pandas
import matplotlib.pyplot as plt
df = pandas.read_csv('Paris_random_hash.csv', sep=',')
plt.hist(df['Frequency'])
plt.show()
Output

This is a minimal working example for showing a histogram. It only solves part of your question, but it can be a step towards your goal. Note that the histogram function gives you the values at the two corners of the bin and you have to interpolate to get the center value.
import numpy as np
import matplotlib.pyplot as pl
x = np.random.randn(10000)
nbins = 20
n, bins = np.histogram(x, nbins, density=1)
pdfx = np.zeros(n.size)
pdfy = np.zeros(n.size)
for k in range(n.size):
pdfx[k] = 0.5*(bins[k]+bins[k+1])
pdfy[k] = n[k]
pl.plot(pdfx, pdfy)
You can fit your data using the example shown in:
Fitting empirical distribution to theoretical ones with Scipy (Python)?

Definitely a stats question - sounds like you're trying to do a probability test of whether the distribution is significantly similar to the normal, lognormal, binomial, etc. distributions. The easiest is to test for normal or lognormal as explained below.
Set your Pvalue cutoff, usually if your Pvalue <= 0.05 then it is NOT normally distributed.
In Python use SciPy, you just need your P value returned to test, so 2 return values from this function (I'm ignoring optional (not needed) inputs here for clarity):
import scipy.stats
[W, Pvalue] = scipy.stats.morestats.shapiro(x)
Perform the Shapiro-Wilk test for normality. The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.
If you want to see if it is lognormally distributed (provided it doesn't pass the P test above), you can try:
import numpy
[W, Pvalue] = scipy.stats.morestats.shapiro(numpy.log(x))
Interpret the same way - I just tested on a known lognormally distributed simulation and got a 0.17 Pvalue on the np.log(x) test, and a number close to 0 for the standard shapiro(x) test. That tells me lognormally distributed is the better choice, normally distributed fails miserably.
I made it simple which is what I gathered you are looking for. For other distributions, you may need to do more work along the lines of Q-Q plots https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot and not simply following a few tests I proposed. That means you have a plot of the distribution you are trying to fit to vs. your data plotted. Here's a quick example that can get you down that path if you so desire:
import numpy as np
import pylab
import scipy.stats as stats
mydata = whatever data you are looking to fit to a distribution
stats.probplot(mydata, dist='norm', plot=pylab)
pylab.show()
Above you can substitute anything for dist='norm' from the scipy library http://docs.scipy.org/doc/scipy/reference/tutorial/stats/continuous.html#continuous-distributions-in-scipy-stats
then find its scipy name (must add shape parameters according to the documentation such as stats.probplot(mydata, dist='loggamma', sparams=(1,1), plot=pylab) or for student T stats.probplot(mydata, dist='t', sparams=(1), plot=pylab)), then look at the plot and see how close your data follows that distribution. If the data points are close you've found your distribution. It will give you an R^2 value too on the graph; closer to 1 the better the fit generally.
And if you want to continue trying to do what you're doing with the dataframe, try changing to: plt.hist(df['Frequency'].values)
Please vote for this answer if it answers your question :) Need some bounty to get replies on my own programming dilemmas.

Did you try using the seaborn library? They have a nice kernel density estimation function. Try:
import seaborn as sns
sns.kdeplot(df['frequency'])
You find installation instructions here

The only distribution the data carry within itself is the empirical probability. If your have data as a 1d numpy array data you can compute the value of the empirical distribution function at x as the cumulative relative frequency of the values lesser than or equal to x:
d[d <= x].size / d.size
This is a step function so it does not have an associated probability density function but a probability mass function where the mass of each observed value is its relative frequency. To compute the relative frequencies:
values, freqs = np.unique(data, return_counts=True)
rfreqs = freqs / data.size
This does not mean that the data is a random sample from their empirical distribution. If you want to know what distribution your data are a sample from (if any) just by looking at the data, the answer is you can't. But that is more about statistics than about programming.

The histogram does not what you think it does, you try to show a bar graph. The histogram needs each data point separately in a list, not the frequency itself. You have [3,2,0,4,...] bout should have [1,1,1,2,2,4,4,4,4]. You can not determine a probability distribution automatically

I think you are asking a slightly different question:
What is the correlation between my raw data and the curve to which I have mapped it?
This is a conceptual problem, and you're are trying to understand the meanings of the values R and R squared. Start by working through this MiniTab blog post. You may want to skim this non-Python Kaledia Graph Guide to understand the classes of curves to fit and the usage of Least-Mean-Squares in fitting the curves.
You were probably downvoted because it is a math question more than a programming question.

I may be missing something, but it seems that a major point is being overlooked across the board: The data set you are describing is a categorical data set. That is, the x-values are not numeric, they're just words (#Car, #photo, etc.). The concept of a the shape of a probability distribution is meaningless for a categorical data set, since there is no logical ordering for the categories. What would a histogram even look like? Would #Car be the first bin? Or would it be all the way to the right of your graph? Unless you have some criteria for quantifying your categories then trying to make judgments based on the shape of the distribution is meaningless.
Here's a small text-based example to clarify what I'm saying. Suppose I survey a group of people and ask their favorite color. I plot the results:
Red | ##
Green | #####
Blue | #######
Yellow | #####
Orange | ##
Huh, looks like color preferences are normally distributed. Wait, what if I had randomly put the colors in a different order in my graph:
Blue | #######
Yellow | #####
Green | #####
Orange | ##
Red | ##
I guess the data is actually positively skewed? Not so, of course - for a categorical data set the shape of the distribution is meaningless. Only if you were to decide to some how quantify each hashtag in your data set would the problem have meaning. Do you want to compare the length of a hashtag to its frequency? Or the alphabetical ordering of a hashtag to its frequency? Etc.

How to create dataset for fitting a function in scipy stats?

I want to fit some data to a Pareto distribution using the scipy.stats library. I am not sure if the issue might be numerical, so just to be safe; I have values measured for the dependent variable (let's call them 'pushes') for the independent variable ('minutes') starting at a few thousand minutes and every ten minutes thereafter (with the exception of a few points that were removed during data cleaning).
e.g.
2780.0 362.0
2800.0 376.0
2810.0 393.0
...
The best info I can find says something like
from scipy.stats import pareto
result = pareto.fit(data)
and I have no idea how this data is to be formatted in this case. I've tried the following but all result in errors.
result = pareto.fit(zip(minutes, pushes))
result = pareto.fit(pushes)
The error is usually
Warning: invalid value encountered in double_scalars
would greatly appreciate some guidance, thank you.

As I mentioned in the comments above, pareto.fit() is not what you're looking for.
The .fit() methods of the continuous distributions in scipy.stats obtain an estimate of the parameters of the distribution that maximise the probability of observing some particular set of sample values. Therefore, pareto.fit() wants only a single array argument containing the samples you want to fit the distribution to. The other keyword arguments control various aspects of the fitting process, for example by specifying initial values for the distribution parameters.
What you're actually trying to do is to fit the relationship between some independent variable x and some dependent variable y, i.e.
y_fit = f(x, params)
What you need to do is:
Choose some functional form for f. From your description, the plot of y vs x resembles the probability density function for a Pareto distribution, so perhaps either this or a decaying exponential might be appropriate.
Find the set of params that minimize some measure of the difference between y and y_fit (usually the sum of squared differences). You could use scipy.optimize.curve_fit or scipy.optimize.minimize to do this.

Compound model in PyMC

I'm trying to use PyMC 2.3 to obtain an estimate of the parameter of a compound model.
By "compound" I mean a random variable that follows a distribution whose whose parameter is another random variable. ("nested" or "hierarchical" are somtimes used to refer to this situation, but I think they are less specific and generate more confusion in this context).
Let make a specific example. The "real" data is generated from a compound distribution that is a Poisson with a parameter that is distributed as an Exponential. In plain scipy the data is generated as follows:
import numpy as np
from scipy.stats import distributions
np.random.seed(3) # for repeatability
nsamples = 1000
tau_true = 50
orig_lambda_sample = distributions.expon(scale=tau_true).rvs(nsamples)
data = distributions.poisson(orig_lambda_sample).rvs(nsamples)
I want to obtain an estimate of the model parameter tau_true.
My approach so far in modelling this problem in PyMC is the following:
tau = pm.Uniform('tau', 0, 100)
count_rates = pm.Exponential('count_rates', beta=1/tau, size=nsamples)
counts = pm.Poisson('counts', mu=count_rates, value=data, observed=True)
Note that I use size=nsamples to have a new stochastic variable for each sample.
Finally I run the MCMC as:
model = pm.Model([count_rates, counts, tau])
mcmc = pm.MCMC(model)
mcmc.sample(40000, 10000)
The model converges (although slowly, > 10^5 iterations) to a distribution centred around 50 (tau_true). However it seems like an overkill to define 1000 stochastic variables to fit a single distribution with a single parameter.
Is there a better way to describe a compound model in PyMC?
PS I also tried with a more informative prior (tau = pm.Normal('tau', mu=51, tau=1/2**2)) but the results are similar and the convergence is still slow.

It looks like what you are trying to model is data that is over-dispersed. In fact, a negative binomial distribution is just a Poisson with a mean that is distributed according to a gamma distribution (the general form of the Exponential). So, one way to get around defining 1000 variables is to use the negative binomial directly. Keep in mind, though, despite there being nominally 1000 variables, the effective number of variables is somewhere between 1 and 1000, depending on how constrained the distribution of means is. You are essentially defining a random effect here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.