Python: Gaussian Copula or invers of cdf

Python: Gaussian Copula or invers of cdf - python

Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable

Related

scipy distributions provide zero for values smaller than loc (mean)

For the data yielding the below histogram, I used gamma.fit(data) function. It yields (0.2856629839547822, 0.001612367540316285, 1.3126526078419007) which must be the alpha, loc, and scale parameters of the distribution. However, the mean and standard deviations are m=0.04181341484525036 and s=0.02581912984507876 for the given dataset. The PDF is zero below the mean (m) value. I couldn't find any questions about this problem. What am I missing?
Histogram of the data

The PDF is most definitely not zero for all values below your calculated mean. I fact, over 40% of the PDF area resides at x<=m:
from scipy import stats
g = stats.gamma(0.2856629839547822, 0.001612367540316285, 1.3126526078419007)
m=0.04181341484525036
print(g.cdf(m))
0.4078

scipy.stats.wasserstein_distance implementation

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!

The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

Calculate KL Divergence between two gamma distribution list

I have two list. Both include normalized percent:
actual_population_distribution = [0.2,0.3,0.3,0.2]
sample_population_distribution = [0.1,0.4,0.2,0.3]
I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.
I have already able to get KL.
This is the function I used to calculate gamma:
def gamma_random_sample(data_list):
mean = np.mean(data_list)
var = np.var(data_list)
g_alpha = mean * mean / var
g_beta = mean / var
for i in range(len(data_list)):
yield random.gammavariate(g_alpha, 1/g_beta)
Fit two lists into gamma distribution:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
This is the code I used to calculate KL:
kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))
The code above does not produce any errors.
But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.
Indeed, the number is different to:
mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')
if I use this way.
By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.
What do I miss?

See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std
I don't understand what you are trying to do with:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.
The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .
I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.
I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.
EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.

Finding the sigma of a Gaussian array without using a fit

I have an array, called gaussian_array, which is made of a series of numbers that, once plotted, form a Gaussian, to a good approximation.
I need to understand the \sigma of this Gaussian, but I am not allowed to use a fit of any kind. What I have tried so far is to calculate the peak of the Gaussian, which is given by the first element of the array (the Gaussian is centred around the origin), gaussian_array[0], and then somehow I thought it could be useful to use the FWHM and the well known relation between \sigma and the FWHM.
However, I do not know exactly how to implement this in python. I thought it could have been useful to write something like
for i in range(len(gaussian_array)):
if gaussian[i] = FWHM:
sigma = gaussian[i]/(2.*np.sqrt(2.np.log(2)))
but I don't think that's a reliable procedure, because it will not always be true that a certain element of the gaussian_array will EXACTLY coincide to the calculated FWHM. I cannot even calculate the standard deviation by the sum of the squares of the differences between the values and the origin.
So, how could I estimate the sigma of this gaussian_array?

I am confused why you would go to such great lengths to calculate a standard deviation. In you post it seems you are trying to get the \sigma by this relation
If you are trying to obtain the standard deviation, just use numpy
import numpy as np
# method 1 - use np.std() on a python data structure
sigma = np.std(gaussian_array)
# method 2 - convert to numpy array and use .std() method
gaussian_array = np.asarray(gaussian_array)
sigma = gaussian_array.std()

Scipy - Inverse Sampling Method from custom probability density function

I am trying to perform an inverse sampling from a custom probability density function (PDF). I am just wondering if this even possible, i.e. integrating the PDF, inverting the result and then solving it for a given uniform number. The PDF has the shape f(x, alpha, mean(x))=(1/Gamma(alpha+1)(x))((x*(alpha+1)/mean(x))^(alpha+1))exp(-(alpha+1)*(x/mean(x)) where x > 0. From the shape the only values sub-150 are relevant, and for what I am trying to do the sub-80 values are good enough. Extending the range shouldnt be too hard though.
I have tried to do the inversion method, but only found a numerical way to do the integral, which isnt necessarily helpful considering that I need to invert the function to solve:
u = integral(f(x, alpha, mean(x))dx) from 0 to y, where y is unknown and u is uniform random variable between 0 and 1.
The integral has a gamma function and an incomplete gamma function, so trying to invert it is kind of a mess. Any help is welcome.
Thanks a bunch in advance.
Cheers

Assuming you mean that you're trying to randomly choose values which will be distributed according to your PDF, then yes, it is possible. This is described on Wikipedia as inverse transform sampling. Basically, it's just what you said: integrate the PDF to produce the cumulative distribution (CDF), invert it (which can be done ahead of time), and then choose a random number and run it through the inverted CDF.
If your domain is 0 to positive infinity, your distribution appears to match the gamma distribution which is built into Numpy and Scipy, with theta = 1/alpha and k = alpha+1.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.