I have two list. Both include normalized percent:
actual_population_distribution = [0.2,0.3,0.3,0.2]
sample_population_distribution = [0.1,0.4,0.2,0.3]
I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.
I have already able to get KL.
This is the function I used to calculate gamma:
def gamma_random_sample(data_list):
mean = np.mean(data_list)
var = np.var(data_list)
g_alpha = mean * mean / var
g_beta = mean / var
for i in range(len(data_list)):
yield random.gammavariate(g_alpha, 1/g_beta)
Fit two lists into gamma distribution:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
This is the code I used to calculate KL:
kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))
The code above does not produce any errors.
But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.
Indeed, the number is different to:
mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')
if I use this way.
By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.
What do I miss?
See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std
I don't understand what you are trying to do with:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.
The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .
I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.
I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.
EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.
Related
I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.
I'm new to coding in python, and want to get parameters from a data set that I know from theory is most likely t-distributed. The first method I tried was using t.fit(). To double check the results I also used st.stats.describe(), and noticed I got different results. I also used t.stats() to get the moments "mvsk". I'm not sure what the different functions do, and which results to trust. The parameters are later going to be used in a Monte Carlo Simulation. Can somebody explain the different methods, and what I'm doing wrong?
import numpy as np
from scipy.stats import norm,t
import scipy.stats as st
import pandas as pd
import math
SP = pd.read_excel('S&P+sectors.xlsx',
parse_dates=['date'],
index_col='date')['.SPX']
rets = np.log(SP).diff()
rets = rets.dropna()
t.fit(rets)
print("Parameters from t.fit: ", t.fit(rets), "\n")
d = st.stats.describe(rets)
print(d, "\n")
print("Standard Deviation from st.stats.describe : ",np.sqrt(d[3]), "\n")
mean, var, skew, kurt = t.stats(t.fit(rets)[0], moments='mvsk',
loc = t.fit(rets)[1], scale = t.fit(rets)[2])
print("mean, std.dev, skew, kurt: ",mean,np.sqrt(var),skew,kurt)
Output:
Parameters from t.fit: (2.563005821560674, 0.0005384408493821172, 0.006945103287629065)
DescribeResult(nobs=4767, minmax=(-0.09469514468085727, 0.10957195934756658), mean=0.00011244654312862343, variance=0.00014599380983290917, skewness=-0.21364378793604263, kurtosis=8.494830112279583)
Standard Deviation from st.stats.describe : 0.012082789819942626
mean, std.dev, skew, kurt: 0.0005384408493821172 0.014818254946408262 nan nan
You can see that I get different means from the t.fit() and st.stats.describe(). The standard deviation is different for all three, and the skewness and kurtosis is also different. Why is this?
There is no difference
SQRT(0.00014599380983290917) = 0.01208278982
One is variance, another is stddev
Ok, lets make it more descriptive.
Parameters from t.fit is what fitter think is best to put t-Distribution curve over set of sampled data.
DescribeResult produced variance, not stddev, so here we take square root of variance and get stddev, SQRT(0.00014599380983290917) = 0.01208278982. Then you compute stddev yourslef, and they are the same. Please remember, those values (like stddev, variance, mean) are taken from sampled data.
On the last line you compute DISTRIBUTION mean and stddev, most likely by applying formulas or doing numerical integration. They are ALWAYS different from sampled mean or sampled stddev. Fitting is trying to fit everything (all moments) at once, minimizing some or another error. It works other way around as well - if you come with distribution parameters, compute distribution mean, stddev, and then run some sample and compute sampled mean/stddev, they would be different from distribution ones. Only in case of infinite sample szie you'll reach agreement between distribution moments and sampled moments.
I have a plot for the CDF distribution of packet losses. I thus do not have the original data or the CDF model itself but samples from the CDF curve. (The data is extracted from plots published in literature.)
I want to find which distribution and with what parameters offers the closest fit to the CDF samples.
I've seen that Scipy stats distributions offer fit(data) method but all examples apply to raw data points. PDF/CDF is subsequently drawn from the fitted parameters. Using fit with my CDF samples does not give sensible results.
Am I right in assuming that fit() cannot be directly applied to data samples from an empirical CDF?
What alternatives could I use to find a matching known distribution?
I'm not sure exactly what you're trying to do. When you say you have a CDF, what does that mean? Do you have some data points, or the function itself? It would be helpful if you could post more information or some sample data.
If you have some data points and know the distribution its not hard to do using scipy. If you don't know the distribution, you could just iterate over all distributions until you find one which works reasonably well.
We can define functions of the form required for scipy.optimize.curve_fit. I.e., the first argument should be x, and then the other arguments are parameters.
I use this function to generate some test data based on the CDF of a normal random variable with a bit of added noise.
n = 100
x = np.linspace(-4,4,n)
f = lambda x,mu,sigma: scipy.stats.norm(mu,sigma).cdf(x)
data = f(x,0.2,1) + 0.05*np.random.randn(n)
Now, use curve_fit to find parameters.
mu,sigma = scipy.optimize.curve_fit(f,x,data)[0]
This gives output
>> mu,sigma
0.1828320963531838, 0.9452044983927278
We can plot the original CDF (orange), noisy data, and fit CDF (blue) and observe that it works pretty well.
Note that curve_fit can take some additional parameters, and that the output gives additional information about how good of a fit the function is.
#tch Thank you for the answer. I read on the technique and successfully applied it. I wanted to apply the fit to all continuous distribution supported by scipy.stats so I ended up doing the following:
fitted = []
failed = []
for d in dist_list:
dist_name = d[0] #fetch the distribution name
dist_object = getattr(ss, dist_name) #fetch the distribution object
param_default = d[1] #fetch the default distribution parameters
# For distributions with only location and scale set those to the default loc=0 and scale=1
if not param_default:
param_default = (0,1)
# Computed parameters of fitted distribution
try:
param,cov = curve_fit(dist_object.cdf,data_in,data_out,p0=param_default,method='trf')
# Only take distributions which do not result in zero covariance as those are not a valid fit
if np.any(cov):
fitted.append((dist_name,param),)
# Capture which distributions are not possible to be fitted (variety of reasons)
except (NotImplementedError,RuntimeError) as e:
failed.append((dist_name,e),)
pass
In the above, the empirical cdf distribution is captured in data_out which holds the sampled cdf values for a range of data_in data points. The list dist_list holds for each distribution in scipy.stats.rv_continuous the name of the distribution as first element and a list of the default parameters as second element. Default parameters I extract from scipy.stats._distr_params.
Some distributions cannot be fitted and raise an error. I keep those is failed list.
Finally, I generate a list fitted which holds for each successfully fitted distribution the estimated parameters.
Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable
I have a question:
Given mean and variance I want to calculate the probability of a sample using a normal distribution as probability basis.
The numbers are:
mean = -0.546369
var = 0.006443
curr_sample = -0.466102
prob = 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) )
I get a probability which is larger than 1! I get prob = 3.014558...
What is causing this? The fact that the variance is too small messes something up? It's a totally legal input to the formula and should give something small not greater than 1! Any suggestions?
Ok, what you compute is not a probability, but a probability density (which may be larger than one). In order to get 1 you have to integrate over the normal distribution like so:
import numpy as np
mean = -0.546369
var = 0.006443
curr_sample = np.linspace(-10,10,10000)
prob = np.sum( 1/(np.sqrt(2*np.pi*var))*np.exp( -( ((curr_sample - mean)**2)/(2*var) ) ) * (curr_sample[1]-curr_sample[0]) )
print prob
witch results in
0.99999999999961509
The formula you give is a probability density, not a probability. The density formula is such that when you integrate it between two values of x, you get the probability of being in that interval. However, this means that the probability of getting any particular sample is, in fact, 0 (it's the density times the infinitesimally small dx).
So what are you actually trying to calculate? You probably want something like the probability of getting your value or larger, the so-called tail probability, which is often used in statistics (it so happens that this is given by the error function when you're talking about a normal distribution, although you need to be careful of exactly how it's defined).
When considering the bell-shaped probability distribution function (PDF) of given mean and variance, the peak value of the curve (height of mode) is 1/sqrt(2*pi*var). It is 1 for standard normal distribution (mean 0 and var 1). Hence when trying to calculate a specific value of a general normal distribution pdf, values larger than 1 are possible.