scipy.stats.wasserstein_distance implementation

scipy.stats.wasserstein_distance implementation - python

I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!

The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.

Related

Calculate KL Divergence between two gamma distribution list

I have two list. Both include normalized percent:
actual_population_distribution = [0.2,0.3,0.3,0.2]
sample_population_distribution = [0.1,0.4,0.2,0.3]
I wish to fit these two list in to gamma distribution and then calculate the returned two list in order to get the KL value.
I have already able to get KL.
This is the function I used to calculate gamma:
def gamma_random_sample(data_list):
mean = np.mean(data_list)
var = np.var(data_list)
g_alpha = mean * mean / var
g_beta = mean / var
for i in range(len(data_list)):
yield random.gammavariate(g_alpha, 1/g_beta)
Fit two lists into gamma distribution:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
This is the code I used to calculate KL:
kl = np.sum(scipy.special.kl_div(actual_grs, sample_grs))
The code above does not produce any errors.
But I suspect the way I did for gamma is wrong because of np.mean/var to get mean and variance.
Indeed, the number is different to:
mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')
if I use this way.
By using "mean, var, skew, kurt = gamma.stats(fit_alpha, loc = fit_loc, scale = fit_beta, moments = 'mvsk')", I will get a KL value way larger than 1 so both two ways are invalid for getting a correct KL.
What do I miss?

See this stack overflow post: https://stats.stackexchange.com/questions/280459/estimating-gamma-distribution-parameters-using-sample-mean-and-std
I don't understand what you are trying to do with:
actual_grs = [i for i in f.gamma_random_sample(actual_population_distribution)]
sample_grs = [i for i in f.gamma_random_sample(sample_population_distribution)]
It doesn't look like you are fitting to a gamma distribution, it looks like you are using the Method of Moment estimator to get the parameters of the gamma distribution and then you are drawing a single random number for each element of your actual(sample)_population_distribution lists given the distribution statistics of the list.
The gamma distribution is notoriously hard to fit. I hope your actual data has a longer list -- 4 data points are hardly sufficient for estimating a two parameter distribution. The estimates are kind of garbage until you get hundreds of elements or more, take a look at this document on the MLE estimator for the fisher information of a gamma distribution: https://www.math.arizona.edu/~jwatkins/O3_mle.pdf .
I don't know what you are trying to do with the kl divergence either. Your actual population is already normalized to 1 and so is the sample distribution. You can plug in those elements directly into the KL divergence for a discrete score -- what you are doing with your code is a stretching and addition of gamma noise to your original list values with your defined gamma function. You are more likely to have a larger deviation with the KL divergence after the gamma corruption of your original population data.
I'm sorry, I just don't see what you are trying to accomplish here. If I were to guess your original intent, I'd say your problem is that you need hundreds of data points to guarantee convergence with any gamma fitting program.
EDIT: I just wanted to add that with regards to the KL divergence. If you intend to score your fit gamma distributions with the KL divergence, it's better to use an analytical solution where the scale and shape parameters of your two gamma distributions are your two inputs. Randomly sampling noisy data points won't be helpful unless you take 100,000 random samples and histogram them into 1,000 bins or so and then normalize your histogram -- I'm just throwing those numbers out, but you are going to want to approximate a continuous distribution as best as you can and it will be hard because the gamma distributions have long tails. This document has the analytical solution for a generalized distribution: https://arxiv.org/pdf/1401.6853.pdf . Just set that third parameter to 1 and simplify and then code up a function.

How can I change the number of basis functions when performing B-Spline fitting in scipy (python)?

I have a discrete set of points (x_n, y_n) that I would like to approximate/represent as a linear combination of B-spline basis functions. I need to be able to manually change the number of B-spline basis functions used by the method, and I am trying to implement this in python using scipy. To be specific, below is a bit of code that I am using:
import scipy
spl = scipy.interpolate.splrep(x, y)
However, unless I have misunderstood or missed something in the documentation, it seems I cannot change the number of B-spline basis functions that scipy uses. That seems to be set by the size of x and y. So, my specific questions are:
Can I change the number of B-spline basis functions used by scipy in the "splrep" function that I used above?
Once I have performed the transformation shown in the code above, how can I access the coefficients of the linear combination? Am I correct in thinking that these coefficients are stored in the vector spl[1]?
Is there a better method/toolbox that I should be using?
Thanks in advance for any help/guidance you can provide.

Yes, spl[1] are the coefficients, and spl[0] contains the knot vector.
However, if you want to have a better control, you can manipulate the BSpline objects and construct them with make_interp_spline or make_lsq_spline, which accepts the knot vector and that determines the b-spline basis functions to use.

You can change the number of B-spline basis functions, by supplying a knot vector with the t parameter. Since there is a connection number of knots = number of coefficients + degree + 1, the number of knots will also define the number of coefficients (== the number of basis functions).
The usage of the t parameter is not so intuitive since the given knots should be only the inner knots. So, for example, if you want 7 coefficients for a cubic spline you need to give 3 inner knots. Inside the function it pads the first and last (degree+1) knots with the xb and xe (clamped end conditions see for example here).
Furthermore, as the documentation says, the knots should satisfy the Schoenberg-Whitney conditions.
Here is an example code that does this:
# Input:
x = np.linspace(0,2*np.pi, 9)
y = np.sin(x)
# Your code:
spl = scipy.interpolate.splrep(x, y)
t,c,k = spl # knots, coefficients, degree (==3 for cubic)
# Computing the inner knots and using them:
t3 = np.linspace(x[0],x[-1],5) # five equally spaced knots in the interval
t3 = t3[1:-1] # take only the three inner values
spl3 = scipy.interpolate.splrep(x, y, t=t3)
Regarding your second question, you're right that the coefficients are indeed stored in spl[1]. However, note that (as the documentation says) the last (degree+1) values are zero-padded and should be ignored.
In order to evaluate the resulting B-spline you can use the function splev or the class BSpline.
Below is some example code that evaluates and draws the above splines (resulting in the following figure):
xx = np.linspace(x[0], x[-1], 101) # sample points
yy = scipy.interpolate.splev(xx, spl) # evaluate original spline
yy3 = scipy.interpolate.splev(xx, spl3) # evaluate new spline
plot(x,y,'b.') # plot original interpolation points
plot(xx,yy,'r-', label='spl')
plot(xx,yy3,'g-', label='spl3')

Python: Gaussian Copula or invers of cdf

Let's say I have a column x with uniform distributed values.
To these values, I applied a cdf-function.
Now I want to calculate the Gaussian Copula, but I can't find the function in python. I read already, that Gaussian Copula is something like the "inverse of the cdf function".
The reason why I'm doing it comes from this paragraph:
A visual depiction of applying the Gaussian Copula process to normalize
an observation by applying 𝑛 = Phi^-1(𝐹(𝑥)). Calculating 𝐹(𝑥) yields a value 𝑢 ∈ [0, 1]
representing the proportion of shaded area at the left. Then Phi^−1(𝑢) yields a value 𝑛
by matching the shaded area in a Gaussian distribution.
I need your help, does everyone has an idea how to calculate that?
I have 2 ideas so far:
1) gauss = 1/(sqrt(2*pi)*s)*e**(-0.5*(float(x-m)/s)**2)
--> so transform all the values with this to a new value
2) norm.ppf(array,loc,scale)
--> So give the ppf function the mean and the std and the array and it will calculate me the inverse of the CDF... But I doubt #2
The thing is
n.cdf(n.ppf(0.95))
Is not what I want. The idea why I'm doing it, is transforming a not normal/gaussian distribution to a normal distribution.
Like here:
Transform from a non gaussian distribution to a gaussian distribution with Gaussian Copula
Any other ideas or tipps?
Thank you very much :)
EDIT:
I found 2 links which are quite usefull:
1. https://stats.stackexchange.com/questions/197283/how-to-transform-an-arcsine-distribution-to-a-normal-distribution
2. https://stats.stackexchange.com/questions/125648/transformation-chi-squared-to-normal-distribution/125653#125653
In this posts its said that you have to
All the details are in the answer already - you take your random variable, and transform it by its own cdf ..... yielding a uniform result.
Thats true for me. If I take a random distirbution and apply the norm.cdf(data, mean,std) function, I get a uniform distributed cdf
Compare: import pandas as pd
data = [1.5]*7 + [2.5]*2 + [3.5]*8 + [4.5]*3 + [5.5]*1 + [6.5]*8
cdf=n.cdf(data, n.mean(data),n.std(data))
print cdf
But How can I do the
You then transform again, applying the quantile function (inverse cdf) of the desired distribution (in this case by the standard normal quantile function /inverse of the normal cdf, producing a variable with a standard normal distribution).
Because when I use f.e. the norm.ppf function, the values are not reasonable

Python circle fitting to data points less sensitive to random noise

I have a set of measured radii (t+epsilon+error) at an equally spaced angles.
The model is circle of radius (R) with center at (r, Alpha) with added small noise and some random error values which are much bigger than noise.
The problem is to find the center of the circle model (r,Alpha) and the radius of the circle (R). But it should not be too much sensitive to random error (in below data points at 7 and 14).
Some radii could be missing therefore the simple mean would not work here.
I tried least square optimization but it significantly reacts on error.
Is there a way to optimize least deltas but not the least squares of delta in Python?
Model:
n=36
R=100
r=10
Alpha=2*Pi/6
Data points:
[95.85, 92.66, 94.14, 90.56, 88.08, 87.63, 88.12, 152.92, 90.75, 90.73, 93.93, 92.66, 92.67, 97.24, 65.40, 97.67, 103.66, 104.43, 105.25, 106.17, 105.01, 108.52, 109.33, 108.17, 107.10, 106.93, 111.25, 109.99, 107.23, 107.18, 108.30, 101.81, 99.47, 97.97, 96.05, 95.29]

It seems like your main problem here is going to be removing outliers. There are a couple of ways to do this, but for your application, your best bet is to probably just to remove items based on their distance from the median (Since the median is much less sensitive to outliers than the mean.)
If you're using numpy that would looks like this:
def remove_outliers(data_points, margin=1.5):
nd = np.abs(data_points - np.median(data_points))
s = nd/np.median(nd)
return data_points[s<margin]
After which you should run least squares.
If you're not using numpy you can do something similar with native python lists:
def median(points):
return sorted(points)[len(points)/2] # evaluates to an int in python2
def remove_outliers(data_points, margin=1.5):
m = median(data_points)
centered_points = [abs(point - m) for point in data_points]
centered_median = median(centered_points)
ratios = [datum/centered_median for datum in centered_points]
return [point for i, point in enumerate(data_points) if ratios[i]>margin]
If you're looking to just not count outliers as highly you can just calculate the mean of your dataset, which is just a linear equivalent of the least-squares optimization.
If you're looking for something a little better I might suggest throwing your data through some kind of low pass filter, but I don't think that's really needed here.
A low-pass filter would probably be the best, which you can do as follows: (Note, alpha is a number you will have to fiddle with to get your desired output.)
def low_pass(data, alpha):
new_data = [data[0]]
for i in range(1, len(data)):
new_data.append(alpha * data[i] + (1 - alpha) * new_data[i-1])
return new_data
At which point your least squares optimization should work fine.

Replying to your final question
Is there a way to optimize least deltas but not the least squares of delta in Python?
Yes, pick an optimization method (for example downhill simplex implemented in scipy.optimize.fmin) and use the sum of absolute deviations as a merit function. Your dataset is small, I suppose that any general purpose optimization method will converge quickly. (In case of non-linear least squares fitting it is also possible to use general purpose optimization algorithm, but it's more common to use the Levenberg-Marquardt algorithm which minimizes sums of squares.)
If you are interested when minimizing absolute deviations instead of squares has theoretical justification see Numerical Recipes, chapter Robust Estimation.
From practical side, the sum of absolute deviations may not have unique minimum.
In the trivial case of two points, say, (0,5) and (1,9) and constant function y=a, any value of a between 5 and 9 gives the same sum (4). There is no such problem when deviations are squared.
If minimizing absolute deviations would not work, you may consider heuristic procedure to identify and remove outliers. Such as RANSAC or ROUT.

Python scipy.fftpack.rfft frequency bin mapping

I'm trying to get the correct FFT bin index based on the given frequency. The audio is being sampled at 44.1k Hz and the FFT size is 1024. Given the signal is real (capture from PyAudio, decoded through numpy.fromstring, windowed by scipy.signal.hann), I then perform FFT through scipy.fftpack.rfft, and compute the decibel of the result, in whole, magnitude = 20 * scipy.log10(abs(rfft(audio_sample)))
Based on this, and this, I originally had my mapping from the FFT bin index, k, to any frequency, F, as:
F = k*Fs/N for k = 0 ... N/2-1 where Fs is the sampling rate, and N is the FFT bin size, in this case, 1024. And the reverse as:
k = F*N/Fs for F = 0Hz ... Fs/2-Fs/N
However, realizing that the rfft's result is no symmetric like fft, and provides the result, in an N size array. I now have some questions in regarding the mapping and the function. Documentation unfortunately did not provide much information as I'm novice in this area.
My questions:
To me, the result of rfft on an audio sample can be used directly from the first bin to the last bin, as no symmetry occurs in the output, is that correct?
Given the lack of symmetry from the above, the frequency resolution appears to have increased, is this interpretation correct?
Because of using rfft, my mapping function from bin index k to frequency F is now F = k*Fs/(2N) for k = 0 ... N-1 is this correct?
Conversely, the reverse mapping function from frequency F to bin index k now becomes k = 2*F*N/Fs for F = 0Hz ... Fs/2-(Fs/2/N), what about the correctness of this?
My general confusion arises from how rfft is related to fft, and how the mapping can be done correctly while using rfft. I believe my mapping is offset by a small amount, and that is crucial in my application. Please point out the mistake or advise on the matter if possible, thank you very much.

First to clear up a few things for you:
A quick reference to the fftpack documentation reveals that rfft only gives you an output vector from 0..512 (in your case). The reason for this is exactly because of the symmetry present when calculating the discrete Fourier transform of a real-valued input:
y[k] = y*[N-k] (see Wikipedia page on DFTs). Therefore, the rfft function only calculates and stores N/2+1 values since you can calculate the other half by just taking the complex conjugates (should you really want it for plotting (say)). The fft function makes no assumption on the input values (they can have both a real and imaginary part) and therefore no symmetry can be assumed in the output and it gives you a full output vector with N values. Admittedly, most applications use a real input, so people tend to assume the symmetry is always there. Note that the Fast Fourier Transform (FFT) is an (efficient) algorithm to calculate the Discrete Fourier Transform (DFT) and the rfft function also uses the FFT to do the calculation.
In light of the above, your indices for accessing the output vector are out of bounds, i.e. > 512. The reasons why/how you can do this depends on your code. You should clearly distinguish between the 'logical N' (that you use to map the bin frequencies, define the DFT etc.) and the 'computational N' (the actual number of values in your output vector), then all your problems should disappear.
To concretely answer your questions:
No. There is symmetry and you need to use this to calculate the last bins (but they give you no extra information).
No. The only way to increase resolution of a DFT is to increase your sample length.
No, but almost. F = k*Fs/N for k = 0..N/2
For an output vector with N bins you get frequencies from 0 to (N-1)/N*Fs. Using the rfft you will have an output vector with N/2+1 bins. You do the maths, but I get 0..Fs/2
Hope things are clearer now.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.