I am trying to understand the implementation that is used in
scipy.stats.wasserstein_distance
for p=1 and no weights, with u_values, v_values the two 1-D distributions, the code comes down to
u_sorter = np.argsort(u_values) (1)
v_sorter = np.argsort(v_values)
all_values = np.concatenate((u_values, v_values)) (2)
all_values.sort(kind='mergesort')
deltas = np.diff(all_values) (3)
u_cdf_indices = u_values[u_sorter].searchsorted(all_values[:-1], 'right') (4)
v_cdf_indices = v_values[v_sorter].searchsorted(all_values[:-1], 'right')
v_cdf = v_cdf_indices / v_values.size (5)
u_cdf = u_cdf_indices / u_values.size
return np.sum(np.multiply(np.abs(u_cdf - v_cdf), deltas)) (6)
What is the reasoning behind this implementation, is there some literature?
I did look at the paper cited which I believe explains why calculating the Wasserstein distance in its general definition in 1D is equivalent to evaluating the integral,
\int_{-\infty}^{+\infty} |U-V|,
with U and V the cumulative distribution functions for the distributions u_values and v_values,
but I don't understand how this integral is evaluated in scipy implementation.
In particular,
a) why are they multiplying by the deltas in (6) to solve the integral?
b) how are v_cdf and u_cdf in (5) the cumulative distribution functions U and V?
Also, with this implementation the element order of the distribution u_values and v_values is not preserved. Shouldn't this be the case in the general Wasserstein distance definition?
Thank you for your help!
The order of the PDF, histogram or KDE is preserved and is important in Wasserstein distance. If you only pass the u_values and v_values then it has to calculate something like a PDF, KDE or histogram. Normally you would provide the PDF and the range of U and V as the 4 arguments to the function wasserstein_distance. So in the case where samples are provided you are not passing a real datapoint, simply a collection of repeated "experiments". Numbers 1 and 4 in your list of code blocks basically bins your data by the number of discrete values. A CDF is the number of discrete values until that point or P(x<X). The CDF is basically the cumulative sum of a PDF, histogram or KDE. Number 5 does the normalization of the CDF to between 0.0 and 1.0 or said another way it divides the bin by the number of bins.
So the order of the discrete values is preserved, not the original order in the datapoint.
B) It may make more sense if you plot the CDF's of a datapoint such as an image file by using the code above.
The transportation problem however may not need a PDF, but rather a datapoint of ordered features or some way to measure distance between features in which case you would calculate it differently.
Estimate the following integral with Monte Carlo integration:
I am trying to do Monte Carlo Integration on the problem below, where p(x) is a Gaussian distribution with a mean of 1 and a variance of 2. (see image).
I was told that once we draw samples from a normal distribution the pdf vanishes in the integral. Please explain this concept and how do I solve this in Python. Below is my attempt.
def func(x):
return (math.exp(x))*x
mu = 1
sigma = sqrt(2)
N = 1000
areas = []
for i in range(N):
xrand = np.zeros(N)
for i in range (len(xrand)):
xrand[i] = np.random.normal(mu, sigma)
integral = 0.0
for i in range (N):
integral += func(xrand[i])/N
answer = integral
areas.append(answer)
plt.title("Distribution of areas calculated")
plt.hist(areas, 60, ec = 'black')
plt.xlabel("Areas")
integral
Monte Carlo integration is a way of approximating complex integrals without computing their closed form solution. To answer your question, the PDF vanishes because all you need to do is to 1) sample some random value from the specified normal distribution, 2) calculate the value of the function in the integrand, and 3) compute the average of these values. Note that the PDF becomes irrelevant in the computation; it’s only relevant insofar as assuring that more likely values are more frequently sampled. You might understand this as taking the weighted average, if that makes things more intuitive.
Here is a Python implementation based on your original source code.
def func(x):
return x * math.exp(x)
def monte_carlo(n_sample, mu, sigma):
val_lst = []
for _ in range(n_sample):
x = np.random.normal(mu, sigma)
val_lst.append(func(x))
return mean(val_lst)
You can change func to be any function of your choice to perform a Monte Carlo approximation of that function. You can also edit the parameters of the monte_carlo function if you are given a different probability distribution.
Here is a function you can use to visualize the gradual convergence of the Monte Carlo approximation. As you might expect, the values will converge with larger iterations, i.e. as you increase the value of n_sample.
MAX_SAMPLE = 200 # Adjust this value as you need
x = np.arange(MAX_SAMPLE)
y = [monte_carlo(i, 1, sqrt(2)) for i in x]
plt.plot(x, y)
plt.show()
The resulting plot will show you the value of convergence, which is approximation of the value computed from the closed form solution of the definite integral.
I went through the docs but I'm not able to interpret correctly
IN my code, I wanted to find a line that goes through 2 points(x1,y1), (x2,y2), so I've used
np.polyfit((x1,x2),(y1,y2),1)
since its a 1 degree polynomial(a straight line)
It returns me [ -1.04 727.2 ]
Though my code (which is a much larger file) runs properly, and does what it is intended to do - i want to understand what this is returning
I'm assuming polyfit returns a line(curved, straight, whatever) that satisfies(goes through) the points given to it, so how can a line be represented with 2 points which it is returning?
From the numpy.polyfit documentation:
Returns:
p : ndarray, shape (deg + 1,) or (deg + 1, K)
Polynomial coefficients, highest power first. If y was 2-D, the coefficients for k-th data set are in p[:,k].
So these numbers are the coefficients of your polynomial. Thus, in your case:
y = -1.04*x + 727.2
By the way, numpy.polyfit will only return an equation that goes through all the points (say you have N) if the degree of the polynomial is at least N-1. Otherwise, it will return a best fit that minimises the squared error.
These are essentially the beta and the alpha values for the given data.
Where beta necessarily demonstrates the degree of volatility or the slope
I've seen several posts on this subject, but I need a pure Python (no Numpy or any other imports) solution that accepts a list of points (x,y,z coordinates) and calculates a normal for the closest plane that to those points.
I'm following one of the working Numpy examples from here: Fit points to a plane algorithms, how to iterpret results?
def fitPLaneLTSQ(XYZ):
# Fits a plane to a point cloud,
# Where Z = aX + bY + c ----Eqn #1
# Rearanging Eqn1: aX + bY -Z +c =0
# Gives normal (a,b,-1)
# Normal = (a,b,-1)
[rows,cols] = XYZ.shape
G = np.ones((rows,3))
G[:,0] = XYZ[:,0] #X
G[:,1] = XYZ[:,1] #Y
Z = XYZ[:,2]
(a,b,c),resid,rank,s = np.linalg.lstsq(G,Z)
normal = (a,b,-1)
nn = np.linalg.norm(normal)
normal = normal / nn
return normal
XYZ = np.array([
[0,0,1],
[0,1,2],
[0,2,3],
[1,0,1],
[1,1,2],
[1,2,3],
[2,0,1],
[2,1,2],
[2,2,3]
])
print fitPLaneLTSQ(XYZ)
[ -8.10792259e-17 7.07106781e-01 -7.07106781e-01]
I'm trying to adapt this code: Basic ordinary least squares calculation to replace np.linalg.lstsq
Here is what I have so far without using Numpy using the same coords as above:
xvals = [0,0,0,1,1,1,2,2,2]
yvals = [0,1,2,0,1,2,0,1,2]
zvals = [1,2,3,1,2,3,1,2,3]
""" Basic ordinary least squares calculation. """
sumx, sumy = map(sum, [xvals, yvals])
sumxy = sum(map(lambda x, y: x*y, xvals, yvals))
sumxsq = sum(map(lambda x: x**2, xvals))
Nsamp = len(xvals)
# y = a*x + b
# a (slope)
slope = (Nsamp*sumxy - sumx*sumy) / ((Nsamp*sumxsq - sumx**2))
# b (intercept)
intercept = (sumy - slope*sumx) / (Nsamp)
a = slope
b = intercept
normal = (a,b,-1)
mag = lambda x : math.sqrt(sum(i**2 for i in x))
nn = mag(normal)
normal = [i/nn for i in normal]
print normal
[0.0, 0.7071067811865475, -0.7071067811865475]
As you can see, the answers come out the same, but that is only because of this particular example. In other examples, they don't match. If you look closely you'll see that in the Numpy example the 'z' values are fed into 'np.linalg.lstsq', but in the non-Numpy version the 'z' values are ignored. How do I work in the 'z' values to the least-squares code?
Thanks
I do not think you can get away without implementing some basic matrix operations. As this is a multivariate linear regression problem, you will definitely need dot product, transpose and norm. These are easy. The difficult part is that you also need matrix inverse or QR decomposition or something similar. People usually use BLAS for these for good reasons, implementing them is not easy - but not impossible either.
With QR decomposition
I would start by creating a Matrix class that has the following methods
dot(m1, m2) (or __matmul__(m1, m2) if you have python 3.5): it is just the sum of products, should be straightforward
transpose(self): swapping matrix elements, should be easy
norm(self): square root of sum of squares (should be only used on vectors)
qr_decomp(self): this one is tricky. For an almost pure python implementation see this rosetta code solution (disclaimer: I have not thoroughly checked this code). It uses some numpy functions, but these are basic functions you can implement for your matrix class (shape, eye, dot, copysign, norm).
leastsqr_ut(R, A): solve the equation Rx = A if R is an upper triangular matrix. Not trivial, but should be easy enough as you can solve it equation by equation from the bottom.
With these, the solution is easy:
Generate the matrix G as detailed in your numpy example
Find the QR decomposition of G
Solve Rb = Q'z for b using that R is an upper triangular matrix
Then the normal vector you are looking for is (b[0], b[1], -1) (or the norm of it if you want a unit length normal vector).
With matrix inverse
The inverse of a 3x3 matrix is relatively easy to calculate, but this method is much less numerically stable than doing QR decomposition. If it is not an important concern, then you can do the following: implement
dot(m1, m2) (or __matmul__(m1, m2) if you have python 3.5): it is just the sum of products, should be straightforward
transpose(self): swapping matrix elements, should be easy
norm(self): square root of sum of squares (should be only used on vectors)
det(self): determinant, but it is enough if it works on 2x2 and 3x3 matrices, and for those simple formulas are available
inv(self): matrix inverse. It is enough if it works on 3x3 matrices, there is a simple formula for example here
Then the formula for b is b = inv(G'G) * (G'z) and your normal vector is again (b[0], b[1], -1).
As you can see, none of these are simple, and most of it is replicating some numpy functionality while making it a lot slower lot slower. So make sure you have absolutely no other choice.
I generated a code with a similar purpose (see "tangentplane_3D" function in the linked code).
In my case I had a scatter cloud of points that define a 3D ellipsoid. For each point I wanted to determine the tangent plane to the ellipsoid containing such point --> Goal: Determination of a 3D plane.
The problem can be seen in the following way: A plane is defined by its normal and the normal can be seen as the eigenvector associated to the minimum of the eigenvalues of a n set of points.
What I did, and you can check it on the code I posted, is to select k points close to the point of interest at which I wanted to calculate the tangent plane. Then, I performed a 3D Single Value Decomposition to these k points. Finally, from these SVD I selected the minimum eigenvalue and its associated eigenvector which is, in fact, the normal of the plane best fitting my set of points, and thus in my case, tangent to the ellipsoid plane. With the normal vector and the point you can subsequently calculate the complete plane equation.
I hope it helps!!
Best wishes.
I am trying to find higher order derivatives of a dataset (x,y). x and y are 1D arrays of length N.
Let's say I generate them as :
xder0=np.linspace(0,10,1000)
yder0=np.sin(xder0)
I define the derivative function which takes in 2 array (x,y) and returns (x1, y1) where y1 is the derivative calculated at each index as : (y[i+1]-y[i])/(x[i+1]-x[i]). x1 is just the mean of x[i+1] and x[i]
Here is the function that does it:
def deriv(x,y):
delx =np.zeros((len(x)-1), dtype=np.longdouble)
ydiff=np.zeros((len(x)-1), dtype=np.longdouble)
for i in range(len(x)-1):
delx[i] =(x[i+1]+x[i])/2.0
ydiff[i] =(y[i+1]-y[i])/(x[i+1]-x[i])
return delx, ydiff
Now to calculate the first derivative, I call this function as:
xder1, yder1 = deriv(xder0, yder0)
Similarly for second derivative, I call this function giving first derivatives as input:
xder2, yder2 = deriv(xder1, yder1)
And it goes on:
xder3, yder3 = deriv(xder2, yder2)
xder4, yder4 = deriv(xder3, yder3)
xder5, yder5 = deriv(xder4, yder4)
xder6, yder6 = deriv(xder5, yder5)
xder7, yder7 = deriv(xder6, yder6)
xder8, yder8 = deriv(xder7, yder7)
xder9, yder9 = deriv(xder8, yder8)
Something peculiar happens after I reach order 7. The 7th order becomes very noisy! Earlier derivatives are all either sine or cos functions as expected. However 7th order is a noisy sine. And hence all derivatives after that blow up.
Any idea what is going on?
This is a well known stability issue with numerical interpolation using equally-spaced points. Read the answers at http://math.stackexchange.com.
To overcome this problem you have to use non-equally-spaced points, like the roots of Lagendre polynomial. The instability occurs due to the unavailability of information at the boundaries, thus more concentration of points at the boundaries is required, as per the roots of say Lagendre polynomials or others with similar properties such as Chebyshev polynomial.