Python Earth Mover Distance of 2D arrays - python

I would like to compute the Earth Mover Distance between two 2D arrays (these are not images).
Right now I go through two libraries: scipy (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.wasserstein_distance.html) and pyemd (https://pypi.org/project/pyemd/).
#define a sampeling method
def sampeling2D(n, mu1, std1, mu2, std2):
#sample from N(0, 1) in the 2D hyperspace
x = np.random.randn(n, 2)
#scale N(0, 1) -> N(mu, std)
x[:,0] = (x[:,0]*std1) + mu1
x[:,1] = (x[:,1]*std2) + mu2
return x
#generate two sets
Y1 = sampeling2D(1000, 0, 1, 0, 1)
Y2 = sampeling2D(1000, -1, 1, -1, 1)
#compute the distance
distance = pyemd.emd_samples(Y1, Y2)
While the scipy version doesn't accept 2D arrays and it returns an error, the pyemd method returns a value. If you see from the documentation, it says that it accept only 1D arrays, so I think that the output is wrong. How can I calculate this distance in this case?

So if I understand you correctly, you're trying to transport the sampling distribution, i.e. calculate the distance for a setup where all clusters have weight 1. In general, you can treat the calculation of the EMD as an instance of minimum cost flow, and in your case, this boils down to the linear assignment problem: Your two arrays are the partitions in a bipartite graph, and the weights between two vertices are your distance of choice. Assuming that you want to use the Euclidean norm as your metric, the weights of the edges, i.e. the ground distances, may be obtained using scipy.spatial.distance.cdist, and in fact SciPy provides a solver for the linear sum assignment problem as well in scipy.optimize.linear_sum_assignment (which recently saw huge performance improvements which are available in SciPy 1.4. This could be of interest to you, should you run into performance problems; the 1.3 implementation is a bit slow for 1000x1000 inputs).
In other words, what you want to do boils down to
from scipy.spatial.distance import cdist
from scipy.optimize import linear_sum_assignment
d = cdist(Y1, Y2)
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n)
It is also possible to use scipy.sparse.csgraph.min_weight_bipartite_full_matching as a drop-in replacement for linear_sum_assignment; while made for sparse inputs (which yours certainly isn't), it might provide performance improvements in some situations.
It might be instructive to verify that the result of this calculation matches what you would get from a minimum cost flow solver; one such solver is available in NetworkX, where we can construct the graph by hand:
import networkx as nx
G = nx.DiGraph()
# Represent elements in Y1 by 0, ..., 999, and elements in
# Y2 by 1000, ..., 1999.
for i in range(n):
G.add_node(i, demand=-1)
G.add_node(n + i, demand=1)
for i in range(n):
for j in range(n):
G.add_edge(i, n + j, capacity=1, weight=d[i, j])
At this point, we can verify that the approach above agrees with the minimum cost flow:
In [16]: d[assignment].sum() == nx.algorithms.min_cost_flow_cost(G)
Out[16]: True
Similarly, it's instructive to see that the result agrees with scipy.stats.wasserstein_distance for 1-dimensional inputs:
from scipy.stats import wasserstein_distance
np.random.seed(0)
n = 100
Y1 = np.random.randn(n)
Y2 = np.random.randn(n) - 2
d = np.abs(Y1 - Y2.reshape((n, 1)))
assignment = linear_sum_assignment(d)
print(d[assignment].sum() / n) # 1.9777950447866477
print(wasserstein_distance(Y1, Y2)) # 1.977795044786648

Related

How better perform Pearson R from 2 arrays of dimensions (m, n) and (n), returning an array of (m) size? [Python, NumPy, SciPy]

I'm trying to improve a simple algorithm to obtaining the Pearson correlation coefficient from two arrays, X(m, n) and Y(n), returning me another array R of dimension (m).
In the case, I want to know the behavior each row of X regarding the values of Y. A sample (working) code is presented below:
import numpy as np
from scipy.stats import pearsonr
np.random.seed(1)
m, n = 10, 5
x = 100*np.random.rand(m, n)
y = 2 + 2*x.mean(0)
r = np.empty(m)
for i in range(m):
r[i] = pearsonr(x[i], y)[0]
For this particular case, I get: r = array([0.95272843, -0.69134753, 0.36419159, 0.27467137, 0.76887201, 0.08823868, -0.72608421, -0.01224453, 0.58375626, 0.87442889])
For small values of m (near 10k) this runs pretty fast, but I'm starting to work with m ~ 30k, and so this is taking much longer than I expected. I'm aware I could implement multiprocessing/multi-threading but I believe there's a (better) pythonic way of doing this.
I tried to use use pearsonr(x, np.ones((m, n))*y), but it returns only (nan, nan).
pearsonr only supports 1D array internally. Moreover, it computes the p-values which is not used here. Thus, it would be more efficient not to compute it if possible. Additionally, the code also recompute the y vector every time and it does not efficiently make use of vectorized Numpy operations. This is why the computation is a bit slow. You can check this in the code here.
One way to compute this is by writing your own custom implementation based on the one of Scipy:
def multi_pearsonr(x, y):
xmean = x.mean(axis=1)
ymean = y.mean()
xm = x - xmean[:,None]
ym = y - ymean
normxm = np.linalg.norm(xm, axis=1)
normym = np.linalg.norm(ym)
return np.clip(np.dot(xm/normxm[:,None], ym/normym), -1.0, 1.0)
It is 450 times faster on my machine for m = 10_000.
Note that I did not keep the checks of the Scipy code, but it may be a good idea to keep them if your input is not guaranteed to be statistically safe (ie. well formatted for the computation of the Pearson test).

Fitting a plane by Orthogonal Regression in Python

I want to fit a plane to a set of points (x, y, z) in Python. I found various answers how to perform the fitting if the error is measured with respect to the z-axis but I want to consider errors in orthogonal direction. I found the following question (Best fit plane by minimizing orthogonal distances) which addresses the same question - but it's not clear to me how to implement this in Python (likely with NumPy/SciPy). Further details regarding the mathematical derivation can also be found here: http://www.ncorr.com/download/publications/eberlyleastsquares.pdf (section 2).
The first link you gave does describe the algorithm for orthogonal distance fitting, but rather tersely. Here, in case it helps, is a more prolix description:
I suppose you have points (in your case 3d, but the dimension makes no odds to the algotithm) P[i], i=1..N
You want to find a (hyper-) plane that is of mininmal orthogonal distance from your points.
A hyper-plane can be described by a unit vector n and a scalar d. The set of points on the plane is
{ P | n.P + d = 0 }
and the (orthogonal) distance of a point P from the plane is
n.P + d
So we want to find n and d to minimise
Q(n,d) = Sum{ i | (n.P[i]+d)*(n.P[i]+d) } /N
(The division by N isn't essential, and makes no difference to the values of n and d that are found, but to my mind makes the algebra neater)
The first thing to notice is that if we knew n, the d that minimises Q will be
d = -n.Pbar where
Pbar = Sum{ i | P[i]}/N, the mean of the P[]
We may as well use this value of d, so that, after a little algebra the problem reduces to minimising Q^:
Q^(n) = Sum{ i | (n.P[i]-n.Pbar)*(n.P[i]-n.Pbar) } /N
= n' * C * n
where
C = Sum{ i | (P[i]-Pbar)*(P[i]-Pbar) } /N
The form of Q^ tells us that the value of n to minimise Q^ will be an eigenvector of C correseponding to a minimal eigenvalue.
So (sorry I can't give code but my python is contemptible):
a/ compute
Pbar = Sum{ i | P[i]}/N, the mean of the points
b/ compute
C = Sum{ i | (P[i]-Pbar)*(P[i]-Pbar) } /N, the covariance matrix of the points
c/ diagonalise C, and pick out a minimal eigenvalue and the corresponding eigenvector n
d/ compute
d = -Pbar.n
Then n, d define the hyperplane you want.
I've also had to deal with this situation and at first the mathematical notation can be overwhelming, but in the end the solution is fairly simple.
Once you get the intuition that the vector (A,B,C) that defines the best fitting plane Ax+By+Cz+D=0 is the one that explains the minimum variance of your set of coordinates, then the solution is straightforward.
First thing to do is center your coordinates (this way D will be 0 in your plane equation)
coords -= coords.mean(axis=0)
Then you have 2 options to get the vector you are interested in: (1) use the PCA implementation from sklearn or scipy to get the vector that explains minimal variance
pca = PCA(n_components=3)
pca.fit(coords)
# The last component/vector is the one with minimal variance, see PCA documentation
normal_vector = pca.components_[-1]
(2) re-implement the procedure described in the Geometric Tool reference you've linked.
#njit
def get_best_fitting_plane_vector(coords):
# Calculate the covariance matrix of the coordinates
covariance_matrix = np.cov(coords, rowvar=False) # Variables = columns
# Calculate the eigenvalues & eigenvectors of the covariance matrix
e_val, e_vect = np.linalg.eig(covariance_matrix)
# The normal vector to the plane is the eigenvector associated to the minimum eigenvalue
min_eval = np.argmin(e_val)
normal_vector = e_vect[:, min_eval]
return normal_vector
In terms of speed, the re-implemented procedure is faster than using PCA, and can be a lot faster if you use numba (just decorate the function with #njit).
Based on your second refernce
[]
Say you have n samples (x,y,z)
I'll call the 3 terms M*A=V, and define the column arrays
X=[ x_0, x_1 .. x_n ]'
Y=[ y_0, y_1 .. y_n ]'
Z=[ z_0, z_1 .. z_n ]'
Define the (n by 3) matrix XY1=[X,Y,1n]:
[[x_0,y_0,1],
XY1= [x_1,y_1,1],
...
[x_n,y_n,1]]
The matrix M can be obtained as
M = XY1' * XY1
Where apostrophe (') is the transposition operator and (*) the matrix product.
And the array V is
V = XY1'*Z
The least squares solution can be obtained through the moore-penrose pseoudoinverse: [(M'*M)^-1 * M']
~A = [(M'*M)^-1 * M'] * V
Sample code:
import numpy as np
from mpl_toolkits import mplot3d
import matplotlib.pyplot as plt
#Input your values
A=3
B=2
C=1
#reserve memory
xy1=np.ones([n,3])
#Make random data, n ( x,y ) tuples.
n=30 #samples
xy1[:,:2]=np.random.rand(n,2)
#plane: A*x+B*y+C = z , the z coord is calculated from random x,y
z=xy1.dot (np.array([[A,B,C],]).transpose() )
#addnoise
xy1[:,:2]+=np.random.normal(scale=0.05,size=[n,2])
z+=np.random.normal(scale=0.05,size=[n,1])
#calculate M and V
M=xy1.transpose().dot(xy1)
V=xy1.transpose().dot(z)
#pseudoinverse:
Mp=np.linalg.inv(M.transpose().dot(M)).dot(M.transpose())
#Least-squares Solution
ABC= Mp.dot(V)
Output
In [24]: ABC
Out[24]:
array([[3.11395111],
[2.02909874],
[1.01340411]])

Gradient descent from scratch in python not working

I am trying to implement a gradient descent algorithm from scratch in python, which should be fairly easy. however, I have been scratching my head for quite while with my code now, unable to make it work.
I generate data as follow:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
#Defining the x array.
x=np.array(range(1,100))
#Defining the y array.
y=10+2*x.ravel()
y=y+np.random.normal(loc=0, scale=70, size=99)
Then define the parameters:
alpha = 0.01 # Which will be the learning rate
NbrIter = 100 # Representing the number of iteration
m = len(y)
theta = np.random.randn(2,1)
and my GD is as follow:
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * ( X.T # ((X # theta) - y) )
What I get is a huge matrix, meaning that I have some problem with the linear algebra. However, I really fail to see where the issue is.
(Playing around with the matrices to try to get them to match I reached a theta having the correct form (2x1) with:
theta = theta - (1/m) * alpha * ( X.T # ((X # theta).T - y).T )
But it does look wrong and the actual value are way off (array([[-8.92647663e+148],
[-5.92079000e+150]]))
)
I guess you were hit by broadcasting. Variable y's shape is (100,). When y is subtracted from result of X.T#X#theta. Theta is column vector so I guess the result is a column vector. Variable y is broadcasted to a row vector of shape (1,100). The result of subtraction is (100,100). To fix this reshape y as column vector with y.reshape(-1,1)
Now, a few optimizations:
X.T # ((X # theta) - y[:,None])
can be rewritten as:
(X.T#X) # theta - (X.T*y[:,None])
The most costly computation can be taken out of the loop:
XtX = X.T#X
Xty = X.T*y[:,None]
for iter in range(NbrIter):
theta = theta - (1/m) * alpha * (XtX # theta - Xty)
Now you operate on 2x2 matrix rather that 100x2.
Let's take a look on convergence.
Assuming that X is constructed like: X=np.column_stack((x, np.ones_like(x)) it is possible to check matrix condition:
np.linalg.cond(XtX)
Which produced:
13475.851490419038
It means that the ratio between minimal and maximal eigenvector is about 13k. Therefore using alpha larger then 1/13k will likely result in bad convergence.
If you use alpha=1e-5 the algorithm will converge.
Good luck!

numpy fit coefficients to linear combination of polynomials

I have data that I want to fit with polynomials. I have 200,000 data points, so I want an efficient algorithm. I want to use the numpy.polynomial package so that I can try different families and degrees of polynomials. Is there some way I can formulate this as a system of equations like Ax=b? Is there a better way to solve this than with scipy.minimize?
import numpy as np
from scipy.optimize import minimize as mini
x1 = np.random.random(2000)
x2 = np.random.random(2000)
y = 20 * np.sin(x1) + x2 - np.sin (30 * x1 - x2 / 10)
def fitness(x, degree=5):
poly1 = np.polynomial.polynomial.polyval(x1, x[:degree])
poly2 = np.polynomial.polynomial.polyval(x2, x[degree:])
return np.sum((y - (poly1 + poly2)) ** 2 )
# It seems like I should be able to solve this as a system of equations
# x = np.linalg.solve(np.concatenate([x1, x2]), y)
# minimize the sum of the squared residuals to find the optimal polynomial coefficients
x = mini(fitness, np.ones(10))
print fitness(x.x)
Your intuition is right. You can solve this as a system of equations of the form Ax = b.
However:
The system is overdefined and you want to get the least-squares solution, so you need to use np.linalg.lstsq instead of np.linalg.solve.
You can't use polyval because you need to separate the coefficients and powers of the independent variable.
This is how to construct the system of equations and solve it:
A = np.stack([x1**0, x1**1, x1**2, x1**3, x1**4, x2**0, x2**1, x2**2, x2**3, x2**4]).T
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function
Of course you can generalize over the degree:
A = np.stack([x1**p for p in range(degree)] + [x2**p for p in range(degree)]).T
With the example data, the least squares solution runs much faster than the minimize solution (800µs vs 35ms on my laptop). However, A can become quite large, so if memory is an issue minimize might still be an option.
Update:
Without any knowledge about the internals of the polynomial function things become tricky, but it is possible to separate terms and coefficients. Here is a somewhat ugly way to construct the system matrix A from a function like polyval:
def construct_A(valfunc, degree):
columns1 = []
columns2 = []
for p in range(degree):
c = np.zeros(degree)
c[p] = 1
columns1.append(valfunc(x1, c))
columns2.append(valfunc(x2, c))
return np.stack(columns1 + columns2).T
A = construct_A(np.polynomial.polynomial.polyval, 5)
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function

Efficient distance calculation between N points and a reference in numpy/scipy

I just started using scipy/numpy. I have an 100000*3 array, each row is a coordinate, and a 1*3 center point. I want to calculate the distance for each row in the array to the center and store them in another array. What is the most efficient way to do it?
I would take a look at scipy.spatial.distance.cdist:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html
import numpy as np
import scipy
a = np.random.normal(size=(10,3))
b = np.random.normal(size=(1,3))
dist = scipy.spatial.distance.cdist(a,b) # pick the appropriate distance metric
dist for the default distant metric is equivalent to:
np.sqrt(np.sum((a-b)**2,axis=1))
although cdist is much more efficient for large arrays (on my machine for your size problem, cdist is faster by a factor of ~35x).
I would use the sklearn implementation of the euclidean distance. The advantage is the usage of the more efficient expression by using Matrix multiplication:
dist(x, y) = sqrt(np.dot(x, x) - 2 * np.dot(x, y) + np.dot(y, y)
A simple script would look like this:
import numpy as np
x = np.random.rand(1000, 3)
y = np.random.rand(1000, 3)
dist = np.sqrt(np.dot(x, x)) - (np.dot(x, y) + np.dot(x, y)) + np.dot(y, y)
The advantage of this approach has been nicely described in the sklearn documentation:
http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html#sklearn.metrics.pairwise.euclidean_distances
I am using this approach to crunch large datamatrices (10000, 10000) with some minor modifications like using the np.einsum function.
You can also use the development of the norm (similar to remarkable identities). This is probably the most efficent way to compute the distance of a matrix of points.
Here is a code snippet that I originally used for a k-Nearest-Neighbors implementation, in Octave, but you can easily adapt it to numpy since it only uses matrix multiplications (the equivalent is numpy.dot()):
% Computing the euclidian distance between each known point (Xapp) and unknown points (Xtest)
% Note: we use the development of the norm just like a remarkable identity:
% ||x1 - x2||^2 = ||x1||^2 + ||x2||^2 - 2*<x1,x2>
[napp, d] = size(Xapp);
[ntest, d] = size(Xtest);
A = sum(Xapp.^2, 2);
A = repmat(A, 1, ntest);
B = sum(Xtest.^2, 2);
B = repmat(B', napp, 1);
C = Xapp*Xtest';
dist = A+B-2.*C;
This might not answer your question directly, but if you are after all permutations of particle pairs, I've found the following solution to be faster than the pdist function in some cases.
import numpy as np
L = 100 # simulation box dimension
N = 100 # Number of particles
dim = 2 # Dimensions
# Generate random positions of particles
r = (np.random.random(size=(N,dim))-0.5)*L
# uti is a list of two (1-D) numpy arrays
# containing the indices of the upper triangular matrix
uti = np.triu_indices(100,k=1) # k=1 eliminates diagonal indices
# uti[0] is i, and uti[1] is j from the previous example
dr = r[uti[0]] - r[uti[1]] # computes differences between particle positions
D = np.sqrt(np.sum(dr*dr, axis=1)) # computes distances; D is a 4950 x 1 np array
See this for a more in-depth look on this matter, on my blog post.
You may need to specify a more detailed manner the distance function you are interested of, but here is a very simple (and efficient) implementation of Squared Euclidean Distance based on inner product (which obviously can be generalized, straightforward manner, to other kind of distance measures):
In []: P, c= randn(5, 3), randn(1, 3)
In []: dot(((P- c)** 2), ones(3))
Out[]: array([ 8.80512, 4.61693, 2.6002, 3.3293, 12.41800])
Where P are your points and c is the center.
#is it true, to find the biggest distance between the points in surface?
from math import sqrt
n = int(input( "enter the range : "))
x = list(map(float,input("type x coordinates: ").split()))
y = list(map(float,input("type y coordinates: ").split()))
maxdis = 0
for i in range(n):
for j in range(n):
print(i, j, x[i], x[j], y[i], y[j])
dist = sqrt((x[j]-x[i])**2+(y[j]-y[i])**2)
if maxdis < dist:
maxdis = dist
print(" maximum distance is : {:5g}".format(maxdis))

Categories