Different results using NumPy (np.sin) and for loop - python

I am trying to calculate radius of created lens by two overlapped spheres. In this regard, I tried both trigonometric method and another method based on just algebraic. I compared the results by these two methods with various data sets and find a small number of contradictions on just some of those data sets. The results are the same in most cases. The problem can be reproduced by the following example (on 3-5 indices):
poss = np.array([[[-0.884, -3.45, -0.99 ], [-0.901, -3.43, -0.995]], [[-0.993, -3.44, -0.97 ], [-1.01, -3.46, -1. ]],
[[-0.993, -3.44, -0.97 ], [-0.998, -3.45, -1. ]], [[0.885 , 0.967, -1.02 ], [0.885, 0.964, -1.02] ],
[[-0.252, -3.3 , -0.777], [-0.197, -3.3 , -0.777]], [[0.26 , -1.68, -0.803], [0.288, -1.67, -0.799]],
[[0.599 , 2.04 , -0.857], [0.607 , 2.04 , -0.84 ]], [[0.615 , 2. , -0.833], [0.633, 2. , -0.855]],
[[0.698 , 2.06 , -0.921], [0.679 , 2.06 , -0.914]]])
rad = np.array([[0.0108, 0.0205], [0.0231, 0.0259], [0.0231 , 0.0304], [0.0154, 0.0124], [0.0137, 0.0413],
[0.027 , 0.003 ], [0.0102, 0.022 ], [0.00221, 0.0268], [0.0147, 0.0124]])
# The length of the overlaps; lenses' heights
gap = np.array([-4.57922157e-03, -9.13773714e-03, -2.14843788e-02, -2.48000000e-02, -1.38777878e-17, -2.42861287e-17,
-1.34117058e-02, -5.84659193e-04, -6.85154327e-03])
The functions are:
def trigonometric(r_active, gap):
r_add = np.add.reduce(r_active, axis=1)
paired_cent_dis = np.sum((r_add, gap), axis=0)
intersect_angle_0 = np.arccos(np.clip((r_active[:, 0] ** 2 +
paired_cent_dis ** 2 - r_active[:, 1] ** 2) /
(2 * r_active[:, 0] * paired_cent_dis), -1, 1))
intersect_plane_rad = r_active[:, 0] * np.sin(intersect_angle_0)
return intersect_plane_rad
def algebraic(r, gap):
items_ = np.empty((len(gap), 1), dtype=np.float64)
for i in range(len(gap)):
r0, r1 = r[i]
cur_gap = gap[i]
paired_cent_dis = r0 + r1 + cur_gap
intersect_plane_rad = 0.5 * abs((-paired_cent_dis + r0 + r1) *
( paired_cent_dis + r0 + r1) * (-paired_cent_dis - r0 + r1) *
(-paired_cent_dis + r0 - r1)) ** 0.5 / paired_cent_dis
items_[i] = intersect_plane_rad
return items_.ravel()
trigonometric(rad, gap)
algebraic(rad, gap)
The results:
# repr trigonometric:
array([7.59403901e-03, 1.42126146e-02, 2.08670250e-02, 0.00000000e+00,
4.56484128e-10, 0.00000000e+00, 1.01747354e-02, 1.45347671e-03,
8.94740633e-03])
# repr algebraic:
array([7.59403901e-03, 1.42126146e-02, 2.08670250e-02, 4.69938148e-10,
5.34354024e-10, 3.68549655e-10, 1.01747354e-02, 1.45347671e-03,
8.94740633e-03])
As it can be seen by the results, there are some different resulted values on indices 3, 4, and 5. AFAIK, the two methods do the same job; It is proved by various data volumes. But some such differences may be happened on some indices in rare cases. In this example, just the 3rd index is affected by np.clip (this index in this small example gets 0 by trigonometric method, but it gets a nonzero value in my main code!? That nonzero value, too, was different from the same index resulted value by algebraic method i.e. 4.69938148e-10). As it is obvious in the images, and by focusing on the gap values (that are very small or near the diameter size of the smaller sphere), it seems the problem (differences between the results on some contacts) will be due to calculation precisions or something like that.
The final algebraic result shows the number of decimals for suspected indices will be in a reasonable range (here it is 10) and it seems trigonometric method is misled during the process.
I would be grateful to find
where is the problem source,
why the 4th index of trigonometric result gets a nonzero value but by a different magnitude although the 4th and 5th gap values are near the same,
and how could cure trigonometric method if it could.

Related

How to make for loops faster in Python?

I am writing a script in Python 2.7 that will train a neural network. As a part of the main script I need a program that solves 2D heat conduction partial derivative eauation. Previously I wrote this program in Fortran and then rewrote it in Python. The time that is required by fortran is 0.1 s, while Python requires 13 s! It is absolutely unacceptable for me since in that case computational time will be determined by the part of the program that solves PDE, but not by the epochs of training of the neural network.
How to solve that problem?
It seems that I cannot vectorize a matrix since a new element t[i.j] is calculated using value t[i-1,j], etc.
Here is the part of the code that is running slowly:
while (norm > eps):
# old value
t_old = np.copy(t)
# new value
for i in xrange(1,n-1):
for j in xrange(1,m-1):
d[i] = 0.0
a[i+1,j] = (0.5*dx/k[i,j] + 0.5*dx/k[i+1,j])
a[i-1,j] = (0.5*dx/k[i,j] + 0.5*dx/k[i-1,j])
a[i,j+1] = (0.5*dy/k[i,j] + 0.5*dy/k[i,j+1])
a[i,j-1] = (0.5*dy/k[i,j] + 0.5*dy/k[i,j-1])
a[i,j] = a[i+1,j] + a[i-1,j] + a[i,j+1] + a[i,j-1]
sum = a[i+1,j]*t[i+1,j] + a[i-1,j]*t[i-1,j] + a[i,j+1]*t[i,j+1] + a[i,j-1]*t[i,j-1] + d[i]
t[i,j] = ( sum + d[i] ) / a[i,j]
k[i,j] = k_func(t[i,j])
# matrix 2nd norm
norm = np.linalg.norm(t-t_old)
Pure Python optimizations
This won't bring all that much, but it is the easiest.
Eliminate dead code. In the inner loop, d[i] is set to zero. And then it is added to something else in two places. Adding 0 doesn't change anything, so you can remove d[i] altogether.
calculate things only once. k[i,j], 0.5*dx and 0.5*dy are used four times. So calculate them once and assign them to local variables.
remove unneccesary array access. In the inner loop, only five elements of the a matrix are calculated and used. So replace the matrix element by local variables a1 up to and including a5.
The code now looks like this:
while (norm > eps):
# old value
t_old = np.copy(t)
# new value
for i in xrange(1,n-1):
for j in xrange(1,m-1):
px = 0.5*dx
py = 0.5*py
q = k[i,j]
a1 = (px/q + px/k[i+1,j])
a2 = (px/q + px/k[i-1,j])
a3 = (py/q + py/k[i,j+1])
a4 = (py/q + py/k[i,j-1])
a5 = a1 + a2 + a3 + a4
sum = a1*t[i+1,j] + a2*t[i-1,j] + a3*t[i,j+1] + a4*t[i,j-1]
t[i,j] = sum / a5
k[i,j] = k_func(t[i,j])
# matrix 2nd norm
norm = np.linalg.norm(t-t_old)
Since your example doesn't give complete working code, I cannot measure the effects.
However, looping in Python is relatively inefficient. For good performance in pure Python it is better to use list comprehensions instead of loops. That is because in comprehensions the looping is done in the Python runtime in C, instead of in Python bytecode. But since we're already dealing with numpy arrays here, I will not expand on this.
Recode your algorithm to use numpy instead of loops
The basic idea behind numpy is that it has optimized routines (written in C or Fortran) for array operations. So for operating on arrays you should use numpy functions instead of loops!
Your loop consists mostly of filling a matrix with values derived from another matrix shifted one column or row. For that you could do something like this.
In this example I'll be shifting k one row down:
In [1]: import numpy as np
In [2]: k = np.arange(1, 26).reshape([5,5])
Out[2]:
array([[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]])
In [3]: dx = 0.27
Out[3]: 0.27
In [4]: 0.5*dx/k[1:,:]
Out[4]:
array([[0.0225 , 0.01928571, 0.016875 , 0.015 , 0.0135 ],
[0.01227273, 0.01125 , 0.01038462, 0.00964286, 0.009 ],
[0.0084375 , 0.00794118, 0.0075 , 0.00710526, 0.00675 ],
[0.00642857, 0.00613636, 0.00586957, 0.005625 , 0.0054 ]])
In [5]: np.insert(0.5*dx/k[1:,:], 0, 0, axis=0)
Out[5]:
array([[0. , 0. , 0. , 0. , 0. ],
[0.0225 , 0.01928571, 0.016875 , 0.015 , 0.0135 ],
[0.01227273, 0.01125 , 0.01038462, 0.00964286, 0.009 ],
[0.0084375 , 0.00794118, 0.0075 , 0.00710526, 0.00675 ],
[0.00642857, 0.00613636, 0.00586957, 0.005625 , 0.0054 ]])

The components of numpy.gradient of a symmetric function are different

The gradient of a symmetric function should have same derivatives in all dimensions.
numpy.gradient is providing different components.
Here is a MWE.
import numpy as np
x = (-1,0,1)
y = (-1,0,1)
X,Y = np.meshgrid(x,y)
f = 1/(X*X + Y*Y +1.0)
print(f)
>> [[0.33333333 0.5 0.33333333]
[0.5 1. 0.5 ]
[0.33333333 0.5 0.33333333]]
This has same values in both dimensions.
But np.gradient(f) gives
[array([[ 0.16666667, 0.5 , 0.16666667],
[ 0. , 0. , 0. ],
[-0.16666667, -0.5 , -0.16666667]]),
array([[ 0.16666667, 0. , -0.16666667],
[ 0.5 , 0. , -0.5 ],
[ 0.16666667, 0. , -0.16666667]])]
Both the components of the gradient are different.
Why so?
What I am missing in interpretation of the output?
Let's walk through this step by step. So first, as correctly mentioned by meowgoesthedog
numpy calculates derivatives in a direction.
Numpy's way of calculating gradients
It's important to note that np.gradient uses centric differences meaning (for simplicity we look at just one direction):
grad_f[i] = (f[i+1] - f[i])/2 + (f[i] - f[i-1])/2 = (f[i+1] - f[i-1])/2
At the boundary numpy calculates (take the min as example)
grad_f[min] = f[min+1] - f[min]
grad_f[max] = f[max] - f[max-1]
In your case the boundary is 0 and 2.
2D case
If you use more than one dimension we need to the direction of the derivative into account. np.gradient calculates the derivatives in all possible directions. Let's reproduce your results:
Let's move alongside the columns, so we calculate with row vectors
f[1,:] - f[0,:]
Output
array([0.16666667, 0.5 , 0.16666667])
which is exactly the first row of the first element of your gradient.
The row is calculated with centered derivatives, therefore:
(f[2,:]-f[1,:])/2 + (f[1,:]-f[0,:])/2
Output
array([0., 0., 0.])
The third row:
f[2,:] - f[1,:]
Output
array([-0.16666667, -0.5 , -0.16666667])
For the other direction just exchange the : and the numbers and take in mind that you are now calculating column vectors. This leads directly to the transposed derivative in the case of a symmetric function, like in your case.
3D case
x_ = (-1,0,4)
y_ = (-3,0,1)
z_ = (-1,0,12)
x, y, z = np.meshgrid(x_, y_, z_, indexing='ij')
f = 1/(x**2 + y**2 + z**2 + 1)
np.gradient(f)[1]
Output
array([[[ *2.50000000e-01, 4.09090909e-01, 3.97702165e-04*],
[ 8.33333333e-02, 1.21212121e-01, 1.75554093e-04],
[-8.33333333e-02, -1.66666667e-01, -4.65939801e-05]],
[[ **4.09090909e-01, 9.00000000e-01, 4.03045231e-04**],
[ 1.21212121e-01, 2.00000000e-01, 1.77904287e-04],
[-1.66666667e-01, -5.00000000e-01, -4.72366556e-05]],
[[ ***1.85185185e-02, 2.03619910e-02, 3.28827183e-04***],
[ 7.79727096e-03, 8.54700855e-03, 1.45243282e-04],
[-2.92397661e-03, -3.26797386e-03, -3.83406181e-05]]])
The gradient which is given here is calculated along rows (0 would be along matrices, 1 along rows, 2 along columns).
This can be calculated by
(f[:,1,:] - f[:,0,:])
Output
array([[*2.50000000e-01, 4.09090909e-01, 3.97702165e-04*],
[**4.09090909e-01, 9.00000000e-01, 4.03045231e-04**],
[***1.85185185e-02, 2.03619910e-02, 3.28827183e-04***]])
I added the asteriks so that it becomes clear where to find corresponding row vectors. Since we calculated the gradient in direction 1 we have to look for row vectors.
If one wants to reproduce the whole gradient, this is done by
np.stack(((f[:,1,:] - f[:,0,:]), (f[:,2,:] - f[:,0,:])/2, (f[:,2,:] - f[:,1,:])), axis=1)
n-dim case
We can generalize the things we learned to here to calculate gradients of arbitrary functions along directions.
def grad_along_axis(f, ax):
f_grad_ind = []
for i in range(f.shape[ax]):
if i == 0:
f_grad_ind.append(np.take(f, i+1, ax) - np.take(f, i, ax))
elif i == f.shape[ax] -1:
f_grad_ind.append(np.take(f, i, ax) - np.take(f, i-1, ax))
else:
f_grad_ind.append((np.take(f, i+1, ax) - np.take(f, i-1, ax))/2)
f_grad = np.stack(f_grad_ind, axis=ax)
return f_grad
where
np.take(f, i, ax) = f[:,...,i,...,:]
and i is at index ax.
Usually gradients and jacobians are operators on functions
Id you need the gradient of f = 1/(X*X + Y*Y +1.0) then you have to compute it symbolically. Or estimate it with numerical methods that use that function.
I do not know what a gradient of a constant 3d array is. numpy.gradient is a one dimensional concept.
Python has the sympy package that can automatically compute jacobians symbolically.
If by second order derivative of a scalar 3d field you mean a laplacian then you can estimate that with a standard 4 point stencil.

Filtering coordinates based on distance from a point

I have two arrays say:
A = np.array([[ 1. , 1. , 0.5 ],
[ 2. , 2. , 0.7 ],
[ 3. , 4. , 1.2 ],
[ 4. , 3. , 2.33],
[ 1. , 2. , 0.5 ],
[ 6. , 5. , 0.3 ],
[ 4. , 5. , 1.2 ],
[ 5. , 5. , 1.5 ]])
B = np.array([2,1])
I would want to find all values of A which are not within a radius of 2 from B.
My answer should be:
C = [[3,4,1.2],[4,3,2.33],[6,5,0.3],[4,5,1.2],[5,5,1.5]]
Is there a pythonic way to do this?
What I have tried is:
radius = 2
C.append(np.extract((cdist(A[:, :2], B[np.newaxis]) > radius), A))
But I realized that np.extract flattens A and i dont get what i is expected.
Let R be the radius here. We would have few methods to solve it, as discussed next.
Approach #1 : Using cdist -
from scipy.spatial.distance import cdist
A[(cdist(A[:,:2],B[None]) > R).ravel()]
Approach #2 : Using np.einsum -
d = A[:,:2] - B
out = A[np.einsum('ij,ij->i', d,d) > R**2]
Approach #3 : Using np.linalg.norm -
A[np.linalg.norm(A[:,:2] - B, axis=1) > R]
Approach #4 : Using matrix-multiplication with np.dot -
A[(A[:,:2]**2).sum(1) + (B**2).sum() - 2*A[:,:2].dot(B) > R**2]
Approach #5 : Using a combination of einsum and matrix-multiplication -
A[np.einsum('ij,ij->i',A[:,:2],A[:,:2]) + B.dot(B) - 2*A[:,:2].dot(B) > R**2]
Approach #6 : Using broadcasting -
A[((A[:,:2] - B)**2).sum(1) > R**2]
Hence, to get the points within radius R simply replace > with < in the above mentioned solutions.
Another useful approach not mentioned by #Divakar is to use a cKDTree:
from scipy.spatial import cKDTree
# Find indices of points within radius
radius = 2
indices = cKDTree(A[:, :2]).query_ball_point(B, radius)
# Construct a mask over these points
mask = np.zeros(len(A), dtype=bool)
mask[indices] = True
# Extract values not among the nearest neighbors
A[~mask]
The primary benefit is that it will be much faster than any direct approach as the size of the array increases, because the data structure avoids computing a distance for every point in A.

Vectorized arange using np.einsum for raycast

I have a D dimensional point and vector, p and v, respectively, a positive number n, and a resolution.
I want to get all points after successively adding vector v*resolution to point p n/resolution times.
Example
p = np.array([3, 5])
v = np.array([-1.5, 3])
n = 10
resolution = 1.5
result:
[[ 3. , 5. ],
[ 0.75, 9.5 ],
[ -1.5 , 14. ],
[ -3.75, 18.5 ],
[ -6. , 23. ],
[ -8.25, 27.5 ],
[-10.5 , 32. ]]
My current approach is to tile the range, given by n and the resolution, by the dimension D, multiply by that by v and add p.
def getPoints(p, v, n, resolution=1.):
dRange = np.tile(np.arange(0, n, resolution), (v.shape[0],1))
return np.multiply(v.reshape(-1,1), dRange).T + p
Is there is a direct way to calculate DRange using np.einsum or another method?
Approach #1
Here's one approach leveraging NumPy broadcasting -
np.arange(0, n, resolution)[:,None] * v + p
Basically, we extend the range array to 2D, keeping the second one as singleton, to let it broadcast for elementwise multiplication against 1D v, giving us a 2D array. Then, we add p to it.
Approach #2
There isn't any sum-reduction here, so np.einsum or any dot-based function even though should work, but won't lend any help on performance. Let's put it out anyway, as it was mentioned in the question -
np.einsum('i,j->ij',np.arange(0, n, resolution), v) + p

Scipy - Nan when calculating Mahalanobis distance

When I try to calculate the Mahalanobis distance with the following python code I get some Nan entries in the result. Do you have any insight about why this happens?
My data.shape = (181, 1500)
from scipy.spatial.distance import pdist, squareform
data_log = log2(data + 1) # A log transform that I usually apply to my data
data_centered = data_log - data_log.mean(0) # zero centering
D = squareform( pdist(data_centered, 'mahalanobis' ) )
I also tried:
data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )
Also got nans.
The input is not corrupted and other distances, such as correlation distance, can be computed just fine.
For some reason when I reduce the number of features I stop getting Nans. E.g the following examples does not get any Nan:
D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )
while those others get Nans:
D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )
Any clue? Is this an expected behaviour if some condition for the input is not satisfied?
You have fewer observations than features, so the covariance matrix V computed by the scipy code is singular. The code doesn't check this, and blindly computes the "inverse" of the covariance matrix. Because this numerically computed inverse is basically garbage, the product (x-y)*inv(V)*(x-y) (where x and y are observations) might turn out to be negative. Then the square root of that value results in nan.
For example, this array also results in a nan:
In [265]: x
Out[265]:
array([[-1. , 0.5, 1. , 2. , 2. ],
[ 2. , 1. , 2.5, -1.5, 1. ],
[ 1.5, -0.5, 1. , 2. , 2.5]])
In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]:
array([[ 0. , nan, 1.90394328],
[ nan, 0. , nan],
[ 1.90394328, nan, 0. ]])
Here's the Mahalanobis calculation done "by hand":
In [279]: V = np.cov(x.T)
In theory, V is singular; the following value is effectively 0:
In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47
But inv doesn't see the problem, and returns an inverse:
In [281]: VI = np.linalg.inv(V)
Let's compute the distance between x[0] and x[2] and verify that we get the same non-nan value (1.9039) returned by pdist when we use VI:
In [295]: delta = x[0] - x[2]
In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625
In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772
Here's what happens when we try to compute the distance between x[0] and x[1]:
In [300]: delta = x[0] - x[1]
In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
Then the square root of that value gives nan.
In scipy 0.16 (to be released in June 2015), you will get an error instead of nan or garbage. The error message describes the problem:
In [4]: x = array([[-1. , 0.5, 1. , 2. , 2. ],
...: [ 2. , 1. , 2.5, -1.5, 1. ],
...: [ 1.5, -0.5, 1. , 2. , 2.5]])
In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')
/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1298 "singular. For observations with %d "
1299 "dimensions, at least %d observations "
-> 1300 "are required." % (m, n, n + 1))
1301 V = np.atleast_2d(np.cov(X.T))
1302 VI = _convert_to_double(np.linalg.inv(V).T.copy())
ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.

Categories