I need to detect which spheres are connected to each other. If we have:
radii = np.array([2, 1, 1, 2, 2, 0.5])
poss = np.array([[7, 7, 7], [7.5, 8.5, 6], [0, 0, 0], [-1, -2, -1], [1, 1, 1], [2, 1, 3]])
I want a Boolean array (shape = (number of groups, number of spheres)) or array/lists of arrays/lists of indices that shows which of the spheres are connected. So, the expected results for this example must be something like:
Boolean_array = np.array([[1, 1, 0, 0, 0, 0], [0, 0, 1, 1, 1, 1]], dtype=bool)
object_array = np.array([[0, 1], [2, 3, 4, 5]])
I tried to find a solution with networkx (I'm not very familiar with it) and IDK if this library can help where we have spheres with different radii. I guess, returned ends_ind in my previous code can be helpful in this regard and I tried to use that as:
G = nx.Graph([*ends_ind])
L = [nx.node_connected_component(G, 0)]
for i in range(len(radii)):
iter = 0
for j in L:
if i in j:
iter += 1
if iter == 0:
L.append(nx.node_connected_component(G, i))
Which will not work. The error:
Traceback (most recent call last):
File "C:/Users/Ali/Desktop/check_2.py", line 31, in <module>
L.append(nx.node_connected_component(G, i))
File "<class 'networkx.utils.decorators.argmap'> compilation 8", line 4, in argmap_node_connected_component_5
File "C:\Users\Ali\anaconda3\envs\PFC_FiPy\lib\site-packages\networkx\algorithms\components\connected.py", line 185, in node_connected_component
return _plain_bfs(G, n)
File "C:\Users\Ali\anaconda3\envs\PFC_FiPy\lib\site-packages\networkx\algorithms\components\connected.py", line 199, in _plain_bfs
nextlevel.update(G_adj[v])
File "C:\Users\Ali\anaconda3\envs\PFC_FiPy\lib\site-packages\networkx\classes\coreviews.py", line 82, in __getitem__
return AtlasView(self._atlas[name])
KeyError: 11
Since using my previous code with other libraries will be an inefficient code (if it can solve the issue), I am seeking for any libraries, e.g. networkx, or methods that can do it in a more efficient way, if possible.
What is the best way to get my expected results, particularly for large number of spheres (~100000).
You're trying to utilize networkx too early, here. First, you should calculate the geometrical distances for each pair of spheres. A useful trick for this is:
xyz_distances = poss.reshape(6, 1, 3) - poss.reshape(1, 6, 3)
distances = np.linalg.norm(xyz_distances, axis=2)
This gets you a symmetric 6x6 array of Euclidean distances between the sphere centers. Now, we need to compare the maximum possible distances. This is just the sum of radii for each pair of spheres, once again a 6x6 array, which we can calculate as
maximum_distances = radii.reshape(6, 1) + radii.reshape(1, 6)
And now we can compare the two:
>>> connections = distances < maximum_distances
>>> connections
array([[ True, True, False, False, False, False],
[ True, True, False, False, False, False],
[False, False, True, True, True, False],
[False, False, True, True, False, False],
[False, False, True, False, True, True],
[False, False, False, False, True, True]])
Which translates to two groups, just like you wanted - and you can get your second expected array via
>>> G = nx.Graph(connections)
>>> list(nx.connected_components(G))
[{0, 1}, {2, 3, 4, 5}]
Note that this whole thing is going to scale as N^2 in the number of spheres, and you might need to optimize that somehow (say, via scipy.spatial.ckdtree).
In one of my tests on 18000 spheres, NumPy's linalg leaked the memory, but SciPy's cdist was more memory efficient and worked [ref 1]. It seems the calculations can be limited to JUST triangle upper the diameter of the arrays, which can be more efficient in terms of memory usage and time consumption. Thanks to Dominik answer, we can do this process using Numba accelerator as parallelized no python mode:
import numpy as np
import numba as nb
from scipy.spatial.distance import cdist
import networkx as nx
def distances_linalg(radii, poss):
xyz_distances = poss.reshape(radii.shape[0], 1, 3) - poss.reshape(1, radii.shape[0], 3)
return radii.reshape(radii.shape[0], 1) + radii.reshape(1, radii.shape[0]), np.linalg.norm(xyz_distances, axis=2)
def distances_cdist(radii, poss):
return radii.reshape(radii.shape[0], 1) + radii.reshape(1, radii.shape[0]), cdist(poss, poss)
#nb.njit("(Tuple([float64[:, ::1], float64[:, ::1]]))(float64[::1], float64[:, ::1])", parallel=True)
def distances_numba(radii, poss):
radii_arr = np.zeros((radii.shape[0], radii.shape[0]), dtype=np.float64)
poss_arr = np.zeros((poss.shape[0], poss.shape[0]), dtype=np.float64)
for i in nb.prange(radii.shape[0] - 1):
for j in range(i+1, radii.shape[0]):
radii_arr[i, j] = radii[i] + radii[j]
poss_arr[i, j] = ((poss[i, 0] - poss[j, 0]) ** 2 + (poss[i, 1] - poss[j, 1]) ** 2 + (poss[i, 2] - poss[j, 2]) ** 2) ** 0.5
return radii_arr, poss_arr
def connected_spheres(radii, poss, method=distances_numba):
maximum_distances, distances = method(radii, poss)
connections = distances < maximum_distances
G = nx.Graph(connections)
return list(nx.connected_components(G))
# numba radii cdist or linalg radii
# [[0. 3. 3. 4. 4. 2.5] [[4. 3. 3. 4. 4. 2.5]
# [0. 0. 2. 3. 3. 1.5] [3. 2. 2. 3. 3. 1.5]
# [0. 0. 0. 3. 3. 1.5] [3. 2. 2. 3. 3. 1.5]
# [0. 0. 0. 0. 4. 2.5] [4. 3. 3. 4. 4. 2.5]
# [0. 0. 0. 0. 0. 2.5] [4. 3. 3. 4. 4. 2.5]
# [0. 0. 0. 0. 0. 0. ]] [2.5 1.5 1.5 2.5 2.5 1. ]]
# numba poss
# [[ 0. 1.87082869 12.12435565 14.45683229 10.39230485 8.77496439]
# [ 0. 0. 12.82575534 15.21512405 11.11305539 9.77241014]
# [ 0. 0. 0. 2.44948974 1.73205081 3.74165739]
# [ 0. 0. 0. 0. 4.12310563 5.83095189]
# [ 0. 0. 0. 0. 0. 2.23606798]
# [ 0. 0. 0. 0. 0. 0. ]]
# cdist or linalg poss
# [[ 0. 1.87082869 12.12435565 14.45683229 10.39230485 8.77496439]
# [ 1.87082869 0. 12.82575534 15.21512405 11.11305539 9.77241014]
# [12.12435565 12.82575534 0. 2.44948974 1.73205081 3.74165739]
# [14.45683229 15.21512405 2.44948974 0. 4.12310563 5.83095189]
# [10.39230485 11.11305539 1.73205081 4.12310563 0. 2.23606798]
# [ 8.77496439 9.77241014 3.74165739 5.83095189 2.23606798 0. ]]
Which in my test on 18000 spheres was at least 2 times faster than cdist. I think, numba will be very helpful to avoid memory leaks on large arrays comparing to cdist.
Solution 2:
We can write distances_numba based on an improved cdist code by numba. In this solution I tried to modify that code to adjust it just on the upper triangle of the arrays:
#nb.njit("float64[:, ::1](float64[:, ::1])", parallel=True)
def dot_triu(poss):
assert poss.shape[1] == 3
poss_T = poss.T
dot = np.zeros((poss.shape[0], poss.shape[0]), dtype=poss.dtype)
for i in nb.prange(poss.shape[0] - 1):
for j in range(i + 1, poss.shape[0]):
dot[i, j] = poss[i, 0] * poss_T[0, j] + poss[i, 1] * poss_T[1, j] + poss[i, 2] * poss_T[2, j]
return dot
#nb.njit("float64[::1](float64[:, ::1])", parallel=True)
def poss_(poss):
TMP_A = np.zeros(poss.shape[0], dtype=np.float64)
for i in nb.prange(poss.shape[0]):
for j in range(poss.shape[1]):
TMP_A[i] += poss[i, j] ** 2
return TMP_A
#nb.njit("(Tuple([float64[:, ::1], float64[:, ::1]]))(float64[::1], float64[:, ::1])", parallel=True)
def distances_numba(radii, poss):
poss_arr = dot_triu(poss)
TMP_A = poss_(poss)
radii_arr = np.zeros((radii.shape[0], radii.shape[0]), dtype=np.float64)
for i in nb.prange(poss.shape[0] - 1):
for j in range(i + 1, poss.shape[0]):
radii_arr[i, j] = radii[i] + radii[j]
poss_arr[i, j] = (-2. * poss_arr[i, j] + TMP_A[i] + TMP_A[j]) ** 0.5
return radii_arr, poss_arr
Related
I'm implementing the Nearest Centroid Classification algorithm and I'm kind of blocked on how to use numpy.mean in my case.
So suppose I have some spherical datasets X:
[[ 0.39151059 3.48203037]
[-0.68677876 1.45377717]
[ 2.30803493 4.19341503]
[ 0.50395297 2.87076658]
[ 0.06677012 3.23265678]
[-0.24135103 3.78044279]
[-0.05660036 2.37695381]
[ 0.74210998 -3.2654815 ]
[ 0.05815341 -2.41905942]
[ 0.72126958 -1.71081388]
[ 1.03581142 -4.09666955]
[ 0.23209714 -1.86675298]
[-0.49136284 -1.55736028]
[ 0.00654881 -2.22505305]]]
and the labeled vector Y:
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
An example with 100 2D data points gives the following result:
The NCC algorithm consists of first calculating the class mean of each class (0 and 1: that's blue and red) and then calculating the nearest class centroid for the next data point.
This is my current function:
def mean_ncc(X,Y):
# find unique classes
m_cids = np.unique(Y) #[0. 1.]
# compute class means
mu = np.zeros((len(cids), X.shape[1])) #[[0. 0.] [0. 0.]] (in the case where Y has 2 unique points (0 and 1)
for class_idx, class_label in enumerate(cids):
mu[class_idx, :] = #problem here
return mu
So here I want an array containing the class means of '0' (blue) points and '1' (red) points:
How can I specify the number of elements of X whose mean I want to calculate?
I would like to do something like this:
for class_idx, class_label in enumerate(m_cids):
mu[class_idx, :] = np.mean(X[only the elements,that contains the same class_label], axis=0)
Is it possible or is there another way to implement this?
You could use something like this:
import numpy as np
tags = [0, 0, 1, 1, 0, 1]
values = [5, 4, 2, 5, 9, 8]
tags_np = np.array(tags)
values_np = np.array(values)
print(values_np[tags_np == 1].mean())
EDIT: You will surely need to look more into the axis parameter for the mean function:
import numpy as np
values = [[5, 4],
[5, 4],
[4, 3],
[4, 3]]
values_np = np.array(values)
tags_np = np.array([0, 0, 1, 1])
print(values_np[tags_np == 0].mean(axis=0))
I have this for loop that I need to vectorize. The code below works, but takes a lot of time (this is a simplified example, the full version will have about 1e6 rows in col_ids). Can someone give me an idea how to vectorize this code to get rid of the loop? If it matters, the col_ids are fixed (will be the same every time the code is run), while the values will change.
values = np.array([1.5, 2, 2.3])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1]])
result = np.zeros((4,3))
for idx, col_idx in enumerate(col_ids):
result[np.arange(4),col_idx] += values[idx]
Result:
[[5.8 0. 0. ]
[5.8 0. 0. ]
[3.5 2.3 0. ]
[1.5 4.3 0. ]]
Update:
I am adding a second example as there was some ambiguity in the dimensions of my first example. Only values and col_ids are updated, everything else as in first example. (I keep the first one, since this is referred to in the answers)
values = np.array([1.5, 2, 5, 20, 50])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1], [0,0,1,2], [0,1,2,2]])
Result:
[[78.5 0. 0. ]
[28.5 50. 0. ]
[ 3.5 25. 50. ]
[ 1.5 7. 70. ]]
So result is m x n, col_ids is k x m and values has length k. Both m and n are small (m=4, n=3), k is large (about 1e6 in full example)
You can vectorize the loop, but creating an additional intermediate array is much slower for larger data (starting from result with shape (50,50))
import numpy as np
values = np.array([1.5, 2, 2.3])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1]])
(np.equal.outer(col_ids, np.arange(len(values))) * values[:,None,None]).sum(0)
# for a fixed result shape (4,3)
# (np.equal.outer(col_ids, np.arange(3)) * values[:,None,None]).sum(0)
Output
array([[5.8, 0. , 0. ],
[5.8, 0. , 0. ],
[3.5, 2.3, 0. ],
[1.5, 4.3, 0. ]])
The only reliably faster solution I could find is numba (using version 0.55.1). I thought this implementation would benefit from parallel execution, but I couldn't get any speed up on a 2-core colab instance.
import numba as nb
#nb.njit(parallel=False) # Try parallel=True for multi-threaded execution, no speed up in my benchmarks
def fill(val, ids):
res = np.zeros(ids.shape[::-1])
for i in nb.prange(len(res)):
for j in range(res.shape[1]):
res[i, ids[j,i]] += val[j]
return res
fill(values, col_ids)
Output
array([[5.8, 0. , 0. ],
[5.8, 0. , 0. ],
[3.5, 2.3, 0. ],
[1.5, 4.3, 0. ]])
For a fixed result shape (4,3) with suitable input.
#nb.njit(boundscheck=True) # ~1.25x slower, but much safer
def fill(val, ids):
res = np.zeros((4,3))
for i in nb.prange(ids.shape[0]):
for j in range(ids.shape[1]):
res[j, ids[i,j]] += val[i]
return res
fill(values, col_ids)
Output for the updated example data
array([[78.5, 0. , 0. ],
[28.5, 50. , 0. ],
[ 3.5, 25. , 50. ],
[ 1.5, 7. , 70. ]])
You can solve this using np.add.at. However, AFAIK, this function does not support 2D array so you need to flatten the arrays, computing the 1D flatten indices, and then call the function:
n, m = result.shape
result = np.zeros((4,3))
indices = np.tile(np.arange(0, n*m, m), col_ids.shape[0]) + col_ids.ravel()
np.add.at(result.ravel(), indices, np.repeat(values, n)) # In-place
print(result)
I want to convert array a to log_e. If the number to be converted is non-positive, then convert it to 0:
import numpy as np
a = np.array([-1,0,1,2])
b = np.zeros(len(a))
for i in range(0,len(a)):
if a[i] <= 0:
b[i] = 0
else:
b[i] = np.log(a[i])
To improve the computing performance, I think the following is better. But then the error RuntimeWarning: divide by zero encountered in log pops out. Can I use some code to carry on with my expected calculations?
import numpy as np
a = np.array([0,0,1,2])
b = np.log(a)
use np.where on a to mask non-positive number with 1, then np.log:
b = np.log(np.where(a>0, a, 1))
Output:
array([0. , 0. , 0. , 0.69314718])
As a "ufunc", numpy.log accepts the parameters where and out. So an efficient method for your computation is as follows.
In [6]: a = np.array([-1, 0, 1, 2])
Create the output array.
In [7]: b = np.zeros(len(a))
Tell numpy.log to only compute the result where a > 0, and put the output in b. This returns the array given as out, and modifies out (i.e. b) in-place.
In [8]: np.log(a, where=a > 0, out=b)
Out[8]: array([0. , 0. , 0. , 0.69314718])
In [9]: b
Out[9]: array([0. , 0. , 0. , 0.69314718])
Hi I have to enlarge the number of points inside of vector to enlarge the vector to fixed size. for example:
for this simple vector
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> len(a)
# 6
now, I want to get a vector with size of 11 taken the a vector as base the results will be
# array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
EDIT 1
what I need is a function that will enter the base vector and the number of values that must be the resultant vector, and I return a new vector with size equal to the parameter. something like
def enlargeVector(vector, size):
.....
return newVector
to use like:
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> b = enlargeVector(a, 200):
>>> len(b)
# 200
and b contains data results of linear, cubic, or whatever interpolation methods
There are many methods to do this within scipy.interpolate. My favourite is UnivariateSpline, which produces an order k spline guaranteed to be differentiable k times.
To use it:
from scipy.interpolate import UnivariateSpline
old_indices = np.arange(0,len(a))
new_length = 11
new_indices = np.linspace(0,len(a)-1,new_length)
spl = UnivariateSpline(old_indices,a,k=3,s=0)
new_array = spl(new_indices)
The s is a smoothing factor that you should set to 0 in this case (since the data are exact).
Note that for the problem you have specified (since a just increases monotonically by 1), this is overkill, since the second np.linspace gives already the desired output.
EDIT: clarified that the length is arbitrary
As AGML pointed out there are tools to do this, but how about a pure numpy solution:
In [20]: a = np.arange(6)
In [21]: temp = np.dstack((a[:-1], a[:-1] + np.diff(a) / 2.0)).ravel()
In [22]: temp
Out[22]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In [23]: np.hstack((temp, [a[-1]]))
Out[23]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
I am trying to write code to give numerical answers to a recurrence relation. The relation itself is simple and is defined as follows. The variable x is an integer
p(i) = p(i+2)/2 + p(i-1)/2 if i > 0 and i < x
p(0) = p(2)/2
p(i) = 1 if i >= x
This is also in this code.
from __future__ import division
def p(i):
if (i == 0):
return p(2)/2
if (i >= x):
return 1
return p(i-1)/2+p(i+2)/2
x = 4
#We would like to print p(0) for example.
This of course doesn't actually let you compute p(0). How can you do this in python?
Is it possible to set up a system of simultaneous equations which numpy.linalg.solve can then solve?
You're right this can be solved using linear algebra. What I've done below is a simple hard-coded translation. Your equations for p(0) to p(3) are coded up by rearranging them so that the right hand side is =0. For p(4) and p(5) which appear in the recurrence relations as base cases, there is an =1 on the right hand side.
-p(0) + p(2)/2 = 0
p(i-1)/2 - p(i) + p(i+2)/2 = 0 for i > 0 and i < x
p(i) = 1 if i >= x
Here is the program hardcoded for n=4
import numpy
a=numpy.array([[-1, 0, 0.5, 0, 0, 0], # 0
[0.5, -1, 0,0.5, 0, 0], # 1
[0, 0.5, -1, 0, 0.5, 0], # 2
[0, 0, 0.5, -1, 0, 0.5], # 3
[0, 0, 0, 0, 1, 0], # 4
[0, 0, 0, 0, 0, 1], # 5
])
b=numpy.array([0,0,0,0,1,1])
# solve ax=b
x = numpy.linalg.solve(a, b)
print x
Edit, here is the code which constructs the matrix programmatically, only tested for n=4!
n = 4
# construct a
diag = [-1]*n + [1]*2
lowdiag = [0.5]*(n-1) + [0]*2
updiag = [0.5]*n
a=numpy.diag(diag) + numpy.diag(lowdiag, -1) + numpy.diag(updiag, 2)
# solve ax=b
b=numpy.array([0]*n + [1]*2)
x = numpy.linalg.solve(a, b)
print a
print x[:n]
This outputs
[[-1. 0. 0.5 0. 0. 0. ]
[ 0.5 -1. 0. 0.5 0. 0. ]
[ 0. 0.5 -1. 0. 0.5 0. ]
[ 0. 0. 0.5 -1. 0. 0.5]
[ 0. 0. 0. 0. 1. 0. ]
[ 0. 0. 0. 0. 0. 1. ]]
[ 0.41666667 0.66666667 0.83333333 0.91666667]
which matches the solution in your comment under your question.
This is not an answer to the posted question, but this page is the top Google hit for "solve recurrence relation in Python" so I will write an answer.
If you have a linear recurrence and you want to find the recursive formula, you can use Sympy's find_linear_recurrence function. For example, suppose you have the following sequence: 0, 1, 3, 10, 33, 109, 360, 1189, 3927, 12970. Then the following code produces the recurrence relation:
import sympy
from sympy.abc import n
L = [0, 1, 3, 10, 33, 109, 360, 1189, 3927, 12970]
print(sympy.sequence(L, (n, 1, len(L))).find_linear_recurrence(len(L)))
The output is:
[3, 1]
So you know A(n) = 3*A(n-1) + A(n-2).
The issue here is that you end up in an infinite recursion regardless of where you start, because the recursion isn't explicit, but rather ends up yielding systems of linear equations to solve. If this were a problem you had to solve using Python, I would use Python to calculate the coefficients of this system of equations and use Cramer's rule to solve it.
Edit: Specifically, your unknowns are p(0), ..., p(x-1). One coefficient row vector right off the bat is (1, 0, -1/2, 0, ..., 0) (from p(0)-p(2)/2=0), and all the others are of the form (..., -1/2, 1, 0, -1/2, ...). There are x-1 of these (one for each of p(1), ..., p(x-1)) so the system either has a unique solution or none at all. Intuitively, it seems like there should always be a unique solution.
The two last equations would be unique since they would feature p(x) and p(x+1), so those terms would be ommitted; the column vector for the RHS of Cramer's rule would then be (0, 0, ..., 0, 1/2, 1/2), I believe.
Numpy has matrix support.
I'm confused because your code seems like it should do just that.
def p(i):
x = 4 # your constant should be defined in-function
if (i == 0):
return p(2)/2
elif (i >= x):
return 1
return p(i-1)/2+p(i+2)/2
The big problem here is your recursion. For p(1) it does:
p(0)/2 + p(3)/2
p(2)/2 + p(2)/2 + p(4)/2
p(1)/2 + p(1)/2 + 1/2
# each side of the operator is now the same as the original problem!
# that's a sure sign of infinite recursion.
What do you EXPECT to be the output?