I have an n x n numpy array that contains all pairwise distances and another 1 x n array that contains some scoring metric.
Example:
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
array([[ 0. , 3.2, 4.1, 8.8, 0.6],
[ 3.2, 0. , 1.5, 9. , 5. ],
[ 4.1, 1.5, 0. , 9.9, 10. ],
[ 8.8, 9. , 9.9, 0. , 1.1],
[ 0.6, 5. , 10. , 1.1, 0. ]])
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
array([ 19. , 1.3, 4.8, 6.2, 5.7])
So, note that the ith element of the score array corresponds to the ith row of the distance array.
What I need to do is vectorize this process:
For the ith value in the score array, find all other values that are larger than the ith value and note their indices
Then, in the ith row of the distance array, get all of the distances with the same indices as noted in step 1. above and return the smallest distance
In the case where the ith value in the score array is the largest, then the smallest distance is set as the largest distance found in the distance array
Here is an un-vectorized version:
n = score.shape[0]
min_dist = np.full(n, np.max(dists))
for i in range(score.shape[0]):
inx = numpy.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
min_dist
array([ 10. , 1.5, 4.1, 8.8, 0.6])
This works but is pretty inefficient in terms of speed and my arrays are expected to be much, much larger. I am hoping to improve the efficiency by using faster vectorized operations to achieve the same result.
Update: Based on Oliver W.'s answer, I came up with my own that doesn't require making a copy of the distance array
def new_method (dists, score):
mask = score > score.reshape(-1,1)
return np.ma.masked_array(dists, mask=~mask).min(axis=1).filled(dists.max())
One could in theory make it a one-liner but it's already a bit challenging to read to the untrained eye.
One possible vectorized solution is given below.
import numpy as np
import scipy.spatial.distance
dists = scipy.spatial.distance.squareform(np.array([3.2,4.1,8.8,.6,1.5,9.,5.0,9.9,10.,1.1]))
score = np.array([19., 1.3, 4.8, 6.2, 5.7])
def your_method(dists, score):
dim = score.shape[0]
min_dist = np.full(dim, np.max(dists))
for i in range(dim):
inx = np.where(score > score[i])
if len(inx[0]) > 0:
min_dist[i] = np.min(dists[i, inx])
return min_dist
def vectorized_method_v1(dists, score):
mask = score > score.reshape(-1,1)
dists2 = dists.copy() # get rid of this in case the dists array can be changed
dists2[np.logical_not(mask)] = dists.max()
return dists2.min(axis=1)
The speed gain is not so impressive for these small arrays (~factor of 3 on my machine), so I'll demonstrate with a larger set:
dists = scipy.spatial.distance.squareform(np.random.random(50*99))
score = np.random.random(dists.shape[0])
print(dists.shape)
%timeit your_method(dists, score)
%timeit vectorized_method_v1(dists, score)
## -- End pasted text --
(100, 100)
100 loops, best of 3: 2.98 ms per loop
10000 loops, best of 3: 125 µs per loop
Which is close to a factor of 24.
Related
I have this for loop that I need to vectorize. The code below works, but takes a lot of time (this is a simplified example, the full version will have about 1e6 rows in col_ids). Can someone give me an idea how to vectorize this code to get rid of the loop? If it matters, the col_ids are fixed (will be the same every time the code is run), while the values will change.
values = np.array([1.5, 2, 2.3])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1]])
result = np.zeros((4,3))
for idx, col_idx in enumerate(col_ids):
result[np.arange(4),col_idx] += values[idx]
Result:
[[5.8 0. 0. ]
[5.8 0. 0. ]
[3.5 2.3 0. ]
[1.5 4.3 0. ]]
Update:
I am adding a second example as there was some ambiguity in the dimensions of my first example. Only values and col_ids are updated, everything else as in first example. (I keep the first one, since this is referred to in the answers)
values = np.array([1.5, 2, 5, 20, 50])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1], [0,0,1,2], [0,1,2,2]])
Result:
[[78.5 0. 0. ]
[28.5 50. 0. ]
[ 3.5 25. 50. ]
[ 1.5 7. 70. ]]
So result is m x n, col_ids is k x m and values has length k. Both m and n are small (m=4, n=3), k is large (about 1e6 in full example)
You can vectorize the loop, but creating an additional intermediate array is much slower for larger data (starting from result with shape (50,50))
import numpy as np
values = np.array([1.5, 2, 2.3])
col_ids = np.array([[0,0,0,0], [0,0,0,1], [0,0,1,1]])
(np.equal.outer(col_ids, np.arange(len(values))) * values[:,None,None]).sum(0)
# for a fixed result shape (4,3)
# (np.equal.outer(col_ids, np.arange(3)) * values[:,None,None]).sum(0)
Output
array([[5.8, 0. , 0. ],
[5.8, 0. , 0. ],
[3.5, 2.3, 0. ],
[1.5, 4.3, 0. ]])
The only reliably faster solution I could find is numba (using version 0.55.1). I thought this implementation would benefit from parallel execution, but I couldn't get any speed up on a 2-core colab instance.
import numba as nb
#nb.njit(parallel=False) # Try parallel=True for multi-threaded execution, no speed up in my benchmarks
def fill(val, ids):
res = np.zeros(ids.shape[::-1])
for i in nb.prange(len(res)):
for j in range(res.shape[1]):
res[i, ids[j,i]] += val[j]
return res
fill(values, col_ids)
Output
array([[5.8, 0. , 0. ],
[5.8, 0. , 0. ],
[3.5, 2.3, 0. ],
[1.5, 4.3, 0. ]])
For a fixed result shape (4,3) with suitable input.
#nb.njit(boundscheck=True) # ~1.25x slower, but much safer
def fill(val, ids):
res = np.zeros((4,3))
for i in nb.prange(ids.shape[0]):
for j in range(ids.shape[1]):
res[j, ids[i,j]] += val[i]
return res
fill(values, col_ids)
Output for the updated example data
array([[78.5, 0. , 0. ],
[28.5, 50. , 0. ],
[ 3.5, 25. , 50. ],
[ 1.5, 7. , 70. ]])
You can solve this using np.add.at. However, AFAIK, this function does not support 2D array so you need to flatten the arrays, computing the 1D flatten indices, and then call the function:
n, m = result.shape
result = np.zeros((4,3))
indices = np.tile(np.arange(0, n*m, m), col_ids.shape[0]) + col_ids.ravel()
np.add.at(result.ravel(), indices, np.repeat(values, n)) # In-place
print(result)
What is the best way to remove the minimal number of elements from a sorted Numpy array so that the minimal distance among the remaining is always bigger than a certain threshold?
For example, if the threshold is 1, the following sequence [0.1, 0.5, 1.1, 2.5, 3.] will become [0.1, 1.1, 2.5]. The 0.5 is removed because it is too close to 0.1 but then 1.1 is preserved because it is far enough from 0.1.
My current code:
import numpy as np
MIN_DISTANCE = 1
a = np.array([0.1, 0.5, 1.1, 2.5, 3.])
for i in range(len(a)-1):
if(a[i+1] - a[i] < MIN_DISTANCE):
a[i+1] = a[i]
a = np.unique(a)
a
array([0.1, 1.1, 2.5])
Is there a more efficient way to do so?
Note that my question is similar to Remove values from numpy array closer to each other but not exactly the same.
You could use numpy.ufunc.accumulate to iterate thru adjacent pairs of the array instead of the for loop.
The numpy.add.accumulate example or itertools.accumulate probably shows best what it's doing.
Along with numpy.frompyfunc your condition can be applied as ufunc (universal functions ).
Code: (with an extended array to cross check some additional cases, but works with your array as well)
import numpy as np
MIN_DISTANCE = 1
a = np.array([0.1, 0.5, 0.6, 0.7, 1.1, 2.5, 3., 4., 6., 6.1])
print("original: \n" + str(a))
def my_py_function(arr1, arr2):
if(arr2 - arr1 < MIN_DISTANCE):
arr2 = arr1
return arr2
my_np_function = np.frompyfunc(my_py_function, 2, 1)
my_np_function.accumulate(a, dtype=np.object, out=a).astype(float)
print("complete: \n" + str(a))
a = np.unique(a)
print("unique: \n" + str(a))
Result:
original:
[0.1 0.5 0.6 0.7 1.1 2.5 3. 4. 6. 6.1]
complete:
[0.1 0.1 0.1 0.1 1.1 2.5 2.5 4. 6. 6. ]
unique:
[0.1 1.1 2.5 4. 6. ]
Concerning execution time timeit shows a turnaround at array length of about 20.
Your code is much faster (relative) for your array length of 5
whereas for array length >>20 the accumulate option speeds up considerably (~35% in time for array length 300)
I have 3 Numpy arrays each of length 107952899.
Lets Say :
1. Time = [2.14579526e+08 2.14579626e+08 2.14579726e+08 ...1.10098692e+10 1.10098693e+10]
2. Speed = [0.66 0.66 0.66 .............0.06024864 0.06014756]
3. Brak_press = [0.3, 0.3, 0.3 .............. 0.3, 0.3]
What it mean
Each index Value in Time corresponds to same index Value in Speed & Brake array.
Time Speed Brake
2.14579526e+08 0.66 0.3
.
.
Requirement
No 1 :I want to find the indices in Speed array whose values inside are greater than 20
No 2 : for those indices, what will be values in Brake Array
No 3 : Now i want to find the Top N Maximum Value indices in Brake Array & Store it another list/array
So finally if I take one indices from Top N Maximum Indices and use it in Brake & Speed array it must show..
Brake[idx] = valid Value & more importantly Speed [idx] = Value > than 20
General Summary
Simply, What i needed is, to find the Maximum N Brake point indices whose corresponding speed Value should be greater than 20
What i tried
speed_20 = np.where(Speed > 20) # I got indices as tupple
brake_values = Brake[speed_20] # Found the Brake Values corresponds to speed_20 indices
After that i tried argsort/argpartition but none of result matches my requirement
Request
I believe there will be a best method to do this..Kindly shed some light
(I converted the above np arrays to pandas df, it works fine, due to memory concern i prefer to do using numpy operations)
You are almost there. This should do what you want:
speed_20 = np.where(Speed > 20)[0]
sort = np.argsort(-Brake[speed_20])
result = speed_20[sort[:N]]
Maybe this is an option you can consider, using NumPy.
First create a multidimensional matrix (I changed the values so it's easier to follow):
Time = [ 2, 1, 5, 4, 3]
Speed = [ 10, 20, 40, 30, 50]
Brak_press = [0.1, 0.3, 0.5, 0.4, 0.2]
data = np.array([Time, Speed, Brak_press]).transpose()
So data are stored as:
print(data)
# [[ 2. 10. 0.1]
# [ 1. 20. 0.3]
# [ 5. 40. 0.5]
# [ 4. 30. 0.4]
# [ 3. 50. 0.2]]
To extract speed greater than 20:
data[data[:,1] > 20]
# [[ 5. 40. 0.5]
# [ 4. 30. 0.4]
# [ 3. 50. 0.2]]
To get the n greatest Brak_press:
n = 2
data[data[:,2].argsort()[::-1][:n]]
# [[ 5. 40. 0.5]
# [ 4. 30. 0.4]]
Say I have 2 numpy 2D arrays, mins, and maxs, that will always be the same dimension as one another. I'd like to create a third array, results, that is the result of applying linspace to max and min value. Is there some "numpy"/vectorized way to do this? Example non-vectorized code is below to show results I would like.
import numpy as np
mins = np.random.rand(2,2)
maxs = np.random.rand(2,2)
# Number of elements in the linspace
x = 3
m, n = mins.shape
results = np.zeros((m, n, x))
for i in range(m):
for j in range(n):
min = mins[i][j]
max = maxs[i][j]
results[i][j] = np.linspace(min, max, num=x)
Here's one vectorized approach based on this post to cover for generic n-dim cases -
def create_ranges_nd(start, stop, N, endpoint=True):
if endpoint==1:
divisor = N-1
else:
divisor = N
steps = (1.0/divisor) * (stop - start)
return start[...,None] + steps[...,None]*np.arange(N)
Sample run -
In [536]: mins = np.array([[3,5],[2,4]])
In [537]: maxs = np.array([[13,16],[11,12]])
In [538]: create_ranges_nd(mins, maxs, 6)
Out[538]:
array([[[ 3. , 5. , 7. , 9. , 11. , 13. ],
[ 5. , 7.2, 9.4, 11.6, 13.8, 16. ]],
[[ 2. , 3.8, 5.6, 7.4, 9.2, 11. ],
[ 4. , 5.6, 7.2, 8.8, 10.4, 12. ]]])
As of Numpy version 1.16.0, non-scalar start and stop are now supported.
So, now you can do this:
assert np.__version__ > '1.17.2'
mins = np.random.rand(2,2)
maxs = np.random.rand(2,2)
# Number of elements in the linspace
x = 3
results = np.linspace(mins, maxs, num=x)
# And, if required
results = np.rollaxis(results, 0, 3)
Hi I have to enlarge the number of points inside of vector to enlarge the vector to fixed size. for example:
for this simple vector
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> len(a)
# 6
now, I want to get a vector with size of 11 taken the a vector as base the results will be
# array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
EDIT 1
what I need is a function that will enter the base vector and the number of values that must be the resultant vector, and I return a new vector with size equal to the parameter. something like
def enlargeVector(vector, size):
.....
return newVector
to use like:
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> b = enlargeVector(a, 200):
>>> len(b)
# 200
and b contains data results of linear, cubic, or whatever interpolation methods
There are many methods to do this within scipy.interpolate. My favourite is UnivariateSpline, which produces an order k spline guaranteed to be differentiable k times.
To use it:
from scipy.interpolate import UnivariateSpline
old_indices = np.arange(0,len(a))
new_length = 11
new_indices = np.linspace(0,len(a)-1,new_length)
spl = UnivariateSpline(old_indices,a,k=3,s=0)
new_array = spl(new_indices)
The s is a smoothing factor that you should set to 0 in this case (since the data are exact).
Note that for the problem you have specified (since a just increases monotonically by 1), this is overkill, since the second np.linspace gives already the desired output.
EDIT: clarified that the length is arbitrary
As AGML pointed out there are tools to do this, but how about a pure numpy solution:
In [20]: a = np.arange(6)
In [21]: temp = np.dstack((a[:-1], a[:-1] + np.diff(a) / 2.0)).ravel()
In [22]: temp
Out[22]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In [23]: np.hstack((temp, [a[-1]]))
Out[23]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])