Replace values in subarray based upon dynamic condition in Numpy - python

I have a Python Numpy array that is a 2D array where the second dimension is a subarray of 3 elements of integers. For example:
[ [2, 3, 4], [9, 8, 7], ... [15, 14, 16] ]
For each subarray I want to replace the lowest number with a 1 and all other numbers with a 0. So the desired output from the above example would be:
[ [1, 0, 0], [0, 0, 1], ... [0, 1, 0] ]
This is a large array, so I want to exploit Numpy performance. I know about using conditions to operate on array elements, but how do I do this when the condition is dynamic? In this instance the condition needs to be something like:
newarray = (a == min(a)).astype(int)
But how do I do this across each subarray?

You can specify the axis parameter to calculate a 2d array of mins(if you keep the dimension of the result), then when you do a == a.minbyrow, you will get trues at the minimum position for each sub array:
(a == a.min(1, keepdims=True)).astype(int)
#array([[1, 0, 0],
# [0, 0, 1],
# [0, 1, 0]])

How about this?
import numpy as np
a = np.random.random((4,3))
i = np.argmin(a, axis=-1)
out = np.zeros(a.shape, int)
out[np.arange(out.shape[0]), i] = 1
print(a)
print(out)
Sample output:
# [[ 0.58321885 0.18757452 0.92700724]
# [ 0.58082897 0.12929637 0.96686648]
# [ 0.26037634 0.55997658 0.29486454]
# [ 0.60398426 0.72253012 0.22812904]]
# [[0 1 0]
# [0 1 0]
# [1 0 0]
# [0 0 1]]
It appears to be marginally faster than the direct approach:
from timeit import timeit
def dense():
return (a == a.min(1, keepdims=True)).astype(int)
def sparse():
i = np.argmin(a, axis=-1)
out = np.zeros(a.shape, int)
out[np.arange(out.shape[0]), i] = 1
return out
for shp in ((4,3), (10000,3), (100,10), (100000,1000)):
a = np.random.random(shp)
d = timeit(dense, number=40)/40
s = timeit(sparse, number=40)/40
print('shape, dense, sparse, ratio', '({:6d},{:6d}) {:9.6g} {:9.6g} {:9.6g}'.format(*shp, d, s, d/s))
Sample run:
# shape, dense, sparse, ratio ( 4, 3) 4.22172e-06 3.1274e-06 1.34992
# shape, dense, sparse, ratio ( 10000, 3) 0.000332396 0.000245348 1.35479
# shape, dense, sparse, ratio ( 100, 10) 9.8944e-06 5.63165e-06 1.75693
# shape, dense, sparse, ratio (100000, 1000) 0.344177 0.189913 1.81229

Related

(Numpy or PyTorch) Sum array elements for given bins

I would like this problem to be solved using PyTorch tensors. If there is no efficient solution in torch, then feel free to suggest a numpy solution.
Let a be a 1-dimensional tensor (or numpy array), and bin_indices be a tensor (np array) of integers between 0 and n excluded. I want to compute the array bins that at position i contains the sum of elements of a[bins_indices == i].
n = 3
a = [1, 4, 3, -2, 5] # Values
bins_indices = [0, 0, 1, 2, 0] # Correspondent bin indices
bins = [10, 3, -2] # bins[0] = 1 + 4 + 5 etc. bins has 3 elements since n=3
If you can provide also a way of making this work for batches I would be immensely grateful to you!
Not sure if this is the best way but here is another solution:
>>> bins = torch.unique(bins_indices)
>>> vfunc = np.vectorize( lambda x: torch.sum( a[ bins_indices == x ] ) )
>>> vfunc( bins )
array([10, 3, -2])
Here's a one-line Numpy solution I could think of:
bins = [np.sum(a[np.argwhere(bins_indices == i).flatten()]) for i in range(len(a))]
PyTorch 1.12 added a function scatter_reduce_ to perform exactly this kind of operations
import torch
n = 3
a = torch.tensor([1, 4, 3, -2, 5]) # Values
bins_indices = torch.tensor([0, 0, 1, 2, 0]) # Correspondent bin indices
target_bins = torch.tensor([10, 3, -2]) # bins[0] = 1 + 4 + 5 etc. bins has 3 elements since n=3
bins = torch.zeros(3, dtype=a.dtype)
bins.scatter_reduce_(dim=0, src=a, index=bins_indices, reduce="sum")
assert torch.allclose(target_bins, bins)

N-D indexing with defaults in NumPy

Can I index NumPy N-D array with fallback to default values for out-of-bounds indexes? Example code below for some imaginary np.get_with_default(a, indexes, default):
import numpy as np
print(np.get_with_default(
np.array([[1,2,3],[4,5,6]]), # N-D array
[(np.array([0, 0, 1, 1, 2, 2]), np.array([1, 2, 2, 3, 3, 5]))], # N-tuple of indexes along each axis
13, # Default for out-of-bounds fallback
))
should print
[2 3 6 13 13 13]
I'm looking for some built-in function for this. If such not exists then at least some short and efficient implementation to do that.
I arrived at this question because I was looking for exactly the same. I came up with the following function, which does what you ask for 2 dimension. It could likely be generalised to N dimensions.
def get_with_defaults(a, xx, yy, nodata):
# get values from a, clipping the index values to valid ranges
res = a[np.clip(yy, 0, a.shape[0] - 1), np.clip(xx, 0, a.shape[1] - 1)]
# compute a mask for both x and y, where all invalid index values are set to true
myy = np.ma.masked_outside(yy, 0, a.shape[0] - 1).mask
mxx = np.ma.masked_outside(xx, 0, a.shape[1] - 1).mask
# replace all values in res with NODATA, where either the x or y index are invalid
np.choose(myy + mxx, [res, nodata], out=res)
return res
xx and yy are the index array, a is indexed by (y,x).
This gives:
>>> a=np.zeros((3,2),dtype=int)
>>> get_with_defaults(a, (-1, 1000, 0, 1, 2), (0, -1, 0, 1, 2), -1)
array([-1, -1, 0, 0, -1])
As an alternative, the following implementation achieves the same and is more concise:
def get_with_default(a, xx, yy, nodata):
# get values from a, clipping the index values to valid ranges
res = a[np.clip(yy, 0, a.shape[0] - 1), np.clip(xx, 0, a.shape[1] - 1)]
# replace all values in res with NODATA (gets broadcasted to the result array), where
# either the x or y index are invalid
res[(yy < 0) | (yy >= a.shape[0]) | (xx < 0) | (xx >= a.shape[1])] = nodata
return res
I don't know if there is anything in NumPy to do that directly, but you can always implement it yourself. This is not particularly smart or efficient, as it requires multiple advanced indexing operations, but does what you need:
import numpy as np
def get_with_default(a, indices, default=0):
# Ensure inputs are arrays
a = np.asarray(a)
indices = tuple(np.broadcast_arrays(*indices))
if len(indices) <= 0 or len(indices) > a.ndim:
raise ValueError('invalid number of indices.')
# Make mask of indices out of bounds
mask = np.zeros(indices[0].shape, np.bool)
for ind, s in zip(indices, a.shape):
mask |= (ind < 0) | (ind >= s)
# Only do masking if necessary
n_mask = np.count_nonzero(mask)
# Shortcut for the case where all is masked
if n_mask == mask.size:
return np.full_like(a, default)
if n_mask > 0:
# Ensure index arrays are contiguous so masking works right
indices = tuple(map(np.ascontiguousarray, indices))
for ind in indices:
# Replace masked indices with zeros
ind[mask] = 0
# Get values
res = a[indices]
if n_mask > 0:
# Replace values of masked indices with default value
res[mask] = default
return res
# Test
print(get_with_default(
np.array([[1,2,3],[4,5,6]]),
(np.array([0, 0, 1, 1, 2, 2]), np.array([1, 2, 2, 3, 3, 5])),
13
))
# [ 2 3 6 13 13 13]
I also needed a solution to this, but I wanted a solution that worked in N dimensions. I made Markus' solution work for N-dimensions, including selecting from an array with more dimensions than the coordinates point to.
def get_with_defaults(arr, coords, nodata):
coords, shp = np.array(coords), np.array(arr.shape)
# Get values from arr, clipping to valid ranges
res = arr[tuple(np.clip(c, 0, s-1) for c, s in zip(coords, shp))]
# Set any output where one of the coords was out of range to nodata
res[np.any(~((0 <= coords) & (coords < shp[:len(coords), None])), axis=0)] = nodata
return res
import numpy as np
if __name__ == '__main__':
A = np.array([[1,2,3],[4,5,6]])
B = np.array([[[1, -9],[2, -8],[3, -7]],[[4, -6],[5, -5],[6, -4]]])
coords1 = [[0, 0, 1, 1, 2, 2], [1, 2, 2, 3, 3, 5]]
coords2 = [[0, 0, 1, 1, 2, 2], [1, 2, 2, 3, 3, 5], [1, 1, 1, 1, 1, 1]]
out1 = get_with_defaults(A, coords1, 13)
out2 = get_with_defaults(B, coords1, 13)
out3 = get_with_defaults(B, coords2, 13)
print(out1)
# [2, 3, 6, 13, 13, 13]
print(out2)
# [[ 2 -8]
# [ 3 -7]
# [ 6 -4]
# [13 13]
# [13 13]
# [13 13]]
print(out3)
# [-8, -7, -4, 13, 13, 13]

Dynamic way to compute linear constraints with multiple operators

Imagine a matrix A having one column with a lot of inequality/equality operators (≥, = ≤) and a vector b, where the number of rows in A is equal the number of elements in b. Then one row, in my setting would be computed by, e.g
dot(A[0, 1:], x) ≥ b[0]
where x is some vector, column A[,0] represents all operators and we'd know that for row 0 we were suppose to calculate using ≥ operator (e.i. A[0,0] == "≥" is true). Now, is there a way for dynamically calculate all rows in following so far imaginary way
dot(A[, 1:], x) A[, 0] b
My hope was for a dynamic evaluation of each row where we evaluate which operator is used for each row.
Example, let
A = [
[">=", -2, 1, 1],
[">=", 0, 1, 0],
["==", 0, 1, 1]
]
b = [0, 1, 1]
and x be some given vector, e.g. x = [1,1,0] we wish to compute as following
A[,1:] x A[,0] b
dot([-2, 1, 1], [1, 1, 0]) >= 0
dot([0, 1, 0], [1, 1, 0]) >= 1
dot([0, 1, 1], [1, 1, 0]) == 1
The output would be [False, True, True]
If I understand correctly, this is a way to do that operation:
import numpy as np
# Input data
a = [
[">=", -2, 1, 1],
[">=", 0, 1, 0],
["==", 0, 1, 1]
]
b = np.array([0, 1, 1])
x = np.array([1, 1, 0])
# Split in comparison and data
a0 = np.array([lst[0] for lst in a])
a1 = np.array([lst[1:] for lst in a])
# Compute dot product
c = a1 # x
# Compute comparisons
leq = c <= b
eq = c == b
geq = c >= b
# Find comparison index for each row
cmps = np.array(["<=", "==", ">="]) # This array is lex sorted
cmp_idx = np.searchsorted(cmps, a0)
# Select the right result for each row
result = np.choose(cmp_idx, [leq, eq, geq])
# Convert to numeric type if preferred
result = result.astype(np.int32)
print(result)
# [0 1 1]

Why does calling np.array() on this list comprehension produce a 3d array instead of 2d?

I have a script produces the first several iterations of a Markov matrix multiplying a given set of input values. With the matrix stored as A and the start values in the column u0, I use this list comprehension to store the output in an array:
out = np.array([ ( (A**n) * u0).T for n in range(10) ])
The output has shape (10,1,6), but I want the output in shape (10,6) instead. Obviously, I can fix this with .reshape(), but is there a way to avoid creating the extra dimension in the first place, perhaps by simplifying the list comprehension or the inputs?
Here's the full script and output:
import numpy as np
# Random 6x6 Markov matrix
n = 6
A = np.matrix([ (lambda x: x/x.sum())(np.random.rand(n)) for _ in range(n)]).T
print(A)
#[[0.27457312 0.20195133 0.14400801 0.00814027 0.06026188 0.23540134]
# [0.21526648 0.17900277 0.35145882 0.30817386 0.15703758 0.21069114]
# [0.02100412 0.05916883 0.18309142 0.02149681 0.22214047 0.15257011]
# [0.17032696 0.11144443 0.01364982 0.31337906 0.25752732 0.1037133 ]
# [0.03081507 0.2343255 0.2902935 0.02720764 0.00895182 0.21920371]
# [0.28801424 0.21410713 0.01749843 0.32160236 0.29408092 0.07842041]]
# Random start values
u0 = np.matrix(np.random.randint(51, size=n)).T
print(u0)
#[[31]
# [49]
# [44]
# [29]
# [10]
# [ 0]]
# Find the first 10 iterations of the Markov process
out = np.array([ ( (A**n) * u0).T for n in range(10) ])
print(out)
#[[[31. 49. 44. 29. 10.
# 0. ]]
#
# [[25.58242101 41.41600236 14.45123543 23.00477134 26.08867045
# 32.45689942]]
#
# [[26.86917065 36.02438292 16.87560159 26.46418685 22.66236879
# 34.10428921]]
#
# [[26.69224394 37.06346073 16.59208202 26.48817955 22.56696872
# 33.59706504]]
#
# [[26.68772374 36.99727159 16.49987315 26.5003184 22.61130862
# 33.7035045 ]]
#
# [[26.68766363 36.98517264 16.50532933 26.51717543 22.592951
# 33.71170797]]
#
# [[26.68695152 36.98895204 16.50314718 26.51729716 22.59379049
# 33.70986161]]
#
# [[26.68682195 36.98848867 16.50286371 26.51763013 22.59362679
# 33.71056876]]
#
# [[26.68681128 36.98850409 16.50286036 26.51768807 22.59359453
# 33.71054167]]
#
# [[26.68680313 36.98851046 16.50285038 26.51769497 22.59359219
# 33.71054886]]]
print(out.shape)
#(10, 1, 6)
out = out.reshape(10,n)
print(out)
#[[31. 49. 44. 29. 10. 0. ]
# [25.58242101 41.41600236 14.45123543 23.00477134 26.08867045 32.45689942]
# [26.86917065 36.02438292 16.87560159 26.46418685 22.66236879 34.10428921]
# [26.69224394 37.06346073 16.59208202 26.48817955 22.56696872 33.59706504]
# [26.68772374 36.99727159 16.49987315 26.5003184 22.61130862 33.7035045 ]
# [26.68766363 36.98517264 16.50532933 26.51717543 22.592951 33.71170797]
# [26.68695152 36.98895204 16.50314718 26.51729716 22.59379049 33.70986161]
# [26.68682195 36.98848867 16.50286371 26.51763013 22.59362679 33.71056876]
# [26.68681128 36.98850409 16.50286036 26.51768807 22.59359453 33.71054167]
# [26.68680313 36.98851046 16.50285038 26.51769497 22.59359219 33.71054886]]
I think your confusion lies with how arrays can be joined.
Start with a simple 1d array (in numpy 1d is a real thing, not just a 'row vector' or 'column vector'):
In [288]: arr = np.arange(6)
In [289]: arr
Out[289]: array([0, 1, 2, 3, 4, 5])
np.array joins element arrays along a new 1st dimension:
In [290]: np.array([arr,arr])
Out[290]:
array([[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5]])
np.stack with the default axis value does the same thing. Read its docs.
We can make a 2d array, a column vector:
In [291]: arr1 = arr[:,None]
In [292]: arr1
Out[292]:
array([[0],
[1],
[2],
[3],
[4],
[5]])
In [293]: arr1.shape
Out[293]: (6, 1)
Using np.array on its transpose the (1,6) arrays:
In [294]: np.array([arr1.T, arr1.T])
Out[294]:
array([[[0, 1, 2, 3, 4, 5]],
[[0, 1, 2, 3, 4, 5]]])
In [295]: _.shape
Out[295]: (2, 1, 6)
Note the middle size 1 dimension, that bothered you.
np.vstack joins the arrays along the existing 1st dimension. It does not add one:
In [296]: np.vstack([arr1.T, arr1.T])
Out[296]:
array([[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5]])
Or we could join the arrays horizontally, on the 2nd dimension:
In [297]: np.hstack([arr1, arr1])
Out[297]:
array([[0, 0],
[1, 1],
[2, 2],
[3, 3],
[4, 4],
[5, 5]])
That is (6,2) which can be transposed to (2,6):
In [298]: np.hstack([arr1, arr1]).T
Out[298]:
array([[0, 1, 2, 3, 4, 5],
[0, 1, 2, 3, 4, 5]])
If you use np.array() for input and # for matrix multiplication, it works as expected.
# Random 6x6 Markov matrix
n = 6
A = np.array([ (lambda x: x/x.sum())(np.random.rand(n)) for _ in range(n)]).T
# Random start values
u0 = np.random.randint(51, size=n).T
# Find the first 10 iterations of the Markov process
out = np.array([ ( np.linalg.matrix_power(A,n) # u0).T for n in range(10) ])
print(out)
#[[29. 24. 5. 12. 10. 32. ]
# [15.82875119 13.53436868 20.61648725 19.22478172 20.34082205 22.45478912]
# [21.82434718 10.06037119 14.29281935 20.75271393 18.76134538 26.30840297]
# [20.77484848 10.1379821 15.47488423 19.4965479 20.05618311 26.05955418]
# [21.02944236 10.09401438 15.24263478 19.48662616 19.95767996 26.18960236]
# [20.96887722 10.11647819 15.30729334 19.44261102 20.00089222 26.16384802]
# [20.98086362 10.11522779 15.29529799 19.44899285 19.99137187 26.16824587]
# [20.97795615 10.11606978 15.29817734 19.44798612 19.99293494 26.16687566]
# [20.97858032 10.11591954 15.29752865 19.44839852 19.99245389 26.16711909]
# [20.97844343 10.11594666 15.29766432 19.4483417 19.99254284 26.16706104]]
I made a few changes to the code, although I'm not 100% certain that the result is still the same (I am not familiar with Markov chains).
import numpy as np
n = 6
num_proc_iters = 10
rand_nums_arr = np.random.random_sample((n, n))
rand_nums_arr = np.transpose(rand_nums_arr / rand_nums_arr.sum(axis=1))
u0 = np.random.randint(51, size=n)
res_arr = np.concatenate([np.linalg.matrix_power(rand_nums_arr, curr) # u0 for curr in range(num_proc_iters)])
I would love to hear if anyone can think of any further improvements.

Given distances and values array, return sorted filtered values in numpy

I am not sure what the title of this question should be. But lets say we have 2 arrays, values and distances.
values = np.array([[-1,-1,-1],
[1, 2, 0],
[-1,-1,-1]])
distances = np.array([[1,2,3],
[6,5,4],
[7,8,9]])
I would like to get the values that are non negative, and have them in order by its corresponding distance, based on the distances array.
So with the example above, the positive values are [1,2,0] and its distances will be [6,5,4]. Thus, if sorting by its corresponding distance, I would like to have [0,2,1] as the answer.
My code is below. It works, but would like to have the solution of just using numpy. Im sure that would be more efficient than this:
import numpy as np
import heapq
def get_sorted_values(seek_val, values, distances):
r, c = np.where(values >= seek_val)
di = distances[r, c]
vals = values[r, c]
print("di", di)
print("vals", vals)
if len(di) >= 1:
heap = []
for d, v in zip(di,vals):
heapq.heappush(heap, (d,v))
lists = []
while heap:
d, v = heapq.heappop(heap)
lists.append(v)
return lists
else:
## NOTHING FOUND
return None
Input:
seek_val = 0
values = np.array([[-1,-1,-1],
[1,2,0],
[-1,-1,-1]])
distances = np.array([[1,2,3],
[6,5,4],
[7,8,9]])
print("Ans:",get_sorted_values(seek_val, values, distances))
Output:
di [6 5 4]
vals [1 2 0]
Ans: [0, 2, 1]
"one liner":
values[np.where(values >= 0)][np.argsort(distances[np.where(values >= 0)])]
Out[981]: array([0, 2, 1])
repeating np.where(values >= 0) is inefficient, could make a variable if values is big
v_indx = np.where(values >= 0)
values[v_indx][np.argsort(distances[v_indx])]
Try np.argsort
import numpy as np
values = np.array([[-1,-1,-1],
[ 1, 2, 0],
[-1,-1,-1]])
distances = np.array([[1, 2, 3],
[6, 5, 4],
[7, 8, 9]])
print(values[values >= 0])
# [1 2 0]
print(distances[values >= 0])
# [6 5 4]
print('Ans:', values[values >= 0][np.argsort(distances[values >= 0])])
# Ans: [0 2 1]

Categories