Index a 3D array with 2D array numpy - python

I'm trying to manipulate an index and source array such that:
result[ i ][ j ][ k ] = source[ i ][ indices[ i ][ j ][ k ] ]
I know how to do this with for loops but I'm using giant arrays and I'd like to use something more time efficient. I've tried to use numpy's advanced indexing but I don't really understand it.
Example functionality:
source = [[0.0 0.1 0.2 0.3]
[1.0 1.1 1.2 1.3]
[2.0 2.1 2.2 2.3]]
indices = [[[3 1 0 1]
[3 0 0 3]]
[[0 1 0 2]
[3 2 1 1]]
[[1 1 0 1]
[0 1 2 2]]]
# result[i][j][k] = source[i][indices[i][j][k]]
result = [[[0.3 0.1 0.0 0.1]
[0.3 0.0 0.0 0.3]]
[[1.0 1.1 1.0 1.2]
[1.3 1.2 1.1 1.1]]
[[2.1 2.1 2.0 2.1]
[2.0 2.1 2.2 2.2]]]

Solution using Integer Advanced Indexing:
Given:
source = [[0.0, 0.1, 0.2, 0.3],
[1.0, 1.1, 1.2, 1.3],
[2.0, 2.1, 2.2, 2.3]]
indices = [[[3, 1, 0, 1],
[3, 0, 0, 3]],
[[0, 1, 0, 2],
[3, 2, 1, 1]],
[[1, 1, 0, 1],
[0, 1, 2, 2]]]
Use this:
import numpy as np
nd_source = np.array(source)
source_rows = len(source) # == 3, in above example
source_cols = len(source[0]) # == 4, in above example
row_indices = np.arange(source_rows).reshape(-1,1,1)
result = nd_source [row_indices, indices]
Result:
print (result)
[[[0.3 0.1 0. 0.1]
[0.3 0. 0. 0.3]]
[[1. 1.1 1. 1.2]
[1.3 1.2 1.1 1.1]]
[[2.1 2.1 2. 2.1]
[2. 2.1 2.2 2.2]]]
Explanation:
To use Integer Advanced Indexing, the key rules are:
We must supply index arrays consisting of integer indices.
We must supply as many of these index arrays, as there are dimensions in the source array.
The shape of these index arrays must be the same, or, at least all of them must be broadcastable to a single final shape.
How the Integer Advanced Indexing works is:
Given that source array has n dimensions, and that we have therefore supplied n integer index arrays:
All of these index arrays, if not in the same uniform shape, will be broadcasted to be in a single uniform shape.
To access any element in the source array, we obviously need an n-tuple of indices. Therefore to generate the result array from the source array, we need several n-tuples, one n-tuple for each element-position of the final result array. For each element-position of the result array, the n-tuple of indices will be constructed from the corresponding element-positions in the broadcasted index arrays. (Remember the result array has exactly the same shape as the broadcasted index arrays, as already mentioned above).
Thus, by traversing the index arrays in tandem, we get all the n-tuples we need to generate the result array, in the same shape as the broadcasted index arrays.
Applying this explanation to the above example:
Our source array is nd_source = np.array(source), which is 2d.
Our final result shape is (3,2,4).
We therefore need to supply 2 index arrays, and these index arrays must either be in the final result shape of (3,2,4), or broadcastable to the (3,2,4) shape.
Our first index array is row_indices = np.arange(source_rows).reshape(-1,1,1). (source_rows is the number of rows in the source, which is 3 in this example) This index array has shape (3,1,1), and actually looks like [[[0]],[[1]],[[2]]]. This is broadcastable to the final result shape of (3,2,4), and the broadcasted array looks like [[[0,0,0,0],[0,0,0,0]],[[1,1,1,1],[1,1,1,1]],[[2,2,2,2],[2,2,2,2]]].
Our second index array is indices. Though this is not an array and is only a list of lists, numpy is flexible enough to automatically convert it into the corresponding ndarray, when we pass it as our send index array. Note that this array is already in the final desired result shape of (3,2,4) even without any broadcasting.
Traversing these two index arrays in tandem (one a broadcasted array and the other as is), numpy generates all the 2-tuples needed to access our source 2d array nd_source, and generate the final result in the shape (3,2,4).

Related

Filling parts of a list without a loop

I have the following list or numpy array
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
and an additional list indicating the positions of non-zeros
i=[0,6,9]
I would like to make two new lists out of them, one filling the zeros and one counting in between, for this short example:
a=[7.2,7.2,7.2,7.2,7.2,7.2,6.5,6.5,6.5,-8.1,-8.1,-8.1,-8.1,-8.1]
b=[0,1,2,3,4,5,0,1,2,0,1,2,3,4]
Is therea a way to do that without a for loop to speed up things, as the list ll is quite long in my case.
Array a is the result of a forward fill and array b are indices associated with the range between each consecutive non-zero element.
pandas has a forward fill function, but it should be easy enough to compute with numpy and there are many sources on how to do this.
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
a = np.array(ll)
# find zero elements and associated index
mask = a == 0
idx = np.where(~mask, np.arange(mask.size), False)
# do the fill
a[np.maximum.accumulate(idx)]
output:
array([ 7.2, 7.2, 7.2, 7.2, 7.2, 7.2, 6.5, 6.5, 6.5, -8.1, -8.1,
-8.1, -8.1, -8.1])
More information about forward fill is found here:
Most efficient way to forward-fill NaN values in numpy array
Finding the consecutive zeros in a numpy array
Computing array b you could use the forward fill mask and combine it with a single np.arange:
fill_mask = np.maximum.accumulate(idx)
np.arange(len(fill_mask)) - fill_mask
output:
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 0, 1, 2, 3, 4])
So...
import numpy as np
ll = np.array([7.2, 0, 0, 0, 0, 0, 6.5, 0, 0, -8.1, 0, 0, 0, 0])
i = np.array([0, 6, 9])
counts = np.append(
np.diff(i), # difference between each element in i
# (i element shorter than i)
len(ll) - i[-1], # + length of last repeat
)
repeated = np.repeat(ll[i], counts)
repeated becomes
[ 7.2 7.2 7.2 7.2 7.2 7.2 6.5 6.5 6.5 -8.1 -8.1 -8.1 -8.1 -8.1]
b could be computed with
b = np.concatenate([np.arange(c) for c in counts])
print(b)
# [0 1 2 3 4 5 0 1 2 0 1 2 3 4]
but that involves a loop in the form of that list comprehension; perhaps someone Numpyier could implement it without a Python loop.

MCP Geometrics for calculating marketsheds

I am trying to calculate marketsheds using the skimage.MCP_geometric find_costs function. It has been working wonderfully to calculate least-cost routes, but rather than finding the travel cost to the nearest source, I want to calculate the index of the nearest source.
Sample Code
import numpy as np
import skimage.graph as graph
import copy
img = np.array([[1,1,2,2],[2,1,1,3],[3,2,1,2],[2,2,2,1]])
mcp = graph.MCP_Geometric(img)
destinations = [[0,0],[3,3]]
costs, traceback = mcp.find_costs(destinations)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
This works as expected, and creates a nice travel cost raster. However, I want (for each cell) to know which of the destinations is the closest. The best solution I have found is to run each of the destinations separately, then combine them through min calculations. It works, but is slow, and has not been working at scale.
all_c = []
for dest in destinations:
costs, traceback = mcp.find_costs([dest])
all_c.append(copy.deepcopy(costs))
res = np.dstack(all_c)
res_min = np.amin(res, axis=2)
output = np.zeros([res_min.shape[0], res_min.shape[1]])
for idx in range(0, res.shape[2]):
cur_data = res[:,:,idx]
cur_val = (cur_data == res_min).astype(np.byte) * idx
output = output + cur_val
output = output.astype(np.byte)
print(output)
array([[0, 0, 0, 0],
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]], dtype=int8)
I have been looking into overloading the functions of MCP_Geometric and MCP_Flexible, but I cannot find anything providing information on the index of the destination.
Hope that provides enough information to replicate and understand what I want to do, thanks!
Ok, this is a bit of a ride, but it was fun to figure out. I'm unclear just how fast it'll be but I think it should be pretty fast in the case of many destinations and comfortably-in-RAM images.
The key is the traceback return value, which kinda-sorta tells you the neighbor index to get to the nearest destination. So with a bit of pathfinding you should be able to find that destination. Can that be fast? It turns out it can, with a bit of NumPy index wrangling, scipy.sparse matrices, and connected_components from scipy.sparse.csgraph!
Let's start with your same costs array and both destinations:
import numpy as np
image = np.array(
[[1, 1, 2, 2],
[2, 1, 1, 3],
[3, 2, 1, 2],
[2, 2, 2, 1]]
)
destinations = [[0, 0], [3, 3]]
We then make the graph, and get the costs and the traceback:
from skimage import graph
mcp = graph.MCP_Geometric(image)
costs, traceback = mcp.find_costs(destinations)
print(traceback)
gives:
[[-1 4 4 4]
[ 6 7 7 1]
[ 6 6 0 1]
[ 3 3 3 -1]]
Now, I had to look up the documentation for what traceback is:
Same shape as the costs array; this array contains the offset to
any given index from its predecessor index. The offset indices
index into the offsets attribute, which is a array of n-d
offsets. In the 2-d case, if offsets[traceback[x, y]] is (-1, -1),
that means that the predecessor of [x, y] in the minimum cost path
to some start position is [x+1, y+1]. Note that if the
offset_index is -1, then the given index was not considered.
For some reason, my mcp object didn't have an offsets attribute — possibly a Cython inheritance bug? Dunno, will investigate later — but searching the source code shows me that offsets is defined with the skimage.graph._mcp.make_offsets function. So I did a bad thing and imported from that private module, so I could claim what was rightfully mine — the offsets list, which translates from numbers in traceback to offsets in the image coordinates:
from skimage.graph import _mcp
offsets = _mcp.make_offsets(2, True)
print(offsets)
which gives:
[array([-1, -1]),
array([-1, 0]),
array([-1, 1]),
array([ 0, -1]),
array([0, 1]),
array([ 1, -1]),
array([1, 0]),
array([1, 1])]
Now, there's one last thing to do with the offsets: you'll note that destinations are marked in the traceback with "-1", which doesn't correspond to the last element of the offsets array. So we append np.array([0, 0]), and then every value in traceback corresponds to a real offset. In the case of destinations, you get a self-edge, but that's fine.
offsets.append(np.array([0, 0]))
offsets_arr = np.array(offsets) # shape (9, 2)
Now, we can build a graph from offsets, pixel coordinates, and pixel ids. First, we use np.indices to get an index for every pixel in the image:
indices = np.indices(traceback.shape)
print(indices.shape)
gives:
(2, 4, 4)
To get an array that has, for each pixel, the offset to its neighbor, we use fancy array indexing:
offset_to_neighbor = offsets_arr[traceback]
print(offset_to_neighbor.shape)
which gives:
(4, 4, 2)
The axes are different between the traceback and the numpy indices, but nothing a little transposition won't fix:
neighbor_index = indices - offset_to_neighbor.transpose((2, 0, 1))
Finally, we want to deal with integer pixel ids in order to create a graph of all the pixels, rather than coordinates. For this, we use np.ravel_multi_index.
ids = np.arange(traceback.size).reshape(image.shape)
neighbor_ids = np.ravel_multi_index(
tuple(neighbor_index), traceback.shape
)
This gives me a unique ID for each pixel, and then a unique "step towards the destination" for each pixel:
print(ids)
print(neighbor_ids)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 0 0 1 2]
[ 0 0 1 11]
[ 4 5 15 15]
[13 14 15 15]]
Then we can turn this into a graph using SciPy sparse matrices. We don't care about weights for this graph so we just use the value 1 for the edges.
from scipy import sparse
g = sparse.coo_matrix((
np.ones(traceback.size),
(ids.flat, neighbor_ids.flat),
shape=(ids.size, ids.size),
)).tocsr()
(This uses the (value, (row, column)) or (data, (i, j)) input format for sparse COOrdinate matrices.)
Finally, we use connected components to get the graphs — the groups of pixels that are nearest to each destination. The function returns the number of components and the mapping of "pixel id" to component:
n, components = sparse.csgraph.connected_components(g)
basins = components.reshape(image.shape)
print(basins)
[[0 0 0 0]
[0 0 0 1]
[0 0 1 1]
[1 1 1 1]]
(Note that this result is slightly different from yours because the cost is identical to destination 0 and 1 for the pixels in question, so it's arbitrary which to label.)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
Hope this helps!

Getting a column index in numpy

I'm pretty new to NumPy and I'm looking for a way to get the index of a current column I'm iterating over in a matrix.
import numpy as np
#sum of elements in each column
def p_b(mtrx):
b = []
for c in mtrx.T:
summ = 0
for i in c:
summ += i
b.append(summ)
return b
#return modified matrix where each element is equal to itself divided by
#the sum of the current column in the original matrix
def a_div_b(mtrx):
for c in mtrx:
for i in c:
#change i to be i/p_b(mtrx)[index_of_a_current_column]
return mtrx
For the input ([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) the result would be
([[1/12, 2/12, 3/12], [4/15, 5/15, 6/15], [7/18, 8/18, 9/18]]).
Any ideas about how I can achieve that?
You don't need those functions and loops to do that. Those will not be efficient. When using numpy, go for vectorized operations whenever is possible (in most cases it is possible). numpy broadcasting rules are used to perform mathematical operation between arrays of different dimensions, when possible, such that you can use vectorization, which is much more efficient than python loops.
In your case, say that your array arr is:
arr = np.arange(1, 10)
arr.shape = (3, 3)
#arr is:
>>> arr
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
you can achieve the desired result with:
res = (arr.T / arr.sum(axis=0)).T
>>> res
array([[0.08333333, 0.16666667, 0.25 ],
[0.26666667, 0.33333333, 0.4 ],
[0.38888889, 0.44444444, 0.5 ]])
numpy sum allows you to sum your array along a given axis if the axis parameter is given. 0 is the inner axis, the one you want to sum.
.T gives the transposed matrix. You need to transpose to perform the division on the correct axis and then transpose back.

Vectorized form to get count of elements greater than a reference

...and that reference comes from a separate matrix.
This question is an extension of an earlier answered question where the reference element came directly from the same column it was being compared against. Some clever sorting and referencing the index of the sort seemed to solve that one.
Broadcasting has been suggested in both the original and this new question. I run out of memory at around n ~ 3000 and need another order of magnitude larger yet.
The Target ( Production-grade ) Scaling Definitions:
So as to let proposed solutions' approaches fair and mutually comparable, both in the [SPACE]- and the [TIME]-domains,
let's assume n = 50000; m = 20; k = 50; a = np.random.rand( n, m ); ...
I'm now interested in a more general form where the reference value comes from another matrix of reference values.
Original question:
Vectorized pythonic way to get count of elements greater than current element
New question: Can we write a vectorized form to perform the following role.
Function receives as input 2 2-d arrays.
A = n x m
B = k x m
and returns
C = k x m.
C[i,j] is the proportion of observations in A[:,j] ( just the j-th column ) that are larger than B[i,j]
Here is my embarrasingly slow double for loop implementation.
import numpy as np
n = 100
m = 20
k = 50
a = np.random.rand(n,m)
b = np.random.rand(k,m)
c = np.zeros((k,m))
for j in range(0,m): #cols
for i in range(0,k): # rows
r = b[i,j]
c[i,j] = ( ( a[:,j] > r).sum() ) / (n)
Approach #1
We could again use the argsort trick as discussed in this solution but in a bit twisted manner. We would concatenate the second array into the first array and then perform argsort-ing. We need to use argsort for both the concatenated array and the second one and have our desired output. The implementation would look something like this -
ab = np.vstack((a,b))
len_a, len_b = len(a), len(b)
b_argoffset = b.argsort(0).argsort(0)
total_args = ab.argsort(0).argsort(0)[-len_b:]
out = len_a - total_args + b_argoffset
Explanation
Concatenate second array whose values are to be computed into the first array.
Now, since we are appending, we would have their index positions later on, after the first array length has ended.
We use one argsort to get the relative positions of the second array w.r.t to the entire concatenated array and one more argsort to trace back those indices w.r.t the original order.
We need to repeat the double argsort-ing for the second array on itself, so as to compensate for the concatenation.
These indices are for each element in b with the comparison : a[:,j] > b[i,j]. Now, these indices orders are 0-based, i.e. an index closer to 0 represent greater number of elements in a[:,j] than the current element b[i,j], so a greater count and vice versa. So, we need to subtract those indices from the length of a[:,j] for the final output.
Approach #1 - Improvement
We would optimize it further by using array-assignment, again inspired by Approach #2 from the same solution. So, those arg outputs : b_argoffset and total_args could be alternatively computed, like so -
def unqargsort(a):
n,m = a.shape
idx = a.argsort(0)
out = np.zeros((n,m),dtype=int)
out[idx, np.arange(m)] = np.arange(n)[:,None]
return out
b_argoffset = unqargsort(b)
total_args = unqargsort(ab)[-len_b:]
Approach #2
We could also leverage searchsorted for an altogether different approach -
k,m = b.shape
sidx = a.argsort(0)
out = np.empty((k,m), dtype=int)
for i in range(m): #cols
out[:,i] = np.searchsorted(a[:,i], b[:,i],sorter=sidx[:,i])
out = len(a) - out
Explanation
We get the sorted order indices for each column of a.
Then, use those indices to get how we could place values off b into the sorted a with searcshorted. This gives us same as the output from step#3,4 in Approach#1.
Note that these approaches give us the count. So, for the final output, divide the output thus obtained by n.
I think you can use broadcasting:
c = (a[:,None,:] > b).mean(axis=0)
Demo:
In [207]: n = 5
In [208]: m = 3
In [209]: a = np.random.randint(10, size=(n,m))
In [210]: b = np.random.randint(10, size=(n,m))
In [211]: c = np.zeros((n,m))
In [212]: a
Out[212]:
array([[2, 2, 8],
[5, 0, 8],
[2, 5, 7],
[4, 4, 4],
[2, 6, 7]])
In [213]: b
Out[213]:
array([[3, 6, 8],
[2, 7, 5],
[8, 9, 2],
[9, 8, 7],
[2, 7, 2]])
In [214]: for j in range(0,m): #cols
...: for i in range(0,n): # rows
...: r = b[i,j]
...: c[i,j] = ( ( a[:,j] > r).sum() ) / (n)
...:
...:
In [215]: c
Out[215]:
array([[0.4, 0. , 0. ],
[0.4, 0. , 0.8],
[0. , 0. , 1. ],
[0. , 0. , 0.4],
[0.4, 0. , 1. ]])
In [216]: (a[:,None,:] > b).mean(axis=0)
Out[216]:
array([[0.4, 0. , 0. ],
[0.4, 0. , 0.8],
[0. , 0. , 1. ],
[0. , 0. , 0.4],
[0.4, 0. , 1. ]])
check:
In [217]: ((a[:,None,:] > b).mean(axis=0) == c).all()
Out[217]: True

Array of indizes of unique values

I start with an array a containing N unique values (product(a.shape) >= N).
I need to find the array b that has the index 0 .. N-1 from the (sorted) list of unique values in a at the positions of the respective elements in a.
As an example
import numpy as np
np.random.seed(42)
a = np.random.choice([0.1,1.3,7,9.4], size=(4,3))
print a
prints a as
[[ 7. 9.4 0.1]
[ 7. 7. 9.4]
[ 0.1 0.1 7. ]
[ 1.3 7. 7. ]]
The unique values are [0.1, 1.3, 7.0, 9.4], so the required outcome b would be
[[2 3 0]
[2 2 3]
[0 0 2]
[1 2 2]]
(e.g. the value at a[0,0] is 7.; 7. has the index 2; thus b[0,0] == 2.)
Since numpy does not have an index function,
I could do this using a loop. Either looping over the input array, like this:
u = np.unique(a).tolist()
af = a.flatten()
b = np.empty(len(af), dtype=int)
for i in range(len(af)):
b[i] = u.index(af[i])
b = b.reshape(a.shape)
print b
or looping over the unique values as follows:
u = np.unique(a)
b = np.empty(a.shape, dtype=int)
for i in range(len(u)):
b[np.where(a == u[i])] = i
print b
I suppose that the second way of looping over the unique values is already more efficient than the first in cases where not all values in a are distinct; but still, it involves this loop and is rather inefficient compared to inplace operations.
So my question is: What is the most efficient way of obtaining the array b filled with the indizes of the unique values of a?
You could use np.unique with its optional argument return_inverse -
np.unique(a, return_inverse=1)[1].reshape(a.shape)
Sample run -
In [308]: a
Out[308]:
array([[ 7. , 9.4, 0.1],
[ 7. , 7. , 9.4],
[ 0.1, 0.1, 7. ],
[ 1.3, 7. , 7. ]])
In [309]: np.unique(a, return_inverse=1)[1].reshape(a.shape)
Out[309]:
array([[2, 3, 0],
[2, 2, 3],
[0, 0, 2],
[1, 2, 2]])
Going through the source code of np.unique that looks pretty efficient to me, but still pruning out the un-necessary parts, we would end up with another solution, like so -
def unique_return_inverse(a):
ar = a.flatten()
perm = ar.argsort()
aux = ar[perm]
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
iflag = np.cumsum(flag) - 1
inv_idx = np.empty(ar.shape, dtype=np.intp)
inv_idx[perm] = iflag
return inv_idx
Timings -
In [444]: a= np.random.randint(0,1000,(1000,400))
In [445]: np.allclose( np.unique(a, return_inverse=1)[1],unique_return_inverse(a))
Out[445]: True
In [446]: %timeit np.unique(a, return_inverse=1)[1]
10 loops, best of 3: 30.4 ms per loop
In [447]: %timeit unique_return_inverse(a)
10 loops, best of 3: 29.5 ms per loop
Not a great deal of improvement there over the built-in.

Categories