Related
I am trying to calculate marketsheds using the skimage.MCP_geometric find_costs function. It has been working wonderfully to calculate least-cost routes, but rather than finding the travel cost to the nearest source, I want to calculate the index of the nearest source.
Sample Code
import numpy as np
import skimage.graph as graph
import copy
img = np.array([[1,1,2,2],[2,1,1,3],[3,2,1,2],[2,2,2,1]])
mcp = graph.MCP_Geometric(img)
destinations = [[0,0],[3,3]]
costs, traceback = mcp.find_costs(destinations)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
This works as expected, and creates a nice travel cost raster. However, I want (for each cell) to know which of the destinations is the closest. The best solution I have found is to run each of the destinations separately, then combine them through min calculations. It works, but is slow, and has not been working at scale.
all_c = []
for dest in destinations:
costs, traceback = mcp.find_costs([dest])
all_c.append(copy.deepcopy(costs))
res = np.dstack(all_c)
res_min = np.amin(res, axis=2)
output = np.zeros([res_min.shape[0], res_min.shape[1]])
for idx in range(0, res.shape[2]):
cur_data = res[:,:,idx]
cur_val = (cur_data == res_min).astype(np.byte) * idx
output = output + cur_val
output = output.astype(np.byte)
print(output)
array([[0, 0, 0, 0],
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]], dtype=int8)
I have been looking into overloading the functions of MCP_Geometric and MCP_Flexible, but I cannot find anything providing information on the index of the destination.
Hope that provides enough information to replicate and understand what I want to do, thanks!
Ok, this is a bit of a ride, but it was fun to figure out. I'm unclear just how fast it'll be but I think it should be pretty fast in the case of many destinations and comfortably-in-RAM images.
The key is the traceback return value, which kinda-sorta tells you the neighbor index to get to the nearest destination. So with a bit of pathfinding you should be able to find that destination. Can that be fast? It turns out it can, with a bit of NumPy index wrangling, scipy.sparse matrices, and connected_components from scipy.sparse.csgraph!
Let's start with your same costs array and both destinations:
import numpy as np
image = np.array(
[[1, 1, 2, 2],
[2, 1, 1, 3],
[3, 2, 1, 2],
[2, 2, 2, 1]]
)
destinations = [[0, 0], [3, 3]]
We then make the graph, and get the costs and the traceback:
from skimage import graph
mcp = graph.MCP_Geometric(image)
costs, traceback = mcp.find_costs(destinations)
print(traceback)
gives:
[[-1 4 4 4]
[ 6 7 7 1]
[ 6 6 0 1]
[ 3 3 3 -1]]
Now, I had to look up the documentation for what traceback is:
Same shape as the costs array; this array contains the offset to
any given index from its predecessor index. The offset indices
index into the offsets attribute, which is a array of n-d
offsets. In the 2-d case, if offsets[traceback[x, y]] is (-1, -1),
that means that the predecessor of [x, y] in the minimum cost path
to some start position is [x+1, y+1]. Note that if the
offset_index is -1, then the given index was not considered.
For some reason, my mcp object didn't have an offsets attribute — possibly a Cython inheritance bug? Dunno, will investigate later — but searching the source code shows me that offsets is defined with the skimage.graph._mcp.make_offsets function. So I did a bad thing and imported from that private module, so I could claim what was rightfully mine — the offsets list, which translates from numbers in traceback to offsets in the image coordinates:
from skimage.graph import _mcp
offsets = _mcp.make_offsets(2, True)
print(offsets)
which gives:
[array([-1, -1]),
array([-1, 0]),
array([-1, 1]),
array([ 0, -1]),
array([0, 1]),
array([ 1, -1]),
array([1, 0]),
array([1, 1])]
Now, there's one last thing to do with the offsets: you'll note that destinations are marked in the traceback with "-1", which doesn't correspond to the last element of the offsets array. So we append np.array([0, 0]), and then every value in traceback corresponds to a real offset. In the case of destinations, you get a self-edge, but that's fine.
offsets.append(np.array([0, 0]))
offsets_arr = np.array(offsets) # shape (9, 2)
Now, we can build a graph from offsets, pixel coordinates, and pixel ids. First, we use np.indices to get an index for every pixel in the image:
indices = np.indices(traceback.shape)
print(indices.shape)
gives:
(2, 4, 4)
To get an array that has, for each pixel, the offset to its neighbor, we use fancy array indexing:
offset_to_neighbor = offsets_arr[traceback]
print(offset_to_neighbor.shape)
which gives:
(4, 4, 2)
The axes are different between the traceback and the numpy indices, but nothing a little transposition won't fix:
neighbor_index = indices - offset_to_neighbor.transpose((2, 0, 1))
Finally, we want to deal with integer pixel ids in order to create a graph of all the pixels, rather than coordinates. For this, we use np.ravel_multi_index.
ids = np.arange(traceback.size).reshape(image.shape)
neighbor_ids = np.ravel_multi_index(
tuple(neighbor_index), traceback.shape
)
This gives me a unique ID for each pixel, and then a unique "step towards the destination" for each pixel:
print(ids)
print(neighbor_ids)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 0 0 1 2]
[ 0 0 1 11]
[ 4 5 15 15]
[13 14 15 15]]
Then we can turn this into a graph using SciPy sparse matrices. We don't care about weights for this graph so we just use the value 1 for the edges.
from scipy import sparse
g = sparse.coo_matrix((
np.ones(traceback.size),
(ids.flat, neighbor_ids.flat),
shape=(ids.size, ids.size),
)).tocsr()
(This uses the (value, (row, column)) or (data, (i, j)) input format for sparse COOrdinate matrices.)
Finally, we use connected components to get the graphs — the groups of pixels that are nearest to each destination. The function returns the number of components and the mapping of "pixel id" to component:
n, components = sparse.csgraph.connected_components(g)
basins = components.reshape(image.shape)
print(basins)
[[0 0 0 0]
[0 0 0 1]
[0 0 1 1]
[1 1 1 1]]
(Note that this result is slightly different from yours because the cost is identical to destination 0 and 1 for the pixels in question, so it's arbitrary which to label.)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
Hope this helps!
I am trying to write a function that takes a matrix A, then offsets it by one, and does element wise matrix multiplication on the shared area. Perhaps an example will help. Suppose I have the matrix:
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
What i'd like returned is:
(1*2) + (4*5) + (7*8) = 78
The following code does it, but inefficently:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
Height = A.shape[0]
Width = A.shape[1]
Sum1 = 0
for y in range(0, Height):
for x in range(0,Width-2):
Sum1 = Sum1 + \
A.item(y,x)*A.item(y,x+1)
print("%d * %d"%( A.item(y,x),A.item(y,x+1)))
print(Sum1)
With output:
1 * 2
4 * 5
7 * 8
78
Here is my attempt to write the code more efficently with numpy:
import numpy as np
A = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(np.sum(np.multiply(A[:,0:-1], A[:,1:])))
Unfortunately, this time I get 186. I am at a loss where did I go wrong. i'd love someone to either correcty me or offer another way to implement this.
Thank you.
In this 3 column case, you are just multiplying the 1st 2 columns, and taking the sum:
A[:,:2].prod(1).sum()
Out[36]: 78
Same as (A[:,0]*A[:,1]).sum()
Now just how does that generalize to more columns?
In your original loop, you can cut out the row iteration by taking the sum of this list:
[A[:,x]*A[:,x+1] for x in range(0,A.shape[1]-2)]
Out[40]: [array([ 2, 20, 56])]
Your description talks about multiplying the shared area; what direction are you doing the offset? From the calculation it looks like the offset is negative.
A[:,:-1]
Out[47]:
array([[1, 2],
[4, 5],
[7, 8]])
If that is the offset logic, than I could rewrite my calculation as
A[:,:-1].prod(1).sum()
which should work for many more columns.
===================
Your 2nd try:
In [3]: [A[:,:-1],A[:,1:]]
Out[3]:
[array([[1, 2],
[4, 5],
[7, 8]]),
array([[2, 3],
[5, 6],
[8, 9]])]
In [6]: A[:,:-1]*A[:,1:]
Out[6]:
array([[ 2, 6],
[20, 30],
[56, 72]])
In [7]: _.sum()
Out[7]: 186
In other words instead of 1*2, you are calculating [1,2]*[2*3]=[2,6]. Nothing wrong with that, if that's you you really intend. The key is being clear about 'offset' and 'overlap'.
I have three numpy arrays:
X: a 3073 x 49000 matrix
W: a 10 x 3073 matrix
y: a 49000 x 1 vector
y contains values between 0 and 9, each value represents a row in W.
I would like to add the first column of X to the row in W given by the first element in y. I.e. if the first element in y is 3, add the first column of X to the fourth row of W. And then add the second column of X to the row in W given by the second element in y and so on, until all columns of X has been aded to the row in W specified by y, which means a total of 49000 added rows.
W[y] += X.T does not work for me, because this will not add more than one vector to a row in W.
Please note: I'm only looking for vectorized solutions. I.e. no for-loops.
EDIT: To clarify I'll add an example with small matrix sizes adapted from Salvador Dali's example below.
In [1]: import numpy as np
In [2]: a, b, c = 3, 4, 5
In [3]: np.random.seed(0)
In [4]: X = np.random.randint(10, size=(b,c))
In [5]: W = np.random.randint(10, size=(a,b))
In [6]: y = np.random.randint(a, size=(c,1))
In [7]: X
Out[7]:
array([[5, 0, 3, 3, 7],
[9, 3, 5, 2, 4],
[7, 6, 8, 8, 1],
[6, 7, 7, 8, 1]])
In [8]: W
Out[8]:
array([[5, 9, 8, 9],
[4, 3, 0, 3],
[5, 0, 2, 3]])
In [9]: y
Out[9]:
array([[0],
[1],
[1],
[2],
[0]])
In [10]: W[y.ravel()] + X.T
Out[10]:
array([[10, 18, 15, 15],
[ 4, 6, 6, 10],
[ 7, 8, 8, 10],
[ 8, 2, 10, 11],
[12, 13, 9, 10]])
In [11]: W[y.ravel()] = W[y.ravel()] + X.T
In [12]: W
Out[12]:
array([[12, 13, 9, 10],
[ 7, 8, 8, 10],
[ 8, 2, 10, 11]])
The problem is to get BOTH column 0 and column 4 in X added to row 0 in W, as well as both column 1 and 2 in X added to row 1 in W.
The desired outcome is thus:
W = [[17, 22, 16, 16],
[ 7, 11, 14, 17],
[ 8, 2, 10, 11]]
First the straight forward loop solution as reference:
In [65]: for i,j in enumerate(y):
W[j]+=X[:,i]
....:
In [66]: W
Out[66]:
array([[17, 22, 16, 16],
[ 7, 11, 14, 17],
[ 8, 2, 10, 11]])
An add.at solution:
In [67]: W=W1.copy()
In [68]: np.add.at(W,(y.ravel()),X.T)
In [69]: W
Out[69]:
array([[17, 22, 16, 16],
[ 7, 11, 14, 17],
[ 8, 2, 10, 11]])
add.at does an unbuffered calculation, getting around the buffering that prevents W[y.ravel()] += X.T from working. It is still iterative, but the loop has been moved to compiled code. It isn't true vectorization because the order of application matters. The addition for one row of X.T depends on the results from the previous rows.
https://stackoverflow.com/a/20811014/901925 is the answer I gave a couple of years ago to a similar question (for 1d arrays).
But when dealing with your large arrays:
X: a 3073 x 49000 matrix
W: a 10 x 3073 matrix
y: a 49000 x 1 vector
this can run into speed issues. Note that W[y.ravel()] is the same size as X.T (why did you pick these sizes that require transpose?). And it's a copy, not a view. So there's already a time penalty.
bincount has been suggested in previous questions, and I think it is faster. Making for loop with index arrays faster (both bincount and add.at solutions)
Iterating over the small 3073 dimension could also have speed advantages. Or better yet on the size 10 dimension as Divakar demonstrates.
For the small test case, a,b,c=3,4,5, the add.at solution is fastest, with Divakar's bincount and einseum next. For a larger a,b,c=10,1000,20000, add.at gets very slow, with bincount being the fastest.
Related SO answers
https://stackoverflow.com/a/28205888/901925 (notes that bincount requires complete coverage for y).
https://stackoverflow.com/a/30041823/901925 (where Divakar again shows that bincount rules!)
Vectorized approaches
Approach #1
Based on this answer, here's a vectorized solution using np.bincount -
N = y.max()+1
id = y.ravel() + np.arange(X.shape[0])[:,None]*N
W[:N] += np.bincount(id.ravel(), weights=X.ravel()).reshape(-1,N).T
Approach #2
You can make good usage of boolean indexing and np.einsum to get the job done in a concise vectorized manner -
N = y.max()+1
W[:N] += np.einsum('ijk,lk->il',(np.arange(N)[:,None,None] == y.ravel()),X)
Loopy approaches
Approach #3
Since you are selecting and adding up a huge number of columns from X per unique y, it might be better in terms of performance to run a loop with complexity equal to the number of such unique y's, which seems to be at max equal to the number of rows in W and that in your case is just 10. Thus, the loop has just 10 iterations, not bad! Here's the implementation to fulfill those aspirations -
for k in range(W.shape[0]):
W[k] += X[:,(y==k).ravel()].sum(1)
Approach #4
You can bring in np.einsum to do the columnwise summations and have the final output like so -
for k in range(W.shape[0]):
W[k] += np.einsum('ij->i',X[:,(y==k).ravel()])
This will achieve what you want: X + W[y.ravel()].T
To see that this really does the work, here is a reproducible example:
import numpy as np
np.random.seed(0)
a, b, c = 3, 5, 4 # you can use your 3073, 49000, 10 later
X = np.random.rand(a, b)
W = np.random.rand(c, a)
y = np.random.randint(c, size=(b, 1))
Now your matrices are:
[[ 0.0871293 0.0202184 0.83261985]
[ 0.77815675 0.87001215 0.97861834]
[ 0.79915856 0.46147936 0.78052918]
[ 0.11827443 0.63992102 0.14335329]]
[[3]
[0]
[3]
[2]
[0]]
[[ 0.5488135 0.71518937 0.60276338 0.54488318 0.4236548 ]
[ 0.64589411 0.43758721 0.891773 0.96366276 0.38344152]
[ 0.79172504 0.52889492 0.56804456 0.92559664 0.07103606]]
And W[y.ravel()] gives you " W given by the first element in y". By transposing it, you will get a matrix ready to be added to X:
[[ 0.11827443 0.0871293 0.11827443 0.79915856 0.0871293 ]
[ 0.63992102 0.0202184 0.63992102 0.46147936 0.0202184 ]
[ 0.14335329 0.83261985 0.14335329 0.78052918 0.83261985]]
While I can't say that this is very pythonic, it is a solution (I think):
for column in range(x.shape[1]):
w[y[column]] = x[:,column].T
I have a matrix X of dimensions (30x8100) and another one Y of dimensions (1x8100). I want to generate an array containing the difference between them (X[1]-Y, X[2]-Y,..., X[30]-Y)
Can anyone help?
All you need for that is
X - Y
Since several people have offered answers that seem to try to make the shapes match manually, I should explain:
Numpy will automatically expand Y's shape so that it matches with that of X. This is called broadcasting, and it usually does a very good job of guessing what should be done. In ambiguous cases, an axis keyword can be applied to tell it which direction to do things. Here, since Y has a dimension of length 1, that is the axis that is expanded to be length 30 to match with X's shape.
For example,
In [87]: import numpy as np
In [88]: n, m = 3, 5
In [89]: x = np.arange(n*m).reshape(n,m)
In [90]: y = np.arange(m)[None,...]
In [91]: x.shape
Out[91]: (3, 5)
In [92]: y.shape
Out[92]: (1, 5)
In [93]: (x-y).shape
Out[93]: (3, 5)
In [106]: x
Out[106]:
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
In [107]: y
Out[107]: array([[0, 1, 2, 3, 4]])
In [108]: x-y
Out[108]:
array([[ 0, 0, 0, 0, 0],
[ 5, 5, 5, 5, 5],
[10, 10, 10, 10, 10]])
But this is not really a euclidean distance, as your title seems to suggest you want:
df = np.asarray(x - y) # the difference between the images
dst = np.sqrt(np.sum(df**2, axis=1)) # their euclidean distances
use array and use numpy broadcasting in order to subtract it from Y
init the matrix:
>>> from numpy import *
>>> a = array([[1,2,3],[4,5,6]])
Accessing the second row in a:
>>> a[1]
array([4, 5, 6])
Subtract array from Y
>>> Y = array([3,9,0])
>>> a - Y
array([[-2, -7, 3],
[ 1, -4, 6]])
Just iterate rows from your numpy array and you can actually just subtract them and numpy will make a new array with the differences!
import numpy as np
final_array = []
#X is a numpy array that is 30X8100 and Y is a numpy array that is 1X8100
for row in X:
output = row - Y
final_array.append(output)
output will be your resulting array of X[0] - Y, X[1] - Y etc. Now your final_array will be an array with 30 arrays inside, each that have the values of the X-Y that you need! Simple as that. Just make sure you convert your matrices to a numpy arrays first
Edit: Since numpy broadcasting will do the iteration, all you need is one line once you have your two arrays:
final_array = X - Y
And then that is your array with the differences!
a1 = numpy.array(X) #make sure you have a numpy array like [[1,2,3],[4,5,6],...]
a2 = numpy.array(Y) #make sure you have a 1d numpy array like [1,2,3,...]
a2 = [a2] * len(a1[0]) #make a2 as wide as a1
a2 = numpy.array(zip(*a2)) #transpose it (a2 is now same shape as a1)
print a1-a2 #idiomatic difference between a1 and a2 (or X and Y)
in Python, given an n x p matrix, e.g. 4 x 4, how can I return a matrix that's 4 x 2 that simply averages the first two columns and the last two columns for all 4 rows of the matrix?
e.g. given:
a = array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
return a matrix that has the average of a[:, 0] and a[:, 1] and the average of a[:, 2] and a[:, 3].
I want this to work for an arbitrary matrix of n x p assuming that the number of columns I am averaging of n is obviously evenly divisible by n.
let me clarify: for each row, I want to take the average of the first two columns, then the average of the last two columns. So it would be:
1 + 2 / 2, 3 + 4 / 2 <- row 1 of new matrix
5 + 6 / 2, 7 + 8 / 2 <- row 2 of new matrix, etc.
which should yield a 4 by 2 matrix rather than 4 x 4.
thanks.
How about using some math? You can define a matrix M = [[0.5,0],[0.5,0],[0,0.5],[0,0.5]] so that A*M is what you want.
from numpy import array, matrix
A = array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12],
[13, 14, 15, 16]])
M = matrix([[0.5,0],
[0.5,0],
[0,0.5],
[0,0.5]])
print A*M
Generating M is pretty simple too, entries are 1/n or zero.
reshape - get mean - reshape
>>> a.reshape(-1, a.shape[1]//2).mean(1).reshape(a.shape[0],-1)
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
is supposed to work for any array size, and reshape doesn't make a copy.
It's a bit unclear what should happen for matrices with n > 4, but this code will do what you want:
a = N.array([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]], dtype=float)
avg = N.vstack((N.average(a[:,0:2], axis=1), N.average(a[:,2:4], axis=1))).T
This yields avg =
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
Here's a way to do it. You only need to change groupsize to make it work with other sizes like you said, though I'm not fully sure what you want.
groupsize = 2
out = np.hstack([np.mean(x,axis=1,out=np.zeros((a.shape[0],1))) for x in np.hsplit(a,groupsize)])
yields
array([[ 1.5, 3.5],
[ 5.5, 7.5],
[ 9.5, 11.5],
[ 13.5, 15.5]])
for out. Hopefully it gives you some ideas on how to do exactly what it is that you want to do. You can make groupsize dependent on the dimensions of a for instance.