Related
I am trying to calculate marketsheds using the skimage.MCP_geometric find_costs function. It has been working wonderfully to calculate least-cost routes, but rather than finding the travel cost to the nearest source, I want to calculate the index of the nearest source.
Sample Code
import numpy as np
import skimage.graph as graph
import copy
img = np.array([[1,1,2,2],[2,1,1,3],[3,2,1,2],[2,2,2,1]])
mcp = graph.MCP_Geometric(img)
destinations = [[0,0],[3,3]]
costs, traceback = mcp.find_costs(destinations)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
This works as expected, and creates a nice travel cost raster. However, I want (for each cell) to know which of the destinations is the closest. The best solution I have found is to run each of the destinations separately, then combine them through min calculations. It works, but is slow, and has not been working at scale.
all_c = []
for dest in destinations:
costs, traceback = mcp.find_costs([dest])
all_c.append(copy.deepcopy(costs))
res = np.dstack(all_c)
res_min = np.amin(res, axis=2)
output = np.zeros([res_min.shape[0], res_min.shape[1]])
for idx in range(0, res.shape[2]):
cur_data = res[:,:,idx]
cur_val = (cur_data == res_min).astype(np.byte) * idx
output = output + cur_val
output = output.astype(np.byte)
print(output)
array([[0, 0, 0, 0],
[0, 0, 1, 1],
[0, 1, 1, 1],
[1, 1, 1, 1]], dtype=int8)
I have been looking into overloading the functions of MCP_Geometric and MCP_Flexible, but I cannot find anything providing information on the index of the destination.
Hope that provides enough information to replicate and understand what I want to do, thanks!
Ok, this is a bit of a ride, but it was fun to figure out. I'm unclear just how fast it'll be but I think it should be pretty fast in the case of many destinations and comfortably-in-RAM images.
The key is the traceback return value, which kinda-sorta tells you the neighbor index to get to the nearest destination. So with a bit of pathfinding you should be able to find that destination. Can that be fast? It turns out it can, with a bit of NumPy index wrangling, scipy.sparse matrices, and connected_components from scipy.sparse.csgraph!
Let's start with your same costs array and both destinations:
import numpy as np
image = np.array(
[[1, 1, 2, 2],
[2, 1, 1, 3],
[3, 2, 1, 2],
[2, 2, 2, 1]]
)
destinations = [[0, 0], [3, 3]]
We then make the graph, and get the costs and the traceback:
from skimage import graph
mcp = graph.MCP_Geometric(image)
costs, traceback = mcp.find_costs(destinations)
print(traceback)
gives:
[[-1 4 4 4]
[ 6 7 7 1]
[ 6 6 0 1]
[ 3 3 3 -1]]
Now, I had to look up the documentation for what traceback is:
Same shape as the costs array; this array contains the offset to
any given index from its predecessor index. The offset indices
index into the offsets attribute, which is a array of n-d
offsets. In the 2-d case, if offsets[traceback[x, y]] is (-1, -1),
that means that the predecessor of [x, y] in the minimum cost path
to some start position is [x+1, y+1]. Note that if the
offset_index is -1, then the given index was not considered.
For some reason, my mcp object didn't have an offsets attribute — possibly a Cython inheritance bug? Dunno, will investigate later — but searching the source code shows me that offsets is defined with the skimage.graph._mcp.make_offsets function. So I did a bad thing and imported from that private module, so I could claim what was rightfully mine — the offsets list, which translates from numbers in traceback to offsets in the image coordinates:
from skimage.graph import _mcp
offsets = _mcp.make_offsets(2, True)
print(offsets)
which gives:
[array([-1, -1]),
array([-1, 0]),
array([-1, 1]),
array([ 0, -1]),
array([0, 1]),
array([ 1, -1]),
array([1, 0]),
array([1, 1])]
Now, there's one last thing to do with the offsets: you'll note that destinations are marked in the traceback with "-1", which doesn't correspond to the last element of the offsets array. So we append np.array([0, 0]), and then every value in traceback corresponds to a real offset. In the case of destinations, you get a self-edge, but that's fine.
offsets.append(np.array([0, 0]))
offsets_arr = np.array(offsets) # shape (9, 2)
Now, we can build a graph from offsets, pixel coordinates, and pixel ids. First, we use np.indices to get an index for every pixel in the image:
indices = np.indices(traceback.shape)
print(indices.shape)
gives:
(2, 4, 4)
To get an array that has, for each pixel, the offset to its neighbor, we use fancy array indexing:
offset_to_neighbor = offsets_arr[traceback]
print(offset_to_neighbor.shape)
which gives:
(4, 4, 2)
The axes are different between the traceback and the numpy indices, but nothing a little transposition won't fix:
neighbor_index = indices - offset_to_neighbor.transpose((2, 0, 1))
Finally, we want to deal with integer pixel ids in order to create a graph of all the pixels, rather than coordinates. For this, we use np.ravel_multi_index.
ids = np.arange(traceback.size).reshape(image.shape)
neighbor_ids = np.ravel_multi_index(
tuple(neighbor_index), traceback.shape
)
This gives me a unique ID for each pixel, and then a unique "step towards the destination" for each pixel:
print(ids)
print(neighbor_ids)
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 0 0 1 2]
[ 0 0 1 11]
[ 4 5 15 15]
[13 14 15 15]]
Then we can turn this into a graph using SciPy sparse matrices. We don't care about weights for this graph so we just use the value 1 for the edges.
from scipy import sparse
g = sparse.coo_matrix((
np.ones(traceback.size),
(ids.flat, neighbor_ids.flat),
shape=(ids.size, ids.size),
)).tocsr()
(This uses the (value, (row, column)) or (data, (i, j)) input format for sparse COOrdinate matrices.)
Finally, we use connected components to get the graphs — the groups of pixels that are nearest to each destination. The function returns the number of components and the mapping of "pixel id" to component:
n, components = sparse.csgraph.connected_components(g)
basins = components.reshape(image.shape)
print(basins)
[[0 0 0 0]
[0 0 0 1]
[0 0 1 1]
[1 1 1 1]]
(Note that this result is slightly different from yours because the cost is identical to destination 0 and 1 for the pixels in question, so it's arbitrary which to label.)
print(costs)
[[0. 1. 2.5 4.5 ]
[1.5 1.41421356 2.41421356 4. ]
[4. 2.91421356 1.41421356 1.5 ]
[5.5 3.5 1.5 0. ]]
Hope this helps!
Have the following task:
Normalize the matrix by columns. From each value in column subtract average (in column) and divide it by standard deviation (in the column). Your output should not contain nan (caused by division by zero). Replace Nans with 1. Don't use if and while/for.
I an working with numpy, so I wrote the following code:
def normalize(matrix: np.array) -> np.array:
res = (matrix - np.mean(matrix, axis = 0)) / np.std(matrix, axis = 0, dtype=np.float64)
return res
matrix = np.array([[1, 4, 4200], [0, 10, 5000], [1, 2, 1000]])
assert np.allclose(
normalize(matrix),
np.array([[ 0.7071, -0.39223, 0.46291],
[-1.4142, 1.37281, 0.92582],
[ 0.7071, -0.98058, -1.38873]])
)
The answer is right.
However, my question is: how do I avoid division by zero? If i have a column of similar numbers, I'll have standard deviation = 0 and the Nan value in result. How do I solve it? Would be grateful!
Your task specifies to avoid nan in the output and replace nan that occur with 1. It does not specify that intermediate results may not contain nan. A valid solution can be to use numpy.nan_to_num on res before returning:
import numpy as np
def normalize(matrix: np.array) -> np.array:
res = (matrix - np.mean(matrix, axis = 0)) / np.std(matrix, axis = 0, dtype=np.float64)
return np.nan_to_num(res, False, 1.0)
matrix = np.array([[2, 4, 4200], [2, 10, 5000], [2, 2, 1000]])
print(normalize(matrix))
yields:
[[ 1. -0.39223227 0.46291005]
[ 1. 1.37281295 0.9258201 ]
[ 1. -0.98058068 -1.38873015]]
I have currently implementing a Gadzow filter in Python.
To put in some context. You begin with an 1 dimensional array (let's take range(10) as an example) and build a Hankel-like matrix out of it like
H= [[0, 1, 2, 3, 4, 5],
[1, 2, 3, 4, 5, 6],
[2, 3, 4, 5, 6, 7],
[3, 4, 5, 6, 7, 8],
[4, 5, 6, 7, 8, 9]])
Afterwards you do some linear algebra with this matrix which is no problem. Afterwards, the most time consuming step follows which is an averaging problem.
In a new matrix B you average the elements of the resulting matrix. In the first row you average all elements by the path which is given by the accurances in H. So something like the off diagonals but going from top right to bottom left. In the second slice you ignore the first line and so on.
Matrix $H$ would be invariant under this analysis step but for example the Matrix
1 2 2 1
1 1 1 1
1 1 1 1
would become
1 1.5 1.33 1
1 1 1 1
1 1 1 1
Okay, I hope you understand the problem. My (working but inefficient) code is
def av_diag(A,i,j):
dim = A.shape
# get the "borders" of A
lim = min((dim[0]-i,j+1))
# calculate the mean
return np.mean([A[i+it,j-it] for it in range(lim)])
def avHankel(A):
# get the mean for all elements by nested list comprehension
return np.array([[av_diag(A,i,j) for j in range(len(A[0]))] for i in range(len(A))])
This takes a while for my data, containing 2048 data points, resulting in a 1024x1023 matrix.
And I would be glad for possible tricks to speed this up.
Thanks
You can convolve your input matrix with a filter matrix to speed up your code. The filter matrix can be defined so that at each step of the convolution, it extracts only the antidiagonals at the given coordinates. Basically, your filter matrix is simply an anti-identity matrix. Finally, as the convolution will only sum the elements of the anti-diagonals, you have to divide the output by the correct number of samples to obtain the mean:
import numpy as np
from scipy.signal import fftconvolve
from time import time
def av_diag(A,i,j):
dim = A.shape
lim = min((dim[0]-i,j+1))
return np.mean([A[i+it,j-it] for it in range(lim)])
def avHankel(A):
return np.array([[av_diag(A,i,j) for j in range(len(A[0]))] for i in range(len(A))])
def fast_avHankel(A):
m, n = A.shape
filt = np.eye(m)[:,::-1]
Apad = np.pad(A, ((0, m-1), (m-1, 0)), mode = "constant", constant_values = 0)
Asum = fftconvolve(Apad, filt, mode = "valid")
Adiv = np.array([ [ min(m-i, j+1) for j in range(n) ] for i in range(m) ])
return Asum / Adiv
if __name__ == "__main__":
A = np.random.rand(500, 500)
starttime = time()
Hold = avHankel(A)
print(time() - starttime) # 10.6 seconds on a laptop
starttime = time()
Hnew = fast_avHankel(A)
print(time() - starttime) # 0.26 seconds on a laptop
What is a pythonic way to calculate the mean of a list ,but only considering the positive values?
So if I have the values
[1,2,3,4,5,-1,4,2,3] and I want to calculate the rolling mean of three values it is basically calculating the average rolling average of [1,2,3,4,5,'nan',4,2,3].
And that becomes
[nan,2,3,4,4.5,4.5,3,nan] where the first and the last nan are due to the missing elements.
The 2 = mean ([1,2,3])
the 3 = mean ([2,3,4])
but the 4.5 = mean ([4,5,nan])=mean ([4,5])
and so on. So it is important that when there are negative values they are excluded, but the division is between the number of positive values.
I tried:
def RollingPositiveAverage(listA,nElements):
listB=[element for element in listA if element>0]
return pd.rolling_mean(listB,3)
but the list B has elements missing. I tried to substitute those elements with nan but then the mean becomes nan itself.
Is there any nice and elegant way to solve this?
Thanks
Since you are using Pandas:
import numpy as np
import pandas as pd
def RollingPositiveAverage(listA, window=3):
s = pd.Series(listA)
s[s < 0] = np.nan
result = s.rolling(window, center=True, min_periods=1).mean()
result.iloc[:window // 2] = np.nan
result.iloc[-(window // 2):] = np.nan
return result # or result.values or list(result) if you prefer array or list
print(RollingPositiveAverage([1, 2, 3, 4, 5, -1, 4, 2, 3]))
Output:
0 NaN
1 2.0
2 3.0
3 4.0
4 4.5
5 4.5
6 3.0
7 3.0
8 NaN
dtype: float64
Plain Python version:
import math
def RollingPositiveAverage(listA, window=3):
result = [math.nan] * (window // 2)
for win in zip(*(listA[i:] for i in range(window))):
win = tuple(v for v in win if v >= 0)
result.append(float(sum(win)) / min(len(win), 1))
result.extend([math.nan] * (window // 2))
return result
print(RollingPositiveAverage([1, 2, 3, 4, 5, -1, 4, 2, 3]))
Output:
[nan, 2.0, 3.0, 4.0, 4.5, 4.5, 3.0, 3.0, nan]
Get rolling summations and get the count of valid elements participating with rolling summations of the mask of positive elements and simple divide them for the average values. For the rolling summations, we could use np.convolve.
Hence, the implementation -
def rolling_mean(a, W=3):
a = np.asarray(a) # convert to array
k = np.ones(W) # kernel for convolution
# Mask of positive numbers and get clipped array
m = a>=0
a_clipped = np.where(m,a,0)
# Get rolling windowed summations and divide by the rolling valid counts
return np.convolve(a_clipped,k,'same')/np.convolve(m,k,'same')
Extending to the specific case of NaN-padding at the boundaries -
def rolling_mean_pad(a, W=3):
hW = (W-1)//2 # half window size for padding
a = np.asarray(a) # convert to array
k = np.ones(W) # kernel for convolution
# Mask of positive numbers and get clipped array
m = a>=0
a_clipped = np.where(m,a,0)
# Get rolling windowed summations and divide by the rolling valid counts
out = np.convolve(a_clipped,k,'same')/np.convolve(m,k,'same')
out[:hW] = np.nan
out[-hW:] = np.nan
return out
Sample run -
In [54]: a
Out[54]: array([ 1, 2, 3, 4, 5, -1, 4, 2, 3])
In [55]: rolling_mean_pad(a, W=3)
Out[55]: array([ nan, 2. , 3. , 4. , 4.5, 4.5, 3. , 3. , nan])
Consider two n-dimensional, possibly overlapping, numpy meshgrids, say
m1 = (x1, y1, z1, ...)
m2 = (x2, y2, z2, ...)
Within m1 and m2 there are no duplicate coordinate tuples. Each meshgrid has a result array, which may result from different functions:
r1 = f1(m1)
r2 = f2(m2)
such that f1(m) != f2(m). Now I would like to join those two meshgrids and their result arrays, e.g. m=m1&m2 and r=r1&r2 (where & would denote some kind of union), such that the coordinate tuples in m are still sorted and the values in r still correspond to the original coordinate tuples. Newly created coordinate tuples should be identifiable (for instance with a special value).
To elaborate on what I'm after, I have two examples that kind of do what I want with simple for and if statements. Here's a 1D example:
x1 = [1, 5, 7]
r1 = [i**2 for i in x1]
x2 = [2, 4, 6]
r2 = [i*3 for i in x2]
x,r = list(zip(*sorted([(i,j) for i,j in zip(x1+x2,r1+r2)],key=lambda x: x[0])))
which gives
x = (1, 2, 4, 5, 6, 7)
r = (1, 6, 12, 25, 18, 49)
For 2D it starts getting quite complicated:
import numpy as np
a1 = [1, 5, 7]
b1 = [2, 5, 6]
x1,y1 = np.meshgrid(a1,b1)
r1 = x1*y1
a2 = [2, 4, 6]
b2 = [1, 3, 8]
x2, y2 = np.meshgrid(a2,b2)
r2 = 2*x2
a = [1, 2, 4, 5, 6, 7]
b = [1, 2, 3, 5, 6, 8]
x,y = np.meshgrid(a,b)
r = np.ones(x.shape)*-1
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if x[i,j] in a1 and y[i,j] in b1:
r[i,j] = r1[a1.index(x[i,j]),b1.index(y[i,j])]
elif x[i,j] in a2 and y[i,j] in b2:
r[i,j] = r2[a2.index(x[i,j]),b2.index(y[i,j])]
This gives the desired result, with new coordinate pairs having the value -1:
x=
[[1 2 4 5 6 7]
[1 2 4 5 6 7]
[1 2 4 5 6 7]
[1 2 4 5 6 7]
[1 2 4 5 6 7]
[1 2 4 5 6 7]]
y=
[[1 1 1 1 1 1]
[2 2 2 2 2 2]
[3 3 3 3 3 3]
[5 5 5 5 5 5]
[6 6 6 6 6 6]
[8 8 8 8 8 8]]
r=
[[ -1. 4. 4. -1. 4. -1.]
[ 2. -1. -1. 5. -1. 6.]
[ -1. 8. 8. -1. 8. -1.]
[ 10. -1. -1. 25. -1. 30.]
[ 14. -1. -1. 35. -1. 42.]
[ -1. 12. 12. -1. 12. -1.]]
but this will also become slow quickly with increasing dimensions and array sizes. So here finally the question: How can this be done using only numpy functions. If it is not possible, what would be the fastest way to implement this in python. If it is anyhow relevant, I prefer using Python 3. Note that the functions I use in the examples are not the actual functions I use.
We can make use of some masking to replace the A in B parts to give us 1D masks. Then, we can use those masks with np.ix_ to extend to desired number of dimensions.
Thus, for a 2D case, it would be something along these lines -
# Initialize o/p array
r_out = np.full([len(a), len(b)],-1)
# Assign for the IF part
mask_a1 = np.in1d(a,a1)
mask_b1 = np.in1d(b,b1)
r_out[np.ix_(mask_b1, mask_a1)] = r1.T
# Assign for the ELIF part
mask_a2 = np.in1d(a,a2)
mask_b2 = np.in1d(b,b2)
r_out[np.ix_(mask_b2, mask_a2)] = r2.T
a could be created, like so -
a = np.concatenate((a1,a2))
a.sort()
Similarly, for b.
Also, we could make use of indices instead of masks for use with np.ix_. For the same, we could use np.searchsorted. Thus, instead of the mask np.in1d(a,a1), we could get corresponding indices with np.searchsorted(a,a1) and so on for the rest of the masks. This should be considerably faster.
For a 3D case, I would assume that we would have another array, say c. Thus, the initialization part would involve using len(c). There would be one more mask/index-array corresponding to c and hence one more term into np.ix_ and there would be transpose of r1 and r2.
Divakar's answer is exactly what I needed. I wanted, however, to still try out the second suggestion in that answer and on top I did some profiling. I thought the results may be interesting to others. Here is the code I used for profiling:
import numpy as np
import timeit
import random
def for_join_2d(x1,y1,r1, x2,y2,r2):
"""
The algorithm from the question.
"""
a = sorted(list(x1[0,:])+list(x2[0,:]))
b = sorted(list(y1[:,0])+list(y2[:,0]))
x,y = np.meshgrid(a,b)
r = np.ones(x.shape)*-1
for i in range(x.shape[0]):
for j in range(x.shape[1]):
if x[i,j] in a1 and y[i,j] in b1:
r[i,j] = r1[a1.index(x[i,j]),b1.index(y[i,j])]
elif x[i,j] in a2 and y[i,j] in b2:
r[i,j] = r2[a2.index(x[i,j]),b2.index(y[i,j])]
return x,y,r
def mask_join_2d(x1,y1,r1,x2,y2,r2):
"""
Divakar's original answer.
"""
a = np.sort(np.concatenate((x1[0,:],x2[0,:])))
b = np.sort(np.concatenate((y1[:,0],y2[:,0])))
# Initialize o/p array
x,y = np.meshgrid(a,b)
r_out = np.full([len(a), len(b)],-1)
# Assign for the IF part
mask_a1 = np.in1d(a,a1)
mask_b1 = np.in1d(b,b1)
r_out[np.ix_(mask_b1, mask_a1)] = r1.T
# Assign for the ELIF part
mask_a2 = np.in1d(a,a2)
mask_b2 = np.in1d(b,b2)
r_out[np.ix_(mask_b2, mask_a2)] = r2.T
return x,y,r_out
def searchsort_join_2d(x1,y1,r1,x2,y2,r2):
"""
Divakar's second suggested solution using searchsort.
"""
a = np.sort(np.concatenate((x1[0,:],x2[0,:])))
b = np.sort(np.concatenate((y1[:,0],y2[:,0])))
# Initialize o/p array
x,y = np.meshgrid(a,b)
r_out = np.full([len(a), len(b)],-1)
#the IF part
ind_a1 = np.searchsorted(a,a1)
ind_b1 = np.searchsorted(b,b1)
r_out[np.ix_(ind_b1,ind_a1)] = r1.T
#the ELIF part
ind_a2 = np.searchsorted(a,a2)
ind_b2 = np.searchsorted(b,b2)
r_out[np.ix_(ind_b2,ind_a2)] = r2.T
return x,y,r_out
##the profiling code:
if __name__ == '__main__':
N1 = 100
N2 = 100
coords_a = [i for i in range(N1)]
coords_b = [i*2 for i in range(N2)]
a1 = random.sample(coords_a, N1//2)
b1 = random.sample(coords_b, N2//2)
a2 = [i for i in coords_a if i not in a1]
b2 = [i for i in coords_b if i not in b1]
x1,y1 = np.meshgrid(a1,b1)
r1 = x1*y1
x2,y2 = np.meshgrid(a2,b2)
r2 = 2*x2
print("original for loop")
print(min(timeit.Timer(
'for_join_2d(x1,y1,r1,x2,y2,r2)',
setup = 'from __main__ import for_join_2d,x1,y1,r1,x2,y2,r2',
).repeat(7,1000)))
print("with masks")
print(min(timeit.Timer(
'mask_join_2d(x1,y1,r1,x2,y2,r2)',
setup = 'from __main__ import mask_join_2d,x1,y1,r1,x2,y2,r2',
).repeat(7,1000)))
print("with searchsort")
print(min(timeit.Timer(
'searchsort_join_2d(x1,y1,r1,x2,y2,r2)',
setup = 'from __main__ import searchsort_join_2d,x1,y1,r1,x2,y2,r2',
).repeat(7,1000)))
For each function I used 7 sets of 1000 iterations and picked the fastest set for evaluation. The results for two 10x10 arrays was:
original for loop
0.5114614190533757
with masks
0.21544912096578628
with searchsort
0.12026709201745689
and for two 100x100 arrays it was:
original for loop
247.88183582702186
with masks
0.5245905339252204
with searchsort
0.2439237720100209
For big matrices the use of numpy functionality unsurprisingly makes a huge difference and indeed searchsort and indexing instead of masking about halves the run time.