I want to compute the transitive closure of a sparse matrix in Python. Currently I am using scipy sparse matrices.
The matrix power (**12 in my case) works well on very sparse matrices, no matter how large they are, but for directed not-so-sparse cases I would like to use a smarter algorithm.
I have found the Floyd-Warshall algorithm (German page has better pseudocode) in scipy.sparse.csgraph, which does a bit more than it should: there is no function only for Warshall's algorithm - that is one thing.
The main problem is that I can pass a sparse matrix to the function, but this is utterly senseless as the function will always return a dense matrix, because what should be 0 in the transitive closure is now a path of inf length and someone felt this needs to be stored explicitly.
So my question is: Is there any python module that allows computing the transitive closure of a sparse matrix and keeps it sparse?
I am not 100% sure that he works with the same matrices, but Gerald Penn shows impressive speed-ups in his comparison paper, which suggests that it is possible to solve the problem.
EDIT: As there were a number of confusions, I will point out the theoretical background:
I am looking for the transitive closure (not reflexive or symmetric).
I will make sure that my relation encoded in a boolean matrix has the properties that are required, i.e. symmetry or reflexivity.
I have two cases of the relation:
reflexive
reflexive and symmetric
I want to apply the transitive closure on those two relations. This works perfectly well with matrix power (only that in certain cases it is too expensive):
>>> reflexive
matrix([[ True, True, False, True],
[False, True, True, False],
[False, False, True, False],
[False, False, False, True]])
>>> reflexive**4
matrix([[ True, True, True, True],
[False, True, True, False],
[False, False, True, False],
[False, False, False, True]])
>>> reflexive_symmetric
matrix([[ True, True, False, True],
[ True, True, True, False],
[False, True, True, False],
[ True, False, False, True]])
>>> reflexive_symmetric**4
matrix([[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True],
[ True, True, True, True]])
So in the first case, we get all the descendents of a node (including itself) and in the second, we get all the components, that is all the nodes that are in the same component.
This was brought up on SciPy issue tracker. Problem is not so much the output format; the implementation of Floyd-Warshall is to begin with the matrix full of infinities and then insert finite values when a path is found. Sparsity is lost immediately.
The networkx library offers an alternative with its all_pairs_shortest_path_length. Its output is an iterator which returns tuples of the form
(source, dictionary of reachable targets)
which takes a little work to convert to a SciPy sparse matrix (csr format is natural here). A complete example:
import numpy as np
import networkx as nx
import scipy.stats as stats
import scipy.sparse as sparse
A = sparse.random(6, 6, density=0.2, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
G = nx.DiGraph(A) # directed because A need not be symmetric
paths = nx.all_pairs_shortest_path_length(G)
indices = []
indptr = [0]
for row in paths:
reachable = [v for v in row[1] if row[1][v] > 0]
indices.extend(reachable)
indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = A + sparse.csr_matrix((data, indices, indptr), shape=A.shape)
print(A, "\n\n", A_trans)
The reason for adding A back is as follows. Networkx output includes paths of length 0, which would immediately fill the diagonal. We don't want that to happen (you wanted transitive closure, not reflexive-and-transitive closure). Hence the line reachable = [v for v in row[1] if row[1][v] > 0]. But then we don't get any diagonal entries at all, even where A had them (the 0-length empty path beats 1-length path formed by self-loop). So I add A back to the result. It now has entries 1 or 2 but only the fact they are nonzero is of significance.
An example of running the above (I pick 6 by 6 size for readability of the output). Original matrix:
(0, 3) 1
(3, 2) 1
(4, 3) 1
(5, 1) 1
(5, 3) 1
(5, 4) 1
(5, 5) 1
Transitive closure:
(0, 2) 1
(0, 3) 2
(3, 2) 2
(4, 2) 1
(4, 3) 2
(5, 1) 2
(5, 2) 1
(5, 3) 2
(5, 4) 2
(5, 5) 1
You can see that this worked correctly: the added entries are (0, 2), (4, 2), and (5, 2), all acquired via the path (3, 2).
By the way, networkx also has floyd_warshall method but its documentation says
This algorithm is most appropriate for dense graphs. The running time is O(n^3), and running space is O(n^2) where n is the number of nodes in G.
The output is dense again. I get the impression that this algorithm is just considered dense by nature. It seems the all_pairs_shortest_path_length is a kind of Dijkstra's algorithm.
Transitive and Reflexive
If instead of transitive closure (which is the smallest transitive relation containing the given one) you wanted transitive and reflexive closure (the smallest transitive and reflexive relation containing the given one) , the code simplifies as we no longer worry about 0-length paths.
for row in paths:
indices.extend(row[1])
indptr.append(len(indices))
data = np.ones((len(indices),), dtype=np.uint8)
A_trans = sparse.csr_matrix((data, indices, indptr), shape=A.shape)
Transitive, Reflexive, and Symmetric
This means finding the smallest equivalence relation containing the given one. Equivalently, dividing the vertices into connected components. For this you don't need to go to networkx, there is connected_components method of SciPy. Set directed=False there. Example:
import numpy as np
import scipy.stats as stats
import scipy.sparse as sparse
import itertools
A = sparse.random(20, 20, density=0.02, format='csr', data_rvs=stats.randint(1, 2).rvs).astype(np.uint8)
components = sparse.csgraph.connected_components(A, directed=False)
nonzeros = []
for k in range(components[0]):
idx = np.where(components[1] == k)[0]
nonzeros.extend(itertools.product(idx, idx))
row = tuple(r for r, c in nonzeros)
col = tuple(c for r, c in nonzeros)
data = np.ones_like(row)
B = sparse.coo_matrix((data, (row, col)), shape=A.shape)
This is what the output print(B.toarray()) looks like for a random example, 20 by 20:
[[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0]
[0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
[1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]]
Related
I'm struggling when writing a function that would seemlessly apply to any numpy arrays whatever its dimension.
At one point in my code, I have boolean arrays that I consider as mask for other arrays (0 = not passing, 1 = passing).
I would like to "enlarge" those mask arrays by overriding zeros adjacent to ones on a defined range.
Example :
input = [0,0,0,0,0,1,0,0,0,0,1,0,0,0]
enlarged_by_1 = [0,0,0,0,1,1,1,0,0,1,1,1,0,0]
enlarged_by_2 = [0,0,0,1,1,1,1,1,1,1,1,1,1,0]
input = [[0,0,0,1,0,0,1,0],
[0,1,0,0,0,0,0,0],
[0,0,0,0,0,0,1,0]]
enlarged_by_1 = [[0,0,1,1,1,1,1,1],
[1,1,1,0,0,0,0,0],
[0,0,0,0,0,1,1,1]]
This is pretty straighforward when inputs are 1D.
However, I would like this function to take seemlessy 1D, matrix, 3D, and so on.
So for a matrix, the same logic would be applied to each lines.
I read about ellipsis, but it does not seem to be applicable in my case.
Flattening the input applying the logic and reshaping the array would lead to possible contamination between individual arrays.
I do not want to go through testing the shape of input numpy array / recursive function as it does not seems very clean to me.
Would you have some suggestions ?
The operation that you are described seems very much like a convolution operation followed by clipping to ensure that values remain 0 or 1.
For your example input:
import numpy as np
input = np.array([0,0,0,0,0,1,0,0,0,0,1,0,0,0], dtype=int)
print(input)
def enlarge_ones(x, k):
mask = np.ones(2*k+1, dtype=int)
return np.clip(np.convolve(x, mask, mode='same'), 0, 1).astype(int)
print(enlarge_ones(input, k=1))
print(enlarge_ones(input, k=3))
which yields
[0 0 0 0 0 1 0 0 0 0 1 0 0 0]
[0 0 0 0 1 1 1 0 0 1 1 1 0 0]
[0 0 1 1 1 1 1 1 1 1 1 1 1 1]
numpy.convolve only works for 1-d arrays. However, one can imagine a for loop over the number of array dimensions and another for loop over each array. In other words, for a 2-d matrix first operate on every row and then on every column. You get the idea for nd-array with more dimensions. In other words the enlarge_ones would become something like:
def enlarge_ones(x, k):
n = len(x.shape)
if n == 1:
mask = np.ones(2*k+1, dtype=int)
return np.clip(np.convolve(x, mask, mode='same')[:len(x)], 0, 1).astype(int)
else:
x = x.copy()
for d in range(n):
for i in np.ndindex(x.shape[:-1]):
x[i] = enlarge_ones(x[i], k) # x[i] is 1-d
x = x.transpose(list(range(1, n)) + [0])
return x
Note the use of np.transpose to rotate the dimensions so that np.convolve is applied to the 1-d along each dimension. This is exactly n times, which returns the matrix to original shape at the end.
x = np.zeros((3, 5, 7), dtype=int)
x[1, 2, 2] = 1
print(x)
print(enlarge_ones(x, k=1))
[[[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 1 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]
[0 0 0 0 0 0 0]]]
[[[0 0 0 0 0 0 0]
[0 1 1 1 0 0 0]
[0 1 1 1 0 0 0]
[0 1 1 1 0 0 0]
[0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0]
[0 1 1 1 0 0 0]
[0 1 1 1 0 0 0]
[0 1 1 1 0 0 0]
[0 0 0 0 0 0 0]]
[[0 0 0 0 0 0 0]
[0 1 1 1 0 0 0]
[0 1 1 1 0 0 0]
[0 1 1 1 0 0 0]
[0 0 0 0 0 0 0]]]
I am looking for the coordinates of connected blobs in a binary image (2d numpy array of 0 or 1).
The skimage library provides a very fast way to label blobs within the array (which I found from similar SO posts). However I want a list of the coordinates of the blob, not a labelled array. I have a solution which extracts the coordinates from the labelled image. But it is very slow. Far slower than the inital labelling.
Minimal Reproducible example:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
# The goal is to obtain lists of the coordinates
# Of each distinct blob.
blobs = []
label = 1
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Output:
2d array of type: <class 'numpy.ndarray'>:
[[0 1 0 0 1 1 0 1 1 0 0 1]
[0 1 0 1 1 1 0 1 1 1 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 0 1 0 0 0 0 0 0 0 0 0]
[0 1 0 0 1 1 0 1 1 0 0 1]
[0 0 0 0 0 0 0 1 1 1 0 0]
[0 1 1 1 1 0 0 0 0 1 0 0]]
2d array with connected blobs labelled of type <class 'numpy.ndarray'>:
[[ 0 1 0 0 2 2 0 3 3 0 0 4]
[ 0 1 0 2 2 2 0 3 3 3 0 4]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 5 5 5 5 0 0 0 0 3 0 0]
[ 0 0 0 0 0 0 0 3 3 3 0 0]
[ 0 0 6 0 0 0 0 0 0 0 0 0]
[ 0 6 0 0 7 7 0 8 8 0 0 9]
[ 0 0 0 0 0 0 0 8 8 8 0 0]
[ 0 10 10 10 10 0 0 0 0 8 0 0]]
Beginning extract_blobs_from_labelled_array timing
Time taken:
9.346099977847189e-05
9e-05 is small but so is this image for the example. In reality I am working with very high resolution images for which the function takes approximately 10 minutes.
Is there a faster way to do this?
Side note: I'm only using list(zip()) to try get the numpy coordinates into something I'm used to (I don't use numpy much just Python). Should I be skipping this and just using the coordinates to index as-is? Will that speed it up?
The part of the code that slow is here:
while True:
indices_of_label = np.where(labelled_array==label)
if not indices_of_label[0].size > 0:
break
else:
blob =list(zip(*indices_of_label))
label+=1
blobs.append(blob)
First, a complete aside: you should avoid using while True when you know the number of elements you will be iterating over. It's a recipe for hard-to-find infinite-loop bugs.
Instead, you should use:
for label in range(np.max(labels)):
and then you can ignore the if ...: break.
A second issue is indeed that you are using list(zip(*)), which is slow compared to NumPy functions. Here you could get approximately the same result with np.transpose(indices_of_label), which will get you a 2D array of shape (n_coords, n_dim), ie (n_coords, 2).
But the Big Issue is the expression labelled_array == label. This will examine every pixel of the image once for every label. (Twice, actually, because then you run np.where(), which takes another pass.) This is a lot of unnecessary work, as the coordinates can be found in one pass.
The scikit-image function skimage.measure.regionprops can do this for you. regionprops goes over the image once and returns a list containing one RegionProps object per label. The object has a .coords attribute containing the coordinates of each pixel in the blob. So, here's your code, modified to use that function:
import timeit
from skimage import measure
import numpy as np
binary_image = np.array([
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,1,0,1,1,1,0,1,1,1,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,0,1,0,0,0,0,0,0,0,0,0],
[0,1,0,0,1,1,0,1,1,0,0,1],
[0,0,0,0,0,0,0,1,1,1,0,0],
[0,1,1,1,1,0,0,0,0,1,0,0],
])
print(f"\n\n2d array of type: {type(binary_image)}:")
print(binary_image)
labels = measure.label(binary_image)
print(f"\n\n2d array with connected blobs labelled of type {type(labels)}:")
print(labels)
def extract_blobs_from_labelled_array(labelled_array):
"""Return a list containing coordinates of pixels in each blob."""
props = measure.regionprops(labelled_array)
blobs = [p.coords for p in props]
return blobs
if __name__ == "__main__":
print("\n\nBeginning extract_blobs_from_labelled_array timing\n")
print("Time taken:")
print(
timeit.timeit(
'extract_blobs_from_labelled_array(labels)',
globals=globals(),
number=1
)
)
print("\n\n")
Problem:
I want to create a 5 dimensional numpy matrix, each column's value restricted to a range. I can't find any solution online for this problem.
I'm trying to generate a list of rules in the form
Rule: (wordIndex, row, col, dh, dv)
with each column having values in range ( (0-7), (0,11), (0,11), (-1,1), (-1,1) ). I want to generate all possible combinations.
I could easily make the matrix using five loops, one inside another
m, n = 12, 12
rules =[]
for wordIndex in range(0, 15):
for row in range(0,m):
for col in range(0,n):
for dh in range(-1,2):
for dv in range(-1,2):
rules.append([wordIndex, row, col, dh, dv])
But this approach takes an exponentially large time to do this and I wonder if there's a better, vectorized approach to solve this problem using numpy.
I've tried the following but none seem to work:
rules = np.mgrid[words[0]:words[-1], 0:11, 0:11, -1:1, -1:1]
rules = np.rollaxis(words,0,4)
rules = rules.reshape((len(words)*11*11*3*3, 5))
Another approach that fails:
values = list(itertools.product(len(wordsGiven()), range(11), range(11), range(-1,1), range(-1,1)))
I also tried np.arange() but can't seem to figure out how to use if for a multidimensional array.
I think there should be a better way for it. But just in case if you cannot find it, here is a hacky array based way for it:
shape = (8-0, 12-0, 12-0, 2-(-1), 2-(-1))
a = np.zeros(shape)
#create array of indices
a = np.argwhere(a==0).reshape(*shape, len(shape))
#correct the ranges that does not start from 0, here 4th and 5th elements (dh and dv) reduced by -1 (starting range).
#You can adjust this for any other ranges and elements easily.
a[:,:,:,:,:,3:5] -= 1
First few elements of a:
[[[[[[ 0 0 0 -1 -1]
[ 0 0 0 -1 0]
[ 0 0 0 -1 1]]
[[ 0 0 0 0 -1]
[ 0 0 0 0 0]
[ 0 0 0 0 1]]
[[ 0 0 0 1 -1]
[ 0 0 0 1 0]
[ 0 0 0 1 1]]]
[[[ 0 0 1 -1 -1]
[ 0 0 1 -1 0]
[ 0 0 1 -1 1]]
[[ 0 0 1 0 -1]
[ 0 0 1 0 0]
[ 0 0 1 0 1]]
[[ 0 0 1 1 -1]
[ 0 0 1 1 0]
[ 0 0 1 1 1]]]
[[[ 0 0 2 -1 -1]
[ 0 0 2 -1 0]
[ 0 0 2 -1 1]]
[[ 0 0 2 0 -1]
[ 0 0 2 0 0]
[ 0 0 2 0 1]]
[[ 0 0 2 1 -1]
[ 0 0 2 1 0]
[ 0 0 2 1 1]]]
...
I have a polygon which I want to turn into a mask array, such that all points that fall inside/outside the polygon are True/False. I thought I found the perfect solution (SciPy Create 2D Polygon Mask), but for some reason this doesn't work!
What am I doing wrong?
#!/usr/bin/env python3
import numpy as np
import scipy as sp
from PIL import Image, ImageDraw
nx, ny = 10, 10
poly = np.array([(1, 1), (6, 2), (9, 9), (3, 7)])
img = Image.new("L", [nx, ny], 0)
ImageDraw.Draw(img).polygon(poly, outline=1, fill=1)
mask = np.array(img)
print(mask)
# [[1 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]
# [0 0 0 0 0 0 0 0 0 0]]
Broader context:
I'm working with features of arbitrary shape on a rectangular grid. I have the indices of all boundary points on the grid, and I want the indices of a convex hull around this feature. scipy.spatial.ConvexHull(boundary_points) gives me the edge points of the convex hull, and it is this hull polygon that I now want to turn into a mask.
poly has to be a list of tuples or a flattened list. For some reason numpy arrays are handled badly. You can convert a polygon in a numpy array with poly.ravel().tolist() or with list(map(tuple, poly)).
I have a 2D labeled image (numpy array), each label represents an object. I have to find the object's center and its area. My current solution:
centers = [np.mean(np.where(label_2d == i),1) for i in range(1,num_obj+1)]
surface_area = np.array([np.sum(label_2d == i) for i in range(1,num_obj+1)])
Note that label_2d used for centers is not the same as the one for surface area, so I can't combine both operations. My current code is about 10-100 times to slow.
In C++ I would iterate through the image once (2 for loops) and fill the table (an array), from which I would than calculate centers and surface area.
Since for loops are quite slow in python, I have to find another solution. Any advice?
You could use the center_of_mass function present in scipy.ndimage.measurements for the first problem and then use np.bincount for the second problem. Because these are in the mainstream libraries, they will be heavily optimized, so you can expect decent speed gains.
Example:
>>> import numpy as np
>>> from scipy.ndimage.measurements import center_of_mass
>>>
>>> a = np.zeros((10,10), dtype=np.int)
>>> # add some labels:
... a[3:5, 1:3] = 1
>>> a[7:9, 0:3] = 2
>>> a[5:6, 4:9] = 3
>>> print(a)
[[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]
[0 1 1 0 0 0 0 0 0 0]
[0 1 1 0 0 0 0 0 0 0]
[0 0 0 0 3 3 3 3 3 0]
[0 0 0 0 0 0 0 0 0 0]
[2 2 2 0 0 0 0 0 0 0]
[2 2 2 0 0 0 0 0 0 0]
[0 0 0 0 0 0 0 0 0 0]]
>>>
>>> num_obj = 3
>>> surface_areas = np.bincount(a.flat)[1:]
>>> centers = center_of_mass(a, labels=a, index=range(1, num_obj+1))
>>> print(surface_areas)
[4 6 5]
>>> print(centers)
[(3.5, 1.5), (7.5, 1.0), (5.0, 6.0)]
Speed gains depend on the size of your input data though, so I can't make any serious estimates on that. Would be nice if you could add that info (size of a, number of labels, timing results for the method you used and these functions) in the comments.