Best way to generate all possible bidirectional graphs Python - python

I'm trying to generate all possible bidirectional graphs given a set of nodes. I'm storing my graphs as numpy vectors, and need them that way as there is some downstream code consumes the graphs in this format.
Lets say I have two sets of nodes in which nodes in the same set do not connect. It is however possible to have graphs in which members of two sets do not meet at all.
posArgs= [0,1] # (0->1) / (1->1) is not allowed..neither is (0->1)
negArgs= [2] # (0->2) is possible, and so is (0 2) - meaning no connection.
To illustrate what I mean:
translates as:
import numpy as np
graph = np.array([0,0,1,0,0,1,0,0,0])
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
>>[[0 0 1]
[0 0 1]
[0 0 0]]
# Where the first row represent node 0's relationships with nodes 0, 1, 2
# the second row represents node 1's relationships with nodes 0, 1, 2 etc
What I want is to generate all the possible ways in which these two sets can then be constructed (i.e. all the possible node combinations as vectors). At the moment I am using itertools.product to generate a power set of all the nodes, then I create a bunch of vectors in which there are circular connections and same set connections. I then delete them from the powerset. Hence using the sets above I have the following code:
import numpy as np
import itertools
posArgs= [0,1] # 0->1 / 1->1 is not allowed..neither is 0->1
negArgs= [2]
nargs= len(posArgs+ negArgs)
allPermutations= np.array(list(itertools.product([0,1], repeat=nargs*nargs)))
# Create list of Attacks that we will never need. Circular attacks, and attacks between arguments of same polarity
circularAttacks = (np.arange(0, nargs*nargs, nargs+1)).tolist()
samePolarityAttacks = []
posList = list(itertools.permutations(posArgs, 2))
negList = list(itertools.permutations(negArgs, 2))
totList = posList + negList
for l in totList:
ptn = ((l[0]+1)*nargs)- ((nargs+1) - l[1]) + 1 # All the odd +1 are to account for the shift in 0 index
samePolarityAttacks.append(ptn)
graphsToDelete = np.unique([circularAttacks + samePolarityAttacks])
subGraphs = allPermutations[:,graphsToDelete]
cutDownGraphs = np.delete(allPermutations, (np.where(subGraphs>0)[0]).tolist(), axis = 0)
for graph in cutDownGraphs:
singleGraph= np.vstack( np.array_split(np.array(graph), nargs))
print(singleGraph)
My problem is that when I have more than 5 nodes within both of my sets my itertools.product is trying to produce (2^25) set of vectors. This of course is really expensive and leaves me with memory leaks.
Are you aware of a smart way in which I can reshape this code while ensuring my graphs stay in this numpy array format?
--Additional info:
For two sets, one node a set, all possible combinations look as follows:
Thanks

Related

Faiss : How to create an Index of 10M vectors of size 1024

I want to create an index of nearly 10M vectors of size 1024. Here is the code that I used.
import numpy as np
import faiss
import random
f = 1024
vectors = []
no_of_vectors=10000000
for k in range(no_of_vectors):
v = [random.gauss(0, 1) for z in range(f)]
vectors.append(v)
np_vectors = np.array(vectors).astype('float32')
index = faiss.IndexFlatL2(f)
index.add(np_vectors)
faiss.write_index(index, "faiss_index.index")
The code is worked for a small number of vectors. But the memory limit exceeds when the number of vectors is about 2M. I used index.add() instead of appending vectors to list(vectors=[]). But it didn't work as well.
I want to know how to create an index for large number of vectors.
If you want to continue using Faiss, there is a reference for choosing a different index, maybe HNSW or IVFPQ.
ref: https://wangzwhu.github.io/home/file/acmmm-t-part3-ann.pdf go the last page.
And another option is to try some distributed solutions, such as Milvus, which build top of Ann library like faiss

How to remove overlapping blocks from numpy array?

I'm using cv2.goodFeaturesToTrack function to find feature points in an image. The end goal is to extract square blocks of certain size, with feature points being the centers of those blocks.
However, lots of the feature points are close to each other, so the blocks are overlapping, which is not what I want.
This is an example of all feature points (centers):
array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
Let's say I want to find the first n blocks with a blockRadius = 400 which are not overlapping. Any ideas on how to achieve this?
You could get closer with scipy.spatial.KDTree - though it doesn't support querying blocks that consists of distinct amounts of points in blocks. So it can be used in conjunction with another library python-igraph that allows to find connected components of close points in a fast manner:
from scipy.spatial import KDTree
import igraph as ig
data = np.array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
edges1 = KDTree(data[:,:1]).query_pairs(r=400)
edges2 = KDTree(data[:,1:]).query_pairs(r=400)
g = ig.Graph(n = len(data), edges=edges1 & edges2)
i = g.clusters()
So clusters corresponds to sequences of indices of block points of some kind of internal type igraph. There's a quick preview:
>>> print(i)
Clustering with 8 elements and 2 clusters
[0] 0, 2, 3, 4, 5, 6
[1] 1, 7
>>> pal = ig.drawing.colors.ClusterColoringPalette(len(i)) #number of colors used
color = pal.get_many(i.membership) #list of color tags
ig.plot(g, bbox = (200, 100), layout=g.layout('circle'), vertex_label=g.vs.indices,
vertex_color = color, vertex_size = 12, vertex_label_size = 8)
Example of usage:
>>> [data[n] for n in i] #or list(i)
[array([[3536., 1419.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.]]),
array([[2976., 1024.],
[3108., 737.]])]
Remark: this method allows to work with pairs of close points instead of n*n matrix which is more efficient in memory in some cases.
You'll need something iterative to do that, as recurrent dropouts like this aren't vectorizable. Something like this will work, I think
from scipy.spatial.distance import pdist, squareform
c = np.array([[3536., 1419.],
[2976., 1024.],
[3504., 1400.],
[3574., 1505.],
[3672., 1453.],
[3671., 1442.],
[3489., 1429.],
[3108., 737.]])
dists = squareform(pdist(c, metric = 'chebyshev')) # distance matrix, chebyshev here since you seem to want blocks
indices = np.arange(c.shape[0]) # indices that haven't been dropped (all to start)
out = [0] # always want the first index
while True:
try:
indices = indices[dists[indices[0], indices] > 400] #drop indices that are inside threshhold
out.append(indices[0]) # add the next index that hasn't been dropped to the output
except:
break # once you run out of indices, you'll get an IndexError and you're done
print(out)
[0, 1]
let's try with a whole bunch of points:
np.random.seed(42)
c = np.random.rand(10000, 2) * 800
dists = squareform(pdist(c, metric = 'chebyshev')) # distance matrix, checbyshev here since you seem to want squares
indices = np.arange(c.shape[0]) # indices that haven't been dropped (all to start)
out = [0] # always want the first index
while True:
try:
indices = indices[dists[indices[0], indices] > 400] #drop indices that are inside threshhold
out.append(indices[0]) # add the next index that hasn't been dropped to the output
except:
break # once you run out of indices, you'll get an IndexError and you're done
print(out, pdist(c[out], metric = 'chebyshev'))
[0, 2, 6, 17] [635.77582886 590.70015659 472.87353138 541.13920029 647.69071411
476.84658995]
So, 4 points (makes sense since 4 400x400 blocks tile a 800x800 space with 4 tiles), mostly low values (17 << 10000) and distance between kept points is always > 400

How to vectorize a function with array lookups

I'm trying to vectorize my fitness function for a Minimum Vector Cover genetic algorithm, but I'm at a loss about how to do it.
As it stands now:
vert_cover_fitness = [1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges for edge in edges]
The dna is a one-dimensional binary array of size [0..n], where each index corresponds to a vertex, and its value indicates if we have chosen it or not. edges is a two dimensional positive integer array, where each value corresponds to a vertex (index) in dna. Both are ndarrays.
Simply explained - if one of the vertices connected by an edge is "selected", then we get a score of one. If not, the function is penalized by -num_edges.
I have tried np.vectorize as an attempt to get away cheap with a lambda function:
fit_func = np.vectorize(lambda edge: 1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges)
vert_cover_fitness = fit_func(edges)
This returns IndexError: invalid index to scalar variable., as this function is applied to each value, and not each row.
To fix this I tried np.apply_along_axis. This works but it's just a wrapper for a loop so I'm not getting any speedups.
If any Numpy wizards can see some obvious way to do this, I would much appreciate your help. I'm guessing a problem lies with the representation of the problem, and that changing either the dna or edges shapes could help. I'm just not skilled enough to see what I should do.
I came up with this bit of numpy code, it runs 30x faster than your for loop on my randomly generated data.
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges * 2).reshape(-1, 2)
vert_cover_fitness1 = [1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
print((vert_cover_fitness1 == vert_cover_fitness2).all()) # this shows it's correct
Here is the timeit code used to measure the speedup.
import timeit
setup = """
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges*2).reshape(-1, 2)
"""
python_loop = "[1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]"
print(timeit.timeit(python_loop, setup, number=1000))
vectorised="""
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
"""
print(timeit.timeit(vectorised, setup, number=1000))
# prints:
# 0.375906624016352
# 0.012783741112798452

python - combining argsort with masking to get nearest values in moving window

I have some code for calculating missing values in an image, based on neighbouring values in a 2D circular window. It also uses the values from one or more temporally-adjacent images at the same locations (i.e. the same 2D window shifted in the 3rd dimension).
For each position that is missing, I need to calculate the value based not necessarily on all the values available in the whole window, but only on the spatially-nearest n cells that do have values (in both images / Z-axis positions), where n is some value less than the total number of cells in the 2D window.
At the minute, it's much quicker to calculate for everything in the window, because my means of sorting to get the nearest n cells with data is the slowest part of the function as it has to be repeated each time even though the distances in terms of window coordinates do not change. I'm not sure this is necessary and feel I must be able to get the sorted distances once, and then mask those in the process of only selecting available cells.
Here's my code for selecting the data to use within a window of the gap cell location:
# radius will in reality be ~100
radius = 2
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
dist = np.sqrt(x**2 + y**2)
circle_template = dist > radius
# this will in reality be a very large 3 dimensional array
# representing daily images with some gaps, indicated by 0s
dataStack = np.zeros((2,5,5))
dataStack[1] = (np.random.random(25) * 100).reshape(dist.shape)
dataStack[0] = (np.random.random(25) * 100).reshape(dist.shape)
testdata = dataStack[1]
alternatedata = dataStack[0]
random_gap_locations = (np.random.random(25) * 30).reshape(dist.shape) > testdata
testdata[random_gap_locations] = 0
testdata[radius, radius] = 0
# in reality we will go through every gap (zero) location in the data
# for each image and for each gap use slicing to get a window of
# size (radius*2+1, radius*2+1) around it from each image, with the
# gap being at the centre i.e.
# testgaplocation = [radius, radius]
# and the variables testdata, alternatedata below will refer to these
# slices
locations_to_exclude = np.logical_or(circle_template, np.logical_or
(testdata==0, alternatedata==0))
# the places that are inside the circular mask and where both images
# have data
locations_to_include = ~locations_to_exclude
number_available = np.count_nonzero(locations_to_include)
# we only want to do the interpolation calculations from the nearest n
# locations that have data available, n will be ~100 in reality
number_required = 3
available_distances = dist[locations_to_include]
available_data = testdata[locations_to_include]
available_alternates = alternatedata[locations_to_include]
if number_available > number_required:
# In this case we need to find the closest number_required of elements, based
# on distances recorded in dist, from available_data and available_alternates
# Having to repeat this argsort for each gap cell calculation is slow and feels
# like it should be avoidable
sortedDistanceIndices = available_distances.argsort(kind = 'mergesort',axis=None)
requiredIndices = sortedDistanceIndices[0:number_required]
selected_data = np.take(available_data, requiredIndices)
selected_alternates = np.take(available_alternates , requiredIndices)
else:
# we just use available_data and available_alternates as they are...
# now do stuff with the selected data to calculate a value for the gap cell
This works, but over half of the total time of the function is taken in the argsort of the masked spatial distance data. (~900uS of a total 1.4mS - and this function will be running tens of billions of times, so this is an important difference!)
I am sure that I must be able to just do this argsort once outside of the function, when the spatial distance window is originally set up, and then include those sort indices in the masking, to get the first howManyToCalculate indices without having to re-do the sort. The answer might involve putting the various bits that we are extracting from, into a record array - but I can't figure out how, if so. Can anyone see how I can make this part of the process more efficient?
So you want to do the sorting outside of the loop:
sorted_dist_idcs = dist.argsort(kind='mergesort', axis=None)
Then using some variables from the original code, this is what I could come up with, though it still feels like a major round-trip..
loc_to_incl_sorted = locations_to_include.take(sorted_dist_idcs)
sorted_dist_idcs_to_incl = sorted_dist_idcs[loc_to_incl_sorted]
required_idcs = sorted_dist_idcs_to_incl[:number_required]
selected_data = testdata.take(required_idcs)
selected_alternates = alternatedata.take(required_idcs)
Note the required_idcs refer to locations in the testdata and not available_data as in the original code. And this snippet I used take for the purpose of conveniently indexing the flattened array.
#moarningsun - thanks for the comment and answer. These got me on the right track, but don't quite work for me when the gap is < radius from the edge of the data: in this case I use a window around the gap cell which is "trimmed" to the data bounds. In this situation the indices reflect the "full" window and thus can't be used to select cells from the bounded window.
Unfortunately I edited that part of my code out when I clarified the original question but it's turned out to be relevant.
I've realised now that if you use argsort again on the output of argsort then you get ranks; i.e. the position that each item would have when the overall array was sorted. We can safely mask these and then take the smallest number_required of them (and do this on a structured array to get the corresponding data at the same time).
This implies another sort within the loop, but in fact we can use partitioning rather than a full sort, because all we need is the smallest num_required items. If num_required is substantially less than the number of data items then this is much faster than doing the argsort.
For example with num_required = 80 and num_available = 15000 the full argsort takes ~900µs whereas argpartition followed by index and slice to get the first 80 takes ~110µs. We still need to do the argsort to get the ranks at the outset (rather than just partitioning based on distance) in order to get the stability of the mergesort, and thus get the "right one" when distance is not unique.
My code as shown below now runs in ~610uS on real data, including the actual calculations that aren't shown here. I'm happy with that now, but there seem to be several other apparently minor factors that can have an influence on the runtime that's hard to understand.
For example putting the circle_template in the structured array along with dist, ranks, and another field not shown here, doubles the runtime of the overall function (even if we don't access circle_template in the loop!). Even worse, using np.partition on the structured array with order=['ranks'] increases the overall function runtime by almost two orders of magnitude vs using np.argpartition as shown below!
# radius will in reality be ~100
radius = 2
y,x = np.ogrid[-radius:radius+1, -radius:radius+1]
dist = np.sqrt(x**2 + y**2)
circle_template = dist > radius
ranks = dist.argsort(axis=None,kind='mergesort').argsort().reshape(dist.shape)
diam = radius * 2 + 1
# putting circle_template in this array too doubles overall function runtime!
fullWindowArray = np.zeros((diam,diam),dtype=[('ranks',ranks.dtype.str),
('thisdata',dayDataStack.dtype.str),
('alternatedata',dayDataStack.dtype.str),
('dist',spatialDist.dtype.str)])
fullWindowArray['ranks'] = ranks
fullWindowArray['dist'] = dist
# this will in reality be a very large 3 dimensional array
# representing daily images with some gaps, indicated by 0s
dataStack = np.zeros((2,5,5))
dataStack[1] = (np.random.random(25) * 100).reshape(dist.shape)
dataStack[0] = (np.random.random(25) * 100).reshape(dist.shape)
testdata = dataStack[1]
alternatedata = dataStack[0]
random_gap_locations = (np.random.random(25) * 30).reshape(dist.shape) > testdata
testdata[random_gap_locations] = 0
testdata[radius, radius] = 0
# in reality we will loop here to go through every gap (zero) location in the data
# for each image
gapz, gapy, gapx = 1, radius, radius
desLeft, desRight = gapx - radius, gapx + radius+1
desTop, desBottom = gapy - radius, gapy + radius+1
extentB, extentR = dataStack.shape[1:]
# handle the case where the gap is < search radius from the edge of
# the data. If this is the case, we can't use the full
# diam * diam window
dataL = max(0, desLeft)
maskL = 0 if desLeft >= 0 else abs(dataL - desLeft)
dataT = max(0, desTop)
maskT = 0 if desTop >= 0 else abs(dataT - desTop)
dataR = min(desRight, extentR)
maskR = diam if desRight <= extentR else diam - (desRight - extentR)
dataB = min(desBottom,extentB)
maskB = diam if desBottom <= extentB else diam - (desBottom - extentB)
# get the slice that we will be working within
# ranks, dist and circle are already populated
boundedWindowArray = fullWindowArray[maskT:maskB,maskL:maskR]
boundedWindowArray['alternatedata'] = alternatedata[dataT:dataB, dataL:dataR]
boundedWindowArray['thisdata'] = testdata[dataT:dataB, dataL:dataR]
locations_to_exclude = np.logical_or(boundedWindowArray['circle_template'],
np.logical_or
(boundedWindowArray['thisdata']==0,
boundedWindowArray['alternatedata']==0))
# the places that are inside the circular mask and where both images
# have data
locations_to_include = ~locations_to_exclude
number_available = np.count_nonzero(locations_to_include)
# we only want to do the interpolation calculations from the nearest n
# locations that have data available, n will be ~100 in reality
number_required = 3
data_to_use = boundedWindowArray[locations_to_include]
if number_available > number_required:
# argpartition seems to be v fast when number_required is
# substantially < data_to_use.size
# But partition on the structured array itself with order=['ranks']
# is almost 2 orders of magnitude slower!
reqIndices = np.argpartition(data_to_use['ranks'],number_required)[:number_required]
data_to_use = np.take(data_to_use,reqIndices)
else:
# we just use available_data and available_alternates as they are...
pass
# now do stuff with the selected data to calculate a value for the gap cell

Python - Iter through identified component features

I am standing in front of a huge problem. Using the python libraries NumPy and SciPy, I identified several features in large array. For this purpose, I created a 3x3 neighbor structure and used it for a connected component analysis --> see docs.
struct = scipy.ndimage.generate_binary_structure(2,2)
labeled_array, num_features = ndimage.label(array,struct)
My problem now is that I want to iterate through all identified features in a loop. Someone has an idea how to address individual features in the resulting NumPy array?
Here's an example of handling features identified by ndimage.label. Whether this helps you or not depends on what you want to do with the features.
import numpy as np
import scipy.ndimage as ndi
import matplotlib.pyplot as plt
# Make a small array for the demonstration.
# The ndimage.label() function treats 0 as the "background".
a = np.zeros((16, 16), dtype=int)
a[:6, :8] = 1
a[9:, :5] = 1
a[8:, 13:] = 2
a[5:13, 6:12] = 3
struct = ndi.generate_binary_structure(2, 2)
lbl, n = ndi.label(a, struct)
# Plot the original array.
plt.figure(figsize=(11, 4))
plt.subplot(1, n + 1, 1)
plt.imshow(a, interpolation='nearest')
plt.title("Original")
plt.axis('off')
# Plot the isolated features found by label().
for i in range(1, n + 1):
# Make an array of zeros the same shape as `a`.
feature = np.zeros_like(a, dtype=int)
# Set the elements that are part of feature i to 1.
# Feature i consists of elements in `lbl` where the value is i.
# This statement uses numpy's "fancy indexing" to set the corresponding
# elements of `feature` to 1.
feature[lbl == i] = 1
# Make an image plot of the feature.
plt.subplot(1, n + 1, i + 1)
plt.imshow(feature, interpolation='nearest', cmap=plt.cm.copper)
plt.title("Feature {:d}".format(i))
plt.axis('off')
plt.show()
Here's the image generated by the script:
Just a quick note on an alternative way to solve the above mentioned problem. Instead of using the NumPy "fanzy indexing" one could also use the ndimage "find_objects" function.
example:
# Returns a list of slices for the labeled array. The slices represent the position of features in the labeled area
s = ndi.find_objects(lbl, max_label=0)
# Then you can simply output the patches
for i in n:
print a[s[i]]
I will leave the question open because i couldn't solve an additional arising problem. I want to get the size of the features (already solved, quite easy via ndi.sum() ) as well as the number of nonlabeled cells in direct vicinity of the feature (ergo counting the number of zeros around the feature).

Categories