Bitwise operations between every pair of elements in array - python

Here is the situation:
I have a graph type structure, an adjacency list, and each element of this adjacency list is a 1 dimensional array (either numpy, or bcolz.. not sure if I will use bcolz or not).
Each 1-dimensional array represents graph elements that could possibly connect, in the form of binary sequences. For them to connect, they need to have a specific bitwise intersection value.
Therefore, for each 1 dimensional array in my adjacency list, I want to do the bitwise "and" between every combination of two elements in the given array.
This will possibly be used for huge graph breadth-first traversal, so we may be talking a very very large number of elements.
Is this something I can do with vectorized operations? Should I be using a different structure? What is a good way to do this? I am willing to completely restructure everything if there could be a significant performance boost.
Is it as simple as looping through the individual elements and then broadcasting(correct terminology?) & against the entire array? Thanks.
quick edit
As an extra note, I am using python integers for my byte sequences. Which from my understanding, doesn't play well with numpy(the integers get too big, type long long). I have to create arrays of object type. Does this potentially cause a huge slowdown? Is it a reason to use a different structure?
An Example
//create an nxn adjacency list, where n is number of graph nodes.
//map each graph node to a value 2^k:
nodevals = defaultdict()
for i in xrange(n):
nodevals[i] = 2**(i+1)
//for each edge in our graph it is comprised of two nodes, which are mapped as powers of two. Take their sum, and place them in the adjacency list:
for i in xrange(n):
for j in xrange(n):
adjlist[i][j].append((nodevals[i]|nodevals[j]))
//We now have our first Adjacency list, which is just bare edges. These edges can be connected by row or column, by taking intersection of the sum (nodevals[i]|nodevals[j]) with the other edge (nodevals[i2]|nodevals[j2]), and checking if it equals the connection point for each.
//this may not seem useful for individual edges, but in future iterations we can do this:
//After 3 iterations. (5,1) connected to (1,9), and then this connected to (7,5), for example:
adjlist[5][1] & adjlist[1][9] == 1
adjlist2[5][9] == adjlist[5][1]|adjlist[1][9]
adjlist[7][5] & adjlist2[5][9] == 5
adjlist3[7][9] == adjlist[7][5]|adjlist2[5][9]
//So, you may see how this could be useful for efficient traversal.
//However, it becomes more complicated, because as we increase the length of our subpaths, or "pseudo-edges", or whatever you want to call them,
//The arrays for the given (i,j) hold more and more subpath sums that can potentially be connected.
//Soon the arrays can become very large, which is when I would want to efficiently be able to calculate intersections
//AND, for this problem in particular, I want to be able to connect edges against the SAME edge, so I want to do the bitwise intersection between all pairs of elements in the given array (ie, the given indices [i][j] of that adjacency list).

Related

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Best way to represent a graph to be stored in a text file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
My problem involves creating a directed graph, checking if it unique by comparing to a text file containing graphs and if it is unique, appending it to the file. What would be the best representation of graph to be used in that case?
I'm using Python and I'll be using brute-force to check if graphs are isomorphic, since the graphs are small and have some restrictions.
There is a standard text based format called DOT which allows you to work with directed and undirected graphs, and would give you the benefit of using a variety of different libraries to work with your graphs. Notably graphviz which allows you to read and write DOT files, as well as plot them graphically using matplotlib.
Assuming that this is a simple case of how the graphs are represented you might be ok with a simple CSV format where a line is a single edge and ther's some separator between graphs, eg:
graph_4345345
A,B
B,C
C,E
E,B
graph_3234766
F,D
B,C
etc.
You could then make use of https://docs.python.org/3/library/csv.html
I guess it depends on how you are going to represent your graph as a data structure.
The two most known graph representations as data structures are:
Adjacency matrices
Adjacency lists
Adjacency matrices
For a graph with |V| vertices, an adjacency matrix is a |V|X|V| matrix of 0s and 1s, where the entry in row i and column j is 1 if and only if the edge (i,j) is in the graph. If you want to indicate an edge weight, put it in the row i column j entry, and reserve a special value (perhaps null) to indicate an absent edge.
With an adjacency matrix, we can find out whether an edge is present in constant time, by just looking up the corresponding entry in the matrix. For example, if the adjacency matrix is named graph, then we can query whether edge (i,j) is in the graph by looking at graph[i][j].
For an undirected graph, the adjacency matrix is symmetric: the row i, column j entry is 1 if and only if the row j, column i entry is 1. For a directed graph, the adjacency matrix need not be symmetric.
Adjacency lists
Representing a graph with adjacency lists combines adjacency matrices with edge lists. For each vertex i, store an array of the vertices adjacent to it. We typically have an array of |V| adjacency lists, one adjacency list per vertex.
Vertex numbers in an adjacency list are not required to appear in any particular order, though it is often convenient to list them in increasing order.
We can get to each vertex's adjacency list in constant time, because we just have to index into an array. To find out whether an edge (i,j) is present in the graph, we go to i's adjacency list in constant time and then look for j in i's adjacency list.
In an undirected graph, vertex j is in vertex i's adjacency list if and only if i is in j's adjacency list. If the graph is weighted, then each item in each adjacency list is either a two-item array or an object, giving the vertex number and the edge weight.
Export to file
How to export the data structure to a text file? Well, that's up to you based on how you would read the text file and import it into the data structure you decided to work with.
If I were to do it, I'd probably try to dump it in the most simple way for later to know how to read and parse it back to the data structure.
Adjacency list
store graphs in this format:
First line contains two integers: N (number of nodes) and E (number of edges).
ThenE lines follow each containing two integers U and V. each line represents an edge (edge goring from U to V)
This is how a cycle graph of four nodes would look like:
4 4
1 2
2 3
3 4
4 1
To represent graphs in python you can use a list of lists.
N, E = input() # input will take two comma separated integers
graph = [[] for x in range(N+1)] # initially no edge is inserted
for x in range(E): #to read E edges
u, v = input()
# inserting edge u->v
graph[u].append(v)

How to keep track of original row indices in Numpy array when comparing to only a slice?

I'm working with a 2D numpy array A, performing a comparison of a one dimensional array, X, against each row in A. As approximate matches are found, I'm keeping track of their indices in A in a dtype=bool array S. I'd like to use S to shrink the field of match candidates in A to improve efficiency. Here's the basic idea in code:
def compare(nxt):
S[nxt] = 0 #sets boolean
T = A[nxt, i:] == A[S, :-i] #T has different dimesions than A
compare() is iterated over and S is progressively populated with False values.
The problem is that the boolean array T is of the same dimensions as the pared down version of A not the original version. I'm hoping to use T to get the indices (in the unsliced A) of the approximate matches for later use.
np.argwhere(T)
This returns a list of indices of the matches, but again in the slice of A.
It seems like there has to be a better way to, at the same time, crop A for more efficient searching and still be able to get the correct index of the matching row.
Any thoughts?

How can a list's lists be modified efficiently to have equal length to the list's longest list?

I have a 2-D list of shape (300,000, X), where each of the sublists has a different size. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.
That means that I need to fill all sublists smaller than the longest sublist with filler (-1) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.
My conversion code is below:
for seq in new_format:
for i in range(0, length-len(seq)):
seq.append(-1)
However, when there are 300,000 sequences in new_format, and length-len(seq) is generally >4000, the process is extraordinarily slow. How can I speed this process up or get around the issue efficiently?
Individual append calls can be rather slow, so use list multiplication to create the whole filler value at once, then concatenate it all at once, e.g.:
for seq in new_format:
seq += [-1] * (length-len(seq))
seq.extend([-1] * (length-len(seq))) would be equivalent (trivially slower due to generalized method call approach, but likely unnoticeable given size of real work).
In theory, seq.extend(itertools.repeat(-1, length-len(seq))) would avoid the potentially large temporaries, but IIRC, the actual CPython implementation of list.__iadd__/list.extend forces the creation of a temporary list anyway (to handle the case where the generator is defined in terms of the list being extended), so it wouldn't actually avoid the temporary.

Permutations over subarray in python

I have a array of identifiers that have been grouped into threes. For each group, I would like to randomly assign them to one of three sets and to have those assignments stored in another array. So, for a given array of grouped identifiers (I presort them):
groupings = array([1,1,1,2,2,2,3,3,3])
A possible output would be
assignments = array([0,1,2,1,0,2,2,0,1])
Ultimately, I would like to be able to generate many of these assignment lists and to do so efficiently. My current method is just to create an zeroes array and set each consecutive subarray of length 3 to a random permutation of 3.
assignment = numpy.zeros((12,10),dtype=int)
for i in range(0,12,3):
for j in range(10):
assignment[i:i+3,j] = numpy.random.permutation(3)
Is there a better/faster way?
Two things I can think about:
instead of visiting the 2D array 3 row * 1 column in your inner loop, try to visit it 1*3. Accessing 2D array horizontally first is usually faster than vertically first, since it gives you better spatial locality, which is good for caching.
instead of running numpy.random.permutation(3) each time, if 3 is fixed and is a small number, try to generate the arrays of permutations beforehand and save them into a constant array of array like: (array([0,1,2]), array([0,2,1]), array([1,0,2])...). You just need to randomly pick one array from it each time.

Categories