Pairwise Set Intersection in Python - python

If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.
For example, if I have the following sets:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
I want to be able to find:
intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}
Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:
intersections_C={"a": {"A", "C"},
"c": {"A", "B", "C"}
"e": {"B", "C"}}
I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.
Some additional background information:
Each of the sets are of roughly the same length, but are also very large (large enough that storing them all in memory is a realistic concern, and an algorithm which avoids that would be preferred though is not necessary)
The size of the intersections between any two sets is very small compared to the size of the sets themselves
If it helps, we can assume anything we need to about the ordering of the input sets.

this ought to do what you want
import random as RND
import string
import itertools as IT
mock some data
fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]
generate an index list of the sets in S so the sets can be referenced more concisely below
idx = range(len(S))
get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations
pairs = IT.combinations(idx, 2)
write a function perform the set intersection
nt = lambda a, b: S[a].intersection(S[b])
fold this function over the pairs & key the result from each function call to its arguments
res = dict([ (t, nt(*t)) for t in pairs ])
the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences
this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container
the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie
res = ( (t, nt(*t)) for t in pairs )
notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.

If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.

How about using intersection method of set. See below:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
intersect_AB = A.intersection(B)
intersect_BC = B.intersection(C)
intersect_AC = A.intersection(C)
print intersect_AB, intersect_BC, intersect_AC

Related

Is there a method in python for randomly selected a subset of tuples with weighting and no replacement?

I have a list of tuples that I want to randomly choose a subset from, but using weights to affect how likely an element is to be chosen and that doesn't use replacement.
I've tried random.choices(), which handles the subset and the weights element, but it uses replacement so I'm getting the same element repeatedly in the subset. For example, if my large set is [red (10&),orange (10%),blue(10%),yellow(10%),green(50%)], and I want a subset of 3 of them, random.choices often results in [green,green,blue].
I've also looked at random.sample(), which doesn't use replacement but doesn't allow for weighting, and at numpy.random.choice(), which requires a 1D array (which a list of tuples is not).
Is there another method I should be looking at?
you can do this with numpy
from numpy.random import choice
list_len = len(tuple_list)
np_list = np.arange(list_len)
draw = choice(np_list, number_of_items_to_pick, p=list_prob, replace=False)
selected = []
for n in draw:
selected.append(tuple_list[n])
this will choose the tuples without replacement by setting replace=False with tuple_list being your list of tuples, and list_prob being their probabilities of being chosen. Creating a list of indexes with np.arange(list_len) allows you to get around the two dimensional issue by randomly selecting the indexes of the tuples you want rather than the tuples directly.

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.
my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?
If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

Generating unordered pairs of disjoint subsets of a set of integers

I am trying to generate the unordered pairs of disjoint subsets of a set S of integers.
As shown in [1], when S consists of n integers, we generate around 3n/2 pairs.
Now, I know how to generate all 2n subsets of S (i.e. the powerset of S), and for every subset (consisting of k integers) I could thus generate the kC2 (k choose 2) possible pairs.
But this is inefficient, because pairs will end up being generated more than once.
Therefore, I am wondering if there is some efficient (recursive) way to generate these pairs of subsets from S? I could not find any existing implementations and my own attempts using for example Python's itertools were not succesful so far.
[1] Total number of unordered pairs of disjoint subsets of S (MathOverflow)

Choosing python data structures to speed up algorithm implementation

So I'm given a large collection (roughly 200k) of lists. Each contains a subset of the numbers 0 through 27. I want to return two of the lists where the product of their lengths is greater than the product of the lengths of any other pair of lists. There's another condition, namely that the lists have no numbers in common.
There's an algorithm I found for this (can't remember the source, apologies for non-specificity of props) which exploits the fact that there are fewer total subsets of the numbers 0 through 27 than there are words in the dictionary.
The first thing I've done is looped through all the lists, found the unique subset of integers that comprise it and indexed it as a number between 0 and 1<<28. As follows:
def index_lists(lists):
index_hash = {}
for raw_list in lists:
length = len(raw_list)
if length > index_hash.get(index,{}).get("length"):
index = find_index(raw_list)
index_hash[index] = {"list": raw_list, "length": length}
return index_hash
This gives me the longest list and the length of the that list for each subset that's actually contained in the collection of lists given. Naturally, not all subsets from 0 to (1<<28)-1 are necessarily included, since there's not guarantee the supplied collection has a list containing each unique subset.
What I then want, for each subset 0 through 1<<28 (all of them this time) is the longest list that contains at most that subset. This is the part that is killing me. At a high level, it should, for each subset, first check to see if that subset is contained in the index_hash. It should then compare the length of that entry in the hash (if it exists there) to the lengths stored previously in the current hash for the current subset minus one number (this is an inner loop 27 strong). The greatest of these is stored in this new hash for the current subset of the outer loop. The code right now looks like this:
def at_most_hash(index_hash):
most_hash = {}
for i in xrange(1<<28): # pretty sure this is a bad idea
max_entry = index_hash.get(i)
if max_entry:
max_length = max_entry["length"]
max_word = max_entry["list"]
else:
max_length = 0
max_word = []
for j in xrange(28): # again, probably not great
subset_index = i & ~(1<<j) # gets us a pre-computed subset
at_most_entry = most_hash.get(subset_index, {})
at_most_length = at_most_entry.get("length",0)
if at_most_length > max_length:
max_length = at_most_length
max_list = at_most_entry["list"]
most_hash[i] = {"length": max_length, "list": max_list}
return most_hash
This loop obviously takes several forevers to complete. I feel that I'm new enough to python that my choice of how to iterate and what data structures to use may have been completely disastrous. Not to mention the prospective memory problems from attempting to fill the dictionary. Is there perhaps a better structure or package to use as data structures? Or a better way to set up the iteration? Or maybe I can do this more sparsely?
The next part of the algorithm just cycles through all the lists we were given and takes the product of the subset's max_length and complementary subset's max length by looking them up in at_most_hash, taking the max of those.
Any suggestions here? I appreciate the patience for wading through my long-winded question and less than decent attempt at coding this up.
In theory, this is still a better approach than working with the collection of lists alone since that approach is roughly o(200k^2) and this one is roughly o(28*2^28 + 200k), yet my implementation is holding me back.
Given that your indexes are just ints, you could save some time and space by using lists instead of dicts. I'd go further and bring in NumPy arrays. They offer compact storage representation and efficient operations that let you implicitly perform repetitive work in C, bypassing a ton of interpreter overhead.
Instead of index_hash, we start by building a NumPy array where index_array[i] is the length of the longest list whose set of elements is represented by i, or 0 if there is no such list:
import numpy
index_array = numpy.zeros(1<<28, dtype=int) # We could probably get away with dtype=int8.
for raw_list in lists:
i = find_index(raw_list)
index_array[i] = max(index_array[i], len(raw_list))
We then use NumPy operations to bubble up the lengths in C instead of interpreted Python. Things might get confusing from here:
for bit_index in xrange(28):
index_array = index_array.reshape([1<<(28-bit_index), 1<<bit_index])
numpy.maximum(index_array[::2], index_array[1::2], out=index_array[1::2])
index_array = index_array.reshape([1<<28])
Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. The numpy.maximum call then performs the bubble-up operation for that bit. At the end, each cell index_array[i] of index_array represents the length of the longest list whose elements are a subset of set i.
We then compute the products of lengths at complementary indices:
products = index_array * index_array[::-1] # We'd probably have to adjust this part
# if we picked dtype=int8 earlier.
find where the best product is:
best_product_index = products.argmax()
and the longest lists whose elements are subsets of the set represented by best_product_index and its complement are the lists we want.
This is a bit too long for a comment so I will post it as an answer. One more direct way to index your subsets as integers is to use "bitsets" with each bit in the binary representation corresponding to one of the numbers.
For example, the set {0,2,3} would be represented by 20 + 22 + 23 = 13 and {4,5} would be represented by 24 + 25 = 48
This would allow you to use simple lists instead of dictionaries and Python's generic hashing function.

faster way of finding combinations?

I'm trying to find all possible sub-intervals between np.linspace(0,n,n*10+1)
where the sub-intervals are greater than width (say width=0.5)
so I tried this with using itertools by
import itertools
ranges=np.linspace(0,n,n*10+1)
#find all combinations
combinations=list(itertools.combinations(ranges,2))
#using for-loops to calculate width of each intervals
#and append to new list if the width is greater than 0.5
save=[]
for i in range(len(combinations)):
if combinations[i][1]-combinations[i][0]>0.5:
save.append(combinations[i])
but this takes too many times especially when n gets bigger especially it costs huge ram usage
So I'm wondering whether if I can modify the function faster or set constraint when I collect combinations
itertools.combinations(...) returns a generator, that means the returned object produces its values when needed instead of calculating everything at once and storing the result in memory. You force immediate calculation and storing by converting it to a list, but this is unnecessary. Simply iterate over the combinations object instead of making a list of it and iterating over the indexes (which should not be done anyway):
import itertools
ranges=np.linspace(0,n,n*10+1) # alternatively 'range(100)' or so to test
combinations=itertools.combinations(ranges,2)
save=[]
for c in combinations:
if c[1] - c[0] > 0.5:
save.append(c)

Categories