How find pairwise disjoint sets of a given dataset? - python

Given a data set consisting lists of varying lengths similar to this example : pairwise comparisons within a dataset
more context : Given a set of update requests for a table how can I use this to split requests?
I'm looking to output a the set of pairwise disjoint sets such that if you union all the fields they are unique (I'm unsure, if the term pairwise disjoint for this output is correct).
for example input,
[1,2,5,6,8]
[4,5,7,9]
[10,11]
[23,45]
can give an output of
[1,2,4,5,6,7,8,9] -- is combined as 5 is common.
[10,11]
[23,45]
Ideally I'm splitting the given datasets, to unique sets which are disjoint.
My list size can be ~150.
Total dataset can be of size ~700.
I tried this : Find in python combinations of mutually exclusive sets from a list's elements but it does not do what i need.

Related

calculate similarity between two lists of tags

How can I calculate the semantic similarity between two lists of tags?
For example:
Input
list1 = ['marketing', 'social medial', 'operations', 'management']
list2 = ['software development', 'system network', 'system design']
Ouput
5%
Are there any python packages/libraries I can use to do this?
You cannot calculate the "semantic similarity", only the degree of overlap of the two lists. You have two lists of arbitrary elements and want to see how similar the lists are with each other.
There are several metrics to do that, eg the Jaccard Index or the Sørensen–Dice coefficient. Either of these should work for your purposes.
This assumes thast the elements in your lists are arbitrary, but for your example the similarity would be zero, as there is no overlap at all. If you want to look at the similarity of the terms, you need a different approach.
For that you'd need to work out the pairwise similarity of two terms, and you could then substitute those for equality in the corresponding metrics.

Generating unordered pairs of disjoint subsets of a set of integers

I am trying to generate the unordered pairs of disjoint subsets of a set S of integers.
As shown in [1], when S consists of n integers, we generate around 3n/2 pairs.
Now, I know how to generate all 2n subsets of S (i.e. the powerset of S), and for every subset (consisting of k integers) I could thus generate the kC2 (k choose 2) possible pairs.
But this is inefficient, because pairs will end up being generated more than once.
Therefore, I am wondering if there is some efficient (recursive) way to generate these pairs of subsets from S? I could not find any existing implementations and my own attempts using for example Python's itertools were not succesful so far.
[1] Total number of unordered pairs of disjoint subsets of S (MathOverflow)

Find a collection of sets such that there is maximum intersection of elements among the selected sets

I have roughly 300,000 (300K) sets, each containing 0-100 elements.
s1={a,b,x,y}
s2={a}
s3={a,x,y}
s4={x,y}
My question is, How do I efficiently find a collection of sets (say I need a collection 5000 sets from 300K sets) where there is maximum intersection of elements among those selected sets?
i.e.
Among all possible combinations of 5000 sets that can be picked from 300K sets, I need that one collection of 5000 sets such that intersection(number of common elements) among it's 5000 sets is greater(>=) than any other combination of 5000 sets that are possible from 300K sets.
For example : From the sets shown above,
Say I need 2 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = {s1,s3} with [common elements={a,x,y}, common elements count=3]
Say I need 3 sets where there is maximum intersection of elements among them.The resulting collection would be -> C = {s1,s3,s4} with [common elements={x,y}, common elements count=2]
Bruteforce method is not an option since the total number of possible combinations of 5000 sets from a collection of 300K sets is huge.
300K choose 5000 = O(10^11041)
Are there any smart data structures and algorithms that I can use to get the desired collection of sets?
Also, is there are any available python library that I can use for this?

Pairwise Set Intersection in Python

If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.
For example, if I have the following sets:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
I want to be able to find:
intersect_AB={"c"}
intersect_BC={"c", "e"}
intersect_AC={"a", "c"}
Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:
intersections_C={"a": {"A", "C"},
"c": {"A", "B", "C"}
"e": {"B", "C"}}
I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.
Some additional background information:
Each of the sets are of roughly the same length, but are also very large (large enough that storing them all in memory is a realistic concern, and an algorithm which avoids that would be preferred though is not necessary)
The size of the intersections between any two sets is very small compared to the size of the sets themselves
If it helps, we can assume anything we need to about the ordering of the input sets.
this ought to do what you want
import random as RND
import string
import itertools as IT
mock some data
fnx = lambda: set(RND.sample(string.ascii_uppercase, 7))
S = [fnx() for c in range(5)]
generate an index list of the sets in S so the sets can be referenced more concisely below
idx = range(len(S))
get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations
pairs = IT.combinations(idx, 2)
write a function perform the set intersection
nt = lambda a, b: S[a].intersection(S[b])
fold this function over the pairs & key the result from each function call to its arguments
res = dict([ (t, nt(*t)) for t in pairs ])
the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences
this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container
the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie
res = ( (t, nt(*t)) for t in pairs )
notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.
If we can assume that the input sets are ordered, a pseudo-mergesort approach seems promising. Treating each set as a sorted stream, advance the streams in parallel, always only advancing those where the value is the lowest among all current iterators. Compare each current value with the new minimum every time an iterator is advanced, and dump the matches into your same-item collections.
How about using intersection method of set. See below:
A={"a","b","c"}
B={"c","d","e"}
C={"a","c","e"}
intersect_AB = A.intersection(B)
intersect_BC = B.intersection(C)
intersect_AC = A.intersection(C)
print intersect_AB, intersect_BC, intersect_AC

Optimal group selection from a dictionary Python

So I have a dictionary with key:value(tuple). Something like this. {"name":(4,5),....} where
(4,5) represents two categories (cat1, cat2). Given a maximum number for the second category, I would like to find the optimal combination of dictionary entries such that the 1st category is maximized or minimized.
For example, if maxCat2 = 15, I want to find some combination of entries from the dictionary such that, when I add the cat2 values from each entry together, I'm under 15. There may be many such conditions. Of these possibilities, I would like to pick the one that when I add up the values for cat1 for each entry it is larger than any of the other possibilities.
I thought about writing an algorithm to get all permutations of the entries in the dictionary and then see if each one meets the maxCat2 criteria and then see which one of those gives me the largest total cat1 value. If I have 20 entries, that means I would check 20! combinations, which is a very large number. Is there anything that I can do to avoid this? Thanks.
As Jochen Ritzel pointed out, this can be seen as an instance of the knapsack problem.
Typically, you have a set of objects that have both "weight" (the "second category", in your example) and "value" (or "cost", if it is a minimization problem).
The problem consists in picking a subset of the objects such that the sum of their "values" is maximized/minimized, subject to the constraint the sum of the weights cannot exceed a specified maximum.
Though the problem is intractable in general, if the constraint on the maximum value for the sum of weights is fixed, there exists a polynomial time solution using dynamic programming or memoization.
Very broadly, the idea is to define a set of values where
Cij denotes the maximum sum (of "values") attainable considering only the first i objects where the total weight (of the chosen subset) cannot exceed j.
There are two possible choices here to calculate Cij .
either element i is included in the subset and then
Cij = valuei + Ci-1,j-weighti
or element i is not in the subset of chosen objects, so
Cij = Ci-1,j
The maximum of the two needs to be picked.
If n is the number of elements and w is the maximum total weight, then the answer ends up in Cnw.

Categories