Get Union of all Sets that don't Intersect more efficiently

Get Union of all Sets that don't Intersect more efficiently - python

I would like to get the union of two set of frozensets. I'm only interested in the union of frozensets that don't intersect. Another way to look at it is that I'm only interested in unions that have a length equal to the total length of both frozensets combined. Ideally I would like to ignore any frozensets that don't intersect with each other for a massive speedup. I expect many frozensets to have at least one element in common. Here is the code I have so far in python. I would like it to be as fast as possible as I'm working with a large dataset. Each of the frozensets are no more then 20 elements but there will be somewhere around 1,000 total in a set. All numbers will be between 0 and 100. I'm open to converting to other types if it would allow my program to run faster but I don't want any repeated elements and order is not important.
sets1 = set([frozenset([1,2,3]),frozenset([4,5,6]),frozenset([8,10,11])])
sets2 = set([frozenset([8,9,10]),frozenset([6,7,3])])
newSets = set()
for fset in sets1:
for fset2 in sets2:
newSet = fset.union(fset2)
if len(newSet) == len(fset)+len(fset2):
newSets.add(frozenset(newSet))
the correct output is
set(frozenset([1,2,3,8,9,10]),frozenset([4,5,6,8,9,10]),frozenset([8,10,11,6,7,3]))

sets1 = set([frozenset([1,2,3]),frozenset([4,5,6]),frozenset([8,10,11])])
sets2 = set([frozenset([8,9,10]),frozenset([6,7,3])])
union_ = set()
for s1 in sets1:
for s2 in sets2:
if s1.isdisjoint(s2):
union_.add(s1 | s2)
print(union_)
{frozenset({3, 6, 7, 8, 10, 11}), frozenset({1, 2, 3, 8, 9, 10}), frozenset({4, 5, 6, 8, 9, 10})}

Related

Find matching records from multiple condition search in Python

I have a set of related integers which I need to search for across a huge number of data, and am wondering what's considered to be the most Pythonic or efficient method for going about this.
For example, if I have a list of integers:
query = [1,5,7,8]
And need to find all objects that contain these values:
record_1 = [0,5,7,8,10,11,12]
record_2 = [1,3,5,8,10,13,14]
record_3 = [1,4,5,6,7,8,11]
record_4 = [1,5,6,7,8,10,14]
record_5 = [1,5,8,9,11,13,16]
I know it wouldn't be too difficult to load each record into a larger list and iteratively test each for whether or not they contain all the integers found in the query, but I am wondering if there's a more Pythonic way of doing it, or if there's a more efficient method than testing every value (which will become expensive when scaling).
Thanks in advance!

If the numbers in queries and records are unique I would represent them a sets (or frozensets for better performance). Lets assume you have a list of records and a query:
The filter function is applied onto the list of records. For each record the lambda function is executed to see if it is true. The lambda function checks if query is a subset of the current record. Thus the filtered list contains our results. The result is converted to a list.
query = set([1,5,7,8])
records = [
set([0,5,7,8,10,11,12]),
set([1,3,5,8,10,13,14]),
set([1,4,5,6,7,8,11]),
set([1,5,6,7,8,10,14]),
set([1,5,8,9,11,13,16]),
]
matches = list(filter(lambda r: query.issubset(r), records))
print(matches)
Output:
[{1, 4, 5, 6, 7, 8, 11}, {1, 5, 6, 7, 8, 10, 14}]

Using list map with issubset
for y,x in zip(records,map(lambda x : query.issubset(x),records)):
if x :
print(y)
{1, 4, 5, 6, 7, 8, 11}
{1, 5, 6, 7, 8, 10, 14}

Randomly selecting N number of items from a list, duplicates possible

I am trying to generate a list of random items from another list b. Duplicates are allowed. I cannot use random.sample because N can exceed the number of items in list b.
I have written some code below:
def generate_random_sequence(n):
population = []
for i in xrange(n):
b = random.choice(stuff)
population.append(b)
However i am really concerned about it's performance as it will be performed a lot of times. Is there a method in Random library that performs this task? Or is there a more optimized way of doing this task?

You can use random.choice in numpy library:
In [3]: np.random.choice([1,5,6],10)
Out[3]: array([6, 5, 6, 6, 6, 6, 1, 6, 1, 6])

Fast sorting of large nested lists

I am looking to find out the likelihood of parameter combinations using Monte Carlo Simulation.
I've got 4 parameters and each can have about 250 values.
I have randomly generated 250,000 scenarios for each of those parameters using some probability distribution function.
I now want to find out which parameter combinations are the most likely to occur.
To achieve this I have started by filtering out any duplicates from my 250,000 randomly generated samples in order to reduce the length of the list.
I then iterated through this reduced list and checked how many times each scenario occurs in the original 250,000 long list.
I have a large list of 250,000 items which contains lists, as such :
a = [[1,2,5,8],[1,2,5,8],[3,4,5,6],[3,4,5,7],....,[3,4,5,7]]# len(a) is equal to 250,000
I want to find a fast and efficient way of having each list within my list only occurring once.
The end goal is to count the occurrences of each list within list a.
so far I've got:
'''Removing duplicates from list a and storing this as a new list temp'''
b_set = set(tuple(x) for x in a)
temp = [ list(x) for x in b_set ]
temp.sort(key = lambda x: a.index(x) )
''' I then iterate through each of my possible lists (i.e. temp) and count how many times they occur in a'''
most_likely_dict = {}
for scenario in temp:
freq = list(scenario_list).count(scenario)
most_likely_dict[str(scenario)] = freq
at the moment it takes a good 15 minutes to perform ... Any suggestion on how to turn that into a few seconds would be greatly appreciated !!

You can take out the sorting part, as the final result is a dictionary which will be unordered in any case, then use a dict comprehension:
>>> a = [[1,2],[1,2],[3,4,5],[3,4,5], [3,4,5]]
>>> a_tupled = [tuple(i) for i in a]
>>> b_set = set(a_tupled)
>>> {repr(i): a_tupled.count(i) for i in b_set}
{'(1, 2)': 2, '(3, 4, 5)': 3}
calling list on your tuples will add more overhead, but you can if you want to
>>> {repr(list(i)): a_tupled.count(i) for i in b_set}
{'[3, 4, 5]': 3, '[1, 2]': 2}
Or just use a Counter:
>>> from collections import Counter
>>> Counter(tuple(i) for i in a)

{str(item):a.count(item) for item in a}
Input :
a = [[1,2,5,8],[1,2,5,8],[3,4,5,6],[3,4,5,7],[3,4,5,7]]
Output :
{'[3, 4, 5, 6]': 1, '[1, 2, 5, 8]': 2, '[3, 4, 5, 7]': 2}

Storing the result of Minhash

The result is a fixed number of arrays, let's say lists (all of the same length) in python.
One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python?
A list where every item is a list or something else?
I thought of a dictionary, but the keys are trivial, 1, 2, ..., M, so I am not sure if that is the pythonic way to go here.
I am not interested in the implementation, I am interested in which approach I should follow, in which choice I should made!

Whatever container you choose, it should contain hash-itemID pairs, and should be indexed or sorted by the hash. Unsorted arrays will not be remotely efficient.
Assuming you're using a decent sized hash and your various hash algorithms are well-implemented, you should be able to just as effectively store all minhashes in a single container, since the chance of collision between a minhash from one algorithm and a minhash from another is negligible, and if any such collision occurs it won't substantially alter the similarity measure.
Using a single container as opposed to multiple reduces the memory overhead for indexing, though it also slightly increases the amount of processing required. As memory is usually the limiting factor for minhash, a single container may be preferable.

You can store anything you want in a python list: ints, strings, more lists lists, dicts, objects, functions - you name it.
anything_goes_in_here = [1, 'one', lambda one: one / 1, {1: 'one'}, [1, 1]]
So storing a list of lists is pretty straight forward:
>>> list_1 = [1, 2, 3, 4]
>>> list_2 = [5, 6, 7, 8]
>>> list_3 = [9, 10, 11, 12]
>>> list_4 = [13, 14, 15, 16]
>>> main_list = [list_1, list_2, list_3, list_4]
>>> for list in main_list:
... for num in list:
... print num
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
If you are looking to store a list of lists where the index is meaningful (meaning the index gives you some information about the data stored there), then this is basically reimplementing a hashmap (dictionary), and while you say it's trivial - using a dictionary sounds like it fits the problem well here.

Intersection complexity

In Python you can get the intersection of two sets doing:
>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])
Anybody knows the complexity of this intersection (&) algorithm?
EDIT: In addition, does anyone know what is the data structure behind a Python set?

The data structure behind the set is a hash table where the typical performance is an amortized O(1) lookup and insertion.
The intersection algorithm loops exactly min(len(s1), len(s2)) times. It performs one lookup per loop and if there is a match performs an insertion. In pure Python, it looks like this:
def intersection(self, other):
if len(self) <= len(other):
little, big = self, other
else:
little, big = other, self
result = set()
for elem in little:
if elem in big:
result.add(elem)
return result

The answer appears to be a search engine query away. You can also use this direct link to the Time Complexity page at python.org. Quick summary:
Average: O(min(len(s), len(t))
Worst case: O(len(s) * len(t))
EDIT: As Raymond points out below, the "worst case" scenario isn't likely to occur. I included it originally to be thorough, and I'm leaving it to provide context for the discussion below, but I think Raymond's right.

Set intersection of two sets of sizes m,n can be achieved with O(max{m,n} * log(min{m,n})) in the following way:
Assume m << n
1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
3.1 Sort the next **m** items on the bigger list(cost: m*logm)
3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set
The loop in step 3 will run for n/m iterations and each iteration will take O(m*logm), so you will have time complexity of O(nlogm) for m << n.
I think that's the best lower bound that exists

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get Union of all Sets that don't Intersect more efficiently - python

Related

Find matching records from multiple condition search in Python

Randomly selecting N number of items from a list, duplicates possible

Fast sorting of large nested lists

Storing the result of Minhash

Intersection complexity

Categories

Resources