Prune a list of combinations based on sets - python

While this question is formulated using the Python programming language, I believe it is more of a programming logic problem.
I have a list of all possible combinations, i.e.: n choose k
I can prepare such a list using
import itertools
bits_list = list(itertools.combinations(range(n), k))
If 'n' is 100, and `k' is 5, then the length of 'bits_list' will be 75287520.
Now, I want to prune this list, such that numbers appear in groups, or they don't. Let's use the following sets as an example:
Set 1: [0, 1, 2]
Set 2: [57, 58]
Set 3: [10, 15, 20, 25]
Set 4: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
Here each set needs to appear in any member of the bits_list together, or not at all.
So far, I only have been able to think of a brute-force if-else method of solving this problem, but the number of if-else conditions will be very large this way.
Here's what I have:
bits_list = [x for x in list(itertools.combinations(range(n), k))
if all(y in x for y in [0, 1, 2]) or
all(y not in x for y in [0, 1, 2])]
Now, this only covered Set 1. I would like to do this for many sets. If the length of the set is longer than the value of k, we can ignore the set (for example, k = 5 and Set 4).
Note, that the ultimate aim is to have 'k' iterate over a range, say [5:25] and work on the appended list. The size of the list grows exponentially here and computationally speaking, very expensive!
With 'k' as 10, the python interpreter interrupts the process before completion on any average laptop with 16 GB RAM. I need to find a solution that fits in the memory of a relatively modern server (not a cluster or a server farm).
Any help is greatly appreciated!
P.S.: Intuitively, think of this problem as generating all the possible cases for people boarding a public bus or train system. Usually, you board an entire group or you don't board anyone.
UPDATE:
For the given sets above, if k = 5, then a valid member of bits_list would be [0, 1, 2, 57, 58], i.e.: a combination of Set1 and Set2. If k = 10, then we could have built Set1 + Set2 + Set3 + NoSetElement as a possible member. #DonkeyKong's solution made me realize I haven't mentioned this explicitly in my question.
I have a lot of sets; I intend to use enough sets to prune the full list of combinations such that the bits_list eventually fits into memory.
#9000's suggestion is perfectly valid here, that during each iteration, I can save the combinations as actual bits.

This still gets crushed by a memory error (which I don't see how you're getting away from if you insist on a list) at a certain point (around n=90, k=5), but it is much faster than your current implementation. For n=80 and k=5, my rudimentary benchmarking had my solution at 2.6 seconds and yours around 52 seconds.
The idea is to construct the disjoint and subset parts of your filter separately. The disjoint part is trivial, and the subset part is calculated by taking the itertools.product of all disjoint combinations of length k - set_len and the individual elements of your set.
from itertools import combinations, product, chain
n = 80
k = 5
set1 = {0,1,2}
nots = set(range(n)) - set1
disj_part = list(combinations(nots, k))
subs_part = [tuple(chain(x, els)) for x, *els in
product(combinations(nots, k - len(set1)), *([e] for e in set1))]
full_l = disj_part + subs_part

If you actually represented your bits as bits, that is, 0/1 values in a binary representation of an integer n bits long with exactly k bits set, the amount of RAM you'd need to store the data would be drastically smaller.
Also, you'd be able to use bit operations to look check if all bits in a mask are actually set (value & mask == mask), or all unset (value | ~mask == value).
The brute-force will probably take shorter that the time you'd spend thinking about a more clever algorithm, so it's totally OK for a one-off filtering.
If you must execute this often and quickly, and your n is in small hundreds or less, I'd rather use cython to describe the brute-force algorithm efficiently than look at algorithmic improvements. Modern CPUs can efficiently operate on 64-bit numbers; you won't benefit much from not comparing a part of the number.
OTOH if your n is really large, and the number of sets to compare to is also large, you could partition your bits for efficient comparison.
Let's suppose you can efficiently compare a chunk of 64 bits, and your bit lists contain e.g. 100 chunks each. Then you can do the same thing you'd do with strings: compare chunk by chunk, and if one of the chunks fails to match, do not compare the rest.

A faster implementation would be to replace the if and all() statements in:
bits_list = [x for x in list(itertools.combinations(range(n), k))
if all(y in x for y in [0, 1, 2]) or
all(y not in x for y in [0, 1, 2])]
with python's set operations isdisjoint() and issubset() operations.
bits_generator = (set(x) for x in itertools.combinations(range(n), k))
first_set = set([0,1,2])
filter_bits = (x for x in bits_generator
if x.issubset(first_set) or
x.isdisjoint(first_set))
answer_for_first_set = list(filter_bits)
I can keep going using generators and with generators you won't run out of memory but you will be waiting and hastening the heat death of the universe. Not because of python's runtime or other implementation details but because there are some problems that are just not feasible even in computer time if you pick a large N and K values.

Based on the ideas from #Mitch's answer, I created a solution with a slightly different thinking than originally presented in the question. Instead of creating the list (bits_list) of all combinations and then pruning those combinations that do not match the sets listed, I built bits_list from the sets.
import itertools
all_sets = [[0, 1, 2], [3, 4, 5], [6, 7], [8], [9, 19, 29], [10, 20, 30],
[11, 21, 31], [12, 22, 32], ...[57, 58], ... [95], [96], [97]]
bits_list = [list(itertools.chain.from_iterable(x)) for y in [1, 2, 3, 4, 5]
for x in itertools.combinations(all_sets, y)]
Here, instead of finding n choose k, and then looping for all k, and finding combinations which match the sets, I started from the sets, and even included the individual members as sets by themselves and therefore removing the need for the 2 components - the disjoint and the subset parts - discussed in #Mitch's answer.

Related

How could I reduce the time complexity of this nested loop?

I'm in the process of learning to optimize code and implement more data structures and algorithms in my programs, however I'm experiencing difficulties with this code block.
My primary goal is to reduce this from a O(n**2) time complexity mainly not using a nested loop.
numArray = np.array([[20, 12, 10, 8, 6, 4], [10, 10, 10, 10, 10, 10]], dtype=int)
rolls = [[] for i in range(len(numArray[0]))]
for i in range(len(numArray[0])):
for j in range(numArray[1, i]):
rolls[i].append(random.randint(1, numArray[0, i]))
The code is supposed to generate x amount of random integers (where x is index i of the second numArray subarray, e.g. 10) between 1 and index i of the first numArray subarray (e.g. 20).
Then repeat this for each index in the first numArray subarray.
(In the whole dice program numArray subarrays are user generated integers, but I assigned fixed numbers to it for simplicities sake while optimizing.)
You could use np.random.randint since you're already importing numpy. It accepts a size argument to produce multiple random values in one go.
rolls = [list(np.random.randint(1, numArray[0][idx], val)) for idx, val in enumerate(numArray[1])]
This of course assumes that both lists in numArray are the same length, but should get you somewhere at least.

Cost function of a dynamic solution to the "longest increasing subsequence" problem

So I have a pretty simple dynamic programming solution to the "longest increasing subsequence" problem (find the longest subsequence of increasing elements in a given sequence, for instance for [1, 7, 2, 6, 4] it would be [1, 2, 4]), which can also find the actual subsequence (as opposed to just lenght):
sequence = [1, 8, 6, 4, 9, 8, 3, 5, 2, 7, 1, 9, 5, 7]
listofincreasing = [[] for _ in range(len(sequence))]
listofincreasing[0].append(sequence[0])
for right in range(1, len(sequence)):
for left in range(right):
if (sequence[left] < sequence[right]) and (len(listofincreasing[right]) < len(listofincreasing[left])):
listofincreasing[right] = [] + listofincreasing[left]
listofincreasing[right].append(sequence[right])
print(max(listofincreasing, key=len))
These sort of brainteasers are pretty manageable for me, but I don't really know the hard theory behind this. My question is this: How would I go about creating a cost function that would formally describe "how I am filling the list', so to speak? Could someone show me how to approach these problems on this example? Thanks in advance.
Edit - some people asked for a clarification. In the most succint way possible, I would need to create a mathematic function in the the exact same way as it is created here: https://medium.com/#pp7954296/change-making-problem-dynamic-method-4954a446a511 in the "formula to solve coin change using dynamic method:" section, but not for the change making problem but for my solution of the longest increasing subsequence problem
You are looking for a recursive formulation of the overlapping subproblems in your dynamic programming solution.
Let LONGEST(S,x) be the longest increasing subsequence of the first x characters of the sequence S. The solution to the problem is then LONGEST(S,|S|).
Recursively (using 1-based indexing):
LONGEST(S,x) = S[1] if x = 1. Otherwise,
LONGEST(S,x) = the longest of:
S[x],
LONGEST(S,y), where 1 <= y < x, or
LONGEST(S,y) + S[x], where 1 <= y < x and LAST_ELMENT(LONGEST(S,y)) < S[x]
Since LONGEST(S,x) depends only on the values for smaller prefixes, we can produce the values iteratively in order of increasing x, and that is what your program does.

Storing the result of Minhash

The result is a fixed number of arrays, let's say lists (all of the same length) in python.
One could see it as a matrix too, so in c I would use an array, where every cell would point to another array. How to do it in Python?
A list where every item is a list or something else?
I thought of a dictionary, but the keys are trivial, 1, 2, ..., M, so I am not sure if that is the pythonic way to go here.
I am not interested in the implementation, I am interested in which approach I should follow, in which choice I should made!
Whatever container you choose, it should contain hash-itemID pairs, and should be indexed or sorted by the hash. Unsorted arrays will not be remotely efficient.
Assuming you're using a decent sized hash and your various hash algorithms are well-implemented, you should be able to just as effectively store all minhashes in a single container, since the chance of collision between a minhash from one algorithm and a minhash from another is negligible, and if any such collision occurs it won't substantially alter the similarity measure.
Using a single container as opposed to multiple reduces the memory overhead for indexing, though it also slightly increases the amount of processing required. As memory is usually the limiting factor for minhash, a single container may be preferable.
You can store anything you want in a python list: ints, strings, more lists lists, dicts, objects, functions - you name it.
anything_goes_in_here = [1, 'one', lambda one: one / 1, {1: 'one'}, [1, 1]]
So storing a list of lists is pretty straight forward:
>>> list_1 = [1, 2, 3, 4]
>>> list_2 = [5, 6, 7, 8]
>>> list_3 = [9, 10, 11, 12]
>>> list_4 = [13, 14, 15, 16]
>>> main_list = [list_1, list_2, list_3, list_4]
>>> for list in main_list:
... for num in list:
... print num
...
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
If you are looking to store a list of lists where the index is meaningful (meaning the index gives you some information about the data stored there), then this is basically reimplementing a hashmap (dictionary), and while you say it's trivial - using a dictionary sounds like it fits the problem well here.

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.
Say I have three lists, A, B and C.
A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.
I need to remove all duplicates from list A, but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C:
A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]
This is easy enough to do with longer code by moving everything to new lists:
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A:
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.
Zip the three lists together, uniquify based on the first element, then unzip:
from operator import itemgetter
from more_itertools import unique_everseen
abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)
This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.
Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:
abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)
Once you get the hang of zip, the "uniquify" is the only hard part, so let me explain it.
Your existing code checks whether each element has been seen by searching for each one in new_A. Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—e.g., 1 million values, a half of them unique, and you're doing 250 billion comparisons.
To avoid that quadratic time, you use a set. A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.
If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict, but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:
new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A_set:
new_A_set.add(A[i])
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.
The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).
PadraicCunningham asks:
how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))?
If there are N elements, M unique, it's O(N) time and O(M) space.
In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key), and since both operations are amortized constant time for set, that means the whole thing is O(N) time. In practice, for N=1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.
As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M, that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:
indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]
But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.
If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).
Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:
indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]
* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.
How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.
new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]
You can write a function for the second statement, for reuse:
def get_new_list(original_list, indices):
return [original_list[idx] for idx in indices]

Intersection complexity

In Python you can get the intersection of two sets doing:
>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])
Anybody knows the complexity of this intersection (&) algorithm?
EDIT: In addition, does anyone know what is the data structure behind a Python set?
The data structure behind the set is a hash table where the typical performance is an amortized O(1) lookup and insertion.
The intersection algorithm loops exactly min(len(s1), len(s2)) times. It performs one lookup per loop and if there is a match performs an insertion. In pure Python, it looks like this:
def intersection(self, other):
if len(self) <= len(other):
little, big = self, other
else:
little, big = other, self
result = set()
for elem in little:
if elem in big:
result.add(elem)
return result
The answer appears to be a search engine query away. You can also use this direct link to the Time Complexity page at python.org. Quick summary:
Average: O(min(len(s), len(t))
Worst case: O(len(s) * len(t))
EDIT: As Raymond points out below, the "worst case" scenario isn't likely to occur. I included it originally to be thorough, and I'm leaving it to provide context for the discussion below, but I think Raymond's right.
Set intersection of two sets of sizes m,n can be achieved with O(max{m,n} * log(min{m,n})) in the following way:
Assume m << n
1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
3.1 Sort the next **m** items on the bigger list(cost: m*logm)
3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set
The loop in step 3 will run for n/m iterations and each iteration will take O(m*logm), so you will have time complexity of O(nlogm) for m << n.
I think that's the best lower bound that exists

Categories