Intersection complexity

Intersection complexity - python

In Python you can get the intersection of two sets doing:
>>> s1 = {1, 2, 3, 4, 5, 6, 7, 8, 9}
>>> s2 = {0, 3, 5, 6, 10}
>>> s1 & s2
set([3, 5, 6])
>>> s1.intersection(s2)
set([3, 5, 6])
Anybody knows the complexity of this intersection (&) algorithm?
EDIT: In addition, does anyone know what is the data structure behind a Python set?

The data structure behind the set is a hash table where the typical performance is an amortized O(1) lookup and insertion.
The intersection algorithm loops exactly min(len(s1), len(s2)) times. It performs one lookup per loop and if there is a match performs an insertion. In pure Python, it looks like this:
def intersection(self, other):
if len(self) <= len(other):
little, big = self, other
else:
little, big = other, self
result = set()
for elem in little:
if elem in big:
result.add(elem)
return result

The answer appears to be a search engine query away. You can also use this direct link to the Time Complexity page at python.org. Quick summary:
Average: O(min(len(s), len(t))
Worst case: O(len(s) * len(t))
EDIT: As Raymond points out below, the "worst case" scenario isn't likely to occur. I included it originally to be thorough, and I'm leaving it to provide context for the discussion below, but I think Raymond's right.

Set intersection of two sets of sizes m,n can be achieved with O(max{m,n} * log(min{m,n})) in the following way:
Assume m << n
1. Represent the two sets as list/array(something sortable)
2. Sort the **smaller** list/array (cost: m*logm)
3. Do until all elements in the bigger list has been checked:
3.1 Sort the next **m** items on the bigger list(cost: m*logm)
3.2 With a single pass compare the smaller list and the m items you just sorted and take the ones that appear in both of them(cost: m)
4. Return the new set
The loop in step 3 will run for n/m iterations and each iteration will take O(m*logm), so you will have time complexity of O(nlogm) for m << n.
I think that's the best lower bound that exists

Related

Cost function of a dynamic solution to the "longest increasing subsequence" problem

So I have a pretty simple dynamic programming solution to the "longest increasing subsequence" problem (find the longest subsequence of increasing elements in a given sequence, for instance for [1, 7, 2, 6, 4] it would be [1, 2, 4]), which can also find the actual subsequence (as opposed to just lenght):
sequence = [1, 8, 6, 4, 9, 8, 3, 5, 2, 7, 1, 9, 5, 7]
listofincreasing = [[] for _ in range(len(sequence))]
listofincreasing[0].append(sequence[0])
for right in range(1, len(sequence)):
for left in range(right):
if (sequence[left] < sequence[right]) and (len(listofincreasing[right]) < len(listofincreasing[left])):
listofincreasing[right] = [] + listofincreasing[left]
listofincreasing[right].append(sequence[right])
print(max(listofincreasing, key=len))
These sort of brainteasers are pretty manageable for me, but I don't really know the hard theory behind this. My question is this: How would I go about creating a cost function that would formally describe "how I am filling the list', so to speak? Could someone show me how to approach these problems on this example? Thanks in advance.
Edit - some people asked for a clarification. In the most succint way possible, I would need to create a mathematic function in the the exact same way as it is created here: https://medium.com/#pp7954296/change-making-problem-dynamic-method-4954a446a511 in the "formula to solve coin change using dynamic method:" section, but not for the change making problem but for my solution of the longest increasing subsequence problem

You are looking for a recursive formulation of the overlapping subproblems in your dynamic programming solution.
Let LONGEST(S,x) be the longest increasing subsequence of the first x characters of the sequence S. The solution to the problem is then LONGEST(S,|S|).
Recursively (using 1-based indexing):
LONGEST(S,x) = S[1] if x = 1. Otherwise,
LONGEST(S,x) = the longest of:
S[x],
LONGEST(S,y), where 1 <= y < x, or
LONGEST(S,y) + S[x], where 1 <= y < x and LAST_ELMENT(LONGEST(S,y)) < S[x]
Since LONGEST(S,x) depends only on the values for smaller prefixes, we can produce the values iteratively in order of increasing x, and that is what your program does.

Remove duplicates from one Python list, prune other lists based on it

I have a problem that's easy enough to do in an ugly way, but I'm wondering if there's a more Pythonic way of doing it.
Say I have three lists, A, B and C.
A = [1, 1, 2, 3, 4, 4, 5, 5, 3]
B = [1, 2, 3, 4, 5, 6, 7, 8, 9]
C = [1, 2, 3, 4, 5, 6, 7, 8, 9]
# The actual data isn't important.
I need to remove all duplicates from list A, but when a duplicate entry is deleted, I would like the corresponding indexes removed from B and C:
A = [1, 2, 3, 4, 5]
B = [1, 3, 4, 5, 7]
C = [1, 3, 4, 5, 7]
This is easy enough to do with longer code by moving everything to new lists:
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A:
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But is there a more elegant and efficient (and less repetitive) way of doing this? This could get cumbersome if the number of lists grows, which it might.

Zip the three lists together, uniquify based on the first element, then unzip:
from operator import itemgetter
from more_itertools import unique_everseen
abc = zip(a, b, c)
abc_unique = unique_everseen(abc, key=itemgetter(0))
a, b, c = zip(*abc_unique)
This is a very common pattern. Whenever you want to do anything in lock step over a bunch of lists (or other iterables), you zip them together and loop over the result.
Also, if you go from 3 lists to 42 of them ("This could get cumbersome if the number of lists grows, which it might."), this is trivial to extend:
abc = zip(*list_of_lists)
abc_unique = unique_everseen(abc, key=itemgetter(0))
list_of_lists = zip(*abc_unique)
Once you get the hang of zip, the "uniquify" is the only hard part, so let me explain it.
Your existing code checks whether each element has been seen by searching for each one in new_A. Since new_A is a list, this means that if you have N elements, M of them unique, on average you're going to be doing M/2 comparisons for each of those N elements. Plug in some big numbers, and NM/2 gets pretty big—e.g., 1 million values, a half of them unique, and you're doing 250 billion comparisons.
To avoid that quadratic time, you use a set. A set can test an element for membership in constant, rather than linear, time. So, instead of 250 billion comparisons, that's 1 million hash lookups.
If you don't need to maintain order or decorate-process-undecorate the values, just copy the list to a set and you're done. If you need to decorate, you can use a dict instead of a set (with the key as the dict keys, and everything else hidden in the values). To preserve order, you could use an OrderedDict, but at that point it's easier to just use a list and a set side by side. For example, the smallest change to your code that works is:
new_A_set = set()
new_A = []
new_B = []
new_C = []
for i in range(len(A)):
if A[i] not in new_A_set:
new_A_set.add(A[i])
new_A.append(A[i])
new_B.append(B[i])
new_C.append(C[i])
But this can be generalized—and should be, especially if you're planning to expand from 3 lists to a whole lot of them.
The recipes in the itertools documentation include a function called unique_everseen that generalizes exactly what we want. You can copy and paste it into your code, write a simplified version yourself, or pip install more-itertools and use someone else's implementation (as I did above).
PadraicCunningham asks:
how efficient is zip(*unique_everseen(zip(a, b, c), key=itemgetter(0)))?
If there are N elements, M unique, it's O(N) time and O(M) space.
In fact, it's effectively doing the same work as the 10-line version above. In both cases, the only work that's not obviously trivial inside the loop is key in seen and seen.add(key), and since both operations are amortized constant time for set, that means the whole thing is O(N) time. In practice, for N=1000000, M=100000 the two versions are about 278ms and 297ms (I forget which is which) compared to minutes for the quadratic version. You could probably micro-optimize that down to 250ms or so—but it's hard to imagine a case where you'd need that, but wouldn't benefit from running it in PyPy instead of CPython, or writing it in Cython or C, or numpy-izing it, or getting a faster computer, or parallelizing it.
As for space, the explicit version makes it pretty obvious. Like any conceivable non-mutating algorithm, we've got the three new_Foo lists around at the same time as the original lists, and we've also added new_A_set of the same size. Since all of those are length M, that's 4M space. We could cut that in half by doing one pass to get indices, then doing the same thing mu 無's answer does:
indices = set(zip(*unique_everseen(enumerate(a), key=itemgetter(1))[0])
a = [a[index] for index in indices]
b = [b[index] for index in indices]
c = [c[index] for index in indices]
But there's no way to go lower than that; you have to have at least a set and a list of length M alive to uniquify a list of length N in linear time.
If you really need to save space, you can mutate all three lists in-place. But this is a lot more complicated, and a bit slower (although still linear*).
Also, it's worth noting another advantage of the zip version: it works on any iterables. You can feed it three lazy iterators, and it won't have to instantiate them eagerly. I don't think it's doable in 2M space, but it's not too hard in 3M:
indices, a = zip(*unique_everseen(enumerate(a), key=itemgetter(1))
indices = set(indices)
b = [value for index, value in enumerate(b) if index in indices]
c = [value for index, value in enumerate(c) if index in indices]
* Note that just del c[i] will make it quadratic, because deleting from the middle of a list takes linear time. Fortunately, that linear time is a giant memmove that's orders of magnitude faster than the equivalent number of Python assignments, so if N isn't too big you can get away with it—in fact, at N=100000, M=10000 it's twice as fast as the immutable version… But if N might be too big, you have to instead replace each duplicate element with a sentinel, then loop over the list in a second pass so you can shift each element only once, which is instead 50% slower than the immutable version.

How about this - basically get a set of all unique elements of A, and then get their indices, and create a new list based on these indices.
new_A = list(set(A))
indices_to_copy = [A.index(element) for element in new_A]
new_B = [B[index] for index in indices_to_copy]
new_C = [C[index] for index in indices_to_copy]
You can write a function for the second statement, for reuse:
def get_new_list(original_list, indices):
return [original_list[idx] for idx in indices]

Fastest way to compare ordered lists and count common elements including duplicates

I need to compare two lists of numbers and count how many elements of first list are there in second list. For example,
a = [2, 3, 3, 4, 4, 5]
b1 = [0, 2, 2, 3, 3, 4, 6, 8]
here I should get result of 4: I should count '2' 1 time (as it happens only once in first list), '3' - 2 times, '4' - 1 time (as it happens only once in second list). I was using the following code:
def scoreIn(list1, list2):
score=0
list2c=list(list2)
for i in list1:
if i in list2c:
score+=1
list2c.remove(i)
return score
it works correctly, but too slow for my case (I call it 15000 times). I read a hint about 'walking' through sorted lists which was supposed to be faster, so I tried to do like that:
def scoreWalk(list1, list2):
score=0
i=0
j=0
len1=len(list1) # we assume that list2 is never shorter than list1
while i<len1:
if list1[i]==list2[j]:
score+=1
i+=1
j+=1
elif list1[i]>list2[j]:
j+=1
else:
i+=1
return score
Unfortunately this code is even slower. Is there any way to make it more efficient? In my case, both lists are sorted, contains only integers, and list1 is never longer than list2.

You can use the intersection feature of collections.Counter to solve the problem in an easy and readable way:
>>> from collections import Counter
>>> intersection = Counter( [2,3,3,4,4,5] ) & Counter( [0, 2, 2, 3, 3, 4, 6, 8] )
>>> intersection
Counter({3: 2, 2: 1, 4: 1})
As #Bakuriu says in the comments, to obtain the number of elements in the intersection (including duplicates), like your scoreIn function, you can then use sum( intersection.values() ).
However, doing it this way you're not actually taking advantage of the fact that your data is pre-sorted, nor of the fact (mentioned in the comments) that you're doing this over and over again with the same list.
Here is a more elaborate solution more specifically tailored for your problem. It uses a Counter for the static list and directly uses the sorted dynamic list. On my machine it runs in 43% of the run-time of the naïve Counter approach on randomly generated test data.
def common_elements( static_counter, dynamic_sorted_list ):
last = None # previous element in the dynamic list
count = 0 # count seen so far for this element in the dynamic list
total_count = 0 # total common elements seen, eventually the return value
for x in dynamic_sorted_list:
# since the list is sorted, if there's more than one element they
# will be consecutive.
if x == last:
# one more of the same as the previous element
# all we need to do is increase the count
count += 1
else:
# this is a new element that we haven't seen before.
# first "flush out" the current count we've been keeping.
# - count is the number of times it occurred in the dynamic list
# - static_counter[ last ] is the number of times it occurred in
# the static list (the Counter class counted this for us)
# thus the number of occurrences the two have in common is the
# smaller of these numbers. (Note that unlike a normal dictionary,
# which would raise KeyError, a Counter will return zero if we try
# to look up a key that isn't there at all.)
total_count += min( static_counter[ last ], count )
# now set count and last to the new element, starting a new run
count = 1
last = x
if count > 0:
# since we only "flushed" above once we'd iterated _past_ an element,
# the last unique value hasn't been counted. count it now.
total_count += min( static_counter[ last ], count )
return total_count
The idea of this is that you do some of the work up front when you create the Counter object. Once you've done that work, you can use the Counter object to quickly look up counts, just like you look up values in a dictionary: static_counter[ x ] returns the number of times x occurred in the static list.
Since the static list is the same every time, you can do this once and use the resulting quick-lookup structure 15 000 times.
On the other hand, setting up a Counter object for the dynamic list may not pay off performance-wise. There is a little bit of overhead involved in creating a Counter object, and we'd only use each dynamic list Counter one time. If we can avoid constructing the object at all, it makes sense to do so. And as we saw above, you can in fact implement what you need by just iterating through the dynamic list and looking up counts in the other counter.
The scoreWalk function in your post does not handle the case where the biggest item is only in the static list, e.g. scoreWalk( [1,1,3], [1,1,2] ). Correcting that, however, it actually performs better than any of the Counter approaches for me, contrary to the results you report. There may be a significant difference in the distribution of your data to my uniformly-distributed test data, but double-check your benchmarking of scoreWalk just to be sure.
Lastly, consider that you may be using the wrong tool for the job. You're not after short, elegant and readable -- you're trying to squeeze every last bit of performance out of a rather simple task. CPython allows you to write modules in C. One of the primary use cases for this is to implement highly optimized code. It may be a good fit for your task.

You can do this with a dict comprehension:
>>> a = [2, 3, 3, 4, 4, 5]
>>> b1 = [0, 2, 2, 3, 3, 4, 6, 8]
>>> {k: min(b1.count(k), a.count(k)) for k in set(a)}
{2: 1, 3: 2, 4: 1, 5: 0}
This is much faster if set(a) is small. If set(a) is more than 40 items, the Counter based solution is faster.

Inserting and removing into/from sorted list in Python

I have a sorted list of integers, L, and I have a value X that I wish to insert into the list such that L's order is maintained. Similarly, I wish to quickly find and remove the first instance of X.
Questions:
How do I use the bisect module to do the first part, if possible?
Is L.remove(X) going to be the most efficient way to do the second part? Does Python detect that the list has been sorted and automatically use a logarithmic removal process?
Example code attempts:
i = bisect_left(L, y)
L.pop(i) #works
del L[bisect_left(L, i)] #doesn't work if I use this instead of pop

You use the bisect.insort() function:
bisect.insort(L, X)
L.remove(X) will scan the whole list until it finds X. Use del L[bisect.bisect_left(L, X)] instead (provided that X is indeed in L).
Note that removing from the middle of a list is still going to incur a cost as the elements from that position onwards all have to be shifted left one step. A binary tree might be a better solution if that is going to be a performance bottleneck.

You could use Raymond Hettinger's IndexableSkiplist. It performs 3 operations in O(ln n) time:
insert value
remove value
lookup value by rank
import skiplist
import random
random.seed(2013)
N = 10
skip = skiplist.IndexableSkiplist(N)
data = range(N)
random.shuffle(data)
for num in data:
skip.insert(num)
print(list(skip))
# [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
for num in data[:N//2]:
skip.remove(num)
print(list(skip))
# [0, 3, 4, 6, 9]

Pattern Matching in Python

This question might go closer to pattern matching in image processing.
Is there any way to get a cost function value, applied on different lists, which will return the inter-list proximity? For example,
a = [4, 7, 9]
b = [5, 8, 10]
c = [2, 3]
Now the cost function value for, may be a 2-tuple, (a, b) should be more than (a, c) and (b, c). This can be a huge computational task since there can be many more number of lists and all permutations would blow up the complexity of the problem. So only the set of 2-tuples would work as well.
EDIT:
The list names indicate the type of actions, and elements in them are the time at which corresponding actions occur. What I'm trying to do is to come up with set(s) of actions which have similar occurrence pattern. Since two actions cannot occur at the same time, it's the combination of intra- and inter-list distance.
Thanks in advance!

You're asking a very difficult question. Without allowing the sizes to change there are already several distance measures you could use (Euclidean, Manhattan, etc, check the See Also section for more). The one you need depends on what you think a good measure of the proximity is for whatever these lists represent.
Without knowing what you're trying to do with these lists no-one can define what a good answer would be, let alone how to compute it efficiently.

For comparing two strings or lists you can use the Levenshtein distance (Python implementation from here):
def levenshtein(s1, s2):
l1 = len(s1)
l2 = len(s2)
matrix = [range(l1 + 1)] * (l2 + 1)
for zz in range(l2 + 1):
matrix[zz] = range(zz,zz + l1 + 1)
for zz in range(0,l2):
for sz in range(0,l1):
if s1[sz] == s2[zz]:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz])
else:
matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1,
matrix[zz][sz+1] + 1,
matrix[zz][sz] + 1)
return matrix[l2][l1]
Using that on your lists:
>>> a = [4, 7, 9]
>>> b = [5, 8, 10]
>>> c = [2, 3]
>>> levenshtein(a,b)
3
>>> levenshtein(b,c)
3
>>> levenshtein(a,c)
3
EDIT: with the added explanation in the comments, you could use sets instead of lists. Since every element of a set is unique, adding an existing element again is a no-op. And you can use the set's isdisjoint method to check that two sets do not contain the same elements, or the intersection method to see which elements they have in common:
In [1]: a = {1,3,5}
In [2]: a.add(3)
In [3]: a
Out[3]: set([1, 3, 5])
In [4]: a.add(4)
In [5]: a
Out[5]: set([1, 3, 4, 5])
In [6]: b = {2,3,7}
In [7]: a.isdisjoint(b)
Out[7]: False
In [8]: a.intersection(b)
Out[8]: set([3])
N.B.: this syntax of creating sets requires at least Python 2.7.

Given the answer you gave to Michael's clarification, you should probably look up "Dynamic Time Warping".
I haven't used http://mlpy.sourceforge.net/ but its blurb says it provides DTW. (Might be a hammer to crack a nut; depends on your use case.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Intersection complexity - python

Related

Cost function of a dynamic solution to the "longest increasing subsequence" problem

Remove duplicates from one Python list, prune other lists based on it

Fastest way to compare ordered lists and count common elements including duplicates

Inserting and removing into/from sorted list in Python

Pattern Matching in Python

Categories

Resources

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Intersection complexity - python

Related

Cost function of a dynamic solution to the "longest increasing subsequence" problem

Remove duplicates from one Python list, prune other lists based on it

Fastest way to compare ordered lists and count common elements *including* duplicates

Inserting and removing into/from sorted list in Python

Pattern Matching in Python

Categories

Resources

Fastest way to compare ordered lists and count common elements including duplicates