Comparing algorithms for List intersection - python

I am attempting to design an algorithm to find common elements between sorted and distinct arrays. I am using one of the following two methods. Is either one better in terms of runtime and time complexity?
Method 1:
# O(n^2) ?
common = []
def intersect(array1,array2):
dict1 = {}
for item in array1:
dict1.update({item:0})
for k,v in dict1.iteritems():
if k in array2:
common.append(k)
return common
print intersect(array1=[1,2,3,5], array2 = [5,6,7,8,9])
Method 2:
# probably O(n^2)
common = []
def intersect(array1,array2):
for item1 in array1:
for item2 in array2:
if (item1==item2):
common.append(item1)
return common
print intersect(array1=[1,2,3,5], array2 = [5,6,7,8,9])

Let array1 has M elements and array2 has N elements. The first approach has time complexity O(M lg N). The second approach has time complexity O(M*N). So, from time complexity perspective, the first is better. Note, however, that the first approach has O(M) space complexity which the second one does not.
BTW, there is likely a O(max(M, N)) algorithm.

set(array1).intersection(set(array2)) will likely be the fastest solution. The intersection method is lightning fast and easy to implement. Not sure about its time complexity but you may want to take a look at its implementation.

Related

Find the location(Indices) of N elements in a huge numpy array

I have a set of say, 5 elements,
[21,103,3,10,243]
and a huge Numpy array
[4,5,1,3,5,100,876,89,78......456,64,3,21,245]
with the 5 elements appearing repetitively in the bigger array.
I want to find all the Indices where the elements of the small list appears in the larger array.
The small list will be less than 100 elements long and the large list will be about 10^7 elements long, and so, speed is a concern here. What is the most elegant and the fastest way to do it in python3.x ?
I have tried using np.where() but it works dead slow. Looking for a faster way.
You can put the 100 elements to be found in a set, a hash table.
Then loop through the elements of the huge array asking if the element is in the set.
S = set([21,103,3,10,243])
A = [4,5,1,3,5,100,876,89,78......456,64,3,21,245]
result = []
for i,x in enumerate(A):
if x in S:
result.append(i)
To speed up things, you can optimize like this:
Sort the larger array
Perform binary search (on the larger array) for each number in the smaller array.
Time Complexity
Sorting using numpy.sort(kind='heapsort') will have time complexity n*log(n).
Binary search will have complexity log(n) for each element in the smaller array. Assuming, there are m elements in the smaller array, the total search complexity will be m*log(n).
Overall, this will provide you good optimization.
smaller_array = [21,103,3,10,243]
bigger_array = [4,5,1,3,5,100,876,89,78,456,64,3,21,243,243]
print(bigger_array)
print(smaller_array)
for val in smaller_array:
if val in bigger_array:
c = bigger_array.index(val)
while True:
print(f'{val} is found in bigger_array at index {bigger_array.index(val,c)}')
c = bigger_array.index(val,c)+1
if val not in bigger_array[c:]:
break
smaller_array = [21,103,3,10,243]
bigger_array = [4,5,1,3,5,100,876,89,78,456,64,3,21,243,243]
print(bigger_array)
print(smaller_array)
for val in smaller_array:
if val in bigger_array:
c=0
try:
while True:
c = bigger_array.index(val,c)
print(f'{val} is found in bigger_array at index {c}')
c+=1
except:
pass

Calculating complexity of an algorithm (Big-O)

I'm currently doing some work around Big-O complexity and calculating the complexity of algorithms.
I seem to be struggling to work out the steps to calculate the complexity and was looking for some help to tackle this.
The function:
index = 0
while index < len(self.items):
if self.items[index] == item:
self.items.pop(index)
else:
index += 1
The actual challenge is to rewrite this function so that is has O(n) worst case complexity.
My problem with this is, as far as I thought, assignment statements and if statements have a complexity of O(1) whereas the while loop has a complexity of (n) and in the worst case any statements within the while loop could execute n times. So i work this out as 1 + n + 1 = 2 + n = O(n)
I figure I must be working this out incorrectly as there'd be no point in rewriting the function otherwise.
Any help with this is greatly appreciated.
If self.items is a list, the pop operation has complexity "k" where k is the index,
so the only way this is not O(N) is because of the pop operation.
Probably the exercise is done in order for you to use some other method of iterating and removing from the list.
To make it O(N) you can do:
self.items = [x for x in self.items if x == item]
If you are using Python's built in list data structure the pop() operation is not constant in the worst case and is O(N). So your overall complexity is O(N^2). You will need to use some other data structure like linked list if you cannot use auxiliary space.
With no arguments to pop its O(1)
With an argument to pop:
Average time Complexity O(k) (k represents the number passed in as an argument for pop
Amortized worst case time complexity O(k)
Worst case time complexity O(n)
Time Complexity - Python Wiki
So to make your code effective, allow user to pop from the end of the list:
for example:
def pop():
list.pop(-1)
Reference
Since you are passing index to self.items.pop(index), Its NOT O(1).

Time Complexity of list flattening

I have two functions, both of which flatten an arbitrarily nested list of lists in Python.
I am trying to figure out the time complexity of both, to see which is more efficient, but I haven't found anything definitive on SO anything so far. There are lots of questions about lists of lists, but not to the nth degree of nesting.
function 1 (iterative)
def flattenIterative(arr):
i = 0
while i < len(arr):
while isinstance(arr[i], list):
if not arr[i]:
arr.pop(i)
i -= 1
break
else:
arr[i: i + 1] = arr[i]
i += 1
return arr
function 2 (recursive)
def flattenRecursive(arr):
if not arr:
return arr
if isinstance(arr[0], list):
return flattenRecursive(arr[0]) + flattenRecursive(arr[1:])
return arr[:1] + flattenRecursiveweb(arr[1:])
My thoughts are below:
function 1 complexity
I think that the time complexity for the iterative version is O(n * m), where n is the length of the initial array, and m is the amount of nesting. I think space complexity of O(n) where n is the length of the initial array.
function 2 complexity
I think that the time complexity for the recursive version will be O(n) where n is the length of the input array. I think space complexity of O(n * m) where n is the length of the initial array, and m is the depth of nesting.
summary
So, to me it seems that the iterative function is slower, but more efficient with space. Conversely, the recursive function is faster, but less efficient with space. Is this correct?
I don't think so. There are N elements, so you will need to visit each element at least once. Overall, your algorithm will run for O(N) iterations. The deciding factor is what happens per iteration.
Your first algorithm has 2 loops, but if you observe carefully, it is still iterating over each element O(1) times per iteration. However, as #abarnert pointed out, the arr[i: i + 1] = arr[i] moves every element from arr[i+1:] up, which is O(N) again.
Your second algorithm is similar, but you are adding lists in this case (in the previous case, it was a simple slice assignment), and unfortunately, list addition is linear in complexity.
In summary, both your algorithms are quadratic.

fastest method of getting k smallest numbers in unsorted list of size N in python?

What is the fastest method to get the k smallest numbers in an unsorted list of size N using python?
Is it faster to sort the big list of numbers, and then get the k smallest numbers,or to get the k smallest numbers by finding the minimum in the list k times, making sure u remove the found minimum from the search before the next search?
You could use a heap queue; it can give you the K largest or smallest numbers out of a list of size N in O(NlogK) time.
The Python standard library includes the heapq module, complete with a heapq.nsmallest() function ready implemented:
import heapq
k_smallest = heapq.nsmallest(k, input_list)
Internally, this creates a heap of size K with the first K elements of the input list, then iterating over the remaining N-K elements, pushing each to the heap, then popping off the largest one. Such a push and pop takes log K time, making the overall operation O(NlogK).
The function also optimises the following edge cases:
If K is 1, the min() function is used instead, giving you a O(N) result.
If K >= N, the function uses sorting instead, since O(NlogN) would beat O(NlogK) in that case.
A better option is to use the introselect algorithm, which offers an O(n) option. The only implementation I am aware of is using the numpy.partition() function:
import numpy
# assuming you have a python list, you need to convert to a numpy array first
array = numpy.array(input_list)
# partition, slice back to the k smallest elements, convert back to a Python list
k_smallest = numpy.partition(array, k)[:k].tolist()
Apart from requiring installation of numpy, this also takes N memory (versus K for heapq), as a copy of the list is created for the partition.
If you only wanted indices, you can use, for either variant:
heapq.nsmallest(k, range(len(input_list)), key=input_list.__getitem__) # O(NlogK)
numpy.argpartition(numpy.array(input_list), k)[:k].tolist() # O(N)
If the list of the kth smallest numbers doesn't need to be sorted, this can be done in O(n) time with a selection algorithm like introselect. The standard library doesn't come with one, but NumPy has numpy.partition for the job:
partitioned = numpy.partition(l, k)
# The subarray partitioned[:k] now contains the k smallest elements.
You might want to take a look at heapq:
In [109]: L = [random.randint(1,1000) for _ in range(100)]
In [110]: heapq.nsmallest(10, L)
Out[110]: [1, 17, 17, 19, 24, 37, 37, 45, 63, 73]
EDIT: this assumes that the list is immutable. If the list is an array and can be modified there are linear methods available.
You can get the complexity down to O(n * log k) by using a heap of size k + 1.
Initially get the first k elements into a min-heap.
For every subsequent element, add the element as a leaf and heapify.
Replace the last element with the next element.
Heapify can be done in logarithmic time and hence the time complexity is as above.
You can do it in O(kn) with a selection algorithm. Once kn >= n log n, switch to sorting. That said, the constant on the selection algorithm tends to be a lot higher than the one on quicksort, so you really need to compare i (kn) and j (n log n). In practice, it's usually more desirable to just sort unless you're dealing with large n or very small k.
Edit: see comments. It's actually a lot better.
Using nsmallest numbers in heapq is less code but if you are looking to implement it yourself this is a simple way to do it. This solution requires looping through the data once only but since heappush and heappop run on O(log n) this algorithm would perform best on smaller numbers of k.
import heapq
def getsmallest(arr, k):
m = [-x for x in l[:k]]
heapq.heapify(m)
for num in arr[5:]:
print num, m
heapq.heappush(m, max(-num, heapq.heappop(m)))
return m
if __name__ == '__main__':
l = [1,2,3,52,2,3,1]
print getsmallest(l, 5)

What's a fast and pythonic/clean way of removing a sorted list from another sorted list in python?

I am creating a fast method of generating a list of primes in the range(0, limit+1). In the function I end up removing all integers in the list named removable from the list named primes. I am looking for a fast and pythonic way of removing the integers, knowing that both lists are always sorted.
I might be wrong, but I believe list.remove(n) iterates over the list comparing each element with n. meaning that the following code runs in O(n^2) time.
# removable and primes are both sorted lists of integers
for composite in removable:
primes.remove(composite)
Based off my assumption (which could be wrong and please confirm whether or not this is correct) and the fact that both lists are always sorted, I would think that the following code runs faster, since it only loops over the list once for a O(n) time. However, it is not at all pythonic or clean.
i = 0
j = 0
while i < len(primes) and j < len(removable):
if primes[i] == removable[j]:
primes = primes[:i] + primes[i+1:]
j += 1
else:
i += 1
Is there perhaps a built in function or simpler way of doing this? And what is the fastest way?
Side notes: I have not actually timed the functions or code above. Also, it doesn't matter if the list removable is changed/destroyed in the process.
For anyone interested the full functions is below:
import math
# returns a list of primes in range(0, limit+1)
def fastPrimeList(limit):
if limit < 2:
return list()
sqrtLimit = int(math.ceil(math.sqrt(limit)))
primes = [2] + range(3, limit+1, 2)
index = 1
while primes[index] <= sqrtLimit:
removable = list()
index2 = index
while primes[index] * primes[index2] <= limit:
composite = primes[index] * primes[index2]
removable.append(composite)
index2 += 1
for composite in removable:
primes.remove(composite)
index += 1
return primes
This is quite fast and clean, it does O(n) set membership checks, and in amortized time it runs in O(n) (first line is O(n) amortized, second line is O(n * 1) amortized, because a membership check is O(1) amortized):
removable_set = set(removable)
primes = [p for p in primes if p not in removable_set]
Here is the modification of your 2nd solution. It does O(n) basic operations (worst case):
tmp = []
i = j = 0
while i < len(primes) and j < len(removable):
if primes[i] < removable[j]:
tmp.append(primes[i])
i += 1
elif primes[i] == removable[j]:
i += 1
else:
j += 1
primes[:i] = tmp
del tmp
Please note that constants also matter. The Python interpreter is quite slow (i.e. with a large constant) to execute Python code. The 2nd solution has lots of Python code, and it can indeed be slower for small practical values of n than the solution with sets, because the set operations are implemented in C, thus they are fast (i.e. with a small constant).
If you have multiple working solutions, run them on typical input sizes, and measure the time. You may get surprised about their relative speed, often it is not what you would predict.
The most important thing here is to remove the quadratic behavior. You have this for two reasons.
First, calling remove searches the entire list for values to remove. Doing this takes linear time, and you're doing it once for each element in removable, so your total time is O(NM) (where N is the length of primes and M is the length of removable).
Second, removing elements from the middle of a list forces you to shift the whole rest of the list up one slot. So, each one takes linear time, and again you're doing it M times, so again it's O(NM).
How can you avoid these?
For the first, you either need to take advantage of the sorting, or just use something that allows you to do constant-time lookups instead of linear-time, like a set.
For the second, you either need to create a list of indices to delete and then do a second pass to move each element up the appropriate number of indices all at once, or just build a new list instead of trying to mutate the original in-place.
So, there are a variety of options here. Which one is best? It almost certainly doesn't matter; changing your O(NM) time to just O(N+M) will probably be more than enough of an optimization that you're happy with the results. But if you need to squeeze out more performance, then you'll have to implement all of them and test them on realistic data.
The only one of these that I think isn't obvious is how to "use the sorting". The idea is to use the same kind of staggered-zip iteration that you'd use in a merge sort, like this:
def sorted_subtract(seq1, seq2):
i1, i2 = 0, 0
while i1 < len(seq1):
if seq1[i1] != seq2[i2]:
i2 += 1
if i2 == len(seq2):
yield from seq1[i1:]
return
else:
yield seq1[i1]
i1 += 1

Categories