Imagine you want to find all the duplicates in an array, and you must do this in O(1) space and O(N) time.
An algorithm like this would have O(N) space:
def find_duplicates(arr):
seen = set()
res = []
for i in arr:
if i in seen: res.append(i)
seen.add(i)
return res
My question is would the following algorithm use O(1) space or O(N) space:
def find_duplicates(arr):
seen = set()
res = []
while arr:
i = arr.pop()
if i in seen: res.append(i)
seen.add(i)
return res
Technically arr gets smaller and the sum of |seen| and |arr| will always be less than the original |arr|, but at the end of the day I think it's still allocating |arr| space for seen.
In order to determine the space complexity, you have to know something about how pop is implemented, as well as how Python manages memory. In order for your algorithm to use constant space, arr would have to release the memory used by popped items, and seen would have to be able to reuse that memory. However, most implementations of Python probably do not support that level of sharing. In particular, pop isn't going to release any memory; it will keep it against the possibility of needing it in the future, rather than having to ask to get the memory back.
Whenever you try to do time and space complexity analysis, think of a test case which could blow up your program the most.
Your space complexity is O(N). In the case of your second program, if you have a list of numbers with only 1s. Eg: x = [1,1,1,1,1,1,1]. Then you'll see that res grows almost to the size of N. Consider what happens when you have all different numbers. x = [1,2,3,4,5,6,7,8]. Now seen grows to the size of N.
Also thinking about time complexity, the pop() function of python lists could sometime be a problem. Check out this post for more details.
Related
I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.
I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).
Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.
My textbook says that the following algorithm has an efficiency of O(n):
list = [5,8,4,5]
def check_for_duplicates(list):
dups = []
for i in range(len(list)):
if list[i] not in dups:
dups.append(list[i])
else:
return True
return False
But why? I ask because the in operation has an efficiency of O(n) as well (according to this resource). If we take list as an example the program needs to iterate 4 times over the list. But with each iteration, dups keeps growing faster. So for the first iteration over list, dups does not have any elements, but for the second iteration it has one element, for the third two elements and for the fourth three elements. Wouldn't that make 1 + 2 + 3 = 6 extra iterations for the in operation on top of the list iterations? But if this is true then wouldn't this alter the efficiency significantly, as the sum of the extra iterations grows faster with every iteration?
You are correct that the runtime of the code that you've posted here is O(n2), not O(n), for precisely the reason that you've indicated.
Conceptually, the algorithm you're implementing goes like this:
Maintain a collection of all the items seen so far.
For each item in the list:
If that item is in the collection, report a duplicate exists.
Otherwise, add it to the collection.
Report that there are no duplicates.
The reason the code you've posted here is slow is because the cost of checking whether a duplicate exists is O(n) when using a list to track the items seen so far. In fact, if you're using a list of the existing elements, what you're doing is essentially equivalent to just checking the previous elements of the array to see if any of them are equal!
You can speed this up by switching your implementation so that you use a set to track prior elements rather than a list. Sets have (expected) O(1) lookups and insertions, so this will make your code run in (expected) O(1) time.
I have designed an algorithm but confused whether the time complexity is theta(n) or theta (n^2).
def prefix_soms(number):
Array_A=[1,2,3,4,5,6]
Array_B=[]
soms=0
for i in range(0,number):
soms=soms+Array_A[i]
Array_B.insert(i,soms)
return Array_B[number-1]
I know the for loop is running n times so that's O(n).
Is the inside operations O(1)?
For arbitrary large numbers, it is not, since adding two huge numbers takes logarithmic time in the value of these numbers. If we assume that the sum will not run out of control, then we can say that it runs in O(n). The .insert(…) is basically just an .append(…). The amortized cost of appending n items is O(n).
We can however improve the readablility, and memory usage, by writing this as:
def prefix_soms(number):
Array_A=[1,2,3,4,5,6]
soms=0
for i in range(0,number):
soms += Array_A[i]
return soms
or:
def prefix_soms(number):
Array_A=[1,2,3,4,5,6]
return sum(Array_A[:number])
or we can omit creating a copy of the list, by using islice(..):
from itertools import islice
def prefix_soms(number):
Array_A=[1,2,3,4,5,6]
return sum(islice(Array_A, number))
We thus do not need to use another list, since we are only interested in the last item.
Given that the insert method doesn't shift your array - that is as to your algorithm it solely appends one element to end of the list - its time
complexity is O(1). Moreover, accessing an element with index takes O(1) time as well.
You run number of number time a loop with some O(1)s. O(number)*someO(1)s = O(number)
The complexity of list.insert is O(n), as shown on this wiki page. You can check the blist library which provides an optimized list type that has an O(log n) insert, but in your algorithm I think that the item is always placed at the end of the Array_B, so it is just an append which takes constant amortized time (You can replace the insert with append to make the code more elegant).
Why is this Significantly faster with comments? Shouldn't a pop, a comparison, and a length check be O(1)? Would that significantly affect the speed?
#! /usr/bin/python
import math
pmarbs = []
pows = 49
pmarbs.append("W")
inc = 1
for i in range(pows):
count = 0
j = 0
ran = int(pow(2, i))
marker = len(pmarbs) - inc
while (j < ran):
#potential marble choice
pot = pmarbs[marker - j]
pot1 = pot + "W"
pot2 = pot + "B"
if (pot2.count('W') < pot2.count('B')) and (len(pot2) > (i+1)):
count += 1
else:
pmarbs.append(pot2)
pmarbs.append(pot1)
# if(len(pmarbs[0]) < i):
# pmarbs.pop(0)
# marker -= 1
j += 1
if (count != 0):
print(count)
print("length of pmarbs = %s" % len(pmarbs))
UPDATE:
I'm making the question shorter, because the code being significantly slower was my question. I cared less about the code getting killed at runtime.
Just to answer part of the question: popping from the end (the right end) of a list takes constant time in CPython, but popping from the left end (.pop(0)) takes time proportional to the length of the list: all the elements in the_list[1:] are physically moved one position to the left.
If you need to delete index position 0 frequently, much better to use an instance of collections.deque. Deques support efficient pushing and popping from both ends.
BTW, when I run the program, I get a clean exception:
...
length of pmarbs = 8306108
Traceback (most recent call last):
File "xxx.py", line 22, in <module>
pmarbs.append(pot2)
MemoryError
That happened to be on a 32-bit Windows box. And it doesn't surprise me ;-)
list.pop(index) is an O(n) operation, because after you remove the value from the list, you have to shift the memory location of every other value in the list over one. Calling pop repeatedly on large lists is great way to waste computing cycles. If you absolutely must remove from the front of a large list over and over use collections.deque, which will give you much faster insertions and deletions to thr front.
len() is O(1) because deletions are O(n), since if you make sure all the values in a list are allocated in memory right next to each other, the total length of a list is just the tail's memory location - the head's memory location. If you don't care about the performance of len() and similar operations, then you can use a linked list to do constant time insertions and deletions - that just makes len() be O(n) and pop() be O(1) (and you get some other funky stuff like O(n) lookups).
Everything I said about pop() goes for insert() also - except for append(), which usually takes O(1).
I recently worked on a problem that required deleting lots of elements from a very large list (around 10,000,000 integers) and my initial dumb implementation just used pop() every time I needed to delete something - that turned out to not work at all, because it took O(n) to do even one cycle of the algorithm, which itself needed to n times.
My solution was to create a set() called ignore in which I kept the indices of all "deleted" elements. I had little helper functions to help me not have to think about skipping these, so my algorithm didn't get too ugly. What eventually did it was doing a single O(n) pass every 10,000 iterations to delete all the elements in ignore and make ignore empty again, that way I got the increased performance from a shrinking list while only having to do one 10,000th of the work for my deletions.
Also, ya, you should get a memory error because you are trying to allocate a list that is definitely much larger than your hard drive - much less your memory.
For example, files, in Python, are iterable - they iterate over the lines in the file. I want to count the number of lines.
One quick way is to do this:
lines = len(list(open(fname)))
However, this loads the whole file into memory (at once). This rather defeats the purpose of an iterator (which only needs to keep the current line in memory).
This doesn't work:
lines = len(line for line in open(fname))
as generators don't have a length.
Is there any way to do this short of defining a count function?
def count(i):
c = 0
for el in i: c += 1
return c
To clarify, I understand that the whole file will have to be read! I just don't want it in memory all at once
Short of iterating through the iterable and counting the number of iterations, no. That's what makes it an iterable and not a list. This isn't really even a python-specific problem. Look at the classic linked-list data structure. Finding the length is an O(n) operation that involves iterating the whole list to find the number of elements.
As mcrute mentioned above, you can probably reduce your function to:
def count_iterable(i):
return sum(1 for e in i)
Of course, if you're defining your own iterable object you can always implement __len__ yourself and keep an element count somewhere.
If you need a count of lines you can do this, I don't know of any better way to do it:
line_count = sum(1 for line in open("yourfile.txt"))
The cardinality package provides an efficient count() function and some related functions to count and check the size of any iterable: http://cardinality.readthedocs.org/
import cardinality
it = some_iterable(...)
print(cardinality.count(it))
Internally it uses enumerate() and collections.deque() to move all the actual looping and counting logic to the C level, resulting in a considerable speedup over for loops in Python.
I've used this redefinition for some time now:
def len(thingy):
try:
return thingy.__len__()
except AttributeError:
return sum(1 for item in iter(thingy))
It turns out there is an implemented solution for this common problem. Consider using the ilen() function from more_itertools.
more_itertools.ilen(iterable)
An example of printing a number of lines in a file (we use the with statement to safely handle closing files):
# Example
import more_itertools
with open("foo.py", "r+") as f:
print(more_itertools.ilen(f))
# Output: 433
This example returns the same result as solutions presented earlier for totaling lines in a file:
# Equivalent code
with open("foo.py", "r+") as f:
print(sum(1 for line in f))
# Output: 433
Absolutely not, for the simple reason that iterables are not guaranteed to be finite.
Consider this perfectly legal generator function:
def forever():
while True:
yield "I will run forever"
Attempting to calculate the length of this function with len([x for x in forever()]) will clearly not work.
As you noted, much of the purpose of iterators/generators is to be able to work on a large dataset without loading it all into memory. The fact that you can't get an immediate length should be considered a tradeoff.
Because apparently the duplication wasn't noticed at the time, I'll post an extract from my answer to the duplicate here as well:
There is a way to perform meaningfully faster than sum(1 for i in it) when the iterable may be long (and not meaningfully slower when the iterable is short), while maintaining fixed memory overhead behavior (unlike len(list(it))) to avoid swap thrashing and reallocation overhead for larger inputs.
# On Python 2 only, get zip that lazily generates results instead of returning list
from future_builtins import zip
from collections import deque
from itertools import count
def ilen(it):
# Make a stateful counting iterator
cnt = count()
# zip it with the input iterator, then drain until input exhausted at C level
deque(zip(it, cnt), 0) # cnt must be second zip arg to avoid advancing too far
# Since count 0 based, the next value is the count
return next(cnt)
Like len(list(it)), ilen(it) performs the loop in C code on CPython (deque, count and zip are all implemented in C); avoiding byte code execution per loop is usually the key to performance in CPython.
Rather than repeat all the performance numbers here, I'll just point you to my answer with the full perf details.
For filtering, this variation can be used:
sum(is_good(item) for item in iterable)
which can be naturally read as "count good items" and is shorter and simpler (although perhaps less idiomatic) than:
sum(1 for item in iterable if is_good(item)))
Note: The fact that True evaluates to 1 in numeric contexts is specified in the docs
(https://docs.python.org/3.6/library/stdtypes.html#boolean-values), so this coercion is not a hack (as opposed to some other languages like C/C++).
We'll, if you think about it, how do you propose you find the number of lines in a file without reading the whole file for newlines? Sure, you can find the size of the file, and if you can gurantee that the length of a line is x, you can get the number of lines in a file. But unless you have some kind of constraint, I fail to see how this can work at all. Also, since iterables can be infinitely long...
I did a test between the two common procedures in some code of mine, which finds how many graphs on n vertices there are, to see which method of counting elements of a generated list goes faster. Sage has a generator graphs(n) which generates all graphs on n vertices. I created two functions which obtain the length of a list obtained by an iterator in two different ways and timed each of them (averaging over 100 test runs) using the time.time() function. The functions were as follows:
def test_code_list(n):
l = graphs(n)
return len(list(l))
and
def test_code_sum(n):
S = sum(1 for _ in graphs(n))
return S
Now I time each method
import time
t0 = time.time()
for i in range(100):
test_code_list(5)
t1 = time.time()
avg_time = (t1-t0)/10
print 'average list method time = %s' % avg_time
t0 = time.time()
for i in range(100):
test_code_sum(5)
t1 = time.time()
avg_time = (t1-t0)/100
print "average sum method time = %s" % avg_time
average list method time = 0.0391882109642
average sum method time = 0.0418473792076
So computing the number of graphs on n=5 vertices this way, the list method is slightly faster (although 100 test runs isn't a great sample size). But when I increased the length of the list being computed by trying graphs on n=7 vertices (i.e. changing graphs(5) to graphs(7)), the result was this:
average list method time = 4.14753051996
average sum method time = 3.96504004002
In this case the sum method was slightly faster. All in all, the two methods are approximately the same speed but the difference MIGHT depend on the length of your list (it might also just be that I only averaged over 100 test runs, which isn't very high -- would have taken forever otherwise).