Time complexity analysis of two algorithms contradicts empirical results

Time complexity analysis of two algorithms contradicts empirical results - python

I wrote the following simple function that checks whether str1 is a permutation of str2:
def is_perm(str1, str2):
return True if sorted(str1)==sorted(str2) else False
Assuming that sorted(str) has a time complexity of O(n*logn), we can expect a time complexity of O(2*n*logn)=O(n*logn). The following function is an attempt to achieve a better time complexity:
def is_perm2(str1, str2):
dict1 = {}
dict2 = {}
for char in str1:
if char in dict1:
dict1[char] += 1
else:
dict1[char] = 1
for char in str2:
if char in dict2:
dict2[char] += 1
else:
dict2[char] = 1
if dict1==dict2:
return True
else:
return False
Each for-loop iterates n times. Assuming that dictionary lookup and both dictionary updates have constant time complexity, I expect an overall complexity of O(2n)=O(n). However, timeit measurements show the following, contradicting results. Why is is_perm2 slower than is_perm after 1000000 executions even though it's time complexity looks better? Are my assumptions wrong?
import timeit
print(timeit.timeit('is_perm("helloworld","worldhello")', 'from __main__ import is_perm', number=10000000))
print(timeit.timeit('is_perm2("helloworld","worldhello")', 'from __main__ import is_perm2', number=10000000))
# output of first print-call: 12.4199592999993934 seconds
# output of second print-call: 37.13826630001131 seconds

There is no guarantee that an algorithm with a time complexity of O(nlogn) will be slower than one with a time complexity of O(n) for a given input. The second one could for instance have a large constant overhead, making it slower for input sizes that are below 100000 (for instance).
In your test the input size is 10 ("helloworld"), which doesn't tell us much. Repeating that test doesn't make a difference, even if repeated 10000000 times. The repetition only gives a more precise estimate of the average time spent on that particular input.
You would need to feed the algorithm with increasingly large inputs. If memory allows, that would eventually bring us to an input size for which the O(nlogn) algorithm takes more time than the O(n) algorithm.
In this case, I found that the input size had to be really large in comparison with available memory, and I only barely managed to find a case where the difference showed:
import random
import string
import timeit
def shuffled_string(str):
lst = list(str)
random.shuffle(lst)
return "".join(lst)
def random_string(size):
return "".join(random.choices(string.printable, k=size))
str1 = random_string(10000000)
str2 = shuffled_string(str1)
print("start")
print(timeit.timeit(lambda: is_perm(str1, str2), number=5))
print(timeit.timeit(lambda: is_perm2(str1, str2), number=5))
After the initial set up of the strings (which each have a size of 10 million characters), the output on repl.it was:
54.72847577700304
51.07616817899543
The reason why the input has to be so large to see this happen, is that sorted is doing all the hard work in lower-level, compiled code (often C), while the second solution does all the looping and character reading in Python code (often interpreted). It is clear that the overhead of the second solution is huge in comparison with the first.
Improving the second solution
Although not your question, we could improve the implementation of the second algorithm, by relying on another built-in function: Counter:
def is_perm3(str1, str2):
return Counter(str1) == Counter(str2)
With the same test set up as above, the timing for this implementation on repl.it is:
24.917681352002546

Assuming that dictionary lookup and both dictionary updates have constant time complexity,
python dictionary is hashmap.
So exactly, dictionary lookup and update costs O(n) in worst case.
Total average time complexity of is_perm2 is O(n) but worse case time complexity is O(n^2).
If you want get exactly O(n) time complexity, please use List(not Dictionary) to store frequency of characters.
You can easily convert each character to ascii numbers and store their frequency to python list.

Related

Can you help me with the time complexity of this Python code?

I have written this code and I think its time complexity is O(n+m) as time depends on both the inputs, am I right? Is there a better algorithm you can suggest?
The function return the length of union of both the inputs.
class Solution :
def getUnion(self,a,b,):
p= 0
lower, greater = a,b
if len(a)>len(b):
lower,greater = b,a
while p< len(lower): # O(n+m)
if lower[p] in greater:
greater.remove(lower[p])
p+=1
return len(lower+greater)
print(Solution().getUnion([1,2,3,4,5],[2,3,4,54,67]))

Assuming 𝑚 is the shorter length, and 𝑛 the longer (or both are equal), then the while loop will iterate 𝑚 times.
Inside a single iteration of that loop an in greater operation is executed, which has a time complexity of O(𝑛) -- for each individual execution.
So the total time complexity is O(𝑚𝑛).
The correctness of this algorithm depends on whether we can assume that the input lists only contain unique values (each).
You can do better using a set:
return len(set(a + b))
Building a set is O(𝑚 + 𝑛), and getting its length is a constant time operation, so this is O(𝑚 + 𝑛)

Is there a faster way to count non-overlapping occurrences in a string than count()?

Given a minimum length N and a string S of 1's and 0's (e.g. "01000100"), I am trying to return the number of non-overlapping occurrences of a sub-string of length n containing all '0's. For example, given n=2 and the string "01000100", the number of non-overlapping "00"s is 2.
This is what I have done:
def myfunc(S,N):
return S.count('0'*N)
My question: is there a faster way of performing this for very long strings? This is from an online coding practice site and my code passes all but one of the test cases, which fails due to not being able to finish within a time limit. Doing some research it seems I can only find that count() is the fastest method for this.

This might be faster:
>>> s = "01000100"
>>> def my_count( a, n ) :
... parts = a.split('1')
... return sum( len(p)//n for p in parts )
...
>>> my_count(s, 2)
2
>>>
Worst case scenario for count() is O(N^2), the function above is strictly linear O(N). Here's the discussion where O(N^2) number came from: What's the computational cost of count operation on strings Python?
Also, you may always do this manually, without using split(), just loop over the string, reset counter (once saved counter // n somewhere) on 1 and increase counter on 0. This would definitely beat any other approach because strictly O(N).
Finally, for relatively large values of n (n > 10 ?), there might be a sub-linear (or still linear, but with a smaller constant) algorithm, which starts with comparing a[n-1] to 0, and going back to beginning. Chances are, there going to be a 1 somewhere, so we don't have to analyse the beginning of the string if a[n-1] is 1 -- simply because there's no way to fit enough zeros in there. Assuming we have found 1 at position k, the next position to compare would be a[k+n-1], again going back to the beginning of the string.
This way we can effectively skip most of the string during the search.

lenik posted a very good response that worked well. I also found another method faster than count() that I will post here as well. It uses the findall() method from the regex library:
import re
def my_count(a, n):
return len(re.findall('0'*n, a))

What is the time complexity of searching in Dict if very long strings are used as keys?

I read from python3 document, that python use hash table for dict(). So the search time complexity should be O(1) with O(N) as the worst case. However, recently as I took a course, the teacher says that happens only when you use int as the key. If you use a string of length L as keys the search time complexity is O(L).
I write a code snippet to test out his honesty
import random
import string
from time import time
import matplotlib.pyplot as plt
def randomString(stringLength=10):
"""Generate a random string of fixed length """
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(stringLength))
def test(L):
#L: int length of keys
N = 1000 # number of keys
d = dict()
for i in range(N):
d[randomString(L)] = None
tic = time()
for key in d.keys():
d[key]
toc = time() - tic
tic = time()
for key in d.keys():
pass
t_idle = time() - tic
t_total = toc - t_idle
return t_total
L = [i * 10000 for i in range(5, 15)]
ans = [test(l) for l in L]
plt.figure()
plt.plot(L, ans)
plt.show()
The result is very interesting. As you can see, the x-axis is the length of the strings used as keys and the y-axis is the total time to query all 1000 keys in the dictionary.
Can anyone explain this result?
Please be gentle on me. As you can see, if I ask this basic question, that means I don't have the ability to read python source code or equivalently complex insider document.

Since a dictionary is a hashtable, and looking up a key in a hashtable requires computing the key's hash, then the time complexity of looking up the key in the dictionary cannot be less than the time complexity of the hash function.
In current versions of CPython, a string of length L takes O(L) time to compute the hash of if it's the first time you've hashed that particular string object, and O(1) time if the hash for that string object has already been computed (since the hash is stored):
>>> from timeit import timeit
>>> s = 'b' * (10**9) # string of length 1 billion
>>> timeit(lambda: hash(s), number=1)
0.48574538500002973 # half a second
>>> timeit(lambda: hash(s), number=1)
5.301000044255488e-06 # 5 microseconds
So that's also how long it takes when you look up the key in a dictionary:
>>> s = 'c' * (10**9) # string of length 1 billion
>>> d = dict()
>>> timeit(lambda: s in d, number=1)
0.48521506899999167 # half a second
>>> timeit(lambda: s in d, number=1)
4.491000026973779e-06 # 5 microseconds
You also need to be aware that a key in a dictionary is not looked up only by its hash: when the hashes match, it still needs to test that the key you looked up is equal to the key used in the dictionary, in case the hash matching is a false positive. Testing equality of strings takes O(L) time in the worst case:
>>> s1 = 'a'*(10**9)
>>> s2 = 'a'*(10**9)
>>> timeit(lambda: s1 == s2, number=1)
0.2006020820001595
So for a key of length L and a dictionary of length n:
If the key is not present in the dictionary, and its hash has already been cached, then it takes O(1) average time to confirm it is absent.
If the key is not present and its hash has not been cached, then it takes O(L) average time because of computing the hash.
If the key is present, it takes O(L) average time to confirm it is present whether or not the hash needs to be computed, because of the equality test.
The worst case is always O(nL) because if every hash collides and the strings are all equal except in the last places, then a slow equality test has to be done n times.

only when you use int as the key. If you use a string of length L as keys the search time complexity is O(L)
Just to address a point not covered by kaya3's answer....
Why people often say a hash table insertion, lookup or erase is a O(1) operation.
For many real-world applications of hash tables, the typical length of keys doesn't tend to grow regardless of how many keys you're storing. For example, if you made a hash set to store the names in a telephone book, the average name length for the first 100 people is probably very close to the average length for absolutely everyone. For that reason, the time spent to look for a name is no worse when you have a set of ten million names, versus that initial 100 (this kind of analysis normally ignores the performance impact of CPU cache sizes, and RAM vs disk speeds if your program starts swapping). You can reason about the program without thinking about the length of the names: e.g. inserting a million names is likely to take roughly a thousand times longer than inserting a thousand.
Other times, an application has a hash tables where the key may vary significantly. Imagine say a hash set where the keys are binary data encoding videos: one data set is old Standard Definition 24fps video clips, while another is 8k UHD 60fps movies. The time taken to insert these sets of keys won't simply be in the ratio of the numbers of such keys, because there's vastly different amounts of work involved in key hashing and comparison. In this case - if you want to reason about insertion time for different sized keys, a big-O performance analysis would be useless without a related factor. You could still describe the relative performance for data sets with similar sized keys considering only the normal hash table performance characteristics. When key hashing times could become a problem, you may well want to consider whether your application design is still a good idea, or whether e.g. you could have used a set of say filenames instead of the raw video data.

Anagram sorting version vs Counting characters version

I am learning about the Time-complexity and theoretically I've read that to check Anagram for two string of same lengths there can be two version:
Sorting the string and comparing O(nlogn)
Counting the characters O(n)
but I wanted to go ahead and experience the same using code also.
So I've written the two version of the code and check time using python timeit module but there I'm getting different results.
import timeit
def method_one(input1, input2):
"""
Check if two string are anagram
"""
if len(input1) == len(input2):
if sorted(input1) == sorted(input2):
return True
return False
def method_two(input1, input2):
"""
Check if two string are anagram using count the character method
"""
count_char = [0] * 26
if len(input1) == len(input2):
for i in range(0, len(input1)):
count_char[ord(input1[i])-ord("a")] += 1
count_char[ord(input2[i])-ord("a")] -= 1
for i in count_char:
if(bool(i)):
return False
return True
return False
timer1 = timeit.Timer("method_one('apple','pleap')", "from __main__ import method_one")
timer2 = timeit.Timer("method_two('apple','pleap')", "from __main__ import method_two")
print(timer1.timeit(number=10000))
print(timer2.timeit(number=10000))
method_one: 0.0203204
method_two: 0.1090699
Ideally counting chars should be winning this but results are opposite to what I expected.

Time complexity describes how the execution time of the algorithm scales when the input to said algorithm increases. Since it ignores constants, an algorithm which has a better time complexity it not guarantee to run faster than others with higher bounds.
What time complexity tells you is that as the input size approaches infinity then the most efficient algorithm will run faster.

How can I vectorize this python count sort so it is absolutely as fast as it can be?

I am trying to write a count sort in python to beat the built-in timsort in certain situations. Right now it beats the built in sorted function, but only for very large arrays (1 million integers in length and longer, I haven't tried over 10 million) and only for a range no larger than 10,000. Additionally, the victory is narrow, with count sort only winning by a significant margin in random lists specifically tailored to it.
I have read about astounding performance gains that can be gained from vectorizing python code, but I don't particularly understand how to do it or how it could be used here. I would like to know how I can vectorize this code to speed it up, and any other performance suggestions are welcome.
Current fastest version for just python and stdlibs:
from itertools import chain, repeat
def untimed_countsort(unsorted_list):
counts = {}
for num in unsorted_list:
try:
counts[num] += 1
except KeyError:
counts[num] = 1
sorted_list = list(
chain.from_iterable(
repeat(num, counts[num])
for num in xrange(min(counts), max(counts) + 1)))
return sorted_list
All that counts is raw speed here, so sacrificing even more space for speed gains is completely fair game.
I realize the code is fairly short and clear already, so I don't know how much room there is for improvement in speed.
If anyone has a change to the code to make it shorter, as long as it doesn't make it slower, that would be awesome as well.
Execution time is down almost 80%! Now three times as fast as Timsort on my current tests!
The absolute fastest way to do this by a LONG shot is using this one-liner with numpy:
def np_sort(unsorted_np_array):
return numpy.repeat(numpy.arange(1+unsorted_np_array.max()), numpy.bincount(unsorted_np_array))
This runs about 10-15 times faster than the pure python version, and about 40 times faster than Timsort. It takes a numpy array in and outputs a numpy array.

With numpy, this function reduces to the following:
def countsort(unsorted):
unsorted = numpy.asarray(unsorted)
return numpy.repeat(numpy.arange(1+unsorted.max()), numpy.bincount(unsorted))
This ran about 40 times faster when I tried it on 100000 random ints from the interval [0, 10000). bincount does the counting, and repeat converts from counts to a sorted array.

Without thinking about your algorithm, this will help get rid of most of your pure python loops (which are quite slow) and turning them into comprehensions or generators (always faster than regular for blocks). Also, if you have to make a list consisting of all the same elements, the [x]*n syntax is probably the fastest way to go. The sum is used to flatten the list of lists.
from collections import defaultdict
def countsort(unsorted_list):
lmin, lmax = min(unsorted_list), max(unsorted_list) + 1
counts = defaultdict(int)
for j in unsorted_list:
counts[j] += 1
return sum([[num]*counts[num] for num in xrange(lmin, lmax) if num in counts])
Note that this is not vectorized, nor does it use numpy.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.