Anagram sorting version vs Counting characters version - python

I am learning about the Time-complexity and theoretically I've read that to check Anagram for two string of same lengths there can be two version:
Sorting the string and comparing O(nlogn)
Counting the characters O(n)
but I wanted to go ahead and experience the same using code also.
So I've written the two version of the code and check time using python timeit module but there I'm getting different results.
import timeit
def method_one(input1, input2):
"""
Check if two string are anagram
"""
if len(input1) == len(input2):
if sorted(input1) == sorted(input2):
return True
return False
def method_two(input1, input2):
"""
Check if two string are anagram using count the character method
"""
count_char = [0] * 26
if len(input1) == len(input2):
for i in range(0, len(input1)):
count_char[ord(input1[i])-ord("a")] += 1
count_char[ord(input2[i])-ord("a")] -= 1
for i in count_char:
if(bool(i)):
return False
return True
return False
timer1 = timeit.Timer("method_one('apple','pleap')", "from __main__ import method_one")
timer2 = timeit.Timer("method_two('apple','pleap')", "from __main__ import method_two")
print(timer1.timeit(number=10000))
print(timer2.timeit(number=10000))
method_one: 0.0203204
method_two: 0.1090699
Ideally counting chars should be winning this but results are opposite to what I expected.

Time complexity describes how the execution time of the algorithm scales when the input to said algorithm increases. Since it ignores constants, an algorithm which has a better time complexity it not guarantee to run faster than others with higher bounds.
What time complexity tells you is that as the input size approaches infinity then the most efficient algorithm will run faster.

Related

Why is this (presumably more efficient) dynamic algorithm being outperformed by the naive recursive version?

I have the following problem as homework:
Write a O(N^2) algorithm to determine whether the string can be broken into a list of words. You can start by writing an exponential algorithm and then using dynamic programming to improve the runtime complexity.
The naive exponential algorithm which I started out with is this:
def naiveWordBreak(S):
if len(S) == 0:
return True
else:
return any([(S[:i] in wordSet) and naiveWordBreak(S[i:]) for i in range(1, len(S) + 1)])
I then adapted this into the following dynamic algorithm:
def wordBreak(S):
prefixTable = [False] * (len(S) + 1)
prefixTable[0] = True
return _helper(S, prefixTable)
def _helper(S, prefixTable):
if prefixTable[len(S)]:
return prefixTable[len(S)]
else:
for i in range(1, len(S) + 1):
if S[:i] in wordSet and _helper(S[i:], prefixTable):
prefixTable[i] = True
return True
I am fairly confident from my proof and some testing that both algorithms are correct, however the recursive method should be exponential time while the dynamic method should be O(n^2). However, I got curious and used the timeit library to analyze the time it takes for both algorithms to run a batch of tests and the results were surprising. The dynamic method only beats the recursive method by a fraction of a second. More confusing is that after running the same test a couple of times, the recursive method actually gave a better runtime than the dynamic method. Here is the code I'm using for testing runtime:
def testRecursive():
naiveWordBreak("alistofwords")
naiveWordBreak("anotherlistofwords")
naiveWordBreak("stableapathydropon")
naiveWordBreak("retouchesreissueshockshopbedrollsunspotassailsinstilledintersectionpipitreappointx")
naiveWordBreak("xretouchesreissueshockshopbedrollsunspotassailsinstilledintersectionpipitreappoint")
naiveWordBreak("realignitingrains")
def testDynamic():
wordBreak("alistofwords")
wordBreak("anotherlistofwords")
wordBreak("stableapathydropon")
wordBreak("retouchesreissueshockshopbedrollsunspotassailsinstilledintersectionpipitreappointx")
wordBreak("xretouchesreissueshockshopbedrollsunspotassailsinstilledintersectionpipitreappoint")
wordBreak("realignitingrains")
def main():
recTime = timeit.timeit(testRecursive, number=1)
dynTime = timeit.timeit(testDynamic, number=1)
print("Performance Results:\n")
print("Time for recursive method = {}\n".format(recTime))
print("Time for dynamic method = {}\n".format(dynTime))
if dynTime < recTime:
print("Dynamic method produces better performance")
else:
print("Recursive method produces better performance")
The way I see it, there are only a few explanations for why the runtimes are inconsistent/not what I expected:
There is something wrong with my dynamic algorithm (or my analysis of it)
There is something wrong with my recursive algorithm
My test cases are insufficient
timeit isn't actually an appropriate library for what I'm trying to do
Does anyone have any insights or explanations?
The naive recursive approach is only slow when there are many, many, ways to break up the same string into words. If there is only one way, then it will be linear.
Assuming that can, not and cannot are all words in your list, try a string like "cannot" * n. By the time you get to n=40, you should see the win pretty clearly.

Time complexity analysis of two algorithms contradicts empirical results

I wrote the following simple function that checks whether str1 is a permutation of str2:
def is_perm(str1, str2):
return True if sorted(str1)==sorted(str2) else False
Assuming that sorted(str) has a time complexity of O(n*logn), we can expect a time complexity of O(2*n*logn)=O(n*logn). The following function is an attempt to achieve a better time complexity:
def is_perm2(str1, str2):
dict1 = {}
dict2 = {}
for char in str1:
if char in dict1:
dict1[char] += 1
else:
dict1[char] = 1
for char in str2:
if char in dict2:
dict2[char] += 1
else:
dict2[char] = 1
if dict1==dict2:
return True
else:
return False
Each for-loop iterates n times. Assuming that dictionary lookup and both dictionary updates have constant time complexity, I expect an overall complexity of O(2n)=O(n). However, timeit measurements show the following, contradicting results. Why is is_perm2 slower than is_perm after 1000000 executions even though it's time complexity looks better? Are my assumptions wrong?
import timeit
print(timeit.timeit('is_perm("helloworld","worldhello")', 'from __main__ import is_perm', number=10000000))
print(timeit.timeit('is_perm2("helloworld","worldhello")', 'from __main__ import is_perm2', number=10000000))
# output of first print-call: 12.4199592999993934 seconds
# output of second print-call: 37.13826630001131 seconds
There is no guarantee that an algorithm with a time complexity of O(nlogn) will be slower than one with a time complexity of O(n) for a given input. The second one could for instance have a large constant overhead, making it slower for input sizes that are below 100000 (for instance).
In your test the input size is 10 ("helloworld"), which doesn't tell us much. Repeating that test doesn't make a difference, even if repeated 10000000 times. The repetition only gives a more precise estimate of the average time spent on that particular input.
You would need to feed the algorithm with increasingly large inputs. If memory allows, that would eventually bring us to an input size for which the O(nlogn) algorithm takes more time than the O(n) algorithm.
In this case, I found that the input size had to be really large in comparison with available memory, and I only barely managed to find a case where the difference showed:
import random
import string
import timeit
def shuffled_string(str):
lst = list(str)
random.shuffle(lst)
return "".join(lst)
def random_string(size):
return "".join(random.choices(string.printable, k=size))
str1 = random_string(10000000)
str2 = shuffled_string(str1)
print("start")
print(timeit.timeit(lambda: is_perm(str1, str2), number=5))
print(timeit.timeit(lambda: is_perm2(str1, str2), number=5))
After the initial set up of the strings (which each have a size of 10 million characters), the output on repl.it was:
54.72847577700304
51.07616817899543
The reason why the input has to be so large to see this happen, is that sorted is doing all the hard work in lower-level, compiled code (often C), while the second solution does all the looping and character reading in Python code (often interpreted). It is clear that the overhead of the second solution is huge in comparison with the first.
Improving the second solution
Although not your question, we could improve the implementation of the second algorithm, by relying on another built-in function: Counter:
def is_perm3(str1, str2):
return Counter(str1) == Counter(str2)
With the same test set up as above, the timing for this implementation on repl.it is:
24.917681352002546
Assuming that dictionary lookup and both dictionary updates have constant time complexity,
python dictionary is hashmap.
So exactly, dictionary lookup and update costs O(n) in worst case.
Total average time complexity of is_perm2 is O(n) but worse case time complexity is O(n^2).
If you want get exactly O(n) time complexity, please use List(not Dictionary) to store frequency of characters.
You can easily convert each character to ascii numbers and store their frequency to python list.

Can you help me with the time complexity of this Python code?

I have written this code and I think its time complexity is O(n+m) as time depends on both the inputs, am I right? Is there a better algorithm you can suggest?
The function return the length of union of both the inputs.
class Solution :
def getUnion(self,a,b,):
p= 0
lower, greater = a,b
if len(a)>len(b):
lower,greater = b,a
while p< len(lower): # O(n+m)
if lower[p] in greater:
greater.remove(lower[p])
p+=1
return len(lower+greater)
print(Solution().getUnion([1,2,3,4,5],[2,3,4,54,67]))
Assuming 𝑚 is the shorter length, and 𝑛 the longer (or both are equal), then the while loop will iterate 𝑚 times.
Inside a single iteration of that loop an in greater operation is executed, which has a time complexity of O(𝑛) -- for each individual execution.
So the total time complexity is O(𝑚𝑛).
The correctness of this algorithm depends on whether we can assume that the input lists only contain unique values (each).
You can do better using a set:
return len(set(a + b))
Building a set is O(𝑚 + 𝑛), and getting its length is a constant time operation, so this is O(𝑚 + 𝑛)

Is there a faster way to count non-overlapping occurrences in a string than count()?

Given a minimum length N and a string S of 1's and 0's (e.g. "01000100"), I am trying to return the number of non-overlapping occurrences of a sub-string of length n containing all '0's. For example, given n=2 and the string "01000100", the number of non-overlapping "00"s is 2.
This is what I have done:
def myfunc(S,N):
return S.count('0'*N)
My question: is there a faster way of performing this for very long strings? This is from an online coding practice site and my code passes all but one of the test cases, which fails due to not being able to finish within a time limit. Doing some research it seems I can only find that count() is the fastest method for this.
This might be faster:
>>> s = "01000100"
>>> def my_count( a, n ) :
... parts = a.split('1')
... return sum( len(p)//n for p in parts )
...
>>> my_count(s, 2)
2
>>>
Worst case scenario for count() is O(N^2), the function above is strictly linear O(N). Here's the discussion where O(N^2) number came from: What's the computational cost of count operation on strings Python?
Also, you may always do this manually, without using split(), just loop over the string, reset counter (once saved counter // n somewhere) on 1 and increase counter on 0. This would definitely beat any other approach because strictly O(N).
Finally, for relatively large values of n (n > 10 ?), there might be a sub-linear (or still linear, but with a smaller constant) algorithm, which starts with comparing a[n-1] to 0, and going back to beginning. Chances are, there going to be a 1 somewhere, so we don't have to analyse the beginning of the string if a[n-1] is 1 -- simply because there's no way to fit enough zeros in there. Assuming we have found 1 at position k, the next position to compare would be a[k+n-1], again going back to the beginning of the string.
This way we can effectively skip most of the string during the search.
lenik posted a very good response that worked well. I also found another method faster than count() that I will post here as well. It uses the findall() method from the regex library:
import re
def my_count(a, n):
return len(re.findall('0'*n, a))

Implementing Levenshtein distance in python

I have implemented the algorithm, but now I want to find the edit distance for the string which has the shortest edit distance to the others strings.
Here is the algorithm:
def lev(s1, s2):
return min(lev(a[1:], b[1:])+(a[0] != b[0]), lev(a[1:], b)+1, lev(a, b[1:])+1)
Your "implementation" has several flaws:
(1) It should start with def lev(a, b):, not def lev(s1, s2):. Please get into the good habits of (a) running your code before asking questions about it (b) quoting the code that you've actually run (by copy/paste, not by (error-prone) re-typing).
(2) It has no termination conditions; for any arguments it will eventually end up trying to evaluate lev("", "") which would loop forever were it not for Python implementation limits: RuntimeError: maximum recursion depth exceeded.
You need to insert two lines:
if not a: return len(b)
if not b: return len(a)
to make it work.
(3) The Levenshtein distance is defined recursively. There is no such thing as "the" (one and only) algorithm. Recursive code is rarely seen outside a classroom and then only in a "strawman" capacity.
(4) Naive implementations take time and memory proportional to len(a) * len(b) ... aren't those strings normally a little bit longer than 4 to 8?
(5) Your extremely naive implementation is worse, because it copies slices of its inputs.
You can find working not-very-naive implementations on the web ... google("levenshtein python") ... look for ones which use O(max(len(a), len(b))) additional memory.
What you asked for ("the edit distance for the string who has the shortest edit distance to the others strings.") Doesn't make sense ... "THE string"??? "It takes two to tango" :-)
What you probably want (finding all pairs of strings in a collection which have the minimal distance), or maybe just that minimal distance, is a simple programming exercise. What have you tried?
By the way, finding those pairs by a simplistic algorithm will take O(N ** 2) executions of lev() where N is the number of strings in the collection ... if this is a real-world application, you should look to use proven code rather than try to write it yourself. If this is homework, you should say so.
is this what you're looking for ??
import itertools
import collections
# My Simple implementation of Levenshtein distance
def levenshtein_distance(string1, string2):
"""
>>> levenshtein_distance('AATZ', 'AAAZ')
1
>>> levenshtein_distance('AATZZZ', 'AAAZ')
3
"""
distance = 0
if len(string1) < len(string2):
string1, string2 = string2, string1
for i, v in itertools.izip_longest(string1, string2, fillvalue='-'):
if i != v:
distance += 1
return distance
# Find the string with the shortest edit distance.
list_of_string = ['AATC', 'TAGCGATC', 'ATCGAT']
strings_distances = collections.defaultdict(int)
for strings in itertools.combinations(list_of_string, 2):
strings_distances[strings[0]] += levenshtein_distance(*strings)
strings_distances[strings[1]] += levenshtein_distance(*strings)
shortest = min(strings_distances.iteritems(), key=lambda x: x[1])

Categories