String searching algorithm - Complexity of string matching - python

I am trying to solve for 'String search algorithm' but the answers of many sites seems to be complex ( 'Naive string search' with O(m(n-m+1) ), what's the issue with my algo below, it has worst case complexity of O(n), while KMP also has O(n) therefore I must be definitely wrong, but where?
def find(s1, s2):
size = len(s1)
index = 0
while ( index != len(s2)):
if s2[index : index + size] == s1:
print 'Pattern found at index %s'%(index)
index += size
else:
index += 1
Ok so I was supposing s2[index : index + size] == s1 to be O(1) which is O(n), so now my original question becomes,
Why isn't the hashes of two strings calculated and compared, if both hashes are equal strings should be equal.
I don't get how can they collide. Isn't that dependant of hashing algorithm. Like MD5 has known breaks.

Original question
I don't think your code has complexity O(n), but rather O(mn). This check: s2[index : index + size] == s1, since, in the worst case, it needs to do len(s1) comparisons of characters.
Hashing
Here's Wikipedia's definition of a hash function:
A hash function is any function that can be used to map data of
arbitrary size to data of fixed size. The values returned by a hash
function are called hash values, hash codes, digests, or simply
hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup.
Here we run into the first problem with this approach. A hash function takes in a value of arbitrary size, and returns a value of a fixed size. Following the pigeonhole principle, there is at least one hash with multiple values, probably more. As a quick example, imagine your hash function always produces an output that is one byte long. That means there are 256 possible outputs. After you've hashed 257 items, you'll always be certain there are at least 2 items with the same hash. To avoid this for as long as possible, a good hash function will map inputs over all possible outputs as uniformly as possible.
So if the hashes aren't equal, you can be sure the strings aren't equal, but not vice versa. Two different strings can have the same hash.

Related

Run time difference for "in" searching through "list" and "set" using Python

My understanding of list and set in Python are mainly that list allows duplicates, list allows ordered information, and list has position information. I found while I was trying to search if an element is with in a list, the runtime is much faster if I convert the list to a set first. For example, I wrote a code trying find the longest consecutive sequence in a list. Use a list from 0 to 10000 as an example, the longest consecutive is 10001. While using a list:
start_time = datetime.now()
nums = list(range(10000))
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the list##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code is "Duration: 0:00:01.481939"
With adding only one line to convert the list to set in third row below:
start_time = datetime.now()
nums = list(range(10000))
nums = set(nums)
longest = 0
for number in nums:
if number - 1 not in nums:
length = 0
##Search if number + 1 also in the set(was a list)##
while number + length in nums:
length += 1
longest = max(length, longest)
end_time = datetime.now()
timecost = 'Duration: {}'.format(end_time - start_time)
print(timecost)
The run time for above code by using a set is now "Duration: 0:00:00.005138", Many time shorter than search through a list. Could anyone help me to understand the reason for that? Thank you!
This is a great question.
The issue with arrays is that there is no smarter way to search in some array a besides just comparing every element one by one.
Sometimes you'll get lucky and get a match on the first element of a.
Sometimes you'll get unlucky and not find a match until the last element of a, or perhaps none at all.
On average, you'll have to search half the elements of they array each time.
This is said to have a "time complexity" of O(len(a)), or colloquially, O(n). This means the time taken by the algorithm (searching for a value in array) is linearly propertional to the size of the input (the number of elements in the array to be searched). This is why it's called "linear search". Oh, your array got 2x bigger? Well your searches just got 2x slower. 1000x bigger? 1000x slower.
Arrays are great, but they're đź’© for searching if the element count gets too high.
Sets are clever. In Python, they're implemented as if they were a Dictionary with keys and no values. Like dictionaries, they're backed by data structure called a hash table.
A hash table uses the hash of a value as a quick way to get a "summary" of an object. This "summary" is then used to narrow down its search, so it only needs to linearly search a very small subset of all the elements. Searching in a hash table a time complexity of O(1). Notice that there's no "n" or len(the_set) in there. That's because the time taken to search in a hash table does not grow as the size of the hash table grows. So it's said to have constant time complexity.
By analogy, you only search the dairy isle when you're looking for milk. You know the hash value of milk (say, it's isle) is "dairy" and not "deli", so you don't have to waste any time searching for milk
A natural follow-up question is "then why don't we always use sets?". Well, there's a trade-off.
As you mentioned, sets can't contain duplicates, so if you want to store two of something, it's a non-starter.
Prior to Python 3.7, they were also unordered, so if you cared about
the order of elements, they won't do, either. * Sets generally have a
larger cpu/memory overhead, which adds up when using many sets containing small numbers of elements.
Also, it's possible
that because of fancy CPU features (like CPU caches and branch
prediction), linear searching through small arrays can actually be
faster than the hash-based look-up in sets.
I'd recommend you do some further reading into data structures and algorithms. This stuff is quite language-independent. Now that you know that set and dict use a Hash Table behind the scenes, you can look up resource that cover hash tables in any language, and that should help. There's also some Python-centric resoruces too, like https://www.interviewcake.com/concept/python/hash-map

Time Complexity of getting value when key is very long

Let's assume that keys of dictionary are very long, and their length is around M where M is a very large number.
Then does it mean in terms of M, the time complexity of operations like
x=dic[key]
print(dic[key])
is O(M)? not O(1)?
How does it work?
If you're talking about string keys with M characters, yes, it can be O(M), and on two counts:
Computing the hash code can take O(M) time.
If the hash code of the key passed in matches the hash code of a key in the table, then the implementation has to go on to compute whether they're equal (what __eq__() returns). If they are equal, that requires exactly M+1 comparisons to determine (M for each character pair, and another compare at the start to verify that the (integer) string lengths are the same).
In rare cases it can be constant-time (independent of string length): those where the passed-in key is the same object as a key in the table. For example, in
k = "a" * 10000000
d = {k : 1}
print(k in d)
Time it, and compare to when, e.g., adding this line before the end:
k = k[:-1] + "a"
The change builds another key equal to the original k, but is not the same object. So an internal pointer-equality test doesn't succeed, and a full-blown character-by-character comparison is needed.

Most efficient way to get first value that startswith of large list

I have a very large list with over a 100M strings. An example of that list look as follows:
l = ['1,1,5.8067',
'1,2,4.9700',
'2,2,3.9623',
'2,3,1.9438',
'2,7,1.0645',
'3,3,8.9331',
'3,5,2.6772',
'3,7,3.8107',
'3,9,7.1008']
I would like to get the first string that starts with e.g. '3'.
To do so, I have used a lambda iterator followed by next() to get the first item:
next(filter(lambda i: i.startswith('3,'), l))
Out[1]: '3,3,8.9331'
Considering the size of the list, this strategy unfortunately still takes relatively much time for a process I have to do over and over again. I was wondering if someone could come up with an even faster, more efficient approach. I am open for alternative strategies.
I have no way of testing it myself but it is possible that if you will join all the strings with a char that is not in any of the string:
concat_list = '$'.join(l)
And now use a simple .find('$3,'), it would be faster. It might happen if all the strings are relatively short. Since now all the string is in one place in memory.
If the amount of unique letters in the text is small you can use Abrahamson-Kosaraju method and het time complexity of practically O(n)
Another approach is to use joblib, create n threads when the i'th thread is checking the i + k * n, when one is finding the pattern it stops the others. So the time complexity is O(naive algorithm / n).
Since your actual strings consist of relatively short tokens (such as 301) after splitting the the strings by tabs, you can build a dict with each possible length of the first token as the keys so that subsequent lookups take only O(1) in average time complexity.
Build the dict with values of the list in reverse order so that the first value in the list that start with each distinct character will be retained in the final dict:
d = {s[:i + 1]: s for s in reversed(l) for i in range(len(s.split('\t')[0]))}
so that given:
l = ['301\t301\t51.806763\n', '301\t302\t46.970094\n',
'301\t303\t39.962393\n', '301\t304\t18.943836\n',
'301\t305\t11.064584\n', '301\t306\t4.751911\n']
d['3'] will return '301\t301\t51.806763'.
If you only need to test each of the first tokens as a whole, rather than prefixes, you can simply make the first tokens as the keys instead:
d = {s.split('\t')[0]: s for s in reversed(l)}
so that d['301'] will return '301\t301\t51.806763'.

Why can’t you use Hash Tables/Dictionaries in Counting Sort algorithm?

When you use the counting sort algorithm you create a list, and use its indices as keys while adding the number of integer occurrences as the values within the list. Why is this not the same as simply creating a dictionary with the keys as the index and the counts as the values? Such as:
hash_table = collections.Counter(numList)
or
hash_table = {x:numList.count(x) for x in numList}
Once you have your hash table created you essentially just copy the number of integer occurrences over to another list. Hash Tables/Dictionaries have O(1) lookup times, so why would this not be preferable if your simply referencing the key/value pairs?
I've included the algorithm for Counting Sort below for reference:
def counting_sort(the_list, max_value):
# List of 0's at indices 0...max_value
num_counts = [0] * (max_value + 1)
# Populate num_counts
for item in the_list:
num_counts[item] += 1
# Populate the final sorted list
sorted_list = []
# For each item in num_counts
for item, count in enumerate(num_counts):
# For the number of times the item occurs
for _ in xrange(count):
# Add it to the sorted list
sorted_list.append(item)
return sorted_list
You certainly can do something like this. The question is whether it’s worthwhile to do so.
Counting sort has a runtime of O(n + U), where n is the number of elements in the array and U is the maximum value. Notice that as U gets larger and larger the runtime of this algorithm starts to degrade noticeably. For example, if U > n and I add one more digit to U (for example, changing it from 1,000,000 to 10,000,000), the runtime can increase by a factor of ten. This means that counting sort starts to become impractical as U gets bigger and bigger, and so you typically run counting sort when U is fairly small. If you’re going to run counting sort with a small value of U, then using a hash table isn’t necessarily worth the overhead. Hashing items costs more CPU cycles than just doing standard array lookups, and for small arrays the potential savings in memory might not be worth the extra time. And if you’re using a very large value of U, you’re better off switching to radix sort, which essentially is lots of smaller passes of counting sort with a very small value of U.
The other issue is that the reassembly step of counting sort has amazing locality of reference - you simply scan over the counts array and the input array in parallel filling in values. If you use a hash table, you lose some of that locality because th elements in the hash table aren’t necessarily stored consecutively.
But these are more implementation arguments than anything else. Fundamentally, counting sort is less about “use an array” and more about “build a frequency histogram.” It just happens to be the case that a regular old array is usually preferable to a hash table when building that histogram.

Python: time/platform independent fast hash for big sets of pairs of ints

I would like to get a time and platform independent hash function for big sets of pairs of integers in Python, which is also fast and has (almost certainly) no collision. (Hm, what else do you want a hash to be - but anyway... .)
What I have so far is to use hashlib.md5 on the string representation of the sorted list:
> my_set = set([(1,2),(0,3),(1,3)]) # the input set, size 1...10^6
> import hashlib
> def MyHash(my_set):
> my_lst = sorted(my_set)
> my_str = str(my_lst)
> return hashlib.md5(my_str).hexdigest()
my_set contains between 1 and 10^5 pairs, and each int is between 0 and 10^6. In total, I have about 10^8 such sets on which the hash should be almost certainly unique.
Does this sound reasonable, or is there a better way of doing it?
On my example set with 10^6 pairs in the list, this takes about 2.5sec, so improvements on the time might be good, if possible. Almost all of the time is spend computing the string of the sorted list, so a big part of the question is
Is the string of a sorted list of tuples of integers in python stable among versions and platforms? Is there a better/faster way of obtaining a stable string representation?

Categories