Let's assume that keys of dictionary are very long, and their length is around M where M is a very large number.
Then does it mean in terms of M, the time complexity of operations like
x=dic[key]
print(dic[key])
is O(M)? not O(1)?
How does it work?
If you're talking about string keys with M characters, yes, it can be O(M), and on two counts:
Computing the hash code can take O(M) time.
If the hash code of the key passed in matches the hash code of a key in the table, then the implementation has to go on to compute whether they're equal (what __eq__() returns). If they are equal, that requires exactly M+1 comparisons to determine (M for each character pair, and another compare at the start to verify that the (integer) string lengths are the same).
In rare cases it can be constant-time (independent of string length): those where the passed-in key is the same object as a key in the table. For example, in
k = "a" * 10000000
d = {k : 1}
print(k in d)
Time it, and compare to when, e.g., adding this line before the end:
k = k[:-1] + "a"
The change builds another key equal to the original k, but is not the same object. So an internal pointer-equality test doesn't succeed, and a full-blown character-by-character comparison is needed.
Related
When you use the counting sort algorithm you create a list, and use its indices as keys while adding the number of integer occurrences as the values within the list. Why is this not the same as simply creating a dictionary with the keys as the index and the counts as the values? Such as:
hash_table = collections.Counter(numList)
or
hash_table = {x:numList.count(x) for x in numList}
Once you have your hash table created you essentially just copy the number of integer occurrences over to another list. Hash Tables/Dictionaries have O(1) lookup times, so why would this not be preferable if your simply referencing the key/value pairs?
I've included the algorithm for Counting Sort below for reference:
def counting_sort(the_list, max_value):
# List of 0's at indices 0...max_value
num_counts = [0] * (max_value + 1)
# Populate num_counts
for item in the_list:
num_counts[item] += 1
# Populate the final sorted list
sorted_list = []
# For each item in num_counts
for item, count in enumerate(num_counts):
# For the number of times the item occurs
for _ in xrange(count):
# Add it to the sorted list
sorted_list.append(item)
return sorted_list
You certainly can do something like this. The question is whether it’s worthwhile to do so.
Counting sort has a runtime of O(n + U), where n is the number of elements in the array and U is the maximum value. Notice that as U gets larger and larger the runtime of this algorithm starts to degrade noticeably. For example, if U > n and I add one more digit to U (for example, changing it from 1,000,000 to 10,000,000), the runtime can increase by a factor of ten. This means that counting sort starts to become impractical as U gets bigger and bigger, and so you typically run counting sort when U is fairly small. If you’re going to run counting sort with a small value of U, then using a hash table isn’t necessarily worth the overhead. Hashing items costs more CPU cycles than just doing standard array lookups, and for small arrays the potential savings in memory might not be worth the extra time. And if you’re using a very large value of U, you’re better off switching to radix sort, which essentially is lots of smaller passes of counting sort with a very small value of U.
The other issue is that the reassembly step of counting sort has amazing locality of reference - you simply scan over the counts array and the input array in parallel filling in values. If you use a hash table, you lose some of that locality because th elements in the hash table aren’t necessarily stored consecutively.
But these are more implementation arguments than anything else. Fundamentally, counting sort is less about “use an array” and more about “build a frequency histogram.” It just happens to be the case that a regular old array is usually preferable to a hash table when building that histogram.
I am trying to solve for 'String search algorithm' but the answers of many sites seems to be complex ( 'Naive string search' with O(m(n-m+1) ), what's the issue with my algo below, it has worst case complexity of O(n), while KMP also has O(n) therefore I must be definitely wrong, but where?
def find(s1, s2):
size = len(s1)
index = 0
while ( index != len(s2)):
if s2[index : index + size] == s1:
print 'Pattern found at index %s'%(index)
index += size
else:
index += 1
Ok so I was supposing s2[index : index + size] == s1 to be O(1) which is O(n), so now my original question becomes,
Why isn't the hashes of two strings calculated and compared, if both hashes are equal strings should be equal.
I don't get how can they collide. Isn't that dependant of hashing algorithm. Like MD5 has known breaks.
Original question
I don't think your code has complexity O(n), but rather O(mn). This check: s2[index : index + size] == s1, since, in the worst case, it needs to do len(s1) comparisons of characters.
Hashing
Here's Wikipedia's definition of a hash function:
A hash function is any function that can be used to map data of
arbitrary size to data of fixed size. The values returned by a hash
function are called hash values, hash codes, digests, or simply
hashes. One use is a data structure called a hash table, widely used in computer software for rapid data lookup.
Here we run into the first problem with this approach. A hash function takes in a value of arbitrary size, and returns a value of a fixed size. Following the pigeonhole principle, there is at least one hash with multiple values, probably more. As a quick example, imagine your hash function always produces an output that is one byte long. That means there are 256 possible outputs. After you've hashed 257 items, you'll always be certain there are at least 2 items with the same hash. To avoid this for as long as possible, a good hash function will map inputs over all possible outputs as uniformly as possible.
So if the hashes aren't equal, you can be sure the strings aren't equal, but not vice versa. Two different strings can have the same hash.
How do __hash__ and __eq__ use in identification in sets?
For example some code that should help to solve some domino puzzle:
class foo(object):
def __init__(self, one, two):
self.one = one
self.two = two
def __eq__(self,other):
if (self.one == other.one) and (self.two == other.two): return True
if (self.two == other.one) and (self.one == other.two): return True
return False
def __hash__(self):
return hash(self.one + self.two)
s = set()
for i in range(7):
for j in range(7):
s.add(foo(i,j))
len(s) // returns 28 Why?
If i use only __eq__() len(s) equals 49. Its ok because as i understand objects (1-2 and 2-1 for example) not same, but represent same domino. So I have added hash function.
Now it works the way i want, but i did not understand one thing: hash of 1-3 and 2-2 should be same so they should counted like same object and shouldn't added to set. But they do! Im stuck.
Equality for dict/set purposes depends on equality as defined by __eq__. However, it is required that objects that compare equal have the same hash value, and that is why you need __hash__. See this question for some similar discussion.
The hash itself does not determine whether two objects count as the same in dictionaries. The hash is like a "shortcut" that only works one way: if two objects have different hashes, they are definitely not equal; but if they have the same hash, they still might not be equal.
In your example, you defined __hash__ and __eq__ to do different things. The hash depends only on the sum of the numbers on the domino, but the equality depends on both individual numbers (in order). This is legal, since it is still the case that equal dominoes have equal hashes. However, like I said above, it doesn't mean that equal-sum dominoes will be considered equal. Some unequal dominoes will still have equal hashes. But equality is still determined by __eq__, and __eq__ still looks at both numbers, in order, so that's what determines whether they are equal.
It seems to me that the appropriate thing to do in your case is to define both __hash__ and __eq__ to depend on the ordered pair --- that is, first compare the greater of the two numbers, then compare the lesser. This will mean that 2-1 and 1-2 will be considered the same.
The hash is only a hint to help Python arrange the objects. When looking for some object foo in a set, it still has to check each object in the set with the same hash as foo.
It's like having a bookshelf for every letter of the alphabet. Say you want to add a new book to your collection, only if you don't have a copy of it already; you'd first go to the shelf for the appropriate letter. But then you have to look at each book on the shelf and compare it to the one in your hand, to see if it's the same. You wouldn't discard the new book just because there's something already on the shelf.
If you want to use some other value to filter out "duplicates", then use a dict that maps the domino's total value to the first domino you saw. Don't subvert builtin Python behavior to mean something entirely different. (As you've discovered, it doesn't work in this case, anyway.)
The requirement for hash functions is that if x == y for two values, then hash(x) == hash(y). The reverse need not be true.
You can easily see why this is the case by considering hashing of strings. Lets say that hash(str) returns a 32-bit number, and we are hashing strings longer than 4 characters long (i.e. contain more than 32 bits). There are more possible strings than there are possible hash values, so some non-equal strings must share the same hash (this is an application of the pigeonhole principle).
Python sets are implemented as hash tables. When checking whether an object is a member of the set, it will call its hash function and use the result to pick a bucket, and then use the equality operator to see if it matches any of the items in the bucket.
With your implementation, the 2-2 and 1-3 dominoes will end up in the hash bucket, but they don't compare equal. Therefore, the both can be added to the set.
You can read about this in the Python data model documentation, but the short answer is that you can rewrite your hash function as:
def __hash__(self):
return hash(tuple(sorted((self.one, self.two))))
I like the sound of the answer provided by Eevee, but I had difficulty imagining an implementation. Here's my interpretation, explanation and implementation of the answer provided by Eevee.
Use the sum of two domino values as dictionary the key.
Store either of the domino values as the dictionary value.
For example, given the domino '12', the sum is 3, and therefore the dictionary key will be 3. We can then pick either value (1 or 2) to store in that position (we'll pick the first value, 1).
domino_pairs = {}
pair = '12'
pair_key = sum(map(int, pair))
domino_pairs[pair_key] = int(pair[0]) # Store the first pair's first value.
print domino_pairs
Outputs:
{3: '1'}
Although we're only storing a single value from the domino pair, the other value can easily be calculated from the dictionary key and value:
pair = '12'
pair_key = sum(map(int, pair))
domino_pairs[pair_key] = int(pair[0]) # Store the first pair's first value.
# Retrieve pair from dictionary.
print pair_key - domino_pairs[pair_key] # 3-1 = 2
Outputs:
2
But, since two different pairs may have the same total, we need to store multiple values against a single key. So, we store a list of values against a single key (i.e. sum of two pairs). Putting this into a function:
def add_pair(dct, pair):
pair_key = sum(map(int, pair))
if pair_key not in dct:
dct[pair_key] = []
dct[pair_key].append(int(pair[0]))
domino_pairs = {}
add_pair(domino_pairs, '22')
add_pair(domino_pairs, '04')
print domino_pairs
Outputs:
{4: [2, 0]}
This makes sense. Both pairs sum to 4, yet the first value in each pair differs, so we store both. The implementation so far will allow duplicates:
domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
print domino_pairs
Outputs
{4: [4, 0]}
'40' and '04' are the same in Dominos, so we don't need to store both. We need a way of checking for duplicates. To do this we'll define a new function, has_pair:
def has_pair(dct, pair):
pair_key = sum(map(int, pair))
if pair_key not in dct:
return False
return (int(pair[0]) in dct[pair_key] or
int(pair[1]) in dct[pair_key])
As normal, we get the sum (our dictionary key). If it it's not in the dictionary, then the pair cannot exist. If it is in the dictionary, we must check to see if either value in our pair exist in the dictionary 'bucket'. Let's insert this check into add_pair, and so we don't add duplicate domino pairs:
def add_pair(dct, pair):
pair_key = sum(map(int, pair))
if has_pair(dct, pair):
return
if pair_key not in dct:
dct[pair_key] = []
dct[pair_key].append(int(pair[0]))
Now adding duplicate domino pairs works correctly:
domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
print domino_pairs
Outputs:
{4: [4]}
Lastly, a print function shows how from storing only the sum of a domino pair, and a single value from the same pair, is the same as storing the pair itself:
def print_pairs(dct):
for total in dct:
for a in dct[total]:
a = int(a)
b = int(total) - int(a)
print '(%d, %d)'%(a,b)
Testing:
domino_pairs = {}
add_pair(domino_pairs, '40')
add_pair(domino_pairs, '04')
add_pair(domino_pairs, '23')
add_pair(domino_pairs, '50')
print_pairs(domino_pairs)
Outputs:
(4, 0)
(2, 3)
(5, 0)
Python dict key delete, if key pattern match with other dict key.
e.g.
a={'a.b.c.test':1, 'b.x.d.pqr':2, 'c.e.f.dummy':3, 'd.x.y.temp':4}
b={'a.b.c':1, 'b.p.q':20}
result
a={'b.x.d.pqr':2,'c.e.f.dummy':3,'d.x.y.temp':4}`
If "pattern match with other dict key" means "starts with any key in the other dict", the most direct way to write that would be like this:
a = {k:v for (k, v) in a.items() if any(k.startswith(k2) for k2 in b)}
If that's hard to follow at first glance, it's basically the equivalent of this:
def matches(key1, d2):
for key2 in d2:
if key1.startswith(key2):
return True
return False
c = {}
for key in a:
if not matches(key, b):
c[key] = a[key]
a = c
This is going to be slower than necessary. If a has N keys, and b has M keys, the time taken is O(NM). While you can checked "does key k exist in dict b" in constant time, there's no way to check "does any key starting with k exist in dict b" without iterating over the whole dict. So, if b is potentially large, you probably want to search sorted(b.keys()) and write a binary search, which will get the time down to O(N log M). But if this isn't a bottleneck, you may be better off sticking with the simple version, just because it's simple.
Note that I'm generating a new a with the matches filtered out, rather than deleting the matches. This is almost always a better solution than deleting in-place, for multiple reasons:
* It's much easier to reason about. Treating objects as immutable and doing pure operations on them means you don't need to think about how states change over time. For example, the naive way to delete in place would run into the problem that you're changing the dictionary while iterating over it, which will raise an exception. Issues like that never come up without mutable operations.
* It's easier to read, and (once you get the hang of it) even to write.
* It's almost always faster. (One reason is that it takes a lot more memory allocations and deallocations to repeatedly modify a dictionary than to build one with a comprehension.)
The one tradeoff is memory usage. The delete-in-place implementation has to make a copy of all of the keys; the built-a-new-dict implementation has to have both the filtered dict and the original dict in memory. If you're keeping 99% of the values, and the values are much larger than the keys, this could hurt you. (On the other hand, if you're keeping 10% of the values, and the values are about the same size as the keys, you'll actually save space.) That's why it's "almost always" a better solution, rather than "always".
for key in list(a.keys()):
if any(key.startswith(k) for k in b):
del a[key]
Replace key.startswith(k) with an appropriate condition for "matching".
c={} #result in dict c
for key in b.keys():
if all([z.count(key)==0 for z in a.keys()]): #string of the key in b should not be substring for any of the keys in a
c[key]=b[key]
I'm doing an iteration through 3 words, each about 5 million characters long, and I want to find sequences of 20 characters that identifies each word. That is, I want to find all sequences of length 20 in one word that is unique for that word. My problem is that the code I've written takes an extremely long time to run. I've never even completed one word running my program over night.
The function below takes a list containing dictionaries where each dictionary contains each possible word of 20 and its location from one of the 5 million long words.
If anybody has an idea how to optimize this I would be really thankful, I don't have a clue how to continue...
here's a sample of my code:
def findUnique(list):
# Takes a list with dictionaries and compairs each element in the dictionaries
# with the others and puts all unique element in new dictionaries and finally
# puts the new dictionaries in a list.
# The result is a list with (in this case) 3 dictionaries containing all unique
# sequences and their locations from each string.
dicList=[]
listlength=len(list)
s=0
valuelist=[]
for i in list:
j=i.values()
valuelist.append(j)
while s<listlength:
currdic=list[s]
dic={}
for key in currdic:
currval=currdic[key]
test=True
n=0
while n<listlength:
if n!=s:
if currval in valuelist[n]: #this is where it takes to much time
n=listlength
test=False
else:
n+=1
else:
n+=1
if test:
dic[key]=currval
dicList.append(dic)
s+=1
return dicList
def slices(seq, length, prefer_last=False):
unique = {}
if prefer_last: # this doesn't have to be a parameter, just choose one
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]] = start
else: # prefer first
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
# or find all locations for each slice:
import collections
def slices(seq, length):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[seq[start:start+length]].append(start)
return unique
This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).
Printing unique slices:
a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}
Including locations:
a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
# (depending on which slices function used)
In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.
import operator
import random
import string
def slices(seq, length):
unique = {}
for start in xrange(len(seq) - length, -1, -1):
unique[seq[start:start+length]] = start
return unique
def sample_with_repeat(population, length, choice=random.choice):
return "".join(choice(population) for _ in xrange(length))
word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
(x for x in words_slices_sets if x is not n),
n)
for n in words_slices_sets]
print [len(x) for x in unique_words_slices]
You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.
If you can provide more information about your input data, a specific solution might be available.
For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.
edit
I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.
adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.
I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.
If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.
If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.
I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.
Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").
import collections
def hashed_slices(seq, length, hasher=None):
unique = collections.defaultdict(list)
for start in xrange(len(seq) - length + 1):
unique[hasher(seq[start:start+length])].append(start)
return unique
(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)
Now, we can combine all the hashes :
unique = [] # Unique words in first string
# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
for h, starts in hashed.iteritems() :
# We only care about the first word
if h in hashed_starts[0] :
all_hashed[h][i]=starts
# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
if len(starts_by_word) == 1 :
# if there's only one word for the hash, it's obviously valid
unique.extend(words[0][i:i+20] for i in starts_by_word.values())
else :
# we might have a hash collision
candidates = {}
for word_idx, starts in starts_by_word.iteritems() :
candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
# Now go that we have the candidate slices, find the unique ones
valid = candidates[0]
for word_idx, candidate_set in candidates.iteritems() :
if word_idx != 0 :
valid -= candidate_set
unique.extend(valid)
(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)
Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.
The code is still O(N^2), but the constant should be much lower.
Let's attempt to improve on Roger Pate's excellent answer.
Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.
Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.
seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
for letterid in range(maxlength):
for wordid,word in words:
if (letterid < len(word)):
letter = word[letterid]
if letter is startletter:
seq = word[letterid:letterid+seqlen]
if seq in seqtrie and not wordid in seqtrie[seq]:
seqtrie[seq].append(wordid)
Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.