Search for keyword instead of whole word - py - python

My hash codes returns only the whole title of the word.
I want to make it to show the results with only using keywords
for at least 2 word (onwards) then show the results (get function).
My hash code
class hashin:
def __init__(self):
self.size = 217 # size of hash table
self.map = [None] * self.size
def _get_hash(self, key):
hash = 0
for char in str(key):
hash += ord(char)
return hash % self.size
#returns the ASCII value of char in str(key)
def add(self, key, value): # add item to list
key_hash = self._get_hash(key)
key_value = [key, value]
if self.map[key_hash] is None:
self.map[key_hash] = list([key_value])
return True
else:
for pair in self.map[key_hash]:
if pair[0] == key:
pair[1] = value
return True
self.map[key_hash].append(key_value)
return True
def get(self, key): # search for item
key_hash = self._get_hash(key)
if self.map[key_hash] is not None:
for pair in self.map[key_hash]: # find pair of words
if pair[0] == key: # if pair is equals to the whole title of the word
return pair[0] + " - " + pair[1]
return "Error no results for %s \nEnter the correct word." % (key)
sample outputs:
when whole title was typed
When keyword was typed (i need to show the results even when keyword was typed)
What i need is :
Output:
Cheater - Kygos
and the other words with chea in their name

A hash table isn't the right data structure for this task. The purpose of a hash value is to narrow the search to a small subset of the possibilities. Since the hash value is dependent on the entire string, using just a portion of the string will give the wrong subset.
A better data structure for this task is a trie (sometimes called a "prefix tree"). While it is not difficult to write this data structure on your own, there are already many tested, ready-to-use modules already available on PyPI.
See:
https://pypi.python.org/pypi?%3Aaction=search&term=trie&submit=search

Related

Return smaller values in a BST inorder

I am trying to implement this method, "smaller" for a BST, that returns the values in the tree which are smaller than a given item, in order.
class BinarySearchTree:
def __init__(self, root: Optional[Any]) -> None:
if root is None:
self._root = None
self._left = None
self._right = None
else:
self._root = root
self._left = BinarySearchTree(None)
self._right = BinarySearchTree(None)
def is_empty(self) -> bool:
return self._root is None
def smaller(self, item: Any) -> List:
if self.is_empty():
return []
else:
return self._left.items() + [self._root] + self._right.items()
So far, the "smaller" method will return all of the values in the tree in order, but I'm not sure how to check if those values are smaller and than a given item, and to only return those in a list.
Let's write pseudocode for in-order-tree-walk method which prints the keys of BST in sorted (in-order) order.
in-order-tree-walk(T, x)
if (T != NULL)
in-order-tree-walk(T.left, x)
print T's key
in-order-tree-walk(T.right, x)
smaller method has exactly the same structure as in-order-tree-walk except that it's additional condition which makes it to print keys that are smaller. smaller method's pseudocode will look like
smaller(T, x)
if (T != NULL)
smaller(T.left, x)
if (T's key is less than x)
print T's key
smaller(T.right, x)
We're done. smaller method is now completed. Now let's look at your actual implementation.
Your code prints all keys of BST in sorted order because of the way you implemented it. You have the problem the following part:
def smaller(self, item: Any) -> List:
if self.is_empty():
return []
else:
return self._left.items() + [self._root] + self._right.items()
In return self._left.items() + [self._root] + self._right.items(), you don't check whether the [self.root] is less than item's value or not. You have to check that because you put restraint on printing the key of tree, but in implementation you didn't check it. Since I'm not qualified in Python, I can't complete this part, but I think you've get what the problem is with your code based on above explanations.

Implementing Trie (or similar data structure) on spark

I'm working as an intern, I've been tasked with implementing a fast searching algorithm for phone numbers on their spark cluster, using tries (prefix tree), and performing operations such as inner join on several such tries
I managed to create get it working for about 5 million numbers (2 tries, with 2.5 mllion numbers in each)
I've been tasked to scale it up to 10-20 million. though if I try to go above that I get Java.outofmemory error
Right Now my approach is this
My code,
- create a dataframe of phonenumbers from the spark database,
- loads 2.5 million numbers into memory (memory of the JVM) in a python list, using collect()
- converts that list into trie
- clear the list
- search the number_to_be_searched in the trie
- if found return true
- else load next 2.5 million numbers, and repeat step 3 and so on
from collections import defaultdict
class Trie:
# Implement a trie with insert, search.
def __init__(self):
self.root = defaultdict()
def insert(self, word):
current = self.root
for letter in word:
current = current.setdefault(letter, {})
current.setdefault("_end")
def search(self, word):
current = self.root
for letter in word:
if letter not in current:
return False
current = current[letter]
if "_end" in current:
return True
return False
# these are the inner join and merge functions
def ijoin_util(root1, root2, str):
for k in root1:
if k == '_end':
ijoin_util.join.append(str)
return
found = root2.get(k)
if found != None:
ijoin_util(root1[k], found, str + k)
def inner_join(root1, root2):
str = ""
ijoin_util.join = []
ijoin_util(root1.root, root2.root, str)
return ijoin_util.join
def merge_util(root1, root2):
for k in root1:
found = root2.get(k)
if found != None:
merge_util(root1[k], found)
else:
root2.update({k: root1[k]})
return root2
def merge(root1, root2):
merge_util(root1.root, root2.root)
I know this is a really bad implementation for the problem, and I want to know if I can implement this in a way where I dont have to store the trie in the memory, (I mean like if I store it as RDD of nested maps), or any other approach which might help me scale it further

How to find two items of a list with the same return value of a function on their attribute?

Given a basic class Item:
class Item(object):
def __init__(self, val):
self.val = val
a list of objects of this class (the number of items can be much larger):
items = [ Item(0), Item(11), Item(25), Item(16), Item(31) ]
and a function compute that process and return a value.
How to find two items of this list for which the function compute return the same value when using the attribute val? If nothing is found, an exception should be raised. If there are more than two items that match, simple return any two of them.
For example, let's define compute:
def compute( x ):
return x % 10
The excepted pair would be: (Item(11), Item(31)).
You can check the length of the set of resulting values:
class Item(object):
def __init__(self, val):
self.val = val
def __repr__(self):
return f'Item({self.val})'
def compute(x):
return x%10
items = [ Item(0), Item(11), Item(25), Item(16), Item(31)]
c = list(map(lambda x:compute(x.val), items))
if len(set(c)) == len(c): #no two or more equal values exist in the list
raise Exception("All elements have unique computational results")
To find values with similar computational results, a dictionary can be used:
from collections import Counter
new_d = {i:compute(i.val) for i in items}
d = Counter(new_d.values())
multiple = [a for a, b in new_d.items() if d[b] > 1]
Output:
[Item(11), Item(31)]
A slightly more efficient way to find if multiple objects of the same computational value exist is to use any, requiring a single pass over the Counter object, whereas using a set with len requires several iterations:
if all(b == 1 for b in d.values()):
raise Exception("All elements have unique computational results")
Assuming the values returned by compute are hashable (e.g., float values), you can use a dict to store results.
And you don't need to do anything fancy, like a multidict storing all items that produce a result. As soon as you see a duplicate, you're done. Besides being simpler, this also means we short-circuit the search as soon as we find a match, without even calling compute on the rest of the elements.
def find_pair(items, compute):
results = {}
for item in items:
result = compute(item.val)
if result in results:
return results[result], item
results[result] = item
raise ValueError('No pair of items')
A dictionary val_to_it that contains Items keyed by computed val can be used:
val_to_it = {}
for it in items:
computed_val = compute(it.val)
# Check if an Item in val_to_it has the same computed val
dict_it = val_to_it.get(computed_val)
if dict_it is None:
# If not, add it to val_to_it so it can be referred to
val_to_it[computed_val] = it
else:
# We found the two elements!
res = [dict_it, it]
break
else:
raise Exception( "Can't find two items" )
The for block can be rewrite to handle n number of elements:
for it in items:
computed_val = compute(it.val)
dict_lit = val_to_it.get(computed_val)
if dict_lit is None:
val_to_it[computed_val] = [it]
else:
dict_lit.append(it)
# Check if we have the expected number of elements
if len(dict_lit) == n:
# Found n elements!
res = dict_lit
break

My lempel zip implementation makes encoding longer

I can't work out why my implementation is creating a longer string than the input.
It is implemented according to the description in this document and only this description.
It is simply designed to act on binary strings only. If anyone can shed some light on why this creates a longer string than it started with I'd be very greatful!
Main Encoding
def LZ_encode(uncompressed):
m=uncompressed
dictionary=dict_gen(m)
list=[int(bin(i)[2:]) for i in range(1,len(dictionary))]
pointer_bit=[]
for k in list:
pointer_bit=pointer_bit+[(str(chopped_lookup(k,dictionary)),dictionary[k][-1])]
new_pointer_bit=pointer_length_correct(pointer_bit)
list_output=[i for sub in new_pointer_bit for i in sub]
if list_output[-1]=='$':
output=''.join(list_output[:-1])
else:
output=''.join(list_output)
return output
Component Functions
def dict_gen(m): # Generates Dictionary
dictionary={0:""}
j=1
w=""
iterator=0
l=len(m)
for c in m:
iterator+=1
wc= str(str(w) + str(c))
if wc in dictionary.values():
w=wc
if iterator==l:
dictionary.update({int(bin(j)[2:]): wc+'$'})
else:
dictionary.update({int(bin(j)[2:]): wc})
w=""
j+=1
return dictionary
def chopped_lookup(k,dictionary): # Returns entry number of shortened source string
cut_source_string=dictionary[k][:-1]
for key, value in dictionary.iteritems():
if value == cut_source_string:
return key
def pointer_length_correct(lst): # Takes the (pointer,bit) list and corrects the lenth of the pointer
new_pointer_bit=[]
for pair in lst:
n=lst.index(pair)
if len(str(pair[0]))>ceil(log(n+1,2)):
while len(str(pair[0]))!=ceil(log(n+1,2)):
pair = (str(pair[0])[1:],pair[1])
if len(str(pair[0]))<ceil(log(n+1,2)):
while len(str(pair[0]))!=ceil(log(n+1,2)):
pair = (str('0'+str(pair[0])),pair[1])
new_pointer_bit=new_pointer_bit+[pair]
return new_pointer_bit

Alter the hash function of a dictionary

Following this question, we know that two different dictionaries, dict_1 and dict_2 for example, use the exact same hash function.
Is there any way to alter the hash function used by the dictionary?Negative answers also accepted!
You can't change the hash-function - the dict will call hash on the keys it's supposed to insert, and that's that.
However, you can wrap the keys to provide different __hash__ and __eq__-Methods.
class MyHash(object):
def __init__(self, v):
self._v = v
def __hash__(self):
return hash(self._v) * -1
def __eq__(self, other):
return self._v == other._v
If this actually helps anything with your original problem/question I doubt though, it seems rather a custom array/list-based data-structure might be the answer. Or not.
Here is a "hash table" on top of a list of lists, where each hash table object is associated with a particular hashing function.
class HashTable(object):
def __init__(self, hash_function, size=256):
self.hash_function = hash_function
self.buckets = [list() for i in range(size)]
self.size = size
def __getitem__(self, key):
hash_value = self.hash_function(key) % self.size
bucket = self.buckets[hash_value]
for stored_key, stored_value in bucket:
if stored_key == key:
return stored_value
raise KeyError(key)
def __setitem__(self, key, value):
hash_value = self.hash_function(key) % self.size
bucket = self.buckets[hash_value]
i = 0
found = False
for stored_key, stored_value in bucket:
if stored_key == key:
found = True
break
i += 1
if found:
bucket[i] = (key, value)
else:
bucket.append((key, value))
The rest of your application can still see the underlying list of buckets. Your application might require additional metadata to be associated with each bucket, but that would be as simple as defining a new class for the elements of the bucket list instead of a plain list.
I think what you want is a way to create buckets. Based on this I recommend collections.defaultdict with a set initializer as the "bucket" (depends on what you're using it for though).
Here is a sample:
#!/usr/bin/env python
from collections import defaultdict
from itertools import combinations
d = defaultdict(set)
strs = ["str", "abc", "rts"]
for s in strs:
d[hash(s)].add(s)
d[hash(''.join(reversed(s)))].add(s)
for combination in combinations(d.values(), r=2):
matches = combination[0] & combination[1]
if len(matches) > 1:
print matches
# output: set(['str', 'rts'])
Two strings ending up in the same buckets here are very likely the same. I've created a hash collision by using the reverse function and using a string and it's reverse as values.
Note that the set will use full comparison but should do it very fast.
Don't hash too many values without draining the sets.

Categories