Implementing Trie (or similar data structure) on spark

Implementing Trie (or similar data structure) on spark - python

I'm working as an intern, I've been tasked with implementing a fast searching algorithm for phone numbers on their spark cluster, using tries (prefix tree), and performing operations such as inner join on several such tries
I managed to create get it working for about 5 million numbers (2 tries, with 2.5 mllion numbers in each)
I've been tasked to scale it up to 10-20 million. though if I try to go above that I get Java.outofmemory error
Right Now my approach is this
My code,
- create a dataframe of phonenumbers from the spark database,
- loads 2.5 million numbers into memory (memory of the JVM) in a python list, using collect()
- converts that list into trie
- clear the list
- search the number_to_be_searched in the trie
- if found return true
- else load next 2.5 million numbers, and repeat step 3 and so on
from collections import defaultdict
class Trie:
# Implement a trie with insert, search.
def __init__(self):
self.root = defaultdict()
def insert(self, word):
current = self.root
for letter in word:
current = current.setdefault(letter, {})
current.setdefault("_end")
def search(self, word):
current = self.root
for letter in word:
if letter not in current:
return False
current = current[letter]
if "_end" in current:
return True
return False
# these are the inner join and merge functions
def ijoin_util(root1, root2, str):
for k in root1:
if k == '_end':
ijoin_util.join.append(str)
return
found = root2.get(k)
if found != None:
ijoin_util(root1[k], found, str + k)
def inner_join(root1, root2):
str = ""
ijoin_util.join = []
ijoin_util(root1.root, root2.root, str)
return ijoin_util.join
def merge_util(root1, root2):
for k in root1:
found = root2.get(k)
if found != None:
merge_util(root1[k], found)
else:
root2.update({k: root1[k]})
return root2
def merge(root1, root2):
merge_util(root1.root, root2.root)
I know this is a really bad implementation for the problem, and I want to know if I can implement this in a way where I dont have to store the trie in the memory, (I mean like if I store it as RDD of nested maps), or any other approach which might help me scale it further

Related

Can I treat a file as a list in python?

This is kind of a question, but it's also kind of me just hoping I don't have to write a bunch of code to get behavior I want. (Plus if it already exists, it probably runs faster than what I would write anyway.) I have a number of large lists of numbers that cannot fit into memory -- at least not all at the same time. Which is fine because I only need a small portion of each list at a time, and I know how to save the lists into files and read out the part of the list I need. The problem is that my method of doing this is somewhat inefficient as it involves iterating through the file for the part I want. So, I was wondering if there happened to be some library or something out there that I'm not finding that allows me to index a file as though it were a list using the [] notation I'm familiar with. Since I'm writing the files myself, I can make the formatting of them whatever I need to, but currently my files contain nothing but the elements of the list with \n as a deliminator between values.
Just to recap what I'm looking for/make it more specific.
I want to use the list indexing notation (including slicing into sub-list and negative indexing) to access the contents of a list written in a file
A accessed sub-list (e.g. f[1:3]) should return as a python list object in memory
I would like to be able to assign to indices of the file (e.g. f[i] = x should write the value x to the file f in the location corresponding to index i)
To be honest, I don't expect this to exist, but you never know when you miss something in your research. So, I figured I'd ask. On a side note if this doesn't exist, is possible to overload the [] operator in python?

If your data is purely numeric you could consider using numpy arrays, and storing the data in npy format. Once stored in this format, you could load the memory-mapped file as:
>>> X = np.load("some-file.npy", mmap_mode="r")
>>> X[1000:1003]
memmap([4, 5, 6])
This access will load directly from disk without requiring the loading of leading data.

You can actually do this by writing a simple class, I think:
class FileWrapper:
def __init__(self, path, **kwargs):
self._file = open(path, 'r+', **kwargs)
def _do_single(self, where, s=None):
if where >= 0:
self._seek(where)
else:
self._seek(where, 2)
if s is None:
return self._read(1)
else:
return self._write(s)
def _do_slice_contiguous(self, start, end, s=None):
if start is None:
start = 0
if end is None:
end = -1
self._seek(start)
if s is None:
return self._read(end - start)
else:
return self._write(s)
def _do_slice(self, where, s=None):
if s is None:
result = []
for index in where:
file._seek(index)
result.append(file.read(1))
return result
else:
for index, char in zip(where, s):
file._seek(index)
file._write(char)
return len(s)
def __getitem__(self, key):
if isinstance(key, int):
return self._do_single(key)
elif isinstance(key, slice):
if self._is_contiguous(key):
return self._do_slice_contiguous(key.start, key.stop)
else:
return self._do_slice(self._process_slice(key))
else:
raise ValueError('File indices must be ints or slices.')
def __setitem__(self, key, value):
if isinstance(key, int):
return self._do_single(key, value)
elif isinstance(key, slice):
if self._is_contiguous(key):
return self._do_slice_contiguous(key.start, key.stop, value)
else:
where = self._process_slice(key)
if len(where) == len(value):
return self._do_slice(where, value)
else:
raise ValueError('Length of slice not equal to length of string to be written.')
def __del__(self):
self._file.close()
def _is_contiguous(self, key):
return key.step is None or key.step == 1
def _process_slice(self, key):
return range(key.start, key.stop, key.step)
def _read(self, size):
return self._file.read(size)
def _seek(self, offset, whence=0):
return self._file.seek(offset, whence)
def _write(self, s):
return self._file.write(s)
I'm sure many optimisations could be made, since I rushed through this, but it was fun to write.
This does not answer the question in full, because it supports random access of characters, as supposed to lines, which are at a higher level of abstraction and more complicated to handle (since they can be variable length)

Python binary search recursive if possible

class SortedList:
theList = []
def add(self, number):
self.theList.append(number)
return self.theList
def remove(self, number):
self.theList.remove(number)
return self.theList
def printList(self):
return print(self.theList)
def binarSearch(self, number):
middle = (len(self.theList)//2)
end = len(self.theList)
if end != 0:
if int(self.theList[middle]) == int(number):
return print("The number is found in the list at place",middle+1)
elif int(self.theList[middle]) < int(number):
self.theList = self.theList[middle:]
return self.binarSearch(number)
elif int(self.theList[middle]) > int(number):
self.theList = self.theList[:middle]
return self.binarSearch(number)
else:
return print("The list is empty")
sorted = SortedList() #create a SortedList object
sorted.add("1")
sorted.add("2")
sorted.add("3")
sorted.add("4")
sorted.add("5")
sorted.add("6")
sorted.printList()
sorted.binarSearch(3)
I cannot use additional parameters I mut use only self and number. I want to make it recursive but if it is hard you can answer as normal.
This code works good until the number 4. When I give 4 for searching it says it is in place 2 and it continues saying two after 4. I have tried adding other numbers but it is same

Python already has a great module bisect which performs a binary search for sorted lists:
import bisect
l = [2,3,1,5,6,7,9,8,4]
print(bisect.bisect(l, 4)) # Output: 3
Familiarize yourself with this library:
https://docs.python.org/3.5/library/bisect.html

Just a hint: You can use additional parameters if you give them default values. Your method signature would look like this:
def binarSearch(self, number, start=0, end=len(self.theList)):
So it could still be called like sorted.binarySearch(5) but would internally be able to pass the state correctly.

Find Compound Words in List of Words using Trie

Given a list of words, I am trying to figure out how to find words in that list that are made up of other words in the list. For example, if the list were ["race", "racecar", "car"], I would want to return ["racecar"].
Here is my general thought process. I understand that using a trie would be good for this sort of problem. For each word, I can find all of its prefixes (that are also words in the list) using the trie. Then for each prefix, I can check to see if the word's suffix is made up of one or more words in the trie. However, I am having a hard time implementing this. I have been able to implement the trie and and the function to get all prefixes of a word. I am just stuck on implementing the compound word detection.

You could present Trie nodes as defaultdict objects which have been extended to contain a boolean flag marking if the prefix is a word. Then you could have two pass processing where on the first round you add all the words to Trie and on second round check for each word if it's a combination or not:
from collections import defaultdict
class Node(defaultdict):
def __init__(self):
super().__init__(Node)
self.terminal = False
class Trie():
def __init__(self, it):
self.root = Node()
for word in it:
self.add_word(word)
def __contains__(self, word):
node = self.root
for c in word:
node = node.get(c)
if node is None:
return False
return node.terminal
def add_word(self, word):
node = self.root
for c in word:
node = node[c]
node.terminal = True
def is_combination(self, word):
node = self.root
for i, c in enumerate(word):
node = node.get(c)
if not node:
break
# If prefix is a word check if suffix can be found
if node.terminal and word[i+1:] in self:
return True
return False
lst = ["race", "racecar", "car"]
t = Trie(lst)
print([w for w in lst if t.is_combination(w)])
Output:
['racecar']

Alter the hash function of a dictionary

Following this question, we know that two different dictionaries, dict_1 and dict_2 for example, use the exact same hash function.
Is there any way to alter the hash function used by the dictionary?Negative answers also accepted!

You can't change the hash-function - the dict will call hash on the keys it's supposed to insert, and that's that.
However, you can wrap the keys to provide different __hash__ and __eq__-Methods.
class MyHash(object):
def __init__(self, v):
self._v = v
def __hash__(self):
return hash(self._v) * -1
def __eq__(self, other):
return self._v == other._v
If this actually helps anything with your original problem/question I doubt though, it seems rather a custom array/list-based data-structure might be the answer. Or not.

Here is a "hash table" on top of a list of lists, where each hash table object is associated with a particular hashing function.
class HashTable(object):
def __init__(self, hash_function, size=256):
self.hash_function = hash_function
self.buckets = [list() for i in range(size)]
self.size = size
def __getitem__(self, key):
hash_value = self.hash_function(key) % self.size
bucket = self.buckets[hash_value]
for stored_key, stored_value in bucket:
if stored_key == key:
return stored_value
raise KeyError(key)
def __setitem__(self, key, value):
hash_value = self.hash_function(key) % self.size
bucket = self.buckets[hash_value]
i = 0
found = False
for stored_key, stored_value in bucket:
if stored_key == key:
found = True
break
i += 1
if found:
bucket[i] = (key, value)
else:
bucket.append((key, value))
The rest of your application can still see the underlying list of buckets. Your application might require additional metadata to be associated with each bucket, but that would be as simple as defining a new class for the elements of the bucket list instead of a plain list.

I think what you want is a way to create buckets. Based on this I recommend collections.defaultdict with a set initializer as the "bucket" (depends on what you're using it for though).
Here is a sample:
#!/usr/bin/env python
from collections import defaultdict
from itertools import combinations
d = defaultdict(set)
strs = ["str", "abc", "rts"]
for s in strs:
d[hash(s)].add(s)
d[hash(''.join(reversed(s)))].add(s)
for combination in combinations(d.values(), r=2):
matches = combination[0] & combination[1]
if len(matches) > 1:
print matches
# output: set(['str', 'rts'])
Two strings ending up in the same buckets here are very likely the same. I've created a hash collision by using the reverse function and using a string and it's reverse as values.
Note that the set will use full comparison but should do it very fast.
Don't hash too many values without draining the sets.

Infix to prefix conversion in Python

I wanted to make an infix to prefix converter. When I ran the code, the operator in the string sends all the operators in the beginning of the returning string.
How can I fix the code below?
class Stack:
def __init__(self):
self.a = []
def isEmpty(self):
return self.a == []
def push(self,i):
self.a.append(i)
def pop(self):
return self.a.pop()
def peek(self):
return self.a[len(self.a)-1]
def infixToPrefix(s):
prec = {'/':3,'*':3,'+':2,'-':2,'^':4,'(':1}
opStack = Stack()
prefixList = []
temp = []
for token in s:
if token in "ABCDEFGHIJKLMNOPQRSTUVWXYZ" or token in "0123456789":
prefixList.append(token)
elif token == '(':
opStack.push(token)
elif token == ')':
topToken = opStack.pop()
while topToken != '(':
temp.append(topToken)
topToken = opStack.pop()
prefixList = temp + prefixList
temp = []
else:
while (not opStack.isEmpty()) and \
(prec[opStack.peek()]>= prec[token]):
temp.append(opStack.pop())
prefixList = temp + prefixList
temp = []
opStack.push(token)
while not opStack.isEmpty():
temp.append(opStack.pop())
prefixList = temp + prefixList
return ''.join(prefixList)
print infixToPrefix("(A+B)*C-(D-E)*(F+G)")

Don't reinvent the wheel. Use a parser generator instead. For example, PLY (Python lex-yacc) is a good option. You can start by looking at a basic example and either do the conversion within the production rules themselves, or produce an abstract syntax tree equipped with flattening methods that return prefix, infix, or postfix notation. Note that the difference between these three is whether the operator is inserted in pre-, in between, or post-order during a depth-first traversal of the syntax tree (implemented either as a single function or recursively -- the latter leads to simpler and more modular code).

It might be late to post this answer, but I'm also leaving this as a reference for anyone else. It seems OP, that you already solved the issue when converting from infix to postfix. If that's the case, you can use that same algorithm and code to convert your text to a prefix notation.
All you'd need to do is invert your text first, and then pass that text through your algorithm. Once you invert your text, you'll also store your text in your Stack already inverted. After you've processed this, you need to re-invert your text again to it's original form and you'll get your prefix notation.
Be sure to keep track of what you compare in your dictionary though, you'll no longer compare your operands with the "(".
Hope this helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.