Python Parser Recursion Infinite Reference - python

Im new to Parsers and i have a problem with my Parser, specifically when i call itself to analize the body of a function.
When it finds another function, just became crazy and messy.
Basically, when analyzing this code
fn test (a, b, c):
fn test2 (c, b, a):
print("Hello world")
end
end
It starts to point the object to itself, not the subfunction:
>>> print(ast[0].value.body[9])
<ast.VariableAST object at 0x7f6285540710>
>>> print(ast[0].value.body[9].value.body[9])
<ast.VariableAST object at 0x7f6285540710>
This is the main parser code:
# Parser Loop
def Parser(tokenList):
global tokens
global size
tokens = tokenList
size = len(tokens)
ast = []
while i < size:
ast.append(MainParser())
return ast
# The Main Parser
def MainParser():
global i
if tokens[i] == 'fn':
i += 1
node = FunctionParser()
else:
node = tokens[i]
i += 1
return node
# Parse a function
def FunctionParser():
global i
checkEnd("function")
if tokens[i][0].isalpha():
node = VariableAST()
node.name = tokens[i]
i += 1
node.value = FunctionBodyParser()
elif tokens[i] == '(':
node = FunctionBodyParser()
else:
syntaxError("Expecting '(' or function name")
return node
# Parse a function body
def FunctionBodyParser():
global i
i += 1
node = FunctionAST()
while True:
checkEnd("function")
if tokens[i][0].isalpha():
node.args.append(tokens[i])
i += 1
elif tokens[i] == ',':
i += 1
elif tokens[i] == ')':
break
else:
syntaxError("Expecting ')' or ',' in function declaration")
i += 1
checkEnd("function")
if tokens[i] != ':' and tokens[i] != '{':
syntaxError("Expecting '{' or ':'")
begin = tokens[i]
while True:
checkEnd("function")
if begin == ':' and tokens[i] == 'end':
break
elif begin == '{' and tokens[i] == '}':
break
else:
node.body.append(MainParser())
i += 1
return node
Edit: I forgot to mention that this is a prototype for a C version. Im avoiding stuff related to object orientation and some good pratices in python to make easier to port the code to C later.

There's a lot of parser implemented in Python https://wiki.python.org/moin/LanguageParsing like PyPEG allowing you to describe the language you're parsing instead of parsing it yourself, which is more readable and less error-prone.
Also using global is typically a source of problems as you can't control the scope of a variable (there's only one), reducing reusability of your functions.
It's probably better to use a class, which is almost the same thing but you can have multiple instances of the same class running at the same time without variable colisions:
class Parser:
def __init__(self, tokenList):
self.tokens = tokenList
self.size = len(tokenList)
self.ast = []
self.position = 0
def parse(tokenList):
while self.position < self.size:
self.ast.append(self.main())
return self.ast
def main(self):
if self.tokens[self.position] == 'fn':
self.position += 1
node = self.function()
else:
node = self.tokens[self.position]
self.position += 1
return node
# And so on...
From this point you can deduplicate self.position += 1 in main:
def main(self):
self.position += 1
if self.tokens[self.position] == 'fn':
node = self.function()
else:
node = self.tokens[self.position]
return node
Then remove the useless "node" variable:
def main(self):
self.position += 1
if self.tokens[self.position] == 'fn':
return self.function()
else:
return self.tokens[self.position]
But the real way to do this is to use a parser, take a look a pyPEG, it's a nice one, but others are nice too.
Oh and last point, avoid useless comments like:
# Parse a function
def FunctionParser():
We know that "FunctionParser" "Parse a function", thanks, that's not an information. The most important is to wisely choose your function names (oh, the PEP8 tells us not to start method name with capitalized letters), and if you want to add meta-information about the function, put it in a string as a first statement in your function like:
def functionParser():
"Delegates the body parsing to functionBodyParser"

I found the solution,
Actually it was a silly error of programming in Python and the above parser is working fine.
In AST i was creating classes like structure to make easier to port the language to C. But i was doing it wrong, for example:
class VariableAST(NodeAST):
name = None
value = None
What i didnt knew about Python is that those parameters inside the class arent attributes for the object, but static variables, wich just made my program work unpredictable when i assing a value to a object, since im assing to a static variable, i also assing to all other variables and theres is my infinite recursion of objects.

Related

Efficient partial search of a trie in python

This is a hackerrank exercise, and although the problem itself is solved, my solution is apparently not efficient enough, so on most test cases I'm getting timeouts. Here's the problem:
We're going to make our own Contacts application! The application must perform two types of operations:
add name, where name is a string denoting a contact name. This must store as a new contact in the application.
find partial, where partial is a string denoting a partial name to search the application for. It must count the number of contacts starting with partial and print the count on a new line.
Given n sequential add and find operations, perform each operation in order.
I'm using Tries to make it work, here's the code:
import re
def add_contact(dictionary, contact):
_end = '_end_'
current_dict = dictionary
for letter in contact:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return(dictionary)
def find_contact(dictionary, contact):
p = re.compile('_end_')
current_dict = dictionary
for letter in contact:
if letter in current_dict:
current_dict = current_dict[letter]
else:
return(0)
count = int(len(p.findall(str(current_dict))) / 2)
re.purge()
return(count)
n = int(input().strip())
contacts = {}
for a0 in range(n):
op, contact = input().strip().split(' ')
if op == "add":
contacts = add_contact(contacts, contact)
if op == "find":
print(find_contact(contacts, contact))
Because the problem requires not returning whether partial is a match or not, but instead counting all of the entries that match it, I couldn't find any other way but cast the nested dictionaries to a string and then count all of the _end_s, which I'm using to denote stored strings. This, it would seem, is the culprit, but I cannot find any better way to do the searching. How do I make this work faster? Thanks in advance.
UPD:
I have added a results counter that actually parses the tree, but the code is still too slow for the online checker. Any thoughts?
def find_contact(dictionary, contact):
current_dict = dictionary
count = 0
for letter in contact:
if letter in current_dict:
current_dict = current_dict[letter]
else:
return(0)
else:
return(words_counter(count, current_dict))
def words_counter(count, node):
live_count = count
live_node = node
for value in live_node.values():
if value == '_end_':
live_count += 1
if type(value) == type(dict()):
live_count = words_counter(live_count, value)
return(live_count)
Ok, so, as it turns out, using nested dicts is not a good idea in general, because hackerrank will shove 100k strings into your program and then everything will slow to a crawl. So the problem wasn't in the parsing, it was in the storing before the parsing. Eventually I found this blogpost, their solution passes the challenge 100%. Here's the code in full:
class Node:
def __init__(self):
self.count = 1
self.children = {}
trie = Node()
def add(node, name):
for letter in name:
sub = node.children.get(letter)
if sub:
sub.count += 1
else:
sub = node.children[letter] = Node()
node = sub
def find(node, data):
for letter in data:
sub = node.children.get(letter)
if not sub:
return 0
node = sub
return node.count
if __name__ == '__main__':
n = int(input().strip())
for _ in range(n):
op, param = input().split()
if op == 'add':
add(trie, param)
else:
print(find(trie, param))

How to execute main method in terminal?

I have a Python program with class definitions, method definitions, and finally a main method which calls those methods and classes. When I run the program in the terminal, it outputs nothing, without an error message. Should I change the way my program is written?
The main method is at the bottom.
import re
import random
class node:
def __init__(self, parent, daughters, edge):
self.parent = parent
self.daughters = daughters
self.edge = edge
trie.append(self)
self.index = len(trie) - 1
def BruteForcePatternMatching(text, patterns):
indices = []
for pattern in patterns:
pattern = re.compile(pattern)
indices += pattern.finditer(text)
return indices
def TrieConstruction(patterns, trie):
trie.append(node(0, [], 0))
for pattern in patterns:
currentNode = trie[0]
for base in pattern:
for daughter in currentNode.daughters:
if base == daughter.edge:
currentNode = daughter
break
else:
trie.append(node(currentNode, [], base))
currentNode = trie[-1]
def PrefixTrieMatching(text, trie):
v = trie[0]
for index, base in enumerate(text):
if v.daughters == []:
pattern_out = []
climb(v.index)
return ''.join(pattern_out)
else:
for daughter in v.daughters:
if base == daughter.edge:
v = daughter
break
else:
print('No matches found.')
return
def climb(index):
if index == 0:
return
else:
pattern_out.append(node.edge)
climb(trie[index].parent)
def TrieMatching(text, trie):
while not text == []:
PrefixTrieMatching(text, trie)
text = text[:-1]
def SuffixTrieConstruction(text):
trie = [node(0, [1], 0), node(0, [], len(text) + 1)] #we use normal nodes, but if they are leaves, we give them integers for indices
for index in range(len(text)):
start = len(text) - index
slide = text[start:-1]
currentNode = trie[0]
for symbol in slide:
for daughter in currentNode.daughters:
if symbol == daughter.edge:
currentNode = daughter
break
else:
trie.append(node(currentNode.index, [], symbol))
currentNode = trie[-1]
if symbol == slide[-1]:
trie.append(node(currentNode.index, [], start))
return trie
def SuffixTreeConstruction(trie):
for node in trie:
if len(node.daughters) == 1:
node.edge = node.edge + trie[node.daughter[0]].edge
trie[node.daughters[0]].parent = node.index
node.daughters = trie[currentNode.daughters[0]].daughters
del trie[node.daughter[0]]
for node in trie:
for daughter in node.daughters:
print('(%d, %d, %s') % (node.index, daughter, node.edge)
def main():
print('First, we open a file of DNA sequences and generate a random DNA string of length 3000, representative of the reference genome.')
patterns = list(open('patterns'))
text = ''
for count in range(3000):
text += choice('ACTG')
print('We will first check for matches using the Brute Force matching method, which scans the text for matches of each pattern.')
BruteForcePatternMatching(text, patterns)
print('It returns the indices in the text where matches occur.')
print('Next, we generate a trie with the patterns, and then run the text over the trie to search for matches.')
trie = []
TrieConstruction(patterns, trie)
TrieMatching(text, trie)
print('It returns all matched patterns.')
print('Finally, we can construct a suffix tree from the text. This is the concatenation of all non-branching nodes into a single node.')
SuffixTreeConstruction(SuffixTrieConstruction(text))
print('It returns the adjacency list - (parent, daughter, edge).')
Assuming that your code is saved in myfile.py.
You can, as other stated, add:
if __name__ == "__main__":
main()
at the end of your file and run it with python myfile.py.
If, however, for some obscure reasons, you can not modify the content of your file, you can still achieve this with:
python -c "import myfile; myfile.main()"
from the command line. (as long as you run this from the directory containing myfile)
Simply call the main function at the end
main()
In this case the function will be run any time the program runs.
You can use
if __name__ == "__main__": main()
Only runs if that file is run (/ the module is executed)
As stated by NightShadeQueen and Chief.
Use
if __name__ == "__main__":
main()
At the end of your program. This way, the program can be run from the command line and have all the functions defined within the program be importable without executing main().
See this post for a more detailed explanation of __name__ and it's uses.
Here is a StackOverflow post which goes into detail about a few of the things you can do with this and restates the original explanation in many different words for your leisure.

Turning an object into an iterator in Python 3?

I'm trying to port a library over to Python 3. It has a tokenizer for PDF streams. The reader class calls next() on these tokens. This worked in Python 2, but when I run it in Python 3 I get TypeError: 'PdfTokens' object is not an iterator.
Selections from tokens.py concerning iterators:
class PdfTokens(object):
def __init__(self, fdata, startloc=0, strip_comments=True):
self.fdata = fdata
self.iterator = iterator = self._gettoks(startloc)
self.next = next(iterator)
def __iter__(self):
return self.iterator
def _gettoks(self, startloc, cacheobj=_cacheobj,
delimiters=delimiters, findtok=findtok, findparen=findparen,
PdfString=PdfString, PdfObject=PdfObject):
fdata = self.fdata
current = self.current = [(startloc, startloc)]
namehandler = (cacheobj, self.fixname)
cache = {}
while 1:
for match in findtok(fdata, current[0][1]):
current[0] = tokspan = match.span()
token = match.group(1)
firstch = token[0]
if firstch not in delimiters:
token = cacheobj(cache, token, PdfObject)
elif firstch in '/<(%':
if firstch == '/':
# PDF Name
token = namehandler['#' in token](cache, token, PdfObject)
elif firstch == '<':
# << dict delim, or < hex string >
if token[1:2] != '<':
token = cacheobj(cache, token, PdfString)
elif firstch == '(':
ends = None # For broken strings
if fdata[match.end(1)-1] != ')':
nest = 2
m_start, loc = tokspan
for match in findparen(fdata, loc):
loc = match.end(1)
ending = fdata[loc-1] == ')'
nest += 1 - ending * 2
if not nest:
break
if ending and ends is None:
ends = loc, match.end(), nest
token = fdata[m_start:loc]
current[0] = m_start, match.end()
if nest:
(self.error, self.exception)[not ends]('Unterminated literal string')
loc, ends, nest = ends
token = fdata[m_start:loc] + ')' * nest
current[0] = m_start, ends
token = cacheobj(cache, token, PdfString)
elif firstch == '%':
# Comment
if self.strip_comments:
continue
else:
self.exception('Tokenizer logic incorrect -- should never get here')
yield token
if current[0] is not tokspan:
break
else:
if self.strip_comments:
break
raise StopIteration
The beginning of the offending method in the pdfreader file that raises the error:
def findxref(fdata):
''' Find the cross reference section at the end of a file
'''
startloc = fdata.rfind('startxref')
if startloc < 0:
raise PdfParseError('Did not find "startxref" at end of file')
source = PdfTokens(fdata, startloc, False)
tok = next(source)
I was under the impression that all you needed to define a custom iterator object was a .__iter__method, a .next() method and to raise a StopIteration error. This class has all these things and yet it stills raises the TypeError.
Furthermore, this library and it's methods worked in Python 2.7 and have ceased to work in a Python 3 environment. What about Python 3 has made this different? What can I do to make the PdfTokens object iterable?
You cannot call next on PdfTokens's instance directly, you need to get its iterator first by calling iter() on it. That's exactly what a for-loop does as well*, it calls iter() on the object first and gets an iterator and then within the loop __next__ is invoked on that iterator until it is not exhausted:
instance = PdfTokens(fdata, startloc, False)
source = iter(instance)
tok = next(source)
Well not always, if there's no __iter__ defined on the class then the iterator protocol falls back to __getitem__ if defined.

Recursive parsing for lisp like syntax

im trying to parse lines in the form:
(OP something something (OP something something ) ) ( OP something something )
Where OP is a symbol for a logical gate (AND, OR, NOT) and something is the thing i want to evaluate.
The output im looking for is something like:
{ 'OPERATOR': [condition1, condition2, .. , conditionN] }
Where a condition itself can be a dict/list pair itself (nested conditions). So far i tried something like:
tree = dict()
cond = list()
tree[OP] = cond
for string in conditions:
self.counter += 1
if string.startswith('('):
try:
OP = string[1]
except IndexError:
OP = 'AND'
finally:
if OP == '?':
OP = 'OR'
elif OP == '!':
OP = 'N'
# Recurse
cond.append(self.parse_conditions(conditions[self.counter:], OP))
break
elif not string.endswith(")"):
cond.append(string)
else:
return tree
return tree
I tried other ways aswell but i just can't wrap my head around this whole recursion thing so im wondering if i could get some pointers here, i looked around the web and i found some stuff about recursive descent parsing but the tutorials were all trying to do something more complicated than i needed.
PS: i realize i could do this with existing python libraries but what would i learn by doing that eh?
I'm posting this without further comments, for learning purposes (in the real life please do use a library). Note that there's no error checking (a homework for you!)
Feel free to ask if there's something you don't understand.
# PART 1. The Lexer
symbols = None
def read(input):
global symbols
import re
symbols = re.findall(r'\w+|[()]', input)
def getsym():
global symbols
return symbols[0] if symbols else None
def popsym():
global symbols
return symbols.pop(0)
# PART 2. The Parser
# Built upon the following grammar:
#
# program = expr*
# expr = '(' func args ')'
# func = AND|OR|NOT
# args = arg*
# arg = string|expr
# string = [a..z]
def program():
r = []
while getsym():
r.append(expr())
return r
def expr():
popsym() # (
f = func()
a = args()
popsym() # )
return {f: a}
def func():
return popsym()
def args():
r = []
while getsym() != ')':
r.append(arg())
return r
def arg():
if getsym() == '(':
return expr()
return string()
def string():
return popsym()
# TEST = Lexer + Parser
def parse(input):
read(input)
return program()
print parse('(AND a b (OR c d)) (NOT foo) (AND (OR x y))')
# [{'AND': ['a', 'b', {'OR': ['c', 'd']}]}, {'NOT': ['foo']}, {'AND': [{'OR': ['x', 'y']}]}]

Python string assignment issue!

So I'm fairly new to Python but I have absolutely no idea why this strong oldUser is changing to current user after I make the parse call. Any help would be greatly appreciated.
while a < 20:
f = urllib.urlopen("SITE")
a = a+1
for i, line in enumerate(f):
if i == 187:
print line
myparser.parse(line)
if fCheck == 1:
result = oldUser[0] is oldUser[1]
print oldUser[0]
print oldUser[1]
else:
result = user is oldUser
fCheck = 1
print result
user = myparser.get_descriptions(firstCheck)
firstCheck = 1
print user
if result:
print "SAME"
array[index+1] = array[index+1] +0
else:
oldUser = user
elif i > 200:
break
myparser.reset()
I don't understand why result doesn't work either... I print out both values and when they're the same it's telling me they're not equal... Also, why does myparser.parse(line) turn oldUser into a size 2 array? Thanks!
** Here's the definition for myparse...
class MyParser(sgmllib.SGMLParser):
"A simple parser class."
def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.divs = []
self.descriptions = []
self.inside_div_element = 0
def start_div(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "id":
self.divs.append(value)
self.inside_div_element = 1
def end_div(self):
"Record the end of a hyperlink."
self.inside_div_element = 0
def handle_data(self, data):
"Handle the textual 'data'."
if self.inside_div_element:
self.descriptions.append(data)
def get_div(self):
"Return the list of hyperlinks."
return self.divs
def get_descriptions(self, check):
"Return a list of descriptions."
if check == 1:
self.descriptions.pop(0)
return self.descriptions
Don’t compare strings with is. That checks if they’re the same object, not two copies of the same string. See:
>>> string = raw_input()
hello
>>> string is 'hello'
False
>>> string == 'hello'
True
Also, the definition of myparser would be useful.
I'm not quite sure what your code is doing, but I suspect you want to use == instead of is. Using is compares object identity, which is not the same as string equality. Two different string objects may contain the same sequence of characters.
result = oldUser[0] == oldUser[1]
If you're curious, for more information on the behaviour of the is operator see Python “is” operator behaves unexpectedly with integers.

Categories