Efficient partial search of a trie in python

Efficient partial search of a trie in python - python

This is a hackerrank exercise, and although the problem itself is solved, my solution is apparently not efficient enough, so on most test cases I'm getting timeouts. Here's the problem:
We're going to make our own Contacts application! The application must perform two types of operations:
add name, where name is a string denoting a contact name. This must store as a new contact in the application.
find partial, where partial is a string denoting a partial name to search the application for. It must count the number of contacts starting with partial and print the count on a new line.
Given n sequential add and find operations, perform each operation in order.
I'm using Tries to make it work, here's the code:
import re
def add_contact(dictionary, contact):
_end = '_end_'
current_dict = dictionary
for letter in contact:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return(dictionary)
def find_contact(dictionary, contact):
p = re.compile('_end_')
current_dict = dictionary
for letter in contact:
if letter in current_dict:
current_dict = current_dict[letter]
else:
return(0)
count = int(len(p.findall(str(current_dict))) / 2)
re.purge()
return(count)
n = int(input().strip())
contacts = {}
for a0 in range(n):
op, contact = input().strip().split(' ')
if op == "add":
contacts = add_contact(contacts, contact)
if op == "find":
print(find_contact(contacts, contact))
Because the problem requires not returning whether partial is a match or not, but instead counting all of the entries that match it, I couldn't find any other way but cast the nested dictionaries to a string and then count all of the _end_s, which I'm using to denote stored strings. This, it would seem, is the culprit, but I cannot find any better way to do the searching. How do I make this work faster? Thanks in advance.
UPD:
I have added a results counter that actually parses the tree, but the code is still too slow for the online checker. Any thoughts?
def find_contact(dictionary, contact):
current_dict = dictionary
count = 0
for letter in contact:
if letter in current_dict:
current_dict = current_dict[letter]
else:
return(0)
else:
return(words_counter(count, current_dict))
def words_counter(count, node):
live_count = count
live_node = node
for value in live_node.values():
if value == '_end_':
live_count += 1
if type(value) == type(dict()):
live_count = words_counter(live_count, value)
return(live_count)

Ok, so, as it turns out, using nested dicts is not a good idea in general, because hackerrank will shove 100k strings into your program and then everything will slow to a crawl. So the problem wasn't in the parsing, it was in the storing before the parsing. Eventually I found this blogpost, their solution passes the challenge 100%. Here's the code in full:
class Node:
def __init__(self):
self.count = 1
self.children = {}
trie = Node()
def add(node, name):
for letter in name:
sub = node.children.get(letter)
if sub:
sub.count += 1
else:
sub = node.children[letter] = Node()
node = sub
def find(node, data):
for letter in data:
sub = node.children.get(letter)
if not sub:
return 0
node = sub
return node.count
if __name__ == '__main__':
n = int(input().strip())
for _ in range(n):
op, param = input().split()
if op == 'add':
add(trie, param)
else:
print(find(trie, param))

Related

Processing Strings from data file containing punctuation coding in python

I am trying to make a simple programme that can help make army lists for a popular tabletop wargame. More as an excercise for my own experience as there are plenty of pre made software packages that do this, but the idea behind it seems fairly straightforward
The programme reads the data for all the units available in an army from a spreadsheet and creates various classes for each unit. The main bit I am looking at now is the options/ upgrades.
In the file I want a straightforward syntax for the option field for each unit. i.e. the following options string itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ would mean
1. you may take itemA (X pts per model)
2. for every 3 models, you may exchange itemB with
a) itemC (net X pts per model)
3. each model may take 2 of itemD (X pts per model)
4. each model may take one of either
a)itemE (X pts per model)
b)itemF (X pts per model)
c)itemG (X pts per model
5. each model may take either
a)itemH (X points per model)
b)itemI and itemJ (X points per model)
At the moment I am processing the string using lots of splits and if statements, that make it very hard to keep track of and assign correctly once the user input their choice.
for index, option in enumerate(self.options):
output = "{}.".format(index+1)
if '-' in option:
sub_option, no_models = option.split('-')
no_models = int(no_models)
print(sub_option)
print(no_models)
output += "For every {} models ".format(no_models)
if '/' in sub_option:
temp_str, temp_options, points_list = exchange_option(sub_option)
else:
temp_str, temp_options, points_list = standard_option(sub_option)
index_points.append(points_list)
temp_options.append(no_models)
index_options.append(temp_options)
else:
if '/' in option:
temp_str, temp_options, points_list = exchange_option(option)
else:
temp_str, temp_options, points_list = standard_option(option)
index_points.append(points_list)
index_options.append(temp_options)
output += temp_str
the *_option() functions are additional helper functions I have defined above which have a similar structure with further if statements within them.
The main question I am asking, is there an easier way to process a code like string such as this? While it works to produce the output in the example above it seems awfully cumbersome to then deal with the user input.
What I am aiming to do is first output the string as given in my example at the top of the question, and then taking the user input index of the given option, modify the associated unit class to have the correct wargear and points value.
I thought about trying to make some kind of options class, but again labelling and defining each option so that they can interact with one another properly seems equally complex, and I feel there must be something more pythonic or just generally better coding practice to processing encoded strings such as this?

So, here's a full blown parser to do that! Now, this only outputs the list as in the previous version of your question, but it shouldn't be too hard to add more features as you want. Also please note that at the moment, the lexer does not error out when a string contains invalid tokens, but that's just a proof-of-concept, so it should be fine.
Part I: the lexer
This tokenises the input string - looks through it from left to right and attempts to classify non-overlapping substrings as instances of tokens. It's to be used before parsing. When given a string, Lexer.tokenize yields a stream of Tokens.
# FILE: lex.py
import re
import enum
class Token:
def __init__(self, type, value: str, lineno: int, pos: int):
self.type, self.value, self.lineno, self.pos = type, value, lineno, pos
def __str__(self):
v = f'({self.value!r})' if self.value else ''
return f'{self.type.name}{v} at {self.lineno}:{self.pos}'
__repr__ = __str__
class Lexer:
def __init__(self, token_types: enum.Enum, tokens_regexes: dict):
self.token_types = token_types
regex = '|'.join(map('(?P<{}>{})'.format, *zip(*((tok.name, regex) for tok, regex in tokens_regexes.items()))))
self.regex = re.compile(regex)
def tokenize(self, string, skip=['space']):
# TODO: detect invalid input
lineno, pos = 0, 0
skip = set(map(self.token_types.__getitem__, skip))
for matchobj in self.regex.finditer(string):
type_name = matchobj.lastgroup
value = matchobj.groupdict()[type_name]
Type = self.token_types[type_name]
if Type == self.token_types.newline: # possibly buggy, but not catastrophic
self.lineno += 1
self.pos = 0
continue
pos = matchobj.end()
if Type not in skip:
yield Token(Type, value, lineno, pos)
yield Token(self.token_types.EOF, '', lineno, pos)
Part II: the parser (with syntax-driven evaluation):
This parses the given stream of tokens provided by lex.Lexer.tokenize and translates individual symbols to English according to the following grammar:
Opt_list -> Option Opt_list_
Opt_list_ -> comma Option Opt_list_ | empty
Option -> Choice | Mult
Choice -> Compound More_choices Exchange
Compound -> item Add_item
Add_item -> plus item Add_item | empty
More_choices -> slash Compound More_choices | empty
Exchange -> minus num | empty
Mult -> num star Compound
The uppercase symbols are nonterminals, the lowercase ones are terminals. There's also a special symbol EOF that's not present here.
Also, take a look at the vital statistics of this grammar. This grammar is LL(1), so we can use an LL(1) recursive descent predictive parser, as shown below.
If you modify the grammar, you should modify the parser accordingly! The methods that do the actual parsing are called parse_<something>, and to change the output of the parser (the Parser.parse function, actually) you should change the return values of these parse_<something> functions.
# FILE: parse.py
import lex
class Parser:
def __init__(self, lexer):
self.string, self.tokens = None, None
self.lexer = lexer
self.t = self.lexer.token_types
self.__lookahead = None
#property
def lookahead(self):
if not self.__lookahead:
try:
self.__lookahead = next(self.tokens)
except StopIteration:
self.__lookahead = lex.Token(self.t.EOF, '', 0, -1)
return self.__lookahead
def next(self):
if self.__lookahead and self.__lookahead.type == self.t.EOF:
return self.__lookahead
self.__lookahead = None
return self.lookahead
def match(self, token_type):
if self.lookahead.type == token_type:
return self.next()
raise SyntaxError(f'Expected {token_type}, got {self.lookahead.type}', ('<string>', self.lookahead.lineno, self.lookahead.pos, self.string))
# THE PARSING STARTS HERE
def parse(self, string):
# setup
self.string = string
self.tokens = self.lexer.tokenize(string)
self.__lookahead = None
self.next()
# do parsing
ret = [''] + self.parse_opt_list()
return ' '.join(ret)
def parse_opt_list(self) -> list:
ret = self.parse_option(1)
ret.extend(self.parse_opt_list_(1))
return ret
def parse_opt_list_(self, curr_opt_number) -> list:
if self.lookahead.type in {self.t.EOF}:
return []
self.match(self.t.comma)
ret = self.parse_option(curr_opt_number + 1)
ret.extend(self.parse_opt_list_(curr_opt_number + 1))
return ret
def parse_option(self, opt_number) -> list:
ret = [f'{opt_number}.']
if self.lookahead.type == self.t.item:
ret.extend(self.parse_choice())
elif self.lookahead.type == self.t.num:
ret.extend(self.parse_mult())
else:
raise SyntaxError(f'Expected {token_type}, got {self.lookahead.type}', ('<string>', self.lookahead.lineno, self.lookahead.pos, self.string))
ret[-1] += '\n'
return ret
def parse_choice(self) -> list:
c = self.parse_compound()
m = self.parse_more_choices()
e = self.parse_exchange()
if not m:
if not e:
ret = f'You may take {" ".join(c)}'
else:
ret = f'for every {e} models you may take item {" ".join(c)}'
elif m:
c.extend(m)
if not e:
ret = f'each model may take one of: {", ".join(c)}'
else:
ret = f'for every {e} models you may exchange the following items with each other: {", ".join(c)}'
else:
ret = 'Semantic error!'
return [ret]
def parse_compound(self) -> list:
ret = [self.lookahead.value]
self.match(self.t.item)
_ret = self.parse_add_item()
return [' '.join(ret + _ret)]
def parse_add_item(self) -> list:
if self.lookahead.type in {self.t.comma, self.t.minus, self.t.slash, self.t.EOF}:
return []
ret = ['with']
self.match(self.t.plus)
ret.append(self.lookahead.value)
self.match(self.t.item)
return ret + self.parse_add_item()
def parse_more_choices(self) -> list:
if self.lookahead.type in {self.t.comma, self.t.minus, self.t.EOF}:
return []
self.match(self.t.slash)
ret = self.parse_compound()
return ret + self.parse_more_choices()
def parse_exchange(self) -> str:
if self.lookahead.type in {self.t.comma, self.t.EOF}:
return ''
self.match(self.t.minus)
ret = self.lookahead.value
self.match(self.t.num)
return ret
def parse_mult(self) -> list:
ret = [f'each model may take {self.lookahead.value} of:']
self.match(self.t.num)
self.match(self.t.star)
return ret + self.parse_compound()
Part III: usage
Here's how to use all of that code:
# FILE: evaluate.py
import enum
from lex import Lexer
from parse import Parser
# these are all the types of tokens present in our grammar
token_types = enum.Enum('Types', 'item num plus minus star slash comma space newline empty EOF')
t = token_types
# these are the regexes that the lexer uses to recognise the tokens
terminals_regexes = {
t.item: r'[a-zA-Z_]\w*',
t.num: '0|[1-9][0-9]*',
t.plus: r'\+',
t.minus: '-',
t.star: r'\*',
t.slash: '/',
t.comma: ',',
t.space: r'[ \t]',
t.newline: r'\n'
}
lexer = Lexer(token_types, terminals_regexes)
parser = Parser(lexer)
string = 'itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ'
print(f'STRING FROM THE QUESTION: {string!r}\nRESULT:')
print(parser.parse(string), '\n\n')
string = input('Enter a command: ')
while string and string.lower() not in {'q', 'quit', 'e', 'exit'}:
try:
print(parser.parse(string))
except SyntaxError as e:
print(f' Syntax error: {e}\n {e.text}\n' + ' ' * (4 + e.offset - 1) + '^\n')
string = input('Enter a command: ')
Example session:
# python3 evaluate.py
STRING FROM THE QUESTION: 'itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ'
RESULT:
1. You may take itemA
2. for every 3 models you may exchange the following items with each other: itemB, itemC
3. each model may take 2 of: itemD
4. each model may take one of: itemE, itemF, itemG
5. each model may take one of: itemH, itemI with itemJ
Enter a command: itemA/b/c/stuff
1. each model may take one of: itemA, b, c, stuff
Enter a command: 4 * anything
1. each model may take 4 of: anything
Enter a command: 5 * anything + more
1. each model may take 5 of: anything with more
Enter a command: a + b + c+ d
1. You may take a with b with c with d
Enter a command: a+b/c
1. each model may take one of: a with b, c
Enter a command: itemA/itemB-2
1. for every 2 models you may exchange the following items with each other: itemA, itemB
Enter a command: itemA+itemB/itemC - 5
1. for every 5 models you may exchange the following items with each other: itemA with itemB, itemC
Enter a command: q

Formatting and Understanding the Process for an "Unjumbler"

Basically, the code is supposed to print out what it believes to be the unjumbled letters based on the amount of one letter in an index. When I run it, it keeps saying stringlist is not defined. Any idea why? Could use some help with formatting too.
def getMessages():
stringlist=[]
stringinput=""
while stringinput!="DONE":
stringinput=input("Type each string. When you are finished, type DONE. ")
if stringinput=="DONE":
return stringlist
else:
stringlist.append(stringinput)
def countFrequencies(stringlist, indexval):
letterdict={"a":0, "b":0, "c":0, "d":0, "e":0, "f":0, "g":0, "h":0, "i":0, "j":0, "k":0, "l":0,
"m":0, "n":0, "o":0, "p":0, "q":0, "r":0, "s":0, "t":0, "u":0, "v":0, "w":0, "x":0,
"y":0, "z":0}
for i in stringlist:
counter=i[indexval]
letterdict[counter]+=1
return letterdict
def mostCommonLetter(letterdict):
ungarble=""
highest=-1
for i in letterdict.keys():
if letterdict[i]>highest:
ungarble=i
highest=letterdict[i]
return ungarble
getMessages()
countFrequencies(stringlist, indexval)
print("Recovered message: ", mostCommonLetter(letterdict))

Your indentation is incorrect.
You could use Counter to aggregate the frequencies of letters in each line.
from collections import Counter
def getMessages():
stringlist=[]
stringinput=""
while stringinput!="DONE":
stringinput=input("Type each string. When you are finished, type DONE. ")
if stringinput=="DONE":
return stringlist
else:
stringlist.append(stringinput)
def countFrequencies(stringlist):
frequencies = Counter()
for line in stringlist:
frequencies.update(line)
return frequencies
def mostCommonLetter(frequencies):
return max(frequencies)
stringlist = getMessages()
frequencies = countFrequencies(stringlist)
print("Recovered message: ", mostCommonLetter(frequencies))

Python Parser Recursion Infinite Reference

Im new to Parsers and i have a problem with my Parser, specifically when i call itself to analize the body of a function.
When it finds another function, just became crazy and messy.
Basically, when analyzing this code
fn test (a, b, c):
fn test2 (c, b, a):
print("Hello world")
end
end
It starts to point the object to itself, not the subfunction:
>>> print(ast[0].value.body[9])
<ast.VariableAST object at 0x7f6285540710>
>>> print(ast[0].value.body[9].value.body[9])
<ast.VariableAST object at 0x7f6285540710>
This is the main parser code:
# Parser Loop
def Parser(tokenList):
global tokens
global size
tokens = tokenList
size = len(tokens)
ast = []
while i < size:
ast.append(MainParser())
return ast
# The Main Parser
def MainParser():
global i
if tokens[i] == 'fn':
i += 1
node = FunctionParser()
else:
node = tokens[i]
i += 1
return node
# Parse a function
def FunctionParser():
global i
checkEnd("function")
if tokens[i][0].isalpha():
node = VariableAST()
node.name = tokens[i]
i += 1
node.value = FunctionBodyParser()
elif tokens[i] == '(':
node = FunctionBodyParser()
else:
syntaxError("Expecting '(' or function name")
return node
# Parse a function body
def FunctionBodyParser():
global i
i += 1
node = FunctionAST()
while True:
checkEnd("function")
if tokens[i][0].isalpha():
node.args.append(tokens[i])
i += 1
elif tokens[i] == ',':
i += 1
elif tokens[i] == ')':
break
else:
syntaxError("Expecting ')' or ',' in function declaration")
i += 1
checkEnd("function")
if tokens[i] != ':' and tokens[i] != '{':
syntaxError("Expecting '{' or ':'")
begin = tokens[i]
while True:
checkEnd("function")
if begin == ':' and tokens[i] == 'end':
break
elif begin == '{' and tokens[i] == '}':
break
else:
node.body.append(MainParser())
i += 1
return node
Edit: I forgot to mention that this is a prototype for a C version. Im avoiding stuff related to object orientation and some good pratices in python to make easier to port the code to C later.

There's a lot of parser implemented in Python https://wiki.python.org/moin/LanguageParsing like PyPEG allowing you to describe the language you're parsing instead of parsing it yourself, which is more readable and less error-prone.
Also using global is typically a source of problems as you can't control the scope of a variable (there's only one), reducing reusability of your functions.
It's probably better to use a class, which is almost the same thing but you can have multiple instances of the same class running at the same time without variable colisions:
class Parser:
def __init__(self, tokenList):
self.tokens = tokenList
self.size = len(tokenList)
self.ast = []
self.position = 0
def parse(tokenList):
while self.position < self.size:
self.ast.append(self.main())
return self.ast
def main(self):
if self.tokens[self.position] == 'fn':
self.position += 1
node = self.function()
else:
node = self.tokens[self.position]
self.position += 1
return node
# And so on...
From this point you can deduplicate self.position += 1 in main:
def main(self):
self.position += 1
if self.tokens[self.position] == 'fn':
node = self.function()
else:
node = self.tokens[self.position]
return node
Then remove the useless "node" variable:
def main(self):
self.position += 1
if self.tokens[self.position] == 'fn':
return self.function()
else:
return self.tokens[self.position]
But the real way to do this is to use a parser, take a look a pyPEG, it's a nice one, but others are nice too.
Oh and last point, avoid useless comments like:
# Parse a function
def FunctionParser():
We know that "FunctionParser" "Parse a function", thanks, that's not an information. The most important is to wisely choose your function names (oh, the PEP8 tells us not to start method name with capitalized letters), and if you want to add meta-information about the function, put it in a string as a first statement in your function like:
def functionParser():
"Delegates the body parsing to functionBodyParser"

I found the solution,
Actually it was a silly error of programming in Python and the above parser is working fine.
In AST i was creating classes like structure to make easier to port the language to C. But i was doing it wrong, for example:
class VariableAST(NodeAST):
name = None
value = None
What i didnt knew about Python is that those parameters inside the class arent attributes for the object, but static variables, wich just made my program work unpredictable when i assing a value to a object, since im assing to a static variable, i also assing to all other variables and theres is my infinite recursion of objects.

How to execute main method in terminal?

I have a Python program with class definitions, method definitions, and finally a main method which calls those methods and classes. When I run the program in the terminal, it outputs nothing, without an error message. Should I change the way my program is written?
The main method is at the bottom.
import re
import random
class node:
def __init__(self, parent, daughters, edge):
self.parent = parent
self.daughters = daughters
self.edge = edge
trie.append(self)
self.index = len(trie) - 1
def BruteForcePatternMatching(text, patterns):
indices = []
for pattern in patterns:
pattern = re.compile(pattern)
indices += pattern.finditer(text)
return indices
def TrieConstruction(patterns, trie):
trie.append(node(0, [], 0))
for pattern in patterns:
currentNode = trie[0]
for base in pattern:
for daughter in currentNode.daughters:
if base == daughter.edge:
currentNode = daughter
break
else:
trie.append(node(currentNode, [], base))
currentNode = trie[-1]
def PrefixTrieMatching(text, trie):
v = trie[0]
for index, base in enumerate(text):
if v.daughters == []:
pattern_out = []
climb(v.index)
return ''.join(pattern_out)
else:
for daughter in v.daughters:
if base == daughter.edge:
v = daughter
break
else:
print('No matches found.')
return
def climb(index):
if index == 0:
return
else:
pattern_out.append(node.edge)
climb(trie[index].parent)
def TrieMatching(text, trie):
while not text == []:
PrefixTrieMatching(text, trie)
text = text[:-1]
def SuffixTrieConstruction(text):
trie = [node(0, [1], 0), node(0, [], len(text) + 1)] #we use normal nodes, but if they are leaves, we give them integers for indices
for index in range(len(text)):
start = len(text) - index
slide = text[start:-1]
currentNode = trie[0]
for symbol in slide:
for daughter in currentNode.daughters:
if symbol == daughter.edge:
currentNode = daughter
break
else:
trie.append(node(currentNode.index, [], symbol))
currentNode = trie[-1]
if symbol == slide[-1]:
trie.append(node(currentNode.index, [], start))
return trie
def SuffixTreeConstruction(trie):
for node in trie:
if len(node.daughters) == 1:
node.edge = node.edge + trie[node.daughter[0]].edge
trie[node.daughters[0]].parent = node.index
node.daughters = trie[currentNode.daughters[0]].daughters
del trie[node.daughter[0]]
for node in trie:
for daughter in node.daughters:
print('(%d, %d, %s') % (node.index, daughter, node.edge)
def main():
print('First, we open a file of DNA sequences and generate a random DNA string of length 3000, representative of the reference genome.')
patterns = list(open('patterns'))
text = ''
for count in range(3000):
text += choice('ACTG')
print('We will first check for matches using the Brute Force matching method, which scans the text for matches of each pattern.')
BruteForcePatternMatching(text, patterns)
print('It returns the indices in the text where matches occur.')
print('Next, we generate a trie with the patterns, and then run the text over the trie to search for matches.')
trie = []
TrieConstruction(patterns, trie)
TrieMatching(text, trie)
print('It returns all matched patterns.')
print('Finally, we can construct a suffix tree from the text. This is the concatenation of all non-branching nodes into a single node.')
SuffixTreeConstruction(SuffixTrieConstruction(text))
print('It returns the adjacency list - (parent, daughter, edge).')

Assuming that your code is saved in myfile.py.
You can, as other stated, add:
if __name__ == "__main__":
main()
at the end of your file and run it with python myfile.py.
If, however, for some obscure reasons, you can not modify the content of your file, you can still achieve this with:
python -c "import myfile; myfile.main()"
from the command line. (as long as you run this from the directory containing myfile)

Simply call the main function at the end
main()
In this case the function will be run any time the program runs.
You can use
if __name__ == "__main__": main()
Only runs if that file is run (/ the module is executed)

As stated by NightShadeQueen and Chief.
Use
if __name__ == "__main__":
main()
At the end of your program. This way, the program can be run from the command line and have all the functions defined within the program be importable without executing main().
See this post for a more detailed explanation of __name__ and it's uses.
Here is a StackOverflow post which goes into detail about a few of the things you can do with this and restates the original explanation in many different words for your leisure.

Python - Formatting strings

I have the input file :
sun vehicle
one number
two number
reduce command
one speed
five speed
zero speed
speed command
kmh command
I used the following code:
from collections import OrderedDict
output = OrderedDict()
with open('final') as in_file:
for line in in_file:
columns = line.split(' ')
if len(columns) >= 2:
word,tag = line.strip().split()
if output.has_key(tag) == False:
output[tag] = [];
output[tag].append(word)
else:
print ""
for k, v in output.items():
print '<{}> {} </{}>'.format(k, ' '.join(v), k)
output = OrderedDict()
I am getting the output as:
<vehicle> sun </vehicle>
<number> one two </number>
<command> reduce speed kmh </command>
<speed> one five zero </speed>
But my expected output should be:
<vehicle> sun </vehicle>
<number> one two </number>
<command> reduce
<speed> one five zero </speed>
speed kmh </command>
Can someone help me in solving this?

It looks like the output you want to achieve is underspecified!
You presumably want the code to "know in advance" that speed is a part of command, before you get to the line speed command.
To do what you want, you will need a recursive function.
How about
for k, v in output.items():
print expandElements(k, v,output)
and somewhere you define
def expandElements(k,v, dic):
out = '<' +k + '>'
for i in v:
# check each item of v for matches in dic.
# if no match, then out=out+i
# otherwise expand using a recursive call of expandElements()
# and out=out+expandElements
out = out + '<' +k + '>'

It looks like you want some kind of tree structure for your output?
You are printing out with print '<{}> {} </{}>'.format(k, ' '.join(v), k) so all of your output is going to have the form of '<{}> {} </{}>'.
If you want to nest things you are going to need a nested structure to represent them.

For recursivly parsing the input file I would make a class representing the tag. Each tag can have its children. Every children is first a string added manually with tag.children.append("value") or by calling tag.add_value(tag.name, "value").
class Tag:
def __init__(self, name, parent=None):
self.name = name
self.children = []
self.has_root = True
self.parent = parent
def __str__(self):
""" compose string for this tag (recursivly) """
if not self.children:
return self.name
children_str = ' '.join([str(child) for child in self.children])
if not self.parent:
return children_str
return '<%s>%s</%s>' % (self.name, children_str, self.name)
#classmethod
def from_file(cls, file):
""" create root tag from file """
obj = cls('root')
columns = []
with open(file) as in_file:
for line in in_file:
value, tag = line.strip().split(' ')
obj.add_tag(tag, value)
return obj
def search_tag(self, tag):
""" search for a tag in the children """
if self.name == tag:
return self
for i, c in enumerate(self.children):
if isinstance(c, Tag) and c.name == tag:
return c
elif isinstance(c, str):
if c.strip() == tag.strip():
self.children[i] = Tag(tag, self)
return self.children[i]
else:
result = c.search_tag(tag)
if result:
return result
def add_tag(self, tag, value):
"""
add a value, tag pair to the children
Firstly this searches if the value is an child. If this is the
case it moves the children to the new location
Afterwards it searches the tag in the children. When found
the value is added to this tag. If not a new tag object
is created and added to this Tag. The flag has_root
is set to False so the element can be moved later.
"""
value_tag = self.search_tag(value)
if value_tag and not value_tag.has_root:
print("Found value: %s" % value)
if value_tag.parent:
i = value_tag.parent.children.index(value_tag)
value = value_tag.parent.children.pop(i)
value.has_root = True
else:
print("not %s" % value)
found = self.search_tag(tag)
if found:
found.children.append(value)
else:
# no root
tag_obj = Tag(tag, self)
self.children.append(tag_obj)
tag_obj.add_tag(tag, value)
tag_obj.has_root = False
tags = Tag.from_file('final')
print(tags)
I know in this example the speed-Tag is not added twice. I hope that's ok.
Sorry for the long code.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient partial search of a trie in python - python

Related

Processing Strings from data file containing punctuation coding in python

Formatting and Understanding the Process for an "Unjumbler"

Python Parser Recursion Infinite Reference

How to execute main method in terminal?

Python - Formatting strings

Categories

Resources