Processing Strings from data file containing punctuation coding in python

Processing Strings from data file containing punctuation coding in python - python

I am trying to make a simple programme that can help make army lists for a popular tabletop wargame. More as an excercise for my own experience as there are plenty of pre made software packages that do this, but the idea behind it seems fairly straightforward
The programme reads the data for all the units available in an army from a spreadsheet and creates various classes for each unit. The main bit I am looking at now is the options/ upgrades.
In the file I want a straightforward syntax for the option field for each unit. i.e. the following options string itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ would mean
1. you may take itemA (X pts per model)
2. for every 3 models, you may exchange itemB with
a) itemC (net X pts per model)
3. each model may take 2 of itemD (X pts per model)
4. each model may take one of either
a)itemE (X pts per model)
b)itemF (X pts per model)
c)itemG (X pts per model
5. each model may take either
a)itemH (X points per model)
b)itemI and itemJ (X points per model)
At the moment I am processing the string using lots of splits and if statements, that make it very hard to keep track of and assign correctly once the user input their choice.
for index, option in enumerate(self.options):
output = "{}.".format(index+1)
if '-' in option:
sub_option, no_models = option.split('-')
no_models = int(no_models)
print(sub_option)
print(no_models)
output += "For every {} models ".format(no_models)
if '/' in sub_option:
temp_str, temp_options, points_list = exchange_option(sub_option)
else:
temp_str, temp_options, points_list = standard_option(sub_option)
index_points.append(points_list)
temp_options.append(no_models)
index_options.append(temp_options)
else:
if '/' in option:
temp_str, temp_options, points_list = exchange_option(option)
else:
temp_str, temp_options, points_list = standard_option(option)
index_points.append(points_list)
index_options.append(temp_options)
output += temp_str
the *_option() functions are additional helper functions I have defined above which have a similar structure with further if statements within them.
The main question I am asking, is there an easier way to process a code like string such as this? While it works to produce the output in the example above it seems awfully cumbersome to then deal with the user input.
What I am aiming to do is first output the string as given in my example at the top of the question, and then taking the user input index of the given option, modify the associated unit class to have the correct wargear and points value.
I thought about trying to make some kind of options class, but again labelling and defining each option so that they can interact with one another properly seems equally complex, and I feel there must be something more pythonic or just generally better coding practice to processing encoded strings such as this?

So, here's a full blown parser to do that! Now, this only outputs the list as in the previous version of your question, but it shouldn't be too hard to add more features as you want. Also please note that at the moment, the lexer does not error out when a string contains invalid tokens, but that's just a proof-of-concept, so it should be fine.
Part I: the lexer
This tokenises the input string - looks through it from left to right and attempts to classify non-overlapping substrings as instances of tokens. It's to be used before parsing. When given a string, Lexer.tokenize yields a stream of Tokens.
# FILE: lex.py
import re
import enum
class Token:
def __init__(self, type, value: str, lineno: int, pos: int):
self.type, self.value, self.lineno, self.pos = type, value, lineno, pos
def __str__(self):
v = f'({self.value!r})' if self.value else ''
return f'{self.type.name}{v} at {self.lineno}:{self.pos}'
__repr__ = __str__
class Lexer:
def __init__(self, token_types: enum.Enum, tokens_regexes: dict):
self.token_types = token_types
regex = '|'.join(map('(?P<{}>{})'.format, *zip(*((tok.name, regex) for tok, regex in tokens_regexes.items()))))
self.regex = re.compile(regex)
def tokenize(self, string, skip=['space']):
# TODO: detect invalid input
lineno, pos = 0, 0
skip = set(map(self.token_types.__getitem__, skip))
for matchobj in self.regex.finditer(string):
type_name = matchobj.lastgroup
value = matchobj.groupdict()[type_name]
Type = self.token_types[type_name]
if Type == self.token_types.newline: # possibly buggy, but not catastrophic
self.lineno += 1
self.pos = 0
continue
pos = matchobj.end()
if Type not in skip:
yield Token(Type, value, lineno, pos)
yield Token(self.token_types.EOF, '', lineno, pos)
Part II: the parser (with syntax-driven evaluation):
This parses the given stream of tokens provided by lex.Lexer.tokenize and translates individual symbols to English according to the following grammar:
Opt_list -> Option Opt_list_
Opt_list_ -> comma Option Opt_list_ | empty
Option -> Choice | Mult
Choice -> Compound More_choices Exchange
Compound -> item Add_item
Add_item -> plus item Add_item | empty
More_choices -> slash Compound More_choices | empty
Exchange -> minus num | empty
Mult -> num star Compound
The uppercase symbols are nonterminals, the lowercase ones are terminals. There's also a special symbol EOF that's not present here.
Also, take a look at the vital statistics of this grammar. This grammar is LL(1), so we can use an LL(1) recursive descent predictive parser, as shown below.
If you modify the grammar, you should modify the parser accordingly! The methods that do the actual parsing are called parse_<something>, and to change the output of the parser (the Parser.parse function, actually) you should change the return values of these parse_<something> functions.
# FILE: parse.py
import lex
class Parser:
def __init__(self, lexer):
self.string, self.tokens = None, None
self.lexer = lexer
self.t = self.lexer.token_types
self.__lookahead = None
#property
def lookahead(self):
if not self.__lookahead:
try:
self.__lookahead = next(self.tokens)
except StopIteration:
self.__lookahead = lex.Token(self.t.EOF, '', 0, -1)
return self.__lookahead
def next(self):
if self.__lookahead and self.__lookahead.type == self.t.EOF:
return self.__lookahead
self.__lookahead = None
return self.lookahead
def match(self, token_type):
if self.lookahead.type == token_type:
return self.next()
raise SyntaxError(f'Expected {token_type}, got {self.lookahead.type}', ('<string>', self.lookahead.lineno, self.lookahead.pos, self.string))
# THE PARSING STARTS HERE
def parse(self, string):
# setup
self.string = string
self.tokens = self.lexer.tokenize(string)
self.__lookahead = None
self.next()
# do parsing
ret = [''] + self.parse_opt_list()
return ' '.join(ret)
def parse_opt_list(self) -> list:
ret = self.parse_option(1)
ret.extend(self.parse_opt_list_(1))
return ret
def parse_opt_list_(self, curr_opt_number) -> list:
if self.lookahead.type in {self.t.EOF}:
return []
self.match(self.t.comma)
ret = self.parse_option(curr_opt_number + 1)
ret.extend(self.parse_opt_list_(curr_opt_number + 1))
return ret
def parse_option(self, opt_number) -> list:
ret = [f'{opt_number}.']
if self.lookahead.type == self.t.item:
ret.extend(self.parse_choice())
elif self.lookahead.type == self.t.num:
ret.extend(self.parse_mult())
else:
raise SyntaxError(f'Expected {token_type}, got {self.lookahead.type}', ('<string>', self.lookahead.lineno, self.lookahead.pos, self.string))
ret[-1] += '\n'
return ret
def parse_choice(self) -> list:
c = self.parse_compound()
m = self.parse_more_choices()
e = self.parse_exchange()
if not m:
if not e:
ret = f'You may take {" ".join(c)}'
else:
ret = f'for every {e} models you may take item {" ".join(c)}'
elif m:
c.extend(m)
if not e:
ret = f'each model may take one of: {", ".join(c)}'
else:
ret = f'for every {e} models you may exchange the following items with each other: {", ".join(c)}'
else:
ret = 'Semantic error!'
return [ret]
def parse_compound(self) -> list:
ret = [self.lookahead.value]
self.match(self.t.item)
_ret = self.parse_add_item()
return [' '.join(ret + _ret)]
def parse_add_item(self) -> list:
if self.lookahead.type in {self.t.comma, self.t.minus, self.t.slash, self.t.EOF}:
return []
ret = ['with']
self.match(self.t.plus)
ret.append(self.lookahead.value)
self.match(self.t.item)
return ret + self.parse_add_item()
def parse_more_choices(self) -> list:
if self.lookahead.type in {self.t.comma, self.t.minus, self.t.EOF}:
return []
self.match(self.t.slash)
ret = self.parse_compound()
return ret + self.parse_more_choices()
def parse_exchange(self) -> str:
if self.lookahead.type in {self.t.comma, self.t.EOF}:
return ''
self.match(self.t.minus)
ret = self.lookahead.value
self.match(self.t.num)
return ret
def parse_mult(self) -> list:
ret = [f'each model may take {self.lookahead.value} of:']
self.match(self.t.num)
self.match(self.t.star)
return ret + self.parse_compound()
Part III: usage
Here's how to use all of that code:
# FILE: evaluate.py
import enum
from lex import Lexer
from parse import Parser
# these are all the types of tokens present in our grammar
token_types = enum.Enum('Types', 'item num plus minus star slash comma space newline empty EOF')
t = token_types
# these are the regexes that the lexer uses to recognise the tokens
terminals_regexes = {
t.item: r'[a-zA-Z_]\w*',
t.num: '0|[1-9][0-9]*',
t.plus: r'\+',
t.minus: '-',
t.star: r'\*',
t.slash: '/',
t.comma: ',',
t.space: r'[ \t]',
t.newline: r'\n'
}
lexer = Lexer(token_types, terminals_regexes)
parser = Parser(lexer)
string = 'itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ'
print(f'STRING FROM THE QUESTION: {string!r}\nRESULT:')
print(parser.parse(string), '\n\n')
string = input('Enter a command: ')
while string and string.lower() not in {'q', 'quit', 'e', 'exit'}:
try:
print(parser.parse(string))
except SyntaxError as e:
print(f' Syntax error: {e}\n {e.text}\n' + ' ' * (4 + e.offset - 1) + '^\n')
string = input('Enter a command: ')
Example session:
# python3 evaluate.py
STRING FROM THE QUESTION: 'itemA, itemB/itemC-3, 2*itemD, itemE/itemF/itemG, itemH/itemI+itemJ'
RESULT:
1. You may take itemA
2. for every 3 models you may exchange the following items with each other: itemB, itemC
3. each model may take 2 of: itemD
4. each model may take one of: itemE, itemF, itemG
5. each model may take one of: itemH, itemI with itemJ
Enter a command: itemA/b/c/stuff
1. each model may take one of: itemA, b, c, stuff
Enter a command: 4 * anything
1. each model may take 4 of: anything
Enter a command: 5 * anything + more
1. each model may take 5 of: anything with more
Enter a command: a + b + c+ d
1. You may take a with b with c with d
Enter a command: a+b/c
1. each model may take one of: a with b, c
Enter a command: itemA/itemB-2
1. for every 2 models you may exchange the following items with each other: itemA, itemB
Enter a command: itemA+itemB/itemC - 5
1. for every 5 models you may exchange the following items with each other: itemA with itemB, itemC
Enter a command: q

Related

How does the Python3 <python-main-file.py> --export my_image.eps command word?

I'm working on a school project around l-systems. I have a main.py that can convert an l-system to a drawing using pythonTurtle.
# Import libraries
import json
import turtle as tur
from datetime import datetime
# Getting input on what json file to use
def getFile():
"""
Input: None
----------
Output: Return file that is given by the user of the program
"""
file = input("What is the name of the json file with the inputs?: ")
f = open(file) # Load in input file
return f
# Get information of json file in dictionary
def getData(f):
"""
Input: File
----------
Output: Return a dictionary of the content of the given file
"""
data = json.load(f)
f.close() # Close file after getting information in data
return data
# Making variables in python with information from file
def getVariables(data):
"""
Input: Dictionary with neede variables for an l-system
----------
Output: Return the needed variables for an l-system
"""
axiom = data['axiom']
rules = data['rule']
variables = data['variables']
constants = data['constants']
trans = data['trans']
return axiom, rules, variables, constants, trans
# Apply the logic of an l-system
def lSystem(axiom, rules, iter):
"""
Input: Axiom = string, rules = dictionary, iterations (iter) = int
----------
Output: The resulting string after preforming the l-system with the inputs
"""
new_string = ''
old_string = axiom # Sart with the axiom
for _ in range(iter):
for string in old_string:
if string in rules:
new_string += rules[string] # Add the rule in the new_string
else:
new_string += string # When the el of the string is not in rules, just add the el in the new_string
old_string = new_string
new_string = ''
return old_string
# Change the color of the drawing
def setColor(color, t):
"""
Input: Color = string (with name of a color), t = a python turtle
----------
Output: Python turtle with the given color
"""
t.pencolor(color)
return t
# Draw the given string by using the translations
def draw(string, trans):
"""
Input: String (the end result of an l-system), translations (trans) = dictionary (the trans of the used l-system
----------
Output: No return, will draw the given string by the given translations
"""
screen = tur.getscreen() # Make new screen where the drawing will cmom
t = tur.Turtle() # Initialize the turtle and give it the name "t"
t.hideturtle()
t.setheading(90) # Set starting position of turtle heading up
stack = [] # Stack will be used to push and pop between positions
for symbol in string:
if symbol in trans: # Check if the el can get translated
para = trans[symbol][1] # Para is the parameter that will be put in the used fuction
if "draw" == trans[symbol][0]:
t.forward(para) # Draw line with length para
elif "angle" == trans[symbol][0]:
t.left(para) # Rotate to the left with "para" degrees
elif "forward" == trans[symbol][0]:
# Move forward without drawing a line
t.penup() # Raising pen
t.forward(para) # Moving
t.pendown() # Dropping pen, draw again
elif "nop" == trans[symbol][0]:
pass
elif "push" == trans[symbol][0]:
stack.append((t.pos(), t.heading())) # Add the current position and heading to stack
elif "pop" == trans[symbol][0]:
t.penup()
t.setpos(stack[len(stack)-1][0]) # Set position and heading to last item in stack
t.setheading(stack[len(stack)-1][1])
t.pendown()
stack.pop(len(stack)-1) # Remove last item from stack
elif "color" == trans[symbol][0]:
setColor(trans[symbol][1], t)
# Check if the input file has correct input data
def checkData(axiom, rules, alph, trans, pos_translations):
"""
Input: All the variables of an l-system and the translations of the string that our program supports
axiom = string, rules = dictionary, alph = list, trans = list, pos_translations = list (supported translations)
----------
Output: Return False if the data is not the correct data to preform an l-system
"""
if axiom == '': # We need an axiom to start the program
return False
if not isinstance(axiom, str): # Check if axiom is a string
return False
for symbol in axiom: # Check if every symbol in axiom is in the alphabet
if not symbol in alph:
return False
if not isinstance(rules, dict): # Check if rules is a dictionary
return False
for rule in rules:
if not rule in alph: # Rules is a dictionary, rule is a key
return False
for symbol in rules[rule]: # Symbol is every symbol in the value of the given key
if symbol not in alph:
return False
if not isinstance(rule, str):
return False
if not isinstance(rules[rule], str):
return False
if not isinstance(alph, list): # Check if the alphabet is a list
return False
for symbol in alph: # Check if every symbol in the alphabet is a string
if not isinstance(symbol, str):
return False
if not isinstance(trans, dict):
return False
for tran in trans: # Trans is a dictionary, tran is a key in trans
if not tran in alph:
return False
if not isinstance(trans[tran] , list): # Trans[tran] is the value of the key
return False
if not isinstance(trans[tran][0], str): # Trans[tran][1] is the second item in the value of the key, the value is a list
return False
if trans[tran][0] == "color": # When the function is color, we need a string not a value
if not isinstance(trans[tran][1], str):
return False
else:
if not isinstance(trans[tran][1], int):
return False
if not trans[tran][0] in pos_translations: # Check if the translation is supported by the program
return False
# Check if input file contains needed variables
def checkFile(data):
"""
Input: List (data of the input file)
----------
Output: Return False if the needed variables are not in the input file
"""
if not "axiom" in data:
return False
if not "rules" in data:
return False
if not "variables" in data:
return False
if not "constants" in data:
return False
if not "trans" in data:
return False
return True
# Write system history to file
def addHistory(axiom, rules, alph, trans, iterations, lstring, variables, constants):
"""
Input: All variables that need to be stored in the history file
----------
Output: Add a line with the variables to history.txt
"""
f = open("history.txt","a")
timestamp = datetime.now().strftime("%d/%m/%Y %H:%M:%S")
f.write(timestamp + "\t" + str(variables) + "\t" + str(constants) + "\t" + axiom + "\t" + str(rules) + "\t" + str(trans) + "\t" + str(iterations) + "\t" + str(lstring) + "\n")
# Make alphabet variable
def makeAlph(variables, constants):
"""
Input: Variables = list, constants = list
----------
Output: List (The alphabet of the l-system = variables + constants)
"""
alph = variables + constants
return alph
# Main function
def main():
file = getFile()
# Get input on how many iterations the user wants to preform
iter = int(input("How many iterations of the given l-system do you want to preform?: "))
# Used_functions is a list of the possible translations that refer to a function
# When you add more possible functions, add the translations of the function in the used_functions below
used_functions = ("draw", "angle", "forward", "nop", "push", "pop", "color")
data = getData(file)
if checkFile(data) == False:
print("The given input file is not contain the needed variabes.")
return
axiom, rules, variables, constants, trans = getVariables(data)
alph = makeAlph(variables, constants)
if checkData(axiom, rules, alph, trans, used_functions) == False:
print("The given variables in the input file are not correct.")
return
lstring = lSystem(axiom, rules, iter)
addHistory(axiom, rules, alph, trans, iter, lstring, variables, constants)
print(lstring)
draw(lstring, trans)
tur.Screen().exitonclick() # Keep the drawing open unless you click on exit button
main()
This code can also be found on https://github.com/crearth/l-system
To clear things up, we are working in linux. We need to make it possible for the user to use the command
python3 <python-main-file.py> --export my_image.eps
in the terminal. This command exports the drawing that was made to a file called my_image.eps.
I have been looking for hours and cannot find a way to make this command possible. Can someone help me out with this? For example input files, you can look at https://github.com/crearth/l-system/tree/main/tests
For more questions feel free to contact me.

How to parse a BinaryTree from a string?

I want to code a BinaryTree parser. I don't know how to solve this problem. I've tried using regular expressions recursively but I can't find good resources. My goal is:
BinaryTree.from_string("('a') 'b' ('c')") --> BinaryTree("a", "b", "c")
BinaryTree.from_string("") --> None
BinaryTree.from_string("() ()") --> BinaryTree(None, None, None)
BinaryTree.from_string("((1) 2 (3)) 4 (5)") --> BinaryTree(BinaryTree(1, 2, 3), 4, 5)
Here is some source code:
class BinaryTree:
def __init__(self, left=None, name=None, right=None):
self.left = left
self.name = name
self.right = right
def __str__(self):
return f"({self.left}) {self.name} ({self.right})"
def __repr__(self):
return f"BinaryTree({repr(self.left)}, {repr(self.name)}, {repr(self.right)})"
def __len__(self):
if self.name is not None:
output = 1
else:
output = 0
if self.left is not None:
output += len(self.left)
if self.right is not None:
output += len(self.right)
return output
#staticmethod
def from_string(string):
# "(x) y (z)" --> BinaryTree("x", "y", "z")
# "((a) b (c)) y (z)" --> BinaryTree(BinaryTree("a", "b", "c"), "y", "z")
# "" --> None
# () () --> BinaryTree("", "", "")
pass

First, I believe that you need to drop the idea of regular expressions and concentrate on simple matching the parentheses. You have a very simple expression grammar here. Rather than reproducing such a well-traveled exercise, I simply direct your to research how to parse a binary tree expression with parentheses.
The basic expression is
left root right
where each of left and right is either
a sub-tree (first char is a left parenthesis)
a leaf-node label (first char is something else)
null (white space)
Note that you have some ambiguities. For instance, given a b, is the resulting tree (a, b, None), (None, a, b), or an error?
In any case, if you focus on simple string processing, you should be able to do this without external packages:
Find the first left parenthesis and its matching right.
In the string after that, look again for a left-right pair.
If there's anything before that first left-paren, then it must be a leaf node for the left and the node label for the root.
Either way, there must be a root node in the middle (unless this is a degenerate tree).
Recur on each of the paren matches you made.
Can you take it from there?

You can use regular expressions, but only use it to tokenize the input, i.e. it should match all parentheses as individual matches, and match quoted or non-quoted literals. Some care has to be taken to support backslash escaping inside quoted substrings.
For converting quoted strings and numbers to the corresponding Python data typed values, you can use ast.literal_eval. Of course, if the input format would have been a valid Python expression (using comma separators, ...etc), you could leave the parsing completely to ast.literal_eval. But as this is not the case, you'll have to tokenize the input and iterate over the tokens.
So import these:
import re
import ast
And then:
#staticmethod
def from_string(string):
tokens = re.finditer(r"[()]|'(?:\\.|[^'])*'|[^\s()]+|$", string)
def format(token):
return f"'{token}'" if token else "end of input"
def take(end, expect=None, forbid=None):
token = next(tokens).group(0)
if expect is not None and token != expect:
raise ValueError("Expected {}, got {}".format(format(expect), format(token)))
if end != "" and token == "" or token == forbid:
raise ValueError("Unexpected {}".format(format(token)))
if token not in ("(", ")", ""):
token = ast.literal_eval(token)
return token
def recur(end=")"):
token = take(end)
if token == end: # it is an empty leaf
return None
if token != "(": # it is a leaf
take(end, end)
return token
# It is a (left)-name-(right) sequence:
left = recur()
name = None
token = take(end, None, end)
if token != "(":
name = token
take(end, "(")
right = recur()
take(end, end)
return BinaryTree(left, name, right)
return recur("")

Efficient partial search of a trie in python

This is a hackerrank exercise, and although the problem itself is solved, my solution is apparently not efficient enough, so on most test cases I'm getting timeouts. Here's the problem:
We're going to make our own Contacts application! The application must perform two types of operations:
add name, where name is a string denoting a contact name. This must store as a new contact in the application.
find partial, where partial is a string denoting a partial name to search the application for. It must count the number of contacts starting with partial and print the count on a new line.
Given n sequential add and find operations, perform each operation in order.
I'm using Tries to make it work, here's the code:
import re
def add_contact(dictionary, contact):
_end = '_end_'
current_dict = dictionary
for letter in contact:
current_dict = current_dict.setdefault(letter, {})
current_dict[_end] = _end
return(dictionary)
def find_contact(dictionary, contact):
p = re.compile('_end_')
current_dict = dictionary
for letter in contact:
if letter in current_dict:
current_dict = current_dict[letter]
else:
return(0)
count = int(len(p.findall(str(current_dict))) / 2)
re.purge()
return(count)
n = int(input().strip())
contacts = {}
for a0 in range(n):
op, contact = input().strip().split(' ')
if op == "add":
contacts = add_contact(contacts, contact)
if op == "find":
print(find_contact(contacts, contact))
Because the problem requires not returning whether partial is a match or not, but instead counting all of the entries that match it, I couldn't find any other way but cast the nested dictionaries to a string and then count all of the _end_s, which I'm using to denote stored strings. This, it would seem, is the culprit, but I cannot find any better way to do the searching. How do I make this work faster? Thanks in advance.
UPD:
I have added a results counter that actually parses the tree, but the code is still too slow for the online checker. Any thoughts?
def find_contact(dictionary, contact):
current_dict = dictionary
count = 0
for letter in contact:
if letter in current_dict:
current_dict = current_dict[letter]
else:
return(0)
else:
return(words_counter(count, current_dict))
def words_counter(count, node):
live_count = count
live_node = node
for value in live_node.values():
if value == '_end_':
live_count += 1
if type(value) == type(dict()):
live_count = words_counter(live_count, value)
return(live_count)

Ok, so, as it turns out, using nested dicts is not a good idea in general, because hackerrank will shove 100k strings into your program and then everything will slow to a crawl. So the problem wasn't in the parsing, it was in the storing before the parsing. Eventually I found this blogpost, their solution passes the challenge 100%. Here's the code in full:
class Node:
def __init__(self):
self.count = 1
self.children = {}
trie = Node()
def add(node, name):
for letter in name:
sub = node.children.get(letter)
if sub:
sub.count += 1
else:
sub = node.children[letter] = Node()
node = sub
def find(node, data):
for letter in data:
sub = node.children.get(letter)
if not sub:
return 0
node = sub
return node.count
if __name__ == '__main__':
n = int(input().strip())
for _ in range(n):
op, param = input().split()
if op == 'add':
add(trie, param)
else:
print(find(trie, param))

Splitting the elements of a list into a list and then splitting them again

This is a sample of the raw text i'm reading:
ID: 00000001
SENT: to do something
to 01573831
do 02017283
something 03517283
ID: 00000002
SENT: just an example
just 06482823
an 01298744
example 01724894
Right now I'm trying to split it into a lists of lists of lists.
Topmost level list: By the ID so 2 elements here (done)
Next level: Within each ID, split by newlines
Last level: Within each line split the word and ID, for the lines beginning with ID or SENT, it doesn't matter if they are split or not. Between the word and their ID is an indent (\t)
Current code:
f=open("text.txt","r")
raw=list(f)
text=" ".join(raw)
wordlist=text.split("\n \n ") #split by ID
toplist=wordlist[:2] #just take 2 IDs
Edit:
I was going to cross-reference the words to another text file to add their word classes which is why i asked for a lists of lists of lists.
Steps:
1) Use .append() to add on word classes for each word
2) Use "\t".join() to connect a line together
3) Use "\n".join() to connect different lines in an ID
4) "\n\n".join() to connect all the IDs together into a string
Output:
ID: 00000001
SENT: to do something
to 01573831 prep
do 02017283 verb
something 03517283 noun
ID: 00000002
SENT: just an example
just 06482823 adverb
an 01298744 ind-art
example 01724894 noun

A more pythonic version of Thorsten's answer:
from collections import namedtuple
class Element(namedtuple("ElementBase", "id sent words")):
#classmethod
def parse(cls, source):
lines = source.split("\n")
return cls(
id=lines[0][4:],
sent=lines[1][6:],
words=dict(
line.split("\t") for line in lines[2:]
)
)
text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283
ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""
elements = [Element.parse(part) for part in text.split("\n\n")]
for el in elements:
print el
print el.id
print el.sent
print el.words
print

I'd regard every part of the topmost split as an "object". Thus, I'd create a class with properties corresponding to each part.
class Element(object):
def __init__(self, source):
lines = source.split("\n")
self._id = lines[0][4:]
self._sent = lines[1][6:]
self._words = {}
for line in lines[2:]:
word, id_ = line.split("\t")
self._words[word] = id_
#property
def ID(self):
return self._id
#property
def sent(self):
return self._sent
#property
def words(self):
return self._words
def __str__(self):
return "Element %s, containing %i words" % (self._id, len(self._words))
text = """ID: 00000001
SENT: to do something
to\t01573831
do\t02017283
something\t03517283
ID: 00000002
SENT: just an example
just\t06482823
an\t01298744
example\t01724894"""
elements = [Element(part) for part in text.split("\n\n")]
for el in elements:
print el
print el.ID
print el.sent
print el.words
print
In the main code (one line, the list comprehension) the text is only split at each double new-line. Then, all logic is deferred into the __init__ method, making it very local.
Using a class also gives you the benefit of __str__, allowing you control over how your objects are printed.
You could also consider rewriting the last three lines of __init__ to:
self._words = dict([line.split("\t") for line in lines[2:]])
but I wrote a plain loop as it seemed to be easier to understand.
Using a class also gives you the

I'm not sure exactly what output you need but you can adjust this to fit your needs (This uses the itertools grouper recipe):
>>> from itertools import izip_longest
>>> def grouper(n, iterable, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
args = [iter(iterable)] * n
return izip_longest(fillvalue=fillvalue, *args)
>>> with open('text.txt') as f:
print [[x.rstrip().split(None, 1) for x in g if x.rstrip()]
for g in grouper(6, f, fillvalue='')]
[[['ID:', '00000001'], ['SENT:', 'to do something'], ['to', '01573831'], ['do', '02017283'], ['something', '03517283']],
[['ID:', '00000002'], ['SENT:', 'just an example'], ['just', '06482823'], ['an', '01298744'], ['example', '01724894']]]

would this work for you?:
Top - level (which you have done)
def get_parent(text, parent):
"""recursively walk through text, looking for 'ID' tag"""
# find open_ID and close_ID
open_ID = text.find('ID')
close_ID = text.find('ID', open_ID + 1)
# if there is another instance of 'ID', recursively walk again
if close_ID != -1:
parent.append(text[open_ID : close_ID])
return get_parent(text[close_ID:], parent)
# base-case
else:
parent.append(text[open_ID:])
return
Second - level: split by newlines:
def child_split(parent):
index = 0
while index < len(parent):
parent[index] = parent[index].split('\n')
index += 1
Third - level: split the 'ID' and 'SENT' fields
def split_field(parent, index):
if index < len(parent):
child = 0
while child < len(parent[index]):
if ':' in parent[index][child]:
parent[index][child] = parent[index][child].split(':')
else:
parent[index][child] = parent[index][child].split()
child += 1
return split_field(parent, index + 1)
else:
return
Running it all together:
def main(text):
parent = []
get_parent(text, parent)
child_split(parent)
split_field(parent, 0)
The result is quite nested, perhaps it can be cleaned up somewhat? Or perhaps the split_fields() function could return a dictionary?

Python string assignment issue!

So I'm fairly new to Python but I have absolutely no idea why this strong oldUser is changing to current user after I make the parse call. Any help would be greatly appreciated.
while a < 20:
f = urllib.urlopen("SITE")
a = a+1
for i, line in enumerate(f):
if i == 187:
print line
myparser.parse(line)
if fCheck == 1:
result = oldUser[0] is oldUser[1]
print oldUser[0]
print oldUser[1]
else:
result = user is oldUser
fCheck = 1
print result
user = myparser.get_descriptions(firstCheck)
firstCheck = 1
print user
if result:
print "SAME"
array[index+1] = array[index+1] +0
else:
oldUser = user
elif i > 200:
break
myparser.reset()
I don't understand why result doesn't work either... I print out both values and when they're the same it's telling me they're not equal... Also, why does myparser.parse(line) turn oldUser into a size 2 array? Thanks!
** Here's the definition for myparse...
class MyParser(sgmllib.SGMLParser):
"A simple parser class."
def parse(self, s):
"Parse the given string 's'."
self.feed(s)
self.close()
def __init__(self, verbose=0):
"Initialise an object, passing 'verbose' to the superclass."
sgmllib.SGMLParser.__init__(self, verbose)
self.divs = []
self.descriptions = []
self.inside_div_element = 0
def start_div(self, attributes):
"Process a hyperlink and its 'attributes'."
for name, value in attributes:
if name == "id":
self.divs.append(value)
self.inside_div_element = 1
def end_div(self):
"Record the end of a hyperlink."
self.inside_div_element = 0
def handle_data(self, data):
"Handle the textual 'data'."
if self.inside_div_element:
self.descriptions.append(data)
def get_div(self):
"Return the list of hyperlinks."
return self.divs
def get_descriptions(self, check):
"Return a list of descriptions."
if check == 1:
self.descriptions.pop(0)
return self.descriptions

Don’t compare strings with is. That checks if they’re the same object, not two copies of the same string. See:
>>> string = raw_input()
hello
>>> string is 'hello'
False
>>> string == 'hello'
True
Also, the definition of myparser would be useful.

I'm not quite sure what your code is doing, but I suspect you want to use == instead of is. Using is compares object identity, which is not the same as string equality. Two different string objects may contain the same sequence of characters.
result = oldUser[0] == oldUser[1]
If you're curious, for more information on the behaviour of the is operator see Python “is” operator behaves unexpectedly with integers.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Processing Strings from data file containing punctuation coding in python - python

Related

How does the Python3 <python-main-file.py> --export my_image.eps command word?

How to parse a BinaryTree from a string?

Efficient partial search of a trie in python

Splitting the elements of a list into a list and then splitting them again

Python string assignment issue!

Categories

Resources