Parse a colon separated string with pyparsing - python

This is the data:
C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK
And I would like to get this result
[C:/data/my_file.txt.c, 10, 0x21, name1, name2, 0x10, 1, OK]
[C:/data/my_file2.txt.c, 110, 0x1, name2, name5, 0x12, 1, NOT_OK]
[./data/my_file3.txt.c, 110, 0x1, name2, name5, 0x12, 10, OK]
I know how to do that with some code or string split and stuff like that, but I am searching for a nice solution using pyparsing. My problem is the :/ for the file path.
Additional Question I use some code to strip comments and other stuff from the records so the raw data looks like this:
text = """C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
// comment
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK
----
ok
"""
And i strip the "//", "ok", and "---" before parsing right now
So now I have a next question too the first:
Some addition to the first question. Till now I extracted the lines above from a data file - that works great. So I read the file line by line and parse it. But now I found out it is possible to use parseFile to parse a whole file. So I think I could strip some of my code and use parseFile instead. So the files I would like to parse have an additional footer.
C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK: info message
-----------------------
3 Files 2 OK 1 NOT_OK
NOT_OK
Is it possible to change the parser to get 2 parse results?
Result1:
[['C:/data/my_file.txt.c', '10', '0x21', 'name1', 'name2', '0x10', '1', 'OK'],
['C:/data/my_file2.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '1', 'NOT_OK'],
['./data/my_file3.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '10', 'OK']]
Ignore the blank line
Ignore this line => -----------------------
Result 2:
[['3', 'Files', 2', 'OK’, '1', 'NOT_OK'],
['NOT_OK’],
So I changed the thes Code for that:
# define an expression for your file reference
one_thing = Combine(
oneOf(list(alphas)) + ':/' +
Word(alphanums + '_-./'))
# define a catchall expression for everything else (words of non-whitespace characters,
# excluding ':')
another_thing = Word(printables + " ", excludeChars=':')
# define an expression of the two; be sure to list the file reference first
thing = one_thing | another_thing
# now use plain old pyparsing delimitedList, with ':' delimiter
list_of_things = delimitedList(thing, delim=':')
list_of_other_things = Word(printables).setName('a')
# run it and see...
parse_ret = OneOrMore(Group(list_of_things | list_of_other_things)).parseFile("data.file")
parse_ret.pprint()
And I get this result:
[['C:/data/my_file.txt.c', '10', '0x21', 'name1', 'name2', '0x10', '1', 'OK'],
['C:/data/my_file2.txt.c','110', '0x1', 'name2', 'name5', '0x12', '1', 'NOT_OK'],
['./data/my_file3.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '10', 'OK', 'info message'],
['-----------------------'],
['3 Files 2 OK 1 NOT_OK'],
['NOT_OK']]
So I can go with this but is it possible to split the result into two named results? I searched the docs but I didn´t find anything that works.

See embedded comments for pyparsing description:
from pyparsing import *
text = """C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
// blah-de blah blah blah
./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK"""
# define an expression for your file reference
one_thing = Combine(
oneOf(list(alphas.upper())) + ':/' +
Word(alphanums + '_-./'))
# define a catchall expression for everything else (words of non-whitespace characters,
# excluding ':')
another_thing = Word(printables, excludeChars=':')
# define an expression of the two; be sure to list the file reference first
thing = one_thing | another_thing
# now use plain old pyparsing delimitedList, with ':' delimiter
list_of_things = delimitedList(thing, delim=':')
parser = OneOrMore(Group(list_of_things))
# ignore comments starting with double slash
parser.ignore(dblSlashComment)
# run it and see...
parser.parseString(text).pprint()
prints:
[['C:/data/my_file.txt.c', '10', '0x21', 'name1', 'name2', '0x10', '1', 'OK'],
['C:/data/my_file2.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '1', 'NOT_OK'],
['./data/my_file3.txt.c', '110', '0x1', 'name2', 'name5', '0x12', '10', 'OK']]

So I didn´t found a solution with delimitedList and parseFile but I found a Solution which is okay for me.
from pyparsing import *
data = """
C: / data / my_file.txt.c:10:0x21:name1:name2:0x10:1:OK
C: / data / my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK
./ data / my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK: info message
-----------------------
3 Files 2 OK 1 NOT_OK
NOT_OK
"""
if __name__ == '__main__':
# define an expression for your file reference
entry_one = Combine(
oneOf(list(alphas)) + ':/' +
Word(alphanums + '_-./'))
entry_two = Word(printables + ' ', excludeChars=':')
entry = entry_one | entry_two
delimiter = Literal(':').suppress()
tc_result_line = Group(entry.setResultsName('file_name') + delimiter + entry.setResultsName(
'line_nr') + delimiter + entry.setResultsName('num_one') + delimiter + entry.setResultsName('name_one') + delimiter + entry.setResultsName(
'name_two') + delimiter + entry.setResultsName('num_two') + delimiter + entry.setResultsName('status') + Optional(
delimiter + entry.setResultsName('msg'))).setResultsName("info_line")
EOL = LineEnd().suppress()
SOL = LineStart().suppress()
blank_line = SOL + EOL
tc_summary_line = Group(Word(nums).setResultsName("num_of_lines") + "Files" + Word(nums).setResultsName(
"num_of_ok") + "OK" + Word(nums).setResultsName("num_of_not_ok") + "NOT_OK").setResultsName(
"info_summary")
tc_end_line = Or(Literal("NOT_OK"), Literal('Ok')).setResultsName("info_result")
# run it and see...
pp1 = tc_result_line | Optional(tc_summary_line | tc_end_line)
pp1.ignore(blank_line | OneOrMore("-"))
result = list()
for l in data.split('\n'):
result.append((pp1.parseString(l)).asDict())
# delete empty results
result = filter(None, result)
for r in result:
print(r)
pass
Result:
{'info_line': {'file_name': 'C', 'num_one': '10', 'msg': '1', 'name_one': '0x21', 'line_nr': '/ data / my_file.txt.c', 'status': '0x10', 'num_two': 'name2', 'name_two': 'name1'}}
{'info_line': {'file_name': 'C', 'num_one': '110', 'msg': '1', 'name_one': '0x1', 'line_nr': '/ data / my_file2.txt.c', 'status': '0x12', 'num_two': 'name5', 'name_two': 'name2'}}
{'info_line': {'file_name': './ data / my_file3.txt.c', 'num_one': '0x1', 'msg': 'OK', 'name_one': 'name2', 'line_nr': '110', 'status': '10', 'num_two': '0x12', 'name_two': 'name5'}}
{'info_summary': {'num_of_lines': '3', 'num_of_ok': '2', 'num_of_not_ok': '1'}}
{'info_result': ['NOT_OK']}

Using re:
myList = ["C:/data/my_file.txt.c:10:0x21:name1:name2:0x10:1:OK", "C:/data/my_file2.txt.c:110:0x1:name2:name5:0x12:1:NOT_OK", "./data/my_file3.txt.c:110:0x1:name2:name5:0x12:10:OK"]
for i in myList:
newTxt = re.sub(r':', ",", i)
newTxt = re.sub(r',/', ":/", newTxt)
print newTxt

Related

Python regarding .split in file

Hello can anyone please help out with this?
this is the content of my txt file
DICT1 Assignment 1 25 100 nothing anyway at all
DICT2 Assignment 2 25 100 nothing at all
DICT3 Assignment 3 50 100 not at all
this is my code
from pathlib import Path
home = str(Path.home())
with open(home + "\\Desktop\\PADS Assignment\\DICT1 Assessment Task.txt", "r") as r:
for line in r:
print(line.strip().split())
my output of the code is
['DICT1', 'Assignment', '1', '25', '100', 'nothing']
['DICT2', 'Assignment', '2', '25', '100', 'nothing', 'at', 'all']
['DICT3', 'Assignment', '3', '50', '100', 'not', 'at', 'all']
Now my question is , how do i make the output to be
['DICT1', 'Assignment 1', '25', '100', 'nothing']
['DICT2', 'Assignment 2', '25', '100', 'nothing at all']
['DICT3', 'Assignment 3', '50', '100', 'not at all']
You could use the maxsplit parameter of the split method
line.split(maxsplit=5)
Of course if the format of the lines in your file is similar and you are using python 3.
For Python 2.x you should use
line.split(' ', 5)
Your main problem here is your input file, the separator in this file is a space but you also have some values with spaces to retrieve.
So you have two choices here:
You either change the input file to be comma separated values, i.e.:
DICT1, Assignment, 1, 25, 100, nothing anyway at all
DICT2, Assignment, 2, 25, 100, nothing at all
DICT3, Assignment, 3, 50, 100, not at all
You change your script to unpack manually the end of lines once you got every other items:
from pathlib import Path
home = str(Path.home())
with open(home + "\\Desktop\\PADS Assignment\\DICT1 Assessment Task.txt", "r") as r:
for line in r:
splittedLine = line.strip().split(" ")
taskId = splittedLine[0]
taskTitle = splittedLine[1]
weight = splittedLine[2]
fullMark = splittedLine[3]
description = " ".join(splittedLine[4:])
print("taskId: " + taskId + " - taskTitle: " + taskTitle + " - weight: " + weight + " -fullMark: " + fullMark + " - description: " + description)

How can I split a string of a mathematical expressions in python?

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments.
If i introduce something like this: (this will be a string)
( ( 73 + ( ( 34 - 72 ) / ( 33 - 3 ) ) ) + ( 56 + ( 95 - 28 ) ) )
it will split it with .split() and the program will work correctly.
But I want the user to be able to introduce something like this:
((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )
As you can see I want that the blank spaces can be trivial but the program continue splitting the string by parentheses, integers (not digits) and operands.
I try to solve it with a for but I don't know how to catch the whole number (73 , 34 ,72) instead one digit by digit (7, 3 , 3 , 4 , 7 , 2)
To sum up, what I want is split a string like ((81 * 6) /42+ (3-1)) into:
[(, (, 81, *, 6, ), /, 42, +, (, 3, -, 1, ), )]
Tree with ast
You could use ast to get a tree of the expression :
import ast
source = '((81 * 6) /42+ (3-1))'
node = ast.parse(source)
def show_children(node, level=0):
if isinstance(node, ast.Num):
print(' ' * level + str(node.n))
else:
print(' ' * level + str(node))
for child in ast.iter_child_nodes(node):
show_children(child, level+1)
show_children(node)
It outputs :
<_ast.Module object at 0x7f56abbc5490>
<_ast.Expr object at 0x7f56abbc5350>
<_ast.BinOp object at 0x7f56abbc5450>
<_ast.BinOp object at 0x7f56abbc5390>
<_ast.BinOp object at 0x7f56abb57cd0>
81
<_ast.Mult object at 0x7f56abbd0dd0>
6
<_ast.Div object at 0x7f56abbd0e50>
42
<_ast.Add object at 0x7f56abbd0cd0>
<_ast.BinOp object at 0x7f56abb57dd0>
3
<_ast.Sub object at 0x7f56abbd0d50>
1
As #user2357112 wrote in the comments : ast.parse interprets Python syntax, not mathematical expressions. (1+2)(3+4) would be parsed as a function call and list comprehensions would be accepted even though they probably shouldn't be considered a valid mathematical expression.
List with a regex
If you want a flat structure, a regex could work :
import re
number_or_symbol = re.compile('(\d+|[^ 0-9])')
print(re.findall(number_or_symbol, source))
# ['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']
It looks for either :
multiple digits
or any character which isn't a digit or a space
Once you have a list of elements, you could check if the syntax is correct, for example with a stack to check if parentheses are matching, or if every element is a known one.
You need to implement a very simple tokenizer for your input. You have the following types of tokens:
(
)
+
-
*
/
\d+
You can find them in your input string separated by all sorts of white space.
So a first step is to process the string from start to finish, and extract these tokens, and then do your parsing on the tokens, rather than on the string itself.
A nifty way to do this is to use the following regular expression: '\s*([()+*/-]|\d+)'. You can then:
import re
the_input='(3+(2*5))'
tokens = []
tokenizer = re.compile(r'\s*([()+*/-]|\d+)')
current_pos = 0
while current_pos < len(the_input):
match = tokenizer.match(the_input, current_pos)
if match is None:
raise Error('Syntax error')
tokens.append(match.group(1))
current_pos = match.end()
print(tokens)
This will print ['(', '3', '+', '(', '2', '*', '5', ')', ')']
You could also use re.findall or re.finditer, but then you'd be skipping non-matches, which are syntax errors in this case.
If you don't want to use re module, you can try this:
s="((81 * 6) /42+ (3-1))"
r=[""]
for i in s.replace(" ",""):
if i.isdigit() and r[-1].isdigit():
r[-1]=r[-1]+i
else:
r.append(i)
print(r[1:])
Output:
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1', ')', ')']
It actual would be pretty trivial to hand-roll a simple expression tokenizer. And I'd think you'd learn more that way as well.
So for the sake of education and learning, Here is a trivial expression tokenizer implementation which can be extended. It works based upon the "maximal-much" rule. This means it acts "greedy", trying to consume as many characters as it can to construct each token.
Without further ado, here is the tokenizer:
class ExpressionTokenizer:
def __init__(self, expression, operators):
self.buffer = expression
self.pos = 0
self.operators = operators
def _next_token(self):
atom = self._get_atom()
while atom and atom.isspace():
self._skip_whitespace()
atom = self._get_atom()
if atom is None:
return None
elif atom.isdigit():
return self._tokenize_number()
elif atom in self.operators:
return self._tokenize_operator()
else:
raise SyntaxError()
def _skip_whitespace(self):
while self._get_atom():
if self._get_atom().isspace():
self.pos += 1
else:
break
def _tokenize_number(self):
endpos = self.pos + 1
while self._get_atom(endpos) and self._get_atom(endpos).isdigit():
endpos += 1
number = self.buffer[self.pos:endpos]
self.pos = endpos
return number
def _tokenize_operator(self):
operator = self.buffer[self.pos]
self.pos += 1
return operator
def _get_atom(self, pos=None):
pos = pos or self.pos
try:
return self.buffer[pos]
except IndexError:
return None
def tokenize(self):
while True:
token = self._next_token()
if token is None:
break
else:
yield token
Here is a demo the usage:
tokenizer = ExpressionTokenizer('((81 * 6) /42+ (3-1))', {'+', '-', '*', '/', '(', ')'})
for token in tokenizer.tokenize():
print(token)
Which produces the output:
(
(
81
*
6
)
/
42
+
(
3
-
1
)
)
Quick regex answer:
re.findall(r"\d+|[()+\-*\/]", str_in)
Demonstration:
>>> import re
>>> str_in = "((81 * 6) /42+ (3-1))"
>>> re.findall(r"\d+|[()+\-*\/]", str_in)
['(', '(', '81', '*', '6', ')', '/', '42', '+', '(', '3', '-', '1',
')', ')']
For the nested parentheses part, you can use a stack to keep track of the level.
This does not provide quite the result you want but might be of interest to others who view this question. It makes use of the pyparsing library.
# Stolen from http://pyparsing.wikispaces.com/file/view/simpleArith.py/30268305/simpleArith.py
# Copyright 2006, by Paul McGuire
# ... and slightly altered
from pyparsing import *
integer = Word(nums).setParseAction(lambda t:int(t[0]))
variable = Word(alphas,exact=1)
operand = integer | variable
expop = Literal('^')
signop = oneOf('+ -')
multop = oneOf('* /')
plusop = oneOf('+ -')
factop = Literal('!')
expr = operatorPrecedence( operand,
[("!", 1, opAssoc.LEFT),
("^", 2, opAssoc.RIGHT),
(signop, 1, opAssoc.RIGHT),
(multop, 2, opAssoc.LEFT),
(plusop, 2, opAssoc.LEFT),]
)
print (expr.parseString('((81 * 6) /42+ (3-1))'))
Output:
[[[[81, '*', 6], '/', 42], '+', [3, '-', 1]]]
Using grako:
start = expr $;
expr = calc | value;
calc = value operator value;
value = integer | "(" #:expr ")" ;
operator = "+" | "-" | "*" | "/";
integer = /\d+/;
grako transpiles to python.
For this example, the return value looks like this:
['73', '+', ['34', '-', '72', '/', ['33', '-', '3']], '+', ['56', '+', ['95', '-', '28']]]
Normally you'd use the generated semantics class as a template for further processing.
To provide a more verbose regex approach that you could easily extend:
import re
solution = []
pattern = re.compile('([\d\.]+)')
s = '((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )'
for token in re.split(pattern, s):
token = token.strip()
if re.match(pattern, token):
solution.append(float(token))
continue
for character in re.sub(' ', '', token):
solution.append(character)
Which will give you the result:
solution = ['(', '(', 73, '+', '(', '(', 34, '-', 72, ')', '/', '(', 33, '-', 3, ')', ')', ')', '+', '(', 56, '+', '(', 95, '-', 28, ')', ')', ')']
Similar to #McGrady's answer, you can do this with a basic queue implementation.
As a very basic implementation, here's what your Queue class can look like:
class Queue:
EMPTY_QUEUE_ERR_MSG = "Cannot do this operation on an empty queue."
def __init__(self):
self._items = []
def __len__(self) -> int:
return len(self._items)
def is_empty(self) -> bool:
return len(self) == 0
def enqueue(self, item):
self._items.append(item)
def dequeue(self):
try:
return self._items.pop(0)
except IndexError:
raise RuntimeError(Queue.EMPTY_QUEUE_ERR_MSG)
def peek(self):
try:
return self._items[0]
except IndexError:
raise RuntimeError(Queue.EMPTY_QUEUE_ERR_MSG)
Using this simple class, you can implement your parse function as:
def tokenize_with_queue(exp: str) -> List:
queue = Queue()
cum_digit = ""
for c in exp.replace(" ", ""):
if c in ["(", ")", "+", "-", "/", "*"]:
if cum_digit != "":
queue.enqueue(cum_digit)
cum_digit = ""
queue.enqueue(c)
elif c.isdigit():
cum_digit += c
else:
raise ValueError
if cum_digit != "": #one last sweep in case there are any digits waiting
queue.enqueue(cum_digit)
return [queue.dequeue() for i in range(len(queue))]
Testing it like below:
exp = "((73 + ( (34- 72 ) / ( 33 -3) )) + (56 +(95 - 28) ) )"
print(tokenize_with_queue(exp)")
would give you the token list as:
['(', '(', '73', '+', '(', '(', '34', '-', '72', ')', '/', '(', '33', '-', '3', ')', ')', ')', '+', '(', '56', '+', '(', '95', '-', '28', ')', ')', ')']

Failing to append to dictionary. Python

I am experiencing a strange faulty behaviour, where a dictionary is only appended once and I can not add more key value pairs to it.
My code reads in a multi-line string and extracts substrings via split(), to be added to a dictionary. I make use of conditional statements. Strangely only the key:value pairs under the first conditional statement are added.
Therefore I can not complete the dictionary.
How can I solve this issue?
Minimal code:
#I hope the '\n' is sufficient or use '\r\n'
example = "Name: Bugs Bunny\nDOB: 01/04/1900\nAddress: 111 Jokes Drive, Hollywood Hills, CA 11111, United States"
def format(data):
dic = {}
for line in data.splitlines():
#print('Line:', line)
if ':' in line:
info = line.split(': ', 1)[1].rstrip() #does not work with files
#print('Info: ', info)
if ' Name:' in info: #middle name problems! /maiden name
dic['F_NAME'] = info.split(' ', 1)[0].rstrip()
dic['L_NAME'] = info.split(' ', 1)[1].rstrip()
elif 'DOB' in info: #overhang
dic['DD'] = info.split('/', 2)[0].rstrip()
dic['MM'] = info.split('/', 2)[1].rstrip()
dic['YY'] = info.split('/', 2)[2].rstrip()
elif 'Address' in info:
dic['STREET'] = info.split(', ', 2)[0].rstrip()
dic['CITY'] = info.split(', ', 2)[1].rstrip()
dic['ZIP'] = info.split(', ', 2)[2].rstrip()
return dic
if __name__ == '__main__':
x = format(example)
for v, k in x.iteritems():
print v, k
Your code doesn't work, at all. You split off the name before the colon and discard it, looking only at the value after the colon, stored in info. That value never contains the names you are looking for; Name, DOB and Address all are part of the line before the :.
Python lets you assign to multiple names at once; make use of this when splitting:
def format(data):
dic = {}
for line in data.splitlines():
if ':' not in line:
continue
name, _, value = line.partition(':')
name = name.strip()
if name == 'Name':
dic['F_NAME'], dic['L_NAME'] = value.split(None, 1) # strips whitespace for us
elif name == 'DOB':
dic['DD'], dic['MM'], dic['YY'] = (v.strip() for v in value.split('/', 2))
elif name == 'Address':
dic['STREET'], dic['CITY'], dic['ZIP'] = (v.strip() for v in value.split(', ', 2))
return dic
I used str.partition() here rather than limit str.split() to just one split; it is slightly faster that way.
For your sample input this produces:
>>> format(example)
{'CITY': 'Hollywood Hills', 'ZIP': 'CA 11111, United States', 'L_NAME': 'Bunny', 'F_NAME': 'Bugs', 'YY': '1900', 'MM': '04', 'STREET': '111 Jokes Drive', 'DD': '01'}
>>> from pprint import pprint
>>> pprint(format(example))
{'CITY': 'Hollywood Hills',
'DD': '01',
'F_NAME': 'Bugs',
'L_NAME': 'Bunny',
'MM': '04',
'STREET': '111 Jokes Drive',
'YY': '1900',
'ZIP': 'CA 11111, United States'}

looping once over html as a string

I try to read data from a table in html. I read periodically and the table length always change and I don't know its length. However the table is always on the same format so I try to recognize some pattern and read data based on it's position.
The html is of the form:
<head>
<title>Some webside</title>
</head>
<body
<tr><td> There are some information coming here</td></tr>
<tbody><table>
<tr><td>First</td><td>London</td><td>24</td><td>3</td><td>19:00</td><td align="center"></td></tr>
<tr bgcolor="#cccccc"><td>Second</td><td>NewYork</td><td>24</td><td>4</td><td>20:13</td><td align="center"></td></tr>
<tr><td>Some surprise</td><td>Swindon</td><td>25</td><td>5</td><td>20:29</td><td align="center"></td></tr>
<tr bgcolor="#cccccc"><td>Third</td><td>Swindon</td><td>24</td><td>6</td><td>20:45</td><td align="center"></td></tr>
</tbody></table>
<tr><td> There are some information coming here</td></tr>
</body>
I convert html to a string and go over it to read the data but I want to read it only once. My code is:
def ReadTable(m):
refList = []
firstId = 1
nextId = 2
k = 1
helper = 1
while firstId != nextId:
row = []
helper = m.find('<td><a href="d?k=', helper) + 17
end_helper = m.find('">', helper)
rowId = m[helper : end_helper]
if k == 1: # to check if looped again
firstId = rowId
else:
nextId = rowId
row.append(rowId)
helper = end_helper + 2
end_helper = m.find('</a></td><td>', helper)
rowPlace = m[helper : end_helper]
row.append(rowPlace)
helper = m.find('</a></td><td>', end_helper) + 13
end_helper = m.find('</td><td>', helper)
rowCity = m[helper : end_helper]
row.append(rowCity)
helper = end_helper + 9
end_helper = m.find('</td><td>', helper)
rowDay = m[helper : end_helper]
row.append(rowDay)
helper = end_helper + 9
end_helper = m.find('</td><td>', helper)
rowNumber = m[helper : end_helper]
row.append(rowNumber)
helper = end_helper + 9
end_helper = m.find('</td>', helper)
rowTime = m[helper : end_helper]
row.append(rowTime)
refList.append(row)
k +=1
return refList
if __name__ == '__main__':
filePath = '/home/m/workspace/Tests/mainP.html'
fileRead = open(filePath)
myString = fileRead.read()
print myString
refList = ReadTable(myString)
print 'Final List = %s' % refList
I expect the outcome as a list with 4 lists inside like that:
Final List = [['101', 'First', 'London', '24', '3', '19:00'], ['102', 'Second', 'NewYork', '24', '4', '20:13'], ['201', 'Some surprise', 'Swindon', '25', '5', '20:29'], ['202', 'Third', 'Swindon', '24', '6', '20:45']]
I expect that after first loop the string is read again and the firstId is found again and my while-loop will terminate. Instead I have infinite loop and my list start to look like this:
Final List = [['101', 'First', 'London', '24', '3', '19:00'], ['102', 'Second', 'NewYork', '24', '4', '20:13'], ['201', 'Some surprise', 'Swindon', '25', '5', '20:29'], ['202', 'Third', 'Swindon', '24', '6', '20:45'], ['me webside</title>\n</head>\n<body \n<tr><td> There are some information coming here</td></tr>\n<tbody><table>\n<tr><td><a href="d?k=101', 'First', 'London', '24', '3', '19:00'], ['102', 'Second', 'NewYork', '24', '4', '20:13']...
I don't understand why my helper start to behave this way and I can't figure out how a program like that should be written. Can you suggest a good/effective way to write it or to fix my loop?
I would suggest you invest some time in looking at LXML. It allows you to look at all of the tables in an html file and work with the sub-elements of the things that make up the table (like rows and cells)
LXML is not hard to work with and it allows you to feed in a string with the
html.fromstring(somestring)
Further, there arte a lot of lxml questions that have been asked and answered here on SO so it is not to hard to find good examples to work from
You aren't checking the return from your find and it is returning -1 when it doesn't find a match.
http://docs.python.org/2/library/string.html#string.find
Return -1 on failure
I updated this section of the code and it returns as you expect now. First and last row below match what you have above so you can find the replacement.
row = []
helper = m.find('<td><a href="d?k=', helper)
if helper == -1:
break
helper += 17
end_helper = m.find('">', helper)

How to parse lines in file in sequence/match adequate results using python, pyparsing?

Here's my code:
from pyparsing import *
survey ='''
BREAK_L,PN1000,LA55.16469813,LN18.15054629
PN1,LA54.16469813,LN17.15054629,EL22.222
BREAK_L,PN2000,LA55.16507249,LN18.15125566
PN6,LA54.16506873,LN17.15115798,EL33.333
PN7,LA54.16507249,LN17.15125566,EL44.444
BREAK_L,PN3000,LA55.16507249,LN18.15125566
PN10,LA54.16507522,LN17.15198405,EL55.555
PN11,LA54.16506566,LN17.15139220,EL44.44
PN12,LA54.16517275,LN17.15100652,EL11.111
'''
digits = "0123456789"
number = Word(nums+'.').setParseAction(lambda t: float(t[0]))
num = Word(digits)
text = Word(alphas)
pt_id = Suppress('PN') + Combine(Optional(text) + num + Optional(text) + Optional(num))
separator = Suppress(',')
latitude = Suppress('LA') + number
longitude = Suppress('LN') + number
gps_line = pt_id + separator + latitude + separator + longitude
break_line = (Suppress('BREAK_L,')
+ pt_id
+ separator
+ latitude
+ separator
+ longitude)
result1 = gps_line.scanString(survey)
result2 = break_line.scanString(survey)
for item in result1:
print item
With example above I would like to find solution how to get output like:
gps_line + it's break_line, what means something like in pseudo code:
for every gps_line in result1:
print gps_line + precedent break_line
If matter of my question is not clear or not fit to description, feel free to change it.
EDIT #2
What I try to achieve is output:
['1', 54.16469813, 17.15054629, 22.222, 'BP1000', 55.16469813, 18.15054629]
['6', 54.16506873, 17.15115798, 33.333, 'BP2000', 55.16507249, 18.15125566]
['7', 54.16507249, 17.15125566, 44.444, 'BP2000', 55.16507249, 18.15125566]
['10', 54.16507522, 17.15198405, 55.555, 'BP3000', 55.16507249, 18.15125566]
['11', 54.16506566, 17.1513922, 44.44, 'BP3000', 55.16507249, 18.15125566]
['12', 54.16517275, 17.15100652, 11.111, 'BP3000', 55.16507249, 18.15125566]
Second attempt:
from decimal import Decimal
from operator import itemgetter
survey ='''
BREAK_L,PN1000,LA55.16469813,LN18.15054629
PN1,LA54.16469813,LN17.15054629,EL22.222
BREAK_L,PN2000,LA55.16507249,LN18.15125566
PN6,LA54.16506873,LN17.15115798,EL33.333
PN7,LA54.16507249,LN17.15125566,EL44.444
BREAK_L,PN3000,LA55.16507249,LN18.15125566
PN10,LA54.16507522,LN17.15198405,EL55.555
PN11,LA54.16506566,LN17.15139220,EL44.44
PN12,LA54.16517275,LN17.15100652,EL11.111
'''
def parse_line(line):
brk = False
kv = {}
for part in line.split(','):
if part == 'BREAK_L':
brk = True
else:
k = part[:2]
v = part[2:]
kv[k] = v
return (brk,kv)
def parse_survey(survey):
ig1 = itemgetter('PN','LA','LN','EL')
ig2 = itemgetter('PN','LA','LN')
brk_data = None
for line in survey.strip().splitlines():
brk, data = parse_line(line)
if brk:
brk_data = data
continue
else:
yield ig1(data) + ig2(brk_data)
for r in parse_survey(survey):
print r
Yields:
('1', '54.16469813', '17.15054629', '22.222', '1000', '55.16469813', '18.15054629')
('6', '54.16506873', '17.15115798', '33.333', '2000', '55.16507249', '18.15125566')
('7', '54.16507249', '17.15125566', '44.444', '2000', '55.16507249', '18.15125566')
('10', '54.16507522', '17.15198405', '55.555', '3000', '55.16507249', '18.15125566')
('11', '54.16506566', '17.15139220', '44.44', '3000', '55.16507249', '18.15125566')
('12', '54.16517275', '17.15100652', '11.111', '3000', '55.16507249', '18.15125566')
This is really not much different to my previous attempt. I'd already paired the data for you. I assume you'll be able to change 1000 into BP1000 yourself.

Categories