Does ply.lex parse the same token once? - python

I was reading a lexical parsing document so that I can parse some arguments and I exactly followed the document to create a parser. This is the whole code:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
import ply.lex as lex
args = ['[watashi]', '[anata]>500', '[kare]>400&&[kare]<800']
tokens = ('NUMBER', 'EXPRESSION', 'AND', 'LESS', 'MORE')
t_EXPRESSION = r'\[.*\]'
t_AND = r'&&'
t_LESS = r'<'
t_MORE = r'>'
t_ignore = '\t'
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
def t_error(t):
print 'Illegal character "%s"' % t.value[0]
t.lexer.skip(1)
lexer = lex.lex()
for i in args:
lexer.input(i)
while True:
tok = lexer.token()
if not tok: break
print tok
print '#############'
I simply created a list of sample arguments and I got this output:
LexToken(EXPRESSION,'[watashi]',1,0)
#############
LexToken(EXPRESSION,'[anata]',1,0)
LexToken(MORE,'>',1,7)
LexToken(NUMBER,500,1,8)
#############
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
#############
The first and second sample arguments are parsed correctly, but the third one is not. The third sample argument is EXPRESSION+LESS+NUMBER whereas it must be EXPRESSION+MORE+NUMBER+AND+EXPRESSION+LESS+NUMBER. So I thought there could be one of those problems:
ply.lex is only parsing one token: In the codes above, ply.lex cannot parse two seperate expressions and it returns the latest token as its type. "[kare]>400&&[kare]" is EXPRESSION because it ends with the latest EXPRESSION token which is second [kare] and 800 is NUMBER because it is the latest NUMBER token.
!!! OR !!!
There is a mistake in t_EXPRESSION variable: I defined this variable as "[.*]" to get all characters in those two brackets ([]). The first token of third sample argument is "[kare]>400&&[kare]" since it simply starts and ends with those brackets and contains .* (every single character) in them, but I thought the interpreter would stop in the first (]) character due to being first.
So I could not find a way to solve but asked here.
in general this is what I am struggling with
lexer.input("[kare]>400&&[kare]<800")
while True:
tok = lexer.token()
if not tok: break
print tok
I get
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
but I expected something more like
LexToken(EXPRESSION,'[kare]',1.0)
LexToken(LESS,'>',?)
LexToken(NUMBER,400,?)
LexToken(AND,'&&',?)
LexToken(EXPRESSION,'[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)

I think I see your problem
t_EXPRESSION = r'\[.*\]'
is greedy and will match the biggest match it can ie '[kare]>400&&[kare]'
instead try
t_EXPRESSION = r'\[[^\]]*\]'
this will match only one set since it looks for not open bracket([^\]]) instead of anything(.)
you can also use not greedy matching
t_EXPRESSION = r'\[.*?\]'
the ? makes it match as few characters as possible rather than the maximum

Related

Regular expression matching with re but not lex

I am trying to parse a file in order to reformat it. For this, I need to be able to distinguish between full line comments and end of line comments. I have been able to get lex to recognize full line comments properly, but am having issues with end of line comments.
For example: "a = 0; //This; works; fine" but "a = 0; //This, does; not;".
What confuses me the most is that re is able to recognise both comments without issue and yet lex can not.
Here is the relevant code (FL=full line, EL=end of line):
tokens = (
'EQUAL',
'SEMICOLON',
'FL_COMMENT',
'EL_COMMENT',
'STRING'
)
t_EQUAL = r'='
t_SEMICOLON = r';'
def t_FL_COMMENT(t):
r"""(^|\n)\s*(//|\#).*"""
return t
def t_EL_COMMENT(t):
r"""(?<=;)\s*(//|\#).*"""
return t
def t_STRING(t):
r"""(".*")|([a-zA-Z0-9\</][\w.\-\+/]*)"""
return t
def t_newline(t):
r"""\n"""
t.lexer.lineno += len(t.value)
t_ignore = ' \t'
def t_error(t):
print("Illegal character '%s' on line %d" % (t.value[0], t.lineno))
t.lexer.skip(1)
def t_eof(t):
return None
lexer = lex.lex()
lexer.input(file_contents)
for token in lexer:
print(token)
Lex (including the Ply variety) builds lexical analysers, not regular expression searchers. Unlike a regular expression library, which generally attempts to scan the entire input to find a pattern, lex tries to decide what pattern matches at the current input point. It then advances the input to the point immediately following, and tries to find the matching pattern at that point. And so on. Every character in the text is contained in some matched token. (Although some tokens might be discarded.)
You can actually take advantage of this fact to simplify your regular expressions. In this case, for example, since you can count on t_FL_COMMENT to match a comment which does occur at the beginning of a line, any other comment must be not at the start of a line. So no lookbehind is needed:
def t_FL_COMMENT(t):
r"""(^|\n)\s*(//|\#).*"""
return t
def t_EL_COMMENT(t):
r"""(//|\#).*"""
return t
An alternative to (\n|^) is (?m)^ (which enables multiline mode so that the ^ can match right after a newline, as well as matching at the beginning of the string).

PLY Parse C Files for curly brace construct

I want to parse some C Code with PLY.
What I want to extract is the following:
{ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}
This structure can be hidden in some more curly braces.
{SOME, RANDOM, STUFF {ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}, SOME, MORE, RANDOM, STUFF }
Currently I am able to lex for the structure I want to extract ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4 but only if its the only match.
{SOME, RANDOM, STUFF {ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}, SOME, MORE, RANDOM, STUFF }{Argument1, Argument2, Argument3, Argument4}
This is where my current approach fails as the lexing output for above example would be:
ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4}, SOME, MORE, RANDOM, STUFF }{Argument1, Argument2, Argument3, Argument4
How can I only receive following:
ARGUMENT1, ARGUMENT2, ARGUMENT3, ARGUMENT4
Argument1, Argument2, Argument3, Argument4
Short explanation:
I do have a conditional lexer which searches for left curly braces to save its position.
For each new left brace I increment a counter.
For each right brace i decrement the counter.
If the counter is zero, I start to set t.value to all the elements from the latest left brace to the following right brace.
I guess that should work for more than one hit in an example string.
In my opinion, I fail to switch back from ccode state to initial state.
Now to my actual code (in this example i left out the commas in curly braces to make it a bit simpler for me to program):
import ply.lex as lex
import ply.yacc as yacc
# Declare the state
states = (
('ccode', 'exclusive'),
)
tokens = [
'TEXT',
'CCODE'
]
# this saves all rbrace positions
# to get the inner curly brace construct you want to use first element
# text lib call should always be the inner curly brace construct
rbrace_positions = []
def t_ANY_TEXT(t):
r'\w+'
t.value = str(t.value)
return t
# Match the first {. Enter ccode state.
def t_ccode(t):
r'\{'
t.lexer.code_start = t.lexer.lexpos # Record the starting position
print(t.lexer.code_start)
t.lexer.level = 1 # Initial brace level
t.lexer.begin('ccode') # Enter 'ccode' state
def t_lbrace(t):
r'\{'
t.lexer.level += 1
def t_rbrace(t):
r'\}'
t.lexer.level -= 1
# Rules for the ccode state
def t_ccode_lbrace(t):
r'\{'
t.lexer.current_lbrace = t.lexer.lexpos
t.lexer.level += 1
def t_ccode_rbrace(t):
r'\}'
rbrace_positions.append(t.lexer.lexpos)
t.lexer.level -= 1
# If closing brace, return the code fragment
if t.lexer.level == 0:
t.value = t.lexer.lexdata[t.lexer.current_lbrace:rbrace_positions[0]-1]
t.type = "CCODE"
t.lexer.lineno += t.value.count('\n')
t.lexer.begin('INITIAL')
for _ in reversed(rbrace_positions):
rbrace_positions.pop()
return t
# C or C++ comment (ignore)
def t_ccode_comment(t):
r'(/\*(.|\n)*?\*/)|(//.*)'
pass
# C string
def t_ccode_string(t):
r'\"([^\\\n]|(\\.))*?\"'
# C character literal
def t_ccode_char(t):
r'\'([^\\\n]|(\\.))*?\''
# Any sequence of non-whitespace characters (not braces, strings)
def t_ccode_nonspace(t):
r'[^\s\{\}\'\"]+'
# Ignored characters (whitespace)
t_ccode_ignore = " \t\n"
# For bad characters, we just skip over it
def t_ccode_error(t):
t.lexer.skip(1)
def t_error(t):
t.lexer.skip(1)
lexer = lex.lex()
data = '''{ I DONT WANT TO RECEIVE THIS
{THIS IS WHAT I WANT TO SEE}
AS WELL AS I DONT WANT TO RECEIVE THIS}
OUTSIDE OF CURLY BRACES
{I WANT TO SEE THIS AGAIN}
'''
lexer.input(data)
for tok in lexer:
print(tok)
Data is just a test string to have an easy example.
But in my C source files there are some constructs where I want to extract Argument1, Argument2, Argument3, Argument4.
Apparently those C files will not compile but there is no need to since they are included in some other files.
Thank you for all of your input!
Your description is not really clear. Your example seems to indicate that you want to find a braced list which doesn't contain any sublists. So that's the question I'm addressing.
Note that trying to do all this work in the lexer is not generally recommended. Lexers should normally return simple atomic tokens, leaving it to the parser's grammar to do the work of putting the tokens together into a useful structure. But if I've got your use case right, it is possible to do this with the lexer.
You code decides whether or not to return a CCODE token based on whether the depth counter is 0 when it hits a close brace. But that's apparently not what you want: you don't care how deeply nested the braces are; rather, when a closing brace is encountered, you want to know whether it's the innermost brace or not. You don't need a stack for that, since you only ever need the position of the last open brace read, and you only need that while it is unclosed. So every time you see an open brace, you set the last open brace position, and when you see a closing brace, you check whether the last open brace position is set. If it is, you can return the string since that position and set the last open brace position to None. If it is not set, then just continue the scan.
Here's a simplified example based on your code:
import ply.lex as lex
# Declare the state
states = (
('ccode', 'exclusive'),
)
tokens = [
'TEXT',
'CCODE'
]
# Changed from t_ANY_TEXT because otherwise you get all the text inside
# braces as well. Perhaps that's what you wanted but it makes the output less
# clear.
def t_TEXT(t):
r'\w+'
t.value = str(t.value)
return t
# Match the first {. Enter ccode state.
def t_ccode(t):
r'\{'
t.lexer.current_open = t.lexer.lexpos # Record the starting position
t.lexer.level = 1 # Initial brace level
t.lexer.begin('ccode') # Enter 'ccode' state
# t_lbrace and t_rbrace deleted because they never match
# Rules for the ccode state
def t_ccode_lbrace(t):
r'\{'
t.lexer.current_open = t.lexer.lexpos
t.lexer.level += 1
def t_ccode_rbrace(t):
r'\}'
t.lexer.level -= 1
if t.lexer.level == 0:
t.lexer.begin('INITIAL')
if t.lexer.current_open is not None:
t.value = t.lexer.lexdata[t.lexer.current_open:t.lexer.lexpos - 1]
t.type = "CCODE"
t.lexer.current_open = None
return t
# C or C++ comment (ignore)
def t_ccode_comment(t):
r'(/\*(.|\n)*?\*/)|(//.*)'
# C string
def t_ccode_string(t):
r'\"([^\\\n]|(\\.))*?\"'
# C character literal
def t_ccode_char(t):
r'\'([^\\\n]|(\\.))*?\''
# Any sequence of non-whitespace characters (not braces, strings)
def t_ccode_nonspace(t):
r'''[^\s{}'"]+''' # No need to escape inside a character class
# Ignored characters (whitespace)
t_ccode_ignore = " \t\n"
# For bad characters, we just skip over it
def t_ccode_error(t):
t.lexer.skip(1)
def t_error(t):
t.lexer.skip(1)
lexer = lex.lex()
data = '''{ I DONT WANT TO RECEIVE THIS
{THIS IS WHAT I WANT TO SEE}
AS WELL AS I DONT WANT TO RECEIVE THIS}
OUTSIDE OF CURLY BRACES
{I WANT TO SEE THIS AGAIN}
'''
lexer.input(data)
for tok in lexer:
print(tok)
Sample run:
$ python3 nested_brace.py
LexToken(CCODE,'THIS IS WHAT I WANT TO SEE',1,58)
LexToken(TEXT,'OUTSIDE',1,102)
LexToken(TEXT,'OF',1,110)
LexToken(TEXT,'CURLY',1,113)
LexToken(TEXT,'BRACES',1,119)
LexToken(CCODE,'I WANT TO SEE THIS AGAIN',1,152)

Complex string filtering with python

I have a long string that is a phylogenetic tree and I want to do a very specific filtering.
(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;
Basically every x#y is a species#gene_id information. What I am trying to do is trimming this down so that I will only have x instead of x#y.
(Esy, Aar,(Spa,Cpl))...
I tried splitting the string first but the problem is string has different 'split points' for what I want to achieve i.e. some parts x#y is ending with a , and others with a ). I searched for a solution and saw regular expression operations, but I am new to Python and I couldn't be sure if that is what I should be focusing on. I also thought about strip() but it seems like I need to specify the characters to be stripped for this.
Main problem is there is no 'pattern' for me to tell Python to follow. Only thing is that all species ids are 3 letters and they are before an # character.
Is there a method that can do what I want? I will be really glad if you can help me out with my problem. Thanks in advance.
Give this a try:
import re:
pat = re.compile(r'(\w{3})#')
txt = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
pat.findall(t)
Result:
['Esy', 'Aar', 'Spa', 'Cpl', 'Bst', 'Aly', 'Ath', 'Chi', 'Cru', 'Hco', 'Hlo', 'Hla', 'Hse', 'Esa', 'Aal']
If you need the structure intact, we can try to remove the unnecessary parts instead:
pat = re.compile(r'(#|:)[^/),]*')
pat.sub('',t).replace(',', ', ')
Result:
'(Esy, Aar, ((Spa, Cpl), (((Bst, ((Aly, Ath), (Chi, Cru))), (((Hco, Hlo), Hla), Hse)), (Esa, Aal))))'
Regex demo
How about this kind of function:
def parse_string(string):
new_string = ''
skip = False
for char in string:
if char == '#':
skip = True
if char == ',':
skip = False
if not skip or char in ['(', ')']:
new_string += char
return new_string
Calling it on your string:
string = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
parse_string(string)
> '(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hse)),(Esa,Aal))))'
you can use regex:
import re
s = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
p = "...?(?=#)|\(|\)"
result = re.findall(p, s)
and you have your result as a list, so you can make it string or do anything with it
for explaining what is happening :
p is regular expression pattern
so in this pattern:
. means matching any word
...?(?=#) means match any word until I get to a word ? wich ? is #, so this whole pattern means that you get any three words before #
| is or statement, I used it here to find another pattern
and the rest of them is to find ) and (
Try this regex if you need the brackets in the output:
import re
regex = r"#[A-Za-z0-9_\.:]+|[0-9:\.;e-]+"
phylogenetic_tree = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
print(re.sub(regex,"",phylogenetic_tree))
Output:
(Esy,Aar,((Spa,Cpl),(((Bst,((Aly,Ath),(Chi,Cru))),(((Hco,Hlo),Hla),Hs)),(Esa,Aal))))
Because you are trying to parse a phylogenetic tree, I highly suggest to let BioPython do the heavy lifting for you.
You can easily parse and display a phylogenetic with Bio.Phylo. Then it is just iterating over all tree elements and splitting the names at the 'at'-sign.
Because Phylo expects the input to be in a file, we create an in-memory file-like object with io.StringIO. Getting the complete tree is then as easy as
Phylo.read(io.StringIO(s), 'newick')
In order to check if the parsed tree looks sane, I print it once with print(tree).
Now we want to change all node names that contain a '#'. With tree.find_elements we get access to all nodes. Some nodes don't have a name and some might not contain a '#'. So to be extra careful, we first check if n.name and '#' in n.name. Only then do we split each node's name at the '#' and take just the first part (index 0) of it:
n.name = n.name.split('#')[0]
In order to recreate the initial string representation, we use Phylo.write:
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Again, write wants to get a file argument - if we just want to get a string, we can use a StringIO object again.
Full code:
import io
from Bio import Phylo
if __name__ == '__main__':
s = '(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;'
tree = Phylo.read(io.StringIO(s), 'newick')
print(' before '.center(20, '='))
print(tree)
for n in tree.find_elements():
if n.name and '#' in n.name:
n.name = n.name.split('#')[0]
print(' result '.center(20, '='))
out = io.StringIO()
Phylo.write(tree, out, "newick")
print(out.getvalue())
Output:
====== before ======
Tree(rooted=False, weight=1.0)
Clade(branch_length=0.0129090235079)
Clade(branch_length=0.0726396855636, name='Esy#ESY15_g64743_DN3_SP7_c0')
Clade(branch_length=0.137507902808, name='Aar#AA_maker7399_1')
Clade(branch_length=0.0129090235079)
Clade(branch_length=9.05326020871e-05)
Clade(branch_length=0.0318934795022, name='Spa#Tp2g18720')
Clade(branch_length=0.0273465005242, name='Cpl#CP2_g48793_DN3_SP8_c')
Clade(branch_length=0.00328120860999)
Clade(branch_length=0.00859075940423)
Clade(branch_length=0.0340484449097)
Clade(branch_length=0.0332592496158, name='Bst#Bostr_13083s0053_1')
Clade(branch_length=0.0150356382287)
Clade(branch_length=0.0205924636564)
Clade(branch_length=0.0328569260951, name='Aly#AL8G21130_t1')
Clade(branch_length=0.0391706378372, name='Ath#AT5G48370_1')
Clade(branch_length=0.00998579652059)
Clade(branch_length=0.0954469923893, name='Chi#CARHR183840_1')
Clade(branch_length=0.0570981548016, name='Cru#Carubv10026342m')
Clade(branch_length=0.0372829371381)
Clade(branch_length=0.0206478928557)
Clade(branch_length=0.0144626717872)
Clade(branch_length=0.00823215335663, name='Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100')
Clade(branch_length=0.0085462978729, name='Hlo#DN13684_c0_g1_i1_p1')
Clade(branch_length=0.0225079453622, name='Hla#DN22821_c0_g1_i1_p1')
Clade(branch_length=0.048590776459, name='Hse#DN23412_c0_g1_i3_p1')
Clade(branch_length=1.00000050003e-06)
Clade(branch_length=0.0378509854703, name='Esa#Thhalv10004228m')
Clade(branch_length=0.0712272454125, name='Aal#Aa_G102140_t1')
==== result =====
(Esy:0.07264,Aar:0.13751,((Spa:0.03189,Cpl:0.02735):0.00009,(((Bst:0.03326,((Aly:0.03286,Ath:0.03917):0.02059,(Chi:0.09545,Cru:0.05710):0.00999):0.01504):0.03405,(((Hco:0.00823,Hlo:0.00855):0.01446,Hla:0.02251):0.02065,Hse:0.04859):0.03728):0.00859,(Esa:0.03785,Aal:0.07123):0.00000):0.00328):0.01291):0.01291;
The default format of Phylo uses less digits than in your original tree. In order to keep the numbers unchanged, just override the branch length format string with a '%s':
Phylo.write(tree, out, "newick", format_branch_length="%s")
Parsing code can be hard to follow. Tatsu lets you write readable parsing code by combining grammars and python:
text = "(Esy#ESY15_g64743_DN3_SP7_c0:0.0726396855636,Aar#AA_maker7399_1:0.137507902808,((Spa#Tp2g18720:0.0318934795022,Cpl#CP2_g48793_DN3_SP8_c:0.0273465005242):9.05326020871e-05,(((Bst#Bostr_13083s0053_1:0.0332592496158,((Aly#AL8G21130_t1:0.0328569260951,Ath#AT5G48370_1:0.0391706378372):0.0205924636564,(Chi#CARHR183840_1:0.0954469923893,Cru#Carubv10026342m:0.0570981548016):0.00998579652059):0.0150356382287):0.0340484449097,(((Hco#scaff1034_g23864_DN3_SP8_c_TE35_CDS100:0.00823215335663,Hlo#DN13684_c0_g1_i1_p1:0.0085462978729):0.0144626717872,Hla#DN22821_c0_g1_i1_p1:0.0225079453622):0.0206478928557,Hse#DN23412_c0_g1_i3_p1:0.048590776459):0.0372829371381):0.00859075940423,(Esa#Thhalv10004228m:0.0378509854703,Aal#Aa_G102140_t1:0.0712272454125):1.00000050003e-06):0.00328120860999):0.0129090235079):0.0129090235079;"
import sys
import tatsu
grammar = """
start = things ';'
;
things = thing [ ',' things ]
;
thing = x '#' y ':' number
| '(' things ')' ':' number
;
x = /\w+/
;
y = /\w+/
;
number = /[+-]?\d+\.?\d*(e?[+-]?\d*)/
;
"""
class Semantics:
def x(self, ast):
# the method name matches the rule name
print('X =', ast)
parser = tatsu.compile(grammar, semantics=Semantics())
parser.parse(text)

how to define two tokens as one token?

I am trying to define two words separated by space as one token in my lexical analyzer
but when I pass an input like in out it says LexToken(KEYIN,'in',1,0)
and LexToken(KEYOUT,'out',1,3)
I need it to be like this LexToken(KEYINOUT,'in out',1,0)
PS: KEYIN and KEYOUT are two different tokens as the grammar's definition
Following is the test which causes the problem:
import lex
reserved = {'in': 'KEYIN', 'out': 'KEYOUT', 'in\sout': 'KEYINOUT'} # the problem is in here
tokens = ['PLUS', 'MINUS', 'IDENTIFIER'] + list(reserved.values())
t_MINUS = r'-'
t_PLUS = r'\+'
t_ignore = ' \t'
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0], "at line", t.lexer.lineno, "at position", t.lexer.lexpos)
t.lexer.skip(1)
lex.lex()
lex.input("in out inout + - ")
while True:
tok = lex.token()
print(tok)
if not tok:
break
Output:
LexToken(KEYIN,'in',1,0)
LexToken(KEYOUT,'out',1,3)
LexToken(IDENTIFIER,'inout',1,7)
LexToken(PLUS,'+',1,13)
LexToken(MINUS,'-',1,15)
None
This is your function which recognizes IDENTIFIERs and keywords:
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
First, it is clear that the keywords it can recognize are precisely the keys of the dictionary reserved, which are:
in
out
in\sout
Since in out is not a key in that dictionary (in\sout is not the same string), it cannot be recognised as a keyword no matter what t.value happens to be.
But t.value cannot be in out either, because t.value will always match the regular expression which controls t_IDENTIFIER:
[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*
and that regular expression never matches anything with a space character. (That regular expression has various problems; the characters *, (, ), | and + inside the second character class are treated as ordinary characters. See below for a correct regex.)
You could certainly match in out as a token in a manner similar to that suggested in your original question, prior to the edit. However,
t_KEYINOUT = r'in\sout'
will not work, because Ply does not use the common "maximum munch" algorithm for deciding which regular expression pattern to accept. Instead, it simply orders all of the patterns and picks the first one which matches, where the order consists of all of the tokenizing functions (in the order in which they are defined), followed by the token variables sorted in reverse order of regex length. Since t_IDENTIFIER is a function, it will be tried before the variable t_KEYINOUT. To ensure that t_KEYINOUT is tried first, it must be made into a function and placed before t_IDENTIFIER.
However, that is still not exactly what you want, since it will tokenize
in outwards
as
LexToken(KEYINOUT,'in out',1,0)
LexToken(IDENTIFIER,'wards',1,6)
rather than
LexToken(KEYIN,'in',1,0)
LexToken(IDENTIFIER,'outwards',1,3)
To get the correct analysis, you need to ensure that in out only matches if out is a complete word; in other words, if there is a word boundary at the end of the match. So one solution is:
reserved = {'in': 'KEYIN', 'out': 'KEYOUT'}
def t_KEYINOUT(t):
r'in\sout\b'
return t
def t_IDENTIFIER(t):
r'[a-zA-Z][a-zA-Z0-9_]*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
However, it is almost certainly not necessary for the lexer recognize in out as a single token. Since both in and out are keywords, it is easy to leave it to the parser to notice when they are used together as an in out designator:
parameter: KEYIN IDENTIFIER
| KEYOUT IDENTIFIER
| KEYIN KEYOUT IDENTIFIER

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories