Regular expression matching with re but not lex - python

I am trying to parse a file in order to reformat it. For this, I need to be able to distinguish between full line comments and end of line comments. I have been able to get lex to recognize full line comments properly, but am having issues with end of line comments.
For example: "a = 0; //This; works; fine" but "a = 0; //This, does; not;".
What confuses me the most is that re is able to recognise both comments without issue and yet lex can not.
Here is the relevant code (FL=full line, EL=end of line):
tokens = (
'EQUAL',
'SEMICOLON',
'FL_COMMENT',
'EL_COMMENT',
'STRING'
)
t_EQUAL = r'='
t_SEMICOLON = r';'
def t_FL_COMMENT(t):
r"""(^|\n)\s*(//|\#).*"""
return t
def t_EL_COMMENT(t):
r"""(?<=;)\s*(//|\#).*"""
return t
def t_STRING(t):
r"""(".*")|([a-zA-Z0-9\</][\w.\-\+/]*)"""
return t
def t_newline(t):
r"""\n"""
t.lexer.lineno += len(t.value)
t_ignore = ' \t'
def t_error(t):
print("Illegal character '%s' on line %d" % (t.value[0], t.lineno))
t.lexer.skip(1)
def t_eof(t):
return None
lexer = lex.lex()
lexer.input(file_contents)
for token in lexer:
print(token)

Lex (including the Ply variety) builds lexical analysers, not regular expression searchers. Unlike a regular expression library, which generally attempts to scan the entire input to find a pattern, lex tries to decide what pattern matches at the current input point. It then advances the input to the point immediately following, and tries to find the matching pattern at that point. And so on. Every character in the text is contained in some matched token. (Although some tokens might be discarded.)
You can actually take advantage of this fact to simplify your regular expressions. In this case, for example, since you can count on t_FL_COMMENT to match a comment which does occur at the beginning of a line, any other comment must be not at the start of a line. So no lookbehind is needed:
def t_FL_COMMENT(t):
r"""(^|\n)\s*(//|\#).*"""
return t
def t_EL_COMMENT(t):
r"""(//|\#).*"""
return t
An alternative to (\n|^) is (?m)^ (which enables multiline mode so that the ^ can match right after a newline, as well as matching at the beginning of the string).

Related

Tokenizer method in python without using NLTK

New to python - I need some help figuring out how to write a tokenizer method in python without using any libraries like Nltk. How would I start? Thank you!
Depending on the complexity you can simply use the string split function.
# Words independent of sentences
words = raw_text.split(' ')
# Sentences and words
sentences = raw_text.split('. ')
words_in_sentences = [sentence.split(' ') for sentence in sentences]
If you want to do something more sophisticated you can use packages like re, which provides support for regular expressions. [Related question]
I assuming you are talking about a tokenizer for a compiler. Such tokens are usually definable by a regular language for which regular expressions/finite state automata are the natural solutions. An example:
import re
from collections import namedtuple
Token = namedtuple('Token', ['type','value'])
def lexer(text):
IDENTIFIER = r'(?P<IDENTIFIER>[a-zA-Z_][a-zA-Z_0-9]*)'
ASSIGNMENT = r'(?P<ASSIGNMENT>=)'
NUMBER = r'(?P<NUMBER>\d+)'
MULTIPLIER_OPERATOR = r'(?P<MULTIPLIER_OPERATOR>[*/])'
ADDING_OPERATOR = r'(?P<ADDING_OPERATOR>[+-])'
WHITESPACE = r'(?P<WHITESPACE>\s+)'
EOF = r'(?P<EOF>\Z)'
ERROR = r'(?P<ERROR>.)' # catch everything else, which is an error
tokenizer = re.compile('|'.join([IDENTIFIER, ASSIGNMENT, NUMBER, MULTIPLIER_OPERATOR, ADDING_OPERATOR, WHITESPACE, EOF, ERROR]))
seen_error = False
for m in tokenizer.finditer(text):
if m.lastgroup != 'WHITESPACE': #ignore whitespace
if m.lastgroup == 'ERROR':
if not seen_error:
yield Token(m.lastgroup, m.group())
seen_error = True # scan until we find a non-error input
else:
yield Token(m.lastgroup, m.group())
seen_error = False
else:
seen_error = False
for token in lexer('foo = x12 * y / z - 3'):
print(token)
Prints:
Token(type='IDENTIFIER', value='foo')
Token(type='ASSIGNMENT', value='=')
Token(type='IDENTIFIER', value='x12')
Token(type='MULTIPLIER_OPERATOR', value='*')
Token(type='IDENTIFIER', value='y')
Token(type='MULTIPLIER_OPERATOR', value='/')
Token(type='IDENTIFIER', value='z')
Token(type='ADDING_OPERATOR', value='-')
Token(type='NUMBER', value='3')
Token(type='EOF', value='')
The above code defines each token such as IDENTIFIER, ASSIGNMENT, etc. as simple regular expressions and then combines them into a single regular expression pattern using the | operator and compiles the expression as variable tokenizer. It then uses the regular expression finditer method with the input text as its argument to create a "scanner" that tries to match successive input tokens against the tokenizer regular expression. As long as there are matches, Token instances consisting of type and value are yielded by the lexer generator function. In this example, WHITESPACE tokens are not yielded on the assumption that whitespace is to be ignored by the parser and only serves to separate other tokens.
There is a catchall ERROR token defined as the last token that will match a single character if none of the other token regular expressions match (a . is used for this, which will not match a newline character unless flag re.S is used, but there is no need to match a newline since the newline character is being matched by the WHITESPACE token regular expression and is therefor a "legal" match). Special code is added to prevent successive ERROR tokens being generated. In effect, the lexer generates an ERROR token and then throws away input until it can once again match a legal token.
use gensim instead.
tokenized_word = gensim.utils.simple_preprocess(str(sentences ), deacc=True) # deacc=True removes punctuations

how to define two tokens as one token?

I am trying to define two words separated by space as one token in my lexical analyzer
but when I pass an input like in out it says LexToken(KEYIN,'in',1,0)
and LexToken(KEYOUT,'out',1,3)
I need it to be like this LexToken(KEYINOUT,'in out',1,0)
PS: KEYIN and KEYOUT are two different tokens as the grammar's definition
Following is the test which causes the problem:
import lex
reserved = {'in': 'KEYIN', 'out': 'KEYOUT', 'in\sout': 'KEYINOUT'} # the problem is in here
tokens = ['PLUS', 'MINUS', 'IDENTIFIER'] + list(reserved.values())
t_MINUS = r'-'
t_PLUS = r'\+'
t_ignore = ' \t'
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0], "at line", t.lexer.lineno, "at position", t.lexer.lexpos)
t.lexer.skip(1)
lex.lex()
lex.input("in out inout + - ")
while True:
tok = lex.token()
print(tok)
if not tok:
break
Output:
LexToken(KEYIN,'in',1,0)
LexToken(KEYOUT,'out',1,3)
LexToken(IDENTIFIER,'inout',1,7)
LexToken(PLUS,'+',1,13)
LexToken(MINUS,'-',1,15)
None
This is your function which recognizes IDENTIFIERs and keywords:
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
First, it is clear that the keywords it can recognize are precisely the keys of the dictionary reserved, which are:
in
out
in\sout
Since in out is not a key in that dictionary (in\sout is not the same string), it cannot be recognised as a keyword no matter what t.value happens to be.
But t.value cannot be in out either, because t.value will always match the regular expression which controls t_IDENTIFIER:
[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*
and that regular expression never matches anything with a space character. (That regular expression has various problems; the characters *, (, ), | and + inside the second character class are treated as ordinary characters. See below for a correct regex.)
You could certainly match in out as a token in a manner similar to that suggested in your original question, prior to the edit. However,
t_KEYINOUT = r'in\sout'
will not work, because Ply does not use the common "maximum munch" algorithm for deciding which regular expression pattern to accept. Instead, it simply orders all of the patterns and picks the first one which matches, where the order consists of all of the tokenizing functions (in the order in which they are defined), followed by the token variables sorted in reverse order of regex length. Since t_IDENTIFIER is a function, it will be tried before the variable t_KEYINOUT. To ensure that t_KEYINOUT is tried first, it must be made into a function and placed before t_IDENTIFIER.
However, that is still not exactly what you want, since it will tokenize
in outwards
as
LexToken(KEYINOUT,'in out',1,0)
LexToken(IDENTIFIER,'wards',1,6)
rather than
LexToken(KEYIN,'in',1,0)
LexToken(IDENTIFIER,'outwards',1,3)
To get the correct analysis, you need to ensure that in out only matches if out is a complete word; in other words, if there is a word boundary at the end of the match. So one solution is:
reserved = {'in': 'KEYIN', 'out': 'KEYOUT'}
def t_KEYINOUT(t):
r'in\sout\b'
return t
def t_IDENTIFIER(t):
r'[a-zA-Z][a-zA-Z0-9_]*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
However, it is almost certainly not necessary for the lexer recognize in out as a single token. Since both in and out are keywords, it is easy to leave it to the parser to notice when they are used together as an in out designator:
parameter: KEYIN IDENTIFIER
| KEYOUT IDENTIFIER
| KEYIN KEYOUT IDENTIFIER

Does ply.lex parse the same token once?

I was reading a lexical parsing document so that I can parse some arguments and I exactly followed the document to create a parser. This is the whole code:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
import ply.lex as lex
args = ['[watashi]', '[anata]>500', '[kare]>400&&[kare]<800']
tokens = ('NUMBER', 'EXPRESSION', 'AND', 'LESS', 'MORE')
t_EXPRESSION = r'\[.*\]'
t_AND = r'&&'
t_LESS = r'<'
t_MORE = r'>'
t_ignore = '\t'
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
def t_error(t):
print 'Illegal character "%s"' % t.value[0]
t.lexer.skip(1)
lexer = lex.lex()
for i in args:
lexer.input(i)
while True:
tok = lexer.token()
if not tok: break
print tok
print '#############'
I simply created a list of sample arguments and I got this output:
LexToken(EXPRESSION,'[watashi]',1,0)
#############
LexToken(EXPRESSION,'[anata]',1,0)
LexToken(MORE,'>',1,7)
LexToken(NUMBER,500,1,8)
#############
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
#############
The first and second sample arguments are parsed correctly, but the third one is not. The third sample argument is EXPRESSION+LESS+NUMBER whereas it must be EXPRESSION+MORE+NUMBER+AND+EXPRESSION+LESS+NUMBER. So I thought there could be one of those problems:
ply.lex is only parsing one token: In the codes above, ply.lex cannot parse two seperate expressions and it returns the latest token as its type. "[kare]>400&&[kare]" is EXPRESSION because it ends with the latest EXPRESSION token which is second [kare] and 800 is NUMBER because it is the latest NUMBER token.
!!! OR !!!
There is a mistake in t_EXPRESSION variable: I defined this variable as "[.*]" to get all characters in those two brackets ([]). The first token of third sample argument is "[kare]>400&&[kare]" since it simply starts and ends with those brackets and contains .* (every single character) in them, but I thought the interpreter would stop in the first (]) character due to being first.
So I could not find a way to solve but asked here.
in general this is what I am struggling with
lexer.input("[kare]>400&&[kare]<800")
while True:
tok = lexer.token()
if not tok: break
print tok
I get
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
but I expected something more like
LexToken(EXPRESSION,'[kare]',1.0)
LexToken(LESS,'>',?)
LexToken(NUMBER,400,?)
LexToken(AND,'&&',?)
LexToken(EXPRESSION,'[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
I think I see your problem
t_EXPRESSION = r'\[.*\]'
is greedy and will match the biggest match it can ie '[kare]>400&&[kare]'
instead try
t_EXPRESSION = r'\[[^\]]*\]'
this will match only one set since it looks for not open bracket([^\]]) instead of anything(.)
you can also use not greedy matching
t_EXPRESSION = r'\[.*?\]'
the ? makes it match as few characters as possible rather than the maximum

quote string for TeX input

I am writing a Python script that takes plain text as input and produces
LaTeX code as output. At some point the script has to quote all the
characters that have a special meaning in TeX, such as %, &, \, and so
on.
This is more difficult than I expected. Currently I have this:
def ltx_quote(s):
s = re.sub(r'[\\]', r'\\textbackslash{}', s)
# s = re.sub(r'[{]', r'\\{{}', s)
# s = re.sub(r'[}]', r'\\}{}', s)
s = re.sub(r'[&]', r'\\&{}', s)
s = re.sub(r'[$]', r'\\${}', s)
s = re.sub(r'[%]', r'\\%{}', s)
s = re.sub(r'[_]', r'\\_{}', s)
s = re.sub(r'[\^]', r'\\^{}', s)
s = re.sub(r'[~]', r'\\~{}', s)
s = re.sub(r'[|]', r'\\textbar{}', s)
s = re.sub(r'[#]', r'\\#{}', s)
s = re.sub(r'[<]', r'\\textless{}', s)
s = re.sub(r'[>]', r'\\textgreater{}', s)
return s
The problem is the { and } characters, because they are potentially produced by an earlier substitution (\ -> \textbackslash{}) in which case shouldn't be substituted. I think the solution would be making all the substitutions in one step, but I don't know how to do it.
Perhaps try using the undocumented re.Scanner:
import re
scanner = re.Scanner([
(r"[\\]", r'\\textbackslash{}'),
(r"[{]", r'\\{{}'),
(r"[}]", r'\\}{}'),
(r".", lambda s, t: t)
])
tokens, remainder = scanner.scan("\\foo\\{bar}")
print(''.join(tokens))
yields
\\textbackslash{}foo\\textbackslash{}\\{{}bar\\}{}
Unlike the code you posted, if you look at the source code, the re.Scanner.scan makes only one pass through the string. Once a match is made, the next match is begun from where the last match ended.
The first argument to re.Scanner is a lexicon -- a list of 2-tuples. Each 2-tuple is a regex pattern and an action. The action may be a string, a callable (function), or None (no action).
The patterns are all compiled into one compound pattern. So the order in which the patterns are listed in the lexicon is important. The first pattern to match wins.
If a match is made, the action is called if it is callable, or simply returned if a string. The return values are collected in the list tokens.

python regex for repeating string

I am wanting to verify and then parse this string (in quotes):
string = "start: c12354, c3456, 34526; other stuff that I don't care about"
//Note that some codes begin with 'c'
I would like to verify that the string starts with 'start:' and ends with ';'
Afterward, I would like to have a regex parse out the strings. I tried the following python re code:
regx = r"start: (c?[0-9]+,?)+;"
reg = re.compile(regx)
matched = reg.search(string)
print ' matched.groups()', matched.groups()
I have tried different variations but I can either get the first or the last code but not a list of all three.
Or should I abandon using a regex?
EDIT: updated to reflect part of the problem space I neglected and fixed string difference.
Thanks for all the suggestions - in such a short time.
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).
Your easiest solution is to first extract the part between start: and ; and then using a regular expression to return all matches, not just a single match, using re.findall('c?[0-9]+', text).
You could use the standard string tools, which are pretty much always more readable.
s = "start: c12354, c3456, 34526;"
s.startswith("start:") # returns a boolean if it starts with this string
s.endswith(";") # returns a boolean if it ends with this string
s[6:-1].split(', ') # will give you a list of tokens separated by the string ", "
This can be done (pretty elegantly) with a tool like Pyparsing:
from pyparsing import Group, Literal, Optional, Word
import string
code = Group(Optional(Literal("c"), default='') + Word(string.digits) + Optional(Literal(","), default=''))
parser = Literal("start:") + OneOrMore(code) + Literal(";")
# Read lines from file:
with open('lines.txt', 'r') as f:
for line in f:
try:
result = parser.parseString(line)
codes = [c[1] for c in result[1:-1]]
# Do something with teh codez...
except ParseException exc:
# Oh noes: string doesn't match!
continue
Cleaner than a regular expression, returns a list of codes (no need to string.split), and ignores any extra characters in the line, just like your example.
import re
sstr = re.compile(r'start:([^;]*);')
slst = re.compile(r'(?:c?)(\d+)')
mystr = "start: c12354, c3456, 34526; other stuff that I don't care about"
match = re.match(sstr, mystr)
if match:
res = re.findall(slst, match.group(0))
results in
['12354', '3456', '34526']

Categories