how to define two tokens as one token? - python

I am trying to define two words separated by space as one token in my lexical analyzer
but when I pass an input like in out it says LexToken(KEYIN,'in',1,0)
and LexToken(KEYOUT,'out',1,3)
I need it to be like this LexToken(KEYINOUT,'in out',1,0)
PS: KEYIN and KEYOUT are two different tokens as the grammar's definition
Following is the test which causes the problem:
import lex
reserved = {'in': 'KEYIN', 'out': 'KEYOUT', 'in\sout': 'KEYINOUT'} # the problem is in here
tokens = ['PLUS', 'MINUS', 'IDENTIFIER'] + list(reserved.values())
t_MINUS = r'-'
t_PLUS = r'\+'
t_ignore = ' \t'
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0], "at line", t.lexer.lineno, "at position", t.lexer.lexpos)
t.lexer.skip(1)
lex.lex()
lex.input("in out inout + - ")
while True:
tok = lex.token()
print(tok)
if not tok:
break
Output:
LexToken(KEYIN,'in',1,0)
LexToken(KEYOUT,'out',1,3)
LexToken(IDENTIFIER,'inout',1,7)
LexToken(PLUS,'+',1,13)
LexToken(MINUS,'-',1,15)
None

This is your function which recognizes IDENTIFIERs and keywords:
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
First, it is clear that the keywords it can recognize are precisely the keys of the dictionary reserved, which are:
in
out
in\sout
Since in out is not a key in that dictionary (in\sout is not the same string), it cannot be recognised as a keyword no matter what t.value happens to be.
But t.value cannot be in out either, because t.value will always match the regular expression which controls t_IDENTIFIER:
[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*
and that regular expression never matches anything with a space character. (That regular expression has various problems; the characters *, (, ), | and + inside the second character class are treated as ordinary characters. See below for a correct regex.)
You could certainly match in out as a token in a manner similar to that suggested in your original question, prior to the edit. However,
t_KEYINOUT = r'in\sout'
will not work, because Ply does not use the common "maximum munch" algorithm for deciding which regular expression pattern to accept. Instead, it simply orders all of the patterns and picks the first one which matches, where the order consists of all of the tokenizing functions (in the order in which they are defined), followed by the token variables sorted in reverse order of regex length. Since t_IDENTIFIER is a function, it will be tried before the variable t_KEYINOUT. To ensure that t_KEYINOUT is tried first, it must be made into a function and placed before t_IDENTIFIER.
However, that is still not exactly what you want, since it will tokenize
in outwards
as
LexToken(KEYINOUT,'in out',1,0)
LexToken(IDENTIFIER,'wards',1,6)
rather than
LexToken(KEYIN,'in',1,0)
LexToken(IDENTIFIER,'outwards',1,3)
To get the correct analysis, you need to ensure that in out only matches if out is a complete word; in other words, if there is a word boundary at the end of the match. So one solution is:
reserved = {'in': 'KEYIN', 'out': 'KEYOUT'}
def t_KEYINOUT(t):
r'in\sout\b'
return t
def t_IDENTIFIER(t):
r'[a-zA-Z][a-zA-Z0-9_]*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
However, it is almost certainly not necessary for the lexer recognize in out as a single token. Since both in and out are keywords, it is easy to leave it to the parser to notice when they are used together as an in out designator:
parameter: KEYIN IDENTIFIER
| KEYOUT IDENTIFIER
| KEYIN KEYOUT IDENTIFIER

Related

Regular expression matching with re but not lex

I am trying to parse a file in order to reformat it. For this, I need to be able to distinguish between full line comments and end of line comments. I have been able to get lex to recognize full line comments properly, but am having issues with end of line comments.
For example: "a = 0; //This; works; fine" but "a = 0; //This, does; not;".
What confuses me the most is that re is able to recognise both comments without issue and yet lex can not.
Here is the relevant code (FL=full line, EL=end of line):
tokens = (
'EQUAL',
'SEMICOLON',
'FL_COMMENT',
'EL_COMMENT',
'STRING'
)
t_EQUAL = r'='
t_SEMICOLON = r';'
def t_FL_COMMENT(t):
r"""(^|\n)\s*(//|\#).*"""
return t
def t_EL_COMMENT(t):
r"""(?<=;)\s*(//|\#).*"""
return t
def t_STRING(t):
r"""(".*")|([a-zA-Z0-9\</][\w.\-\+/]*)"""
return t
def t_newline(t):
r"""\n"""
t.lexer.lineno += len(t.value)
t_ignore = ' \t'
def t_error(t):
print("Illegal character '%s' on line %d" % (t.value[0], t.lineno))
t.lexer.skip(1)
def t_eof(t):
return None
lexer = lex.lex()
lexer.input(file_contents)
for token in lexer:
print(token)
Lex (including the Ply variety) builds lexical analysers, not regular expression searchers. Unlike a regular expression library, which generally attempts to scan the entire input to find a pattern, lex tries to decide what pattern matches at the current input point. It then advances the input to the point immediately following, and tries to find the matching pattern at that point. And so on. Every character in the text is contained in some matched token. (Although some tokens might be discarded.)
You can actually take advantage of this fact to simplify your regular expressions. In this case, for example, since you can count on t_FL_COMMENT to match a comment which does occur at the beginning of a line, any other comment must be not at the start of a line. So no lookbehind is needed:
def t_FL_COMMENT(t):
r"""(^|\n)\s*(//|\#).*"""
return t
def t_EL_COMMENT(t):
r"""(//|\#).*"""
return t
An alternative to (\n|^) is (?m)^ (which enables multiline mode so that the ^ can match right after a newline, as well as matching at the beginning of the string).

wildcard match & replace and/or multiple string wildcard matching

I have two very related questions:
I want to match a string pattern with a wildcard (i.e. containing one or more '*' or '?')
and then form a replacement string with a second wildcard pattern. There the placeholders should refer to the same matched substring
(As for instance in the DOS copy command)
Example: pattern='*.txt' and replacement-pattern='*.doc':
I want aaa.txt --> aaa.doc and xx.txt.txt --> xx.txt.doc
Ideally it would work with multiple, arbitrarily placed wildcards: e.g., pattern='*.*' and replacement-pattern='XX*.*'.
Of course one needs to apply some constraints (e.g. greedy strategy). Otherwise patterns such as X*X*X are not unique for string XXXXXX.
or, alternatively, form a multi-match. That is I have one or more wildcard patterns each with the same number of wildcard characters. Each pattern is matched to one string but the wildcard characters should refer to the same matching text.
Example: pattern1='*.txt' and pattern2='*-suffix.txt
Should match the pair string1='XX.txt' and string2='XX-suffix.txt' but not
string1='XX.txt' and string2='YY-suffix.txt'
In contrast to the first this is a more well defined problem as it avoids the ambiguity problem but is perhaps quite similar.
I am sure there are algorithms for these tasks, however, I am unable to find anything useful.
The Python library has fnmatch but this is does not support what I want to do.
There are many ways to do this, but I came up with the following, which should work for your first question. Based on your examples I’m assuming you don’t want to match whitespace.
This function turns the first passed pattern into a regex and the passed replacement pattern into a string suitable for the re.sub function.
import re
def replaceWildcards(string, pattern, replacementPattern):
splitPattern = re.split(r'([*?])', pattern)
splitReplacement = re.split(r'([*?])', replacementPattern)
if (len(splitPattern) != len(splitReplacement)):
raise ValueError("Provided pattern wildcards do not match")
reg = ""
sub = ""
for idx, (regexPiece, replacementPiece) in enumerate(zip(splitPattern, splitReplacement)):
if regexPiece in ["*", "?"]:
if replacementPiece != regexPiece:
raise ValueError("Provided pattern wildcards do not match")
reg += f"(\\S{regexPiece if regexPiece == '*' else ''})" # Match anything but whitespace
sub += f"\\{idx + 1}" # Regex matches start at 1, not 0
else:
reg += f"({re.escape(regexPiece)})"
sub += f"{replacementPiece}"
return re.sub(reg, sub, string)
Sample output:
replaceWildcards("aaa.txt xx.txt.txt aaa.bat", "*.txt", "*.doc")
# 'aaa.doc xx.txt.doc aaa.bat'
replaceWildcards("aaa10.txt a1.txt aaa23.bat", "a??.txt", "b??.doc")
# 'aab10.doc a1.txt aaa23.bat'
replaceWildcards("aaa10.txt a1-suffix.txt aaa23.bat", "a*-suffix.txt", "b*-suffix.doc")
# 'aaa10.txt b1-suffix.doc aaa23.bat'
replaceWildcards("prefix-2aaa10-suffix.txt a1-suffix.txt", "prefix-*a*-suffix.txt", "prefix-*b*-suffix.doc")
# 'prefix-2aab10-suffix.doc a1-suffix.txt
Note f-strings require Python >=3.6.

Tokenizer method in python without using NLTK

New to python - I need some help figuring out how to write a tokenizer method in python without using any libraries like Nltk. How would I start? Thank you!
Depending on the complexity you can simply use the string split function.
# Words independent of sentences
words = raw_text.split(' ')
# Sentences and words
sentences = raw_text.split('. ')
words_in_sentences = [sentence.split(' ') for sentence in sentences]
If you want to do something more sophisticated you can use packages like re, which provides support for regular expressions. [Related question]
I assuming you are talking about a tokenizer for a compiler. Such tokens are usually definable by a regular language for which regular expressions/finite state automata are the natural solutions. An example:
import re
from collections import namedtuple
Token = namedtuple('Token', ['type','value'])
def lexer(text):
IDENTIFIER = r'(?P<IDENTIFIER>[a-zA-Z_][a-zA-Z_0-9]*)'
ASSIGNMENT = r'(?P<ASSIGNMENT>=)'
NUMBER = r'(?P<NUMBER>\d+)'
MULTIPLIER_OPERATOR = r'(?P<MULTIPLIER_OPERATOR>[*/])'
ADDING_OPERATOR = r'(?P<ADDING_OPERATOR>[+-])'
WHITESPACE = r'(?P<WHITESPACE>\s+)'
EOF = r'(?P<EOF>\Z)'
ERROR = r'(?P<ERROR>.)' # catch everything else, which is an error
tokenizer = re.compile('|'.join([IDENTIFIER, ASSIGNMENT, NUMBER, MULTIPLIER_OPERATOR, ADDING_OPERATOR, WHITESPACE, EOF, ERROR]))
seen_error = False
for m in tokenizer.finditer(text):
if m.lastgroup != 'WHITESPACE': #ignore whitespace
if m.lastgroup == 'ERROR':
if not seen_error:
yield Token(m.lastgroup, m.group())
seen_error = True # scan until we find a non-error input
else:
yield Token(m.lastgroup, m.group())
seen_error = False
else:
seen_error = False
for token in lexer('foo = x12 * y / z - 3'):
print(token)
Prints:
Token(type='IDENTIFIER', value='foo')
Token(type='ASSIGNMENT', value='=')
Token(type='IDENTIFIER', value='x12')
Token(type='MULTIPLIER_OPERATOR', value='*')
Token(type='IDENTIFIER', value='y')
Token(type='MULTIPLIER_OPERATOR', value='/')
Token(type='IDENTIFIER', value='z')
Token(type='ADDING_OPERATOR', value='-')
Token(type='NUMBER', value='3')
Token(type='EOF', value='')
The above code defines each token such as IDENTIFIER, ASSIGNMENT, etc. as simple regular expressions and then combines them into a single regular expression pattern using the | operator and compiles the expression as variable tokenizer. It then uses the regular expression finditer method with the input text as its argument to create a "scanner" that tries to match successive input tokens against the tokenizer regular expression. As long as there are matches, Token instances consisting of type and value are yielded by the lexer generator function. In this example, WHITESPACE tokens are not yielded on the assumption that whitespace is to be ignored by the parser and only serves to separate other tokens.
There is a catchall ERROR token defined as the last token that will match a single character if none of the other token regular expressions match (a . is used for this, which will not match a newline character unless flag re.S is used, but there is no need to match a newline since the newline character is being matched by the WHITESPACE token regular expression and is therefor a "legal" match). Special code is added to prevent successive ERROR tokens being generated. In effect, the lexer generates an ERROR token and then throws away input until it can once again match a legal token.
use gensim instead.
tokenized_word = gensim.utils.simple_preprocess(str(sentences ), deacc=True) # deacc=True removes punctuations

Does ply.lex parse the same token once?

I was reading a lexical parsing document so that I can parse some arguments and I exactly followed the document to create a parser. This is the whole code:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
import ply.lex as lex
args = ['[watashi]', '[anata]>500', '[kare]>400&&[kare]<800']
tokens = ('NUMBER', 'EXPRESSION', 'AND', 'LESS', 'MORE')
t_EXPRESSION = r'\[.*\]'
t_AND = r'&&'
t_LESS = r'<'
t_MORE = r'>'
t_ignore = '\t'
def t_NUMBER(t):
r'\d+'
t.value = int(t.value)
return t
def t_newline(t):
r'\n+'
t.lexer.lineno += len(t.value)
def t_error(t):
print 'Illegal character "%s"' % t.value[0]
t.lexer.skip(1)
lexer = lex.lex()
for i in args:
lexer.input(i)
while True:
tok = lexer.token()
if not tok: break
print tok
print '#############'
I simply created a list of sample arguments and I got this output:
LexToken(EXPRESSION,'[watashi]',1,0)
#############
LexToken(EXPRESSION,'[anata]',1,0)
LexToken(MORE,'>',1,7)
LexToken(NUMBER,500,1,8)
#############
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
#############
The first and second sample arguments are parsed correctly, but the third one is not. The third sample argument is EXPRESSION+LESS+NUMBER whereas it must be EXPRESSION+MORE+NUMBER+AND+EXPRESSION+LESS+NUMBER. So I thought there could be one of those problems:
ply.lex is only parsing one token: In the codes above, ply.lex cannot parse two seperate expressions and it returns the latest token as its type. "[kare]>400&&[kare]" is EXPRESSION because it ends with the latest EXPRESSION token which is second [kare] and 800 is NUMBER because it is the latest NUMBER token.
!!! OR !!!
There is a mistake in t_EXPRESSION variable: I defined this variable as "[.*]" to get all characters in those two brackets ([]). The first token of third sample argument is "[kare]>400&&[kare]" since it simply starts and ends with those brackets and contains .* (every single character) in them, but I thought the interpreter would stop in the first (]) character due to being first.
So I could not find a way to solve but asked here.
in general this is what I am struggling with
lexer.input("[kare]>400&&[kare]<800")
while True:
tok = lexer.token()
if not tok: break
print tok
I get
LexToken(EXPRESSION,'[kare]>400&&[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
but I expected something more like
LexToken(EXPRESSION,'[kare]',1.0)
LexToken(LESS,'>',?)
LexToken(NUMBER,400,?)
LexToken(AND,'&&',?)
LexToken(EXPRESSION,'[kare]',1,0)
LexToken(LESS,'<',1,18)
LexToken(NUMBER,800,1,19)
I think I see your problem
t_EXPRESSION = r'\[.*\]'
is greedy and will match the biggest match it can ie '[kare]>400&&[kare]'
instead try
t_EXPRESSION = r'\[[^\]]*\]'
this will match only one set since it looks for not open bracket([^\]]) instead of anything(.)
you can also use not greedy matching
t_EXPRESSION = r'\[.*?\]'
the ? makes it match as few characters as possible rather than the maximum

How fill a regex string with parameters

I would like to fill regex variables with string.
import re
hReg = re.compile("/robert/(?P<action>([a-zA-Z0-9]*))/$")
hMatch = hReg.match("/robert/delete/")
args = hMatch.groupdict()
args variable is now a dict with {"action":"delete"}.
How i can reverse this process ? With args dict and regex pattern, how i can obtain the string "/robert/delete/" ?
it's possible to have a function just like this ?
def reverse(pattern, dictArgs):
Thank you
This function should do it
def reverse(regex, dict):
replacer_regex = re.compile('''
\(\?P\< # Match the opening
(.+?) # Match the group name into group 1
\>\(.*?\)\) # Match the rest
'''
, re.VERBOSE)
return replacer_regex.sub(lambda m : dict[m.group(1)], regex)
You basically match the (\?P...) block and replace it with a value from the dict.
EDIT: regex is the regex string in my exmple. You can get it from patter by
regex_compiled.pattern
EDIT2: verbose regex added
Actually, i thinks it's doable for some narrow cases, but pretty complex thing "in general case".
You'll need to write some sort of finite state machine, parsing your regex string, and splitting different parts, then take appropriate action for this parts.
For regular symbols — simply put symbols "as is" into results string.
For named groups — put values from dictArgs in place of them
For optional blocks — put some of it's values
And so on.
One requllar expression often can match big (or even infinite) set of strings, so this "reverse" function wouldn't be very useful.
Building upon #Dimitri's answer, more sanitisation is possible.
retype = type(re.compile('hello, world'))
def reverse(ptn, dict):
if isinstance(ptn, retype):
ptn = ptn.pattern
ptn = ptn.replace(r'\.','.')
replacer_regex = re.compile(r'''
\(\?P # Match the opening
\<(.+?)\>
(.*?)
\) # Match the rest
'''
, re.VERBOSE)
# return replacer_regex.findall(ptn)
res = replacer_regex.sub( lambda m : dict[m.group(1)], ptn)
return res

Categories