I'm trying to create a grammar to parse some Excel-like formulas I have devised, where a special character in the beginning of a string signifies a different source. For example, $ can signify a string, so "$This is text" would be treated as a string input in the program and & can signify a function, so &foo() can be treated as a call to the internal function foo.
The problem I'm facing is how to construct the grammar properly. For example, This is a simplified version as a MWE:
grammar = r'''start: instruction
?instruction: simple
| func
STARTSYMBOL: "!"|"#"|"$"|"&"|"~"
SINGLESTR: (LETTER+|DIGIT+|"_"|" ")*
simple: STARTSYMBOL [SINGLESTR] (WORDSEP SINGLESTR)*
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: STARTSYMBOL SINGLESTR "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
So, with this grammar, things like: $This is a string, &foo(), &foo(#arg1), &foo($arg1,,#arg2) and &foo(!w1,w2,w3,,!w4,w5,w6) are all parsed as expected. But if I'd like to add more flexibility to my simple terminal, then I need to start fiddling around with the SINGLESTR token definition which is not convenient.
What have I tried
The part that I cannot get past is that if I want to have a string including parentheses (which are literals of func), then I cannot handle them in my current situation.
If I add the parentheses in SINGLESTR, then I get Expected STARTSYMBOL, because it's getting mixed up with the func definition and it thinks that a function argument should be passed, which makes sense.
If I redefine the grammar to reserve the ampersand symbol for functions only and add the parentheses in SINGLESTR, then I can parse a string with parentheses, but every function I'm trying to parse gives Expected LPAR.
My intent is that anything starting with a $ would be parsed as a SINGLESTR token and then I could parse things like &foo($first arg (has) parentheses,,$second arg).
My solution, for now, is that I'm using 'escape' words like LEFTPAR and RIGHTPAR in my strings and I've written helper functions to change those into parentheses when I process the tree. So, $This is a LEFTPARtestRIGHTPAR produces the correct tree and when I process it, then this gets translated to This is a (test).
To formulate a general question: Can I define my grammar in such a way that some characters that are special to the grammar are treated as normal characters in some situations and as special in any other case?
EDIT 1
Based on a comment from jbndlr I revised my grammar to create individual modes based on the start symbol:
grammar = r'''start: instruction
?instruction: simple
| func
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|")")*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
This falls (somewhat) under my second test case. I can parse all the simple types of strings (TEXT, MD or DB tokens that can contain parentheses) and functions that are empty; for example, &foo() or &foo(&bar()) parse correctly. The moment I put an argument within a function (no matter which type), I get an UnexpectedEOF Error: Expected ampersand, RPAR or ARGSEP. As a proof of concept, if I remove the parentheses from the definition of SINGLESTR in the new grammar above, then everything works as it should, but I'm back to square one.
import lark
grammar = r'''start: instruction
?instruction: simple
| func
MIDTEXTRPAR: /\)+(?!(\)|,,|$))/
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"("|MIDTEXTRPAR)*
FUNCNAME: (LETTER+) (LETTER+|DIGIT+|"_")* // no parentheses allowed in the func name
DB: "!" SINGLESTR (WORDSEP SINGLESTR)*
TEXT: "$" SINGLESTR
MD: "#" SINGLESTR
simple: TEXT|DB|MD
ARGSEP: ",," // argument separator
WORDSEP: "," // word separator
CONDSEP: ";;" // condition separator
STAR: "*"
func: "&" FUNCNAME "(" [simple|func] (ARGSEP simple|func)* ")"
%import common.LETTER
%import common.WORD
%import common.DIGIT
%ignore ARGSEP
%ignore WORDSEP
'''
parser = lark.Lark(grammar, parser='earley')
parser.parse("&foo($first arg (has) parentheses,,$second arg)")
Output:
Tree(start, [Tree(func, [Token(FUNCNAME, 'foo'), Tree(simple, [Token(TEXT, '$first arg (has) parentheses')]), Token(ARGSEP, ',,'), Tree(simple, [Token(TEXT, '$second arg')])])])
I hope it's what you were looking for.
Those have been crazy few days. I tried lark and failed. I also tried persimonious and pyparsing. All of these different parsers all had the same problem with the 'argument' token consuming the right parenthesis that was part of the function, eventually failing because the function's parentheses weren't closed.
The trick was to figure out how do you define a right parenthesis that's "not special". See the regular expression for MIDTEXTRPAR in the code above. I defined it as a right parenthesis that is not followed by argument separation or by end of string. I did that by using the regular expression extension (?!...) which matches only if it's not followed by ... but doesn't consume characters. Luckily it even allows matching end of string inside this special regular expression extension.
EDIT:
The above mentioned method only works if you don't have an argument ending with a ), because then the MIDTEXTRPAR regular expression won't catch that ) and will think that's the end of the function even though there are more arguments to process. Also, there may be ambiguities such as ...asdf),,..., it may be an end of a function declaration inside an argument, or a 'text-like' ) inside an argument and the function declaration goes on.
This problem is related to the fact that what you describe in your question is not a context-free grammar (https://en.wikipedia.org/wiki/Context-free_grammar) for which parsers such as lark exist. Instead it is a context-sensitive grammar (https://en.wikipedia.org/wiki/Context-sensitive_grammar).
The reason for it being a context sensitive grammar is because you need the parser to 'remember' that it is nested inside a function, and how many levels of nesting there are, and have this memory available inside the grammar's syntax in some way.
EDIT2:
Also take a look at the following parser that is context-sensitive, and seems to solve the problem, but has an exponential time complexity in the number of nested functions, as it tries to parse all possible function barriers until it finds one that works. I believe it has to have an exponential complexity has since it's not context-free.
_funcPrefix = '&'
_debug = False
class ParseException(Exception):
pass
def GetRecursive(c):
if isinstance(c,ParserBase):
return c.GetRecursive()
else:
return c
class ParserBase:
def __str__(self):
return type(self).__name__ + ": [" + ','.join(str(x) for x in self.contents) +"]"
def GetRecursive(self):
return (type(self).__name__,[GetRecursive(c) for c in self.contents])
class Simple(ParserBase):
def __init__(self,s):
self.contents = [s]
class MD(Simple):
pass
class DB(ParserBase):
def __init__(self,s):
self.contents = s.split(',')
class Func(ParserBase):
def __init__(self,s):
if s[-1] != ')':
raise ParseException("Can't find right parenthesis: '%s'" % s)
lparInd = s.find('(')
if lparInd < 0:
raise ParseException("Can't find left parenthesis: '%s'" % s)
self.contents = [s[:lparInd]]
argsStr = s[(lparInd+1):-1]
args = list(argsStr.split(',,'))
i = 0
while i<len(args):
a = args[i]
if a[0] != _funcPrefix:
self.contents.append(Parse(a))
i += 1
else:
j = i+1
while j<=len(args):
nestedFunc = ',,'.join(args[i:j])
if _debug:
print(nestedFunc)
try:
self.contents.append(Parse(nestedFunc))
break
except ParseException as PE:
if _debug:
print(PE)
j += 1
if j>len(args):
raise ParseException("Can't parse nested function: '%s'" % (',,'.join(args[i:])))
i = j
def Parse(arg):
if arg[0] not in _starterSymbols:
raise ParseException("Bad prefix: " + arg[0])
return _starterSymbols[arg[0]](arg[1:])
_starterSymbols = {_funcPrefix:Func,'$':Simple,'!':DB,'#':MD}
P = Parse("&foo($first arg (has)) parentheses,,&f($asdf,,&nested2($23423))),,&second(!arg,wer))")
print(P)
import pprint
pprint.pprint(P.GetRecursive())
Problem is arguments of function are enclosed in parenthesis where one of the arguments may contain parenthesis.
One of the possible solution is use backspace \ before ( or ) when it is a part of String
SINGLESTR: (LETTER+|DIGIT+|"_"|" ") (LETTER+|DIGIT+|"_"|" "|"\("|"\)")*
Similar solution used by C, to include double quotes(") as a part of string constant where string constant is enclosed in double quotes.
example_string1='&f(!g\()'
example_string2='&f(#g)'
print(parser.parse(example_string1).pretty())
print(parser.parse(example_string2).pretty())
Output is
start
func
f
simple !g\(
start
func
f
simple #g
Related
I created a simple parser in PLY that has two rules:
when a : comes first a name appears
when a = comes first a number appears
Corresponding code:
from ply import lex, yacc
tokens = ['Name', 'Number']
def t_Number(t):
r'[0-9]'
return t
def t_Name(t):
r'[a-zA-Z0-9]'
return t
literals = [':', '=']
def t_error(t):
print("lex error: " + str(t.value[0]))
t.lexer.skip(1)
lex.lex()
def p_name(p):
'''
expression : ':' Name
'''
print("name: " + str(list(p)))
def p_number(p):
'''
expression : '=' Number
'''
print("number: " + str(list(p)))
def p_error(p):
print("yacc error: " + str(p.value))
yacc.yacc()
yacc.parse("=3")
yacc.parse(":a")
yacc.parse(":3")
My expectation is that if it sees a : or = it enters the corresponding rule and tries to match the corresponding terminal. Yet in the third example it matches a Number which should be a Name and then fails.
Afaik the grammar should be context free (which is needed to be parsed), is this the case? Also how would I handle the case when one token is a superset of another token?
Ply tokenises before the grammar is consulted, so the context does not influence the tokenisation.(To be more precise, the parser receives a stream of tokens produced by the lexer. The two processes are interleaved in practice, but they are kept independent.)
You can build context into your lexer, but that gets ugly really fast. (Nonetheless, it is a common strategy.)
Your best bet is to write your lexixal rules to produce the most granular result possible, and then write your grammar to accept all alternatives:
def p_name(p):
'''
expression : ':' Name
expression : ':' Number
'''
print("name: " + str(list(p)))
def p_number(p):
'''
expression : '=' Number
'''
print("number: " + str(list(p)))
That assumes you change your lexical rules to put the most specific pattern first.
I am trying to define two words separated by space as one token in my lexical analyzer
but when I pass an input like in out it says LexToken(KEYIN,'in',1,0)
and LexToken(KEYOUT,'out',1,3)
I need it to be like this LexToken(KEYINOUT,'in out',1,0)
PS: KEYIN and KEYOUT are two different tokens as the grammar's definition
Following is the test which causes the problem:
import lex
reserved = {'in': 'KEYIN', 'out': 'KEYOUT', 'in\sout': 'KEYINOUT'} # the problem is in here
tokens = ['PLUS', 'MINUS', 'IDENTIFIER'] + list(reserved.values())
t_MINUS = r'-'
t_PLUS = r'\+'
t_ignore = ' \t'
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
def t_error(t):
print("Illegal character '%s'" % t.value[0], "at line", t.lexer.lineno, "at position", t.lexer.lexpos)
t.lexer.skip(1)
lex.lex()
lex.input("in out inout + - ")
while True:
tok = lex.token()
print(tok)
if not tok:
break
Output:
LexToken(KEYIN,'in',1,0)
LexToken(KEYOUT,'out',1,3)
LexToken(IDENTIFIER,'inout',1,7)
LexToken(PLUS,'+',1,13)
LexToken(MINUS,'-',1,15)
None
This is your function which recognizes IDENTIFIERs and keywords:
def t_IDENTIFIER(t):
r'[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
First, it is clear that the keywords it can recognize are precisely the keys of the dictionary reserved, which are:
in
out
in\sout
Since in out is not a key in that dictionary (in\sout is not the same string), it cannot be recognised as a keyword no matter what t.value happens to be.
But t.value cannot be in out either, because t.value will always match the regular expression which controls t_IDENTIFIER:
[a-zA-Z]+([(a-zA-Z)*|(\d+)*|(_*)])*
and that regular expression never matches anything with a space character. (That regular expression has various problems; the characters *, (, ), | and + inside the second character class are treated as ordinary characters. See below for a correct regex.)
You could certainly match in out as a token in a manner similar to that suggested in your original question, prior to the edit. However,
t_KEYINOUT = r'in\sout'
will not work, because Ply does not use the common "maximum munch" algorithm for deciding which regular expression pattern to accept. Instead, it simply orders all of the patterns and picks the first one which matches, where the order consists of all of the tokenizing functions (in the order in which they are defined), followed by the token variables sorted in reverse order of regex length. Since t_IDENTIFIER is a function, it will be tried before the variable t_KEYINOUT. To ensure that t_KEYINOUT is tried first, it must be made into a function and placed before t_IDENTIFIER.
However, that is still not exactly what you want, since it will tokenize
in outwards
as
LexToken(KEYINOUT,'in out',1,0)
LexToken(IDENTIFIER,'wards',1,6)
rather than
LexToken(KEYIN,'in',1,0)
LexToken(IDENTIFIER,'outwards',1,3)
To get the correct analysis, you need to ensure that in out only matches if out is a complete word; in other words, if there is a word boundary at the end of the match. So one solution is:
reserved = {'in': 'KEYIN', 'out': 'KEYOUT'}
def t_KEYINOUT(t):
r'in\sout\b'
return t
def t_IDENTIFIER(t):
r'[a-zA-Z][a-zA-Z0-9_]*'
t.type = reserved.get(t.value, 'IDENTIFIER') # Check for reserved words
return t
However, it is almost certainly not necessary for the lexer recognize in out as a single token. Since both in and out are keywords, it is easy to leave it to the parser to notice when they are used together as an in out designator:
parameter: KEYIN IDENTIFIER
| KEYOUT IDENTIFIER
| KEYIN KEYOUT IDENTIFIER
I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.
Mod file:
name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
"Fixes"
}
remote_file_id="879973318"
supported_version="1.6"
Settings:
language="l_english"
graphics={
size={
x=1920
y=1200
}
min_gui={
x=1920
y=1200
}
gui_scale=1.000000
gui_safe_ratio=1.000000
refreshRate=59
fullScreen=no
borderless=no
display_index=0
shadowSize=2048
multi_sampling=8
maxanisotropy=16
gamma=50.000000
vsync=yes
}
last_mods={
"mod/ship_designer_ui_fix.mod"
"mod/ugc_720237457.mod"
"mod/ugc_775944333.mod"
}
I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.
I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.
lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()
nonequals = "".join([c for c in printables if c != "="]) + " \t"
keydef = ~lbrack + Word(nonequals) + equals + restOfLine
conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)
I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)
Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.
Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".
There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:
identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*
The right-hand sides are a mix of things:
integers
floats
'yes' or 'no' booleans
quoted strings
something in braces
The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:
boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'
You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.
The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.
Using pyparsing, how can I match a keyword immediately before or after a special character (like "{" or "}")? The code below shows that my keyword "msg" is not matched unless it is preceded by whitespace (or at start):
import pyparsing as pp
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
messageKw = pp.Keyword("msg")
messageExpr = pp.Forward()
messageExpr << messageKw + openBrace +\
pp.ZeroOrMore(messageExpr) + closeBrace
try:
result = messageExpr.parseString("msg { msg { } }")
print result.dump(), "\n"
result = messageExpr.parseString("msg {msg { } }")
print result.dump()
except pp.ParseException as pe:
print pe, "\n", "Text: ", pe.line
I'm sure there's a way to do this, but I have been unable to find it.
Thanks in advance
openBrace = pp.Suppress(pp.Keyword("{"))
closeBrace = pp.Suppress(pp.Keyword("}"))
should be:
openBrace = pp.Suppress(pp.Literal("{"))
closeBrace = pp.Suppress(pp.Literal("}"))
or even just:
openBrace = pp.Suppress("{")
closeBrace = pp.Suppress("}")
(Most pyparsing classes will auto-promote a string argument "arg" to Literal("arg").)
When I have parsers with many punctuation marks, rather than have a big ugly chunk of statements like this, I'll collapse them down to something like:
OBRACE, CBRACE, OPAR, CPAR, SEMI, COMMA = map(pp.Suppress, "{}();,")
The problem you are seeing is that Keyword looks at the immediately-surrounding characters, to make sure that the current string is not being accidentally matched when it is really embedded in a larger identifier-like string. In Keyword('{'), this will only work if there is no adjoining character that could be confused as part of a larger word. Since '{' itself is not really a typical keyword character, using Keyword('{') is not a good use of that class.
Only use Keyword for strings that could be misinterpreted as identifiers. For matching characters that are not in the set of typical keyword characters (by "keyword characters" I mean alphanumerics + '_'), use Literal.
Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]
I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.
If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)
You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.
This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.