Using pyparsing to parse a word escape-split over multiple lines

Using pyparsing to parse a word escape-split over multiple lines - python

I'm trying to parse words which can be broken up over multiple lines with a backslash-newline combination ("\\n") using pyparsing. Here's what I have done:
from pyparsing import *
continued_ending = Literal('\\') + lineEnd
word = Word(alphas)
split_word = word + Suppress(continued_ending)
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
print multi_line_word.parseString(
'''super\\
cali\\
fragi\\
listic''')
The output I get is ['super'], while the expected output is ['super', 'cali', fragi', 'listic']. Better still would be all of them joined as one word (which I think I can just do with multi_line_word.parseAction(lambda t: ''.join(t)).
I tried looking at this code in pyparsing helper, but it gives me an error, maximum recursion depth exceeded.
EDIT 2009-11-15: I realized later that pyparsing gets a little generous with regards to white space, and that leads to some poor assumptions that what I thought I was parsing for was a lot looser. That is to say, we want to see no white space between any of the portions of the word, the escape, and the EOL character.
I realized that the little example string above is insufficient as a test case, so I wrote the following unit tests. Code that passes these tests should be able to match what I intuitively think of as a escape-split word—and only an escape-split word. They will not match a basic word that is not escape-split. We can—and I believe should—use a different grammatical construct for that. This keeps it all tidy having the two separate.
import unittest
import pyparsing
# Assumes you named your module 'multiline.py'
import multiline
class MultiLineTests(unittest.TestCase):
def test_continued_ending(self):
case = '\\\n'
expected = ['\\', '\n']
result = multiline.continued_ending.parseString(case).asList()
self.assertEqual(result, expected)
def test_continued_ending_space_between_parse_error(self):
case = '\\ \n'
self.assertRaises(
pyparsing.ParseException,
multiline.continued_ending.parseString,
case
)
def test_split_word(self):
cases = ('shiny\\', 'shiny\\\n', ' shiny\\')
expected = ['shiny']
for case in cases:
result = multiline.split_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_split_word_no_escape_parse_error(self):
case = 'shiny'
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_split_word_space_parse_error(self):
cases = ('shiny \\', 'shiny\r\\', 'shiny\t\\', 'shiny\\ ')
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.split_word.parseString,
case
)
def test_multi_line_word(self):
cases = (
'shiny\\',
'shi\\\nny',
'sh\\\ni\\\nny\\\n',
' shi\\\nny\\',
'shi\\\nny '
'shi\\\nny captain'
)
expected = ['shiny']
for case in cases:
result = multiline.multi_line_word.parseString(case).asList()
self.assertEqual(result, expected)
def test_multi_line_word_spaces_parse_error(self):
cases = (
'shi \\\nny',
'shi\\ \nny',
'sh\\\n iny',
'shi\\\n\tny',
)
for case in cases:
self.assertRaises(
pyparsing.ParseException,
multiline.multi_line_word.parseString,
case
)
if __name__ == '__main__':
unittest.main()

After poking around for a bit more, I came upon this help thread where there was this notable bit
I often see inefficient grammars when
someone implements a pyparsing grammar
directly from a BNF definition. BNF
does not have a concept of "one or
more" or "zero or more" or
"optional"...
With that, I got the idea to change these two lines
multi_line_word = Forward()
multi_line_word << (word | (split_word + multi_line_word))
To
multi_line_word = ZeroOrMore(split_word) + word
This got it to output what I was looking for: ['super', 'cali', fragi', 'listic'].
Next, I added a parse action that would join these tokens together:
multi_line_word.setParseAction(lambda t: ''.join(t))
This gives a final output of ['supercalifragilistic'].
The take home message I learned is that one doesn't simply walk into Mordor.
Just kidding.
The take home message is that one can't simply implement a one-to-one translation of BNF with pyparsing. Some tricks with using the iterative types should be called into use.
EDIT 2009-11-25: To compensate for the more strenuous test cases, I modified the code to the following:
no_space = NotAny(White(' \t\r'))
# make sure that the EOL immediately follows the escape backslash
continued_ending = Literal('\\') + no_space + lineEnd
word = Word(alphas)
# make sure that the escape backslash immediately follows the word
split_word = word + NotAny(White()) + Suppress(continued_ending)
multi_line_word = OneOrMore(split_word + NotAny(White())) + Optional(word)
multi_line_word.setParseAction(lambda t: ''.join(t))
This has the benefit of making sure that no space comes between any of the elements (with the exception of newlines after the escaping backslashes).

You are pretty close with your code. Any of these mods would work:
# '|' means MatchFirst, so you had a left-recursive expression
# reversing the order of the alternatives makes this work
multi_line_word << ((split_word + multi_line_word) | word)
# '^' means Or/MatchLongest, but beware using this inside a Forward
multi_line_word << (word ^ (split_word + multi_line_word))
# an unusual use of delimitedList, but it works
multi_line_word = delimitedList(word, continued_ending)
# in place of your parse action, you can wrap in a Combine
multi_line_word = Combine(delimitedList(word, continued_ending))
As you found in your pyparsing googling, BNF->pyparsing translations should be done with a special view to using pyparsing features in place of BNF, um, shortcomings. I was actually in the middle of composing a longer answer, going into more of the BNF translation issues, but you have already found this material (on the wiki, I assume).

Related

Extend Formatter features to f-string syntax

In a project of mine, I'm passing strings to a Formatter subclass whic formats it using the format specifier mini-language. In my case it is customized (using the features of the Formatter class) by adding additional bang converters : !u converts the resulting string to lowercase, !c to titlecase, !q doubles any square bracket (because reasons), and some others.
For example, using a = "toFu", "{a!c}" becomes "Tofu"
How could I make my system use f-string syntax, so I can have "{a+a!c}" be turned into "Tofutofu" ?
NB: I'm not asking for a way of making f"{a+a!c}" (note the presence of an f) resolve itself as "Tofutofu", which is what hook into the builtin python f-string format machinery covers, I'm asking if there is a way for a function or any form of python code to turn "{a+a!c}" (note the absence of an f) into "Tofutofu".

Not sure I still fully understand what you need, but from the details given in the question and some comments, here is a function that parses strings with the format you specified and gives the desired results:
import re
def formatter(s):
def replacement(match):
expr, frmt = match[1].split('!')
if frmt == 'c':
return eval(expr).title()
return re.sub(r"{([^{]+)}", replacement, s)
a = "toFu"
print(formatter("blah {a!c}"))
print(formatter("{a+a!c}blah"))
Outputs:
blah Tofu
Tofutofublah
This uses the function variation of the repl argument of the re.sub function. This function (replacement) can be further extended to support all other !xs.
Main disadvantages:
Using eval is evil.
This doesn't take in count regular format specifiers, i.e. :0.3
Maybe someone can take it from here and improve.

Evolved from #Tomerikoo 's life-saving answer, here's the code:
import re
def formatter(s):
def replacement(match):
pre, bangs, suf = match.group(1, 2, 3)
# pre : the part before the first bang
# bangs : the bang (if any) and the characters going with it
# suf : the colon (if any) and the characters going with it
if not bangs:
return eval("f\"{" + pre + suf + "}\"")
conversion = set(bangs[1:]) # the first character is always a bang
sra = conversion - set("tiqulc")
conversion = conversion - sra
if sra:
sra = "!" + "".join(sra)
value = eval("f\"{" + pre + (sra or "") + suf + "}\"")
if "q" in conversion:
value = value.replace("{", "{{")
if "u" in conversion:
value = value.upper()
if "l" in conversion:
value = value.lower()
if "c" in conversion and value:
value = value.capitalize()
return value
return re.sub(r"{([^!:\n]+)((?:![^!:\n]+)?)((?::[^!:\n]+)?)}", replacement, s)
The massive regex results in the three groups I commented about at the top.
Caveat: it still uses eval (no acceptable way around it anyway), it doesn't allow for multiline replacement fields, and it may cause issues and/or discrepancies to put spaces between the ! and the :.
But these are acceptable for the use I have.

Please check specifcation
only those characters are allowed : 's', 'r', or 'a'
https://peps.python.org/pep-0498/

Parsing a custom configuration format in Python

I'm writing a profile manager for Stellaris game and I've hit a wall with their format in which they keep the info about mods and settings.
Mod file:
name="! (Ship Designer UI Fix) !"
path="mod/ship_designer_ui_fix"
tags={
"Fixes"
}
remote_file_id="879973318"
supported_version="1.6"
Settings:
language="l_english"
graphics={
size={
x=1920
y=1200
}
min_gui={
x=1920
y=1200
}
gui_scale=1.000000
gui_safe_ratio=1.000000
refreshRate=59
fullScreen=no
borderless=no
display_index=0
shadowSize=2048
multi_sampling=8
maxanisotropy=16
gamma=50.000000
vsync=yes
}
last_mods={
"mod/ship_designer_ui_fix.mod"
"mod/ugc_720237457.mod"
"mod/ugc_775944333.mod"
}
I've thought pyparsing will be of help there (and it probably will be) but it has been a long time since I've actually did something like this and this I'm clueless atm.
I've got to extract the simple key=value but I'm struggling to actually move from there to be able to extract the arrays, not to mention the multilevel arrays.
lbrack = Literal("{").suppress()
rbrack = Literal("}").suppress()
equals = Literal("=").suppress()
nonequals = "".join([c for c in printables if c != "="]) + " \t"
keydef = ~lbrack + Word(nonequals) + equals + restOfLine
conf = Dict( ZeroOrMore( Group(keydef) ) )
tokens = conf.parseString(data)
I haven't got very far as you can see. Can anyone point me towards next step? I'm not asking a finished and working solution for the whole thing - it would move me forward a lot but where's the fun in that :)

Well, it is awfully tempting to just dive in and write this parser, but you want some of that fun for yourself, that's great.
Before writing any code, write a BNF. That way you'll write a decent and robust parser, instead of just "everything that's not an equals sign must be an identifier".
There are a lot of "something = something" bits here, look at the kinds of things on the right- and left-hand sides of the '='. The left-hand sides all look like pretty well-mannered identifiers: alphas, underscores. I could envision numeric digits too, as long as they aren't the leading character. So let's say the left-hand sides will be identifiers:
identifier_leading = 'A'..'Z' 'a'..'z' '_'
identifier_body = identifier_leading '0'..'9'
identifier ::= identifier_leading + identifier_body*
The right-hand sides are a mix of things:
integers
floats
'yes' or 'no' booleans
quoted strings
something in braces
The "something in braces" are either a list of quoted strings, or a list of 'identifer = value' pairs. I'll skip the awful details of defining floats and integers and quoted strings, let's just assume we have those defined:
boolean_value ::= 'yes' | 'no'
value ::= float | integer | boolean_value | quoted_string | string_list_in_braces | key_value_list_in_braces
string_list_in_braces ::= '{' quoted_string * '}'
key_value ::= identifier '=' value
key_value_list_in_braces ::= '{' key_value* '}'
You will have to use a pyparsing Forward to declare value before it is fully defined, since it is used in key_value, but key_value is used in key_value_list_in_braces, which is used to define value - a recursive grammar. You are already familiar with the Dict(OneOrMore(Group(named_item))) pattern, and this should be good to give you a structure of fields that are accessible by name. For identifier, a Word would work, or you could just use the pre-defined pyparsing_common.identifier which was introduced as part of the pyparsing_common namespace class last year.
The translation from BNF to pyparsing should be pretty much 1-to-1 from here. For that matter, from the BNF, you could use PLY, ANTLR, or another parsing lib too. The BNF is really worth taking the 1/2 hour or 1/2 day to get sorted out.

Pyparsing: Detect tokens with a specific ending

I wonder what I am doing wrong here. Maybe someone can give me a hint on this problem.
I want to detect certain tokens using pyparsing that terminate with the string _Init.
As an example, I have the following lines stored in text
one
two_Init
threeInit
four_foo_Init
five_foo_bar_Init
I want to extract the following lines:
two_Init
four_foo_Init
five_foo_bar_Init
Currently, I have reduced my problem to the following lines:
import pyparsing as pp
ident = pp.Word(pp.alphas, pp.alphanums + "_")
ident_init = pp.Combine(ident + pp.Literal("_Init"))
for detected, s, e in ident_init.scanString(text):
print detected
Using this code there are no results. If I remove the "_" in the Word statement then I can detect at least the lines having a _Init at their ends. But the result isnt complete:
['two_Init']
['foo_Init']
['bar_Init']
Has someone any ideas what I am doing completely wrong here?

The problem is that you want to accept '_' as long as it is not the '_' in the terminating '_Init'. Here are two pyparsing solutions, one is more "pure" pyparsing, the other just says the heck with it and uses an embedded regex.
samples = """\
one
two_Init
threeInit
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init"""
from pyparsing import Combine, OneOrMore, Word, alphas, alphanums, Literal, WordEnd, Regex
# implement explicit lookahead: allow '_' as part of your Combined OneOrMore,
# as long as it is not followed by "Init" and the end of the word
option1 = Combine(OneOrMore(Word(alphas,alphanums) |
'_' + ~(Literal("Init")+WordEnd()))
+ "_Init")
# sometimes regular expressions and their implicit lookahead/backtracking do
# make things easier
option2 = Regex(r'\b[a-zA-Z_][a-zA-Z0-9_]*_Init\b')
for expr in (option1, option2):
print '\n'.join(t[0] for t in expr.searchString(samples))
print
Both options print:
two_Init
four_foo_Init
six_seven_Init_eight_Init
five_foo_bar_Init

PyParsing lookaheads and greedy expressions

I'm writing a parser for a query language using PyParsing, and I've gotten stuck on (what I believe to be) an issue with lookaheads. One clause type in the query is intended to split strings into 3 parts (fieldname,operator, value) such that fieldname is one word, operator is one or more words, and value is a word, a quoted string, or a parenthesized list of these.
My data look like
author is william
author is 'william shakespeare'
author is not shakespeare
author is in (william,'the bard',shakespeare)
And my current parser for this clause is written as:
fieldname = Word(alphas)
operator = OneOrMore(Word(alphas))
single_value = Word(alphas) ^ QuotedString(quoteChar="'")
list_value = Literal("(") + Group(delimitedList(single_value)) + Literal(")")
value = single_value ^ list_value
clause = fieldname + originalTextFor(operator) + value
Obviously this fails due to the the fact that the operator element is greedy and will gobble up the value if it can. From reading other similar questions and the docs, I've gathered that I need to manage that lookahead with a NotAny or FollowedBy, but I haven't been able to figure out how to make that work.

This is a good place to Be The Parser. Or more accurately, Make The Parser Think Like You Do. Ask yourself, "In 'author is shakespeare', how do I know that 'shakespeare' is not part of the operator?" You know that 'shakespeare' is the value because it is at the end of the query, there is nothing more after it. So operator words aren't just words of alphas, they are words of alphas that are not followed by the end of the string. Now build that lookahead logic into your definition of operator:
operator = OneOrMore(Word(alphas) + ~FollowedBy(StringEnd()))
And I think this will start parsing better for you.
Some other tips:
I only use '^' operator if there will be some possible ambiguity, like if I was going to parse a string with numbers that could be integers or hex. If I used Word(nums) | Word(hexnums), then I might misprocess "123ABC" as just the leading "123". By changing '|' to '^', all of the alternatives will be tested, and the longest match chosen. In my example of parsing decimal or hex integers, I could have gotten the same result by reversing the alternatives, and test for Word(hexnums) first. In you query language, there is no way to confuse a quoted string with a non-quoted single word value (one leads with ' or ", the other doesn't), so there is no reason to use '^', '|' will suffice. Similar for value = singleValue ^ listValue.
Adding results names to the key components of your query string will make it easier to work with later:
clause = fieldname("fieldname") + originalTextFor(operator)("operator") + value("value")
Now you can access the parsed values by name instead of by parse position (which will get tricky and error-prone once you start getting more complicated with optional fields and such):
queryParts = clause.parseString('author is william')
print queryParts.fieldname
print queryParts.operator

Keyword Matching in Pyparsing: non-greedy slurping of tokens

Pythonistas:
Suppose you want to parse the following string using Pyparsing:
'ABC_123_SPEED_X 123'
were ABC_123 is an identifier; SPEED_X is a parameter, and 123 is a value. I thought of the following BNF using Pyparsing:
Identifier = Word( alphanums + '_' )
Parameter = Keyword('SPEED_X') or Keyword('SPEED_Y') or Keyword('SPEED_Z')
Value = # assume I already have an expression valid for any value
Entry = Identifier + Literal('_') + Parameter + Value
tokens = Entry.parseString('ABC_123_SPEED_X 123')
#Error: pyparsing.ParseException: Expected "_" (at char 16), (line:1, col:17)
If I remove the underscore from the middle (and adjust the Entry definition accordingly) it parses correctly.
How can I make this parser be a bit lazier and wait until it matches the Keyword (as opposed to slurping the entire string as an Identifier and waiting for the _, which does not exist.
Thank you.
[Note: This is a complete rewrite of my question; I had not realized what the real problem was]

I based my answer off of this one, since what you're trying to do is get a non-greedy match. It seems like this is difficult to make happen in pyparsing, but not impossible with some cleverness and compromise. The following seems to work:
from pyparsing import *
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
UndParam = Suppress('_') + Parameter
Identifier = SkipTo(UndParam)
Value = Word(nums)
Entry = Identifier + UndParam + Value
When we run this from the interactive interpreter, we can see the following:
>>> Entry.parseString('ABC_123_SPEED_X 123')
(['ABC_123', 'SPEED_X', '123'], {})
Note that this is a compromise; because I use SkipTo, the Identifier can be full of evil, disgusting characters, not just beautiful alphanums with the occasional underscore.
EDIT: Thanks to Paul McGuire, we can concoct a truly elegant solution by setting Identifier to the following:
Identifier = Combine(Word(alphanums) +
ZeroOrMore('_' + ~Parameter + Word(alphanums)))
Let's inspect how this works. First, ignore the outer Combine; we'll get to this later. Starting with Word(alphanums) we know we'll get the 'ABC' part of the reference string, 'ABC_123_SPEED_X 123'. It's important to note that we didn't allow the "word" to contain underscores in this case. We build that separately in to the logic.
Next, we need to capture the '_123' part without also sucking in '_SPEED_X'. Let's also skip over ZeroOrMore at this point and return to it later. We start with the underscore as a Literal, but we can shortcut with just '_', which will get us the leading underscore, but not all of '_123'. Instictively, we would place another Word(alphanums) to capture the rest, but that's exactly what will get us in trouble by consuming all of the remaining '_123_SPEED_X'. Instead, we say, "So long as what follows the underscore is not the Parameter, parse that as part of my Identifier. We state that in pyparsing terms as '_' + ~Parameter + Word(alphanums). Since we assume we can have an arbitrary number of underscore + WordButNotParameter repeats, we wrap that expression a ZeroOrMore construct. (If you always expect at least underscore + WordButNotParameter following the initial, you can use OneOrMore.)
Finally, we need to wrap the initial Word and the special underscore + Word repeats together so that it's understood they are contiguous, not separated by whitespace, so we wrap the whole expression up in a Combine construct. This way 'ABC _123_SPEED_X' will raise a parse error, but 'ABC_123_SPEED_X' will parse correctly.
Note also that I had to change Keyword to Literal because the ways of the former are far too subtle and quick to anger. I do not trust Keywords, nor could I get matching with them.

If you are sure that the identifier never ends with an underscore, you can enforce it in the definition:
from pyparsing import *
my_string = 'ABC_123_SPEED_X 123'
Identifier = Combine(Word(alphanums) + Literal('_') + Word(alphanums))
Parameter = Literal('SPEED_X') | Literal('SPEED_Y') | Literal('SPEED_Z')
Value = Word(nums)
Entry = Identifier + Literal('_').suppress() + Parameter + Value
tokens = Entry.parseString(my_string)
print tokens # prints: ['ABC_123', 'SPEED_X', '123']
If it's not the case but if the identifier length is fixed you can define Identifier like this:
Identifier = Word( alphanums + '_' , exact=7)

You can also parse the identifier and parameter as one token, and split them in a parse action:
from pyparsing import *
import re
def split_ident_and_param(tokens):
mo = re.match(r"^(.*?_.*?)_(.*?_.*?)$", tokens[0])
return [mo.group(1), mo.group(2)]
ident_and_param = Word(alphanums + "_").setParseAction(split_ident_and_param)
value = Word(nums)
entry = ident_and_param + value
print entry.parseString("APC_123_SPEED_X 123")
The example above assumes that the identifiers and parameters always have the format XXX_YYY (containing one single underscore).
If this is not the case, you need to adjust the split_ident_and_param() method.

This answers a question that you probably have also asked yourself: "What's a real-world application for reduce?):
>>> keys = ['CAT', 'DOG', 'HORSE', 'DEER', 'RHINOCEROS']
>>> p = reduce(lambda x, y: x | y, [Keyword(x) for x in keys])
>>> p
{{{{"CAT" | "DOG"} | "HORSE"} | "DEER"} | "RHINOCEROS"}
Edit:
This was a pretty good answer to the original question. I'll have to work on the new one.
Further edit:
I'm pretty sure you can't do what you're trying to do. The parser that pyparsing creates doesn't do lookahead. So if you tell it to match Word(alphanums + '_'), it's going to keep matching characters until it finds one that's not a letter, number, or underscore.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.