Writing a compiler for a DSL in python - python

I am writing a game in python and have decided to create a DSL for the map data files. I know I could write my own parser with regex, but I am wondering if there are existing python tools which can do this more easily, like re2c which is used in the PHP engine.
Some extra info:
Yes, I do need a DSL, and even if I didn't I still want the experience of building and using one in a project.
The DSL contains only data (declarative?), it doesn't get "executed". Most lines look like:
SOMETHING: !abc #123 #xyz/123
I just need to read the tree of data.

I've always been impressed by pyparsing. The author, Paul McGuire, is active on the python list/comp.lang.python and has always been very helpful with any queries concerning it.

Here's an approach that works really well.
abc= ONETHING( ... )
xyz= ANOTHERTHING( ... )
pqr= SOMETHING( this=abc, that=123, more=(xyz,123) )
Declarative. Easy-to-parse.
And...
It's actually Python. A few class declarations and the work is done. The DSL is actually class declarations.
What's important is that a DSL merely creates objects. When you define a DSL, first you have to start with an object model. Later, you put some syntax around that object model. You don't start with syntax, you start with the model.

Yes, there are many -- too many -- parsing tools, but none in the standard library.
From what what I saw PLY and SPARK are popular. PLY is like yacc, but you do everything in Python because you write your grammar in docstrings.
Personally, I like the concept of parser combinators (taken from functional programming), and I quite like pyparsing: you write your grammar and actions directly in python and it is easy to start with. I ended up producing my own tree node types with actions though, instead of using their default ParserElement type.
Otherwise, you can also use existing declarative language like YAML.

I have written something like this in work to read in SNMP notification definitions and automatically generate Java classes and SNMP MIB files from this. Using this little DSL, I could write 20 lines of my specification and it would generate roughly 80 lines of Java code and a 100 line MIB file.
To implement this, I actually just used straight Python string handling (split(), slicing etc) to parse the file. I find Pythons string capabilities to be adequate for most of my (simple) parsing needs.
Besides the libraries mentioned by others, if I were writing something more complex and needed proper parsing capabilities, I would probably use ANTLR, which supports Python (and other languages).

For "small languages" as the one you are describing, I use a simple split, shlex (mind that the # defines a comment) or regular expressions.
>>> line = 'SOMETHING: !abc #123 #xyz/123'
>>> line.split()
['SOMETHING:', '!abc', '#123', '#xyz/123']
>>> import shlex
>>> list(shlex.shlex(line))
['SOMETHING', ':', '!', 'abc', '#', '123']
The following is an example, as I do not know exactly what you are looking for.
>>> import re
>>> result = re.match(r'([A-Z]*): !([a-z]*) #([0-9]*) #([a-z0-9/]*)', line)
>>> result.groups()
('SOMETHING', 'abc', '123', 'xyz/123')

DSLs are a good thing, so you don't need to defend yourself :-)
However, have you considered an internal DSL ? These have so many pros versus external (parsed) DSLs that they're at least worth consideration. Mixing a DSL with the power of the native language really solves lots of the problems for you, and Python is not really bad at internal DSLs, with the with statement handy.

On the lines of declarative python, I wrote a helper module called 'bpyml' which lets you declare data in python in a more XML structured way without the verbose tags, it can be converted to/from XML too, but is valid python.
https://svn.blender.org/svnroot/bf-blender/trunk/blender/release/scripts/modules/bpyml.py
Example Use
http://wiki.blender.org/index.php/User:Ideasman42#Declarative_UI_In_Blender

Here is a simpler approach to solve it
What if I can extend python syntax with new operators to introduce new functionally to the language? For example, a new operator <=> for swapping the value of two variables.
How can I implement such behavior? Here comes AST module.
The last module is a handy tool for handling abstract syntax trees. What’s cool about this module is it allows me to write python code that generates a tree and then compiles it to python code.
Let’s say we want to compile a superset language (or python-like language) to python:
from :
a <=> b
to:
a , b = b , a
I need to convert my 'python like' source code into a list of tokens.
So I need a tokenizer, a lexical scanner for Python source code. Tokenize module
I may use the same meta-language to define both the grammar of new 'python-like' language and then build the structure of the abstract syntax tree AST
Why use AST?
AST is a much safer choice when evaluating untrusted code
manipulate the tree before executing the code Working on the Tree
from tokenize import untokenize, tokenize, NUMBER, STRING, NAME, OP, COMMA
import io
import ast
s = b"a <=> b\n" # i may read it from file
b = io.BytesIO(s)
g = tokenize(b.readline)
result = []
for token_num, token_val, _, _, _ in g:
# naive simple approach to compile a<=>b to a,b = b,a
if token_num == OP and token_val == '<=' and next(g).string == '>':
first = result.pop()
next_token = next(g)
second = (NAME, next_token.string)
result.extend([
first,
(COMMA, ','),
second,
(OP, '='),
second,
(COMMA, ','),
first,
])
else:
result.append((token_num, token_val))
src = untokenize(result).decode('utf-8')
exp = ast.parse(src)
code = compile(exp, filename='', mode='exec')
def my_swap(a, b):
global code
env = {
"a": a,
"b": b
}
exec(code, env)
return env['a'], env['b']
print(my_swap(1,10))
Other modules using AST, whose source code may be a useful reference:
textX-LS: A DSL used to describe a collection of shapes and draw it for us.
pony orm: You can write database queries using Python generators and lambdas with translate to SQL query sting—pony orm use AST under the hood
osso: Role Based Access Control a framework handle permissions.

Related

Can Python-Markdown support imageboard-style links?

I would like to add an additional syntax to Python-Markdown: if n is a positive integer, >>n should expand into n. (Double angled brackets (>>) is a conventional syntax for creating links in imageboard forums.)
By default, Python-Markdown expands >>n into nested blockquotes: <blockquote><blockquote>n</blockquote></blockquote>. Is there a way create links out of >>n, while preserving the rest of blockquote's default behavior? In other words, if x is a positive integer, >>x should expand into a link, but if x is not a positive integer, >>x should still expand into nested blockquotes.
I have read the relevant wiki article: Tutorial 1 Writing Extensions for Python Markdown. Based on what I learned in the wiki, I wrote a custom extension:
import markdown
import xml.etree.ElementTree as ET
from markdown.extensions import Extension
from markdown.inlinepatterns import Pattern
class ImageboardLinkPattern(Pattern):
def handleMatch(self, match):
number = match.group('number')
# Create link.
element = ET.Element('a', attrib={'href': f'#post-{number}'})
element.text = f'>>{number}'
return element
class ImageboardLinkExtension(Extension):
def extendMarkdown(self, md):
IMAGEBOARD_LINK_RE = '>>(?P<number>[1-9][0-9]*)'
imageboard_link = ImageboardLinkPattern(IMAGEBOARD_LINK_RE)
md.inlinePatterns['imageboard_link'] = imageboard_link
html = markdown.markdown('>>123',
extensions=[ImageboardLinkExtension()])
print(html)
However, >>123 still produces <blockquote><blockquote>123</blockquote></blockquote>. What is wrong with the implementation above?
The problem is that your new syntax conflicts with the preexisting blockquote syntax. Your extension would presumably work if it was ever called. However, due to the conflict, that never happens. Note that their are five types of processors. As documented:
Preprocessors alter the source before it is passed to the parser.
Block Processors work with blocks of text separated by blank lines.
Tree Processors modify the constructed ElementTree
Inline Processors are common tree processors for inline elements, such as *strong*.
Postprocessors munge of the output of the parser just before it is returned.
Of importance here is that the processors are run in that order. In other words, all block processors are run before any inline processors are run. Therefore, the blockquote block processor runs first on your input and removes the double angle bracket, wrapping the rest of the line in double blockquote tags. By the time your inline processor sees the document, your regex will no longer match and will therefore never be called.
That being said, an inline processor is the correct way to implement a link syntax. However, you would need to do one of two things to make it work.
Alter the syntax so that it does not clash with any preexisting syntax; or
Alter the blockquote behavior to avoid the conflict.
Personally, I would recommend option 1, but I understand you are trying to implement a preexisting syntax from another environment. So, if you want to explore option 2, then I would suggest perhaps making the blockquote syntax a little more strict. For example, while it is not required, the recommended syntax is to always insert a space after the angle bracket in a blockquote. It should be relatively simple to alter the BlockquoteProcessor to require the space, which would cause your syntax to no longer clash.
This is actually pretty simple. As you may note, the entire syntax is defined via a rather simple regex:
RE = re.compile(r'(^|\n)[ ]{0,3}>[ ]?(.*)')
You simply need to rewrite that so that 0 whitespace is no longer accepted (> rather than >[ ]?). First import and subclass the existing processor and then override the regex:
from markdown.blockprocessors import BlockquoteProcessor
class CustomBlockquoteProcessor(BlockquoteProcessor):
RE = re.compile(r'(^|\n)[ ]{0,3}> (.*)')
Finally, you just need to tell Markdown to use your custom class rather than the default. Add the following to the extendMarkdown method of your ImageboardLinkExtension class:
md.parser.blockprocessors.register(CustomBlockQuoteProcessor(md.parser), 'quote', 20)
Now the blockquote syntax will no longer clash with your link syntax and you will get an opportunity to have your code run on the text. Just be careful to remember to always include the now required space for any actual blockquotes.

Printing z3 expressions using the python api

I am trying to use z3 to simplify a few expressions generated by S2E/KLEE
from z3 import *
f = open("query.smt2").read()
expr = parse_smt2_string(f)
print(expr)
print(simplify(expr))
But it seems to only log 200 lines. I have also tried writing it to file, but that has the same result.
g = open("simplified_query.smt2", 'w')
g.write(str(simplify(expr)))
g.close();
How should I log the entire expression?
Example input/output: https://paste.ee/p/tRwxQ
You can print the expressions using the Python pretty printer as you do. It cuts off the expressions if they become very big and the pretty printer is not efficient. There are settings you can add to the pretty printer to force it to print full expressions. The function is called set_pp_option and it is defined in z3printer.py. The main option is called max_depth. Other options are defined as fields in the Formatter class.
You can also print expressions in SMT2 format using the method "sexpr()".
BTW, the file you uploaded doesn't process because it is in UTF8 format, but this is orthogonal to your question and probably an artifact of how you uploaded the repro.

Python - any property file or data format that is mostly free-form?

I'm about to roll my own property file parser. I've got a somewhat odd requirement where I need to be able to store metadata in an existing field of a GUI. The data needs to be easily parse-able and human readable, preferably with some flexibility in defining the data (no yaml for example).
I was thinking I could do something like this:
this is random text that is truly a description
.metadata.
owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla
I was thinking I could use something like '.metadata.' to denote that anything below that line is metadata to be parsed. Then, I would treat the properties almost like java properties where I would read each line in and build a map (or object) to hold the metadata, which would then be outputted and searchable via a simple web app.
My real question before I roll this on my own, is can anyone suggest a better method for solving this problem? A specific data format or library that would fit this use case? I would normally use something like yaml or the like, but there's no good way for me to validate that the data is indeed in yaml format when it is saved.
You have 3 problems:
How to fit two different things into one box.
If you are mixing free form text and something that is more tightly defined, you are always going to end up with stuff that you can't parse. Then you will have a never ending battle of trying to deal with the rubbish that gets put in. Is there really no other way?
How to define a simple format for metadata that is robust enough for simple use.
This is a hard problem - all attempts to do so seem to expand until they become quite complicated (e.g. YAML). You will probably have custom requirements for your domain, so what you've proposed may be best.
How to parse that format.
For this I would recommend parsy.
It would be quite simple to split the text on .metadata. and then parse what remains.
Here is an example using parsy:
from parsy import *
attribute = letter.at_least(1).concat()
name = attribute.sep_by(string("."))
value = regex(r"[^\n]+")
definition = seq(name << string(":") << string(" ").many(), value)
metadata = definition.sep_by(string("\n"))
Example usage:
>>> metadata.parse_partial("""owner.first: rick
owner.second: bob
property: blue
pets.mammals.dog: rufus
pets.mammals.cat: ludmilla""")
([[['owner', 'first'], 'rick'],
[['owner', 'second'], 'bob'],
[['property'], 'blue'],
[['pets', 'mammals', 'dog'], 'rufus'],
[['pets', 'mammals', 'cat'], 'ludmilla']],
'')
YAML is a simple and nice solution. There is a YAML library in Python:
import yaml
output = {'a':1,'b':{'c':output = {'a':1,'b':{'c':[2,3,4]}}}}
print yaml.dump(output,default_flow_style=False)
Giving as a result:
a: 1
b:
c:
- 2
- 3
- 4
You can also parse from string and so. Just explore it and check if it fits your requeriments.
Good luck!

Best Practices for Text Generation in Python

I'm writing a python script that generates another python script based off an external file. A small section of my code can be seen below. I haven't been exposed to many examples of these kinds of scripts, so I was wondering what the best practices were.
As seen in the last two lines of the code example, the techniques that I'm using can be unwieldy at times.
SIG_DICT_NAME = "sig_dict"
SIG_LEN_KEYWORD = "len"
SIG_BUS_IND_KEYWORD = "ind"
SIG_EP_ADDR_KEYWORD = "ep_addr"
KEYWORD_DEC = "{} = \"{}\""
SIG_LEN_KEYWORD_DEC = KEYWORD_DEC.format(SIG_LEN_KEYWORD, SIG_LEN_KEYWORD)
SIG_BUS_IND_KEYWORD_DEC = KEYWORD_DEC.format(SIG_BUS_IND_KEYWORD,
SIG_BUS_IND_KEYWORD)
SIG_EP_ADDR_KEYWORD_DEC = KEYWORD_DEC.format(SIG_EP_ADDR_KEYWORD,
SIG_EP_ADDR_KEYWORD)
SIG_DICT_DEC = "{} = dict()"
SIG_DICT_BODY_LINE = "{}[{}.{}] = {{{}:{}, {}:{}, {}:{}}}"
#line1 = SIG_DICT_DEC.format(SIG_DICT_NAME)
#line2 = SIG_DICT_BODY.format(SIG_DICT_NAME, x, y, z...)
You don't really see examples of this kind of thing because your solution might be a wee bit over-engineered ;)
I'm guessing that you're trying to collect some "state of things", and then you want to run a script to process that "state of things". Rather than writing a meta-script, what is typically far more convenient is to write a script that will do the processing (say, process.py), and another script that will do the collecting of the "state of things" (say, collect.py).
Then you can take the results from collect.py and throw them at process.py and write out todays_results.txt or some such:
collect.py -> process.py -> 20150207_results.txt
If needed, you can write intermediate files to disk with something like:
with open('todays_progress.txt') as f_out:
for thing, state in states_of_things.iteritems():
f.write('{}<^_^>{}\n'.format(state, thing))
Then you can parse it back in later with something like:
with open('todays_progress.txt') as f_in:
lines = f_in.read().splitlines()
things, states = [x, y for x, y in lines.split('<^_^>')]
states_of_things = dict(zip(things, states))
More complicated data structures than a flat dict? Well, this is Python. There's probably more than one module for that! Off the top of my head I would suggest json if plaintext will do, or pickle if you need some more detailed structures. Two warnings with pickle: custom objects don't always get reinstantiated well, and it's vulnerable to code injection attacks, so only use it if your entire workflow is trusted.
Hope this helps!
You seem to be translating keyword-by-keyword.
It would almost certainly be better to read each "sentence" into a representative Python class; you could then run the simulation directly, or have each class write itself to an "output sentence".
Done correctly, this should be much easier to write and debug and produce more idiomatic output.

embed custom form language in Python

I have have a Python script that converts files from a custom form language into compilable C++ files. An example of what such a file looks like could be
data = open_special_file_format('data.nc')
f = div(grad(data.u)) + data.g
write_special_file(f, 'out.nc')
Note that is Python syntax an in fact is parsed with Python's ast. The magic that happens here is mostly in the custom keywords div, grad, and a few others.
Since this so closely resembles Python, I was asking myself if it is possible to embed this language into Python. I'm imagining something like
import mylang
data = mylang.open_special_file_format('data.nc')
f = mylang.div(mylang.grad(data.u)) + data.g
mylang.write_special_file(f, 'out.nc')
I'm not really sure though if it's possible to tell the module mylang to create and compile C++ code on the fly and insert it in the right place.
Any hints?

Categories