Generating parser in Python language from JavaCC source?

Generating parser in Python language from JavaCC source? - python

I do mean the ??? in the title because I'm not exactly sure. Let me explain the situation.
I'm not a computer science student & I never did any compilers course. Till now I used to think that compiler writers or students who did compilers course are outstanding because they had to write Parser component of the compiler in whatever language they are writing the compiler. It's not an easy job right?
I'm dealing with Information Retrieval problem. My desired programming language is Python.
Parser Nature:
http://ir.iit.edu/~dagr/frDocs/fr940104.0.txt is the sample corpus. This file contains around 50 documents with some XML style markup. (You can see it in above link). I need to note down other some other values like <DOCNO> FR940104-2-00001 </DOCNO> & <PARENT> FR940104-2-00001 </PARENT> and I only need to index the <TEXT> </TEXT> portion of document which contains some varying tags which I need to strip down and a lot of  comments that are to be neglected and some &hyph; &space; & character entities. I don't know why corpus has things like this when its know that it's neither meant to be rendered by browser nor a proper XML document.
I thought of using any Python XML parser and extract desired text. But after little searching I found JavaCC parser source code (Parser.jj) for the same corpus I'm using here. A quick look up on JavaCC followed by Compiler-compiler revealed that after all compiler writers aren't as great as I thought. They use Compiler-compiler to generate parser code in desired language. Wiki says input to compiler-compiler is input is a grammar (usually in BNF). This is where I'm lost.
Is Parser.jj the grammar (Input to compiler-compiler called JavaCC)? It's definitely not BNF. What is this grammar called? Why is this grammar has Java language? Isn't there any universal grammar language?
I want python parser for parsing the corpus. Is there any way I can translate Parser.jj to get python equivalent? If yes, what is it? If no, what are my other options?
By any chance does any one know what is this corpus? Where is its original source? I would like to see some description for it. It is distributed on internet with name frDocs.tar.gz

Why do you call this "XML-style" markup? - this looks like pretty standard/basic XML to me.
Try elementTree or lxml. Instead of writing a parser, use one of the stable, well-hardened libraries that are already out there.

You can't build a parser - let alone a whole compiler - from a(n E)BNF grammar - it's just the grammar, i.e. syntax (and some syntax, like Python's indentation-based block rules, can't be modeled in it at all), not the semantics. Either you use seperate tools for these aspects, or use a more advances framework (like Boost::Spirit in C++ or Parsec in Haskell) that unifies both.
JavaCC (like yacc) is responsible for generating a parser, i.e. the subprogram that makes sense of the tokens read from the source code. For this, they mix a (E)BNF-like notation with code written in the language the resulting parser will be in (for e.g. building a parse tree) - in this case, Java. Of course it would be possible to make up another language - but since the existing languages can handle those tasks relatively well, it would be rather pointless. And since other parts of the compiler might be written by hand in the same language, it makes sense to leave the "I got ze tokens, what do I do wit them?" part to the person who will write these other parts ;)
I never heard of "PythonCC", and google didn't either (well, theres a "pythoncc" project on google code, but it's describtion just says "pythoncc is a program that tries to generate optimized machine Code for Python scripts." and there was no commit since march). Do you mean any of these python parsing libraries/tools? But I don't think there's a way to automatically convert the javaCC code to a Python equivalent - but the whole thing looks rather simple, so if you dive in and learn a bit about parsing via javaCC and [python library/tool of your choice], you might be able to translate it...

Related

How to select a subset from Python for parsing purpose

I am working on assignment and need to develop a python to openmodelica translator. For which I am using flex and bison in initial stages. Initially I need to define a subset of python language on which I could perform a whole demo. I am new to Python language, Can anybody suggest how can I define a subset of python language? Thanks.

Well as you are probably not interested in writing it in Python itself, I guess the language reference is the best starting point. It defines the whole grammar of the language. So this is likely a good starting point to find some features of the language you want to implement on your own; and then you need to write your own grammar and a parser for it in your language of choice.
Otherwise, you could use the built-in Python language services to actually parse real Python code and extract it into abstract syntax trees for example.
But if you are meant to only have a subset, I don’t think having the full language capabilities will do you any good. So you better start off with a real subset of the grammar. A good way to get to know which features you want to take over is probably by using the language yourself for a bit. Do some tutorials etc. and see how the basic syntax works.

How is generated the python grammar and how the interpreter understand it

I wonder how is generated the grammar of the Python language and how it is understood by the interpreter.
In python, the file graminit.c seems to implement the grammar, but i don't clearly understand it.
More broadly, what are the different ways to generate a grammar and are there differences between how the grammar is implemented in languages such as Perl, Python or Lua.

Grammars are generally of the same form: Backus-Naur Form (BNF) is typical.
Lexer/parsers can take very different forms.
The lexer breaks up the input file into tokens. The parser uses the grammar to see if the stream of tokens is "valid" according to its rules.
Usually the outcome is an abstract syntax tree (AST) that can then be used to generate whatever you want, such as byte code or assembly.

There are many ways to implement lexing/parsing, it really comes down to identifing the patterns and how they fit together. There are a few very nice Python packages for doing this that range from pure python to wrapped C code. Pyparsing in-particular has many excellent examples. One thing worth noting, finding a straight EBNF/BNF parser is kind of hard -- writing a parser with Python code isn't awful but it is one step further from the raw grammar which might be important to you.
Code Talker
SimpleParse
Python Lex Yacc
pyparsing

Parsing XML - right scripting languages / packages for the job?

I know that any language is capable of parsing XML; I'm really just looking for advantages or drawbacks that you may have come across in your own experiences. Perl would be my standard go to here, but I'm open to suggestions.
Thanks!
UPDATE: I ended up going with XML::Simple which did a nice job, but I have one piece of advice if you plan to use it--research the forcearray option first. I had to rewrite a bunch of statements after learning that it is usually best practice to set forcearray. This page had the clearest explanation that I could find. Frankly, I'm surprised this isn't the default behavior.

If you are using Perl then I would recommend XML::Simple:
As more and more Web sites begin using
XML for their content, it's
increasingly important for Web
developers to know how to parse XML
data and convert it into different
formats. That's where the Perl module
called XML::Simple comes in. It takes
away the drudgery of parsing XML data,
making the process easier than you
ever thought possible.

XML::Twig is very nice, especially because it’s not as awfully verbose as some of the other options.

For pure XML parsing, I wouldn't use Java, C#, C++, C, etc. They tend to overcomplicate things, as in you want a banana and get the gorilla with it as well.
Higher-level and interpreted languages such as Perl, PHP, Python, Groovy are more suitable. Perl is included in virtually every Linux distro, as is PHP for the most part.
I've used Groovy recently for especially this and found it very easy. Mind you though that a C parser will be orders of magnitude faster than Groovy for instance.

It's all going to be in the libraries.
Python has great libraries for XML. My preference is lxml. It uses libxml/libxslt so it's fast, but the Python binding make it really easy to use. Perl may very well have equally awesome OO libraries.

I saw that people recommend XML::Simple if you decide on Perl.
While XML::Simple is, indeed, very simple to use and great, is a DOM parser. As such, it is, sadly, completely unsuitable to processing large XML files as your process would run out of memory (it's a common problem for any DOM parser, not limited to XML::Simple or Perl).
So, for large files, you must pick a SAX parser in whichever language you choose (there are many XML SAX parsers in Perl, or use another stream parser like XML::Twig that is even better than standard SAX parser. Can't speak for other languages).

Not exactly a scripting language, but you could also consider Scala. You can start from here.

Scala's XML support is rather good, especially as XML can just be typed directly into Scala programs.
Microsoft also did some cool integrated stuff with their LINQ for XML
But I really like Elementtree and just that package alone is a good reason to use Python instead of Perl ;)
Here's an example:
import elementtree.ElementTree as ET
# build a tree structure
root = ET.Element("html")
head = ET.SubElement(root, "head")
title = ET.SubElement(head, "title")
title.text = "Page Title"
body = ET.SubElement(root, "body")
body.set("bgcolor", "#ffffff")
body.text = "Hello, World!"
# wrap it in an ElementTree instance, and save as XML
tree = ET.ElementTree(root)
tree.write("page.xhtml")

It's not a scripting language, but Scala is great for working with XML natively. Also, see this book (draft) by Burak.

Python has some pretty good support for XML. From the standard library DOM packages to much more 'pythonic' libraries that parse XML directly into more usable object structures.
There isn't really a 'right' language though... there are good XML packages for most languages nowadays.

If you're going to use Ruby to do it then you're going to want to take a look at Nokogiri or Hpricot. Both have their strengths and weaknesses. The language and package selection really comes down to what you want to do with the data after you've parsed it.

Reading Data out of XML files is dead easy with C# and LINQ to XML!
Somehow, although I really love python, I found it hard to parse XML with the standard libraries.

I would say it depends like everything else. VB.NET 2008 uses XML literals, has IntelliSense for LINQ to XML, and a few power toys that help turn XML into XSD. So personally, if you are working in a .NET environment I think this is the best choice.

Python implementation of Parsec?

I recently wrote a parser in Python using Ply (it's a python reimplementation of yacc). When I was almost done with the parser I discovered that the grammar I need to parse requires me to do some look up during parsing to inform the lexer. Without doing a look up to inform the lexer I cannot correctly parse the strings in the language.
Given than I can control the state of the lexer from the grammar rules I think I'll be solving my use case using a look up table in the parser module, but it may become too difficult to maintain/test. So I want to know about some of the other options.
In Haskell I would use Parsec, a library of parsing functions (known as combinators). Is there a Python implementation of Parsec? Or perhaps some other production quality library full of parsing functionality so I can build a context sensitive parser in Python?
EDIT: All my attempts at context free parsing have failed. For this reason, I don't expect ANTLR to be useful here.

I believe that pyparsing is based on the same principles as parsec.

PySec is another monadic parser, I don't know much about it, but it's worth looking at here

An option you may consider, if an LL parser is ok to you, is to give ANTLR a try, it can generate python too (actually it is LL(*) as they name it, * stands for the quantity of lookahead it can cope with).

Nothing prevents you for diverting your parser from the "context free" path using PLY. You can pass information to the lexer during parsing, and in this way achieve full flexibility. I'm pretty sure that you can parse anything you want with PLY this way.
For a hands-on example, consider - it is a parser for ANSI C written in Python with PLY. It solves the classic C typedef - identifier problem (that makes C's grammar non context-sensitive) by populating a symbol table in the parser that is being used in the lexer to resolve symbol names as either types or not.

There's ANTLR, which is LL(*), there's PyParsing, which is more object friendly and is sort of like a DSL, and then there's Parsing which is like OCaml's Menhir.

ANTLR is great and has the added benefit of working across multiple languages.

Resources for lexing, tokenising and parsing in python

Can people point me to resources on lexing, parsing and tokenising with Python?
I'm doing a little hacking on an open source project (hotwire) and wanted to do a few changes to the code that lexes, parses and tokenises the commands entered into it. As it is real working code it is fairly complex and a bit hard to work out.
I haven't worked on code to lex/parse/tokenise before, so I was thinking one approach would be to work through a tutorial or two on this aspect. I would hope to learn enough to navigate around the code I actually want to alter. Is there anything suitable out there? (Ideally it could be done in an afternoon without having to buy and read the dragon book first ...)
Edit: (7 Oct 2008) None of the below answers quite give what I want. With them I could generate parsers from scratch, but I want to learn how to write my own basic parser from scratch, not using lex and yacc or similar tools. Having done that I can then understand the existing code better.
So could someone point me to a tutorial where I can build a basic parser from scratch, using just python?

I'm a happy user of PLY. It is a pure-Python implementation of Lex & Yacc, with lots of small niceties that make it quite Pythonic and easy to use. Since Lex & Yacc are the most popular lexing & parsing tools and are used for the most projects, PLY has the advantage of standing on giants' shoulders. A lot of knowledge exists online on Lex & Yacc, and you can freely apply it to PLY.
PLY also has a good documentation page with some simple examples to get you started.
For a listing of lots of Python parsing tools, see this.

This question is pretty old, but maybe my answer would help someone who wants to learn the basics. I find this resource to be very good. It is a simple interpreter written in python without the use of any external libraries. So this will help anyone who would like to understand the internal working of parsing, lexing, and tokenising:
"A Simple Intepreter from Scratch in Python:" Part 1, Part 2,
Part 3, and Part 4.

For medium-complex grammars, PyParsing is brilliant. You can define grammars directly within Python code, no need for code generation:
>>> from pyparsing import Word, alphas
>>> greet = Word( alphas ) + "," + Word( alphas ) + "!" # <-- grammar defined here
>>> hello = "Hello, World!"
>>>> print hello, "->", greet.parseString( hello )
Hello, World! -> ['Hello', ',', 'World', '!']
(Example taken from the PyParsing home page).
With parse actions (functions that are invoked when a certain grammar rule is triggered), you can convert parses directly into abstract syntax trees, or any other representation.
There are many helper functions that encapsulate recurring patterns, like operator hierarchies, quoted strings, nesting or C-style comments.

pygments is a source code syntax highlighter written in python. It has lexers and formatters, and may be interesting to peek at the source.

Here's a few things to get you started (roughly from simplest-to-most-complex, least-to-most-powerful):
http://en.wikipedia.org/wiki/Recursive_descent_parser
http://en.wikipedia.org/wiki/Top-down_parsing
http://en.wikipedia.org/wiki/LL_parser
http://effbot.org/zone/simple-top-down-parsing.htm
http://en.wikipedia.org/wiki/Bottom-up_parsing
http://en.wikipedia.org/wiki/LR_parser
http://en.wikipedia.org/wiki/GLR_parser
When I learned this stuff, it was in a semester-long 400-level university course. We did a number of assignments where we did parsing by hand; if you want to really understand what's going on under the hood, I'd recommend the same approach.
This isn't the book I used, but it's pretty good: Principles of Compiler Design.
Hopefully that's enough to get you started :)

Have a look at the standard module shlex and modify one copy of it to match the syntax you use for your shell, it is a good starting point
If you want all the power of a complete solution for lexing/parsing, ANTLR can generate python too.

Frederico Tomassetti had a good (but short) concise write-up to all things related from BNF to binary deciphering on:
lexical,
parser,
abstract-syntax tree (AST), and
Construct/code-generator.
He even mentioned the new Parsing Expression Grammar (PEG).
https://tomassetti.me/parsing-in-python/

I suggest http://www.canonware.com/Parsing/, since it is pure python and you don't need to learn a grammar, but it isn't widely used, and has comparatively little documentation. The heavyweight is ANTLR and PyParsing. ANTLR can generate java and C++ parsers too, and AST walkers but you will have to learn what amounts to a new language.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.