Given XML objects of many classes (say, types of document images), I need to generate some outputs depending on the class of the object, and a complex set of mathematical rules relating the contents of the XML file.
What is the generic name of this task (parsing?) and what is the easiest way to encode separate rules for each class, bearing in mind that the rules may involve mathematical relationships. I think I should create a file for each class to keep it manageable using a DSL but I am not sure. Someone suggested incorporating a full-blown Lua or Javascript interpreter. Is this a good idea? I want to keep it lean, and simple.
Parsing refers to reading a series of tokens and matching rules in a grammar. If you can specify your problem in this way you can write the grammar using pyparsing.
If what you are interested in doing is extracting the structure of an XML document, then you can use the standard python module xml.etree.ElementTree. Also look at beautifulsoup.
Related
I have scripts which need different configuration data. Most of the time it is a table format or a list of parameters. For the moment I read out excel tables for this.
However, not only reading excel is slightly buggy (excel is just not made for being a stable data provider), but also I'd like to include some data validation and a little help for the configurators so that input is validated partially. It doesn't have to be pretty though - just functional. Pure text files would be to hard to read and check.
Can you suggest an easy to implement way to realize that? Of course one could program complicating web interfaces and form, but maybe that's too much effort?!
Was is an easy to edit way to provide data tables and other configuration parameters?
The configuration info is just small tables with a list of parameters or a matrix with mathematical coefficients.
I like to use YAML. It's pretty flexible and python can read it as a dict using PyYAML.
Its a bit difficult to answer without knowing what the data you're working with looks like, but there are several ways you could do it. You could, for example, use something like csv or sqlite, provided the data can be easily expressed in a tabular format, but I think that you might find xml is best for your use case. It is very versatile and can be easy to work with if you find a good editor (e.g serna or oxygenxml), however, it might still be in your interests to write your own editor for it (which will probably not be as complicated as you think!). XML is easy to work with in python through the standard xml.etree module, and XML schemas can be used for validation.
I'm looking for a tool that will play nicely with Python.
Except for my Python requirement, my question is the same as this one:
"I am looking for a tool which will take an XML instance document and output a corresponding XSD schema."
According to the PyCharm docs, PyCharm has a facility for this. This is not exactly accessible by a program as an API. You are probably better off using XML Schema Learner as a separate program since it is a command line program (subprocess friendly!).
Are you looking for something like pyxsd? (primarily used for validation against a schema) Or maybe PyXB? (can generate classes based on xml) Otherwise, I don't think there's a tool [yet] that will generate the schema from within Python. Can you do it on demand using something like xsd.exe? Does it have to be programmatic/repeatable?
Currently, there is no module that will run within your python program and do this conversion. But I see the problem of creating a XSD schema from XML as a tooling problem. It's the kind of functionality that I'll use once, to get a schema started but after that I'll be maintaining the schema myself. From reading a single XML file the XSD generator will create a starting point for a real schema, it cannot infer all the functionality and options offered by XSD.
Basically, I don't see the need to have this conversion run as a module inside of my code, generating new XSDs every time the XML changes. After all, it's the schema that defines the XML not the other way around.
As end-user pointed out you could use xsd.exe but you might also want to look at other tools such as trang (a bit old) for Java and stylusstudio (XML tool).
Maybe a silly question, but I usually learn a lot with those. :)
I'm working on a software that deals a lot with XML, both as input and as output, and in between a lot of processing occurs.
My first thought was to use internally a dict as a internal data structure, and from there, work my way with a process of reading and writing it.
What you guys think? Any better approach, python-wise?
An XML document in general is a tree with lots of bells and whistles (attributes vs. child nodes, mixing of text with child nodes, entities, xml declarations, comments, and many more). Handling that should be left to existing, mature libraries - for Python, it's commonly agreed that lxml is the most convenient choice, followed by the stdlib ElementTree modules (by which one lxml module, lxml.etree, is so heavily inspired that incompabilities are exceptions).
These handle all that complexity and expose it in somewhat handable ways with many convenience methods (lxml's XPath support has saved me lot of code). After parsing, programs can - of course - go on to convert the trees into simpler data structures that fit the data actually modeled much better. What data structures exactly are possible and sensible depending on what you want to represent (if you misuse XML as a flat key-value storage, for instance, you could indeed go on to convert the tree into a dictionary).
It depends completely on what type of data you have in XML, what kinds of processing you'll need to do with it, what sort of output you'll need to produce from it, etc.
I do mean the ??? in the title because I'm not exactly sure. Let me explain the situation.
I'm not a computer science student & I never did any compilers course. Till now I used to think that compiler writers or students who did compilers course are outstanding because they had to write Parser component of the compiler in whatever language they are writing the compiler. It's not an easy job right?
I'm dealing with Information Retrieval problem. My desired programming language is Python.
Parser Nature:
http://ir.iit.edu/~dagr/frDocs/fr940104.0.txt is the sample corpus. This file contains around 50 documents with some XML style markup. (You can see it in above link). I need to note down other some other values like <DOCNO> FR940104-2-00001 </DOCNO> & <PARENT> FR940104-2-00001 </PARENT> and I only need to index the <TEXT> </TEXT> portion of document which contains some varying tags which I need to strip down and a lot of <!-- --> comments that are to be neglected and some &hyph; &space; & character entities. I don't know why corpus has things like this when its know that it's neither meant to be rendered by browser nor a proper XML document.
I thought of using any Python XML parser and extract desired text. But after little searching I found JavaCC parser source code (Parser.jj) for the same corpus I'm using here. A quick look up on JavaCC followed by Compiler-compiler revealed that after all compiler writers aren't as great as I thought. They use Compiler-compiler to generate parser code in desired language. Wiki says input to compiler-compiler is input is a grammar (usually in BNF). This is where I'm lost.
Is Parser.jj the grammar (Input to compiler-compiler called JavaCC)? It's definitely not BNF. What is this grammar called? Why is this grammar has Java language? Isn't there any universal grammar language?
I want python parser for parsing the corpus. Is there any way I can translate Parser.jj to get python equivalent? If yes, what is it? If no, what are my other options?
By any chance does any one know what is this corpus? Where is its original source? I would like to see some description for it. It is distributed on internet with name frDocs.tar.gz
Why do you call this "XML-style" markup? - this looks like pretty standard/basic XML to me.
Try elementTree or lxml. Instead of writing a parser, use one of the stable, well-hardened libraries that are already out there.
You can't build a parser - let alone a whole compiler - from a(n E)BNF grammar - it's just the grammar, i.e. syntax (and some syntax, like Python's indentation-based block rules, can't be modeled in it at all), not the semantics. Either you use seperate tools for these aspects, or use a more advances framework (like Boost::Spirit in C++ or Parsec in Haskell) that unifies both.
JavaCC (like yacc) is responsible for generating a parser, i.e. the subprogram that makes sense of the tokens read from the source code. For this, they mix a (E)BNF-like notation with code written in the language the resulting parser will be in (for e.g. building a parse tree) - in this case, Java. Of course it would be possible to make up another language - but since the existing languages can handle those tasks relatively well, it would be rather pointless. And since other parts of the compiler might be written by hand in the same language, it makes sense to leave the "I got ze tokens, what do I do wit them?" part to the person who will write these other parts ;)
I never heard of "PythonCC", and google didn't either (well, theres a "pythoncc" project on google code, but it's describtion just says "pythoncc is a program that tries to generate optimized machine Code for Python scripts." and there was no commit since march). Do you mean any of these python parsing libraries/tools? But I don't think there's a way to automatically convert the javaCC code to a Python equivalent - but the whole thing looks rather simple, so if you dive in and learn a bit about parsing via javaCC and [python library/tool of your choice], you might be able to translate it...
I recently wrote a parser in Python using Ply (it's a python reimplementation of yacc). When I was almost done with the parser I discovered that the grammar I need to parse requires me to do some look up during parsing to inform the lexer. Without doing a look up to inform the lexer I cannot correctly parse the strings in the language.
Given than I can control the state of the lexer from the grammar rules I think I'll be solving my use case using a look up table in the parser module, but it may become too difficult to maintain/test. So I want to know about some of the other options.
In Haskell I would use Parsec, a library of parsing functions (known as combinators). Is there a Python implementation of Parsec? Or perhaps some other production quality library full of parsing functionality so I can build a context sensitive parser in Python?
EDIT: All my attempts at context free parsing have failed. For this reason, I don't expect ANTLR to be useful here.
I believe that pyparsing is based on the same principles as parsec.
PySec is another monadic parser, I don't know much about it, but it's worth looking at here
An option you may consider, if an LL parser is ok to you, is to give ANTLR a try, it can generate python too (actually it is LL(*) as they name it, * stands for the quantity of lookahead it can cope with).
Nothing prevents you for diverting your parser from the "context free" path using PLY. You can pass information to the lexer during parsing, and in this way achieve full flexibility. I'm pretty sure that you can parse anything you want with PLY this way.
For a hands-on example, consider - it is a parser for ANSI C written in Python with PLY. It solves the classic C typedef - identifier problem (that makes C's grammar non context-sensitive) by populating a symbol table in the parser that is being used in the lexer to resolve symbol names as either types or not.
There's ANTLR, which is LL(*), there's PyParsing, which is more object friendly and is sort of like a DSL, and then there's Parsing which is like OCaml's Menhir.
ANTLR is great and has the added benefit of working across multiple languages.