How to parse Xpath expressions in Python? - python

I need to parse (not to evaluate) Xpath expressions in Python to change them, e.g. I have expressions like
//div[...whatever...]//some-other-node...
and I need to change them to (for example):
/changed-node[#attr='value' and ...whatever...]/another-changed-node[#attr='value' ...
As it seems to me I need to split the original expression to steps and the steps to axes+nodes and predicates. Is there some tool I can do it with or is there a nice and easy way to do it without one?
The catch is I can't be sure that the predicates of original expressions won't contain something like [#id='some/value/with/slashes'] so I can't parse them with naive regexes.

You might be able to use the REx parser generator from Gunther Rademacher. See http://www.bottlecaps.de/rex/ This will generate a parser for any grammar from a suitable BNF, and suitable BNF for various XPath versions is available. REx is a superb piece of technology spoilt only by extremely poor documentation. It can generate parsers in a number of languages including Javascript, XQuery, and XSLT. It's used in the Saxon-JS product to parse dynamic XPath expressions within the browser.
Another approach is to use the XQuery to XQueryX converters available from W3C (XPath is a subset of XQuery so these will handle XPath as well. These produce a representation of the syntax tree in XML).

Related

Obtain parse tree for python code

I would like to be able to generate a parse tree for python source code. This code does not have to be compilable, e.g.
if x == 5:
should be turned some sort of tree representation. I can use the Python compiler package to create a tree but this only works for code that is compilable, e.g.
if x == 5: print True
The paper you linked to says that used the ast module in the Python standard library. It also says they used a dummy body for the body of the if statement. Use a statement that will be easy to recognize as being a dummy body, like pass or a function call like dummy().
Our DMS Software Reengineering Toolkit with its Python front end can do this.
DMS provides infrastructure for parsing code, parameterized by a language definition (e.g, a Python grammar, etc.) and automatically building ASTs, as well as the ability to inspect/navigate/change those ASTs, and prettyprint the resulting modified trees.
Its AST parsing machinery can handle a variety of special cases:
Parsing files or strings ("streams") as a (Python) full program.
Syntax errors in a stream are reported, and if repairable by single token insertion or deletion, so repaired.
Parsing a stream according to an arbitrary language nonterminal.
Parsing a pattern, corresponding to a named grammar nonterminal with named placeholders for the missing subtrees. A pattern match result can be used to match against concrete ASTs to decide match or not, and if match, to provide bindings for the pattern variables.
Parsing a valid arbitrary substring. This returns a tree with possible missing left or right children, which define the left and right ends of the substring.
For instance, OP could write the following pattern to handle his example:
pattern if_x_is_5(s: statement):statement
= " if x==5: \s ";
DMS will read that pattern and build the corresponding pattern tree.
The paper that OP references really wants operators and keywords to remain as explicit artifacts in the AST. One way to interpret that is that they really want a concrete syntax tree. DMS actually produces "AST"s which are concrete syntax trees with the constant terminals removed; this has the effect of being very close to what a perfect AST should be, but one can easily determine for any leaf node where constant terminals should be inserted (or one can configure DMS to simply produce the uncompressed CSTs).
Personally, I don't see how the goal of the paper of OP's interest can really succeed in providing useful psuedo-code (in spite of its claims). Understanding an algorithm requires understanding of the corresponding data structures and the abstract and concrete algorithms being applied to those data structures. The paper focuses only on raw language syntax; there is no hint of understanding the more abstract ideas.

What is the advantage of using AST & Tree walker for translating a DSL instead of directly parsing & translating objects from parser grammar?

I am creating a Domain Specific Language using Antlr3. So far, I have directly translated the parsed objects from inside the parser grammar. Going through the examples of AST and Tree Walker, i came to know that they are normally used to divide the grammar into hierarchical tree and translate objects from the nodes. Currently i am also doing the same sort of action using parser grammar where i translate objects from each sub-rule. I would be more than happy to know the advantages of using AST & Tree walker over just using parser grammars. Thanking you in advanced.
One advantage of using tree parsers is that you can organize them into multiple passes. For some translation work I did I was able to use seven passes and separate logical steps into their own pass. One pass did expression analysis, one did control flow analysis, others used that analysis to eliminate dead code or to simplify the translation for special cases.
I personally like using tree grammars for the same reason I like using parsers for text grammars. It allows me to use rules to organize the parsing context. It's easy to do things like structure rules to recognize a top-level expression versus a subexpression if you need to distinguish between them for recognition purposes. All of the attribute and context management you use in regular parsers can apply to tree parsers.

Creating custom extensions to a regular expression engine

Is there any easy way to go about adding custom extensions to a
Regular Expression engine? (For Python in particular, but I would take
a general solution as well).
It might be easier to explain what I'm trying to build with an
example. Here is the use case that I have in mind:
I want users to be able to match strings that may contain arbitrary
ASCII characters. Regular Expressions are a good start, but aren't
quite enough for the type of data I have in mind. For instance, say I
have data that contains strings like this:
<STX>12.3,45.6<ETX>
where <STX> and <ETX> are the Start of Text/End of Text characters
0x02 and 0x03. To capture the two numbers, it would be very
convenient for the user to be able to specify any ASCII
character in their expression. Something like so:
\x02(\d\d\.\d),(\d\d\.\d)\x03
Where the "\x02" and "\x03" are matching the control characters and
the first and second match groups are the numbers. So, something like
regular expressions with just a few domain-specific add-ons.
How should I go about doing this? Is this even the right way to go?
I have to believe this sort of problem has been solved, but my initial
searches didn't turn up anything promising. Regular Expression have
the advantage of being well known, keeping the learning curve down.
A few notes:
I am not looking for a fixed parser for a particular protocol - it needs to be general and user configurable
I really don't want to write my own regex engine
Although it would be nice, I am not looking for "regex macros" where I create shortcuts for a handful of common expressions. (perhaps a follow-up question...)
Bonus: Have you heard of any academic work, i.e "Creating Domain Specific search languages"
EDIT: Thanks for the replies so far, I hadn't realized Python re supported arbitrary ascii chars. However, this is still not quite what I'm looking for. Here is another example that hopefully give the breadth of what I want in the end:
Suppose I have data that contains strings like this:
$\x01\x02\x03\r\n
Where the 123 forms two 12-bit integers (0x010 and 0x023). So how could I add syntax so the user could match it with a regex like this:
\$(\int12)(\int12)\x0d\x0a
Where the \int12's each pull out 12 bits. This would be handy if trying to search for packed data.
\x escapes are already supported by the Python regular expression parser:
>>> import re
>>> regex = re.compile(r'\x02(\d\d\.\d),(\d\d\.\d)\x03')
>>> regex.match('\x0212.3,45.6\x03')
<_sre.SRE_Match object at 0x7f551b0c9a48>

Python:Pattern detection and rule generation

I have a need for a pattern interpretation and rule generating system. Basically how it will work is that it should parse through text and interpret patterns from it, and based on those interprtation, i need to output a set of rules. Here is an example. Lets say i have an HTTP header which looks like
GET https://website.com/api/1.0/download/8hqcdzt9oaq8llapjai1bpp2q27p14ah/2139379149 HTTP/1.1
Host: website.com
User-Agent: net.me.me/2.7.1;OS/iOS-5.0.1;Apple/iPad 2 (GSM)
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
The parser would run through this and output
req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"
The above rule contains a modified version of regex. Each variable e.g STRING:auth_token or STRING:id is to be extracted.
For parsing through the text(header in this case) i will have to tell the parser that it needs to extract whatever comes after the "download". So basically there is a definition of a set of rules which this parser will use to parse through the text and eventually output the final rule.
Now the question is, is there any such module available in python for pattern matching,detection,generation that can help me with this? This is somewhat like a compiler's parser part. I wanted to ask before going deep into trying to make one myself. Any help ?
I think that this has been already answered in:
Parser generation
Python parser Module tutorial
I can assure that what you want is easy with pyparsing module.
Sorry if this is not quite what you're looking for, but I'm a little rushed for time.
The re module documentaiton for Python contains a section on writing a tokenizer.
It's under-documented, but might help you in making something workable.
Certainly easier than tokenizing things yourself, though may not provide the flexibility you seem to be after.
You'd best do this yourself. It is not much work.
As you say, you'd have to define regular expressions as rules. Your program would then find the matching regular expression and transform the match into an output rule.
** EDIT **
I do not think there is a library to do this. If I understand you correctly, you want to specify a set of rules like this one:
EXTRACT AFTER download
And this will output a text like this:
req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"
For this you'd have to create a parser that would parse your rules. Depending on the complexity of the rule syntax, you could use pyparsing, use regular expressions or do it by hand. My rule of the thumb is, if your syntax is recursive (i.e. like html), then it makes sense to use pyparsing, otherwise it is not worth it.
From these parsed rules your program would have to create new regular expressions to match the input text. Basically, your program would translate rules into regular expressions.
Using these regular expressions you'd match extract the data from your input text.

Parsing a DTD to reveal hierarchy of elements

My goal is to parse several relatively complex DTDs to reveal the hierarchy of elements. The only distinction between DTDs is the version, but each version made no attempt to remain backwards compatible--that would be too easy! As such, I intend to visualize the structure of elements defined by each DTD so that I can design a database model suitable for uniformly storing the data.
Because most solutions I've investigated in Python will only validate against external DTDs, I've decided to start my efforts from the beginning. Python's xml.parsers.expat only parses XML files and implements very basic DTD callbacks, so I've decided to check out the original version, which was written in C and claims to fully comport with the XML 1.0 specifications. However, I have the following questions about this approach:
Will expat (in C) parse external entity references in a DTD file and follow those references, parse their elements, and add those elements to the hierarchy?
Can expat generalize and handle SGML, or will it fail after encountering an invalid DTD yet valid SGML file?
My requirements may lead to the conclusion that expat is inappropriate. If that's the case, I'm considering writing a lexer/parser for XML 1.0 DTDs. Are there any other options I should consider?
The following illustrates more succinctly my intent:
Input DTD Excerpt
<!--A concise summary of the disclosure.-->
<!ELEMENT abstract (doc-page+ | (abst-problem , abst-solution) | p+)>
Object Created from DTD Excerpt (pseudocode)
class abstract:
member doc_page_array[]
member abst_problem
member abst_solution
member paragraph_array[]
member description = "A concise summary of the disclosure."
One challenging aspect is to attribute to the <!ELEMENT> tag the comment appearing above it. Hence, a homegrown parser might be necessary if I cannot use expat to accomplish this.
Another problem is that some parsers have problems processing DTDs that use unicode characters greater than #xFFFF, so that might be another factor that favors creating my own.
If it turns out that the lexer/parser route is better suited for my task, does anyone happen to know of a good way to convert these EBNF expressions to something capable of being parsed? I suppose the "best" approach might be to use regular expressions.
Anyway, these are just the thoughts I've had regarding my issue. Any answers to the above questions or suggestions on alternative approaches would be appreciated.
There are several existing tools which may fit your needs, including DTDParse, OpenSP, Matra, and DTD Parser. There are also articles on creating a custom parser.

Categories