Python:Pattern detection and rule generation

Python:Pattern detection and rule generation - python

I have a need for a pattern interpretation and rule generating system. Basically how it will work is that it should parse through text and interpret patterns from it, and based on those interprtation, i need to output a set of rules. Here is an example. Lets say i have an HTTP header which looks like
GET https://website.com/api/1.0/download/8hqcdzt9oaq8llapjai1bpp2q27p14ah/2139379149 HTTP/1.1
Host: website.com
User-Agent: net.me.me/2.7.1;OS/iOS-5.0.1;Apple/iPad 2 (GSM)
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate
The parser would run through this and output
req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"
The above rule contains a modified version of regex. Each variable e.g STRING:auth_token or STRING:id is to be extracted.
For parsing through the text(header in this case) i will have to tell the parser that it needs to extract whatever comes after the "download". So basically there is a definition of a set of rules which this parser will use to parse through the text and eventually output the final rule.
Now the question is, is there any such module available in python for pattern matching,detection,generation that can help me with this? This is somewhat like a compiler's parser part. I wanted to ask before going deep into trying to make one myself. Any help ?

I think that this has been already answered in:
Parser generation
Python parser Module tutorial
I can assure that what you want is easy with pyparsing module.

Sorry if this is not quite what you're looking for, but I'm a little rushed for time.
The re module documentaiton for Python contains a section on writing a tokenizer.
It's under-documented, but might help you in making something workable.
Certainly easier than tokenizing things yourself, though may not provide the flexibility you seem to be after.

You'd best do this yourself. It is not much work.
As you say, you'd have to define regular expressions as rules. Your program would then find the matching regular expression and transform the match into an output rule.
** EDIT **
I do not think there is a library to do this. If I understand you correctly, you want to specify a set of rules like this one:
EXTRACT AFTER download
And this will output a text like this:
req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"
For this you'd have to create a parser that would parse your rules. Depending on the complexity of the rule syntax, you could use pyparsing, use regular expressions or do it by hand. My rule of the thumb is, if your syntax is recursive (i.e. like html), then it makes sense to use pyparsing, otherwise it is not worth it.
From these parsed rules your program would have to create new regular expressions to match the input text. Basically, your program would translate rules into regular expressions.
Using these regular expressions you'd match extract the data from your input text.

Related

How to parse Xpath expressions in Python?

I need to parse (not to evaluate) Xpath expressions in Python to change them, e.g. I have expressions like
//div[...whatever...]//some-other-node...
and I need to change them to (for example):
/changed-node[#attr='value' and ...whatever...]/another-changed-node[#attr='value' ...
As it seems to me I need to split the original expression to steps and the steps to axes+nodes and predicates. Is there some tool I can do it with or is there a nice and easy way to do it without one?
The catch is I can't be sure that the predicates of original expressions won't contain something like [#id='some/value/with/slashes'] so I can't parse them with naive regexes.

You might be able to use the REx parser generator from Gunther Rademacher. See http://www.bottlecaps.de/rex/ This will generate a parser for any grammar from a suitable BNF, and suitable BNF for various XPath versions is available. REx is a superb piece of technology spoilt only by extremely poor documentation. It can generate parsers in a number of languages including Javascript, XQuery, and XSLT. It's used in the Saxon-JS product to parse dynamic XPath expressions within the browser.
Another approach is to use the XQuery to XQueryX converters available from W3C (XPath is a subset of XQuery so these will handle XPath as well. These produce a representation of the syntax tree in XML).

Obtain parse tree for python code

I would like to be able to generate a parse tree for python source code. This code does not have to be compilable, e.g.
if x == 5:
should be turned some sort of tree representation. I can use the Python compiler package to create a tree but this only works for code that is compilable, e.g.
if x == 5: print True

The paper you linked to says that used the ast module in the Python standard library. It also says they used a dummy body for the body of the if statement. Use a statement that will be easy to recognize as being a dummy body, like pass or a function call like dummy().

Our DMS Software Reengineering Toolkit with its Python front end can do this.
DMS provides infrastructure for parsing code, parameterized by a language definition (e.g, a Python grammar, etc.) and automatically building ASTs, as well as the ability to inspect/navigate/change those ASTs, and prettyprint the resulting modified trees.
Its AST parsing machinery can handle a variety of special cases:
Parsing files or strings ("streams") as a (Python) full program.
Syntax errors in a stream are reported, and if repairable by single token insertion or deletion, so repaired.
Parsing a stream according to an arbitrary language nonterminal.
Parsing a pattern, corresponding to a named grammar nonterminal with named placeholders for the missing subtrees. A pattern match result can be used to match against concrete ASTs to decide match or not, and if match, to provide bindings for the pattern variables.
Parsing a valid arbitrary substring. This returns a tree with possible missing left or right children, which define the left and right ends of the substring.
For instance, OP could write the following pattern to handle his example:
pattern if_x_is_5(s: statement):statement
= " if x==5: \s ";
DMS will read that pattern and build the corresponding pattern tree.
The paper that OP references really wants operators and keywords to remain as explicit artifacts in the AST. One way to interpret that is that they really want a concrete syntax tree. DMS actually produces "AST"s which are concrete syntax trees with the constant terminals removed; this has the effect of being very close to what a perfect AST should be, but one can easily determine for any leaf node where constant terminals should be inserted (or one can configure DMS to simply produce the uncompressed CSTs).
Personally, I don't see how the goal of the paper of OP's interest can really succeed in providing useful psuedo-code (in spite of its claims). Understanding an algorithm requires understanding of the corresponding data structures and the abstract and concrete algorithms being applied to those data structures. The paper focuses only on raw language syntax; there is no hint of understanding the more abstract ideas.

Python Regex match multiline Java annotation

I am trying to take advantage of JAXB code generation from a XML Schema to use in an Android project through SimpleXML library, which uses another type of Assertion than JAXB (I do not want to include a 9MB lib tu support JAXB in my Android project). See question previously asked
Basically, I am writing a small Python script to perform the required changes on each Java file generated through the xcj tool, and so far it is working for import deletion/modification, simple line annotation, and also the annotation for which a List #XMLElement needs to be converted to an #ElementList one.
The only issue I am facing right now is for removing annotations on several lines, such as #XMLSeeAlso or #XMLType like the following
#XmlType(name = "AnimatedPictureType", propOrder = {
"resources",
"animation",
"caption"
})
or
#XmlSeeAlso({
BackgroundRGBColorType.class,
ForegroundRGBColorType.class
})
I tried different strategies using either Multineline, DotAll, or both, but without any success. I am new to "advanced" regex usage as well as Python so I am probably missing something silly.
For my simple XSD processing that is the only step I cannot get running to achieve a fully automated script using xcj and then automatically convert JAXB annotations into Simple XML ones.
Thank you in advance for your help.

#Xml.*\}\) with dotall enabled should as far as i know match any annotation starting with #Xml and ending with "})", even when it is multiline.
For a good view of what your regex actually matches you could always test your regular expressions at websites like https://pythex.org/

Creating custom extensions to a regular expression engine

Is there any easy way to go about adding custom extensions to a
Regular Expression engine? (For Python in particular, but I would take
a general solution as well).
It might be easier to explain what I'm trying to build with an
example. Here is the use case that I have in mind:
I want users to be able to match strings that may contain arbitrary
ASCII characters. Regular Expressions are a good start, but aren't
quite enough for the type of data I have in mind. For instance, say I
have data that contains strings like this:
<STX>12.3,45.6<ETX>
where <STX> and <ETX> are the Start of Text/End of Text characters
0x02 and 0x03. To capture the two numbers, it would be very
convenient for the user to be able to specify any ASCII
character in their expression. Something like so:
\x02(\d\d\.\d),(\d\d\.\d)\x03
Where the "\x02" and "\x03" are matching the control characters and
the first and second match groups are the numbers. So, something like
regular expressions with just a few domain-specific add-ons.
How should I go about doing this? Is this even the right way to go?
I have to believe this sort of problem has been solved, but my initial
searches didn't turn up anything promising. Regular Expression have
the advantage of being well known, keeping the learning curve down.
A few notes:
I am not looking for a fixed parser for a particular protocol - it needs to be general and user configurable
I really don't want to write my own regex engine
Although it would be nice, I am not looking for "regex macros" where I create shortcuts for a handful of common expressions. (perhaps a follow-up question...)
Bonus: Have you heard of any academic work, i.e "Creating Domain Specific search languages"
EDIT: Thanks for the replies so far, I hadn't realized Python re supported arbitrary ascii chars. However, this is still not quite what I'm looking for. Here is another example that hopefully give the breadth of what I want in the end:
Suppose I have data that contains strings like this:
$\x01\x02\x03\r\n
Where the 123 forms two 12-bit integers (0x010 and 0x023). So how could I add syntax so the user could match it with a regex like this:
\$(\int12)(\int12)\x0d\x0a
Where the \int12's each pull out 12 bits. This would be handy if trying to search for packed data.

\x escapes are already supported by the Python regular expression parser:
>>> import re
>>> regex = re.compile(r'\x02(\d\d\.\d),(\d\d\.\d)\x03')
>>> regex.match('\x0212.3,45.6\x03')
<_sre.SRE_Match object at 0x7f551b0c9a48>

RegEx anomalous behavior in a program

I have written the following regex to match a set of e-mails from HTML files. The e-mails can take various formats such as
alice # so.edu
alice at sm.so.edu
alice # sm.com
<a href="mailto:alice at bob dot com">
I generally use RegexPal to test my regular expressions before implementing them in a programing language. I observe a strange behavior on the last e-mail example posted. RegexPal shows me a match for my regex but while using the same regex in a Python program it doesn't give me a hit. What could be the reason?
mail_regex = (?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*
(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))
The RegEx is a little bit complex to accommodate variety of other examples (email patterns found in the dataset). You can also run and inspect the Python program on CodePad - http://codepad.org/W2p6waBb
Edit
Just to give a perspective the same regex works on - http://pythonregex.com/

It looks like the specific issue here is that you need to use a raw string:
mail_re = r"(?:[a-zA-Z]+[\w+\.]+[a-zA-Z]+)\s*(?:#|\bat\b)\s*(?:(?:(?:(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*(?:[a-zA-Z]+)\s*(?:\.|dot|dom)\s*)(?:edu|com))|(?:(?:[a-zA-Z]+\s*(?:\.|dot|dom)\s*(?:edu|com))))"
Otherwise, for instance \b will be backspace instead of word boundary.
Also, you're using a JavaScript tester. Python has different syntax and behavior. To avoid surprises, it would better to test with the Python-specific syntax.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.