Does the Python regular expression module use BRE or ERE? - python

It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Python re module reference does not seem to specify.

Except for some similarity in the syntax, re module doesn't follow POSIX standard for regular expressions.
Different matching semantics
POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).
The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix) against PrefixSuffix.
In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match PrefixSuffix.
In contrast, re engine (and many other backtracking regex engines) will match Prefix only, since Prefix is specified first in the alternation.
The difference can also be seen in the case of matching (xxx|xxxxx)* against xxxxxxxxxx (a string of 10 x's):
On Cygwin:
$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx
All 10 x's are matched.
In Python:
>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'
Only 9 x's are matched, since it picks the first item in alternation xxx in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)
POSIX-exclusive regular expression features
Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.
Taking equivalence class expression as example, from the documentation:
An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". [...]
Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.
re regular expression features
re borrows the syntax from Perl, but not all features in Perl regex are implemented in re. Below are some regex features available in re which is unavailable in POSIX regular expression:
Greedy/lazy quantifier, which specifies the order to expand a quantifier.
While people usually call the * in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.
Look-around assertion (look-ahead and look-behind)
Conditional pattern (?(id/name)yes-pattern|no-pattern)
Short-hand constructs: \b, \s, \d, \w (some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)

Neither. It's basically the PCRE dialect, but a distinct implementation.
The very first sentence in the re documentation says:
This module provides regular expression matching operations similar to those found in Perl.
While this does not immediately reveal to a newcomer how they are related to e.g. POSIX regular expressions, it should be common knowledge that Perl 4 and later Perl 5 provided a substantially expanded feature set over the regex features of earlier tools, including what POSIX mandated for grep -E aka ERE.
The perlre manual page describes the regular expression features in more detail, though you'll find much the same details in a different form in the Python documentation. The Perl manual page contains this bit of history:
The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.)
(Here, V8 refers to Version 8 Unix. Spencer's library basically (re)implemented POSIX regular expressions.)
Perl 4 had a large number of convenience constructs like \d, \s, \w as well as symbolic shorthands like \t, \f, \n. Perl 5 added a significant set of extensions (which is still growing slowly) including, but not limited to,
Non-greedy quantifiers
Non-backtracking quantifiers
Unicode symbol and property support
Non-grouping parentheses
Lookaheads and lookbehinds
... Basically anything that starts with (?
As a result, the "regular" expressions are by no means strictly "regular" any longer.
This was reimplemented in a portable library by Philip Hazell, originally for the Exim mail server; his PCRE library has found its way into myriad different applications, including a number of programming languages (Ruby, PHP, Python, etc). Incidentally, in spite of the name, the library is not strictly "Perl compatible" (any longer); there are differences in features as well as in behavior. (For example, Perl internally changes * to something like {0,32767} while PCRE does something else.)
An earlier version of Python actually had a different regex implementation, and there are plans to change it again (though it will remain basically PCRE). This is the situation as of Python 2.7 / 3.5.

Related

How to use regular expression to remove all math expression in latex file

Suppose I have a string which consists of a part of latex file. How can I use python re module to remove any math expression in it?
e.g:
text="This is an example $$a \text{$a$}$$. How to remove it? Another random math expression $\mathbb{R}$..."
I would like my function to return ans="This is an example . How to remove it? Another random math expression ...".
Thank you!
Try this Regex:
(\$+)(?:(?!\1)[\s\S])*\1
Click for Demo
Code
Explanation:
(\$+) - matches 1+ occurrences of $ and captures it in Group 1
(?:(?!\1)[\s\S])* - matches 0+ occurrences of any character that does not start with what was captured in Group 1
\1 - matches the contents of Group 1 again
Replace each match with a blank string.
As suggested by #torek, we should not match 3 or more consecutive $, hence changing the expression to (\${1,2})(?:(?!\1)[\s\S])*\1
It's commonly said that regular expressions cannot count, which is kind of a loose way of describing a problem more formally discussed in Count parentheses with regular expression. See that for what this means.
Now, with that in mind, note that LaTeX math expressions can include nested sub-equations, which can include further nested sub-equations, and so on. This is analogous to the problem of detecting whether a closing parenthesis closes an inner parenthesized expression (as in (for instance) this example, where the first one does not) or an outer parenthesis. Therefore, regular expressions are not going to be powerful enough to handle the full general case.
If you're willing to do a less-than-complete job, you can construct a regular expression that finds $...$ and $$...$$. You will need to pay attention to the particular regular expression language available. Python's is essentially the same as Perl's here.
Importantly, these $-matchers will completely miss \begin{equation} ... \end{equation}, \begin{eqnarray} ... \end{eqnarray}, and so on. We've already noted that handling LaTeX expression parsing with a mere regular expression recognizer is inadequate, so if you want to do a good job—while ignoring the complexity of lower-level TeX manipulation of token types, where one can change any individual character's category code —you will want a more general parser. You can then tokenize \begin, {, }, and words, and match up the begin/end pairs. You can also tokenize $ and $$ and match those up. Since parsers can count, in exactly the way that regular expressions can't, you can do a much better job this way.

Converting C++ Boost Regexes to Python re regexes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The aim is to convert these regexes in C++ boost to Python re regexes:
typedef boost::u32regex tRegex;
tRegex emptyre = boost::make_u32regex("^$");
tRegex commentre = boost::make_u32regex("^;.*$");
tRegex versionre = boost::make_u32regex("^#\\$Date: (.*) \\$$");
tRegex includere = boost::make_u32regex("^<(\\S+)$");
tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
tRegex tokre = boost::make_u32regex("^:(.*)$");
tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
tRegex groupendre = boost::make_u32regex("^#$");
tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");
I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to
how to convert C++ boost regexest to Python and
what is the difference between boost regexes and python re regexes?
Is the C++ boost::u32regex the same as re regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:
in boost, there's boost::u32regex_match, is that the same as
re.match?
in boost, there's boost::u32regex_search, how is it different to re.search
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
How to convert C++ boost regexest to Python
In case of a simple regex, like \w+\s+\d+, or >.*$ you won't have to change the pattern. In case of more complex patterns with constructs mentioned below, you will most probably have to re-write a regex. As with any conversion from one flavor/language to another, the general answer is DON'T. However, Python and Boost do have some similarities, especially when it comes to simple patterns (if Boost is using PCRE-like pattern) containing a dot (a.*b), regular ([\w-]*) and negated ([^>]*) character classes, regular quantifiers like +/*/?, and suchlike.
what is the difference between boost regexes and python re regexes?
Python re module is not that rich as Boost regexps (suffice is to mention such constructs as \h, \G, \K, \R, \X, \Q...\E, branch reset, recursion, possessive quantifiers, POSIX character classes and character properties, extended replacement pattern), and other features that Boost has. The (?imsx-imsx:pattern) is limited to the whole expression in Python, not to a part of it thus you should be aware that (?i) in your &|&#((?i)x26);|& will be treated as if it were at the beginning of the pattern (however, it does not have any impact on this expression).
Also, same as in Boost, you do not have to escape [ inside a character class, and { outside the character class.
The backreferences like \1 are the same as in Python.
Since you are not using capturing groups in alternation in your patterns (e.g. re.sub(r'\d(\w)|(go\w*)', '\2', 'goon')), there should be no problem (in such cases, Python does not fill in the non-participating group with any value, and returns an empty result).
Note the difference in named group definition: (?<NAME>expression)/(?'NAME'expression) in Boost, and (?P<NAME>expression) in Python.
I see your regexps mainly fall under "simple" category. The most complex pattern is a tempered greedy token (e.g. ⌊-((?:(?!-⌋).)*)-⌋). To optimize them, you could use an unroll the loop technique, but it may not be necessary depending on the size of texts you handle with the expressions.
The most troublesome part as I see it is that you are using Unicode literals heavily. In Python 2.x, all strings are byte arrays, and you will always have to make sure you pass a unicode object to the Unicode regexps (see Python 2.x’s Unicode Support). In Python 3, all strings are UTF8 by default, and you can even use UTF8 literal characters in source code without any additional actions (see Python’s Unicode Support). So, Python 3.3+ (with support for raw string literals) is a good candidate.
Now, as for the remaining questions:
in boost, there's boost::u32regex_match, is that the same as re.match?
The re.match is not the same as regex_match as re.match is looking for the match at the beginning of the string, and regex_match requires a full string match. However, in Python 3, you can use re.fullmatch(pattern, string, flags=0) that is equivalent to Boost regex_match.
in boost, there's boost::u32regex_search, how is it different to re.search
Whenver you need to find a match anywhere inside a string, you need to use re.search (see match() versus search()). Thus, this method provides analoguous functionality as regex_search does in Boost.
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
Python does not support Perl-like expressions to the extent Boost can, Python re module is just a "trimmed" Perl regex engine that does not have many nice features I mentioned earlier. Thus, no flags like default or perl can be found there. As for the smatch, you can use re.finditer to get all the match objects. A re.findall returns all matches (or submatches only if capturing groups are specified) as a list of strings/lists of tuples. See the re.findall/re.finditer difference.
And in the conclusion, a must-read article Python’s re Module.

Python regex compile

The programmer who wrote the following line probably uses a python package called regex.
UNIT = regex.compile("(?:{A}(?:'{A})?)++|-+|\S".format(A='\p{Word_Break=ALetter}'))
Can some one help explain what A='\p{Word_Break=ALetter}' and -+ means?
The \p{property=value} operator matches on unicode codepoint properties, and is documented on the package index page you linked to:
Unicode codepoint properties, including scripts and blocks
\p{property=value}; \P{property=value}; \p{value} ; \P{value}
The entry matches any unicode character whose codepoint has a Word_Break property with the value ALetter (there are currently 24941 matches in the Unicode codepoint database, see the Unicode Text Segmentation, Word Boundaries chapter specifiation for details).
The example you gave also uses standard python string formatting to interpolate a partial expression into the regular expression being compiled. The "{A}" part is just a placeholder for the .format(A='...') part to fill. The end result is:
"(?:\p{Word_Break=ALetter}(?:'\p{Word_Break=ALetter})?)++|-+|\S"
The -+ sequence just matches 1 or more - dashes, just like in the python re module expressions, it is not anything special, really.
Now, the ++ before that is more interesting. It's a possessive quantifier, and using it prevents the regex matcher from trying out all possible permutations of the pattern. It's a performance optimization, one that prevents catastrophic backtracking issues.

Regex in Python

Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.
I tried to solve this with regex, Helped by RegexBuddy I came to this one:
[\d]+([\d]{0,4}+[1-9])0*
But python can't compile that.
>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat
The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)
How can I write a working regex?
PS:
I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.
That regular expression is very superfluous. Try this:
>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")
The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:
>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")
Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)
Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.
I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.
\d{0,4}+ is a possessive quantifier supported by certain regular expression flavors such as .NET and Java. Python does not support possessive quantifiers.
In RegexBuddy, select Python in the toolbar at the top, and RegexBuddy will tell you that Python doesn't support possessive quantifiers. The + will be highlighted in red in the regular expression, and the Create tab will indicate the error.
If you select Python on the Use tab in RegexBuddy, RegexBuddy will generate a Python source code snippet with a regular expression without the possessive quantifier, and a comment indicating that the removal of the possessive quantifier may yield different results. Here's the Python code that RegexBuddy generates using the regex from the question:
# Your regular expression could not be converted to the flavor required by this language:
# Python does not support possessive quantifiers
# Because of this, the code snippet below will not work as you intended, if at all.
reobj = re.compile(r"[\d]+([\d]{0,4}[1-9])0*")
What you probably did is select a flavor such as Java in the main toolbar, and then click Copy Regex as Python String. That will give you a Java regular expression formatted as a Pythong string. The items in the Copy menu do not convert your regular expression. They merely format it as a string. This allows you to do things like format a JavaScript regular expression as a Python string so your server-side Python script can feed a regex into client-side JavaScript code.
Small tip. I recommend you test with reTest instead of RegExBuddy. There are different regular expression engines for different programming languages. ReTest is valuable in that it allows you to quickly test regular expression strings within Python itself. That way you can insure that you tested your syntax with the Python's regular expression engine.
The error seems to be that you have two quantifiers in a row, {0,4} and +. Unless + is meant to be a literal here (which I doubt, since you're talking about numbers), then I don't think you need it at all. Unless it means something different in this situation (possibly the greediness of the {} quantifier)? I would try
[\d]+([\d]{0,4}[1-9])0*
If you actually intended to have both quantifiers to be applied, then this might work
[\d]+(([\d]{0,4})+[1-9])0*
But given your specification of the problem, I doubt that's what you want.
This is my solution.
re.search(r'[1-9]\d{0,3}[1-9](?=0*(?:\b|\s|[A-Za-z]))', '02324560001230045980a').group(1)
'4598'
[1-9] - the number must start with 1 - 9
\d{0,3} - 0 or 3 digits
[1-9] - the number must finish with 1 or 9
(?=0*(:?\b|\s\|[A-Za-z])) - the final part of string must be formed from 0 and or \b, \s, [A-Za-z]

How to store regular expressions in the Google App Engine datastore?

Regular Expressions are usually expressed as strings, but they also have properties (ie. single line, multi line, ignore case). How would you store them? And for compiled regular expressions, how to store it?
Please note that we can write custom property classes: http://googleappengine.blogspot.com/2009/07/writing-custom-property-classes.html
As I don't understand Python enough, my first try to write a custom property which stores a compiled regular expression failed.
I'm not sure if Python supprts it, but in .net regex, you can specify these options within the regex itself:
(?si)^a.*z$
would specify single-line, ignore case.
Indeed, the Python docs describe such a mechanism here: http://docs.python.org/library/re.html
To recap: (cut'n'paste from link above)
(?iLmsux)
(One or more letters from the set 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), re.U (Unicode dependent), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the compile() function.
Note that the (?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.
I wouldn't try to store the compiled regex. The data in a compiled regex is not designed to be stored, and is not guaranteed to be picklable or serializable. Just store the string and re-compile (the re module will do this for you behind the scenes anyway).
You can either store the text, as suggested above, or you can pickle and unpickle the compiled RE. For example, see PickledProperty on the cookbook.
Due to the (lack of) speed of Pickle, particularly on App Engine where cPickle is unavailable, you'll probably find that storing the text of the regex is the faster option. In fact, it appears that when pickled, a re simply stores the original text anyway.

Categories