Converting C++ Boost Regexes to Python re regexes [closed]

Converting C++ Boost Regexes to Python re regexes [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The aim is to convert these regexes in C++ boost to Python re regexes:
typedef boost::u32regex tRegex;
tRegex emptyre = boost::make_u32regex("^$");
tRegex commentre = boost::make_u32regex("^;.*$");
tRegex versionre = boost::make_u32regex("^#\\$Date: (.*) \\$$");
tRegex includere = boost::make_u32regex("^<(\\S+)$");
tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
tRegex tokre = boost::make_u32regex("^:(.*)$");
tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
tRegex groupendre = boost::make_u32regex("^#$");
tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");
I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to
how to convert C++ boost regexest to Python and
what is the difference between boost regexes and python re regexes?
Is the C++ boost::u32regex the same as re regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:
in boost, there's boost::u32regex_match, is that the same as
re.match?
in boost, there's boost::u32regex_search, how is it different to re.search
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?

How to convert C++ boost regexest to Python
In case of a simple regex, like \w+\s+\d+, or >.*$ you won't have to change the pattern. In case of more complex patterns with constructs mentioned below, you will most probably have to re-write a regex. As with any conversion from one flavor/language to another, the general answer is DON'T. However, Python and Boost do have some similarities, especially when it comes to simple patterns (if Boost is using PCRE-like pattern) containing a dot (a.*b), regular ([\w-]*) and negated ([^>]*) character classes, regular quantifiers like +/*/?, and suchlike.
what is the difference between boost regexes and python re regexes?
Python re module is not that rich as Boost regexps (suffice is to mention such constructs as \h, \G, \K, \R, \X, \Q...\E, branch reset, recursion, possessive quantifiers, POSIX character classes and character properties, extended replacement pattern), and other features that Boost has. The (?imsx-imsx:pattern) is limited to the whole expression in Python, not to a part of it thus you should be aware that (?i) in your &|&#((?i)x26);|& will be treated as if it were at the beginning of the pattern (however, it does not have any impact on this expression).
Also, same as in Boost, you do not have to escape [ inside a character class, and { outside the character class.
The backreferences like \1 are the same as in Python.
Since you are not using capturing groups in alternation in your patterns (e.g. re.sub(r'\d(\w)|(go\w*)', '\2', 'goon')), there should be no problem (in such cases, Python does not fill in the non-participating group with any value, and returns an empty result).
Note the difference in named group definition: (?<NAME>expression)/(?'NAME'expression) in Boost, and (?P<NAME>expression) in Python.
I see your regexps mainly fall under "simple" category. The most complex pattern is a tempered greedy token (e.g. ⌊-((?:(?!-⌋).)*)-⌋). To optimize them, you could use an unroll the loop technique, but it may not be necessary depending on the size of texts you handle with the expressions.
The most troublesome part as I see it is that you are using Unicode literals heavily. In Python 2.x, all strings are byte arrays, and you will always have to make sure you pass a unicode object to the Unicode regexps (see Python 2.x’s Unicode Support). In Python 3, all strings are UTF8 by default, and you can even use UTF8 literal characters in source code without any additional actions (see Python’s Unicode Support). So, Python 3.3+ (with support for raw string literals) is a good candidate.
Now, as for the remaining questions:
in boost, there's boost::u32regex_match, is that the same as re.match?
The re.match is not the same as regex_match as re.match is looking for the match at the beginning of the string, and regex_match requires a full string match. However, in Python 3, you can use re.fullmatch(pattern, string, flags=0) that is equivalent to Boost regex_match.
in boost, there's boost::u32regex_search, how is it different to re.search
Whenver you need to find a match anywhere inside a string, you need to use re.search (see match() versus search()). Thus, this method provides analoguous functionality as regex_search does in Boost.
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
Python does not support Perl-like expressions to the extent Boost can, Python re module is just a "trimmed" Perl regex engine that does not have many nice features I mentioned earlier. Thus, no flags like default or perl can be found there. As for the smatch, you can use re.finditer to get all the match objects. A re.findall returns all matches (or submatches only if capturing groups are specified) as a list of strings/lists of tuples. See the re.findall/re.finditer difference.
And in the conclusion, a must-read article Python’s re Module.

Related

Python regex to split dutch address

I have a bit of an issue with regex in python, I am familiar with this regex script in PHP: https://gist.github.com/benvds/350404, but in Python, using the re module, I keep getting no results:
re.findall(r"#^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$#", "Wilhelminakade 173")
Output is []
Any ideas?

PHP supports alternative characters as regex delimiters. Your sample Gist uses # for that purpose. They are not part of the regex in PHP, and they are not needed in Python at all. They prevent a match. Remove them.
re.findall(r"^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$", "Wilhelminakade 173")
This still gives no result because Python regex does not know what [:punct:] is supposed to mean. There is no support for POSIX character classes in Python's re. Replace them with something else (i.e. the punctuation you expect, probably something like "dots, apostrophes, dashes"). This results in
re.findall(r"^([\w.'\- ]+) ([0-9]{1,5})([\w.'\-/]*)$", "Wilhelminakade 173")
which gives [('Wilhelminakade', '173', '')].
Long story short, there are different regex engines in different programming languages. You cannot just copy regex from PHP to Python without looking at it closely, and expect it to work.

How to replace 'x!' with 'math.factorial(x)' intelligently [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am currently implementing a graphical calculator in Python, in a manner where you can type a natural expression and evaluate it. This is not a problem with most functions or operators, but as the factorial function is denoted by a ! after the operand, this is more difficult.
What I have is a string which contains a function, for example: '(2x + 1)!' which needs to be replaced with: 'math.factorial((2x + 1))'
However, the string may also include other terms such as: '2*x + (2x + 1)! - math.sin(x)'
and the factorial term may not necessarily contain brackets: '2!'
I've been trying to find a solution to this problem, but to no avail, I don't think the string.replace() method can do this directly. Is what I seek too ambitious, or is there some method by which I could achieve a result as desired?

There's two levels of answer to your question: (1) solve your current problem; (2) solve your general problem.
(1) is pretty easy - the most common, versatile tool in general for doing string pattern matching and replacement is Regular Expressions (RE). You simply define the pattern you're looking for, the pattern you want out, and pass the RE engine your string. Python has a RE module built-in called re. Most languages have something similar. Some languages (eg. Perl) even have it as a core part of the language syntax.
A pattern is a series of either specific characters, or non-specific ("wildcard") characters. So in your case you want the non-specific characters before a specific '!' character. You seem to suggest that "before" in your case means either all the non-whitespace characters, or if the proceeding character is a ')', then all the characters between that and the proceeding '('. So let's build that pattern, starting with the version without the parentheses:
[\w] - the set of characters which are letters or numbers (we need a set of
characters that doesn't include whitespace or ')' so I'm taking some
liberty to keep the example simple - you could always build your own
more complex set with the '[]' pattern)
+ - at least one of them
! - the '!' character
And then the version with the parentheses:
\( - the literal '(' character, as opposed to the special function that ( has
. - any character
+ - at least one of them
? - but dont be "greedy", ie. only take the smallest set of characters that
match the pattern (will work out to be the closest pair of parentheses)
\) - the closing ')' character
! - the '!' character
Then we just need to put it all together. We use | to match the first pattern OR the second pattern. And we use ( and ) to denote the part of the pattern we want to "capture" - it's the bit before the '!' and inside the parentheses that we want to use later. So your pattern becomes:
([\w]+)!|\((.+?)\)!
Don't worry, RE expressions always come out looking like someone has just mashed the keyboard. There are some great tools out there like RegExr which help break down complex RE expressions.
Finally, you just need to take your captures and stick them in "math.factorial". \x means the xth match. If the first pattern matches, \2 will be blank and vice-versa, so we can just use both of them at once.
math.factorial(\1\2)
That's it! Here how you run your RE in Python (note the r before the strings prevents Python from trying to process the \ as an escape sequence):
import re
re.sub(r'([\w]+)!|\((.+?)\)!', r'math.factorial(\1\2)', '2*x + (2x + 1)! - math.sin(x) + 2!')
re.sub takes three parameters (plus some optional ones not used here): the RE pattern, the replacement string, and the input string. This produces:
'2*x + math.factorial(2x + 1) - math.sin(x) + math.factorial(2)'
which is I believe what you're after.
Now, (2) is harder. If your intention really is to implement a calculator that takes strings as input, you're quickly going to drown in regular expressions. There will be so many exceptions and variations between what can be entered and what Python can interpret, that you'll end up with something quite fragile, that will fail on its first contact with a user. If you're not intending on having users you're pretty safe - you can just stick to using the patterns that work. If not, then you'll find the pattern matching method a bit limiting.
In general, the problem you're tackling is known as lexical analysis (or more fully, as the three step process of lexical analysis, syntactic analysis and semantic analysis). The standard way to tackle this is with a technique called recursive descent parsing.
Intriguingly, the Python interpreter performs exactly this process in interpreting the re statement above - compilers and interpreters all undertake the same process to turn a string into tokens that can be processed by a computer.
You'll find lots of tutorials on the web. It's a bit more complex than using RE, but allows significantly more generalisation. You might want to start with the very brief intro here.

You could remove the brackets and calculate everything in a linear form, such that when parsing the brackets it would evaluate the operand, and then apply the factorial function - in the order written.
Or, you could get the index of the factorial ! in the string, then if the character at the index before is a close bracket ) you know there is a bracketed operand that needs to be calculated prior to applying math.factorial().

Does the Python regular expression module use BRE or ERE?

It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Python re module reference does not seem to specify.

Except for some similarity in the syntax, re module doesn't follow POSIX standard for regular expressions.
Different matching semantics
POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).
The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix) against PrefixSuffix.
In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match PrefixSuffix.
In contrast, re engine (and many other backtracking regex engines) will match Prefix only, since Prefix is specified first in the alternation.
The difference can also be seen in the case of matching (xxx|xxxxx)* against xxxxxxxxxx (a string of 10 x's):
On Cygwin:
$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx
All 10 x's are matched.
In Python:
>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'
Only 9 x's are matched, since it picks the first item in alternation xxx in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)
POSIX-exclusive regular expression features
Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.
Taking equivalence class expression as example, from the documentation:
An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". [...]
Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.
re regular expression features
re borrows the syntax from Perl, but not all features in Perl regex are implemented in re. Below are some regex features available in re which is unavailable in POSIX regular expression:
Greedy/lazy quantifier, which specifies the order to expand a quantifier.
While people usually call the * in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.
Look-around assertion (look-ahead and look-behind)
Conditional pattern (?(id/name)yes-pattern|no-pattern)
Short-hand constructs: \b, \s, \d, \w (some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)

Neither. It's basically the PCRE dialect, but a distinct implementation.
The very first sentence in the re documentation says:
This module provides regular expression matching operations similar to those found in Perl.
While this does not immediately reveal to a newcomer how they are related to e.g. POSIX regular expressions, it should be common knowledge that Perl 4 and later Perl 5 provided a substantially expanded feature set over the regex features of earlier tools, including what POSIX mandated for grep -E aka ERE.
The perlre manual page describes the regular expression features in more detail, though you'll find much the same details in a different form in the Python documentation. The Perl manual page contains this bit of history:
The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.)
(Here, V8 refers to Version 8 Unix. Spencer's library basically (re)implemented POSIX regular expressions.)
Perl 4 had a large number of convenience constructs like \d, \s, \w as well as symbolic shorthands like \t, \f, \n. Perl 5 added a significant set of extensions (which is still growing slowly) including, but not limited to,
Non-greedy quantifiers
Non-backtracking quantifiers
Unicode symbol and property support
Non-grouping parentheses
Lookaheads and lookbehinds
... Basically anything that starts with (?
As a result, the "regular" expressions are by no means strictly "regular" any longer.
This was reimplemented in a portable library by Philip Hazell, originally for the Exim mail server; his PCRE library has found its way into myriad different applications, including a number of programming languages (Ruby, PHP, Python, etc). Incidentally, in spite of the name, the library is not strictly "Perl compatible" (any longer); there are differences in features as well as in behavior. (For example, Perl internally changes * to something like {0,32767} while PCRE does something else.)
An earlier version of Python actually had a different regex implementation, and there are plans to change it again (though it will remain basically PCRE). This is the situation as of Python 2.7 / 3.5.

Python regex compile

The programmer who wrote the following line probably uses a python package called regex.
UNIT = regex.compile("(?:{A}(?:'{A})?)++|-+|\S".format(A='\p{Word_Break=ALetter}'))
Can some one help explain what A='\p{Word_Break=ALetter}' and -+ means?

The \p{property=value} operator matches on unicode codepoint properties, and is documented on the package index page you linked to:
Unicode codepoint properties, including scripts and blocks
\p{property=value}; \P{property=value}; \p{value} ; \P{value}
The entry matches any unicode character whose codepoint has a Word_Break property with the value ALetter (there are currently 24941 matches in the Unicode codepoint database, see the Unicode Text Segmentation, Word Boundaries chapter specifiation for details).
The example you gave also uses standard python string formatting to interpolate a partial expression into the regular expression being compiled. The "{A}" part is just a placeholder for the .format(A='...') part to fill. The end result is:
"(?:\p{Word_Break=ALetter}(?:'\p{Word_Break=ALetter})?)++|-+|\S"
The -+ sequence just matches 1 or more - dashes, just like in the python re module expressions, it is not anything special, really.
Now, the ++ before that is more interesting. It's a possessive quantifier, and using it prevents the regex matcher from trying out all possible permutations of the pattern. It's a performance optimization, one that prevents catastrophic backtracking issues.

Regex in Python

Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.
I tried to solve this with regex, Helped by RegexBuddy I came to this one:
[\d]+([\d]{0,4}+[1-9])0*
But python can't compile that.
>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat
The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)
How can I write a working regex?
PS:
I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.

That regular expression is very superfluous. Try this:
>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")
The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:
>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")
Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)
Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.
I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.

\d{0,4}+ is a possessive quantifier supported by certain regular expression flavors such as .NET and Java. Python does not support possessive quantifiers.
In RegexBuddy, select Python in the toolbar at the top, and RegexBuddy will tell you that Python doesn't support possessive quantifiers. The + will be highlighted in red in the regular expression, and the Create tab will indicate the error.
If you select Python on the Use tab in RegexBuddy, RegexBuddy will generate a Python source code snippet with a regular expression without the possessive quantifier, and a comment indicating that the removal of the possessive quantifier may yield different results. Here's the Python code that RegexBuddy generates using the regex from the question:
# Your regular expression could not be converted to the flavor required by this language:
# Python does not support possessive quantifiers
# Because of this, the code snippet below will not work as you intended, if at all.
reobj = re.compile(r"[\d]+([\d]{0,4}[1-9])0*")
What you probably did is select a flavor such as Java in the main toolbar, and then click Copy Regex as Python String. That will give you a Java regular expression formatted as a Pythong string. The items in the Copy menu do not convert your regular expression. They merely format it as a string. This allows you to do things like format a JavaScript regular expression as a Python string so your server-side Python script can feed a regex into client-side JavaScript code.

Small tip. I recommend you test with reTest instead of RegExBuddy. There are different regular expression engines for different programming languages. ReTest is valuable in that it allows you to quickly test regular expression strings within Python itself. That way you can insure that you tested your syntax with the Python's regular expression engine.

The error seems to be that you have two quantifiers in a row, {0,4} and +. Unless + is meant to be a literal here (which I doubt, since you're talking about numbers), then I don't think you need it at all. Unless it means something different in this situation (possibly the greediness of the {} quantifier)? I would try
[\d]+([\d]{0,4}[1-9])0*
If you actually intended to have both quantifiers to be applied, then this might work
[\d]+(([\d]{0,4})+[1-9])0*
But given your specification of the problem, I doubt that's what you want.

This is my solution.
re.search(r'[1-9]\d{0,3}[1-9](?=0*(?:\b|\s|[A-Za-z]))', '02324560001230045980a').group(1)
'4598'
[1-9] - the number must start with 1 - 9
\d{0,3} - 0 or 3 digits
[1-9] - the number must finish with 1 or 9
(?=0*(:?\b|\s\|[A-Za-z])) - the final part of string must be formed from 0 and or \b, \s, [A-Za-z]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.