Are regular expressions the same in every programming language?

Are regular expressions the same in every programming language? - python

I am a python user looking to learn regular expressions and I have just a good good course on Udemy that seems to be OK. However it is neither a python course nor a python regular expression course.
Are regular expressions the same on any programming language ?
I mean would they be the same and use the exact same syntax I would be using with the re package in python ?

there are variations on them...
this site will give you a way to test your expression for some common languages (including python)...
https://regex101.com/

There are significant differences both large and subtle between implementations.
According to the (2.7) regex howto, Python's re module was based on Perl regular expressions. The regular expression syntax is almost the same. The usage in Perl is quite different; more compact (or more unreadable, depending on your views :-).
Also keep in mind that there are differences in regular expressions between Python 2 and 3, depending on which flags are used. Simplifying somewhat you could say that out of the box, Python 2 regexes handle ASCII strings while Python 3 handle unicode strings.
In Python regular expressions, the * and + qualifiers are greedy, that is they match as much text as possible. That makes for results that are not intuitive. For example, suppose you want to search for text between angle brackets. You might think that <.*> might do that. But observe:
In [1]: import re
In [2]: re.findall('<.*>', '<a> <b> <c>')
Out[2]: ['<a> <b> <c>']
You have to add a ? to make them non-greedy.
In [3]: re.findall('<.*?>', '<a> <b> <c>')
Out[3]: ['<a>', '<b>', '<c>']
To be explicit, you'd have to look for anything but the end character.
In [4]: re.findall('<[^>]*>', '<a> <b> <c>')
Out[4]: ['<a>', '<b>', '<c>']
UNIX-like systems such as Linux and *BSD generally support POSIX regular expressions in many utilities. Those come in two flavors, basic and extended. Basic POSIX regular expressions do not support the branching metacharacter |.

Related

Python regex to split dutch address

I have a bit of an issue with regex in python, I am familiar with this regex script in PHP: https://gist.github.com/benvds/350404, but in Python, using the re module, I keep getting no results:
re.findall(r"#^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$#", "Wilhelminakade 173")
Output is []
Any ideas?

PHP supports alternative characters as regex delimiters. Your sample Gist uses # for that purpose. They are not part of the regex in PHP, and they are not needed in Python at all. They prevent a match. Remove them.
re.findall(r"^([\w[:punct:] ]+) ([0-9]{1,5})([\w[:punct:]\-/]*)$", "Wilhelminakade 173")
This still gives no result because Python regex does not know what [:punct:] is supposed to mean. There is no support for POSIX character classes in Python's re. Replace them with something else (i.e. the punctuation you expect, probably something like "dots, apostrophes, dashes"). This results in
re.findall(r"^([\w.'\- ]+) ([0-9]{1,5})([\w.'\-/]*)$", "Wilhelminakade 173")
which gives [('Wilhelminakade', '173', '')].
Long story short, there are different regex engines in different programming languages. You cannot just copy regex from PHP to Python without looking at it closely, and expect it to work.

Converting C++ Boost Regexes to Python re regexes [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The aim is to convert these regexes in C++ boost to Python re regexes:
typedef boost::u32regex tRegex;
tRegex emptyre = boost::make_u32regex("^$");
tRegex commentre = boost::make_u32regex("^;.*$");
tRegex versionre = boost::make_u32regex("^#\\$Date: (.*) \\$$");
tRegex includere = boost::make_u32regex("^<(\\S+)$");
tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
tRegex tokre = boost::make_u32regex("^:(.*)$");
tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
tRegex groupendre = boost::make_u32regex("^#$");
tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");
I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to
how to convert C++ boost regexest to Python and
what is the difference between boost regexes and python re regexes?
Is the C++ boost::u32regex the same as re regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:
in boost, there's boost::u32regex_match, is that the same as
re.match?
in boost, there's boost::u32regex_search, how is it different to re.search
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?

How to convert C++ boost regexest to Python
In case of a simple regex, like \w+\s+\d+, or >.*$ you won't have to change the pattern. In case of more complex patterns with constructs mentioned below, you will most probably have to re-write a regex. As with any conversion from one flavor/language to another, the general answer is DON'T. However, Python and Boost do have some similarities, especially when it comes to simple patterns (if Boost is using PCRE-like pattern) containing a dot (a.*b), regular ([\w-]*) and negated ([^>]*) character classes, regular quantifiers like +/*/?, and suchlike.
what is the difference between boost regexes and python re regexes?
Python re module is not that rich as Boost regexps (suffice is to mention such constructs as \h, \G, \K, \R, \X, \Q...\E, branch reset, recursion, possessive quantifiers, POSIX character classes and character properties, extended replacement pattern), and other features that Boost has. The (?imsx-imsx:pattern) is limited to the whole expression in Python, not to a part of it thus you should be aware that (?i) in your &|&#((?i)x26);|& will be treated as if it were at the beginning of the pattern (however, it does not have any impact on this expression).
Also, same as in Boost, you do not have to escape [ inside a character class, and { outside the character class.
The backreferences like \1 are the same as in Python.
Since you are not using capturing groups in alternation in your patterns (e.g. re.sub(r'\d(\w)|(go\w*)', '\2', 'goon')), there should be no problem (in such cases, Python does not fill in the non-participating group with any value, and returns an empty result).
Note the difference in named group definition: (?<NAME>expression)/(?'NAME'expression) in Boost, and (?P<NAME>expression) in Python.
I see your regexps mainly fall under "simple" category. The most complex pattern is a tempered greedy token (e.g. ⌊-((?:(?!-⌋).)*)-⌋). To optimize them, you could use an unroll the loop technique, but it may not be necessary depending on the size of texts you handle with the expressions.
The most troublesome part as I see it is that you are using Unicode literals heavily. In Python 2.x, all strings are byte arrays, and you will always have to make sure you pass a unicode object to the Unicode regexps (see Python 2.x’s Unicode Support). In Python 3, all strings are UTF8 by default, and you can even use UTF8 literal characters in source code without any additional actions (see Python’s Unicode Support). So, Python 3.3+ (with support for raw string literals) is a good candidate.
Now, as for the remaining questions:
in boost, there's boost::u32regex_match, is that the same as re.match?
The re.match is not the same as regex_match as re.match is looking for the match at the beginning of the string, and regex_match requires a full string match. However, in Python 3, you can use re.fullmatch(pattern, string, flags=0) that is equivalent to Boost regex_match.
in boost, there's boost::u32regex_search, how is it different to re.search
Whenver you need to find a match anywhere inside a string, you need to use re.search (see match() versus search()). Thus, this method provides analoguous functionality as regex_search does in Boost.
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
Python does not support Perl-like expressions to the extent Boost can, Python re module is just a "trimmed" Perl regex engine that does not have many nice features I mentioned earlier. Thus, no flags like default or perl can be found there. As for the smatch, you can use re.finditer to get all the match objects. A re.findall returns all matches (or submatches only if capturing groups are specified) as a list of strings/lists of tuples. See the re.findall/re.finditer difference.
And in the conclusion, a must-read article Python’s re Module.

Does the Python regular expression module use BRE or ERE?

It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Python re module reference does not seem to specify.

Except for some similarity in the syntax, re module doesn't follow POSIX standard for regular expressions.
Different matching semantics
POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).
The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix) against PrefixSuffix.
In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match PrefixSuffix.
In contrast, re engine (and many other backtracking regex engines) will match Prefix only, since Prefix is specified first in the alternation.
The difference can also be seen in the case of matching (xxx|xxxxx)* against xxxxxxxxxx (a string of 10 x's):
On Cygwin:
$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx
All 10 x's are matched.
In Python:
>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'
Only 9 x's are matched, since it picks the first item in alternation xxx in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)
POSIX-exclusive regular expression features
Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.
Taking equivalence class expression as example, from the documentation:
An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". [...]
Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.
re regular expression features
re borrows the syntax from Perl, but not all features in Perl regex are implemented in re. Below are some regex features available in re which is unavailable in POSIX regular expression:
Greedy/lazy quantifier, which specifies the order to expand a quantifier.
While people usually call the * in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.
Look-around assertion (look-ahead and look-behind)
Conditional pattern (?(id/name)yes-pattern|no-pattern)
Short-hand constructs: \b, \s, \d, \w (some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)

Neither. It's basically the PCRE dialect, but a distinct implementation.
The very first sentence in the re documentation says:
This module provides regular expression matching operations similar to those found in Perl.
While this does not immediately reveal to a newcomer how they are related to e.g. POSIX regular expressions, it should be common knowledge that Perl 4 and later Perl 5 provided a substantially expanded feature set over the regex features of earlier tools, including what POSIX mandated for grep -E aka ERE.
The perlre manual page describes the regular expression features in more detail, though you'll find much the same details in a different form in the Python documentation. The Perl manual page contains this bit of history:
The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.)
(Here, V8 refers to Version 8 Unix. Spencer's library basically (re)implemented POSIX regular expressions.)
Perl 4 had a large number of convenience constructs like \d, \s, \w as well as symbolic shorthands like \t, \f, \n. Perl 5 added a significant set of extensions (which is still growing slowly) including, but not limited to,
Non-greedy quantifiers
Non-backtracking quantifiers
Unicode symbol and property support
Non-grouping parentheses
Lookaheads and lookbehinds
... Basically anything that starts with (?
As a result, the "regular" expressions are by no means strictly "regular" any longer.
This was reimplemented in a portable library by Philip Hazell, originally for the Exim mail server; his PCRE library has found its way into myriad different applications, including a number of programming languages (Ruby, PHP, Python, etc). Incidentally, in spite of the name, the library is not strictly "Perl compatible" (any longer); there are differences in features as well as in behavior. (For example, Perl internally changes * to something like {0,32767} while PCRE does something else.)
An earlier version of Python actually had a different regex implementation, and there are plans to change it again (though it will remain basically PCRE). This is the situation as of Python 2.7 / 3.5.

Combine case sensitive regex and case insensitive regex into one

I have multiple filters for files (I'm using python). Some of them are glob filters some of them are regular expressions. I have both case sensitive and case insensitive globs and regexes. I can transform the glob into a regular expression with translate.
I can combine the case sensitive regular expressions into one big regular expression. Let's call it R_sensitive.
I can combine the case insensitive regular expressions into one big regular expression (case insensitive). Let's call it R_insensitive.
Is there a way to combine R_insensitive and R_sensitive into one regular expression? The expression would be (of course) case sensitive?
Thanks,
Iulian
NOTE: The way I combine expressions is the following:
Having R1,R2,R3 regexes I make R = (R1)|(R2)|(R3).
EXAMPLE:
I'm searching for "*.txt" (insensitive glob). But I have another glob that is like this: "*abc*" (case sensitive). How to combine (from programming) the 2 regex resulted from "fnmatch.translate" when one is case insensitive while the other is case sensitive?

Unfortunately, the regex ability you describe is either ordinal modifiers or a modifier span. Python does not support either, though here are what they would look like:
Ordinal Modifiers: (?i)case_insensitive_match(?-i)case_sensitive_match
Modifier Spans: (?i:case_insensitive_match)(?-i:case_sensitive_match)
In Python, they both fail to parse in re. The closest thing you could do (for simple or small matches) would be letter groups:
[Cc][Aa][Ss][Ee]_[Ii][Nn][Ss][Ee][Nn][Ss][Ii][Tt][Ii][Vv][Ee]_[Mm][Aa][Tt][Cc][Hh]case_sensitive_match
Obviously, this approach would be best for something where the insensitive portion is very brief, so I'm afraid it wouldn't be the best choice for you.

What you need is a way to convert a case-insensitive-flagged regexp into a regexp that works equivalent without the flag.
To do this fully generally is going to be a nightmare.
To do this just for fnmatch results is a whole lot easier.
If you need to handle full Unicode case rules, it will still be very hard.
If you only need to handle making sure each character c also matches c.upper() and c.lower(), it's very easy.
I'm only going to explain the easy case, because it's probably what you want, given your examples, and it's easy. :)
Some modules in the Python standard library are meant to serve as sample code as well as working implementations; these modules' docs start with a link directly to their source code. And fnmatch has such a link.
If you understand regexp syntax, and glob syntax, and look at the source to the translate function, it should be pretty easy to write your own translatenocase function.
Basically: In the inner else clause for building character classes, iterate over the characters, and for each character, if c.upper() != c.lower(), append both instead of c. Then, in the outer else clause for non-special characters, if c.upper() != c.lower(), append a two-character character class consisting of those two characters.
So, translatenocase('*.txt') will return something like r'.*\.[tT][xX][tT]' instead of something like r'.*\.txt'. But normal translate('*abc*') will of course return the usual r'.*abc.*'. And you can combine these just by using an alternation, as you apparently already know how to do.

Regex in Python

Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.
I tried to solve this with regex, Helped by RegexBuddy I came to this one:
[\d]+([\d]{0,4}+[1-9])0*
But python can't compile that.
>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat
The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)
How can I write a working regex?
PS:
I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.

That regular expression is very superfluous. Try this:
>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")
The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:
>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")
Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)
Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.
I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.

\d{0,4}+ is a possessive quantifier supported by certain regular expression flavors such as .NET and Java. Python does not support possessive quantifiers.
In RegexBuddy, select Python in the toolbar at the top, and RegexBuddy will tell you that Python doesn't support possessive quantifiers. The + will be highlighted in red in the regular expression, and the Create tab will indicate the error.
If you select Python on the Use tab in RegexBuddy, RegexBuddy will generate a Python source code snippet with a regular expression without the possessive quantifier, and a comment indicating that the removal of the possessive quantifier may yield different results. Here's the Python code that RegexBuddy generates using the regex from the question:
# Your regular expression could not be converted to the flavor required by this language:
# Python does not support possessive quantifiers
# Because of this, the code snippet below will not work as you intended, if at all.
reobj = re.compile(r"[\d]+([\d]{0,4}[1-9])0*")
What you probably did is select a flavor such as Java in the main toolbar, and then click Copy Regex as Python String. That will give you a Java regular expression formatted as a Pythong string. The items in the Copy menu do not convert your regular expression. They merely format it as a string. This allows you to do things like format a JavaScript regular expression as a Python string so your server-side Python script can feed a regex into client-side JavaScript code.

Small tip. I recommend you test with reTest instead of RegExBuddy. There are different regular expression engines for different programming languages. ReTest is valuable in that it allows you to quickly test regular expression strings within Python itself. That way you can insure that you tested your syntax with the Python's regular expression engine.

The error seems to be that you have two quantifiers in a row, {0,4} and +. Unless + is meant to be a literal here (which I doubt, since you're talking about numbers), then I don't think you need it at all. Unless it means something different in this situation (possibly the greediness of the {} quantifier)? I would try
[\d]+([\d]{0,4}[1-9])0*
If you actually intended to have both quantifiers to be applied, then this might work
[\d]+(([\d]{0,4})+[1-9])0*
But given your specification of the problem, I doubt that's what you want.

This is my solution.
re.search(r'[1-9]\d{0,3}[1-9](?=0*(?:\b|\s|[A-Za-z]))', '02324560001230045980a').group(1)
'4598'
[1-9] - the number must start with 1 - 9
\d{0,3} - 0 or 3 digits
[1-9] - the number must finish with 1 or 9
(?=0*(:?\b|\s\|[A-Za-z])) - the final part of string must be formed from 0 and or \b, \s, [A-Za-z]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.