The programmer who wrote the following line probably uses a python package called regex.
UNIT = regex.compile("(?:{A}(?:'{A})?)++|-+|\S".format(A='\p{Word_Break=ALetter}'))
Can some one help explain what A='\p{Word_Break=ALetter}' and -+ means?
The \p{property=value} operator matches on unicode codepoint properties, and is documented on the package index page you linked to:
Unicode codepoint properties, including scripts and blocks
\p{property=value}; \P{property=value}; \p{value} ; \P{value}
The entry matches any unicode character whose codepoint has a Word_Break property with the value ALetter (there are currently 24941 matches in the Unicode codepoint database, see the Unicode Text Segmentation, Word Boundaries chapter specifiation for details).
The example you gave also uses standard python string formatting to interpolate a partial expression into the regular expression being compiled. The "{A}" part is just a placeholder for the .format(A='...') part to fill. The end result is:
"(?:\p{Word_Break=ALetter}(?:'\p{Word_Break=ALetter})?)++|-+|\S"
The -+ sequence just matches 1 or more - dashes, just like in the python re module expressions, it is not anything special, really.
Now, the ++ before that is more interesting. It's a possessive quantifier, and using it prevents the regex matcher from trying out all possible permutations of the pattern. It's a performance optimization, one that prevents catastrophic backtracking issues.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The aim is to convert these regexes in C++ boost to Python re regexes:
typedef boost::u32regex tRegex;
tRegex emptyre = boost::make_u32regex("^$");
tRegex commentre = boost::make_u32regex("^;.*$");
tRegex versionre = boost::make_u32regex("^#\\$Date: (.*) \\$$");
tRegex includere = boost::make_u32regex("^<(\\S+)$");
tRegex rungroupre = boost::make_u32regex("^>(\\d+)$");
tRegex readreppre = boost::make_u32regex("^>(\\S+)$");
tRegex tokre = boost::make_u32regex("^:(.*)$");
tRegex groupstartre = boost::make_u32regex("^#(\\d+)$");
tRegex groupendre = boost::make_u32regex("^#$");
tRegex rulere = boost::make_u32regex("^([!-+^])([^\\t]+)\\t+([^\\t]*)$");
I could rewrite these regexes one by one but there're quite a lot more that the example above, so my question is with regards to
how to convert C++ boost regexest to Python and
what is the difference between boost regexes and python re regexes?
Is the C++ boost::u32regex the same as re regexes in python? If not, what is the difference? (Links to the docs would be much appreciated =) ) For instance:
in boost, there's boost::u32regex_match, is that the same as
re.match?
in boost, there's boost::u32regex_search, how is it different to re.search
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
How to convert C++ boost regexest to Python
In case of a simple regex, like \w+\s+\d+, or >.*$ you won't have to change the pattern. In case of more complex patterns with constructs mentioned below, you will most probably have to re-write a regex. As with any conversion from one flavor/language to another, the general answer is DON'T. However, Python and Boost do have some similarities, especially when it comes to simple patterns (if Boost is using PCRE-like pattern) containing a dot (a.*b), regular ([\w-]*) and negated ([^>]*) character classes, regular quantifiers like +/*/?, and suchlike.
what is the difference between boost regexes and python re regexes?
Python re module is not that rich as Boost regexps (suffice is to mention such constructs as \h, \G, \K, \R, \X, \Q...\E, branch reset, recursion, possessive quantifiers, POSIX character classes and character properties, extended replacement pattern), and other features that Boost has. The (?imsx-imsx:pattern) is limited to the whole expression in Python, not to a part of it thus you should be aware that (?i) in your &|&#((?i)x26);|& will be treated as if it were at the beginning of the pattern (however, it does not have any impact on this expression).
Also, same as in Boost, you do not have to escape [ inside a character class, and { outside the character class.
The backreferences like \1 are the same as in Python.
Since you are not using capturing groups in alternation in your patterns (e.g. re.sub(r'\d(\w)|(go\w*)', '\2', 'goon')), there should be no problem (in such cases, Python does not fill in the non-participating group with any value, and returns an empty result).
Note the difference in named group definition: (?<NAME>expression)/(?'NAME'expression) in Boost, and (?P<NAME>expression) in Python.
I see your regexps mainly fall under "simple" category. The most complex pattern is a tempered greedy token (e.g. ⌊-((?:(?!-⌋).)*)-⌋). To optimize them, you could use an unroll the loop technique, but it may not be necessary depending on the size of texts you handle with the expressions.
The most troublesome part as I see it is that you are using Unicode literals heavily. In Python 2.x, all strings are byte arrays, and you will always have to make sure you pass a unicode object to the Unicode regexps (see Python 2.x’s Unicode Support). In Python 3, all strings are UTF8 by default, and you can even use UTF8 literal characters in source code without any additional actions (see Python’s Unicode Support). So, Python 3.3+ (with support for raw string literals) is a good candidate.
Now, as for the remaining questions:
in boost, there's boost::u32regex_match, is that the same as re.match?
The re.match is not the same as regex_match as re.match is looking for the match at the beginning of the string, and regex_match requires a full string match. However, in Python 3, you can use re.fullmatch(pattern, string, flags=0) that is equivalent to Boost regex_match.
in boost, there's boost::u32regex_search, how is it different to re.search
Whenver you need to find a match anywhere inside a string, you need to use re.search (see match() versus search()). Thus, this method provides analoguous functionality as regex_search does in Boost.
there's also boost::format_perl and boost::match_default and boost::smatch, what are their equivalence in python re?
Python does not support Perl-like expressions to the extent Boost can, Python re module is just a "trimmed" Perl regex engine that does not have many nice features I mentioned earlier. Thus, no flags like default or perl can be found there. As for the smatch, you can use re.finditer to get all the match objects. A re.findall returns all matches (or submatches only if capturing groups are specified) as a list of strings/lists of tuples. See the re.findall/re.finditer difference.
And in the conclusion, a must-read article Python’s re Module.
It appears that POSIX splits regular expression implementations into two kinds: Basic Regular Expressions (BRE) and Extended Regular Expressions (ERE).
Python re module reference does not seem to specify.
Except for some similarity in the syntax, re module doesn't follow POSIX standard for regular expressions.
Different matching semantics
POSIX regular expression (which can be implemented with a DFA/NFA or even a backtracking engine) always finds the leftmost longest match, while re module is a backtracking engine which finds the leftmost "earliest" match ("earliest" according to the search order defined by the regular expression).
The difference in the matching semantics can be observed in the case of matching (Prefix|PrefixSuffix) against PrefixSuffix.
In POSIX-complaint implementation of POSIX regex (not those which only borrows the syntax), the regex will match PrefixSuffix.
In contrast, re engine (and many other backtracking regex engines) will match Prefix only, since Prefix is specified first in the alternation.
The difference can also be seen in the case of matching (xxx|xxxxx)* against xxxxxxxxxx (a string of 10 x's):
On Cygwin:
$ [[ "xxxxxxxxxx" =~ (xxx|xxxxx)* ]] && echo "${BASH_REMATCH[0]}"
xxxxxxxxxx
All 10 x's are matched.
In Python:
>>> re.search(r'(?:xxx|xxxxx)*', 'xxxxxxxxxxx').group(0)
'xxxxxxxxx'
Only 9 x's are matched, since it picks the first item in alternation xxx in all 3 repetitions, and nothing forces it to backtrack and try the second item in alternation)
POSIX-exclusive regular expression features
Apart from the difference in matching semantics, POSIX regular expression also define syntax for collating symbols, equivalence class expressions, and collation-based character range. These features greatly increase the expressive power of the regex.
Taking equivalence class expression as example, from the documentation:
An equivalence class expression shall represent the set of collating elements belonging to an equivalence class, as described in Collation Order. [...]. The class shall be expressed by enclosing any one of the collating elements in the equivalence class within bracket-equal ( "[=" and "=]" ) delimiters. For example, if 'a', 'à', and 'â' belong to the same equivalence class, then "[[=a=]b]", "[[=à=]b]", and "[[=â=]b]" are each equivalent to "[aàâb]". [...]
Since these features heavily depend on the locale settings, the same regex may behave differently on different locale. It also depends on the locale data on the system for the collation order.
re regular expression features
re borrows the syntax from Perl, but not all features in Perl regex are implemented in re. Below are some regex features available in re which is unavailable in POSIX regular expression:
Greedy/lazy quantifier, which specifies the order to expand a quantifier.
While people usually call the * in POSIX greedy, it actually only specifies the lower bound and upper bound of the repetition in POSIX. The so-called "greedy" behavior is due to the leftmost longest match rule.
Look-around assertion (look-ahead and look-behind)
Conditional pattern (?(id/name)yes-pattern|no-pattern)
Short-hand constructs: \b, \s, \d, \w (some POSIX regular expression engine may implement these, since the standard leaves the behavior undefined for these cases)
Neither. It's basically the PCRE dialect, but a distinct implementation.
The very first sentence in the re documentation says:
This module provides regular expression matching operations similar to those found in Perl.
While this does not immediately reveal to a newcomer how they are related to e.g. POSIX regular expressions, it should be common knowledge that Perl 4 and later Perl 5 provided a substantially expanded feature set over the regex features of earlier tools, including what POSIX mandated for grep -E aka ERE.
The perlre manual page describes the regular expression features in more detail, though you'll find much the same details in a different form in the Python documentation. The Perl manual page contains this bit of history:
The patterns used in Perl pattern matching evolved from those supplied in the Version 8 regex routines. (The routines are derived (distantly) from Henry Spencer's freely redistributable reimplementation of the V8 routines.)
(Here, V8 refers to Version 8 Unix. Spencer's library basically (re)implemented POSIX regular expressions.)
Perl 4 had a large number of convenience constructs like \d, \s, \w as well as symbolic shorthands like \t, \f, \n. Perl 5 added a significant set of extensions (which is still growing slowly) including, but not limited to,
Non-greedy quantifiers
Non-backtracking quantifiers
Unicode symbol and property support
Non-grouping parentheses
Lookaheads and lookbehinds
... Basically anything that starts with (?
As a result, the "regular" expressions are by no means strictly "regular" any longer.
This was reimplemented in a portable library by Philip Hazell, originally for the Exim mail server; his PCRE library has found its way into myriad different applications, including a number of programming languages (Ruby, PHP, Python, etc). Incidentally, in spite of the name, the library is not strictly "Perl compatible" (any longer); there are differences in features as well as in behavior. (For example, Perl internally changes * to something like {0,32767} while PCRE does something else.)
An earlier version of Python actually had a different regex implementation, and there are plans to change it again (though it will remain basically PCRE). This is the situation as of Python 2.7 / 3.5.
I am learning about scrapy. I am using scrapy 0.20 that is why I am following this tutorial. http://doc.scrapy.org/en/0.20/intro/tutorial.html
I undrstood the concepts. However, I have one thing yet.
In this statement
sel.xpath('//title/text()').re('(\w+):')
the output is
[u'Computers', u'Programming', u'Languages', u'Python']
what is re('(\w+):') using for please?
to help answering:
this statement
sel.xpath('//title/text()').extract()
has this output:
[u'Open Directory - Computers: Programming: Languages: Python: Books']
why is the comma , added between the elements?
Also, all the ':' are removed.
Moreover: is this a python pure syntax please?
This is a regular expression (regex), and is a whole world unto itself.
(\w+): Will return any text that ends in a colon (but does not return the colon)
Here is an example of how it works with the ":" getting removed
(\w+:) Will return any text that ends in a colon (and will also return the colon)
Here is an example of how it works with the ":" staying in
Also, if you want to learn about regex, Codecademy has a good python course
(\w+):
is a Regular Expression, which matches any word which ends with : and groups all the word characters ([a-zA-Z_]).
The output does not have :, because this method returns all the captured groups.
The results are returned as a Python list. When a list is represented as a string, the elements are separated by ,.
\w is a shortform for [a-zA-Z_]
Quoting from Python Regular Expressions Page,
\w
When the LOCALE and UNICODE flags are not specified, matches any
alphanumeric character and the underscore; this is equivalent to the
set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
whatever characters are defined as alphanumeric for the current
locale. If UNICODE is set, this will match the characters [0-9_] plus
whatever is classified as alphanumeric in the Unicode character
properties database.
Suppose I want to match a lowercase letter followed by an uppercase letter, I could do something like
re.compile(r"[a-z][A-Z]")
Now I want to do the same thing for unicode strings, i.e. match something like 'aÅ' or 'yÜ'.
Tried
re.compile(r"[a-z][A-Z]", re.UNICODE)
but that does not work.
Any clues?
This is hard to do with Python regex because the current implementation doesn't support Unicode property shortcuts like \p{Lu} and \p{Ll}.
[A-Za-z] will of course only match ASCII letters, regardless of whether the Unicode option is set or not.
So until the re module is updated (or you install the regex package currently in development), you either need to do it programmatically (iterate through the string and do char.islower()/char.isupper() on the characters), or specify all the unicode code points manually which probably isn't worth the effort...
Goal: Given a number (it may be very long and it is greater than 0), I'd like to get the five least meaningful digits dropping any 0 at the end of that number.
I tried to solve this with regex, Helped by RegexBuddy I came to this one:
[\d]+([\d]{0,4}+[1-9])0*
But python can't compile that.
>>> import re
>>> re.compile(r"[\d]+([\d]{0,4}+[1-9])0*")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/re.py", line 188, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.5/re.py", line 241, in _compile
raise error, v # invalid expression
sre_constants.error: multiple repeat
The problem is the "+" after "{0,4}", it seems it doesn't work in python (even in 2.6)
How can I write a working regex?
PS:
I know you can start dividing by 10 and then using the remainder n%100000... but this is a problem about regex.
That regular expression is very superfluous. Try this:
>>> import re
>>> re.compile(r"(\d{0,4}[1-9])0*$")
The above regular expression assumes that the number is valid (it will also match "abc0123450", for example.) If you really need the validation that there are no non-number characters, you may use this:
>>> import re
>>> re.compile(r"^\d*?(\d{0,4}[1-9])0*$")
Anyways, the \d does not need to be in a character class, and the quantifier {0,4} does not need to be forced to be greedy (as the additional + specifies, although apparently Python does not recognize that.)
Also, in the second regular expression, the \d is non-greedy, as I believe this will improve the performance and accuracy. I also made it "zero or more" as I assume that is what you want.
I also added anchors as this ensures that your regular expression won't match anything in the middle of a string. If this is what you desired though (maybe you're scanning a long text?), remove the anchors.
\d{0,4}+ is a possessive quantifier supported by certain regular expression flavors such as .NET and Java. Python does not support possessive quantifiers.
In RegexBuddy, select Python in the toolbar at the top, and RegexBuddy will tell you that Python doesn't support possessive quantifiers. The + will be highlighted in red in the regular expression, and the Create tab will indicate the error.
If you select Python on the Use tab in RegexBuddy, RegexBuddy will generate a Python source code snippet with a regular expression without the possessive quantifier, and a comment indicating that the removal of the possessive quantifier may yield different results. Here's the Python code that RegexBuddy generates using the regex from the question:
# Your regular expression could not be converted to the flavor required by this language:
# Python does not support possessive quantifiers
# Because of this, the code snippet below will not work as you intended, if at all.
reobj = re.compile(r"[\d]+([\d]{0,4}[1-9])0*")
What you probably did is select a flavor such as Java in the main toolbar, and then click Copy Regex as Python String. That will give you a Java regular expression formatted as a Pythong string. The items in the Copy menu do not convert your regular expression. They merely format it as a string. This allows you to do things like format a JavaScript regular expression as a Python string so your server-side Python script can feed a regex into client-side JavaScript code.
Small tip. I recommend you test with reTest instead of RegExBuddy. There are different regular expression engines for different programming languages. ReTest is valuable in that it allows you to quickly test regular expression strings within Python itself. That way you can insure that you tested your syntax with the Python's regular expression engine.
The error seems to be that you have two quantifiers in a row, {0,4} and +. Unless + is meant to be a literal here (which I doubt, since you're talking about numbers), then I don't think you need it at all. Unless it means something different in this situation (possibly the greediness of the {} quantifier)? I would try
[\d]+([\d]{0,4}[1-9])0*
If you actually intended to have both quantifiers to be applied, then this might work
[\d]+(([\d]{0,4})+[1-9])0*
But given your specification of the problem, I doubt that's what you want.
This is my solution.
re.search(r'[1-9]\d{0,3}[1-9](?=0*(?:\b|\s|[A-Za-z]))', '02324560001230045980a').group(1)
'4598'
[1-9] - the number must start with 1 - 9
\d{0,3} - 0 or 3 digits
[1-9] - the number must finish with 1 or 9
(?=0*(:?\b|\s\|[A-Za-z])) - the final part of string must be formed from 0 and or \b, \s, [A-Za-z]