How can I convert a Perl regex with named groups to Python?

How can I convert a Perl regex with named groups to Python? - python

I am trying to convert the following Perl regex I found in the Video::Filename Perl module to a Python 2.5.4 regex to parse a filename
# Perl > v5.10
re => '^(?:(?<name>.*?)[\/\s._-]*)?(?<openb>\[)?(?<season>\d{1,2})[x\/](?<episode>\d{1,2})(?:-(?:\k<season>x)?(?<endep>\d{1,2}))?(?(<openb>)\])(?:[\s._-]*(?<epname>[^\/]+?))?$',
I would like to use named groups too, and I know in Python the regex extension for named groups is different, but I am not 100% sure on the syntax.
This is what I tried:
# Python (not working)
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:\kP<season>x)?(?P<endep>\d{1,2}))?(?(P<openb>)\])(?:[\s._-]*(?P<epname>[^\/]+?))?$')
The error I get:
raise error, v # invalid expression
sre_constants.error: bad character in group name
For example, this one I managed to convert and it works. But the one above I can't seem to get right. I get a compilation error in Python.
# Perl:
re => '^(?:(?<name>.*?)[\/\s._-]+)?(?:s|se|season|series)[\s._-]?(?<season>\d{1,2})[x\/\s._-]*(?:e|ep|episode|[\/\s._-]+)[\s._-]?(?<episode>\d{1,2})(?:-?(?:(?:e|ep)[\s._]*)?(?<endep>\d{1,2}))?(?:[\s._]?(?:p|part)[\s._]?(?<part>\d+))?(?<subep>[a-z])?(?:[\/\s._-]*(?<epname>[^\/]+?))?$',
# Python (working):
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]+)?(?:s|se|season|series)[\s._-]?(?P<season>\d{1,2})[x\/\s._-]*(?:e|ep|episode|[\/\s._-]+)[\s._-]?(?P<episode>\d{1,2})(?:-?(?:(?:e|ep)[\s._]*)?(?P<endep>\d{1,2}))?(?:[\s._]?(?:p|part)[\s._]?(?P<part>\d+))?(?P<subep>[a-z])?(?:[\/\s._-]*(?P<epname>[^\/]+?))?$')
I am not sure where to start looking.

There are 2 problems with your translation. First of all, the second mention of openb has extra parenthesis around it making it a conditional expression, not a named expression.
Next is that you didn't translate the \k<season> backreference, Python uses (P=season) to match the same. The following compiles for me:
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:(?P=season)x)?(?P<endep>\d{1,2}))?(?(openb)\])(?:[\s._-]*(?P<epname>[^\/]+?))?$')
If I were you, I'd use re.VERBOSE to split this expression over multiple lines and add copious documentation so you can keep understanding the expression in the future if this is something that needs to remain maintainable though.
(edited after realising the second openb reference was a conditional expression, and to properly translate the backreference).

I found the offending part but can't figure out what exactly is wrong without wrapping my mind around the whole thing.
r = re.compile(r'^(?:(?P<name>.*?)[\/\s._-]*)?(?P<openb>\[)?(?P<season>\d{1,2})[x\/](?P<episode>\d{1,2})(?:-(?:\kP<season>x)?(?P<endep>\d{1,2}))?
(?(P<openb>)\]) // this part here causes the error message
(?:[\s._-]*(?P<epname>[^\/]+?))?$')
The problem seems to be with the fact that group names in python must be valid python identifiers (check documentation). The parentheses seem to be the problem. Removing them gives
(?(P<openb>)\]) //with parentheses
(?P<openb>\]) //without parentheses
redefinition of group name 'openb' as group 6; was group 2

Those regexps are the product of a sick an twisted mind... :-)
Anyway, (?()) are conditions in both Python and Perl, and the perl syntax above looks like it should be the same as the Python syntax, i.e., it evaluates as true of the group named exists.
Where to start looking? The documentation for the modules are here:
http://docs.python.org/library/re.html
http://www.perl.com/doc/manual/html/pod/perlre.html

I may be wrong but you tried to get the backreference using :
(?:\k<season>x)
Isn't the syntax \g<name> in Python ?

Related

Python use regex match object in if statement and then access capture groups like Perl

Is it possible to do something like the following Perl code in Python? From what I can tell the answer is no, but I figured I'd double check.
The Perl code I want to replicate in Python:
#!/usr/bin/perl
my $line = "hello1234world";
if($line=~/hello(.*)world/) {
print($1);
}
#prints 1234
The following is the closest stylistically I can think off, but when running I (obviously) get the following error:
import re
line = "hello1234world"
if matchObj = re.match(r'hello(.*)world',line):
print(matchObj.group(1))
#error: if matchObj = re.match(r'hello(.*)world',line):
#error: ^
#error: SyntaxError: invalid syntax
The following is the best working code I can come up with:
import re
line = "hello1234world"
matchObj = re.match(r'hello(.*)world',line)
if matchObj:
print(matchObj.group(1))
#prints 1234
I'd really like to avoid a separate line for the variable declaration and if statement if possible.

Can just print the (assumed) capture and use exceptions to handle the case when group method is called on None, returned when there isn't a match. If there is indeed nothing to do when the match fails it's one line via With Statement Context Manager (3.4+)
from contextlib import suppress
with suppress(Exception):
print( re.match(r'hello(.*)world', line).group(1) )
To avoid ignoring exceptions that almost surely shouldn't be ignored here, like SystemExit and KeyboardInterrupt, use
with suppress(BaseException):
...
This is now rather compact, as asked for, and it behaves as desired. Using exceptions merely to shorten code could be considered misguided though, but perhaps there'd be further uses of this.
As mentioned in comments, since 3.8 there is the Assignment Expression
if match := re.match(r'hello(.*)world', line):
print( match.group(1) )
which matches almost directly the motivating semantics. However, this newer feature has drawn some sensitive discussions, while using it to merely shorten code may confuse and mislead as it differs from an expected pythonic approach.
I'd like to add, that I'd suggest to not worry about a few extra lines of code, and specially to avoid emulating styles and program flow from other languages. There is a huge value in using styles and idioms native to the language at hand.

How to make regular expression with negative value or non negative value?

I am focusing on data with a regular expression. My data have this template:
Timestamp 1549033386 ID=02141592cc0000000700000000000000 Dest_ID=02141592cc00000007ffffffb0ba2c53 Nbr_packet_not_acK_ti9-ti5 -91
I am using python and I implement this regular expression:
'Nbr_packet_not_acK_ti9-ti5': r'\bTimestamp\s+([0-9]+)\s+ID=(\w{32})0*\s+Dest_ID=(\w{32})0*\sNbr_packet_not_acK_ti9-ti5\s+([0-9]+)',
But It does not work correctly, the problem is with the negative values that I have.
I have another example that works correctly:
Timestamp 1549033599 ID=02141592cc0000000600000000000000 Dest_ID=00000000000000000000000000000000Delay_T2R2 -1
\bTimestamp\s+([0-9]+)\s+ID=(\w{32})0*\s+Dest_ID=(\w{32})0*Delay_T2R2\s+(-?[0-9]+)

If I try this, it matches 3 groups:
1549033386
02141592cc0000000700000000000000
02141592cc00000007ffffffb0ba2c53
but the whole regular expression doesn't match because of the trailing ([0-9]+) which, as you correctly note, doesn't match the negative number. Fixing the regular expression either this way:
\bTimestamp\s+([0-9]+)\s+ID=(\w{32})0*\s+Dest_ID=(\w{32})0*\sNbr_packet_not_acK_ti9-ti5\s+([-0-9]+)
or this way, as suggested by Engineero:
\bTimestamp\s+([0-9]+)\s+ID=(\w{32})0*\s+Dest_ID=(\w{32})0*\sNbr_packet_not_acK_ti9-ti5\s+(-?[0-9]+)
gives me a full match on all 4 capturing groups.
1549033386
02141592cc0000000700000000000000
02141592cc00000007ffffffb0ba2c53
-91
So I conclude that either fix does in fact work, and the failure to match that you report is caused by a confounding error.
To demonstrate that it must be a confounding error, try this at the interpreter prompt, to eliminate such errors:
>>> exp = r"\bTimestamp\s+([0-9]+)\s+ID=(\w{32})0*\s+Dest_ID=(\w{32})0*\sNbr_packet_not_acK_ti9-ti5\s+(-?[0-9]+)"
>>> rx = re.compile(exp)
>>> m=rx.match("Timestamp 1549033386 ID=02141592cc0000000700000000000000 Dest_ID=02141592cc00000007ffffffb0ba2c53 Nbr_packet_not_acK_ti9-ti5 -91")
>>> m.groups()
('1549033386', '02141592cc0000000700000000000000', '02141592cc00000007ffffffb0ba2c53', '-91')
I tried this in Python 2.5, 2.7, 3.6 and 3.7. I don't have 3.5 anymore, but if there had been a bug of this seriousness in 3.5 I'm pretty sure I would have heard about it.
So it's not the version, and it's not the regular expression itself. That leaves the data, which might not look quite the way it looks in your question, or the code that surrounds the check.

Regex named conditional lookahead (in Python)

I'm hoping to match the beginning of a string differently based on whether a certain block of characters is present later in the string. A very simplified version of this is:
re.search("""^(?(pie)a|b)c.*(?P<pie>asda)$""", 'acaaasda')
Where, if <pie> is matched, I want to see a at the beginning of the string, and if it isn't then I'd rather see b.
I'd use normal numerical lookahead but there's no guarantee how many groups will or won't be matched between these two.
I'm currently getting error: unknown group name. The sinking feeling in my gut tells me that this is because what I want is impossible (look-ahead to named groups isn't exactly a feature of a regular language parser), but I really really really want this to work -- the alternative is scrapping 4 or 5 hours' worth of regex writing and redoing it all tomorrow as a recursive descent parser or something.
Thanks in advance for any help.

Unfortunately, I don't think there is a way to do what you want to do with named groups. If you don't mind duplication too much, you could duplicate the shared conditions and OR the expressions together:
^(ac.*asda|bc.*)$
If it is a complicated expression you could always use string formatting to share it (rather than copy-pasting the shared part):
common_regex = "c.*"
final_regex = "^(a{common}asda|b{common})$".format(common=common_regex)

You can use something like that:
^(?:a(?=c.*(?P<pie>asda)$)|b)c.*$
or without .*$ if you don't need it.

Extracting unique syscall names from strace output (via regex?)

I have a file produced by strace which contains all the system calls. Now I want to get the name of all system calls. Therefore, say if I have mprotect listed 4 times, I only need to list it 1 time, that is I only need to list unique system calls.
One method that comes to mind is to use regular expressions using python or any other language that supports parsing regular expression to first see all system calls and then eliminate the duplicates. For that purpose, I was first trying to test my regular expression using the search feature of notepad++. I want to match anything like this, blah(. For that purpose I devised the following regular expression
[a-zA-Z_](
but notepad found nothing. What do you think is the correct regular expression for this?

Why do you think you need regular expressions for this? The output of strace is a sequence of lines, each starting with
<c_identifier>(
and C identifiers can't contain (, so you can just take the part up to the ( to get the name of the system calls. In Python, this one-liner computes the set of distinct system calls:
syscalls = set(ln.split('(', 1)[0] for ln in strace_output)
(You can do this in one line of Awk as well, if you rather work in the shell than in Python.)

Notepad++ should have told you invalid regular expression. The latest version does.
In regular expressions, parentheses have special meaning, so you have to escape them:
[a-zA-Z_]\(
will find h( in blah(, since the part in the brackets isn't quantified (as #CharlesDuffy pointed out).
To match the entire blah(, use
[a-zA-Z_]+\(

It should be [a-zA-Z_]+\( instead. This is because round brackets are used as meta characters.

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

I am converting some matlab code to C, currently I have some lines that have powers using the ^, which is rather easy to do with something along the lines \(?(\w*)\)?\^\(?(\w*)\)?
works fine for converting (glambda)^(galpha),using the sub routine in python pattern.sub(pow(\g<1>,\g<2>),'(glambda)^(galpha)')
My problem comes with nested parenthesis
So I have a string like:
glambdastar^(1-(1-gphi)*galpha)*(glambdaq)^(-(1-gphi)*galpha);
And I can not figure out how to convert that line to:
pow(glambdastar,(1-(1-gphi)*galpha))*pow(glambdaq,-(1-gphi)*galpha));

Unfortunately, regular expressions aren't the right tool for handling nested structures. There are some regular expressions engines (such as .NET) which have some support for recursion, but most — including the Python engine — do not, and can only handle as many levels of nesting as you build into the expression (which gets ugly fast).
What you really need for this is a simple parser. For example, iterate over the string counting parentheses and storing their locations in a list. When you find a ^ character, put the most recently closed parenthesis group into a "left" variable, then watch the group formed by the next opening parenthesis. When it closes, use it as the "right" value and print the pow(left, right) expression.

I think you can use recursion here.
Once you figure out the Left and Right parts, pass each of those to your function again.
The base case would be that no ^ operator is found, so you will not need to add the pow() function to your result string.
The function will return a string with all the correct pow()'s in place.
I'll come up with an example of this if you want.

Nested parenthesis cannot be described by a regexp and require a full parser (able to understand a grammar, which is something more powerful than a regexp). I do not think there is a solution.

See recent discussion function-parser-with-regex-in-python (one of many similar discussions). Then follow the suggestion to pyparsing.

An alternative would be to iterate until all ^ have been exhausted. no?.
Ruby code:
# assuming str contains the string of data with the expressions you wish to convert
while str.include?('^')
str!.gsub!(/(\w+)\^(\w+)/, 'pow(\1,\2)')
end

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I convert a Perl regex with named groups to Python? - python

I may be wrong but you tried to get the backreference using : (?:\k<season>x) Isn't the syntax \g<name> in Python ?

Related

Python use regex match object in if statement and then access capture groups like Perl

How to make regular expression with negative value or non negative value?

Regex named conditional lookahead (in Python)

Extracting unique syscall names from strace output (via regex?)

regular expression help with converting exp1^exp2 to pow(exp1, exp2)

Categories

Resources