Extract named group regex pattern from a compiled regex in Python - python

I have a regex in Python that contains several named groups. However, patterns that match one group can be missed if previous groups have matched because overlaps don't seem to be allowed. As an example:
import re
myText = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegex = re.compile('(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))')
x = re.findall(myRegex,myText)
print(x)
Produces the output:
[('AAA', '')]
The 'long' group does not find a match because 'AAA' was used-up in finding a match for the preceding 'short' group.
I've tried to find a method to allow overlapping but failed. As an alternative, I've been looking for a way to run each named group separately. Something like the following:
for g in myRegex.groupindex.keys():
match = re.findall(***regex_for_named_group_g***,myText)
Is it possible to extract the regex for each named group?
Ultimately, I'd like to produce a dictionary output (or similar) like:
{'short':'AAA',
'long':'AAAaoasgosaegnsBBB'}
Any and all suggestions would be gratefully received.

There really doesn't appear to be a nicer way to do this, but here's a another approach, along the lines of this other answer but somewhat simpler. It will work provided that a) your patterns will always formed as a series of named groups separated by pipes, and b) the named group patterns never contain named groups themselves.
The following would be my approach if you're interested in all matches of each pattern. The argument to re.split looks for a literal pipe followed by the (?=<, the beginning of a named group. It compiles each subpattern and uses the groupindex attribute to extract the name.
def nameToMatches(pattern, string):
result = dict()
for subpattern in re.split('\|(?=\(\?P<)', pattern):
rx = re.compile(subpattern)
name = list(rx.groupindex)[0]
result[name] = rx.findall(string)
return result
With your given text and pattern, returns {'long': ['AAAaoasgosaegnsBBB'], 'short': ['AAA']}. Patterns that don't match at all will have an empty list for their value.
If you only want one match per pattern, you can make it a bit simpler still:
def nameToMatch(pattern, string):
result = dict()
for subpattern in re.split('\|(?=\(\?P<)', pattern):
match = re.search(subpattern, string)
if match:
result.update(match.groupdict())
return result
This gives {'long': 'AAAaoasgosaegnsBBB', 'short': 'AAA'} for your givens. If one of the named groups doesn't match at all, it will be absent from the dict.

There didn't seem to be an obvious answer, so here's a hack. It needs a bit of finessing but basically it splits the original regex into its component parts and runs each group regex separately on the original text.
import re
myTextStr = 'sgasgAAAaoasgosaegnsBBBausgisego'
myRegexStr = '(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))'
myRegex = re.compile(myRegexStr) # This is actually no longer needed
print("Full regex with multiple groups")
print(myRegexStr)
# Use a regex to split the original regex into separate regexes
# based on group names
mySplitGroupsRegexStr = '\(\?P<(\w+)>(\([\w\W]+?\))\)(?:\||\Z)'
mySplitGroupsRegex = re.compile(mySplitGroupsRegexStr)
mySepRegexesList = re.findall(mySplitGroupsRegex,myRegexStr)
print("\nList of separate regexes")
print(mySepRegexesList)
# Convert separate regexes to a dict with group name as key
# and regex as value
mySepRegexDict = {reg[0]:reg[1] for reg in mySepRegexesList}
print("\nDictionary of separate regexes with group names as keys")
print(mySepRegexDict)
# Step through each key and run the group regex on the original text.
# Results are stored in a dictionary with group name as key and
# extracted text as value.
myGroupRegexOutput = {}
for g,r in mySepRegexDict.items():
m = re.findall(re.compile(r),myTextStr)
myGroupRegexOutput[g] = m[0]
print("\nOutput of overlapping named group regexes")
print(myGroupRegexOutput)
The resulting output is:
Full regex with multiple groups
(?P<short>(?:AAA))|(?P<long>(?:AAA.*BBB))
List of separate regexes
[('short', '(?:AAA)'), ('long', '(?:AAA.*BBB)')]
Dictionary of separate regexes with group names as keys
{'short': '(?:AAA)', 'long': '(?:AAA.*BBB)'}
Output of overlapping named group regexes
{'short': 'AAA', 'long': 'AAAaoasgosaegnsBBB'}
This might be useful to someone somewhere.

Related

Python split before a certain character

I have following string:
BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6
I am trying to split it in a way I would get back the following dict / other data structure:
BUCKET1 -> /dir1/dir2/, BUCKET1 -> /dir3/dir4/, BUCKET2 -> /dir5/dir6/
I can somehow split it if I only have one BUCKET, not multiple, like this:
res.split(res.split(':', 1)[0].replace('.', '').upper()) -> it's not perfect
Input: ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/
Output: [(ADRIAN, /dir1/dir11), (DANIEL, /dir2/), (CULEA, /dir3/), (ADRIAN, /dir5/), (ADRIAN, /dir6/)
As per Wiktor Stribiżew comments, the following regex does the job:
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
If you're experienced, I'd recommend learning Regex just as the others have suggested. However, if you're looking for an alternative, here's a way of doing such without Regex. It also produces the output you're looking for.
string = input("Enter:") #Put your own input here.
tempList = string.replace("BUCKET",':').split(":")
outputList = []
for i in range(1,len(tempList)-1,2):
someTuple = ("BUCKET"+tempList[i],tempList[i+1])
outputList.append(someTuple)
print(outputList) #Put your own output here.
This will produce:
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
This code is hopefully easier to understand and manipulate if you're unfamiliar with Regex, although I'd still personally recommend Regex to solve this if you're familiar with how to use it.
Use re.findall() function:
s = "ADRIAN:/dir1/dir11/DANIEL:/dir2/ADI_BUCKET:/dir3/CULEA:/dir4/ADRIAN:/dir5/ADRIAN:/dir6/"
result = re.findall(r'(\w+):([^:]+\/)', s)
print(result)
The output:
[('ADRIAN', '/dir1/dir11/'), ('DANIEL', '/dir2/'), ('ADI_BUCKET', '/dir3/'), ('CULEA', '/dir4/'), ('ADRIAN', '/dir5/'), ('ADRIAN', '/dir6/')]
Use regex instead?
impore re
test = 'BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6'
output = re.findall(r'(?P<bucket>[A-Z0-9]+):(?P<path>[/a-z0-9]+)', test)
print(output)
Which gives
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
It appears you have a list of predefined "buckets" that you want to use as boundaries for the records inside the string.
That means, the easiest way to match these key-value pairs is by matching one of the buckets, then a colon and then any chars not starting a sequence of chars equal to those bucket names.
You may use
r"(BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)"
Compile with re.S / re.DOTALL if your values span across multiple lines. See the regex demo.
Details:
(BUCKET1|BUCKET2) - capture group one that matches and stores in .group(1) any of the bucket names
: - a colon
(.*?) - any 0+ chars, as few as possible (as *? is a lazy quantifier), up to the first occurrence of (but not inlcuding)...
(?=(?:BUCKET1|BUCKET2)|$) - any of the bucket names or end of string.
Build it dynamically while escaping bucket names (just to play it safe in case those names contain * or + or other special chars):
import re
buckets = ['BUCKET1','BUCKET2']
rx = r"({0}):(.*?)(?=(?:{0})|$)".format("|".join([re.escape(bucket) for bucket in buckets]))
print(rx)
s = "BUCKET1:/dir1/dir2/BUCKET1:/dir3/dir4/BUCKET2:/dir5/dir6"
print(re.findall(rx, s))
# => (BUCKET1|BUCKET2):(.*?)(?=(?:BUCKET1|BUCKET2)|$)
[('BUCKET1', '/dir1/dir2/'), ('BUCKET1', '/dir3/dir4/'), ('BUCKET2', '/dir5/dir6')]
See the online Python demo.

python: regex - catch variable number of groups

I have a string that looks like:
TABLE_ENTRY.0[hex_number]= <FIELD_1=hex_number, FIELD_2=hex_number..FIELD_X=hex>
TABLE_ENTRY.1[hex_number]= <FIELD_1=hex_number, FIELD_2=hex_number..FIELD_Y=hex>
number of fields is unknown and varies from entry to entry, I want to capture
each entry separately with all of its fields and their values.
I came up with:
([A-Z_0-9\.]+\[0x[0-9]+\]=)(0x[0-9]+|0):\s+<(([A-Z_0-9]+)=(0x[0-9]+|0))
which matches the table entry and the first field, but I dont know how to account for variable number of fields.
for input:
ENTRY_0[0x130]=0: <FIELD_0=0, FIELD_1=0x140... FIELD_2=0xff3>
output should be:
ENTRY 0:
FIELD_0=0
FIELD_1=0x140
FIELD_2=ff3
ENTRY 1:
...
In short, it's impossible to do all of this in the re engine. You cannot generate more groups dynamically. It will all put it in one group. You should re-parse the results like so:
import re
input_str = ("TABLE_ENTRY.0[0x1234]= <FIELD_1=0x1234, FIELD_2=0x1234, FIELD_3=0x1234>\n"
"TABLE_ENTRY.1[0x1235]= <FIELD_1=0x1235, FIELD_2=0x1235, FIELD_3=0x1235>")
results = {}
for match in re.finditer(r"([A-Z_0-9\.]+\[0x[0-9A-F]+\])=\s+<([^>]*)>", input_str):
fields = match.group(2).split(", ")
results[match.group(1)] = dict(f.split("=") for f in fields)
>>> results
{'TABLE_ENTRY.0[0x1234]': {'FIELD_2': '0x1234', 'FIELD_1': '0x1234', 'FIELD_3': '0x1234'}, 'TABLE_ENTRY.1[0x1235]': {'FIELD_2': '0x1235', 'FIELD_1': '0x1235', 'FIELD_3': '0x1235'}}
The output will just be a large dict consisting of a table entry, to a dict of it's fields.
It's also rather convinient as you may do this:
>>> results["TABLE_ENTRY.0[0x1234]"]["FIELD_2"]
'0x1234'
I personally suggest stripping off "TABLE_ENTRY" as it's repetative but as you wish.
Use a capture group for match unfit lengths:
([A-Z_0-9\.]+\[0x[0-9]+\]=)\s+<(([A-Z_0-9]+)=(0x[0-9]+|0),\s?)*([A-Z_0-9]+)=(0x[0-9]+|0)
The following part matches every number of fields with trailing comma and whitespace
(([A-Z_0-9]+)=(0x[0-9]+|0),\s?)*
And ([A-Z_0-9]+)=(0x[0-9]+|0) will match the latest field.
Demo: https://regex101.com/r/gP3oO6/1
Note: If you don't want some groups you better to use non-capturing groups by adding ?: at the leading of capture groups.((?: ...)), and note that (0x[0-9]+|0):\s+ as extra in your regex (based on your input pattern)

Python regex similar expressions

I have a file with two different types of data I'd like to parse with a regex; however, the data is similar enough that I can't find the correct way to distinguish it.
Some lines in my file are of form:
AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Other lines are of form
AED=20180823
AMD=20150914
AMD=20150921
The remaining lines are headers and I'd like to discard them. For example
[HEADER: BUSINESS DATE=20160831]
My solution attempt so far is to match first three capital letters and an equal sign,
r'\b[A-Z]{3}=\b'
but after that I'm not sure how to distinguish between dates (eg 20180823) and days (eg FRI:SAT:SUN).
The results I'd expect from these parsing functions:
Regex weekday_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=FRI>);
Regex date_rx = new Regex(<EXPRESSION FOR TYPES LIKE AED=20160816>);
weekdays = [weekday_rx.Match(line) for line in infile.read()]
dates = [date_rx.Match(line) for line in infile.read()]
r'\S*\d$'
Will match all non-whitespace characters that end in a digit
Will match AED=20180823
r'\S*[a-zA-Z]$'
Matches all non-whitespace characters that end in a letter.
will match AED=AED=FRI
AFN=FRI:SAT
AMD=SUN:SAT
Neither will match
[HEADER: BUSINESS DATE=20160831]
This will match both
r'(\S*[a-zA-Z]$|\S*\d$)'
Replacing the * with the number of occurences you expect will be safer, the (a|b) is match a or match b
The following is a solution in Python :)
import re
p = re.compile(r'\b([A-Z]{3})=((\d)+|([A-Z])+)')
str_test_01 = "AMD=SUN:SAT"
m = p.search(str_test_01)
print (m.group(1))
print (m.group(2))
str_test_02 = "AMD=20150921"
m = p.search(str_test_02)
print (m.group(1))
print (m.group(2))
"""
<Output>
AMD
SUN
AMD
20150921
"""
Use pipes to express alternatives in regex. Pattern '[A-Z]{3}:[A-Z]{3}|[A-Z]{3}' will match both ABC and ABC:ABC. Then use parenthesis to group results:
import re
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC:ABC')
assert match.groups() == ('ABC:ABC', None)
match = re.match(r'([A-Z]{3}:[A-Z]{3})|([A-Z]{3})', 'ABC')
assert match.groups() == (None, 'ABC')
You can research the concept of named groups to make this even more readable. Also, take a look at the docs for the match object for useful info and methods.

Matching alternative regexps in Python

I'm using Python to parse a file in search for e-mail addresses, but I can't figure out what the syntax for alternative regexps should be. Here's the code:
addresses = []
pattern = '(\w+)#(\w+\.com)|(\w+)#(it.\w+\.com)'
for line in file:
matches = re.findall(pattern,line)
for m in matches:
address = '%s#%s' % m
addresses.append(address)
So I want to find addresses that look like john#company.com or john#it.company.com, but the above code doesn't work because either the first two groups are empty or the last two groups are empty. What is the correct solution? I need to use groups to store the user name (before #) and server name (after #) separately.
EDIT: Matching email adresses is only an example. What I'm trying to find out is how to match different regexps that have only one thing in common - they match two groups.
(\w+)#((?:it\.)?\w+\.com)
You want to capture the part after the # whether it's example.com or it.example.com, so you put both options inside the same capture group. But since they share a similar format, you can condense (it\.\w+\.com|\w+\.com) to just ((it\.)?\w+\.com)
The (?: ) makes that parens a non-capturing group, so it won't take part in your matched groups. There will be one match for the first (\w+), and one match for the whole ((?:it\.)?\w+\.com) after the #. That's two matches total, plus the default group-0 match for the full string.
EDIT: To answer your new question, all you have to do is follow the grouping I used, but stop before you condense it.
If your test cases are:
1) example#abcdef
2) example#123456
You could write your regex as such: (\w+)#([a-zA-Z]+|\d+), which would always have the part before the # in group 1, and the part after in group 2. Notice that there are only two pairs of parens, and the |("or") operator appears inside of the second parens group.
I once found here a well written email regex, it was build for extracting a wide range of valid email adresses from a generic string, so it should also be able to do what you're looking for.
Example:
>>> email_regex = re.compile("""((([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*")\.)*([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~]+|"([a-zA-Z0-9!\#\$%&'*+\-\/=?^_`{|}~(),:;<>#\[\]\.]|\\[ \\"])*"))#((([a-zA-Z0-9]([a-zA-Z0-9]*(\-[a-zA-Z0-9]*)*)?\.)*[a-zA-Z]+|\[((0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\.){3}(0?\d{1,2}|1\d{2}|2[0-4]\d|25[0-5])\]|\[[Ii][Pp][vV]6(:[0-9a-fA-F]{0,4}){6}\]))""")
>>>
>>> m = email_regex.search('john#it.company.com')
>>> m.group(0)
'john#it.company.com'
>>> m.group(1)
'john'
>>> m.group(7)
'it.company.com'
>>>
>>> n = email_regex.search('john#company.com')
>>> n.group(0)
'john#company.com'
>>> n.group(1)
'john'
>>> n.group(7)
'company.com'

How to extract longest of overlapping groups?

How can I extract the longest of groups which start the same way
For example, from a given string, I want to extract the longest match to either CS or CSI.
I tried this "(CS|CSI).*" and it it will return CS rather than CSI even if CSI is available.
If I do "(CSI|CS).*" then I do get CSI if it's a match, so I gues the solution is to always place the shorter of the overlaping groups after the longer one.
Is there a clearer way to express this with re's? somehow it feels confusing that the result depends on the order you link the groups.
No, that's just how it works, at least in Perl-derived regex flavors like Python, JavaScript, .NET, etc.
http://www.regular-expressions.info/alternation.html
As Alan says, the patterns will be matched in the order you specified them.
If you want to match on the longest of overlapping literal strings, you need the longest one to appear first. But you can organize your strings longest-to-shortest automatically, if you like:
>>> '|'.join(sorted('cs csi miami vice'.split(), key=len, reverse=True))
'miami|vice|csi|cs'
Intrigued to know the right way of doing this, if it helps any you can always build up your regex like:
import re
string_to_look_in = "AUHDASOHDCSIAAOSLINDASOI"
string_to_match = "CSIABC"
re_to_use = "(" + "|".join([string_to_match[0:i] for i in range(len(string_to_match),0,-1)]) + ")"
re_result = re.search(re_to_use,string_to_look_in)
print string_to_look_in[re_result.start():re_result.end()]
similar functionality is present in vim editor ("sequence of optionally matched atoms"), where e.g. col\%[umn] matches col in color, colum in columbus and full column.
i am not aware if similar functionality in python re,
you can use nested anonymous groups, each one followed by ? quantifier, for that:
>>> import re
>>> words = ['color', 'columbus', 'column']
>>> rex = re.compile(r'col(?:u(?:m(?:n)?)?)?')
>>> for w in words: print rex.findall(w)
['col']
['colum']
['column']

Categories