Python Regex: Capture overlapping parts - python

Given a string s = "<foo>abcaaa<bar>a<foo>cbacba<foo>c" I'm trying to write a regular expression which will extract portions of: angle brackets with the text inside and the surrounding text. Like this:
<foo>abcaaa
abcaaa<bar>a
a<foo>cbacba
cbacba<foo>c
So expected output should look like this:
["<foo>abcaaa", "abcaaa<bar>a", "a<foo>cbacba", "cbacba<foo>c"]
I found this question How to find overlapping matches with a regexp? which brought me little bit closer to the desired result but still my regex doesn't work.
regex = r"(?=([a-c]*)\<(\w+)\>([a-c]*))"
Any ideas how to solve this problem?

You can match overlapping content with standard regex syntax by using capturing groups inside lookaround assertions, since those may match parts of the string without consuming the matched substring and hence precluding it from further matches. In this specific example, we match either the beginning of the string or a > as anchor for the lookahead assertion which captures our actual targets:
(?:\A|>)(?=([a-c]*<\w+>[a-c]*))
See regex demo.
In python we then use the property of re.findall() to only return matches captured in groups when capturing groups are present in the expression:
text = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
expr = r'(?:\A|>)(?=([a-c]*<\w+>[a-c]*))'
captures = re.findall(expr, text)
print(captures)
Output:
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']

You need to set the left- and right-hand boundaries to < or > chars or start/end of string.
Use
import re
text = "<foo>abcaaa<bar>a<foo>cbacba<foo>c"
print( re.findall(r'(?=(?<![^<>])([a-c]*<\w+>[a-c]*)(?![^<>]))', text) )
# => ['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
See the Python demo online and the regex demo.
Pattern details
(?= - start of a positive lookahead to enable overlapping matches
(?<![^<>]) - start of string, < or >
([a-c]*<\w+>[a-c]*) - Group 1 (the value extracted): 0+ a, b or c chars, then <, 1+ word chars, > and again 0+ a, b or c chars
(?![^<>]) - end of string, < or > must follow immediately
) - end of the lookahead.

You may use this regex code in python:
>>> s = '<foo>abcaaa<bar>a<foo>cbacba<foo>c'
>>> reg = r'([^<>]*<[^>]*>)(?=([^<>]*))'
>>> print ( [''.join(i) for i in re.findall(reg, s)] )
['<foo>abcaaa', 'abcaaa<bar>a', 'a<foo>cbacba', 'cbacba<foo>c']
RegEx Demo
RegEx Details:
([^<>]*<[^>]*>): Capture group #1 to match 0 or more characters that are not < and > followed by <...> string.
(?=([^<>]*)): Lookahead to assert that we have 0 or more non-<> characters ahead of current position. We have capture group #2 inside this lookahead.

Related

Cannot seem to figure out this regex involving forward slash

I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /, and more than two / appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None.
In this instance capture this should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^ - start of string
(?:[^/]*/){4} - four occurrences of any zero or more chars other than / and then a / char
([^/]+) - Capturing group 1:one or more chars other than a / char
(?:/[^/]*){2} - two occurrences of a / char and then any zero or more chars other than /.
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3} - match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
(.*?) - capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,} - same as the first part, match as few characters as possible in between each pair of /s

Pandas regex to remove digits before consecutive dots

I have a string Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23.
Removing all the numbers that are before the dot and after the word.
Ignoring the first part of the string i.e. "Node57Name123".
Should not remove the digits if they are inside words.
Tried re.sub(r"\d+","",string) but it removed every other digit.
The output should look like this "Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape"
Can you please point me to the right direction.
You can use
re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text)
See the regex demo.
Details:
^([^.]*\.) - zero or more chars other than a dot and then a . char at the start of the string captured into Group 1 (referred to with \1 from the replacement pattern)
| - or
\d+(?![^.]) - one or more digits followed with a dot or end of string (=(?=\.|$)).
See the Python demo:
import re
text = r'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
print( re.sub(r'^([^.]*\.)|\d+(?![^.])', r'\1', text) )
## => Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
Just to give you a non-regex alternative' using rstrip(). We can feed this function a bunch of characters to remove from the right of the string e.g.: rstrip('0123456789'). Alternatively we can also use the digits constant from the string module:
from string import digits
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = '.'.join([s.split('.')[0]] + [i.rstrip(digits) for i in s.split('.')[1:]])
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape
EDIT:
If you must use a regular pattern, it seems that the following covers your sample:
(\.[^.]*?)\d+\b
Replace with the 1st capture group, see the online demo
( - Open capture group:
\.[^.]*? - A literal dot followed by 0+ non-dot characters (lazy).
) - Close capture group.
\d+\b - Match 1+ digits up to a word-boundary.
A sample:
import re
s = 'Node57Name123.grpObject12.grp23Symbol43.shape52.anotherobject25.shape23'
x = re.sub(r'(\.[^.]*?)\d+\b', r'\1', s)
print(x)
Prints:
Node57Name123.grpObject.grp23Symbol.shape.anotherobject.shape

Find something between parentheses

I got a string like that:
LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)
I want to look only for OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0), but the OR could be LD as well. _080T_SAF_OUT could be different being always alphanumeric with bottom slash sometimes. COIL(xxSF[4].Flt[120].0), must be always in the format COIL(xxSF["digits"].Flt["digits"]."digits")
I am trying to use the re library of Python 2.7.
m = re.search('\OR|\LD'+'\('+'.+'+'\)'+'+'\COIL+'\('+'\xxSF+'\['+'\d+'+'\].'+ Flt\['+'\d+'+'\]'+'\.'+'\d+', Text)
My Output:
OR(abc_TEST_X)LD(xxSF[16].Flt[0].22
OR
LD(TEST_X_dsfa)OR(WASS_READY)COIL(xxSF[16].Flt[11].10
The first one is the right one which I am getting I want to discard the second one and the third one.
I think that the problem is here:
'\('+'.+'+'\)'
Because of I just want to find something alphanumeric and possibly with symbols between the first pair of paréntesis, and I am not filtering this situation.
You should group alternations like (?:LD|OR), and to match any chars other than ( and ) you may use [^()]* rather than .+ (.+ matches any chars, as many as possible, hence it matches across parentheses).
Here is a Python demo:
import re
Text = 'LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)'
m = re.search(r'(?:OR|LD)\([^()]*\)COIL\(xxSF\[\d+]\.Flt\[\d+]\.\d+', Text)
if m:
print(m.group()) # => OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0
Pattern details
(?:OR|LD) - a non-capturing group matching OR or LD
\( - a ( char
[^()]* - a negated character class matching 0+ chars other than ( and )
\)COIL\(xxSF\[ - )COIL(xxSF[ substring
\d+ - 1+ digits
]\.Flt\[ - ].Flt[ substring
\d+]\.\d+ - 1+ digits, ]. substring and 1+ digits
See the regex demo.
TIP Add a \b before (?:OR|LD) to match them as whole words (not as part of NOR and NLD).
Thanks, I am capturing everything which I want. Just something else to filter. Take a look to some Outputs:
OR(_1B21_A53021_2_En)OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1B21_A53021_2_En)LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I only want to capture the last one "LD" or "OR" as follow:
OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);

Return the next nth result \w+ after a hyphen globally

Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).

Anchor to End of Last Match

In the process of working on this answer I stumbled on an anomaly with Python's repeating regexes.
Say I'm given a CSV string with an arbitrary number of quoted and unquoted elements:
21, 2, '23.5R25 ETADT', 'description, with a comma'
I want to replace all the ','s outside quotes with '\t'. So I'd like an output of:
21\t2\t'23.5R25 ETADT'\t'description, with a comma'
Since there will be multiple matches in the string naturally I'll use the g regex modifier. The regex I'll use will match characters outside quotes or a quoted string followed by a ',':
('[^']*'|[^',]*),\s*
And I'll replace with:
\1\t
Now the problem is the regex is searching not matching so it can choose to skip characters until it can match. So rather than my desired output I get:
21\t2\t'23.5R25 ETADT'\t'description\twith a comma'
You can see a live example of this behavior here: https://regex101.com/r/sG9hT3/2
Q. Is there a way to anchor a g modified regex to begin matching at the character after the previous match?
For those familiar with Perl's mighty regexs, Perl provides the \G. Which allows us to retrieve the end of the last position matched. So in Perl I could accomplish what I'm asking for with the regex:
\G('[^']*'|[^',]*),\s*
This would force a mismatch within the final quoted element. Because rather than allowing the regex implementation to find a point where the regex matched the \G would force it to begin matching at the first character of:
'description, with a comma'
You can use the following regex with re.search:
,?\s*([^',]*(?:'[^']*'[^',]*)*)
See regex demo (I change it to ,?[ ]*([^',\n]*(?:'[^'\n]*'[^',\n]*)*) since it is a multiline demo)
Here, the regex matches (in a regex meaning of the word)...
,? - 1 or 0 comma
\s* - 0 or more whitespace
([^',]*(?:'[^']*'[^',]*)*) - Group 1 storing a captured text that consists of...
[^',]* - 0 or more characters other than , and '
(?:'[^']*'[^',]*)* - 0 or more sequences of ...
'[^']*' - a 'string'-like substring containing no apostrophes
[^',]* - 0 or more characters other than , and '.
If you want to use a re.match and store the captured texts inside capturing groups, it is not possible since Python regex engine does not store all the captures in a stack as .NET regex engine does with CaptureCollection.
Also, Python regex does not support \G operator, so you cannot anchor any subpattern at the end of a successful match here.
As an alternative/workaround, you can use the following Python code to return successive matches and then the rest of the string:
import re
def successive_matches(pattern,text,pos=0):
ptrn = re.compile(pattern)
match = ptrn.match(text,pos)
while match:
yield match.group()
if match.end() == pos:
break
pos = match.end()
match = ptrn.match(text,pos)
if pos < len(text) - 1:
yield text[pos:]
for matched_text in successive_matches(r"('[^']*'|[^',]*),\s*","21, 2, '23.5R25 ETADT', 'description, with a comma'"):
print matched_text
See IDEONE demo, the output is
21,
2,
'23.5R25 ETADT',
'description, with a comma'

Categories