I have examined a previous question relating to optional capture groups in Python, but this has not been helpful. Attempting to follow, the code I have is below:
import re
c = re.compile(r'(?P<Prelude>.*?)'
r'(?:Discussion:(?P<Discussion>.+?))?'
r'(?:References:(?P<References>.*?))?',
re.M|re.S)
test_text = r"""Prelude strings
Discussion: this is some
text.
References: My bad, I have none.
"""
test_text2 = r"""Prelude strings
Discussion: this is some
text.
"""
print(c.match(test_text).groups())
print(c.match(test_text2).groups())
Both print ('Prelude strings', None, None) instead of capturing the two groups. I am unable to determine why.
The expected result is ('Prelude strings', ' this is some\ntext.', ' My bad, I have none.') for the first, and the second the same but with None as the third capture group. It should also be possible to delete the Discussion lines and still capture References.
You can use
c = re.compile(r'^(?P<Prelude>.*?)'
r'(?:Discussion:\s*(?P<Discussion>.*?)\s*)?'
r'(?:References:\s*(?P<References>.*?))?$',
re.S)
One-line regex pattern as a string:
(?s)^(?P<Prelude>.*?)(?:Discussion:\s*(?P<Discussion>.*?)\s*)?(?:References:\s*(?P<References>.*?))?$
See the regex demo.
Details:
(?s) - same as re.S, makes . match line break chars
^ - start of the whole string (note that it no longer matches start of any line, since I removed the re.M flag)
(?P<Prelude>.*?) - Group "Prelude": any zero or more chars as few as possible
(?:Discussion:\s*(?P<Discussion>.*?)\s*)? - an optional non-capturing group matching one or zero occurrences of the following sequence:
Discussion: - a fixed string
\s* - zero or more whitespaces
(?P<Discussion>.*?) - Group "Discussion": zero or more chars as few as possible
\s* - zero or more whitespaces
(?:References:\s*(?P<References>.*?))? - an optional non-capturing group matching one or zero occurrences of the following sequence:
References: - a fixed string
\s* - zero or more whitespaces
(?P<References>.*?) - Group "References": any zero or more chars as few as possible
$ - end of the string.
Related
I am trying to capture instances in my dataframe where a string has the following format:
/random a/random b/random c/capture this/random again/random/random
Where a string is preceded by four instances of /, and more than two / appear after it, I would like the string captured and returned in a different column. If it is not applicable to that row, return None.
In this instance capture this should be captured and placed into a new column.
This is what I tried:
def extract_special_string(df, column):
df['special_string_a'] = df[column].apply(lambda x: re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x).group(0) if re.search(r'(?<=\/{4})[^\/]+(?=\/[^\/]{2,})', x) else None)
extract_special_string(df, 'column')
However nothing is being captured. Can anybody help with this regex? Thanks.
You can use
df['special_string_a'] = df[column].str.extract(r'^(?:[^/]*/){4}([^/]+)(?:/[^/]*){2}', expand=False)
See the regex demo
Details:
^ - start of string
(?:[^/]*/){4} - four occurrences of any zero or more chars other than / and then a / char
([^/]+) - Capturing group 1:one or more chars other than a / char
(?:/[^/]*){2} - two occurrences of a / char and then any zero or more chars other than /.
An alternative regex approach would be to use non-greedy quantifiers.
import re
s = '/random a/random b/random c/capture this/random again/random/random'
pattern = r'/(?:.*?/){3}(.*?)(?:/.*?){2,}'
m = re.match(pattern, s)
print(m.group(1)) # 'capture this'
/(?:.*?/){3} - match the part before capture this, matching any but as few as possible characters between each pair of /s (use noncapturing group to ignore the contents)
(.*?) - capture capture this (since this is a capturing group, we can fetch the contents from <match_object>.group(1)
(?:/.*?){2,} - same as the first part, match as few characters as possible in between each pair of /s
This question is related to my previous question, for which I got an answer.
Now I need to add an exception condition into the recommended regex expression. The regex expression (?<!\s)-\s+ should be applied only if the word after - is not equal to to. If it is equal to - to, then the - should be replaced with a single white space .
I tried to use a negative lookbehind (?<!to) to add the condition on to.
import re
s = "refer- ences har- ness Stand- ard Re- quired www.mypo- rtal.test.com A - it is a document, move- to store"
re.sub(r"(?<!\s)-\s+(?<!to)", "", s)
But it still returns moveto store instead of move to store.
The expected output:
references harness Standard Required www.myportal.test.com A - it is a document, move to store
You can use
import re
s = "refer- ences har- ness Stand- ard Re- quired www.mypo- rtal.test.com A - it is a document, move- to store"
print(re.sub(r"(?<!\s)-(?:(\s)+(to)\b|\s+)", r"\1\2", s))
# => references harness Standard Required www.myportal.test.com A - it is a document, move to store
See the Python demo and the regex demo.
Details
(?<!\s) - a location with no whitespace immediately on the left
- - a hyphen
(?:(\s)+(to)\b|\s+) - a non-capturing group matching either of the two patterns:
(\s)+(to)\b - a whitespace captured into Group 1 (the group value is referred to with the \1 placeholder, called a replacement backreference, from the replacement pattern), repeated one or more times (so that only the last one lands in the Group 1 memory buffer) and then a whole word to (since \b is a word boundary) that is captured into Group 2 (\2 in the replacement pattern)
| - or
\s+ - 1+ whitespaces.
The replacement is a concatenation of Group 1 and Group 2. When the first alternative in the non-capturing group does not match, the \1 and \2 are empty strings, so the result is as expected in both cases.
I got a string like that:
LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)
I want to look only for OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0), but the OR could be LD as well. _080T_SAF_OUT could be different being always alphanumeric with bottom slash sometimes. COIL(xxSF[4].Flt[120].0), must be always in the format COIL(xxSF["digits"].Flt["digits"]."digits")
I am trying to use the re library of Python 2.7.
m = re.search('\OR|\LD'+'\('+'.+'+'\)'+'+'\COIL+'\('+'\xxSF+'\['+'\d+'+'\].'+ Flt\['+'\d+'+'\]'+'\.'+'\d+', Text)
My Output:
OR(abc_TEST_X)LD(xxSF[16].Flt[0].22
OR
LD(TEST_X_dsfa)OR(WASS_READY)COIL(xxSF[16].Flt[11].10
The first one is the right one which I am getting I want to discard the second one and the third one.
I think that the problem is here:
'\('+'.+'+'\)'
Because of I just want to find something alphanumeric and possibly with symbols between the first pair of paréntesis, and I am not filtering this situation.
You should group alternations like (?:LD|OR), and to match any chars other than ( and ) you may use [^()]* rather than .+ (.+ matches any chars, as many as possible, hence it matches across parentheses).
Here is a Python demo:
import re
Text = 'LD(_030S.F.IN)OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0)'
m = re.search(r'(?:OR|LD)\([^()]*\)COIL\(xxSF\[\d+]\.Flt\[\d+]\.\d+', Text)
if m:
print(m.group()) # => OR(_080T_SAF_OUT)COIL(xxSF[4].Flt[120].0
Pattern details
(?:OR|LD) - a non-capturing group matching OR or LD
\( - a ( char
[^()]* - a negated character class matching 0+ chars other than ( and )
\)COIL\(xxSF\[ - )COIL(xxSF[ substring
\d+ - 1+ digits
]\.Flt\[ - ].Flt[ substring
\d+]\.\d+ - 1+ digits, ]. substring and 1+ digits
See the regex demo.
TIP Add a \b before (?:OR|LD) to match them as whole words (not as part of NOR and NLD).
Thanks, I am capturing everything which I want. Just something else to filter. Take a look to some Outputs:
OR(_1B21_A53021_2_En)OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1B21_A53021_2_En)LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I only want to capture the last one "LD" or "OR" as follow:
OR(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
LD(_1_A21_Z53021_2)COIL(xxSF[9].Flt[15].3);
I'm working on my regex skills and i find one of my strings having duplicate words at the starting. I would like to remove the duplicate and just have one word of it -
server_server_dev1_check_1233.zzz
server_server_qa1_run_1233.xyz
server_server_dev2_1233.qqa
server_dev1_1233.zzz
data_data_dev9_check_660.log
I used the below regex but i get both server_server in my output,
((.*?))_(?!\D)
How can i have my output just to one server_ if there are two or more and if its only one server_, then take as is?
The output doesn't have to contain the digits and also the part after . i.e. .zzz, .xyz etc
Expected output -
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
you could back reference the word in your search expression:
>>> s = "server_server_dev1_check_1233.zzz"
>>> re.sub(r"(.*_)\1",r"\1",s)
'server_dev1_check_1233.zzz'
and use the "many times" suffix so if there are more than 2 occurrences it still works:
'server_server_server_dev1_check_1233.zzz'
>>> re.sub(r"(.*_)\1{1,}",r"\1",s)
'server_dev1_check_1233.zzz'
getting rid of the suffix is not the hardest part, just capture the rest and discard the end:
>>> re.sub(r"(.*_)\1{1,}(.*)(_\d+\..*)",r"\1\2",s)
'server_dev1_check'
You may use a single re.sub call to match and remove what you do not need and match and capture what you need:
re.sub(r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$', r'\1\2', s)
See the regex demo
Details
^ - start of string
([^_]+) - Capturing group 1: any 1+ chars other than _
(?:_\1)* - zero or more repetitions of _ followed with the same substring as in Group 1 (thanks to the inline backreference \1 that retrieves the text from Group 1)
(.*) - Group 2: any 0+ chars, as many as possible
_ - an underscore
\d+ - 1+ digits
\. - a dot
\w+ - 1+ word chars ([^.]+ will also do, 1 or more chars other than .)
$ - end of string.
The replacement pattern is \1\2, i.e. the contents of Group 1 and 2 are concatenated and make up the resulting value.
Python demo:
import re
rx = r'^([^_]+)(?:_\1)*(.*)_\d+\.\w+$'
strs = ["server_server_dev1_check_1233.zzz", "server_server_qa1_run_1233.xyz", "server_server_dev2_1233.qqa", "server_dev1_1233.zzz", "data_data_dev9_check_660.log"]
for s in strs:
print(re.sub(rx, r'\1\2', s))
Output:
server_dev1_check
server_qa1_run
server_dev2
server_dev1
data_dev9_check
Just getting to the next stage of understanding regex, hoping the community can help...
string = These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4
There are multiple test names preceded by a '-' hyphen which I derive from regex
\(?<=-)\w+\g
Result:
AUSVERSION
TEST
TESTAGAIN
YIFY
I can parse the very last result using greediness with regex \(?!.*-)(?<=-)\w+\g
Result:
YIFI (4th & last result)
Can you please help me parse either the 1st, 2nd, or 3rd result Globally using the same string?
In Python, you can get these matches with a simple -\s*(\w+) regex and re.findall and then access any match with the appropriate index:
See IDEONE demo:
import re
s = 'These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN-YIFY.cp(tt123456).MiLLENiUM.mp4'
r = re.findall(r'-\s*(\w+)', s)
print(r[0]) # => AUSVERSION
print(r[1]) # => TEST
print(r[2]) # => TESTAGAIN
print(r[3]) # => YIFY
The -\s*(\w+) pattern search for a hyphen, followed with 0+ whitespaces, and then captures 1+ digits, letters or underscores. re.findall only returns the texts captured with capturing groups, so you only get those Group 1 values captured with (\w+).
To get these matches one by one, with re.search, you can use ^(?:.*?-\s*(\w+)){n}, where n is the match index you want. Here is a regex demo.
A quick Python demo (in real code, assign the result of re.search and only access Group 1 value after checking if there was a match):
s = "These.Final.Hours-AUSVERSION.2013-TEST-TESTAGAIN- YIFY.cp(tt123456).MiLLENiUM.mp4"
print(re.search(r'^(?:.*?-\s*(\w+))', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){2}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){3}', s).group(1))
print(re.search(r'^(?:.*?-\s*(\w+)){4}', s).group(1))
Explanation of the pattern:
^ - start of string
(?:.*?-\s*(\w+)){2} - a non-capturing group that matches (here) 2 sequences of:
.*? - 0+ any characters other than a newline (since no re.DOTALL modifier is used) up to the first...
- - hyphen
\s* - 0 or more whitespaces
(\w+) - Group 1 capturing 1+ word characters (letters, digits or underscores).