Start matching after character in regex - python

What I want
Suppose I have the following string:
"Abc def. 2. Ghi jkl. → 1. Mno 2. Pqrs 3. Tu 4 vx 5. yz..."
Now I want to write a regular expression in Python that matches and groups each list item after the character → such that each group would contain the list item number and the content for that list item, like this:
('1', 'Mno')
('2', 'Pqrs')
('3', 'Tu 4 vx')
('5', 'yz..')
In other words, after I encounter → I want to match patterns that look something like:
'([0-9]+)\.[" "]*(.*)'
I know that the obvious practical solution is to split the string and only search the section that comes after →, but I'm more interested in a theoretical, maybe-not-so-practical solution using only regular expression, in order to get a better understanding of regular expressions.
What I've tried
I have tried using look-behind like this:
'(?<=→)[" "]*([0-9]+)\.[" "]*(.*?)(?=[0-9]+\.|$)'
which finds the first match, but then things seem to get vastly more complex since it SEEMS as if I need to use another look-behind to match everything that's not the first occurrence. But since I don't know the length of the first list item, and Python only supports fixed-width look-behinds, I'm not sure how to proceed.

You could make use of the Python PyPi regex module and make use of the \G anchor to get continuous matches. The \G anchor matches at the start of the string of at the end of the previous match.
Use 2 capturing groups to get the data and use regex.findall to return the values from the groups.
Pattern
(?:^[^→\r\n]*→|\G(?!^))[^\S\r\n]*(\d+)\.[^\S\r\n]*(.*?)[^\S\r\n]*(?=$|\d\.)
Explanation
(?: Non capture group
^[^→\r\n]*→ Match 0+ occurrences of any char except a newline or →
| Or
\G(?!^) Assert the position at the end of the previous match, not at the start
) Close group
[^\S\r\n]* Match 0+ whitespaces except a newline
(\d+) Capture group 1, match 1+ digits
\.?[^\S\r\n]* Match an optional dot followed by 0+ whitespaces except a newline
(.*?) Capture group 2, match any char 0+ times non greedy
[^\S\r\n]* Match 0+ trailing whitespaces
(?= Positive lookahead, assert what is on the right is
$|\d\. Assert end of string or match a digit and dot
) Close lookahead
Regex demo in pcre | Python demo
Code example
import regex
pattern = r"(?:^[^→\r\n]*→|\G(?!^))[^\S\r\n]*(\d+)\.[^\S\r\n]*(.*?)[^\S\r\n]*(?=$|\d\.)"
test_str = "Abc def. 2. Ghi jkl. → 1. Mno 2. Pqrs 3. Tu 4 vx 5. yz..."
print(regex.findall(pattern, test_str))
Output
[('1', 'Mno'), ('2', 'Pqrs'), ('3', 'Tu 4 vx'), ('5', 'yz...')]

Related

Regex pattern matching comma delimited values with spaces allowed around comma

I am trying to write a Regex validator (Python 3.8) to accept strings like these:
foo
foo,bar
foo, bar
foo , bar
foo , bar
foo, bar,foobar
This is what I have so far (but it matches only the first two cases):
^[a-zA-Z][0-9a-zA-Z]+(,[a-zA-Z][0-9a-zA-Z]+)*$|^[a-zA-Z][0-9a-zA-Z]+
However, when I add the whitespace match \w, it stops matching altogether:
^[a-zA-Z][0-9a-zA-Z]+(\w+,\w+[a-zA-Z][0-9a-zA-Z]+)*$|^[a-zA-Z][0-9a-zA-Z]+
What is the pattern to use (with explanation as to why my second pattern above is not matching).
\w matches [0-9a-zA-Z_] and it doesn't include whitespaces.
What you need is this regex:
^[a-zA-Z][0-9a-zA-Z]*(?:\s*,\s*[a-zA-Z][0-9a-zA-Z]*)*$
RegEx Demo
RegEx Details:
^: Start
[a-zA-Z][0-9a-zA-Z]*: Match a text starting with a letter followed by 0 or more alphanumeric characters
(?:: Start non-capture group
\s*,\s*: Match a comma optionally surrounded with 0 or more whitespaces on both sides
[a-zA-Z][0-9a-zA-Z]*: Match a text starting with a letter followed by 0 or more alphanumeric characters
)*: End non-capture group. Repeat this group 0 or more times
$: End

Python: find a string between 2 strings in text

I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?
You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n Match (1) Literature and 2 newlines
( Capture group 1
(?: Non capture group
\d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
)+ Close non capture group and repeat it 1 or more times to match all the lines
) Close group 1
Regex demo
Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Regex demo
Parentheses have a special meaning in regex. They are used to group matches.
(1) - Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.
Check Demo
Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Regex Demo
Note the scape characters used on parentheses around 1.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Regex Demo
You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
RegEx Demo
Explanation:
\(1\): Match (1) text
\s+: Match 1+ whitespaces
Literature:
\s+:
(: Start capture group #1
(?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
): End capture group #1
Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+) then globally match to get all groups.

Match capture group fixed number of times

I have a bunch of 5-letter strings. For each string, I would like to match only if the string contains 3 instances of the same letter, i.e.:
Case 1: 'aabbc' -> no match
Case 2: 'bbbcc' -> match 'bbb'
Case 3: 'ddcdc' -> match 'ddd'
My best regex attempt is:
(.){1}(?!\1)*\1{1}(?!\1)*\1{1}
This works for case 1 (where there is no match) and case 2 (where the 3 instances are adjacent), but not for case 3 (where the 3 instances are separated by at least one other letter).
Is there a regex that will work for case 3? Ideally I would like to also extract the locations of the 3 matching instances from the string.
The below pattern catches what you need and should capture the edge cases. The first capturing group can be modified to just be the subset of characters you need to search for if there is a limited list of expected values. Putting the \1s in capturing groups means that you should be able to extract the index of the capturing groups from the match via .start() (getting the starting index of the capturing group), meeting your bonus goal.
>>> pattern = r"(.).*(\1).*(\1)"
>>> x = re.search(pattern, "ababb")
>>> x.groups()
('b', 'b', 'b')
>>> x.start(1)
1
>>> x.start(2)
3
>>> x.start(3)
4
I think the pattern ([a-z]).*?\1.*?\1 does what you want, although there are likely to be edge cases that would complicate it.
The pattern looks for a lowercase letter three times, with 0 or more characters between them.
You could then extract just the capturing groups to get your match locations.
At the moment, the pattern only looks for any lowercase letter repeated three times, but you could change the initial capturing group - ([a-z]) - if you wanted to capture something else.
Demo
You can use the following regex to determine if one character appears at least three times.
^.*(.).*\1.*\1
Demo
This does not check that the characters are letters but it does work with any characters. To restrict to letters change each . to [a-z] or [a-zA-Z], as appropriate.
To see if one character appears exactly 3 times, change the regex to:
^(?!.*(.)(?:.*\1){3,}).*(.).*\2.*\2
Demo
^ # match beginning of line
(?! # begin negative lookahead
.* # match 0+ chars
(.) # match a char in cap grp 2
(?:.*\1) # match 0+ chars followed by content of cap grp 1
# in a non-cap grp
{3,} # execute non-cap grp 3+ times
) # end negative lookahead
.* # match 0+ chars
(.) # match char in cap grp 2
.* # match 0+ chars
\2 # match content of cap grp 2
.* # match 0+ chars
\2 # match content of cap grp 2

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

What does "?:" mean in a Python regular expression?

Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.

Categories