How to capture a group only if occurs twice in a line

How to capture a group only if occurs twice in a line - python

import re
text = """
Tumble Trouble Twwixt Two Towns!
Was the Moon soon in the Sea
Or soon in the sky?
Nobody really knows YET.
"""
How should I make the match happen only when the occurence is found twice in a line?
Regular expression that highlights two 'o's that appear beside each other only if there is another occurence of two 'o's appearing beside each other subsequently in the same line

You can match a single word char with a backreference, and group that again.
The word character will become group 2 as the groups are nested, then the outer group will be group 1.
Then you can assert group 1 using a positive lookahead again in the line.
((\w+)\2)(?=.*?\1)
The pattern matches:
( Capture group 1
(\w+)\2 Match 1+ word chars in capture group 2 followed by a backreference to group 2 to match the same again
) Close group 1
(?=.*?\1) Positive lookahead to assert the captured value of group 1 in the line
See a regex demo and a Python demo.
Example
print(re.compile(r"((\w+)\2)(?=.*?\1)").sub('{\g<1>}', text.rstrip()))
Output
Tumble Trouble Twwixt Two Towns!
Was the M{oo}n soon in the Sea
Or soon in the sky?
Nobody really knows YET.

Related

Regex of sequences surrounded by specific pattern with overlapping problem

I am very new to use python re.finditer to find a regex pattern but trying to make a complex pattern finding, which is the g-quadruplex motif and described as below.
The sequence starts with at least 3 g followed with a/t/g/c multiple times until the next group of ggg+ shows up. this will repeat 3 times and resulting in a pattern like ggg+...ggg+...ggg+...ggg+
The cases should be all ignored and the overlapping can show in the original sequence like ggg+...ggg+...ggg+...ggg+...ggg+...ggg+ should return 3 such patterns.
I have suffered for some time and can only find a way like:
re.finditer(r"(?=(([Gg]{3,})([AaTtCcGg]+?(?=[Gg]{3,})[Gg]{3,}){3}))", seq)
and then filter out the ones that do not with the same start position with
re.finditer(r"([Gg]{3,})", seq)
Is there any better way to extract this type of sequence? And no for loops please since I have millions of rows like this.
Thank you very much!
PS: an example can be like this
ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC
1.ggggggcgggggggACGCTCggctcAAGGGCTCCGGG
2.gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg
3.GGGCTCCGGGCCCCgggggggACgcgcgAAGGG

You could start the match, asserting that what is directly to the left is not a g char to prevent matching on too many positions.
To match both upper and lowercase chars, you can make the pattern case insensitive using re.I
The value is in capture group 1, which will be returned by re.findall.
(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))
(?<!g) Negative lookbehind, assert not g directly to the left
(?= Positive lookahead
( Capture group 1
g{3,} Match 3 or more g chars to start with
(?: Non capture group
[atc](?:g{0,2}[atc])* Optionally repeat matching a t c and 0, 1 or 2 g chars without crossing matching ggg
g{3,} Match 3 or more g chars to end with
){3} Close non capture group and repeat 3 times
) Close group 1
) Close lookahead
Regex demo | Python demo
import re
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))
Output
[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]

import regex as re
seq = 'ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC'
for match in re.finditer(r'(?<=(^|[^g]))g{3,}([atc](g{,2}[atc]+)*g{3,}){3}', seq, overlapped=True, flags=re.I):
print(match[0])
Overlapped works by restarting the search from the character after the start of the current match rather than the end of it. This would give you a bunch of essentially duplicate results, each just removing an extra leading G. To stop that, check for a preceding G with a lookbehind:
(?<=(^|[^g]))
The middle section needs to be a bit more complicated to require an ATC, preventing the seven G's from being split into a ggggggg match. So require one, then allow for any number of less the three G's followed by more ATCs; repeating as needed:
[atc](g{,2}[atc]+)*
The rest is just the Gs and the repeating.

Python: find a string between 2 strings in text

I have a text like this
s = """
...
(1) Literature
1. a.
2. b.
3. c.
...
"""
I want to cut Literature section but I have some problem with detection.
I use here
re.search("(1) Literature\n\n(.*).\n\n", s).group(1)
but search return None.
Desire output is
(1) Literature
1. a.
2. b.
3. c.
What did I do wrong?

You could match (1) Literature and 2 newlines, and then capture all lines that start with digits followed by a dot.
\(1\) Literature\n\n((?:\d+\..*(?:\n|$))+)
The pattern matches:
\(1\) Literature\n\n Match (1) Literature and 2 newlines
( Capture group 1
(?: Non capture group
\d+\..*(?:\n|$) Match 1+ digits and a dot followed by either a newline or end of string
)+ Close non capture group and repeat it 1 or more times to match all the lines
) Close group 1
Regex demo
Another option is to capture all following lines that do not start with ( digits ) using a negative lookahead, and then trim the leading and trailing whitespaces.
\(1\) Literature((?:\n(?!\(\d+\)).*)*)
Regex demo

Parentheses have a special meaning in regex. They are used to group matches.
(1) - Capture 1 as the first capturing group.
Since the string has parentheses in it, the match is not successful. And .* capturing end with line end.
Check Demo
Based on your regex, I assumed you wanted to capture the line with the word Literature, 5 lines below it. Here is a regex to do so.
\(1\) Literature(.*\n){5}
Regex Demo
Note the scape characters used on parentheses around 1.
EDIT
Based on zr0gravity7's comment, I came up with this regex to capture the middle section on the string.
\(1\)\sLiterature\n+((.*\n){3})
This regex will capture the below string in capturing group 1.
1. a.
2. b.
3. c.
Regex Demo

You may use this regex with a capture group:
r'\(1\)\s+Literature\s+((?:.+\n)+)'
RegEx Demo
Explanation:
\(1\): Match (1) text
\s+: Match 1+ whitespaces
Literature:
\s+:
(: Start capture group #1
(?:.+\n)+: Match a line with 1+ character followed by newline. Repeat this 1 or more times to allow it to match multiple such lines
): End capture group #1

Regex for capturing the generic question with that structure:
\(\d+\)\s+(\w+)\s+((?:\d+\.\s.+\n)+)
It will capture the title "Literature", then the choices in another group (for a total of 2 groups).
It is not possible to capture repeating groups, so in order to get each of your "1. a." in a separate group you would have to match the second group from above again, with this pattern:
((\d+\.\s+.+)\n)+) then globally match to get all groups.

regex to find a pair of adjacent digits with different digits around them

I'm a beginner to regex and I am trying to make an expression to find if there are two of the same digits next to each other, and the digit behind and in front of the pair is different.
For example,
123456678 should match as there is a double 6,
1234566678 should not match as there is no double with different surrounding numbers.
12334566 should match because there are two 3s.
So far i have this which works only with 1, and as long as the double is not at the start or end of the string, however I can deal with that by adding a letter at the start and end.
^.*([^1]11[^1]).*$
I know i can use [0-9] instead of the 1s but the problem is having them all be the same digit.
Thank you!

I have divided my answer into four sections.
The first section contains my solution to the problem. Readers interested in nothing else may skip the other sections.
The remaining three sections are concerned with identifying the pairs of equal digits that are preceded by a different digit and are followed by a different digit. The first of the three sections matches them; the other two capture them in a group.
I've included the last section because I wanted to share The Greatest Regex Trick Ever with those unfamiliar with it, because I find it so very cool and clever, yet simple. It is documented here. Be forewarned that, to build suspense, the author at that link has included a lengthy preamble before the drum-roll reveal.
Determine if a string contains two consecutive equal digits that are preceded by a different digit and are followed by a different digit
You can test the string as follows:
import re
r = r'(\d)(?!\1)(\d)\2(?!\2)\d'
arr = ["123456678", "1123455a666788"]
for s in arr:
print(s, bool(re.search(r, s)) )
displays
123456678 True
1123455a666788 False
Run Python code | Start your engine!1
The regex engine performs the following operations.
(\d) : match a digit and save to capture group 1 (preceding digit)
(?!\1) : next character cannot equal content of capture group 1
(\d) : match a digit in capture group 2 (first digit of pair)
\2 : match content of capture group 2 (second digit of pair)
(?!\2) : next character cannot equal content of capture group 2
\d : match a digit
(?!\1) and (?!\2) are negative lookaheads.
Use Python's regex module to match pairs of consecutive digits that have the desired property
You can use the following regular expression with Python’s regex module to obtain the matching pairs of digits.
r'(\d)(?!\1)\K(\d)\2(?=\d)(?!\2)'
Regex Engine
The regex engine performs the following operations.
(\d) : match a digit and save to capture group 1 (preceding digit)
(?!\1) : next character cannot equal content of capture group 1
\K : forget everything matched so far and reset start of match
(\d) : match a digit in capture group 2 (first digit of pair)
\2 : match content of capture group 2 (second digit of pair)
(?=\d) : next character must be a digit
(?!\2) : next character cannot equal content of capture group 2
(?=\d) is a positive lookahead. (?=\d)(?!\2) could be replaced with (?!\2|$|\D).
Save pairs of consecutive digits that have the desired property to a capture group
Another way to obtain the matching pairs of digits, which does not require the regex module, is to extract the contents of capture group 2 from matches of the following regular expression.
r'(\d)(?!\1)((\d)\3)(?!\3)(?=\d)'
Re engine
The following operations are performed.
(\d) : match a digit in capture group 1
(?!\1) : next character does not equal last character
( : begin capture group 2
(\d) : match a digit in capture group 3
\3 : match the content of capture group 3
) : end capture group 2
(?!\3) : next character does not equal last character
(?=\d) : next character is a digit
Use The Greatest Regex Trick Ever to identify pairs of consecutive digits that have the desired property
We use the following regular expression to match the string.
r'(\d)(?=\1)|\d(?=(\d)(?!\2))|\d(?=\d(\d)\3)|\d(?=(\d{2})\d)'
When there is a match, we pay no attention to which character was matched, but examine the content of capture group 4 ((\d{2})), as I will explain below.
The Trick in action
The first three components of the alternation correspond to the ways that a string of four digits can fail to have the property that the second and third digits are equal, the first and second are unequal and the third and fourth are equal. They are:
(\d)(?=\1) : assert first and second digits are equal
\d(?=(\d)(?!\2)) : assert second and third digits are not equal
\d(?=\d(\d)\3) : assert third and fourth digits are equal
It follows that if there is a match of a digit and the first three parts of the alternation fail the last part (\d(?=(\d{2})\d)) must succeed, and the capture group it contains (#4) must contain the two equal digits that have the required properties. (The final \d is needed to assert that the pair of digits of interest is followed by a digit.)
If there is a match how do we determine if the last part of the alternation is the one that is matched?
When this regex matches a digit we have no interest in what digit that was. Instead, we look to capture group 4 ((\d{2})). If that group is empty we conclude that one of the first three components of the alternation matched the digit, meaning that the two digits following the matched digit do not have the properties that they are equal and are unequal to the digits that precede and follow them.
If, however, capture group 4 is not empty, it means that none of the first three parts of the alternation matched the digit, so the last part of the alternation must have matched and the two digits following the matched digit, which are held in capture group 4, have the desired properties.
1. Move the cursor around for detailed explanations.

With regex, it is much more convenient to use a PyPi regex module with the (*SKIP)(*FAIL) based pattern:
import regex
rx = r'(\d)\1{2,}(*SKIP)(*F)|(\d)\2'
l = ["123456678", "1234566678"]
for s in l:
print(s, bool(regex.search(rx, s)) )
See the Python demo. Output:
123456678 True
1234566678 False
Regex details
(\d)\1{2,}(*SKIP)(*F) - a digit and then two or more occurrences of the same digit
| - or
(\d)\2 - a digit and then the same digit.
The point is to match all chunks of identical 3 or more digits and skip them, and then match a chunk of two identical digits.
See the regex demo.

Inspired by the answer or Wiktor Stribiżew, another variation of using an alternation with re is to check for the existence of the capturing group which contains a positive match for 2 of the same digits not surrounded by the same digit.
In this case, check for group 3.
((\d)\2{2,})|\d(\d)\3(?!\3)\d
Regex demo | Python demo
( Capture group 1
(\d)\2{2,} Capture group 2, match 1 digit and repeat that same digit 2+ times
) Close group
| Or
\d(\d) Match a digit, capture a digit in group 3
\3(?!\3)\d Match the same digit as in group 3. Match the 4th digit, but is should not be the same as the group 3 digit
For example
import re
pattern = r"((\d)\2{2,})|\d(\d)\3(?!\3)\d"
strings = ["123456678", "12334566", "12345654554888", "1221", "1234566678", "1222", "2221", "66", "122", "221", "111"]
for s in strings:
match = re.search(pattern, s)
if match and match.group(3):
print ("Match: " + match.string)
else:
print ("No match: " + s)
Output
Match: 123456678
Match: 12334566
Match: 12345654554888
Match: 1221
No match: 1234566678
No match: 1222
No match: 2221
No match: 66
No match: 122
No match: 221
No match: 111
If for example 2 or 3 digits only is also ok to match, you could check for group 2
(\d)\1{2,}|(\d)\2
Python demo

You can also use a simple way .
import re
l=["123456678",
"1234566678",
"12334566 "]
for i in l:
matches = re.findall(r"((.)\2+)", i)
if any(len(x[0])!=2 for x in matches):
print "{}-->{}".format(i, False)
else:
print "{}-->{}".format(i, True)
You can customize this based on you rules.
Output:
123456678-->True
1234566678-->False
12334566 -->True

Capturing repeated pattern in Python

I'm trying to implement some kind of markdown like behavior for a Python log formatter.
Let's take this string as example:
**This is a warning**: Virus manager __failed__
A few regexes later the string has lost the markdown like syntax and been turned into bash code:
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
But that should be compressed to
\033[33;1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m
I tried these, beside many other non working solutions:
(\\033\[([\d]+)m){2,} => Capture: \033[33m\033[1m with g1 '\033[1m' and g2 '1' and \033[0m\033[0mwith g1 '\033[0m' and g2 '0'
(\\033\[([\d]+)m)+ many results, not ok
(?:(\\033\[([\d]+)m)+) many results, although this is the recommended way for repeated patterns if I understood correctly, not ok
and others..
My goal is to have as results:
Input
\033[33m\033[1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
Output
Match 1
033[33m\033[1m
Group1: 33
Group2: 1
Match 2
033[0m\033[0m
Group1: 0
Group2: 0
In other words, capture the ones that are "duplicated" and not the ones alone, so I can fuse them with a regex sub.

You want to match consectuively repeating \033[\d+m chunks of text and join the numbers after [ with a semi-colon.
You may use
re.sub(r'(?:\\033\[\d+m){2,}', lambda m: r'\033['+";".join(set(re.findall(r"\[(\d+)", m.group())))+'m', text)
See the Python demo online
The (?:\\033\[\d+m){2,} pattern will match two or more sequences of \033[ + one or more digits + m chunks of texts and then, the match will be passed to the lambda expression, where the output will be: 1) \033[, 2) all the numbers after [ extracted with re.findall(r"\[(\d+)", m.group()) and deduplicated with the set, and then 3) m.

The patterns in the string to be modified have not been made clear from the question. For example, is 033 fixed or might it be 025 or even 25? I've made certain assumptions in using the regex
r" ^(\\0(\d+)\[\2)[a-z]\\0\2\[(\d[a-z].+)
to obtain two capture groups that are to be combined, separated by a semi-colon. I've attempted to make clear my assumptions below, in part to help the OP modify this regex to satisfy alternative requirements.
Demo
The regex performs the following operations:
^ # match beginning of line
( # begin cap grp 1
\\0 # match '\0'
(\d+) # match 1+ digits in cap grp 2
\[ # match '['
\2 # match contents of cap grp 2
) # end cap grp 1
[a-z] # match a lc letter
\\0 # match '\0'
\2 # match contents of cap grp 2
\[ # match '['
(\d[a-z].+) # match a digit, then lc letter then 1+ chars to the
# end of the line in cap grp 3
As you see, the portion of the string captured in group 1 is
\033[33
I've assumed that the part of this string that is now 033 must be two or more digits beginning with a zero, and the second appearance of a string of digits consists of the same digits after the zero. This is done by capturing the digits following '0' (33) in capture group 2 and then using a back-reference \2.
The next part of the string is to be replaced and therefore is not captured:
m\\033[
I've assumed that m must be one lower case letter (or should it be a literal m?), the backslash and zero and required and the following digits must again match the content of capture group 2.
The remainder of the string,
1mThis is a warning\033[0m: Virus manager \033[4mfailed\033[0m\033[0m
is captured in capture group 3. Here I've assumed it begins with one digit (perhaps it should be \d+) followed by one lower case letter that needn't be the same as the lower case letter matched earlier (though that could be enforced with another capture group). At that point I match the remainder of the line with .+, having given up matching patterns in that part of the string.
One may alternatively have just two capture groups, the capture group that is now #2, becoming #1, and #2 being the part of the string that is to be replaced with a semicolon.

This is pretty straightforward for the cases you desribe here; simply write out from left to right what you want to match and capture. Repeating capturing blocks won't help you here, because only the most recently captured values would be returned as a result.
\\033\[(\d+)m\\033\[(\d+)m

Regex (Python) - Match words with two or more distinct vowels

I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.

You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.

Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.

Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to capture a group only if occurs twice in a line - python

Related

Regex of sequences surrounded by specific pattern with overlapping problem

Python: find a string between 2 strings in text

regex to find a pair of adjacent digits with different digits around them

Capturing repeated pattern in Python

Regex (Python) - Match words with two or more distinct vowels

Categories

Resources