Regex to match all repeating alphanumerical subpatterns [duplicate] - python

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 2 years ago.
After searching for a while, I could only find how to match specific subpattern repetitions. Is there a way I can find (3 or more) repetitions for any subpattern ?
For example:
re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
→ ['a', 'b', 'x', 'aaabbbxxx_']
re.findall(<the_regex>, 'lalala luuluuluul')
→ ['la', 'luu', 'uul']
I apologize in advance if this is a duplicate and would be grateful to be redirected to the original question.

Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.
r'(?=(.+)\1\1)'
RegEx Demo
Code:
>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']
RegEx Details:
Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.
Using findall we only return capture group in our regex.
(?=: Start lookahead
(.+): Match 1 or more of any character (greedy) and capture in group #1
\1\1: Match 2 occurrence of group #1 using back-reference \1\1
): End lookahead

re.findall() won't find overlapping matches. But you can find the non-overlapping matches using a capture group followed by a positive lookahead that matches a back-reference to that group.
>>> import re
>>> regex = r'(.+)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'a', 'b', 'x', 'a', 'b', 'x']
>>> re.findall(regex, 'lalala luuluuluul')
['la', 'luu']
>>>
This will find the longest matches; if you change (.+) to (.+?) you'll get the shortest matches at each point.
>>> regex = r'(.+?)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['a', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']

It is not possible without defining the subpattern first.
Anyway, if the subpattern is just <any_alphanumeric>, then re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_') would produce something like this :
['a', 'b', 'x', 'aa', 'ab', 'bb', 'bx', 'xx', 'x_', 'aaa', 'aaab', 'aaabb', ....]
ie, every alphanumeric combination that is repeated thrice - so a lot of combinations, not just ['a', 'b', 'x', 'aaabbbxxx_']

Related

How to extract the value between the key using RegEx?

I have text like:
"abababba"
I want to extract the characters as a list between a.
For the above text, I am expecting output like:
['b', 'b', 'bb']
I have used:
re.split(r'^a(.*?)a$', data)
But it doesn't work.
You could use re.findall to return the capture group values with the pattern:
a([^\sa]+)(?=a)
a Match an a char
([^\sa]+) Capture group 1, repeat matching any char except a (or a whitspace char if you don't want to match spaces)
(?=a) Positive lookahead, assert a to the right
Regex demo
import re
pattern = r"a([^\sa]+)(?=a)"
s = "abababba"
print(re.findall(pattern, s))
Output
['b', 'b', 'bb']
You could use a list comprehension to achieve this:
s = "abababba"
l = [x for x in s.split("a") if not x == ""]
print(l)
Output:
['b', 'b', 'bb']
The ^ and $ will only match the beginning and end of a line, respectively.
In this case, you will get the desired list by using the line:
re.split(r'a(.*?)a', data)[1:-1]
Why not use a normal split:
"abababba".split("a") --> ['', 'b', 'b', 'bb', '']
And remove the empty parts as needed:
# remove all empties:
[*filter(None,"abababba".split("a"))] -> ['b', 'b', 'bb']
or
# only leading/trailing empties (if any)
"abababba".strip("a").split("a") --> ['b', 'b', 'bb']
or
# only leading/trailing empties (assuming always enclosed in 'a')
"abababba".split("a")[1:-1] --> ['b', 'b', 'bb']
If you must use a regular expression, perhaps findall() will let you use a simpler pattern while covering all edge cases (ignoring all empties):
re.findall(r"[^a]+","abababba") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","abababb") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","bababb") --> ['b', 'b', 'bb']
re.findall(r"[^a]+","babaabb") --> ['b', 'b', 'bb']

Match all [A-Z] but not duplicates [duplicate]

This question already has answers here:
regex to match a word with unique (non-repeating) characters
(3 answers)
Closed 4 years ago.
I need to match all upper case letters in a string, but not duplicates of the same letter in python I've been using
from re import compile
regex = compile('[A-Z]')
variables = regex.findall('(B or P) and (P or not Q)')
but that will match ['B', 'P', 'P', 'Q'] but I need ['B', 'P', 'Q'].
Thanks in advance!
You can use negative lookahead with a backreference to avoid matching duplicates:
re.findall(r'([A-Z])(?!.*\1.*$)', '(B or P) and (P or not Q)')
This returns:
['B', 'P', 'Q']
And if order matters do:
print(sorted(set(variables),key=variables.index))
Or if you have the more_itertools package:
from more_itertools import unique_everseen as u
print(u(variables))
Or if version >= 3.6:
print(list({}.fromkeys(variables)))
Or OrderedDict:
from collections import OrderedDict
print(list(OrderedDict.fromkeys(variables)))
All reproduce:
['B', 'P', 'Q']

What regex will emulate the default behavior of split() in python?

Using split() I can easily create from a string the list of tokens that are divided by space:
>>> 'this is a test 200/2002'.split()
['this', 'is', 'a', 'test', '200/2002']
How do I do the same using re.compile and re.findall? I need something similiar to the following example but without splitting the "200/2002".
>>> test = re.compile('\w+')
>>> test.findall('this is a test 200/2002')
['this', 'is', 'a', 'test', '200', '2002']
This should output the desired list:
>>> test = re.compile('\S+')
>>> test.findall('this is a test 200/2002')
['this', 'is', 'a', 'test', '200/2002']
\S is anything but a whitespace (space, tab, newline, ...).
From str.split() documentation :
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace. Consequently,
splitting an empty string or a string consisting of just whitespace
with a None separator returns [].
findall() with the above regex should have the same behaviour :
>>> test.findall(" a\nb\tc d ")
['a', 'b', 'c', 'd']
>>> " a\nb\tc d ".split()
['a', 'b', 'c', 'd']

Python-Getting contents between current and next occurrence of pattern in a string

I want to implement the following in python
(1)Search pattern in a string
(2)Get content till next occurence of the same pattern in the same string
Till end of the string do (1) and (2)
Searched all available answers but of no use.
Thanks in advance.
As mentioned by Blckknght in the comment, you can achieve this with re.split. re.split retains all empty strings between a) the beginning of the string and the first match, b) the last match and the end of the string and c) between different matches:
>>> re.split('abc', 'abcabcabcabc')
['', '', '', '', '']
>>> re.split('bca', 'abcabcabcabc')
['a', '', '', 'bc']
>>> re.split('c', 'abcabcabcabc')
['ab', 'ab', 'ab', 'ab', '']
>>> re.split('a', 'abcabcabcabc')
['', 'bc', 'bc', 'bc', 'bc']
If you want to retain only c) the strings between 2 matches of the pattern, just slice the resulting array with [1:-1].
Do note that there are two caveat with this method:
re.split doesn't split on empty string match.
>>> re.split('', 'abcabc')
['abcabc']
Content in capturing groups will be included in the resulting array.
>>> re.split(r'(.)(?!\1)', 'aaaaaakkkkkkbbbbbsssss')
['aaaaa', 'a', 'kkkkk', 'k', 'bbbb', 'b', 'ssss', 's', '']
You have to write your own function with finditer if you need to handle those use cases.
This is the variant where only case c) is matched.
def findbetween(pattern, input):
out = []
start = 0
for m in re.finditer(pattern, input):
out.append(input[start:m.start()])
start = m.end()
return out
Sample run:
>>> findbetween('abc', 'abcabcabcabc')
['', '', '']
>>> findbetween(r'', 'abcdef')
['a', 'b', 'c', 'd', 'e', 'f']
>>> findbetween(r'ab', 'abcabcabc')
['c', 'c']
>>> findbetween(r'b', 'abcabcabc')
['ca', 'ca']
>>> findbetween(r'(?<=(.))(?!\1)', 'aaaaaaaaaaaabbbbbbbbbbbbkkkkkkk')
['bbbbbbbbbbbb', 'kkkkkkk']
(In the last example, (?<=(.))(?!\1) matches the empty string at the end of the string, so 'kkkkkkk' is included in the list of results)
You can use something like this
re.findall(r"pattern.*?(?=pattern|$)",test_Str)
Here we search pattern and with lookahead make sure it captures till next pattern or end of string.

How to find double occurrence of a letter in a word [duplicate]

This question already has answers here:
RegExp match repeated characters
(6 answers)
Closed 8 years ago.
I have string :-
s = 'bubble'
how to use regular expression to get a list like:
['b', 'u', 'bb', 'l', 'e']
I want to filter single as well as double occurrence of a letter.
This should do it:
import re
[m.group(0) for m in re.finditer('(.)\\1*',s)]
For 'bubbles' this returns:
['b', 'u', 'bb', 'l', 'e', 's']
For 'bubblesssss' this returns:
['b', 'u', 'bb', 'l', 'e', 'sssss']
You really have two questions. The first question is how to split the list, the second is how to filter.
The splitting takes advantage of back references in a pattern. In this case we'll construct a pattern the will find one or two occurrences of a letter then construct a list from the search results. The \1 in the code block refers to the first parenthesized expression.
import re
pattern = re.compile(r'(.)\1?')
s = "bubble"
result = [x.group() for x in pattern.finditer(s)]
print(result)
To filter the list stored in result you could use a list comprehension that filters on length.
filtered_result = [x for x in result if len(x) == 2]
print(filtered_result)
You could just get the set of duplications directly by tweaking the regular expression.
pattern2 = re.compile(r'(.)\1')
result2 = [x.group() for x in pattern2.finditer(s)]
print(result2)
The output from running the above is:
['b', 'u', 'bb', 'l', 'e']
['bb']
['bb']

Categories