Finding all occurrences of alternating digits using regular expressions - python

I would like to find all alternating digits in a string using regular expressions. An alternating digit is defined as two equal digits having a digit in between; for example, 1212 contains 2 alternations (121 and 212) and 1111 contains 2 alternations as well (111 and 111). I have the following regular expression code:
s = "1212"
re.findall(r'(\d)(?:\d)(\1)+', s)
This works for strings like "121656", but not "1212". This is a problem to do with overlapping matches I think. How can I deal with that?

(?=((\d)\d\2))
Use lookahead to get all overlapping matches. Use re.findall and get the first element from the tuple. See the demo:
https://regex101.com/r/fM9lY3/54

You can use a lookahead to allow for overlapping matches:
r'(\d)(?=(\d)\1)'
To reconstruct full matches from this:
matches = re.findall(r'(\d)(?=(\d)\1)', s)
[a + b + a for a, b in matches]
Also, to avoid other Unicode digits like ١ from being matched (assuming you don’t want them), you should use [0-9] instead of \d.

With the regex module you don't have to use a trick to get overlapped matches since there's a flag to obtain them:
import regex
res = [x.group(0) for x in regex.finditer(r'(\d)\d\1', s, overlapped=True)]
if s contains only digits, you can do this too:
res = [s[i-2:i+1] for i in range(2, len(s)) if s[i]==s[i-2]]

A non regex approach if you string is made up of just digits:
from itertools import islice as isl, izip
s = "121231132124123"
out = [a + b + c for a, b, c in zip(isl(s, 0, None), isl(s, 1, None), isl(s, 2, None)) if a == c]
Output:
['121', '212', '212']
It is actually a nice bit faster than a regex approach.

Related

repeated pattern in regex

I am trying to catch a repeated pattern in my string. The subpattern starts with the beginning of word or ":" and ends with ":" or end of word. I tried findall and search in combination of multiple matching ((subpattern)__(subpattern))+ but was not able what is wrong:
cc = "GT__abc23_1231:TF__XYZ451"
import regex
ma = regex.match("(\b|\:)([a-zA-Z]*)__(.*)(:|\b)", cc)
Expected output:
GT, abc23_1231, TF, XYZ451
I saw a bunch of questions like this, but it did not help.
It seems you can use
(?:[^_:]|(?<!_)_(?!_))+
See the regex demo
Pattern details:
(?:[^_:]|(?<!_)_(?!_))+ - 1 or more sequences of:
[^_:] - any character but _ and :
(?<!_)_(?!_) - a single _ not enclosed with other _s
Python demo with re based solution:
import re
p = re.compile(r'(?:[^_:]|(?<!_)_(?!_))+')
s = "GT__abc23_1231:TF__XYZ451"
print(p.findall(s))
# => ['GT', 'abc23_1231', 'TF', 'XYZ451']
If the first character is always not a : and _, you may use an unrolled regex like:
r'[^_:]+(?:_(?!_)[^_:]*)*'
It won't match the values that start with single _ though (so, an unrolled regex is safer).
Use the smallest common denominator in "starts and ends with a : or a word-boundary", that is the word-boundary (your substrings are composed with word characters):
>>> import re
>>> cc = "GT__abc23_1231:TF__XYZ451"
>>> re.findall(r'\b([A-Za-z]+)__(\w+)', cc)
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]
Testing if there are : around is useless.
(Note: no need to add a \b after \w+, since the quantifier is greedy, the word-boundary becomes implicit.)
[EDIT]
According to your comment: "I want to first split on ":", then split on double underscore.", perhaps you dont need regex at all:
>>> [x.split('__') for x in cc.split(':')]
[['GT', 'abc23_1231'], ['TF', 'XYZ451']]

Regular expression to limit multiple occurrence of any character to two

I am looking for a regular expression to limit multiple occurrence of any character in a string to two.
eg: Reallllly like and Sooooooo good should be converted to Really like and So good.
Replaces sequences of three or more same characters by only two.
re.sub(r'(.)\1{2,}', r'\1\1', "Realllllly goooood")
Edit: fixed typo.
I don't know how to do it with a regex, but itertools.groupby works well:
>>> from itertools import groupby
>>> g = groupby('reallllly goood')
>>> ''.join(''.join(list(x)[:2]) for _,x in g)
>>> 'really good'
Answer from #pacholik is almost right.
Proper expression:
re.sub(r'(.)\1{2,}', r'\1\1', "Realllllly goood")
We replace substrings where more then 3 occurrences, not 4 (first (.) and 2 or more repeats \1{2,} replaced with 2 repeats of the 1st character \1\1).

Matching both possible solutions in Regex

I have a string aaab. I want a Python expression to match aa, so I expect the regular expression to return aa and aa since there are two ways to find substrings of aa.
However, this is not what's happening.
THis is what I've done
a = "aaab"
b = re.match('aa', a)
You can achieve it with a look-ahead and a capturing group inside it:
(?=(a{2}))
Since a look-ahead does not move on to the next position in string, we can scan the same text many times thus enabling overlapping matches.
See demo
Python code:
import re
p = re.compile(r'(?=(a{2}))')
test_str = "aaab"
print(re.findall(p, test_str))
To generalize #stribizhev solution to match one or more of character a: (?=(a{1,}))
For three or more: (?=(a{3,})) etc.

Python regular expression for substring

All I want is to grab the first 3 numeric characters of string:
st = '123_456'
import re
r = re.match('([0-9]{3})', st)
print r.groups()[0]
Am I doing the right thing for grabbing first 3 characters?
This returns 123 but what if I want to get the first 3 characters regardless of numbers and alphabets or special characters?
When given 12_345, I want to grab only 12_
Thanks,
If you always need first three characters in a string, then you can use the below:
first_3_charaters = st[:3]
There is no need of regular expression in your case.
You are really close, just drop the extra set of parenthesis and use the proper indexing of zero instead of one. Python indexing starts at zero. See below.
This works:
import re
mystring = '123_456'
check = re.search('^[0-9]{3}', mystring)
if check:
print check.group(0)
the ^ anchors to the beginning of the string which will ensure a match to the first three numeric digits only. If you do not use the carrot the regexp will match any three digits in a row in the string.
Some may suggest \d but this includes more than 0-9.
As others will surely point out a simple substring operation will do the trick if all the fields start with three numeric digits that you want to extract.
Good luck!
If all digits are separated by _, then you can simply use this regular expression which greedily matches all numeric characters before the first _ .
r = re.match('([0-9]*)_', st)
Actually, the _ in this RE is not necessary,so you can simplify it to (so that any separator is accepted ):
r = re.match('(\d*)', st)
But this solution will give you 1234 if st = '1234_56'. I'm not sure whether it is your intention.
So, if you want at most 3 numeric characters, you can just modify the regular expression to:
r = re.match('(\d{,3})', st)

Extracting a number from an unspaced string in Python

I need to extracted a number from an unspaced string that has the number in brakets for example:
"auxiliary[0]"
The only way I can think of is:
def extract_num(s):
s1=s.split["["]
s2=s1[1].split["]"]
return int(s2[0])
Which seems very clumsy, does any one know of a better way to do it? (The number is always in "[ ]" brakets)
You could use a regular expression (with the built-in re module):
import re
bracketed_number = re.compile(r'\[(\d+)\]')
def extract_num(s):
return int(bracketed_number.search(s).group(1))
The pattern matches a literal [ character, followed by 1 or more digits (the \d escape signifies the digits character group, + means 1 or more), followed by a literal ]. By putting parenthesis around the \d+ part, we create a capturing group, which we can extract by calling .group(1) ("get the first capturing group result").
Result:
>>> extract_num("auxiliary[0]")
0
>>> extract_num("foobar[42]")
42
I would use a regular expression to get the number. See docs: http://docs.python.org/2/library/re.html
Something like:
import re
def extract_num(s):
m = re.search('\[(\d+)\]', s)
return int(m.group(1))
print a[-2]
print a[a.index(']') - 1]
print a[a.index('[') + 1]
for number in re.findall(r'\[(\d+)\]',"auxiliary[0]"):
do_sth(number)

Categories