Regular expression to limit multiple occurrence of any character to two - python

I am looking for a regular expression to limit multiple occurrence of any character in a string to two.
eg: Reallllly like and Sooooooo good should be converted to Really like and So good.

Replaces sequences of three or more same characters by only two.
re.sub(r'(.)\1{2,}', r'\1\1', "Realllllly goooood")
Edit: fixed typo.

I don't know how to do it with a regex, but itertools.groupby works well:
>>> from itertools import groupby
>>> g = groupby('reallllly goood')
>>> ''.join(''.join(list(x)[:2]) for _,x in g)
>>> 'really good'

Answer from #pacholik is almost right.
Proper expression:
re.sub(r'(.)\1{2,}', r'\1\1', "Realllllly goood")
We replace substrings where more then 3 occurrences, not 4 (first (.) and 2 or more repeats \1{2,} replaced with 2 repeats of the 1st character \1\1).

Related

Regex to remove duplicated characters and combinations

I have a string that consists of words that have duplicated characters at the end of it.
These characters may be in such combinations:
wordxxxx
wordxyxyxy
wordxyzxyzxyz
For example:
string = "Thisssssssss isisisis echooooooo stringggg. Replaceaceaceace repeatedededed groupssss of symbolssss"
I've found a way to replace some of the repeated combinations, this way:
re.sub(r'([a-z]{1,3})\1+', r'\1', string)
I'm getting these results:
Thisss is echoooo stringg. Replace repeated groupss of symbolss
How should I change the regex to remove ALL the repeated characters and their combinations?
Your regex is almost correct.
You need to add ? to the capturing group, so it matches as little as it can ("lazy matching" rather than the default "greedy" behavior that matches as much as possible).
I also used + instead of {1,3} because limiting the repetition to 3 seemed arbitrary.
You can observe the difference between the two behaviors: greedy vs lazy.
Note that:
The greedy behavior sees aaaa as aa * 2 rather than a * 4
The greedy behavior only works for even-lengthed repetitions. aaaaa is seen as
aa * 2 + a thus the replacement result would be aaa instead of a.
for word in "Thisssssssss isisisis echooooooo stringggg. Replaceaceaceace repeatedededed groupssss of symbolssss".split():
print(re.sub(r'([a-z]+?)\1+', r'\1', word))
outputs
This
is
echo
string.
Replace
repeated
groups
of
symbols
One Liner Solution
string = "Thisssssssss isisisis echooooooo stringggg. Replaceaceaceace repeatedededed groupssss of symbolssss"
print(re.sub(r'([a-z]+?)\1+', r'\1', string))
#This is echo string. Replace repeated groups of symbols

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?
You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Finding all occurrences of alternating digits using regular expressions

I would like to find all alternating digits in a string using regular expressions. An alternating digit is defined as two equal digits having a digit in between; for example, 1212 contains 2 alternations (121 and 212) and 1111 contains 2 alternations as well (111 and 111). I have the following regular expression code:
s = "1212"
re.findall(r'(\d)(?:\d)(\1)+', s)
This works for strings like "121656", but not "1212". This is a problem to do with overlapping matches I think. How can I deal with that?
(?=((\d)\d\2))
Use lookahead to get all overlapping matches. Use re.findall and get the first element from the tuple. See the demo:
https://regex101.com/r/fM9lY3/54
You can use a lookahead to allow for overlapping matches:
r'(\d)(?=(\d)\1)'
To reconstruct full matches from this:
matches = re.findall(r'(\d)(?=(\d)\1)', s)
[a + b + a for a, b in matches]
Also, to avoid other Unicode digits like ١ from being matched (assuming you don’t want them), you should use [0-9] instead of \d.
With the regex module you don't have to use a trick to get overlapped matches since there's a flag to obtain them:
import regex
res = [x.group(0) for x in regex.finditer(r'(\d)\d\1', s, overlapped=True)]
if s contains only digits, you can do this too:
res = [s[i-2:i+1] for i in range(2, len(s)) if s[i]==s[i-2]]
A non regex approach if you string is made up of just digits:
from itertools import islice as isl, izip
s = "121231132124123"
out = [a + b + c for a, b, c in zip(isl(s, 0, None), isl(s, 1, None), isl(s, 2, None)) if a == c]
Output:
['121', '212', '212']
It is actually a nice bit faster than a regex approach.

How to return regular expression match as one entire string?

I want to match phone numbers, and return the entire phone number but only the digits. Here's an example:
(555)-555-5555
555.555.5555
But I want to use regular expressions to return only:
5555555555
But, for some reason I can't get the digits to be returned:
import re
phone_number='(555)-555-5555'
regex = re.compile('[0-9]')
r = regex.search(phone_number)
regex.match(phone_number)
print r.groups()
But for some reason it just prints an empty tuple? What is the obvious thing I am missing here? Thanks.
You're getting empty result because you don't have any capturing groups, refer to the documentation for details.
You should change it to group() instead, now you'll get the first digit as a match. But this is not what you want because the engine stops when it encounter a non digit character and return the match until there.
You can simply remove all non-numeric characters:
re.sub('[^0-9]', '', '(555)-555-5555')
The range 0-9 is negated, so the regex matches anything that's not a digit, then it replaces it with the empty string.
You can do it without as regular expression using str.join and str.isdigit:
s = "(555)-555-5555"
print("".join([ch for ch in s if ch.isdigit()]))
5555555555
If you printed r.group() you would get some output but using search is not the correct way to find all the matches, search would return the first match and since you are only looking for a single digit it would return 5, even with '[0-9]+') to match one or more you would still only get the first group of consecutive digits i.e 555 in the string above. Using "".join(r.findall(s)) would get the digits but that can obviously be done with str.digit.
If you knew the potential non-digit chars then str.translate would be the best approach:
s = "(555)-555-5555"
print(s.translate(None,"()-."))
5555555555
The simplest way is here:
>>> import re
>>> s = "(555)-555-5555"
>>> x = re.sub(r"\D+", r"", s)
>>> x
'5555555555'

Python regular expression for substring

All I want is to grab the first 3 numeric characters of string:
st = '123_456'
import re
r = re.match('([0-9]{3})', st)
print r.groups()[0]
Am I doing the right thing for grabbing first 3 characters?
This returns 123 but what if I want to get the first 3 characters regardless of numbers and alphabets or special characters?
When given 12_345, I want to grab only 12_
Thanks,
If you always need first three characters in a string, then you can use the below:
first_3_charaters = st[:3]
There is no need of regular expression in your case.
You are really close, just drop the extra set of parenthesis and use the proper indexing of zero instead of one. Python indexing starts at zero. See below.
This works:
import re
mystring = '123_456'
check = re.search('^[0-9]{3}', mystring)
if check:
print check.group(0)
the ^ anchors to the beginning of the string which will ensure a match to the first three numeric digits only. If you do not use the carrot the regexp will match any three digits in a row in the string.
Some may suggest \d but this includes more than 0-9.
As others will surely point out a simple substring operation will do the trick if all the fields start with three numeric digits that you want to extract.
Good luck!
If all digits are separated by _, then you can simply use this regular expression which greedily matches all numeric characters before the first _ .
r = re.match('([0-9]*)_', st)
Actually, the _ in this RE is not necessary,so you can simplify it to (so that any separator is accepted ):
r = re.match('(\d*)', st)
But this solution will give you 1234 if st = '1234_56'. I'm not sure whether it is your intention.
So, if you want at most 3 numeric characters, you can just modify the regular expression to:
r = re.match('(\d{,3})', st)

Categories