Python regex match number + unit [duplicate] - python

I have the code:
import re
sequence="aabbaa"
rexp=re.compile("(aa|bb)+")
rexp.findall(sequence)
This returns ['aa']
If we have
import re
sequence="aabbaa"
rexp=re.compile("(aa|cc)+")
rexp.findall(sequence)
we get ['aa','aa']
Why is there a difference and why (for the first) do we not get ['aa','bb','aa']?
Thanks!

The unwanted behaviour comes down to the way you formulate regualar expression:
rexp=re.compile("(aa|bb)+")
Parentheses (aa|bb) forms a group.
And if we look at the docs of findall we will see this:
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result
unless they touch the beginning of another match.**
As you formed a group, it mathced first aa, then bb, then aa again (because of + quantifier). So this group holds aa in the end. And findall returns this value in the list ['aa'] (as there is only one match aabbaa of the whole expression, the list contains only one element aa which is saved in the group).
From the code you gave, you seemed to want to do this:
>>> rexp=re.compile("(?:aa|bb)+")
>>> rexp.findall(sequence)
['aabbaa']
(?: ...) doesnt create any group, so findall returns the match of the whole expression.
In the end of your question you show the desired output. This is achieved by just looking for aa or bb. No quantifiers (+ or *) are needed. Just do it the way is in the Inbar Rose's answer:
>>> rexp=re.compile("aa|bb")
>>> rexp.findall(sequence)
['aa', 'bb', 'aa']

let me explain what you are doing:
regex = re.compile("(aa|bb)+")
you are creating a regex which will look for aa or bb and then will try to find if there are more aa or bb after that, and it will keep looking for aa or bb until it doesnt find. since you want your capturing group to return only the aa or bb then you only get the last captured/found group.
however, if you have a string like this: aaxaabbxaa you will get aa,bb,aa because you first look at the string and find aa, then you look for more, and find only an x, so you have 1 group. then you find another aa, but then you find a bb, and then an x so you stop and you have your second group which is bb. then you find another aa. and so your final result is aa,bb,aa
i hope this explains what you are DOING. and it is as expected. to get ANY group of aa or bb you need to remove the + which is telling the regex to seek multiple groups before returning a match. and just have regex return each match of aa or bb...
so your regex should be:
regex = re.compile("(aa|bb)")
cheers.

your pattern
rexp=re.compile("(aa|bb)+")
matches the whole string aabbaa. to clarify just look at this
>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(0)
'aabbaa'
Also no other substrings are to match then
>>> re.match(re.compile("(aa|bb)+"),"aabbaa").group(1)
'aa'
so a findall will return the one substring only
>>> re.findall(re.compile("(aa|bb)+"),"aabbaa")
['aa']
>>>

I do not understand why you use + - it means 0 or 1 occurrence, and is usually used when you want find string with optional inclusion of substring.
>>> re.findall(r'(aa|bb)', 'aabbaa')
['aa', 'bb', 'aa']
work as expected

Related

why isn't the re.group function giving me the expected output

import re
v = "aeiou"
c = "qwrtypsdfghjklzxcvbnm"
m = re.finditer(r"(?<=[%s])([%s]{2,})[%s]" % (c, v, c), input(), flags=re.I)
for i in m:
print(i.group())
The above code is an attempt to solve the hackerrank question using re.finditer but for the input
rabcdeefgyYhFjkIoomnpOeorteeeeetmy
my output is
eef
Ioom
Oeor
eeeeet
instead of
ee
Ioo
Oeo
eeeee
I would like to know the reason why
It is because findall() and finditer() are returning different things.
In the re doc, for findall():
If one or more groups are present in the pattern, return a list of groups
for finditer():
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.
In your case, when you use findall() with a group, the whole match is ignored, it just returns a list of vowels in group. But for finditer(), the whole match object is returned including the ending consonant.
You have two ways to get the result,
Keep the current pattern and use i.group(1) to get the match in group 1 instead of the whole match.
Use lookahead assertion for ending consonant like (?=[%s]), then the matched string will be only vowels.

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?
You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Why is re.search not getting the right group while re.findall is getting it?

Given the string abc. the aim is to break it into two groups abc and .. Actually, I'm only interested in the group before the ..
>>> import re
>>> text = 'abc.'
>>> re.search('^(\S+)\.$', text).group(0)
'abc.'
>>> re.findall('^(\S+)\.$', text)
['abc']
Why is re.search not getting the right group while re.findall is getting it?
Another example where the input is abc.def., the expect output is to isolate the final fullstop and get abc.def and .. So re.findall is getting it as desired:
>>> re.findall('^(\S+)\.$', text)
['abc.def']
But re.search lumps the final fullstop into the first group.
>>> re.search('^(\S+)\.$', text).group(0)
'abc.def.'
Is it possible for re.search('^(\S+)\.$', text).group(0) to return only abc.def? Is there some flags that needs to be set?
Because you are asking for the wrong group. Group 0 is the entire match, which includes the dot. Group 1 is the first capture group within the match. This is all spelled out in the docs for the match object, which re.search returns. If you absolutely need something zero-based, use re.search(...).groups()[0].
Group numbers start at 1, so you want group(1). group(0) is the entire match text.

Matching both possible solutions in Regex

I have a string aaab. I want a Python expression to match aa, so I expect the regular expression to return aa and aa since there are two ways to find substrings of aa.
However, this is not what's happening.
THis is what I've done
a = "aaab"
b = re.match('aa', a)
You can achieve it with a look-ahead and a capturing group inside it:
(?=(a{2}))
Since a look-ahead does not move on to the next position in string, we can scan the same text many times thus enabling overlapping matches.
See demo
Python code:
import re
p = re.compile(r'(?=(a{2}))')
test_str = "aaab"
print(re.findall(p, test_str))
To generalize #stribizhev solution to match one or more of character a: (?=(a{1,}))
For three or more: (?=(a{3,})) etc.

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Categories