why isn't the re.group function giving me the expected output - python

import re
v = "aeiou"
c = "qwrtypsdfghjklzxcvbnm"
m = re.finditer(r"(?<=[%s])([%s]{2,})[%s]" % (c, v, c), input(), flags=re.I)
for i in m:
print(i.group())
The above code is an attempt to solve the hackerrank question using re.finditer but for the input
rabcdeefgyYhFjkIoomnpOeorteeeeetmy
my output is
eef
Ioom
Oeor
eeeeet
instead of
ee
Ioo
Oeo
eeeee
I would like to know the reason why

It is because findall() and finditer() are returning different things.
In the re doc, for findall():
If one or more groups are present in the pattern, return a list of groups
for finditer():
Return an iterator yielding match objects over all non-overlapping matches for the RE pattern in string.
In your case, when you use findall() with a group, the whole match is ignored, it just returns a list of vowels in group. But for finditer(), the whole match object is returned including the ending consonant.
You have two ways to get the result,
Keep the current pattern and use i.group(1) to get the match in group 1 instead of the whole match.
Use lookahead assertion for ending consonant like (?=[%s]), then the matched string will be only vowels.

Related

Apart from returning string and iterator in re.findall() and re.finditer() in python do their working also differ?

Wrote the following code so that i get all variable length patterns matching str_key.
line = "ABCDABCDABCDXXXABCDXXABCDABCDABCD"
str_key = "ABCD"
regex = rf"({str_key})+"
find_all_found = re.findall(regex,line)
print(find_all_found)
find_iter_found = re.finditer(regex, line)
for i in find_iter_found:
print(i.group())
Output i got:
['ABCD', 'ABCD', 'ABCD']
ABCDABCDABCD
ABCD
ABCDABCDABCD
The intended output is last three lines printed by finditer(). I was expecting both functions to give me same output(list or callable does not matter). why it differs in findall() as far i understood from other posts already on stackoverflow, these two functions differ only in their return types and not in matching patterns. Do they work differently, if not what have i done wrong?
You want to access groups rather than group.
>>> find_iter_found = re.finditer(regex, line)
>>> for i in find_iter_found:
... print(i.groups()[0])
The difference between the two methods is explained here.
The behaviour of the two functions is pretty much the same as far as the matching process is concerned as per:
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of
strings. The string is scanned left-to-right, and matches are returned
in the order found. If one or more groups are present in the pattern,
return a list of groups; this will be a list of tuples if the pattern
has more than one group. Empty matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
re.finditer(pattern, string, flags=0)
Return an iterator yielding match objects over all non-overlapping
matches for the RE pattern in string. The string is scanned
left-to-right, and matches are returned in the order found. Empty
matches are included in the result.
Changed in version 3.7: Non-empty matches can now start just after a
previous empty match.
For re.findall change your regex
regex = rf"({str_key})+"
into
regex = rf"((?:{str_key})+)".
The quantifier + have to inside the capture group.

Matching both possible solutions in Regex

I have a string aaab. I want a Python expression to match aa, so I expect the regular expression to return aa and aa since there are two ways to find substrings of aa.
However, this is not what's happening.
THis is what I've done
a = "aaab"
b = re.match('aa', a)
You can achieve it with a look-ahead and a capturing group inside it:
(?=(a{2}))
Since a look-ahead does not move on to the next position in string, we can scan the same text many times thus enabling overlapping matches.
See demo
Python code:
import re
p = re.compile(r'(?=(a{2}))')
test_str = "aaab"
print(re.findall(p, test_str))
To generalize #stribizhev solution to match one or more of character a: (?=(a{1,}))
For three or more: (?=(a{3,})) etc.

regular expression: may or may not contain a string

I want to match a floating number that might be in the form of 0.1234567 or 1.23e-5
Here is my python code:
import re
def main():
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
for svs_elem in m2:
print svs_elem
main()
It prints blank... Based on my test, the problem was in (e-\d+)? part.
See emphasis:
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
You have a group, so it’s returned instead of the entire match, but it doesn’t match in any of your cases. Make it non-capturing with (?:e-\d+):
m2 = re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345')
Use a non-capturing group. The matches are succeeding, but the output is the contents of the optional groups that don't actually match.
See the output when your input includes something like e-6:
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['', '', 'e-6']
With a non-capturing group ((?:...)):
>>> re.findall(r'\d{1,4}:[-+]?\d+\.\d+(?:e-\d+)?', '1:0.00003 3:0.123456 8:-0.12345e-6')
['1:0.00003', '3:0.123456', '8:-0.12345e-6']
Here's are some simpler examples to demonstrate how capturing groups work and how they influence the output of findall. First, no groups:
>>> re.findall("a[bc]", "ab")
["ab"]
Here, the string "ab" matched the regex, so we print everything the regex matched.
>>> re.findall("a([bc])", "ab")
["b"]
This time, we put the [bc] inside a capturing group, so even though the entire string is still matched by the regex, findall only includes the part inside the capturing group in its output.
>>> re.findall("a(?:[bc])", "ab")
["ab"]
Now, by converting the capturing group to a non-capturing group, findall again uses the match of the entire regex in its output.
>>> re.findall("a([bc])?", "a")
['']
>>> re.findall("a(?:[bc])?", "a")
['a']
In both of these final case, the regular expression as a whole matches, so the return value is a non-empty list. In the first one, the capturing group itself doesn't match any text, though, so the empty string is part of the output. In the second, we don't have a capturing group, so the match of the entire regex is used for the output.

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Find the indexes of all regex matches?

I'm parsing strings that could have any number of quoted strings inside them (I'm parsing code, and trying to avoid PLY). I want to find out if a substring is quoted, and I have the substrings index. My initial thought was to use re to find all the matches and then figure out the range of indexes they represent.
It seems like I should use re with a regex like \"[^\"]+\"|'[^']+' (I'm avoiding dealing with triple quoted and such strings at the moment). When I use findall() I get a list of the matching strings, which is somewhat nice, but I need indexes.
My substring might be as simple as c, and I need to figure out if this particular c is actually quoted or not.
This is what you want: (source)
re.finditer(pattern, string[, flags])
Return an iterator yielding MatchObject instances over all
non-overlapping matches for the RE pattern in string. The string is
scanned left-to-right, and matches are returned in the order found. Empty
matches are included in the result unless they touch the beginning of
another match.
You can then get the start and end positions from the MatchObjects.
e.g.
[(m.start(0), m.end(0)) for m in re.finditer(pattern, string)]
To get indice of all occurences:
S = input() # Source String
k = input() # String to be searched
import re
pattern = re.compile(k)
r = pattern.search(S)
if not r: print("(-1, -1)")
while r:
print("({0}, {1})".format(r.start(), r.end() - 1))
r = pattern.search(S,r.start() + 1)
This should solve your issue:
pattern=r"(?=(\"[^\"]+\"|'[^']+'))"
Then use the following to get all overlapping indices:
indicesTuple = [(mObj.start(1),mObj.end(1)-1) for mObj in re.finditer(pattern,input)]

Categories