difference between regular expression with and without group '( )'? - python

There are two different codes which produce two different result but I don't know how those differences arise.
>>>re.findall('[a-z]+','abc')
['abc']
and this one with group:
>>> re.findall('([a-z])+','abc')
['c']
why the second code yield character c ?

In your last regex pattern (([a-z])+), you are repeating a capturing group (()). And doing this will return only last iteration. So you get the last letter, which is c
But in your first pattern ([a-z]+), you are repeating a character class ([]), and this doesn't behave the same as a capturing group. It returns all the iterations.

Related

Matching strings where multiple capture groups must be different in regex

I am trying to create a regular expression that picks out a boolean algebra identity, specifically ((A+B).(A+C)), where A, B and C are different strings consisting of characters [A-Z].
I am running into problems getting the regular expression recognise that in the string I am looking for A != B != C.
Here is what I have tried:
\(\(([A-Z]+)\+([A-Z])\)\.\(\1\+([A-Z])\)\)
however, even though I have put every string that I want to be different in a capturing group, it doesn't stop it from thinking that strings B and C are the same. This is because the regular expression matches for all three of the following strings:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
while I only want it to match the first one.
You can use negative lookahead to make sure that group 2 is not the same as group 1, and that group 3 is not the same as either groups 1 or 2.
\(\(([A-Z]+)\+(?!\1)([A-Z])\)\.\(\1\+(?!\1)(?!\2)([A-Z])\)\)
Split up for readability:
\(\(
([A-Z]+)
\+
(?!\1)([A-Z])
\)\.\(
\1
\+
(?!\1)(?!\2)([A-Z])
\)\)
Inputs:
((A+B).(A+C))
((A+B).(A+A))
((A+A).(A+A))
((A+B).(A+B))
Matches:
((A+B).(A+C))
Try it on regex101

How to find repeated substring in a string using regular expressions in Python?

I am trying to find the longest consecutive chain of repeated DNA nucleotides in a DNA sequence. The DNA sequence is a string. So, for example, if I have "AGA", I would want to know the length of the longest consecutive chain of repeats of "AGA" in the chain.
I am thinking of using regular expressions to extract all the chains of repeats of the nucleotides and store them in a list (using re.findall()). Then simply find the longest chain out of them, take its length and divide it by the length of the sequence of nucleotides.
What regular expression can I write for this? I was thinking, for example [AGA]+, but it would identify substrings with A or G or A. I want something similar, so that it identifies "AGA" and its repeats.
Note: if the sequence is AATGAGAAGAAGATCCTAGAAGAAGAAGAAGACGAT, there are two chains of consecutive "AGA", one of length 3 and the other of length 5. The longest chain is therefore of length 5.
You can use expression ((AGA)\2*) (regex101):
For example:
s = 'AATGAGAAGAAGATCCTAGAAGAAGAAGAAGACGAT'
to_find = 'AGA'
m = max(re.findall(r'(({})\2*)'.format(to_find), s), key=lambda k: k[0])[0]
print(m, len(m) // len(to_find))
Prints:
AGAAGAAGAAGAAGA 5
You could use the first match the following regular expression:
r'((?:AGA)+)(?!.*\1)'
Python code <¯\(ツ)/¯> Start your engine!
Python's regex engine performs the following operations.
( : begin capture group 1
(?:AGA) : match 'AGA' in a non-capture group
+ : execute non-capture group 1+ times
) : end capture group 1
(?! : begin negative lookahead
.* : match any character other than line terminators 0+ times
\1 : match contents of capture group 1
) : end negative lookahead
This rejects a candidate string of "AGA"'s if there is another string of "AGA"'s later in the string that is at least as long as the candidate string.
There may well be multiple matches. If, for example, the string were
AGAAGAAGATAGATAGAAGATAGA
^^^^^^^^^ ^^^^^^ ^^^
there would be, as I indicated by the party hats, three matches. As the matches are always non-decreasing in length from left to right, no match will be longer than the first match. We therefore may select the first match.
If one wanted to identify all longest matches (should there be more than one having the longest length), one could use the above regex to obtain a match of, say, four 'ABA‘s, and then match the string with the regex r'(?:ABA){4}'.
This is another way to do find matching subsequences.
re.findall("(?:AGA)+", "AATGAGAAGAAGATCCTAGAAGAAGAAGAAGACGAT")

Python regex: unclear difference between repeating qualifier {n} and equivalent tuple

Why does xx yield something different from x{2}?
Please have a look at the following example:
import re
lines = re.findall(r'".*?"".*?"', '"x""y"')
print(lines) # yields: ['"x""y"']
lines = re.findall(r'(".*?"){2}', '"x""y"')
print(lines) # yields: ['"y"']
As per the documentation of findall, if you have a group in the regex, it returns the list of those groups, either as a tuple for 2+ groups or as a string for 1 groups. In your case, your two regexes are not merely xx versus x{2}, but rather the second one is (x){2}, which has a group, when the first regex has no groups.
Hence, "x" matches the group the first time, then "y" matches the group the second time. This fulfills your overall regex, but "y" overwrites "x" for the value of group 1.
The easiest way to solve this in your example is to convert your group to a non-matching group: (?:".*?"){2}. If you want two groups, one for "x" and one for "y", you need to repeat the group twice: (".*?")(".*?"). You can potentially use named groups to simplify this repetition.
The first expression is "X and then Y, where Y accidentally matches the same thing as X".
The second expression is "(X){repeat two times}". Group 1 cannot contain XX, because group 1 does not match XX. It matches X.
In other words: Group contents does not change just because of a quantifier outside of the group.
One way to remedy the second expression is to make an outer group (and make the inner group non-capturing)
lines = re.findall(r'((?:".*?"){2})', '"x""y"')
About your second pattern (".*?"){2}:
A cite from the rules of matching
If a group is contained in a part of the pattern that matched multiple times, the last match is returned.
And findall does the following:
If one or more groups are present in the pattern, return a list of groups;
Your pattern (".*?"){2} means that (".*?") should match twice in a row, and according to the first rule, only the content of the last match is captured.
For your data findall finds the sequence (".*?"){2} only once, so it returns a list consisting of the last captured group for a single match: ['"y"'].
This example would make it more obvious:
import re
print (re.findall(r'(\d){2}', 'a12b34c56'))
# ['2', '4', '6']
You can see that findall finds the sequence (\d){2} three times and for each it returns the last captured content for the group (\d).
Now about your first pattern: ".*?"".*?".
This one does not contains subgroups, and, according to findall again, in this case it returns:
all non-overlapping matches of pattern in string, as a list of strings.
So for your data it is ['"x""y"'].
AFAIK, findall() is capture group first, if there is any capture group in the applied regex, then findall() returns only capture group values.
And only when there is no capture group in the applied regex, findall() returns fullmatch values.
Therefore, if you want findall() returns fullmatch value, then you must not use capture group in the regex like this
(?:".*?"){2}
in which (?: ... ) indicate non-capture group.
Thus, in python
print(re.findall(r'(?:".*?"){2}', '"x""y"'))

regex. Find multiple occurrence of pattern

I have the following string
my_string = "this data is F56 F23 and G87"
And I would like to use regex to return the following output
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
I approached the problem with python and with this code
import re
re.findall(r'\b(F\d{2}|G\d{2})\b', my_string)
I was able to get all the occurrences
['F56', 'F23', 'G87']
But I would like to have the first two groups together since they are consecutive occurrences. Any ideas of how I can achieve that?
You can use this regex:
\b[FG]\d{2}(?:\s+[FG]\d{2})*\b
Non-capturing group (?:\s+[FG]\d{2})* will find zero or more of the following space separated F/G substrings.
Code:
>>> my_string = "this data is F56 F23 and G87"
>>> re.findall(r'\b[FG]\d{2}(?:\s+[FG]\d{2})*\b', my_string)
['F56 F23', 'G87']
So basically, I'm interested in returning all the parts of the string that start with either F or G and are followed by two numbers. In addition, if there are multiple consecutive occurrences I would like regex to group them together.
You can do this with:
\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b
in case it is separated by at least one spacing character. If that is not a requirement, you can do this with:
\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b
Both the first and second regex generate:
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s+[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
>>> re.findall(r'\b(?:[FG]\d{2})(?:\s*[FG]\d{2})*\b',my_string)
['F56 F23', 'G87']
print map(lambda x : x[0].strip(), re.findall(r'((\b(F\d{2}|G\d{2})\b\s*)+)', my_string))
change your regex to r'((\b(F\d{2}|G\d{2})\b\s*)+)' (brackets around, /s* to find all, that are connected by whitespaces, a + after the last bracket to find more than one occurance (greedy)
now you have a list of lists, of which you need every 0th Argument. You can use map and lambda for this. To kill last blanks I used strip()

Python regexp: get all group's sequence

I have a regex like this '^(a|ab|1|2)+$' and want to get all sequence for this...
for example for re.search(reg, 'ab1') I want to get ('ab','1')
Equivalent result I can get with '^(a|ab|1|2)(a|ab|1|2)$' pattern,
but I don't know how many blocks been matched with (pattern)+
Is this possible, and if yes - how?
try this:
import re
r = re.compile('(ab|a|1|2)')
for i in r.findall('ab1'):
print i
The ab option has been moved to be first, so it will match ab in favor of just a.
findall method matches your regular expression more times and returns a list of matched groups. In this simple example you'll get back just a list of strings. Each string for one match. If you had more groups you'll get back a list of tuples each containing strings for each group.
This should work for your second example:
pattern = '(7325189|7325|9087|087|18)'
str = '7325189087'
res = re.compile(pattern).findall(str)
print(pattern, str, res, [i for i in res])
I'm removing the ^$ signs from the pattern because if findall has to find more than one substring, then it should search anywhere in str. Then I've removed + so it matches single occurences of those options in pattern.
Your original expression does match the way you want to, it just matches the entire string and doesn't capture individual groups for each separate match. Using a repetition operator ('+', '*', '{m,n}'), the group gets overwritten each time, and only the final match is saved. This is alluded to in the documentation:
If a group matches multiple times, only the last match is accessible.
I think you don't need regexpes for this problem,
you need some recursial graph search function

Categories