Regex, greedy quantifiers multiple capture groups - python

I would like to capture n words surrounding a word x without whitespaces. I need a capture group for each word. I can achieve this in the following way (here words after x):
import regex
n = 2
x = 'beef tomato chicken trump Madonna'
right_word = '\s+(\S+)'
regex_right = r'^\S*{}\s*'.format(n*right_word)
m_right = regex.search(regex_right, x)
print(m_right.groups())
so if x = 'beef tomato chicken trump Madonna', n = 2, regex_right = '^\S*\s+(\S+)\s+(\S+)\s*', and I get two capture groups containing 'tomato' and 'chicken'. However, if n=5 I capture nothing which is not the behavior I was looking for. For n = 5 I want to capture all words the right of 'beef'.
I have tried using the greedy quantifier
regex_right = r'^\S*(\s+\S+){,n}\s*'
but I only get a single group (the last word) no matter how many matches I get (furthermore I get the white spaces as well..).
I finally tried using regex.findall but I cannot limit it to n words but have to specify number of characters?
Can anyone help ?
Wiktor helped me(see below) thanks. However I have an additional problem
if
x = 'beef, tomato, chicken, trump Madonna'
I cannot figure out how to capture without the commas? I do not want groups as 'tomato,'

You did not match all those words with the first approach because the pattern did not match the input string. You need to make the right_word pattern optional by enclosing it with (?:...)?:
import re
x = 'beef tomato chicken trump Madonna'
n = 5
right_word = '(?:\s+(\S+))?'
regex_right = r'^\S*{}'.format(n*right_word)
print(regex_right)
m_right = re.search(regex_right, x)
if m_right:
print(m_right.groups())
See the Python demo.
The second approach will only work with PyPi regex module because Python re does not keep repeated captures, once a quantified capturing group matches a substring again within the same match iteration, its value is re-written.
>>> right_word = '\s+(\S+)'
>>> n = 5
>>> regex_right = r'^\S*(?:\s+(\S+)){{1,{0}}}'.format(n)
>>> result = [x.captures(1) for x in regex.finditer(regex_right, "beef tomato chicken trump Madonna")]
>>> result
[['tomato', 'chicken', 'trump', 'Madonna']]
>>> print(regex_right)
^\S*(?:\s+(\S+)){1,5}
Note that ^\S*(?:\s+(\S+)){1,5} has a capturing group #1 inside a quantified non-capturing group that is quantified with the {1,5} limiting quantifier, and since PyPi regex keeps track of all values captured with repeated capturing groups, they all are accessible via .captures(1) here. You can test this feature with a .NET regex tester:

You got the correct approach. However regex can't do what you're asking for. Each time your capturing group captures another pattern, the previous content is replaced. That is why your capturing group only returns the last pattern captured.
You can easily match n words, but you can't capture them separately without writting each capture group explicitly.

Related

Regex of sequences surrounded by specific pattern with overlapping problem

I am very new to use python re.finditer to find a regex pattern but trying to make a complex pattern finding, which is the g-quadruplex motif and described as below.
The sequence starts with at least 3 g followed with a/t/g/c multiple times until the next group of ggg+ shows up. this will repeat 3 times and resulting in a pattern like ggg+...ggg+...ggg+...ggg+
The cases should be all ignored and the overlapping can show in the original sequence like ggg+...ggg+...ggg+...ggg+...ggg+...ggg+ should return 3 such patterns.
I have suffered for some time and can only find a way like:
re.finditer(r"(?=(([Gg]{3,})([AaTtCcGg]+?(?=[Gg]{3,})[Gg]{3,}){3}))", seq)
and then filter out the ones that do not with the same start position with
re.finditer(r"([Gg]{3,})", seq)
Is there any better way to extract this type of sequence? And no for loops please since I have millions of rows like this.
Thank you very much!
PS: an example can be like this
ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC
1.ggggggcgggggggACGCTCggctcAAGGGCTCCGGG
2.gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg
3.GGGCTCCGGGCCCCgggggggACgcgcgAAGGG
You could start the match, asserting that what is directly to the left is not a g char to prevent matching on too many positions.
To match both upper and lowercase chars, you can make the pattern case insensitive using re.I
The value is in capture group 1, which will be returned by re.findall.
(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))
(?<!g) Negative lookbehind, assert not g directly to the left
(?= Positive lookahead
( Capture group 1
g{3,} Match 3 or more g chars to start with
(?: Non capture group
[atc](?:g{0,2}[atc])* Optionally repeat matching a t c and 0, 1 or 2 g chars without crossing matching ggg
g{3,} Match 3 or more g chars to end with
){3} Close non capture group and repeat 3 times
) Close group 1
) Close lookahead
Regex demo | Python demo
import re
pattern = r"(?<!g)(?=(g{3,}(?:[atc](?:g{0,2}[atc])*g{3,}){3}))"
s = ("ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC \n")
print(re.findall(pattern, s, re.I))
Output
[
'ggggggcgggggggACGCTCggctcAAGGGCTCCGGG',
'gggggggACGCTCggctcAAGGGCTCCGGGCCCCggggggg',
'GGGCTCCGGGCCCCgggggggACgcgcgAAGGG'
]
import regex as re
seq = 'ggggggcgggggggACGCTCggctcAAGGGCTCCGGGCCCCgggggggACgcgcgAAGGGCTCC'
for match in re.finditer(r'(?<=(^|[^g]))g{3,}([atc](g{,2}[atc]+)*g{3,}){3}', seq, overlapped=True, flags=re.I):
print(match[0])
Overlapped works by restarting the search from the character after the start of the current match rather than the end of it. This would give you a bunch of essentially duplicate results, each just removing an extra leading G. To stop that, check for a preceding G with a lookbehind:
(?<=(^|[^g]))
The middle section needs to be a bit more complicated to require an ATC, preventing the seven G's from being split into a ggggggg match. So require one, then allow for any number of less the three G's followed by more ATCs; repeating as needed:
[atc](g{,2}[atc]+)*
The rest is just the Gs and the repeating.

getting only total match in a regex method checking multiple patterns in python

I would like to match several expressions or words in a text as follows
patterns = [r'(\bbmw\w*\b)', # bmw
r'(\bopel\w?\b)', # opel
r'(\btoyota\w?\b\s+(\w+\s+){0,2}(\bcorolla\w?\b\s+\bdiesel\w?\b))' # toyota corolla
]
# assume here that I am dealing with hundreds of regex coming from different coders.
text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'
def checkPatternInText(text, patterns):
total_matches =[]
for pattern in patterns:
matches = re.findall(pattern, text)
if len(matches)>0:
print(type(matches))
if type(matches[0]) == type('astring'):
total_matches.append(matches[0])
else:
total_matches.append(matches[0][0])
print(matches)
return total_matches
result = (checkPatternInText(text, patterns))
The result of this method is:
['bmw', 'opel', 'toyota the nice corolla diesel']
I check the type of matches because if the match is a single word then the type is string and if the patter produced several matches the match is a tuple with all the matches -groups-. From this tuple of groups I want the longest one, which is the first in the tuple, hence matches[0][0].
Is there a more elegant way to do this without resorting to checking the variable type of the matches?
As second question: I had to add () around all the patterns in order to access the group 0 which is ALL THE MATCH. How would you proceed if the patters do not have () around?
It was suggested that this question has an answer here:
re.findall behaves weird
The situation is not totally the same since I have here a COLLECTION OF PATTERNS some might be surrounded by () some others not. Some might have groups, some others might not.
I am trying to get a more reliable solution as the one I proposed.
When you deal with one single pattern you can always resort to modifying the pattern (as last resort), when you are dealing with a collection of patterns a more general solution might be required.
The solution of making 1 regex for the three cases is not applicable. The real case has around 100 different regex and more and more are being continuously added.
You can achieve this in a single regex in re.findall using alternations:
\b(?:bmw|opel|toyota\s+(?:\w+\s+){0,2}corolla\s+diesel)\b
RegEx Demo
Code:
>>> import re
>>> text = 'there is a bmw and also an opel and also this span with toyota the nice corolla diesel'
>>> print (re.findall(r'\b(?:bmw|opel|toyota\s+(?:\w+\s+){0,2}corolla\s+diesel)\b', text))
['bmw', 'opel', 'toyota the nice corolla diesel']
RegEx Details:
\b: Word boundary
(?:: Start non-capture group
bmw: Match bmw
|: OR
opel: Match opel
|: OR
toyota\s+(?:\w+\s+){0,2}corolla\s+diesel: Match toyota substring
): End non-capture group
\b: Word boundary

python regex unexpected match groups

I am trying to find all occurrences of either "_"+digit or "^"+digit, using the regex ((_\^)[1-9])
The groups I'd expect back eg for "X_2ZZZY^5" would be [('_2'), ('^5')] but instead I am getting [('_2', '_'), ('^5', '^')]
Is my regex incorrect? Or is my expectation of what gets returned incorrect?
Many thanks
** my original re used (_|\^) this was incorrect, and should have been (_\^) -- question has been amended accordingly
You have 2 groups in your regex - so you're getting 2 groups. And you need to match atleast 1 number that follows.
try this:
([_\^][1-9]+)
See it in action here
Demand at least 1 digit (1-9) following the special characters _ or ^, placed inside a single capture group:
import re
text = "X_2ZZZY^5"
pattern = r"([_\^][1-9]{1,})"
regex = re.compile(pattern)
res = re.findall(regex, text)
print(res)
Returning:
['_2', '^5']

Limiting regex length

I'm having an issue in python creating a regex to get each occurance that matches a regex.
I have this code that I made that I need help with.
strToSearch= "1A851B 1C331 1A3X1 1N111 1A3 and a whole lot of random other words."
print(re.findall('\d{1}[A-Z]{1}\d{3}', strToSearch.upper())) #1C331, 1N111
print(re.findall('\d{1}[A-Z]{1}\d{1}[X]\d{1}', strToSearch.upper())) #1A3X1
print(re.findall('\d{1}[A-Z]{1}\d{3}[A-Z]{1}', strToSearch.upper())) #1A851B
print(re.findall('\d{1}[A-Z]{1}\d{1}', strToSearch.upper())) #1A3
>['1A851', '1C331', '1N111']
>['1A3X1']
>['1A851B']
>['1A8', '1C3', '1A3', '1N1', '1A3']
As you can see it returns "1A851" in the first one, which I don't want it to. How do I keep it from showing in the first regex? Some things for you to know is it may appear in the string like " words words 1A851B?" so I need to keep the punctuation from being grabbed.
Also how can I combine these into one regex. Essentially my end goal is to run an if statement in python similar to the pseudo code below.
lstResults = []
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = re.findall('<REGEX HERE>', strToSearch)
for r in lstResults:
print(r)
And the desired output would be
1N1X1
3C191
1A831B
1A8
With single regex pattern:
strToSearch= " Alot of 1N1X1 people like to eat 3C191 cheese and I'm a 1A831B aka 1A8."
lstResults = [i[0] for i in re.findall(r'(\d[A-Z]\d{1,3}(X\d|[A-Z])?)', strToSearch)]
print(lstResults)
The output:
['1N1X1', '3C191', '1A831B', '1A8']
Yo may use word boundaries:
\b\d{1}[A-Z]{1}\d{3}\b
See demo
For the combination, it is unclear the criterium according to which you consider a word "random word", but you can use something like this:
[A-Z\d]*\d[A-Z\d]*[A-Z][A-Z\d]*
This is a word that contains at least a digit and at least a non-digit character. See demo.
Or maybe you can use:
\b\d[A-Z\d]*[A-Z][A-Z\d]*
dor a word that starts with a digit and contains at least a non-digit character. See demo.
Or if you want to combine exactly those regex, use.
\b\d[A-Z]\d(X\d|\d{2}[A-Z]?)?\b
See the final demo.
If you want to find "words" where there are both digits and letters mixed, the easiest is to use the word-boundary operator, \b; but notice that you need to use r'' strings / escape the \ in the code (which you would need to do for the \d anyway in future Python versions). To match any sequence of alphanumeric characters separated by word boundary, you could use
r'\b[0-9A-Z]+\b'
However, this wouldn't yet guarantee that there is at least one number and at least one letter. For that we will use positive zero-width lookahead assertion (?= ) which means that the whole regex matches only if the contained pattern matches at that point. We need 2 of them: one ensures that there is at least one digit and one that there is at least one letter:
>>> p = r'\b(?=[0-9A-Z]*[0-9])(?=[0-9A-Z]*[A-Z])[0-9A-Z]+\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', 'A1', '1A123B']
This will now match everything including 33333A or AAAAAAAAAA3A for as long as there is at least one digit and one letter. However if the pattern will always start with a digit and always contain a letter, it becomes slightly easier, for example:
>>> p = r'\b\d+[A-Z][0-9A-Z]*\b'
>>> re.findall(p, '1A A1 32 AA 1A123B')
['1A', '1A123B']
i.e. A1 didn't match because it doesn't start with a digit.

Python re.finditer match.groups() does not contain all groups from match

I am trying to use regex in Python to find and print all matching lines from a multiline search.
The text that I am searching through may have the below example structure:
AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA
From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.
The problem is, that despite the group catching what I want:
match = <_sre.SRE_Match object; span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>
... I can access only the last match of the group:
match groups = ('AAA\n', 'ABC4\n')
Below is the example code that I use for this problem.
#! python
import sys
import re
import os
string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"
print(string)
p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA\n)(ABC[0-9]\n){1,}')) ) #
matches = re.finditer(p_MATCHES[0],string)
for match in matches:
strout = ''
gr_iter=0
print("match = "+str(match))
print("match groups = "+str(match.groups()))
for group in match.groups():
gr_iter+=1
sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # test output
if group is not None:
if group != '':
strout+= '"'+group.replace("\n","",1)+'"'+'\n'
sys.stdout.write("\nCOMPLETE RESULT:\n"+strout+"====\n")
Here is your regular expression:
(AAA\r\n)(ABC[0-9]\r\n){1,}
Debuggex Demo
Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part
ABC[0-9]\r\n
is being captured (is inside the parentheses), and its quantifier,
{1,}
is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:
AAA\r\n((?:ABC[0-9]\r\n){1,})
Debuggex Demo
I've placed the "what is being repeated" part (ABC[0-9]\r\n) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)
The captured text can be split on the newline, and will give you all the pieces as you wish.
(Note that \n by itself doesn't work in Debuggex. It requires \r\n.)
This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:
import java.util.regex.*;
public class RepeatingCaptureGroupsDemo {
public static void main(String[] args) {
String input = "I have a cat, but I like my dog better.";
Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
Matcher m = p.matcher(input);
while (m.find()) {
System.out.println(m.group());
}
}
}
Output:
cat
dog
(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)
Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.
You want the pattern of consecutive ABC\n occurring after a AAA\n in the most greedy way. You also want only the group of consecutive ABC\n and not a tuple of that and the most recent ABC\n. So in your regex, exclude the subgroup within the group.
Notice the pattern, write the pattern that represents the whole string.
AAA\n(ABC[0-9]\n)+
Then capture the one you are interested in with (), while remembering to exclude subgroup(s)
AAA\n((?:ABC[0-9]\n)+)
You can then use either findall() or finditer(). I find findIter easier especially when you are dealing with more than one capture.
finditer:-
import re
matches_iter = re.finditer(r'AAA\n((?:ABC[0-9]\n)+)', string)
[print(i.group(1)) for i in matches_iter]
findall, used the original {1,} as its a more verbose form of + :-
matches_all = re.findall(r'AAA\n((?:ABC[0-9]\n){1,})', string)
[[print(x) for x in y.split("\n")] for y in matches_all]

Categories