How to find all occurences of substrings except first three, using REGEX? - python

I have the following string,
"ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
In above string, using REGEX, I want to find all occurrences of 'TAG' except first 3 occurrences.
I used this REGEX, '(TAG.*?){4}', but it only finds 4th occurrence ('TAG:'), but not the others ('TAG.','TAG ','TAGH').

If you want a capture group with all the remaining matches, you have to consume the first ones first:
(TAG.*?){3}(TAG.*?)*
This matches the first 3 occurences in the first capture group and matches the rest in the 2nd.
If you don't want the first matches to be in a capture group, you can flag it as non-capturing group:
(?:TAG.*?){3}(TAG.*?)*
Depending on your example, I think the regex inside the capture group is not correct yet. If this doesn't give you the right Idea on how to do this already, please give us an example of the matches you want to see. I'll edit my answer then.
EDIT:
I get the feeling that you want to capture the 3rd and following occurences in own capture groups while still ignoring the first 3 occurences.
I can't properly explain why, but I think that's not possible because of the following reasons:
Ignoring the first 3 occurences in an own (non-)capturing group forces you to abandon the 'g' modifier for finding all occurences (because that would just do 'ignore 3 TAGS, find 1' in a loop).
It is not possible to capture multiple groups with just one capture group. Trying to do that always captures the last occurence. There is a possibility to capture not just the last but all occurences together in a single capture group but it seems like you want them in separate groups.
So, how to solve this?
I'd come up with a proper regex for one TAG and repeat that using the find all or g modifier. In python you then can simply take all findings skipping the first 3:
import re
str = "ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
pattern = r"(?:TAG((?:(?!TAG).)+))"
findings = re.findall(pattern, str)[:3]
If you want to ignore the first character after TAG, just add a . behind TAG:
pattern = r"(?:TAG.((?:(?!TAG).)+))"
Explanation of the regex:
- I use ?: to make some capturing groups non-capturing groups. I only want to deal with one capture group.
- To get rid of the non-greedy modifier and be a little bit more
specific in what we actually want, I've introduced the negative
lookahead after the TAG occurence.

Related

Matching two identical groups of characters in pandas with some number of characters in between

I'm trying to extract two numbers of interest from a string of docket text in a pandas dataframe. Here's an example with a couple of the idiosyncrasies that exist in the data
import pandas as pd
df = pd.DataFrame(["Fee: $ 15,732, and Expenses: $1,520.62."])
I used regexr to test some ideas and the closest I've been able to come up with is something along the lines of
df[0].str.extract("(\${0,2}\s*(\d+[,\.]*){1,5})")
Which returns:
0 1
0 $15,732,, 732,,
The problems I'm running into are making characters optional while capturing the groups (i.e. I don't know how to get rid of the inner parenthesis because if I make it brackets then I get an error). And then ideally I'd be able to match the other set of numbers too.
I used regexr and while I can make regular expressions that match what I want, I'm struggling with the grouping part so that I can capture both while not needing to use a cumbersome function like apply with re.
There are sometimes numbers that show up again later in the report that include dates, other numbers, etc... So I'm trying to find a pretty controlled sequence (Can't get too liberal with the .*'s haha)
The string I ended up writing after the hint provided in the comments is:
\$((?:\d+(?:[,\.])*)+).*?\$((?:\d+(?:[,\.])*)+). The non-matching groups is what I hadn't understood before. I thought non-matching groups meant that it would somehow remove the parts that matched from the group but really what it means is that it's a group of characters that don't count as a group (not that they'll be removed from a group).
I appreciate the feedback I got this post!
I am not sure if the text stays the same across all of the values but you can use the following regex:
r'Fee: \$\s?([\d,.]+), and Expenses:\s*\$\s?([\d,.]+)\.'
returning two matching groups:
15,732
1,520.62
You can also abstract the text:
r'\w+:\s*\$\s?([\d,.]+),(\s*\w+)+:\s*\$\s?([\d,.]+)\.'
with the same result.
You can use
df[0].str.extract(r"(\$\s*\d+(?:[,.]\d+)*)") # To get the first value
df[0].str.extractall(r"(\$\s*\d+(?:[,.]\d+)*)") # To get all values
df[0].str.findall(r"\$\s*\d+(?:[,.]\d+)*") # To get all values
The str.extract pattern is wrapped with a capturing group so that the method could return any value, it requires at least one capturing group in the regex pattern.
The regex matches
\$ - a $ char
\s* - zero or more whitespaces
\d+ - one or more digits
(?:[,.]\d+)* - a non-capturing group matching zero or more repetitions of a comma/dot and then one or more digits.
See the regex demo.

How can I find out which alternate regular expression matched my string

If I have a regular expression which has a | operator separating two possible patterns. Is it possible to find which pattern was the one that matched my string?
For example, if I have the pattern ([cC]at|[dD]og) and I find a match in the string clifford is a dog. Can I then look back to see that the pattern [dD]og was the successful match and not the alternative: [cC]at.
I understand that I could try to match each alternate pattern individually and then just take the successful ones but I am wondering if there is another solution that doesn't require a match attempt for each pattern (I'm hoping to apply this in a situation where I'm trying to match several hundred patterns at once)
You can use two different groups and check its index, like this:
([cC]at)|([dD]og)
Regex demo
Match information
MATCH 1
Group 2. [14-17] `dog`
MATCH 2
Group 1. [33-36] `cat`
Btw, if for some reason you have to group the whole alternation you can use a non capturing group like this:
(?:([cC]at)|([dD]og))

difference between two regular expressions: [abc]+ and ([abc])+

In [29]: re.findall("([abc])+","abc")
Out[29]: ['c']
In [30]: re.findall("[abc]+","abc")
Out[30]: ['abc']
Confused by the grouped one. How does it make difference?
There are two things that need to be explained here: the behavior of quantified groups, and the design of the findall() method.
In your first example, [abc] matches the a, which is captured in group #1. Then it matches b and captures it in group #1, overwriting the a. Then again with the c, and that's what's left in group #1 at the end of the match.
But it does match the whole string. If you were using search() or finditer(), you would be able to look at the MatchObject and see that group(0) contains abc and group(1) contains c. But findall() returns strings, not MatchObjects. If there are no groups, it returns a list of the overall matches; if there are groups, the list contains all the captures, but not the overall match.
So both of your regexes are matching the whole string, but the first one is also capturing and discarding each character individually (which is kinda pointless). It's only the unexpected behavior of findall() that makes it look like you're getting different results.
In the first example you have a repeated captured group which only capture the last iteration. Here c.
([abc])+
Debuggex Demo
In the second example you are matching a single character in the list one and unlimited times.
[abc]+
Debuggex Demo
Here's the way I would think about it. ([abc])+ is attempting to repeat a captured group. When you use "+" after the capture group, it doesn't mean you are going to get two captured groups. What ends up happening, at least for Python's regex and most implementations, is that the "+" forces iteration until the capture group only contains the last match.
If you want to capture a repeated expression, you need to reverse the ordering of "(...)" and "+", e.g. instead of ([abc])+ use ([abc]+).
input "abc"
[abc]
match a single character => "a"
[abc]+
+ Between one and unlimited times, as many times as possible => "abc"
([abc])
Capturing group ([abc]) => "a"
([abc])+
+ A repeated capturing group will only capture the last iteration => "c"
Grouping just gives different preference.
([abc])+ => Find one from selection. Can match one or more. It finds one and all conditions are met as the + means 1 or more. This breaks up the regex into two stages.
While the ungrouped one is treated as a whole.

Don’t understand lazy regex

Say we have a string 1abcd1efg1hjk1lmn1 and want to find stuff between 1-s. What we do is
re.findall('1.*?1','1abcd1efg1hjk1lmn1')
and get two results
['1abcd1', '1hjk1']
ok I get that. But if we do
re.findall('1.*?1hj','1abcd1efg1hjk1lmn1')
why does it grab TWO intervals between 1s instead of one? Why do we get ['1abcd1efg1hj'] instead of ['1efg1hj']? Isn’t this what laziness is supposed to do?
Regex always tries to match the input string from left to right. Consider your '1.*?1hj' regex. 1 in your regex matches the first one and the following .*? matches all the characters upto the 1hj sub-string non-greedily. So that you got ['1abcd1efg1hj'] instead of ['1efg1hj']
To get ['1efg1hj'] as output, you need use a negated class as 1[^1]*1hj
>>> s = "1abcd1efg1hjk1lmn1"
>>> re.findall(r'1.*?1hj', s)
['1abcd1efg1hj']
>>> re.findall(r'1[^1]*1hj', s)
['1efg1hj']
['1abcd1efg1hj']
You get this because this satisfies your regex. 1.*?1hj essentially means start from 1 then move lazily till you find 1 followed by hj. The 1 in between if followed by ef so that will not match but . will consume all. You don't get ['1efg1hj'] because that string has already been consumed by the first match.Use lookahead to see that both satisfy the conditions. See demo.
A lookahead does not consume string so you get both the match,
https://regex101.com/r/aQ3zJ3/5

Matching an object and a specific regex with Python

Given a text, I need to check for each char if it has exactly (edited) 3 capital letters on both sides and if there are, add it to a string of such characters that is retured.
I wrote the following: m = re.match("[A-Z]{3}.[A-Z]{3}", text)
(let's say text="AAAbAAAcAAA")
I expected to get two groups in the match object: "AAAbAAA" and "AAAcAAA"
Now, When i invoke m.group(0) I get "AAAbAAA" which is right. Yet, when invoking m.group(1), I find that there is no such group, meaning "AAAcAAA" wasn't a match. Why?
Also, when invoking m.groups(), I get an empty tuple although I should get a tuple of the matches, meaning that in my case I should have gotten a tuple with "AAAbAAA". Why doesn't that work?
You don't have any groups in your pattern. To capture something in a group, you have to surround it with parentheses:
([A-Z]{3}).[A-Z]{3}
The exception is m.group(0), which will always contain the entire match.
Looking over your question, it sounds like you aren't actually looking for capture groups, but rather overlapping matches. In regex, a group means a smaller part of the match that is set aside for later use. For example, if you're trying to match phone numbers with something like
([0-9]{3})-([0-9]{3}-[0-9]{4})
then the area code would be in group(1), the local part in group(2), and the entire thing would be in group(0).
What you want is to find overlapping matches. Here's a Stack Overflow answer that explains how to do overlapping matches in Python regex, and here's my favorite reference for capture groups and regex in general.
One, you are using match when it looks like you want findall. It won't grab the enclosing capital triplets, but re.findall('[A-Z]{3}([a-z])(?=[A-Z]{3})', search_string) will get you all single lower case characters surrounded on both sides by 3 caps.

Categories