If I have a regular expression which has a | operator separating two possible patterns. Is it possible to find which pattern was the one that matched my string?
For example, if I have the pattern ([cC]at|[dD]og) and I find a match in the string clifford is a dog. Can I then look back to see that the pattern [dD]og was the successful match and not the alternative: [cC]at.
I understand that I could try to match each alternate pattern individually and then just take the successful ones but I am wondering if there is another solution that doesn't require a match attempt for each pattern (I'm hoping to apply this in a situation where I'm trying to match several hundred patterns at once)
You can use two different groups and check its index, like this:
([cC]at)|([dD]og)
Regex demo
Match information
MATCH 1
Group 2. [14-17] `dog`
MATCH 2
Group 1. [33-36] `cat`
Btw, if for some reason you have to group the whole alternation you can use a non capturing group like this:
(?:([cC]at)|([dD]og))
Related
I'm trying to extract two numbers of interest from a string of docket text in a pandas dataframe. Here's an example with a couple of the idiosyncrasies that exist in the data
import pandas as pd
df = pd.DataFrame(["Fee: $ 15,732, and Expenses: $1,520.62."])
I used regexr to test some ideas and the closest I've been able to come up with is something along the lines of
df[0].str.extract("(\${0,2}\s*(\d+[,\.]*){1,5})")
Which returns:
0 1
0 $15,732,, 732,,
The problems I'm running into are making characters optional while capturing the groups (i.e. I don't know how to get rid of the inner parenthesis because if I make it brackets then I get an error). And then ideally I'd be able to match the other set of numbers too.
I used regexr and while I can make regular expressions that match what I want, I'm struggling with the grouping part so that I can capture both while not needing to use a cumbersome function like apply with re.
There are sometimes numbers that show up again later in the report that include dates, other numbers, etc... So I'm trying to find a pretty controlled sequence (Can't get too liberal with the .*'s haha)
The string I ended up writing after the hint provided in the comments is:
\$((?:\d+(?:[,\.])*)+).*?\$((?:\d+(?:[,\.])*)+). The non-matching groups is what I hadn't understood before. I thought non-matching groups meant that it would somehow remove the parts that matched from the group but really what it means is that it's a group of characters that don't count as a group (not that they'll be removed from a group).
I appreciate the feedback I got this post!
I am not sure if the text stays the same across all of the values but you can use the following regex:
r'Fee: \$\s?([\d,.]+), and Expenses:\s*\$\s?([\d,.]+)\.'
returning two matching groups:
15,732
1,520.62
You can also abstract the text:
r'\w+:\s*\$\s?([\d,.]+),(\s*\w+)+:\s*\$\s?([\d,.]+)\.'
with the same result.
You can use
df[0].str.extract(r"(\$\s*\d+(?:[,.]\d+)*)") # To get the first value
df[0].str.extractall(r"(\$\s*\d+(?:[,.]\d+)*)") # To get all values
df[0].str.findall(r"\$\s*\d+(?:[,.]\d+)*") # To get all values
The str.extract pattern is wrapped with a capturing group so that the method could return any value, it requires at least one capturing group in the regex pattern.
The regex matches
\$ - a $ char
\s* - zero or more whitespaces
\d+ - one or more digits
(?:[,.]\d+)* - a non-capturing group matching zero or more repetitions of a comma/dot and then one or more digits.
See the regex demo.
Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.
I have the following string,
"ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
In above string, using REGEX, I want to find all occurrences of 'TAG' except first 3 occurrences.
I used this REGEX, '(TAG.*?){4}', but it only finds 4th occurrence ('TAG:'), but not the others ('TAG.','TAG ','TAGH').
If you want a capture group with all the remaining matches, you have to consume the first ones first:
(TAG.*?){3}(TAG.*?)*
This matches the first 3 occurences in the first capture group and matches the rest in the 2nd.
If you don't want the first matches to be in a capture group, you can flag it as non-capturing group:
(?:TAG.*?){3}(TAG.*?)*
Depending on your example, I think the regex inside the capture group is not correct yet. If this doesn't give you the right Idea on how to do this already, please give us an example of the matches you want to see. I'll edit my answer then.
EDIT:
I get the feeling that you want to capture the 3rd and following occurences in own capture groups while still ignoring the first 3 occurences.
I can't properly explain why, but I think that's not possible because of the following reasons:
Ignoring the first 3 occurences in an own (non-)capturing group forces you to abandon the 'g' modifier for finding all occurences (because that would just do 'ignore 3 TAGS, find 1' in a loop).
It is not possible to capture multiple groups with just one capture group. Trying to do that always captures the last occurence. There is a possibility to capture not just the last but all occurences together in a single capture group but it seems like you want them in separate groups.
So, how to solve this?
I'd come up with a proper regex for one TAG and repeat that using the find all or g modifier. In python you then can simply take all findings skipping the first 3:
import re
str = "ATAG:AAAABTAG:BBBBCTAG:CCCCCTAG:DDDDEEEECTAG.FFFFCTAG GGGGCTAGHHHH"
pattern = r"(?:TAG((?:(?!TAG).)+))"
findings = re.findall(pattern, str)[:3]
If you want to ignore the first character after TAG, just add a . behind TAG:
pattern = r"(?:TAG.((?:(?!TAG).)+))"
Explanation of the regex:
- I use ?: to make some capturing groups non-capturing groups. I only want to deal with one capture group.
- To get rid of the non-greedy modifier and be a little bit more
specific in what we actually want, I've introduced the negative
lookahead after the TAG occurence.
I want to match the regular expressions \(.*\), \[.*\], \{.*\}, and \<.*\>. Is there a way to combine these regular expressions?
For example, I had in mind something like:
([\(\[\{\<]).*\1, but of course this matches \(.*\(, \[.*\[, \{.*\{, and \<.*\<.
My goal is to be able to match a previous regular expression group, but apply a function to the group before matching it.
Consider:
def match_pairs(pairs):
re = '|'.join("({begin}.*{end})".format(begin=beg, end=end) for (beg, end) in pairs)
return re
I'm considering using something similar to the above function for now, but ideally this function wouldn't return a really long regex. Do let me know if you think this question doesn't have any practical merit. I'm still curious to know if Python3 supports any feature like this, sort of how like re.sub can take a function as the replacement. If no such feature exists, how can I write match_pairs so that it can take in ["()", "[]", "[]", "{}"] as an argument?
The obvious (and shortest) regex for this task is \(.*\)|\[.*\]|\{.*\}|\<.*\>.
The downside is that you have four copies of the .* subpattern, so if you ever need to change it, you'll have to change it in 4 places. Luckily we can work around this problem with some use of capture groups:
(?:\(()|\[()|\{()|<()).*(?:\1\)|\2\]|\3\}|\4>)
Online demo.
This may look confusing, but it's actually very simple. The pattern is built like this:
(?:opening_char_1()|opening_char_2()|...).*(?:\1closing_char_1|\2closing_char_2|...)
This uses a fairly straightforward little trick: Each opening character ((, [, {, <) is accompanied by a capture group like so: \[(). This allows us to "remember" which opening character was matched - if capture group 1 matched, we know the opening character was (. If capture group 2 matched, the opening character was [, and so on. So we simply use backreferences (\1, \2, etc) to find out what the opening character was, and then match the corresponding closing character.
This is in continuation of my earlier question where I wanted to compile many patterns as one regular expression and after the discussion I did something like this
REGEX_PATTERN = '|'.join(self.error_patterns.keys())
where self.error_patterns.keys() would be pattern like
: error:
: warning:
cc1plus:
undefine reference to
Failure:
and do
error_found = re.findall(REGEX_PATTERN,line)
Now when I run it against some file which might contain one or more than one patterns, how do I know what pattern exactly matched? I mean I can anyway see the line manually and find it out, but want to know if after doing re.findall I can find out the pattern like re.group() or something
Thank you
re.findall will return all portions of text that matched your expression.
If that is not sufficient to identify the pattern unambiguously, you can still do a second re.match/re.find against the individual subpatterns you have join()ed. At the time of applying your initial regular expression, the matcher is no longer aware that you have composed it of several subpatterns however, hence it cannot provide more detailed information which subpattern has matched.
Another, equally unwieldy option would be to enclose each pattern in a group (...). Then, re.findall will return an array of None values (for all the non-matching patterns), with the exception of the one group that matched the pattern.
MatchObject has a lastindex property that contains the index of the last capturing group that participated in the match. If you enclose each pattern in its own capturing group, like this:
(: error:)|(: warning:)
...lastindex will tell you which one matched (assuming you know the order in which the patterns appear in the regex). You'll probably want to use finditer() (which creates an iterator of MatchObjects) instead of findall() (which returns a list of strings). Also, make sure there are no other capturing groups in the regex, to throw your indexing out of sync.