Given a text, I need to check for each char if it has exactly (edited) 3 capital letters on both sides and if there are, add it to a string of such characters that is retured.
I wrote the following: m = re.match("[A-Z]{3}.[A-Z]{3}", text)
(let's say text="AAAbAAAcAAA")
I expected to get two groups in the match object: "AAAbAAA" and "AAAcAAA"
Now, When i invoke m.group(0) I get "AAAbAAA" which is right. Yet, when invoking m.group(1), I find that there is no such group, meaning "AAAcAAA" wasn't a match. Why?
Also, when invoking m.groups(), I get an empty tuple although I should get a tuple of the matches, meaning that in my case I should have gotten a tuple with "AAAbAAA". Why doesn't that work?
You don't have any groups in your pattern. To capture something in a group, you have to surround it with parentheses:
([A-Z]{3}).[A-Z]{3}
The exception is m.group(0), which will always contain the entire match.
Looking over your question, it sounds like you aren't actually looking for capture groups, but rather overlapping matches. In regex, a group means a smaller part of the match that is set aside for later use. For example, if you're trying to match phone numbers with something like
([0-9]{3})-([0-9]{3}-[0-9]{4})
then the area code would be in group(1), the local part in group(2), and the entire thing would be in group(0).
What you want is to find overlapping matches. Here's a Stack Overflow answer that explains how to do overlapping matches in Python regex, and here's my favorite reference for capture groups and regex in general.
One, you are using match when it looks like you want findall. It won't grab the enclosing capital triplets, but re.findall('[A-Z]{3}([a-z])(?=[A-Z]{3})', search_string) will get you all single lower case characters surrounded on both sides by 3 caps.
Related
I'm trying to extract two numbers of interest from a string of docket text in a pandas dataframe. Here's an example with a couple of the idiosyncrasies that exist in the data
import pandas as pd
df = pd.DataFrame(["Fee: $ 15,732, and Expenses: $1,520.62."])
I used regexr to test some ideas and the closest I've been able to come up with is something along the lines of
df[0].str.extract("(\${0,2}\s*(\d+[,\.]*){1,5})")
Which returns:
0 1
0 $15,732,, 732,,
The problems I'm running into are making characters optional while capturing the groups (i.e. I don't know how to get rid of the inner parenthesis because if I make it brackets then I get an error). And then ideally I'd be able to match the other set of numbers too.
I used regexr and while I can make regular expressions that match what I want, I'm struggling with the grouping part so that I can capture both while not needing to use a cumbersome function like apply with re.
There are sometimes numbers that show up again later in the report that include dates, other numbers, etc... So I'm trying to find a pretty controlled sequence (Can't get too liberal with the .*'s haha)
The string I ended up writing after the hint provided in the comments is:
\$((?:\d+(?:[,\.])*)+).*?\$((?:\d+(?:[,\.])*)+). The non-matching groups is what I hadn't understood before. I thought non-matching groups meant that it would somehow remove the parts that matched from the group but really what it means is that it's a group of characters that don't count as a group (not that they'll be removed from a group).
I appreciate the feedback I got this post!
I am not sure if the text stays the same across all of the values but you can use the following regex:
r'Fee: \$\s?([\d,.]+), and Expenses:\s*\$\s?([\d,.]+)\.'
returning two matching groups:
15,732
1,520.62
You can also abstract the text:
r'\w+:\s*\$\s?([\d,.]+),(\s*\w+)+:\s*\$\s?([\d,.]+)\.'
with the same result.
You can use
df[0].str.extract(r"(\$\s*\d+(?:[,.]\d+)*)") # To get the first value
df[0].str.extractall(r"(\$\s*\d+(?:[,.]\d+)*)") # To get all values
df[0].str.findall(r"\$\s*\d+(?:[,.]\d+)*") # To get all values
The str.extract pattern is wrapped with a capturing group so that the method could return any value, it requires at least one capturing group in the regex pattern.
The regex matches
\$ - a $ char
\s* - zero or more whitespaces
\d+ - one or more digits
(?:[,.]\d+)* - a non-capturing group matching zero or more repetitions of a comma/dot and then one or more digits.
See the regex demo.
I have a program in which a user inputs a function, such as sin(x)+1. I'm using ast to try to determine if the string is 'safe' by whitelisting components as shown in this answer. Now I'd like to parse the string to add multiplication (*) signs between coefficients without them.
For example:
3x-> 3*x
4(x+5) -> 4*(x+5)
sin(3x)(4) -> sin(3x)*(4) (sin is already in globals, otherwise this would be s*i*n*(3x)*(4)
Are there any efficient algorithms to accomplish this? I'd prefer a pythonic solution (i.e. not complex regexes, not because they're pythonic, but just because I don't understand them as well and want a solution I can understand. Simple regexes are ok. )
I'm very open to using sympy (which looks really easy for this sort of thing) under one condition: safety. Apparently sympy uses eval under the hood. I've got pretty good safety with my current (partial) solution. If anyone has a way to make sympy safer with untrusted input, I'd welcome this too.
A regex is easily the quickest and cleanest way to get the job done in vanilla python, and I'll even explain the regex for you, because regexes are such a powerful tool it's nice to understand.
To accomplish your goal, use the following statement:
import re
# <code goes here, set 'thefunction' variable to be the string you're parsing>
re.sub(r"((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\()", r"\1*\2", thefunction)
I know it's a bit long and complicated, but a different, simpler solution doesn't make itself immediately obvious without even more hacky stuff than what's gone into the regex here. But, this has been tested against all three of your test cases and works out precisely as you want.
As a brief explanation of what's going on here: The first parameter to re.sub is the regular expression, which matches a certain pattern. The second is the thing we're replacing it with, and the third is the actual string to replace things in. Every time our regex sees a match, it removes it and plugs in the substitution, with some special behind-the-scenes tricks.
A more in-depth analysis of the regex follows:
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\() : Matches a number or a function call, followed by a variable or parentheses.
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\))) : Group 1. Note: Parentheses delimit a Group, which is sort of a sub-regex. Capturing groups are indexed for future reference; groups can also be repeated with modifiers (described later). This group matches a number or a function call.
(?:\d+) : Non-capturing group. Any group with ?: immediately after the opening parenthesis will not assign an index to itself, but still act as a "section" of the pattern. Ex. A(?:bc)+ will match "Abcbcbcbc..." and so on, but you cannot access the "bcbcbcbc" match with an index. However, without this group, writing "Abc+" would match "Abcccccccc..."
\d : Matches any numerical digit once. A regex of \d all its own will match, separately, "1", "2", and "3" of "123".
+ : Matches the previous element one or more times. In this case, the previous element is \d, any number. In the previous example, \d+ on "123" will successfully match "123" as a single element. This is vital to our regex, to make sure that multi-digit numbers are properly registered.
| : Pipe character, and in a regex, it effectively says or: "a|b" will match "a" OR "b". In this case, it separates "a number" and "a function call"; match a number OR a function call.
(?:[a-zA-Z]\w*\(\w+\)) : Matches a function call. Also a non-capturing group, like (?:\d+).
[a-zA-Z] : Matches the first letter of the function call. There is no modifier on this because we only need to ensure the first character is a letter; A123 is technically a valid function name.
\w : Matches any alphanumeric character or an underscore. After the first letter is ensured, the following characters could be letters, numbers, or underscores and still be valid as a function name.
* : Matches the previous element 0 or more times. While initially seeming unnecessary, the star character effectively makes an element optional. In this case, our modified element is \w, but a function doesn't technically need any more than one character; A() is a valid function name. A would be matched by [a-zA-Z], making \w unnecessary. On the other end of the spectrum, there could be any number of characters following the first letter, which is why we need this modifier.
\( : This is important to understand: this is not another group. The backslash here acts much like an escape character would in a normal string. In a regex, any time you preface a special character, such as parentheses, +, or * with a backslash, it uses it like a normal character. \( matches an opening parenthesis, for the actual function call part of the function.
\w+ : Matches a number, letter or underscore one or more times. This ensures the function actually has a parameter going into it.
\) : Like \(, but matches a closing parenthesis
((?:[a-zA-Z]\w*)|\() : Group 2. Matches a variable, or an opening parenthesis.
(?:[a-zA-Z]\w*) : Matches a variable. This is the exact same as our function name matcher. However, note that this is in a non-capturing group: this is important, because of the way the OR checks. The OR immediately following this looks at this group as a whole. If this was not grouped, the "last object matched" would be \w*, which would not be sufficient for what we want. It would say: "match one letter followed by more letters OR one letter followed by a parenthesis". Putting this element in a non-capturing group allows us to control what the OR registers.
| : Or character. Matches (?:[a-zA-Z]\w*) or \(.
\( : Matches an opening parenthesis. Once we have checked if there is an opening parenthesis, we don't need to check anything beyond it for the purposes of our regex.
Now, remember our two groups, group one and group two? These are used in the substitution string, "\1*\2". The substitution string is not a true regex, but it still has certain special characters. In this case, \<number> will insert the group of that number. So our substitution string is saying: "Put group 1 in (which is either our function call or our number), then put in an asterisk (*), then put in our second group (either a variable or a parenthesis)"
I think that about sums it up!
There is a big string and I need to find all substrings containing exactly N words (if it is possible).
For example:
big_string = "The most elegant way to find n words in String with the particular word"
N = 2
find_sub(big_string, 'find', N=2) # => ['way to find n words']
I've tried to solve it with regular expressions, but it happened to be more complex then I expect at first. Is there an elegant solution around I've just overlook?
Upd
By word we mean everything separated by \b
N parameter indicates how many words on each side of the 'find' should be
For your specific example (if we use the "word" definition of regular expressions, i.e. anything containing letters, digits and underscores) the regex would look like this:
r'(?:\w+\W+){2}find(?:\W+\w+){2}'
\w matches one of said word characters. \W matches any other character. I think it's obvious where in the pattern your parameters go. You can use the pattern with re.search or re.findall.
The issue is if there are less than the desired amount of words around your query (i.e. if it's too close to one end of the string). But you should be able to get away with:
r'(?:\w+\W+){0,2}find(?:\W+\w+){0,2}'
thanks to greediness of repetition. Note that in any case, if you want multiple results, matches can never overlap. So if you use the first pattern, you will only get the first match, if two occurrences of find are to close to each other, whereas in the second, you won't get n words before the second find (the ones that were already consumed will be missing). In particular, if two occurrences of find are closer together than n so that the second find will already be part of the first match, then you can't get the second match at all.
If you want to treat a word as anything that is not a white-space character, the approach looks similar:
r'(?:\S+\s+){0,2}find(?:\s+\S+){0,2}'
For anything else you will have to come up with the character classes yourself, I guess.
I need to match the following sets of input:
foo_abc_bar
foo_bar
and get "abc" or an empty string as the result.
So this is the regular expression I wrote:
r'foo_(abc|)[_|]bar'
But for some reason, this does not match with the second string that I have given.
On further inspection, I found that [_|] does not match an empty string.
So, how do I solve this problem?
To make abc_ optional, you could use the question mark operator:
(abc_)?
Thus, the entire regex becomes:
r'foo_(abc_)?bar'
With this regex, the second underscore (if present) will become part of the capture group. If you don't want that, you could either remove it post-match with .rstrip('_') or use a slightly more complex regex:
r'foo_(?:(abc)_)?bar'
I found that [_|] does not match an empty string.
That's right. Square brackets denote a character group. The [_|] would match exactly one underscore or exactly one vertical bar, and nothing else. In other words, the vertical bar loses its special meaning when it appears inside a character group.
if you want a string pattern like this
xxx_xxx_xxx
xxx_xxx
then you need
([A-Za-z]{3})((_[A-Za-z]{3})+)?
but this will work also
r'foo(_abc)?_bar'
? means optional (may or may not match).
This is in continuation of my earlier question where I wanted to compile many patterns as one regular expression and after the discussion I did something like this
REGEX_PATTERN = '|'.join(self.error_patterns.keys())
where self.error_patterns.keys() would be pattern like
: error:
: warning:
cc1plus:
undefine reference to
Failure:
and do
error_found = re.findall(REGEX_PATTERN,line)
Now when I run it against some file which might contain one or more than one patterns, how do I know what pattern exactly matched? I mean I can anyway see the line manually and find it out, but want to know if after doing re.findall I can find out the pattern like re.group() or something
Thank you
re.findall will return all portions of text that matched your expression.
If that is not sufficient to identify the pattern unambiguously, you can still do a second re.match/re.find against the individual subpatterns you have join()ed. At the time of applying your initial regular expression, the matcher is no longer aware that you have composed it of several subpatterns however, hence it cannot provide more detailed information which subpattern has matched.
Another, equally unwieldy option would be to enclose each pattern in a group (...). Then, re.findall will return an array of None values (for all the non-matching patterns), with the exception of the one group that matched the pattern.
MatchObject has a lastindex property that contains the index of the last capturing group that participated in the match. If you enclose each pattern in its own capturing group, like this:
(: error:)|(: warning:)
...lastindex will tell you which one matched (assuming you know the order in which the patterns appear in the regex). You'll probably want to use finditer() (which creates an iterator of MatchObjects) instead of findall() (which returns a list of strings). Also, make sure there are no other capturing groups in the regex, to throw your indexing out of sync.