From Python 3.4.1 docs:
(?=...)
Positive lookahead assertion. This succeeds if the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn’t advance at all; the rest of the pattern is tried right where the assertion started.
I'm trying to understand regex in Python. Could you please help me understand the second sentences, especially the bolded words? Any example will be appreciated.
Lookarounds are zero-width assertions. They don't consume any characters on the string.
To touch briefly on the bolded portions of the documentation:
This means that after looking ahead, the regular expression engine is back at the same position on the string from where it started looking. From there, it can start matching again...
The key point:
You can get a zero-width match which is a match that does not consume any characters. It only matches a position in the string. The point of zero-width is the validation to see if a regular expression can or cannot be matched looking ahead or looking back from the current position, without adding them to the overall match.
An answer in an example form. On string "xy":
(?:x) will match "x"
(?:x)x will not match, because there is no another x after x
(?:x)y will match "xy", by advancing over x and then y.
(?=x) will match "" at the start of the string, since x is following.
(?=x)x will match "x" - it recognises that an x follows, and then it advances over it.
(?=x)y will not match, since it affirms there is an x following, but then tries to advance over it using y.
Generally a Regular Expression engine is "consuming" your string character by character as it matches up with your regular expression.
If you use a look-ahead operator, the engine will instead simply look ahead without "consuming" any characters while it looks for a match.
Example
A good example is a regular expression to match a password where it needs to have a single numeric digit as well as be between 6-20 characters long.
You could write two checks (one to check if a digit exists, and one to check if the string length is as required), or use a single regular expression:
(?=.*\d).{6,20}
The first portion (?=.*\d)checks if there is digit anywhere in the string. When it completes we are back at the beginning of the string again (we were only "looking-ahead") and if it passed, we go onto the next portion of the regex.
Now .{6,20} is no longer a lookahead, and begins consuming the string. When the entire string is consumed, a match has been found.
Related
I'm having a Python issue when I include a not / in my regex.
In the following example I only want to find a match if the string sitting in the first word boundary starts with a digit AND there isn't a / at any point afterwards.
Why does the following regex return 1ab as a group value? I was hoping it wouldn't find a match at all:
text = "1ab/"
regex = r"\b(\d[^/]*?)\b"
Whereas:
text = "1abc"
regex = r"\b(\d[^c]*?)\b"
does not return any match, which is the outcome I want for the / scenario.
Any help would be appreciated.
Thanks,
Roy
You can use a negative lookahead assertion:
r'\b(\d\w*?)\b(?!.*/)' (use flags=re.DOTALL with this or prepend (?s) to the regex)
(?!.*/) states that the rest of the input string does not contain a '/' character. If you don't want '/' to appear just as the next character, then use as the assertion (?!/).
You almost did it. Yet the slash is not alphanumerical and thus cannot be inside word . Therefore it makes no sense to match or prohibit it start and the end of the word. You have to place "not slash" sub-expression [^/] after the end of word. And add a star [^/]* (which matches the sequence of non-slash symbols) to address the case when slashes occurs toward the end of the string rather than immediately after the end of the first word.
Since you target the first word and absence of slash until the very end of string adding symbols of the start end might help. Especially, if you are use re.search. Resulting in
^[\W]*\b(\d\w*)\b[^/]*\Z
You can play with it using an online debugger such as https://regex101.com/r/uO27vU/2
to better understand the expression or tune it.
Above ^ is a start, \Z is the end of sting, \W is for "non-word" symbols, a \w is "word" symbol.
You can remove the first \b I kept it, as perhaps, it would easier for you to understand with it.
The second expression that you tried excludes words ending with c but first does not. ^c stands for any symbol but c and right after it you have \b which denotes the end of the word. Which reads please no "c"s at the end of the word.
Your first expression says pleas no slashes before the end of the word (sequence of alphanumeric) . Which is the case for you test.
Always use a debugger to get explanation of each symbol,test and
tune your expressions regex101.com/r/B6INGg/2
Note that the list of symbols in a word might be affected by flags. When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_].
I have a program in which a user inputs a function, such as sin(x)+1. I'm using ast to try to determine if the string is 'safe' by whitelisting components as shown in this answer. Now I'd like to parse the string to add multiplication (*) signs between coefficients without them.
For example:
3x-> 3*x
4(x+5) -> 4*(x+5)
sin(3x)(4) -> sin(3x)*(4) (sin is already in globals, otherwise this would be s*i*n*(3x)*(4)
Are there any efficient algorithms to accomplish this? I'd prefer a pythonic solution (i.e. not complex regexes, not because they're pythonic, but just because I don't understand them as well and want a solution I can understand. Simple regexes are ok. )
I'm very open to using sympy (which looks really easy for this sort of thing) under one condition: safety. Apparently sympy uses eval under the hood. I've got pretty good safety with my current (partial) solution. If anyone has a way to make sympy safer with untrusted input, I'd welcome this too.
A regex is easily the quickest and cleanest way to get the job done in vanilla python, and I'll even explain the regex for you, because regexes are such a powerful tool it's nice to understand.
To accomplish your goal, use the following statement:
import re
# <code goes here, set 'thefunction' variable to be the string you're parsing>
re.sub(r"((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\()", r"\1*\2", thefunction)
I know it's a bit long and complicated, but a different, simpler solution doesn't make itself immediately obvious without even more hacky stuff than what's gone into the regex here. But, this has been tested against all three of your test cases and works out precisely as you want.
As a brief explanation of what's going on here: The first parameter to re.sub is the regular expression, which matches a certain pattern. The second is the thing we're replacing it with, and the third is the actual string to replace things in. Every time our regex sees a match, it removes it and plugs in the substitution, with some special behind-the-scenes tricks.
A more in-depth analysis of the regex follows:
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\)))((?:[a-zA-Z]\w*)|\() : Matches a number or a function call, followed by a variable or parentheses.
((?:\d+)|(?:[a-zA-Z]\w*\(\w+\))) : Group 1. Note: Parentheses delimit a Group, which is sort of a sub-regex. Capturing groups are indexed for future reference; groups can also be repeated with modifiers (described later). This group matches a number or a function call.
(?:\d+) : Non-capturing group. Any group with ?: immediately after the opening parenthesis will not assign an index to itself, but still act as a "section" of the pattern. Ex. A(?:bc)+ will match "Abcbcbcbc..." and so on, but you cannot access the "bcbcbcbc" match with an index. However, without this group, writing "Abc+" would match "Abcccccccc..."
\d : Matches any numerical digit once. A regex of \d all its own will match, separately, "1", "2", and "3" of "123".
+ : Matches the previous element one or more times. In this case, the previous element is \d, any number. In the previous example, \d+ on "123" will successfully match "123" as a single element. This is vital to our regex, to make sure that multi-digit numbers are properly registered.
| : Pipe character, and in a regex, it effectively says or: "a|b" will match "a" OR "b". In this case, it separates "a number" and "a function call"; match a number OR a function call.
(?:[a-zA-Z]\w*\(\w+\)) : Matches a function call. Also a non-capturing group, like (?:\d+).
[a-zA-Z] : Matches the first letter of the function call. There is no modifier on this because we only need to ensure the first character is a letter; A123 is technically a valid function name.
\w : Matches any alphanumeric character or an underscore. After the first letter is ensured, the following characters could be letters, numbers, or underscores and still be valid as a function name.
* : Matches the previous element 0 or more times. While initially seeming unnecessary, the star character effectively makes an element optional. In this case, our modified element is \w, but a function doesn't technically need any more than one character; A() is a valid function name. A would be matched by [a-zA-Z], making \w unnecessary. On the other end of the spectrum, there could be any number of characters following the first letter, which is why we need this modifier.
\( : This is important to understand: this is not another group. The backslash here acts much like an escape character would in a normal string. In a regex, any time you preface a special character, such as parentheses, +, or * with a backslash, it uses it like a normal character. \( matches an opening parenthesis, for the actual function call part of the function.
\w+ : Matches a number, letter or underscore one or more times. This ensures the function actually has a parameter going into it.
\) : Like \(, but matches a closing parenthesis
((?:[a-zA-Z]\w*)|\() : Group 2. Matches a variable, or an opening parenthesis.
(?:[a-zA-Z]\w*) : Matches a variable. This is the exact same as our function name matcher. However, note that this is in a non-capturing group: this is important, because of the way the OR checks. The OR immediately following this looks at this group as a whole. If this was not grouped, the "last object matched" would be \w*, which would not be sufficient for what we want. It would say: "match one letter followed by more letters OR one letter followed by a parenthesis". Putting this element in a non-capturing group allows us to control what the OR registers.
| : Or character. Matches (?:[a-zA-Z]\w*) or \(.
\( : Matches an opening parenthesis. Once we have checked if there is an opening parenthesis, we don't need to check anything beyond it for the purposes of our regex.
Now, remember our two groups, group one and group two? These are used in the substitution string, "\1*\2". The substitution string is not a true regex, but it still has certain special characters. In this case, \<number> will insert the group of that number. So our substitution string is saying: "Put group 1 in (which is either our function call or our number), then put in an asterisk (*), then put in our second group (either a variable or a parenthesis)"
I think that about sums it up!
From the Python documentation of the re module:
{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.
I'm confused about how this works. How is this any different from {m}? I do not see how there could ever be a case where the pattern could match more than m repetitions. If there are m+1 repetitions in a row, then there are also m. What am I missing?
Whereas, it is true that a regex solely containing a{3,5}? and one with the pattern: a{3} will match the same thing (i.e. re.match(r'a{3,5}?', 'aaaaa').group(0) and re.match(r'a{3}', 'aaaaa').group(0)
will both return 'aaa'), the differences between the patterns becomes clear when you look at patterns containing these two elements. Say your pattern is a{3,5}?b, then aaab, aaaab, and aaaaab will be matched. If you just used a{3}b then only aaab would get matched. aaaab and aaaaab would not.
Look to Shashank's answer for examples that flush out this difference a little more, or test your own. I've found that this site is a good resource to use to test out python regular expressions.
I think the way to see the difference between the two is through the following examples:
>>> re.findall(r'ab{3,5}?', 'abbbbb')
['abbb']
>>> re.findall(r'ab{3}', 'abbbbb')
['abbb']
Those two runs give the same results as expected, but let's see some differences.
Difference 1: A range quantifier on a subpattern lets you match a large range of patterns containing that subpattern. This lets you find matches where there normally wouldn't be any if you used an exact quantifier:
>>> re.findall(r'ab{3,5}?c', 'abbbbbc')
['abbbbbc']
>>> re.findall(r'ab{3}c', 'abbbbbc')
[]
Difference 2: Greedy doesn't necessarily mean "match the shortest subpattern possible". It's actually a bit more like "match the shortest subpattern possible starting from the leftmost unmatched index that can possibly start off a match":
>>> re.findall(r'b{3,5}?c', 'bbbbbc')
['bbbbbc']
>>> re.findall(r'b{3}c', 'bbbbbc')
['bbbc']
The way I think of regex is as a construct that scans the string from left to right with two iterators that point to indices in the string. The first iterator marks the beginning of the next possible pattern. The second iterator goes through the suffix of the substring starting from the first iterator and tries to complete the pattern. The first iterator only advances when the construct determines that the regex pattern cannot possibly match a string starting from that index. Thus, defining a range for your quantifier will make it so that the first iterator will keep matching sub-patterns beyond the minimum value specified even if the quantifier is non-greedy.
A non-greedy regex will stop its second iterator as soon as the pattern can stop, but a greedy regex will "save" the position of a matched pattern and keep searching for a longer one. If a longer pattern is found, then it uses that one instead, if it's not found, then it uses the shorter one that it saved in memory earlier.
That's why you see the possibly surprising result with 'b{3,5}?c' and 'bbbbbc'. Although the regex is greedy, it will still never advance its first iterator until the pattern match fails, and that's why the substring with 5 'b' characters is matched by the non-greedy regex even though its not the shortest pattern matchable.
SwankSwashbucklers's answer describes the greedy version. The ? makes it non-greedy, which means it will try to match as few items as possible, which means that
`re.match('a{3,5}?b', 'aaaab').group(0)` # returns `'aaaab'`
but
`re.match('a{3,5}?', 'aaaa').group(0)` # returns `'aaa'`
let say we have a string to be searched is:
str ="aaaaa"
Now we have patter = a{3,5}
The string which it matches are :{aaa,aaaa,aaaaa}
But here we have string as "aaaaa" since we have only one option.
Now lets say we have pattern = a{3,5}?
in this case it matches only "aaa" not "aaaaa".
Thus it takes the minimum items as possible,being non greedy.
please try using online regular Expression at :https://pythex.org/
It will be great help and we check immediately what it matches and what it does not
I bumped into the problem while playing around in Python: when I create a random string, let's say "test 1981", the following Python call returns with an empty string.
>>> re.search('\d?', "test 1981").group()
''
I was wondering why this is. I was reading through some other posts, and it seems that it has to do with greedy vs. non-greedy operators. Is it that the '?' checks to see if the first value is a digit, and if it's not, it takes the easier, quicker path and just outputs nothing?
Any clarification would help. Thanks!
Your pattern matches a digit or the empty string. It starts at the first character and tries to match a digit, what it is doing next is trying to match the alternative, means the empty string, voilà a match is found before the first character.
I think you expected it to move on and try to match on the next character, but that is not done, first it tries to match what the quantifier allows on the first position. And that is 0 or one digit.
The use of the optional quantifier makes only sense in combination with a required part, say you want a digit followed by an optional one:
>>> re.search('\d\d?', "test 1981").group()
'19'
Otherwise your pattern is always true.
Regex
\d?
simply means that it should optionally (?) match single digit (\d).
If you use something like this, it will work as you expect (match single digit anywhere in the string):
\d
re.search('\d?', "test 1981").group() greedily matches the first match of the pattern (0 or 1 digits) it can find. In this case that's zero digits. Note that re.search('\d?', "1981 test").group() actually matches the string '1' at the beginning of the string. What you're probably looking for here is re.search('\d+', "test 1981").group(), which finds the whole string 1981 no matter where it is.
I want to find strings that do NOT match a particular sequence of characters. For example:
something like
REGEX = r'[^XY]*'
I'd like to look for strings that have any number of characters except an X and Y next to each other...the REGEX above doesn't work since it blocks X's and Y's separately.
How about:
if "XY" not in s:
print "matched"
else
print "not matched"
Or is this for inclusion in some longer regexp? Then maybe you want a negative lookahead expression:
REGEXP="...(?!XY)..."
EDIT: fixed typo
There are a few ways to do that.
^(?!.*XY).*$
The lookahead expression tries to match a XY sequence anywhere in the string. It's a negative lookahead, so if it finds one, the match attempt fails. Otherwise the .* goes ahead and consumes the whole string.
^(?:(?!XY).)*$
This one repeatedly matches any character (.), but only after the lookahead confirms that the character is not the beginning of a XY sequence.
^(?:[^X]+|X(?!Y))*$
Repeatedly matches one or more of any character except X, or X if it's not followed by Y.
With the first two regexes, you have to apply the DOT_ALL modifier if their might be newlines in the source string. The third one doesn't need that because it uses a negated character class - [^X] - instead of a dot.