Quick Python Regex Question: Matching negated sets of characters - python

I want to find strings that do NOT match a particular sequence of characters. For example:
something like
REGEX = r'[^XY]*'
I'd like to look for strings that have any number of characters except an X and Y next to each other...the REGEX above doesn't work since it blocks X's and Y's separately.

How about:
if "XY" not in s:
print "matched"
else
print "not matched"
Or is this for inclusion in some longer regexp? Then maybe you want a negative lookahead expression:
REGEXP="...(?!XY)..."
EDIT: fixed typo

There are a few ways to do that.
^(?!.*XY).*$
The lookahead expression tries to match a XY sequence anywhere in the string. It's a negative lookahead, so if it finds one, the match attempt fails. Otherwise the .* goes ahead and consumes the whole string.
^(?:(?!XY).)*$
This one repeatedly matches any character (.), but only after the lookahead confirms that the character is not the beginning of a XY sequence.
^(?:[^X]+|X(?!Y))*$
Repeatedly matches one or more of any character except X, or X if it's not followed by Y.
With the first two regexes, you have to apply the DOT_ALL modifier if their might be newlines in the source string. The third one doesn't need that because it uses a negated character class - [^X] - instead of a dot.

Related

Combining positive and negative lookahead in python

I'm trying to extract tokens that satisfy many conditions out of which, I'm using lookahead to implement the following two conditions:
The tokens must be either numeric/alphanumeric (i.e, they must have at least one digit). They can contain few special characters like - '-','/','\','.','_' etc.,
I want to match strings like: 165271, agya678, yah#123, kj*12-
The tokens can't have consecutive special characters like: ajh12-&
I don't want to match strings like: ajh12-&, 671%&i^
I'm using a positive lookahead for the first condition: (?=\w*\d\w*) and a negative lookahead for the second condition: (?!=[\_\.\:\;\-\\\/\#\+]{2})
I'm not sure how to combine these two look-ahead conditions.
Any suggestions would be helpful. Thanks in advance.
Edit 1 :
I would like to extract complete tokens that are part of a larger string too (i.e., They may be present in middle of the string).
I would like to match all the tokens in the string:
165271 agya678 yah#123 kj*12-
and none of the tokens (not even a part of a token) in the string: ajh12-& 671%&i^
In order to force the regex to consider the whole string I've also used \b in the above regexs : (?=\b\w*\d\w*\b) and (?!=\b[\_\.\:\;\-\\\/\#\+]{2}\b)
You can use
^(?!=.*[_.:;\-\\\/#+*]{2})(?=[^\d\n]*\d)[\w.:;\-\\\/#+*]+$
Regex demo
The negative lookahead (?=[^\d\n]*\d) matches any char except a digit or a newline use a negated character class, and then match a digit.
Note that you also have to add * and that most characters don't have to be escaped in the character class.
Using contrast, you could also turn the first .* into a negated character class to prevent some backtracking
^(?!=[^_.:;\-\\\/#+*\n][_.:;\-\\\/#+*]{2})(?=[^\d\n]*\d)[\w.:;\-\\\/#+*]+$
Edit
Without the anchors, you can use whitespace boundaries to the left (?<!\S) and to the right (?!\S)
(?<!\S)(?!=\S*[_.:;\-\\\/#+*]{2})(?=[^\d\s]*\d)[\w.:;\-\\\/#+*]+(?!\S)
Regex demo
You can use multiple look ahead assertions to only capture strings that
(?!.*(?:\W|_){2,}.*) - doesn't have consecutive special characters and
(?=.*\d.*) - has at least 1 digit
^(?!.*(?:\W|_){2,}.*)(?=.*\d.*).*$

Inverse regex match on group in Python

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.
For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.
I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.

Understanding Positive Look Ahead Assertion

From Python 3.4.1 docs:
(?=...)
Positive lookahead assertion. This succeeds if the contained regular expression, represented here by ..., successfully matches at the current location, and fails otherwise. But, once the contained expression has been tried, the matching engine doesn’t advance at all; the rest of the pattern is tried right where the assertion started.
I'm trying to understand regex in Python. Could you please help me understand the second sentences, especially the bolded words? Any example will be appreciated.
Lookarounds are zero-width assertions. They don't consume any characters on the string.
To touch briefly on the bolded portions of the documentation:
This means that after looking ahead, the regular expression engine is back at the same position on the string from where it started looking. From there, it can start matching again...
The key point:
You can get a zero-width match which is a match that does not consume any characters. It only matches a position in the string. The point of zero-width is the validation to see if a regular expression can or cannot be matched looking ahead or looking back from the current position, without adding them to the overall match.
An answer in an example form. On string "xy":
(?:x) will match "x"
(?:x)x will not match, because there is no another x after x
(?:x)y will match "xy", by advancing over x and then y.
(?=x) will match "" at the start of the string, since x is following.
(?=x)x will match "x" - it recognises that an x follows, and then it advances over it.
(?=x)y will not match, since it affirms there is an x following, but then tries to advance over it using y.
Generally a Regular Expression engine is "consuming" your string character by character as it matches up with your regular expression.
If you use a look-ahead operator, the engine will instead simply look ahead without "consuming" any characters while it looks for a match.
Example
A good example is a regular expression to match a password where it needs to have a single numeric digit as well as be between 6-20 characters long.
You could write two checks (one to check if a digit exists, and one to check if the string length is as required), or use a single regular expression:
(?=.*\d).{6,20}
The first portion (?=.*\d)checks if there is digit anywhere in the string. When it completes we are back at the beginning of the string again (we were only "looking-ahead") and if it passed, we go onto the next portion of the regex.
Now .{6,20} is no longer a lookahead, and begins consuming the string. When the entire string is consumed, a match has been found.

Python Regex to capture single character alphabeticals

Why doesn't the below regex print True?
print re.compile(r'^\b[a-z]\b$').search('(s)')
I want to match single char alphabeticals that may have non alphanumeric characters before and after, but do not have any more alphanumeric characters anywhere in the string. So the following should be matches:
'b'
'b)'
'(b)'
'b,
and the following should be misses:
'b(s)'
'blah(b)'
'bb)'
'b-b'
'bb'
The solutions here don't work.
The ^ at the begining and $ at the end cause the expression to match only if the entire string is a single character. (Thus, they make each \b obsolete.) Remove the anchors to match inside a larger string:
print re.compile(r'\b[a-z]\b').search('b(s)')
Alternatively, ensure only one character like:
print re.compile(r'^\W*[a-z]\W*$').match('b(s)')
Note that in the first case, 'b-b' and 'blah(b)' will match because they contain single alphabetical characters not touching others inside them. In the second case, 'b(s)' will not be a match, because it contains two alphabetical characters, but the other four cases will match correctly, and all of the no-match cases will return None (false logical value) as intended.
Ok here is the answer:
print re.compile(^[(,\[]?[a-z][),;\]]?[,;]?$).search('(s)')
It catches a variety of complex patterns for single character alphanumerics. I realize this is different than what I asked for but in reality it works better.

Python Regex: Question mark (?) doesn't match in middle of string

I bumped into the problem while playing around in Python: when I create a random string, let's say "test 1981", the following Python call returns with an empty string.
>>> re.search('\d?', "test 1981").group()
''
I was wondering why this is. I was reading through some other posts, and it seems that it has to do with greedy vs. non-greedy operators. Is it that the '?' checks to see if the first value is a digit, and if it's not, it takes the easier, quicker path and just outputs nothing?
Any clarification would help. Thanks!
Your pattern matches a digit or the empty string. It starts at the first character and tries to match a digit, what it is doing next is trying to match the alternative, means the empty string, voilà a match is found before the first character.
I think you expected it to move on and try to match on the next character, but that is not done, first it tries to match what the quantifier allows on the first position. And that is 0 or one digit.
The use of the optional quantifier makes only sense in combination with a required part, say you want a digit followed by an optional one:
>>> re.search('\d\d?', "test 1981").group()
'19'
Otherwise your pattern is always true.
Regex
\d?
simply means that it should optionally (?) match single digit (\d).
If you use something like this, it will work as you expect (match single digit anywhere in the string):
\d
re.search('\d?', "test 1981").group() greedily matches the first match of the pattern (0 or 1 digits) it can find. In this case that's zero digits. Note that re.search('\d?', "1981 test").group() actually matches the string '1' at the beginning of the string. What you're probably looking for here is re.search('\d+', "test 1981").group(), which finds the whole string 1981 no matter where it is.

Categories