Ignoring a word in regex (negative lookahead) - python

I'm looking to try and ignore a word in regex, but the solutions I've seen here did not work correctly for me.
Regular expression to match a line that doesn't contain a word
The issue I'm facing is I have an existing regex:
(?P<MovieCode>[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
That is matching on Deku-041114-575-boku.mp4.
However, I want this regex to fail to match for cases where the MovieCode group has Deku in it.
I tried
(?P<MovieCode>(?!Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
but unfortunately it just matches eku-124 and I need it to fail.
I have a regex101 with my attempts.
https://regex101.com/r/xqALM2/2

The MovieClose group can match 3-6 chars A-Z and Deku has 4 chars. If that part should not contain Deku, you could use the negative lookahead predeced by repeating 0+ times a character class [A-Za-z]* as it can not cross the -.
To prevent matching eku-124, you could prepend a word boundary before the MovieClose group or add (?<!\S if there should be a whitespace boundary at the left.
Note that you can omit {1} from the pattern.
\b(?P<MovieCode>(?![A-Za-z]*Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]\b)?
Regex demo

Related

Regex negative lookahead in python [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Regex - Word boundary not working even with raw-string

I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
What is a word boundary in regex?
What happens is that the \ character is not part of the group \w, thus every time your string has a new \ it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...) and lookbehind (?<!...), and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4
The problem is that \b\d{2}/\d{4}\b matches 02/2000 in the string 01/02/2000 because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/]), prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/]) prevents the four digits representing the year to be followed by a digit or forward slash.
Regex demo
Python demo
If 6/2000 is to be matched as well as 06/2000, change (?:0[1-9] to (?:0?[1-9].

How to say "match anything until a specific character, then work your way backwards"?

I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)
Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?
If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character
In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo
By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'
Try this:
.*(\b\d{1,3}(?=\%)).*
demo

Python regex: How to make a group of words/character optional?

I am trying to make regex that can match all of them:
word
word-hyphen
word-hyphen-again
that is -\w+could be many depends on words in a term. How can I make it optional
Thing I made so far is given here:- https://regex101.com/r/Atpwze/1
Try using
\w+(-\w+)* for matching 0 or more hyphenated words after first word
\w+(-\w+){0,} same as first case
based on your exact requirement.
In order to eliminate some extreme cases like a-+-+---, you could use \w+(-\w+)*[^\W]
\W matches all non-word characters and ^\W negates the matching of non-word characters
To catch all of your examples, I think you could use:
^\w+(?:\w+\-?|\-\w+)+$
Beginning of the string ^
Match a word character one or more times \w+
Start a non capturing group (?:
Match a word character one or more times with an optional hyphen \w+\-?
Or |
A hyphen with one or more word characters \-\w+
Close the non capturing group )
End of the string $

Negative lookahead not working after character range with plus quantifier

I am trying to implement a regex which includes all the strings which have any number of words but cannot be followed by a : and ignore the match if it does. I decided to use a negative look ahead for it.
/([a-zA-Z]+)(?!:)/gm
string: lame:joker
since i am using a character range it is matching one character at a time and only ignoring the last character before the : .
How do i ignore the entire match in this case?
Link to regex101: https://regex101.com/r/DlEmC9/1
The issue is related to backtracking: once your [a-zA-Z]+ comes to a :, the engine steps back from the failing position, re-checks the lookahead match and finds a match whenver there are at least two letters before a colon, returning the one that is not immediately followed by :. See your regex demo: c in c:real is not matched as there is no position to backtrack to, and rea in real:c is matched because a is not immediately followed with :.
Adding implicit requirement to the negative lookahead
Since you only need to match a sequence of letters not followed with a colon, you can explicitly add one more condition that is implied: and not followed with another letter:
[A-Za-z]+(?![A-Za-z]|:)
[A-Za-z]+(?![A-Za-z:])
See the regex demo. Since both [A-Za-z] and : match a single character, it makes sense to put them into a single character class, so, [A-Za-z]+(?![A-Za-z:]) is better.
Preventing backtracking into a word-like pattern by using a word boundary
As #scnerd suggests, word boundaries can also help in these situations, but there is always a catch: word boundary meaning is context dependent (see a number of ifs in the word boundary explanation).
[A-Za-z]+\b(?!:)
is a valid solution here, because the input implies the words end with non-word chars (i.e. end of string, or chars other than letter, digits and underscore). See the regex demo.
When does a word boundary fail?
\b will not be the right choice when the main consuming pattern is supposed to match even if glued to other word chars. The most common example is matching numbers:
\d+\b(?!:) matches 12 in 12,, but not in 12:, and also 12c and 12_
\d+(?![\d:]) matches 12 in 12, and 12c and 12_, not in 12: only.
Do a word boundary check \b after the + to require it to get to the end of the word.
([a-zA-Z]+\b)(?!:)
Here's an example run.

Categories