Regex negative lookahead in python [duplicate] - python

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.

You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)

In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.

Tom(?!\s+Thumb) is what you search for.

Related

Ignoring a word in regex (negative lookahead)

I'm looking to try and ignore a word in regex, but the solutions I've seen here did not work correctly for me.
Regular expression to match a line that doesn't contain a word
The issue I'm facing is I have an existing regex:
(?P<MovieCode>[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
That is matching on Deku-041114-575-boku.mp4.
However, I want this regex to fail to match for cases where the MovieCode group has Deku in it.
I tried
(?P<MovieCode>(?!Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]{1}\b)?
but unfortunately it just matches eku-124 and I need it to fail.
I have a regex101 with my attempts.
https://regex101.com/r/xqALM2/2
The MovieClose group can match 3-6 chars A-Z and Deku has 4 chars. If that part should not contain Deku, you could use the negative lookahead predeced by repeating 0+ times a character class [A-Za-z]* as it can not cross the -.
To prevent matching eku-124, you could prepend a word boundary before the MovieClose group or add (?<!\S if there should be a whitespace boundary at the left.
Note that you can omit {1} from the pattern.
\b(?P<MovieCode>(?![A-Za-z]*Deku)[A-Za-z]{3,6}-\d{3,5})(?P<MoviePart>[A-C]\b)?
Regex demo

Regex - Word boundary not working even with raw-string

I'm coding a set of regex to match dates in text using python. One of my regex was designed to match dates in the format MM/YYYY only. The regex is the following:
r'\b((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})\b'
Looks like the word boundary is not working as it is matching parts of dates like 12/02/2020 (it should not match this date format at all).
In the attached image only the second pattern should have been recognized. The first one shouldn't, even parts of it, have been a match.
Remembering that the regex should match the MM/YYYY pattern in strings like:
"The range of dates go from 21/02/2020 to 21/03/2020 as specified above."
Can you help me find the error in my pattern to make it match only my goal format?
A word boundary, in most regex dialects, is a position between \w and \W (non-word char), or at the beginning or end of a string if it begins or ends (respectively) with a word character ([0-9A-Za-z_]).
What is a word boundary in regex?
What happens is that the \ character is not part of the group \w, thus every time your string has a new \ it is considered to be a new word boundary.
You have not provided the full string you are matching, but I could solve the example you have posted you could solve it by just putting the anchors ^$
^((?:(?:0)[0-9])|(?:(?:1)[0-2])|(?:(?:[1-9])))(?:\/|\\)(\d{4})$
https://regex101.com/r/xncZNN/1
edit:
Working on your full example and your regex I did some "clean up" because it was a bit confusing, but I think I understood the pattern you were trying to map
here is the new:
(?<=^|[a-zA-Z ])(0[0-9]|1[12]|[1-9])(?:\/|\\)([\d]{4})(?=[a-zA-Z ]|$)
I have substituted the word boundary by lookahead (?!...) and lookbehind (?<!...), and specified the pattern I want to match before and after the date. You can adjust it to your specific need and add other characters like numbers or specific stuff.
https://regex101.com/r/xncZNN/4
The problem is that \b\d{2}/\d{4}\b matches 02/2000 in the string 01/02/2000 because the first forward slash is a word break. The solution is to identify the characters that should not precede and follow the match and use negative lookarounds in place of word breaks. Here you could use the regular expression
r'(?<![\d/])(?:0[1-9]|1[0-2])/\d{4}(?![\d/])'
The negative lookbehind, (?<![\d/]), prevents the two digits representing the month to be preceded by a digit or forward slash; the negative lookahead, (?![\d/]) prevents the four digits representing the year to be followed by a digit or forward slash.
Regex demo
Python demo
If 6/2000 is to be matched as well as 06/2000, change (?:0[1-9] to (?:0?[1-9].

Python: Regex to search for a "Mozilla" but ignore the match if the string also includes "iPhone" [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Regex that does not contain a substring after some point

I want a regex that doesn't match a string if contains the word page, and match if it's not contain.
^https?.+/(event|news)/.+(?!page).+$ this is the regex I'm currently using, so I want it to not match with, e.g. https://www.foosite.com/news/foopath/page/10, but it does. Where did I made a mistake?
The double .+ expressions should imply that there should be some string around the page string, and (?!page) should imply there must not be a string like page between them. What's wrong with this expression? Thanks, and sorry for poor grammar.
Your problem is that .+(?!page).+ will match foopath/page/10 because the first .+ match can end at the 1 in 10, and the second can match from there until $. Instead, just assert there is no combination of characters plus the word page after (event|news)/:
^https?.+/(event|news)/(?!.*page)
Demo on regex101
If you want more than just a match/nomatch decision, you can capture the entire matching string with this regex:
^https?.+/(event|news)/(?!.*page).*$
Demo on regex101
You might be looking for
^https?.+/(event|news)/(?:(?!page).)+$
See a demo on regex101.com.
Matching is usually way easier in regex than excluding.
I would rather match your excluded words and invert the logic on the if-clause.
if(!re.match(...

Ignore words containing substring using regular expressions

I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.
Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.
I am using Python's re engine.
Here's my regex:
\b - Start at word boundary
(?! - Negative lookahead to avoid starting with the or The
[t|T]he - the and The
)
\w+ - Other letters are fine
(?<! - Negative look behind
[t|T]he - the or The shouldn't occur before \w+
)
\b - Word boundary
Expected output for a given input:
Input: Atheist Others Their Hello the The bathe hottie tahaie theater
Expected Output: Hello hottie tahaie
As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.
Any help will be greatly appreciated.
Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.
The approach is simpler than your original regular expression:
\b(?!\w*[t|T]he)\w+\b
We match a word, but make sure that there is no the within the word using a "padded" negative lookahead. Your original approach only disallowed the at the front or the back of the word as it allowed for no padding after/before the word boundary.
(?![tT]he) only matches at the current position, while (?:\w*[tT]he) allows the match to extend from the current position, because the \w* can be used as filler.

Categories