Ignore words containing substring using regular expressions - python

I am a beginner and have spent considerable amount of time on this. I was partially able to solve it.
Problem: I want to ignore all words that have either the or The. E.g. atheist, others, The, the will be excluded. However, hottie shouldn't be included because the doesn't occur inside the word as a whole word.
I am using Python's re engine.
Here's my regex:
\b - Start at word boundary
(?! - Negative lookahead to avoid starting with the or The
[t|T]he - the and The
)
\w+ - Other letters are fine
(?<! - Negative look behind
[t|T]he - the or The shouldn't occur before \w+
)
\b - Word boundary
Expected output for a given input:
Input: Atheist Others Their Hello the The bathe hottie tahaie theater
Expected Output: Hello hottie tahaie
As one can see in regex101, I am able to exclude most of the words except words like atheist--i.e. cases when the or The appear inside words. I searched for this on SO and found some threads such as How to exclude specific string using regex in Python?, but they don't seem to be directly related to what I am trying to do.
Any help will be greatly appreciated.
Please note that I am interested in solving this problem only using regex. I am not looking for solutions using python's string manipulation.

The approach is simpler than your original regular expression:
\b(?!\w*[t|T]he)\w+\b
We match a word, but make sure that there is no the within the word using a "padded" negative lookahead. Your original approach only disallowed the at the front or the back of the word as it allowed for no padding after/before the word boundary.
(?![tT]he) only matches at the current position, while (?:\w*[tT]he) allows the match to extend from the current position, because the \w* can be used as filler.

Related

Returning text when text contains key element using regex

I am writing a function in python using regex that should return text when an element of that text is matched but the outputs I'm getting aren't as expected and I'm not sure what is going wrong.
My function is as below:
def latin_ish_words(text):
latin = re.findall('tion|ex|ph|ost', text, re.I)
return latin
When I pass latin_ish_words("This functions as expected")) it returns the elements 'tion' and 'ex' rather than 'functions' and 'expected'
If someone could tell me where I've gone wrong, I'd be most appreciative!
Many thanks,
Andrew
The function returns matching text - and that's what you saw. If you want to look for specific string within words, your search should state that.
I think \w*(?:tion|ex|ph|ost)\w* should help you find what you're expecting (you may need to enable greedy matching).
Let's look at the modifications:
\w - matches a "word-character" (letters in upper- or lowercase, digits or underscore)
* - previous pattern needs to match between zero and unlimited times
(?: - followed by a match of the rx within (..)
So basically we're just allowing word characters before and after. If you wanted to be more strict and only accept letters, use [A-z]* instead of \w*.

Regex negative lookahead in python [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Inverse regex match on group in Python

I see a lot of similarly worded questions, but I've had a strikingly difficult time coming up with the syntax for this.
Given a list of words, I want to print all the words that do not have special characters.
I have a regex which identifies words with special characters \w*[\u00C0-\u01DA']\w*. I've seen a lot of answers with fairly straightforward scenarios like a simple word. However, I haven't been able to find anything that negates a group - I've seen several different sets of syntax to include the negative lookahead ?!, but I haven't been able to come up with a syntax that works with it.
In my case given a string like: "should print nŌt thìs"
should print should and print but not the other two words. re.findall("(\w*[\u00C0-\u01DA']\w*)", paragraph.text) gives you the special characters - I just want to invert that.
For this particular case, you can simply specify the regular alphabet range in your search:
a = "should print nŌt thìs"
re.findall(r"(\b[A-Za-z]+\b)", a)
# ['should', 'print']
Of course you can add digits or anything else you want to match as well.
As for negative lookaheads, they use the syntax (?!...), with ? before !, and they must be in parentheses. To use one here, you can use:
r"\b(?!\w*[À-ǚ])\w*"
This:
Checks for a word boundary \b, like a space or the start of the input string.
Does the negative lookahead and stops the match if it finds any special character preceded by 0 or more word characters. You have to include the \w* because (?![À-ǚ]) would only check for the special character being the first letter in the word.
Finally, if it makes it past the lookahead, it matches any word characters.
Demo. Note in regex101.com you must specify Python flavor for \b to work properly with special characters.
There is a third option as well:
r"\b[^À-ǚ\s]*\b"
The middle part [^À-ǚ\s]* means match any character other than special characters or whitespace an unlimited number of times.
I know this is not a regex, but just a completely different idea you may not have had besides using regexes. I suppose it would be also much slower but I think it works:
>>> import unicodedata as ud
>>> [word for word in ['Cá', 'Lá', 'Aqui']\
if any(['WITH' in ud.name(letter) for letter in word])]
['Cá', 'Lá']
Or use ... 'WITH' not in to reverse.

Python: Regex to search for a "Mozilla" but ignore the match if the string also includes "iPhone" [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Why does this regular expression to match two consecutive words not work?

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.
I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:
I am struggling to to make this this work
I tried:
[A-Za-z0-9]* {2}
This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".
How is my logic flawed here? Why does this regex not work?
[A-Za-z0-9]* {2}
Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.
If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.
So, going back to your regular expression, you would have to do it like this:
([A-Za-z0-9]* ){2}
However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:
([A-Za-z0-9]*) \1
The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.
As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:
\b(\w+)\s\1\b
There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.
Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.
Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.
You can match a previous capture group with \1 for the first group, \2 for the second, etc...
import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)
>>> ['to', 'this']
If you want both occurrences, add a capture group around \1:
matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)
>>> [('to', 'to'), ('this', 'this')]
At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.
Something along the lines of the example below will get you part of the way.
>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']
These pages may offer some guidance:
Python regex cheat sheet
RegExp match repeated characters
Regular Expression For Consecutive Duplicate Words.
This should work: \b([A-Za-z0-9]+)\s+\1\b
\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.
>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']
Here is a simple solution not using RegEx.
sentence = 'I am struggling to to make this this work'
def find_duplicates_in_string(words):
""" Takes in a string and returns any duplicate words
i.e. "this this"
"""
duplicates = []
words = words.split()
for i in range(len(words) - 1):
prev_word = words[i]
word = words[i + 1]
if word == prev_word:
duplicates.append(word)
return duplicates
print(find_duplicates_in_string(sentence))

Categories