how to exclude words in regex using Negative Lookahead? - python

I am trying to exclude a word from a sentence, but if the excluded word does not appear, the regex should keep searching for characters until the exclude word is found.
For example, lets suppose I have a list like this:
S.no Vehicle Status
1 car sold
2 car not sold
3 car sold
4 car Repair
I want to match all those cars which don't have a status of sold (they could be anything but sold) and I want it to catch the status too (if not sold)
I tried this regex:
f"car(?!\s+sold)"
But how can I tell it to continue if it doesn't find the "sold" in the negative lookahead (but still search with that filter)

You can write the pattern like this:
pattern = r"\bcar\b(?!\s+sold\b).+"
Explanation
\bcar\b Match the word car
(?!\s+sold\b) Assert not 1+ whitespace chars followed by the word "sold" to the right
.+ Match 1+ chars
See a regex demo.
If there has to be a non whitespace char present after "car" and you don't want to cross newlines:
\bcar\b(?![^\S\n]+sold\b)[^\S\n]+\S.*
See another Regex demo

Related

Regex (Python) - Match words with two or more distinct vowels

I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.
You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.
Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.
Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.

How are regex quantifiers applied?

I have the following regex:
res = re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
for item in res:
print(item.group())
When I use this regex with the following string:
"my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly."
I am getting the following results:
house is painted white, my car
the road, I drive my car
My question is about the quantifier {0,4} that should apply to the whole group. The group collects words with the expression \w+ and some separation symbols with the [ ]. Does the the quantifier apply only to the "words" defined by \w+? In the results I am getting 4 words plus space and comma. It's unclear to me.
So, here's what's happening. You're using ?: to make a non capture group, which collects 1 or more "words", followed by a [ \t,] (a space, tab char, or comma), match one or more of the preceeding. {0,4} matches between 0-4 of the non-capturing group. So it looks at the word "my car" and captures the 4 words before it, since all 4 of them match the \w+ and the , and space get eaten by the character set you specified.
Broken apart more succinctly
(?: -- Non capturing group
\w+ Grab all words
[ \t,]+ -- Grab all spaces, comma, or tab characters
) -- End capture group
{0,4} -- Match the previous capture group 0-4 times
my car -- Based off where you find the words "my car"
As a result this will match 0-4 words / spaces / commas / tabs before the appearance of "my car"
This is working as written

Python regex with \w does not work

I want to have a regex to find a phrase and two words preceding it if there are two words.
For example I have the string (one sentence per line):
Chevy is my car and Rusty is my horse.
My car is very pretty my dog is red.
If i use the regex:
re.finditer(r'[\w+\b|^][\w+\b]my car',txt)
I do not get any match.
If I use the regex:
re.finditer(r'[\S+\s|^][\S+\s]my car',txt)
I am getting:
's my car' and '. My car' (I am ignoring case and using multi-line)
Why is the regex with \w+\b not finding anything? It should find two words and 'my car'
How can I get two complete words before 'my car' if there are two words. If there is only one word preceding my car, I should get it. If there are no words preceding it I should get only 'my car'. In my string example I should get: 'Chevy is my car' and 'My car' (no preceding words here)
In your r'[\w+\b|^][\w+\b]my car regex, [\w+\b|^] matches 1 symbol that is either a word char, a +, a backdpace, |, or ^ and [\w+\b] matches 1 symbol that is either a word char, or +, or a backspace.
The point is that inside a character class, quantifiers and a lot (but not all) special characters match literal symbols. E.g. [+] matches a plus symbol, [|^] matches either a | or ^. Since you want to match a sequence, you need to provide a sequence of subpatterns outside of a character class.
It seems as if you intended to use \b as a word boundary, however, \b inside a character class matches only a backspace character.
To find two words and 'my car', you can use, for example
\S+\s+\S+\s+my car
See the regex demo (here, \S+ matches one or more non-whitespace symbols, and \s+ matches 1 or more whitespaces, and the 2 occurrences of these 2 consecutive subpatterns match these symbols as a sequence).
To make the sequences before my car optional, just use a {0,2} quantifier like this:
(?:\S+[ \t]+){0,2}my car
See this regex demo (to be used with the re.IGNORECASE flag). See Python demo:
import re
txt = 'Chevy is my car and Rusty is my horse.\nMy car is very pretty my dog is red.'
print(re.findall(r'(?:\S+[ \t]+){0,2}my car', txt, re.I))
Details:
(?:\S+[ \t]+){0,2} - 0 to 2 sequences of 1+ non-whitespaces followed with 1+ space or tab symbols (you may also replace it with [^\S\r\n] to match any horizontal space or \s if you also plan to match linebreaks).
my car - a literal text my car.

Using Regex to capture phrase

My question is regarding the following tweets:
Credit Suisse Trims Randgold Resources Limited (RRS) Target Price to GBX
JPMorgan Chase & Co Trims Occidental Petroleum Co (OXY) Target Price to
I want to remove "Randgold Resources Limited (RRS)" from the first tweet and "Occidental Petroleum Co (OXY)" from the second tweet using Regex.
I am working in Python and so far I have tried this without much luck:
Trims\s[\w\s.()]+(?=Target)
I want to capture the phrase "Trims Target Price" in both instances. Help would be appreciated.
You can use this lookaround based regex:
p = re.compile(r'(?<= Trims) .*?(?= Target )')
result = re.sub(p, "", test_str)
(?<= Trims) .*?(?= Target ) will match any text that is between Trim and Target.
RegEx Demo
(?<=Trims )([A-Z][a-z]+ ){3}\([A-Z]{3}\)
See it in action
The idea is:
(?<=Trims ) - find a place preceded by Trims using positive lookbehind
[A-Z][a-z]+ - a word starting with capital letter that continues with multiple lower case letters
([A-Z][a-z]+ ){3} - three such words followed by space
\( and \) - brackets have to be escaped, otherwise they have the meaning of capturing group
[A-Z]{3} - three capital letters
The (?<=...) Lookbehind assertion, match if preceded is missing for Trims word.
re.sub('(?<=Trims)\s[\w\s.()]+(?=Target)', ' ', text)

How to search for a particular string with in a regex group with in a single regex in Python?

I am trying to match the word 'noun' with in the last group of my regex.
So far I have:
tags = 'motocykl mutka 1 motorcycle bike moped 0 transportation openair noun'
print re.search('(?P<pol>\D+)(?P<d1>\d)(?P<eng>\D+)(?P<d2>\d)(?P<end>\D+)', tags).group('end')
All I get is a string which is that last group:
transportation openair noun
I need to just get:
noun
UPDATE:
I forgot to mention that 'noun' will not be showing up as the last word in some strings I will be running the regex against. For example:
tags = 'dźwig 1 crane 0 noun construction vehicle'
tags = 'trycykl 1 tricycle 0 child noun transportation'
Any ideas on how to do this with in a single regex?
Not sure what your tags mean but \D+? should match "transportation openair" and the [a-zA-Z] will match the last word (noun):
^(?P<pol>\D+)(?P<d1>\d)(?P<eng>\D+)(?P<d2>\d)\D+?(?P<end>[a-zA-Z]+)$
Your problem is that you are matching with \D+, which will match multiple words including spaces. It makes perfect sense that you are getting the last group of words.
So you need to make your last group only match non-whitespace characters, and before your last group match on a whitespace character.
Here's a pattern that matches "transportation openair" in a group called "category" and correctly matches "noun" in the group "end". Because we used the non-greedy + in matching category, we need a $ to anchor the end group to actually be the last word in the string.
re.match(r'(?P<pol>\D+)(?P<d1>\d)(?P<eng>\D+)(?P<d2>\d)(?P<category>\D+?)\W+(?P<end>\w+)$', tags).group('end')

Categories