python regex include exceptions in the regex expression - python

I am looking for expressions as Vc Am in texts and for that I have
rex = r"(\(?)(?<!([A-Za-z0-9]))[A-Z][a-z](?!([A-Za-z0-9]))(\)?)"
explanation:
[A-Z][a-z] = Cap followed by lower case letter
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) _ Look ahead not being letter or number
# all that optionally wihtin parenthesis
import re
text="this is Vc and not Cr nor Pb"
matches = re.finditer(rex,text)
What I want to achieve is exclude a list of terms like Cr or Pb.
How should I include exceptions in the expression?
thanks

First, let's shorten your RegEx:
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) -> look ahead not being letter or number
these are so common there is a RegEx feature for them: Word boundaries \b. They have zero width like lookarounds and only match if there is no alphanumeric character.
Your RegEx then becomes \b[A-Z][a-z]\b; looking at this RegEx (and your examples), it appears you want to match certain element abbreviations?
Now you can simply use a lookbehind:
\b[A-Z][a-z](?<!Cr|Pb)\b
to assert that the element is neither Chrome nor Lead.
Just for fun:
Alternatively, if you want a less readable (but more portable) RegEx that makes do with fewer advanced RegEx features (not every engine supports lookaround), you can use character sets as per the following observations:
If the first letter is not a C or P, the second letter may be any lowercase letter;
If the first letter is a C, the second letter may not be an r
If the first letter is a P, the second letter may not be an b
Using character sets, this gives us:
[ABD-OQ-Z][a-z]
C[a-qs-z]
P[ac-z]
Operator precedence works as expected here: Concatenation (implicit) has higher precendence than alteration (|). This makes the RegEx [ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z]. Wrapping this in word boundaries using a group gives us \b([ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z])\b.

You might write the pattern without using the superfluous capture groups, and exclude matching Cr or Pb:
\(?(?<![A-Za-z0-9])(?!Cr\b|Pb\b)[A-Z][a-z](?![A-Za-z0-9])\)?
See a regex demo for the matches.
If you are not interested in matching the parenthesis, and you also do not want to allow an underscore along with the letters or numbers, you can use a word boundary instead:
\b(?!Cr\b|Pb\b)[A-Z][a-z]\b
Explanation
\b A word boundary to prevent a partial word match
(?! Negative lookahead
Cr\b|Pb\b Match either Cr or Pb
) Close the lookahead
[A-Z][a-z] Match a single uppercase and single lowercase char
\b A word boundary
Regex demo

Related

Regular expressions to match numbers (both regular and romans)

I'm trying to write a regex to match both regular numbers (1, 2, 42...) and roman ones (X, VII...).
But the one I've currently wrote:
\b((?=[MDCLXVI])M{0,3}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3}))\b|\b\d+\b
is matching more than expected.
It has 9 matches, while I expect only 4:
XII
VII
2
12
How can I fix it?
You don't really need any lookahead in your regex.
Your regex can be simplified and refactored into this:
/
\b
(?:
[MDCLXVI]M{0,3}C[MD]
|
D?C{0,3}X[CL]
|
L?X{0,3}I[XV]
|
[XV]I{0,3}
|
I{1.3}
|
\d+
)
\b
/gix
Updated RegEx Demo
Note that I have used x (extended mode) in regex so that regex will ignore all whitespaces which allows you to have proper indentation between multiple alternations to make your regex more readable. I don't know all permutations of roman number so I suggest you to please recheck each and every alternation.
The reason for that is the possibility of a zero-width match with just word boundary patterns (i.e.\b(?=[MDCLXVI])\b matches before any word starting with Roman number letter).
You need to precise the word boundaries, make the leading one match only before a word char, and the last one to match only after a word char:
(?<!\w)(?:(?=[MDCLXVI])M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})|\d+)(?!\w)
See the regex demo.
Here, (?<!\w) acts as a word boundary that fails the match if, immediately to the left of the current location, there is a word char, and (?!\w) acts a word boundary that fails the match if, immediately to the right of the current location, there is a word char.

Regex that not ending with smaller case

creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here
The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.

Regex (Python) - Match words with two or more distinct vowels

I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.
You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.
Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.
Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.

Regular expressions: replace comma in string, Python

Somehow puzzled by the way regular expressions work in python, I am looking to replace all commas inside strings that are preceded by a letter and followed either by a letter or a whitespace. For example:
2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15
2015,2135,602832/09,DOYLE V ICON, LLC,15,15
The first line has effectively 6 columns, while the second line has 7 columns. Thus I am trying to replace the comma between (N, L) in the second line by a whitespace (N L) as so:
2015,2135,602832/09,DOYLE V ICON LLC,15,15
This is what I have tried so far, without success however:
new_text = re.sub(r'([\w],[\s\w|\w])', "", text)
Any ideas where I am wrong?
Help would be much appreciated!
The pattern you use, ([\w],[\s\w|\w]), is consuming a word char (= an alphanumeric or an underscore, [\w]) before a ,, then matches the comma, and then matches (and again, consumes) 1 character - a whitespace, a word character, or a literal | (as inside the character class, the pipe character is considered a literal pipe symbol, not alternation operator).
So, the main problem is that \w matches both letters and digits.
You can actually leverage lookarounds:
(?<=[a-zA-Z]),(?=[a-zA-Z\s])
See the regex demo
The (?<=[a-zA-Z]) is a positive lookbehind that requires a letter to be right before the , and (?=[a-zA-Z\s]) is a positive lookahead that requires a letter or whitespace to be present right after the comma.
Here is a Python demo:
import re
p = re.compile(r'(?<=[a-zA-Z]),(?=[a-zA-Z\s])')
test_str = "2015,1674,240/09,PEOPLE V. MICHAEL JORDAN,15,15\n2015,2135,602832/09,DOYLE V ICON, LLC,15,15"
result = p.sub("", test_str)
print(result)
If you still want to use \w, you can exclude digits and underscore from it using an opposite class \W inside a negated character class:
(?<=[^\W\d_]),(?=[^\W\d_]|\s)
See another regex demo
\w matches a-z,A-Z and 0-9, so your regex will replace all commas. You could try the following regex, and replace with \1\2.
([a-zA-Z]),(\s|[a-zA-Z])
Here is the DEMO.

How to match a word that doesn't start with X but ends with Y with regex

Example;
X=This
Y=That
not matching;
ThisWordShouldNotMatchThat
ThisWordShouldNotMatch
WordShouldNotMatch
matching;
AWordShouldMatchThat
I tried (?<!...) but seems not to be easy :)
^(?!This).*That$
As a free-spacing regex:
^ # Start of string
(?!This) # Assert that "This" can't be matched here
.* # Match the rest of the string
That # making sure we match "That"
$ # right at the end of the string
This will match a single word that fulfills your criteria, but only if this word is the only input to the regex. If you need to find words inside a string of many other words, then use
\b(?!This)\w*That\b
\b is the word boundary anchor, so it matches at the start and at the end of a word. \w means "alphanumeric character. If you also want to allow non-alphanumerics as part of your "word", then use \S instead - this will match anything that's not a space.
In Python, you could do words = re.findall(r"\b(?!This)\w*That\b", text).

Categories