Regex that not ending with smaller case - python

creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here

The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.

Related

python regex include exceptions in the regex expression

I am looking for expressions as Vc Am in texts and for that I have
rex = r"(\(?)(?<!([A-Za-z0-9]))[A-Z][a-z](?!([A-Za-z0-9]))(\)?)"
explanation:
[A-Z][a-z] = Cap followed by lower case letter
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) _ Look ahead not being letter or number
# all that optionally wihtin parenthesis
import re
text="this is Vc and not Cr nor Pb"
matches = re.finditer(rex,text)
What I want to achieve is exclude a list of terms like Cr or Pb.
How should I include exceptions in the expression?
thanks
First, let's shorten your RegEx:
(?<!([A-Za-z0-9])) -> lookbehind not being a letter or number
(?!([A-Za-z0-9]))(\)?) -> look ahead not being letter or number
these are so common there is a RegEx feature for them: Word boundaries \b. They have zero width like lookarounds and only match if there is no alphanumeric character.
Your RegEx then becomes \b[A-Z][a-z]\b; looking at this RegEx (and your examples), it appears you want to match certain element abbreviations?
Now you can simply use a lookbehind:
\b[A-Z][a-z](?<!Cr|Pb)\b
to assert that the element is neither Chrome nor Lead.
Just for fun:
Alternatively, if you want a less readable (but more portable) RegEx that makes do with fewer advanced RegEx features (not every engine supports lookaround), you can use character sets as per the following observations:
If the first letter is not a C or P, the second letter may be any lowercase letter;
If the first letter is a C, the second letter may not be an r
If the first letter is a P, the second letter may not be an b
Using character sets, this gives us:
[ABD-OQ-Z][a-z]
C[a-qs-z]
P[ac-z]
Operator precedence works as expected here: Concatenation (implicit) has higher precendence than alteration (|). This makes the RegEx [ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z]. Wrapping this in word boundaries using a group gives us \b([ABD-OQ-Z][a-z]|C[a-qs-z]|P[ac-z])\b.
You might write the pattern without using the superfluous capture groups, and exclude matching Cr or Pb:
\(?(?<![A-Za-z0-9])(?!Cr\b|Pb\b)[A-Z][a-z](?![A-Za-z0-9])\)?
See a regex demo for the matches.
If you are not interested in matching the parenthesis, and you also do not want to allow an underscore along with the letters or numbers, you can use a word boundary instead:
\b(?!Cr\b|Pb\b)[A-Z][a-z]\b
Explanation
\b A word boundary to prevent a partial word match
(?! Negative lookahead
Cr\b|Pb\b Match either Cr or Pb
) Close the lookahead
[A-Z][a-z] Match a single uppercase and single lowercase char
\b A word boundary
Regex demo

Wny it does not give all positive numbers in the string? Regex in Python

I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.

Regex to find sentences of a minimum length

I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.

Regex (Python) - Match words with two or more distinct vowels

I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.
You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.
Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.
Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.

How to match a word that doesn't start with X but ends with Y with regex

Example;
X=This
Y=That
not matching;
ThisWordShouldNotMatchThat
ThisWordShouldNotMatch
WordShouldNotMatch
matching;
AWordShouldMatchThat
I tried (?<!...) but seems not to be easy :)
^(?!This).*That$
As a free-spacing regex:
^ # Start of string
(?!This) # Assert that "This" can't be matched here
.* # Match the rest of the string
That # making sure we match "That"
$ # right at the end of the string
This will match a single word that fulfills your criteria, but only if this word is the only input to the regex. If you need to find words inside a string of many other words, then use
\b(?!This)\w*That\b
\b is the word boundary anchor, so it matches at the start and at the end of a word. \w means "alphanumeric character. If you also want to allow non-alphanumerics as part of your "word", then use \S instead - this will match anything that's not a space.
In Python, you could do words = re.findall(r"\b(?!This)\w*That\b", text).

Categories