I'm trying to write a regex to match both regular numbers (1, 2, 42...) and roman ones (X, VII...).
But the one I've currently wrote:
\b((?=[MDCLXVI])M{0,3}(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3}))\b|\b\d+\b
is matching more than expected.
It has 9 matches, while I expect only 4:
XII
VII
2
12
How can I fix it?
You don't really need any lookahead in your regex.
Your regex can be simplified and refactored into this:
/
\b
(?:
[MDCLXVI]M{0,3}C[MD]
|
D?C{0,3}X[CL]
|
L?X{0,3}I[XV]
|
[XV]I{0,3}
|
I{1.3}
|
\d+
)
\b
/gix
Updated RegEx Demo
Note that I have used x (extended mode) in regex so that regex will ignore all whitespaces which allows you to have proper indentation between multiple alternations to make your regex more readable. I don't know all permutations of roman number so I suggest you to please recheck each and every alternation.
The reason for that is the possibility of a zero-width match with just word boundary patterns (i.e.\b(?=[MDCLXVI])\b matches before any word starting with Roman number letter).
You need to precise the word boundaries, make the leading one match only before a word char, and the last one to match only after a word char:
(?<!\w)(?:(?=[MDCLXVI])M{0,3}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})|\d+)(?!\w)
See the regex demo.
Here, (?<!\w) acts as a word boundary that fails the match if, immediately to the left of the current location, there is a word char, and (?!\w) acts a word boundary that fails the match if, immediately to the right of the current location, there is a word char.
Related
I need to match 'words' (string of characters with no spaces) that might have the word near at the beginning and/or the end and have only digits in the middle.
Examples: near3 4near near2near
It should not match words like nearing3 4nearsighted near3ness nearsighted
I tried this: x = re.match(r"((\bnear)|(near\b))(\d)", txt)
It works for this word: near3 and this word: near4near but not for this word 2near
You can match optional near followed by digits and near OR match near and digits using an alternation using the pipe |
You can surround the alternation with a non capture group and add word boundaries \b at both sides of the pattern to prevent a partial word match.
If you want to match a single digit, you can use only \d instead.
\b(?:(?:near)?\d+near|near\d+)\b
Regex demo
I am trying to create a regular expression that finds sentences with a minimum length.
Really my conditions are:
there must at least be 5 words in a sequence
words in sequence must be distinct
sequence must be followed by some punctuation character.
So far I have tried
^(\b\w*\b\s?){5,}\s?[.?!]$
If my sample text is:
This is a sentence I would like to parse.
This is too short.
Single word
Not not not distinct distinct words words.
Another sentence that I would be interested in.
I would like to match on strings 1 and 5.
I am using the python re library. I am using regex101 to test and it appears the regex I have above is doing quite a bit of work regards to backtracking so I imagine those knowledgable in regex may be a bit appalled (my apologies).
You can use the following regex to identify the strings that meet all three conditions:
^(?!.*\b(\w+)\b.+\b\1\b)(?:.*\b\w+\b){5}.*[.?!]\s*$
with the case-indifferent flag set.
Demo
Python's regex engine performs the following operations.
^ # match beginning of line
(?! # begin negative lookahead
.+ # match 1+ chars
\b(\w+)\b # match a word in cap grp 1
.+ # match 1+ chars
\b\1\b # match the contents of cap grp 1 with word breaks
) # end negative lookahead
(?: # begin non-cap grp
.+ # match 1+ chars
\b\w+\b # match a word
) # end non-cap grp
{5} # execute non-cap grp 5 times
.* # match 0+ chars
[.?!] # match a punctuation char
\s* # match 0+ whitespaces
$ # match end of line
Items 1. and 3. are easily done by regex, but
2. words in sequence must be distinct
I don't see how you could do it with a regex pattern. Remember that regex is a string-matching operation; it doesn't do heavy logic. This problem doesn't sound like a regex problem to me.
I recommend splitting the string in the character " " and checking word by word. Quickier, no sweat.
Edit
can be done with a lookahead as Cary said.
following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo
creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here
The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.
I'm attempting to match words in a string that contain two or more distinct vowels. The question can be restricted to lowercase.
string = 'pool pound polio papa pick pair'
Expected result:
pound, polio, pair
pool and papa would fail because they contain only one distinct vowel. However, polio is fine, because even though it contains two os, it contains two distinct vowels (i and o). mississippi would fail, but albuquerque would pass).
Thought process: Using a lookaround, perhaps five times (ignore uppercase), wrapped in a parenthesis, with a {2} afterward. Something like:
re.findall(r'\w*((?=a{1})|(?=e{1})|(?=i{1})|(?=o{1})|(?=u{1})){2}\w*', string)
However, this matches on all six words.
I killed the {1}s, which makes it prettier (the {1}s seem to be unnecessary), but it still returns all six:
re.findall(r'\w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w*', string)
Thanks in advance for any assistance. I checked other queries, including "How to find words with two vowels", but none seemed close enough. Also, I'm looking for pure RegEx.
You don't need 5 separate lookaheads, that's complete overkill. Just capture the first vowel in a capture group, and then use a negative lookahead to assert that it's different from the second vowel:
[a-z]*([aeiou])[a-z]*(?!\1)[aeiou][a-z]*
See the online demo.
Your \w*((?=a)|(?=e)|(?=i)|(?=o)|(?=u))\w* regex matches all words that have at least 1 any vowel. \w* matches 0+ word chars, so the first pattern grabs the whole chunk of letters, digits and underscores. Then, backtracking begins, the regex engine tries to find a location that is followed with either a, e, i, o, or u. Once it finds that location, the previously grabbed word chars are again grabbed and consumed with the trailing \w*.
To match whole words with at least 2 different vowels, you may use
\b(?=\w*([aeiou])\w*(?!\1)[aeiou])\w+
See the regex demo.
Details
\b - word boundary
(?=\w*([aeiou])\w*(?!\1)[aeiou]) - a positive lookahead that, immediately to the left of the current location, requires
\w* - 0+ word chars
([aeiou]) - Capturing group 1 (its value is referenced to with \1 backreference later in the pattern): any vowel
\w* - 0+ word chars
(?!\1)[aeiou] - any vowel from the [aeiou] set that is not equal to the vowel stored in Group 1 (due to the negative lookahead (?!\1) that fails the match if, immediately to the right of the current location, the lookahead pattern match is found)
\w+ - 1 or more word chars.
Match words in a string that contain at least two distinct vowels in the least amount of characters (to my knowledge): \w*([aeiou])\w*(?!\1)[aeiou]\w*
Demo: https://regex101.com/r/uRgVVa/1
Explanation:
\w*: matches 0 or more word characters. You don't need to start with a word boundary (\b) because \w does not include spaces, so using \b would be redundant.
([aeiou]): [aeiou] matches any one vowel. It is in parenthesis so we can reference what vowel was matched later. Whatever is inside these first parenthesis is group 1.
\w*: matches 0 or more word characters.
(?!\1): says the following regex cannot be the same as the character selected in group 1. For example, if the vowel matched in group 1 was a, the following regex cannot be a. This is called by \1, which references what character was chosen in group 1 (e.g. if a matched group 1, \1 references a). ?! is a negative lookahead that says the following regex outside the parenthesis cannot match what follows ?!.
\w*: matches 0 or more word characters.