Regex Matching - A letter not preceded by another letter - python

What could be regex which match anystring followed by daily but it must not match daily preceded by m?
For example it should match following string
beta.daily
abcdaily
dailyabc
daily
But it must not match
mdaily or
abcmdaily or
mdailyabc
I have tried following and other regex but failed each time:
r'[^m]daily': But it doesn't match with daily
r'[^m]?daily' : It match with string containing mdaily which is not intended

Just add a negative lookbehind, (?<!m)d, before daily:
(?<!m)daily
The zero width negative lookbehind, (?<!m), makes sure daily is not preceded by m.
Demo

Related

Wny it does not give all positive numbers in the string? Regex in Python

I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.

Regex - How do i find this specific slice of string inside a bigger whole string

following my previous question (How do i find multiple occurences of this specific string and split them into a list?), I'm now going to ask something more since the rule has been changed.
Here's the string, and the bold words are the ones that I want to extract.
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
Here's my current regex:
(?<=p1\_1\_.*)[^|]+(?=\|\#\|.*|$)
After trying it out in https://regexr.com/, I found the result instead :
text|p1_1_1120170AS074192161A0Z20|C M E -
Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier
Module 3KW|#|text|p1_4_1120170AS074192161A0Z20|Shuangdeng
6-FMX-170|#|text|p1_5_1120170AS074192161A0Z20|24021665|#|text|p1_6_1120170AS074192161A0Z20|1120170AS074192161A0Z20|#|text|p1_7_1120170AS074192161A0Z20|OK|#|text|p1_8_1120170AS074192161A0Z20||#|text|p1_9_1120170AS074192161A0Z20|ACTIVE|#|text|p1_10_1120170AS074192161A0Z20|-OK|#|text|site_id|20MJK110|#|text|barcode_flag|auto|#|text|movement_flag||#|text|unit_of_measurement||#|text|flag_waste|no|#|text|req_qty_db|2|#|text|req_qty|2
The question remains: "Why don't just return the first matched occurrence ?".
Let's consider that if the value between the first "bar section" is empty, then it'll return the value of the next bar section.
Example :
text|p1_1_1120170AS074192161A0Z20||#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text . . .
And I don't want that. Let it be just return nothing instead (nothing match).
What's the correct regex to acquire such a match?
Thank you :).
This data looks more structured than you are giving it credit for. A regular expression is great for e.g. extracting email addresses from unstructured text, but this data seems delimited in a straightforward manner.
If there is structure it will be simpler, faster, and more reliable to just split on | and perhaps #:
text = 'text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|p1_3_1120170AS074192161A0Z20|Rectifier Module 3KW|#|text|p1_4_11201...'
lines = text.split('|#|')
words = [line.split('|')[-1] for line in lines]
doc='text|p1_1_1120170AS074192161A0Z20|C M E - Rectifier|#|text|p1_2_1120170AS074192161A0Z20|Huawei|#|text|...'
re.findall('[^|]+(?=\|\#\|)', doc)
In the re expression:
[^|]+finds chunks of text not containing the separator
(?=...) is a "lookahead assertion" (match the text but do not include in result)
About the pattern you tried
This part of the pattern [^|]+ states to match any char other than |
Then (?=\|\#\|.*|$) asserts using a positive lookahead what is on the right is |#|.* or the end of the string.
The positive lookbehind (?<=p1\_1\_.*) asserts what is on the left is p1_1_ followed by any char except a newline using a quantifier in the lookbehind.
As the pattern is not anchored, you will get all the matches for this logic because the p1_1_ assertion is true as it precedes all the|#| parts
Note that using the quantifier in the lookbehind will require the pypi regex module.
If you want the first match using a quantifier in the positive lookbehind you could for example use an anchor in combination with a negative lookahead to not cross the |#| or match || in case it is empty:
(?<=^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|)[^|]+(?=\|\#\||$)
Python demo
You could use your original pattern using re.search getting the first match.
(?<=p1_1_.*)[^|]+(?=\|\#\||$)
Note that you don't have to escape the underscore in your original pattern and you can omit .* from the positive lookahead
Python demo
But to get the first match you don't have to use a positive lookbehind. You could also use an anchor, match and capturing group.
^.*?p1_1_(?:(?!\|#\|).|\|{2})*\|([^|]+)(?:\|#\||$)
^ Start of string
.*? Match any char except a newline
p1_1_ Match literally
(?: Non capturing group
(?!\|#\|).|\|{2} If what is on the right is not |#| match any char, or match 2 times ||
)* Close non capturing group and repeat 0+ times
\| Match |
( Capture group 1 (This will contain your value
[^|]+ Match 1+ times any char except |
) Close group
(?:\|#\||$) Match either |#|
Regex demo

Regex that not ending with smaller case

creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here
The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.

Not able to get desired result from lookbehind in python regex

I am trying to find a pattern which allows me to find a year of four digits. But I do not want to get results in which year is preceded by month e.g "This is Jan 2009" should not give any result, but "This is 2009" should return 2009. I use findall with lookbehind at Jan|Feb but I get 'an 2009' instead of blank. What am I missing? How to do It?
Any otherwise matching string preceded by a string matching the negative lookbehind is not matched.
In your current regex, [a-z]* \d{4} matches "an 2009".
The negative lookbehind '(?<!Jan|Feb)' does not match the "This is J" part, so it is not triggered.
If you remove '[a-z]*' from the regex, then no match will be returned on your test string.
To fix such problems:
First, write the match you want \d{4}
Then, write what you don't want (?<!Jan |Feb )
That is (?<!Jan |Feb )\d{4}
You may want to try this:
(?i)(?<!jan|feb)(?<!uary)\s+[0-9]*[0-9]
Hope it helps.
This generalized example should work for the cases you mentioned in your question above (edited to account for full month names):
INPUTS:
'This is 2009'
'This is Jan 2009'
REGEX:
re.findall(r'(?:\b[^A-Z][a-z]+\s)(\d{4})', text))
OUTPUTS:
['2009']
[]
EXPLANATION:
?: indicates a non-capturing group, therefore it will not be included in the output
\b asserts a word boundary
^[A-Z] asserts that the word does not start with a capital letter
[a-z]+ asserts that it is followed by one or more lowercase letters
\s accounts for any whitespace character
(\d{4}) asserts a capturing group for a digit (\d) for four occurrences {4}

How to extract characters of particular length from a given string in python Regex

How to extract characters of particular length from a given string in python Regex
Hi I have records like,
Eg:
Health Insurance PortabilityNEG Ratio
Health Insurance PortabilityNEGRatio
Health Insurance PortabilityNEG NEGRatio
Here I need to extract NEG as my to write a regex in python like
Portability(.+?) Ratio,
Portability(.+?)Ratio
where I first "NEG" after Portability is my valuewhich i should get. The first and Second records give me correct output as "NEG". But in my third record I get "NEG NEG" which is a wrong value.
I need to get only "NEG" for third record also.Should I give the length of the first three character to take only "NEG".
If so, Kindly let me know how can I write the regex according to that?
The . means any character at all, and the + symbol mean "at least one" but does not specify an upper limit. You want \w{n}, where \w means character and n means number of occurences.
Also, note that \w includes arithmetic digits, so if you only want letters, you'd better use [a-zA-Z]{3}
If you have to extract any 3 chars right after Portability use
re.findall(r"Portability(.{3}).*?Ratio", s)
See the regex demo
If these are uppercase letters, replace .{3} with [A-Z]{3}.
Details:
Portability - a literal char sequence
(.{3}) - Capturing group 1: exactly 3 chars (any chars other than line break chars if re.S/re.DOTALL modifier is not used) since {3} is a limiting quantifier matching the number of occurrences defined inside {...}
.*?Ratio - any 0+ chars other than line break chars as few as possible (as *? is a lazy quantifier) up to the first Ratio substring.
The re.findall only returns captured values, so you will only get NEG.

Categories