Why does my regex with word boundary fail? - python

I'd like to match number, positive or negative, possibly with currency sign in front. But I don't want something like PSM-9. My code is:
test='AAA PCSK-9, $111 -3,33'
re.findall(r'\b-?[$€£]?-?\d+[\d,.]*\b', test)
Output is:['-9', '111', '3,33']
Could someone explain why -9 is matched? Thank you in advance.
Edit:
I don't any part of PCSK-9 is matched it is like a name of a product rather a number. So my desired output is:
['111', '3,33']

This is because \b matches the gap between K and -, a word and a non-word character. If you want to avoid matching - if it's preceded by a word you can use negative lookbehind instead:
re.findall(r'[$€£]?(?:(?<!\w)-)?\d+[\d,.]*\b', test)
With your sample input, this returns:
['9', '111', '3,33']
Demo: https://regex101.com/r/A66C5W/1

The word boundary matches between the K and the dash. The 2 parts after the dash [$€£]?-? are optional because of the questionmark and then you match one or more times a digit. This results in the match -9
What you might use instead of a word boundary is an assertion that checks if what is before and after the match is not a non whitespace character \S using a negative lookbehind and a negative lookahead.
(?<!\S)-?[$€£]?(\d+(?:[,.]\d+)?)(?!\S)
Regex demo | Python demo

-9 is matched because - is a non-word character, and S is a word character... so in between there's an interword boundary \b, as you state in your regexp.

Related

Wny it does not give all positive numbers in the string? Regex in Python

I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.

Detecting alphanumeric/numeric values in python string

I'm trying to extract tokens/part of tokens that have numeric/alphanumeric characters that have a length greater than 8 from the text.
Example:
text = 'https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8'
The expected output would be :
59800512 510557XXXXXX2302 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg 69i57j0i22i30l8j0i390 4672j0j7
I have tried using the regular expression : ((\d+)|([A-Za-z]+\d)[\dA-Za-z]*) based on the answer Python Alphanumeric Regex. I got the following results :
[match for match in re.findall(r"((\d+)|([A-Za-z]+\d)[\dA-Za-z]*)",text)]
Output :
[('59800512', '59800512', ''),
('510557', '510557', ''),
('XXXXXX2302', '', 'XXXXXX2'),
('1601371803', '1601371803', ''),
('NhLw6NlR0EksRWkLddEo7NiEvrg', '', 'NhLw6'),
('69', '69', ''),
('i57j0i22i30l8j0i390', '', 'i5'),
('4672', '4672', ''),
('j0j7', '', 'j0'),
('8', '8', '')]
I'm getting a tuple of matching groups for each matching token.
It is possible to filter these tuples again. But I'm trying to make the code as efficient and pythonic as possible.
Could anyone suggest a solution? It need not be based on regular expressions.
Thanks in advance
Edit :
I expect alphanumeric values of length equal to or greater than 8
You get the tuples in the result, as re.findall returns the values of the capture groups.
But you can omit the capture groups and change the pattern to a single match, matching at least a digit between chars A-Z a-z and assert a minimum of 8 characters using a positive lookahead.
\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b
\b A word boundary
(?=[A-Za-z0-9]{8}) Positive lookahead, assert at least 8 occurrences of any of the listed ranges
[A-Za-z]* Optionally match a char A-Z a-z
\d Match a digit
[A-Za-z\d]* Optionall match a char A-Z a-z or a digit
\b A word boundary
See a regex demo or a Python demo.
import re
from pprint import pprint
pattern = r"\b(?=[A-Za-z0-9]{8})[A-Za-z]*\d[A-Za-z\d]*\b"
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
pprint(re.findall(pattern, s))
Output
['59800512',
'510557XXXXXX2302',
'1601371803',
'NhLw6NlR0EksRWkLddEo7NiEvrg',
'69i57j0i22i30l8j0i390',
'4672j0j7']
I came up with:
\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b
See an online demo
\b - Word boundary.
[A-Za-z]{,7} - 0-7 times a alphachar.
\d - A single digit.
[A-Za-z\d]{7,} - 7+ times an alphanumeric char.
\b - Word boundary.
Some sample code:
import re
s = "https://stackoverflow.com/questions/59800512/ 510557XXXXXX2302 Normal words 1601371803 NhLw6NlR0EksRWkLddEo7NiEvrg https://www.google.com/search?q=some+google+search&oq=some+google+search&aqs=chrome..69i57j0i22i30l8j0i390.4672j0j7&sourceid=chrome&ie=UTF-8"
result = re.findall(r'\b[A-Za-z]{,7}\d[A-Za-z\d]{7,}\b', s)
print(result)
Prints:
['59800512', '510557XXXXXX2302', '1601371803', 'NhLw6NlR0EksRWkLddEo7NiEvrg', '69i57j0i22i30l8j0i390', '4672j0j7']
You could opt to match case-insensitive with:
(?i)\b[a-z]{,7}\d[a-z\d]{7,}\b
Although the selected answer returns the required output, it is not generic, and it fails to match specific cases (eg., s= "thisword2H2g2d")
For a more generic regex that works for all combinations of alphanumeric values:
result = re.findall(r"(\d+[A-Za-z\d]+\d*)|([A-Za-z]+[\d]+[A-Za-z\d]*)")
See the demo here.

Regex that not ending with smaller case

creating the regex which is having at least 3 chars and not end with
import re
re.findall(r'(\w{3,})(?![a-z])\b','I am tyinG a mixed charAv case VOW')
My Out
['tyinG', 'mixed', 'charAv', 'case', 'VOW']
My Expected is
['tyinG', 'VOW']
I am getting the proper out when i am doing the re.findall(r'(\w{3,})(?<![a-z])\b','I am tyinG a mixed charAv case VOW')
when i did the je.im my first regex which doesnot having < giving correct only
What is the relevance of < here
The first pattern (\w{3,})(?![a-z])\b does not give you the expected result because the pattern is first matching 3+ word chars and then asserts using a negative lookahead (?! that what is directly on the right is not a lowercase char a-z.
That assertion will be true as the lowercase a-z chars are already matched by \w
The second pattern (\w{3,})(?<![a-z])\b does give you the right result as it first tries to match 3 or more word chars and after that asserts using a negative lookbehind (?<! what is directly to the left is not a lowercase char a-z.
If you want to use a lookaround, you can make the pattern a bit more efficient by making use of a word boundary at the beginning.
At the end of the pattern place the negative lookbehind after the word boundary to first anchor it and then do the assertion.
\b\w{3,}\b(?<![a-z])
Note that you can omit the capturing group if you want the single match only.

Not able to get desired result from lookbehind in python regex

I am trying to find a pattern which allows me to find a year of four digits. But I do not want to get results in which year is preceded by month e.g "This is Jan 2009" should not give any result, but "This is 2009" should return 2009. I use findall with lookbehind at Jan|Feb but I get 'an 2009' instead of blank. What am I missing? How to do It?
Any otherwise matching string preceded by a string matching the negative lookbehind is not matched.
In your current regex, [a-z]* \d{4} matches "an 2009".
The negative lookbehind '(?<!Jan|Feb)' does not match the "This is J" part, so it is not triggered.
If you remove '[a-z]*' from the regex, then no match will be returned on your test string.
To fix such problems:
First, write the match you want \d{4}
Then, write what you don't want (?<!Jan |Feb )
That is (?<!Jan |Feb )\d{4}
You may want to try this:
(?i)(?<!jan|feb)(?<!uary)\s+[0-9]*[0-9]
Hope it helps.
This generalized example should work for the cases you mentioned in your question above (edited to account for full month names):
INPUTS:
'This is 2009'
'This is Jan 2009'
REGEX:
re.findall(r'(?:\b[^A-Z][a-z]+\s)(\d{4})', text))
OUTPUTS:
['2009']
[]
EXPLANATION:
?: indicates a non-capturing group, therefore it will not be included in the output
\b asserts a word boundary
^[A-Z] asserts that the word does not start with a capital letter
[a-z]+ asserts that it is followed by one or more lowercase letters
\s accounts for any whitespace character
(\d{4}) asserts a capturing group for a digit (\d) for four occurrences {4}

Get the first three digits in each gap

Just started learning regex. I have problem here.
This is my code so far.
match = re.findall(r'\d{1,3}', string)
I know i will get each third number. But i dont know how to tell only each gap.
I have a string which looks like this:
string = "24812949 2472198 4271748 12472187"
I want a result like this:
["248", "247", "427", "124"]
Use word boundary \b. \b matches between a word character and a non-word character.
match = re.findall(r'\b\d{1,3}', string)
OR
Negative lookbehind assertion. (?<!\S) Asserts that the match won't be preceded by a non-space character.
match = re.findall(r'(?<!\S)\d{1,3}', string)
You can add \b as word boundaries:
>>> re.findall(r'\b\d{1,3}', string)
['248', '247', '427', '124']
But if your string is always in this form, you can do without regex:
>>> [i[:3] for i in string.split()]
['248', '247', '427', '124']
I am surprised no one thinks about consuming the rest of the number instead of worrying about the boundary:
>>> re.findall(r'(\d{1,3})\d*', string)
['248', '247', '427', '124']
By capturing the first 3 digits (or less in case the number is smaller), and match the rest of the digits, there is no way the next match can happen in the middle of a number. When the previous match ends, the next character after it, if any, must be non-digit, and since the engine scans from left to right, the next match will start at the beginning of a string of digits.
re.findall function also returns only the content in the capturing groups when there is at least 1 capturing group in the regex, which smooths out the whole process.
(?:^|(?<=\s))\d{1,3}
Try this.See demo.
https://regex101.com/r/gQ3kS4/22
match = re.findall(r'(?:^|(?<=\s))\d{1,3}', string)
Try the below solution:-
string="24812949 2472198 4271748 12472187"
match = re.findall(r'\b\d{1,3}', string)
print match
Output:-['248', '247', '427', '124']

Categories