Get the first three digits in each gap - python

Just started learning regex. I have problem here.
This is my code so far.
match = re.findall(r'\d{1,3}', string)
I know i will get each third number. But i dont know how to tell only each gap.
I have a string which looks like this:
string = "24812949 2472198 4271748 12472187"
I want a result like this:
["248", "247", "427", "124"]

Use word boundary \b. \b matches between a word character and a non-word character.
match = re.findall(r'\b\d{1,3}', string)
OR
Negative lookbehind assertion. (?<!\S) Asserts that the match won't be preceded by a non-space character.
match = re.findall(r'(?<!\S)\d{1,3}', string)

You can add \b as word boundaries:
>>> re.findall(r'\b\d{1,3}', string)
['248', '247', '427', '124']
But if your string is always in this form, you can do without regex:
>>> [i[:3] for i in string.split()]
['248', '247', '427', '124']

I am surprised no one thinks about consuming the rest of the number instead of worrying about the boundary:
>>> re.findall(r'(\d{1,3})\d*', string)
['248', '247', '427', '124']
By capturing the first 3 digits (or less in case the number is smaller), and match the rest of the digits, there is no way the next match can happen in the middle of a number. When the previous match ends, the next character after it, if any, must be non-digit, and since the engine scans from left to right, the next match will start at the beginning of a string of digits.
re.findall function also returns only the content in the capturing groups when there is at least 1 capturing group in the regex, which smooths out the whole process.

(?:^|(?<=\s))\d{1,3}
Try this.See demo.
https://regex101.com/r/gQ3kS4/22
match = re.findall(r'(?:^|(?<=\s))\d{1,3}', string)

Try the below solution:-
string="24812949 2472198 4271748 12472187"
match = re.findall(r'\b\d{1,3}', string)
print match
Output:-['248', '247', '427', '124']

Related

Wny it does not give all positive numbers in the string? Regex in Python

I don't understand why it only gives 125, the first number only, why it does not give all positive numbers in that string? My goal is to extract all positive numbers.
import re
pattern = re.compile(r"^[+]?\d+")
text = "125 -898 8969 4788 -2 158 -947 599"
matches = pattern.finditer(text)
for match in matches:
print(match)
Try using the regular expression
-\d+|(\d+)
Disregard the matches. The strings representing non-negative integers are saved in capture group 1.
Demo
The idea is to match but not save to a capture group what you don't want (negative numbers), and both match and save to a capture group what you do want (non-negative numbers).
The regex attempts to match -\d+. If that succeeds the regex engine's internal string pointer is moved to just after the last digit matched. If -\d+ is not matched an attempt is made to match the second part of the alternation (following |). If \d+ is matched the match is saved to capture group 1.
Any plus signs in the string can be disregarded.
For a fuller description of this technique see The Greatest Regex Trick Ever. (Search for "Tarzan"|(Tarzan) to get to the punch line.)
The following pattern will only match non negative numbers:
pattern = re.compile("(?:^|[^\-\d])(\d+)")
pattern.findall(text)
OUTPUT
['125', '8969', '4788', '158', '599']
For the sake of completeness another idea by use of \b and a lookbehind.
\b(?<!-)\d+
See this demo at regex101
Your pattern ^[+]?\d+ is anchored at the start of the string, and will give only that match at the beginning.
Another option is to assert a whitspace boundary to the left, and match the optional + followed by 1 or more digits.
(?<!\S)\+?\d+\b
(?<!\S) Assert a whitespace boundary to the left
\+? Match an optional +
\d+\b Match 1 or more digits followed by a word bounadry
Regex demo
Use , to sperate the numbers in the string.

Why does my regex with word boundary fail?

I'd like to match number, positive or negative, possibly with currency sign in front. But I don't want something like PSM-9. My code is:
test='AAA PCSK-9, $111 -3,33'
re.findall(r'\b-?[$€£]?-?\d+[\d,.]*\b', test)
Output is:['-9', '111', '3,33']
Could someone explain why -9 is matched? Thank you in advance.
Edit:
I don't any part of PCSK-9 is matched it is like a name of a product rather a number. So my desired output is:
['111', '3,33']
This is because \b matches the gap between K and -, a word and a non-word character. If you want to avoid matching - if it's preceded by a word you can use negative lookbehind instead:
re.findall(r'[$€£]?(?:(?<!\w)-)?\d+[\d,.]*\b', test)
With your sample input, this returns:
['9', '111', '3,33']
Demo: https://regex101.com/r/A66C5W/1
The word boundary matches between the K and the dash. The 2 parts after the dash [$€£]?-? are optional because of the questionmark and then you match one or more times a digit. This results in the match -9
What you might use instead of a word boundary is an assertion that checks if what is before and after the match is not a non whitespace character \S using a negative lookbehind and a negative lookahead.
(?<!\S)-?[$€£]?(\d+(?:[,.]\d+)?)(?!\S)
Regex demo | Python demo
-9 is matched because - is a non-word character, and S is a word character... so in between there's an interword boundary \b, as you state in your regexp.

python3: regex, find all substrings that starts with and end with certain string

Let's say that I have a string that looks like this:
a = '1253abcd4567efgh8910ijkl'
I want to find all substrings that starts with a digit, and ends with an alphabet.
I tried,
b = re.findall('\d.*\w',a)
but this gives me,
['1253abcd4567efgh8910ijkl']
I want to have something like,
['1234abcd','4567efgh','8910ijkl']
How can I do this? I'm pretty new to regex method, and would really appreciate it if anyone can show how to do this in different method within regex, and explain what's going on.
\w will match any wordcharacter which consists of numbers, alphabets and the underscore sign. You need to use [a-zA-Z] to capture letters only. See this example.
import re
a = '1253abcd4567efgh8910ijkl'
b = re.findall('(\d+[A-Za-z]+)',a)
Output:
['1253abcd', '4567efgh', '8910ijkl']
\d will match digits. \d+ will match one or more consecutive digits. For e.g.
>>> re.findall('(\d+)',a)
['1253', '4567', '8910']
Similarly [a-zA-Z]+ will match one or more alphabets.
>>> re.findall('([a-zA-Z]+)',a)
['abcd', 'efgh', 'ijkl']
Now put them together to match what you exactly want.
From the Python manual on regular expressions, it tells us that \w:
matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]
So you are actually over capturing what you need. Refine your regular expression a bit:
>>> re.findall(r'(\d+[a-z]+)', a, re.I)
['1253abcd', '4567efgh', '8910ijkl']
The re.I makes your expression case insensitive, so it will match upper and lower case letters as well:
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA')
['12124adbad']
>>> re.findall(r'(\d+[a-z]+)', '12124adbad13434AGDFDF434348888AAA', re.I)
['12124adbad', '13434AGDFDF', '434348888AAA']
\w matches string with any alphanumeric character. And you have used \w with *. So your code will provide a string which is starting with a digit and contains alphanumeric characters of any length.
Solution:
>>>b=re.findall('\d*[A-Za-z]*', a)
>>>b
['1253abcd', '4567efgh', '8910ijkl', '']
you will get '' (an empty string) at the end of the list to display no match. You can remove it using
b.pop(-1)

Matching an apostrophe only within a word or string

I'm looking for a Python regex that can match 'didn't' and returns only the character that is immediately preceded by an apostrophe, like 't, but not the 'd or t' at the beginning and end.
I have tried (?=.*\w)^(\w|')+$ but it only matches the apostrophe at the beginning.
Some more examples:
'I'm' should only match 'm and not 'I
'Erick's' should only return 's and not 'E
The text will always start and end with an apostrophe and can include apostrophes within the text.
To match an apostrophe inside a whole string = match it anwyhere but at the start/end of the string:
(?!^)'(?!$)
See the regex demo.
Often, the apostophe is searched only inside a word (but in fact, a pair of words where the second one is shortened), then you may use
\b'\b
See this regex demo. Here, the ' is preceded and followed with a word boundary, so that ' could be preceded with any word, letter or _ char. Yes, _ char and digits are allowed to be on both sides.
If you need to match a ' only between two letters, use
(?<=[A-Za-z])'(?=[A-Za-z]) # ASCII only
(?<=[^\W\d_])'(?=[^\W\d_]) # Any Unicode letters
See this regex demo.
As for this current question, here is a bunch of possible solutions:
import re
s = "'didn't'"
print(s.strip("'")[s.strip("'").find("'")+1])
print(re.search(r'\b\'(\w)', s).group(1))
print(re.search(r'\b\'([^\W\d_])', s).group(1))
print(re.search(r'\b\'([a-z])', s, flags=re.I).group(1))
print(re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I))
The s.strip("'")[s.strip("'").find("'")+1] gets the character after the first ' after stripping the leading/trailing apostrophes.
The re.search(r'\b\'(\w)', s).group(1) solution gets the word (i.e. [a-zA-Z0-9_], can be adjusted from here) char after a ' that is preceded with a word char (due to the \b word boundary).
The re.search(r'\b\'([^\W\d_])', s).group(1) is almost identical to the above solution, it only fetches a letter character as [^\W\d_] matches any char other than a non-word, digit and _.
Note that the re.search(r'\b\'([a-z])', s, flags=re.I).group(1) solution is next to identical to the above one, but you cannot make it Unicode aware with re.UNICODE.
The last re.findall(r'\b\'([a-z])', "'didn't know I'm a student'", flags=re.I) just shows how to fetch multiple letter chars from a string input.

Python regex boundary

Is there an error in the way python handles '.' or '\b'? I'm not sure why this produces differing results.
import re
regex1 = r'\.?\b'
print bool(re.match(regex1, '.'))
regex2 = r'a?\b'
print bool(re.match(regex2, 'a'))
Output:
False
True
\b, word boundary, matches between word characters and non-word elements. As such, it will match between a word character like a and the end of the string, but not between a non-word character like . and end of string.
As geekosaur pointed out \b is merely a short way of writing
(?:(?<=\w)(?!\w)|(?<!\w)(?=\w))
In your case you may want to use
(?!\w)
or
(?!\S)
instead of \b.

Categories