How can I extract first 3 numbers of a string:
in:
"Box 123 (NO) 456"
out:
123
Just search for \d{3} and grab the first match:
match = re.search(r'\d{3}', inputstring)
if match:
print match.group(0)
Demo:
>>> import re
>>> inputstring = "Box 123 (NO) 456"
>>> match = re.search(r'\d{3}', inputstring)
>>> if match:
... print match.group(0)
...
123
Note that the above also matches a substring; if you have a number that is four digits long it'll match the first 3 digits of that number.
Your post is very sparse on details; let's presume that the above is not enough but that your numbers are delimited by whitespace, then you can match exactly 3 digits by using \b anchors:
match = re.search(r'\b\d{3}\b', inputstring)
which match only 3 digits between non-word characters (the start or end of the string, whitespace, punctuation, etc. Anything not a letter or a number or an underscore):
>>> re.search(r'\b\d{3}\b', inputstring)
<_sre.SRE_Match object at 0x106c4f100>
>>> re.search(r'\b\d{3}\b', "Box 1234")
>>> re.search(r'\b\d{3}\b', "Box 123")
<_sre.SRE_Match object at 0x106c4f1d0>
Related
I am attempting to extract a substring that contains numbers and letters:
string = "LINE : 11m56.95s CPU 13m31.14s TODAY"
I only want 11m56.95s and 13m31.14s
I have tried doing this:
re.findall('\d+', string)
that doesn't give me what I want, I also tried this:
re.findall('\d{2}[m]+\d[.]+\d|\+)
that did not work either, any other suggestions?
Try this:
re.findall("[0-9]{2}[m][0-9]{2}\.[0-9]{2}[s]", string)
Output:
['11m56.95s', '13m31.14s']
Your current regular expression does not match what you expect it to.
You could use the following regular expression to extract those substrings.
re.findall(r'\d+m\d+\.\d+s', string)
Live Demo
Example:
>>> import re
>>> s = 'LINE : 11m56.95s CPU 13m31.14s TODAY'
>>> for x in re.findall(r'\d+m\d+\.\d+s', s):
... print x
11m56.95s
13m31.14s
Your Regex pattern is not formed correctly. It is currently matching:
\d{2} # Two digits
[m]+ # One or more m characters
\d # A digit
[.]+ # One or more . characters
\d|\+ # A digit or +
Instead, you should use:
>>> import re
>>> string = "LINE : 11m56.95s CPU 13m31.14s TODAY"
>>> re.findall('\d+m\d+\.\d+s', string)
['11m56.95s', '13m31.14s']
>>>
Below is an explanation of what the new pattern matches:
\d+ # One or more digits
m # m
\d+ # One or more digits
\. # .
\d+ # One or more digits
s # s
\b #word boundary
\d+ #starts with digit
.*? #anything (non-greedy so its the smallest possible match)
s #ends with s
\b #word boundary
If your lines are all like your example split will work:
s = "LINE : 11m56.95s CPU 13m31.14s TODAY"
spl = s.split()
a,b = spl[2],spl[4]
print(a,b)
('11m56.95s', '13m31.14s')
I have the following string, Hello, season 2 (VSF) and I need to parse "2" out of it. Here is what I'm trying:
s = 'Hello, season 2 (VSF)'
re.findall('Season|Saison|Staffel[\s]+\d',s)
>>> ["Season"]
How would I get "Season 2" here?
Season|Saison|Staffel should be grouped. Also specify re.IGNORECASE or re.I flag to match case-insensitively.
s = 'Hello, season 2 (VSF)'
>>> re.findall(r'(?:Season|Saison|Staffel)\s+\d+', s, flags=re.IGNORECASE)
['season 2']
>>> re.findall(r'(?:Season|Saison|Staffel)\s+\d+', s) # without re.I
[]
Use non-capturing group. Otherwise the pattern include a capturing group and re.findall return a list of matched group instead of match string.
>>> re.findall(r'(Season|Saison|Staffel)\s+\d+', s, flags=re.IGNORECASE)
['season']
So, I am trying to find a word (a complete word) in a sentence. Lets say the sentence is
Str1 = "1. how are you doing"
and that I am interested in finding if
Str2 = "1."
is in it. If I do,
re.search(r"%s\b" % Str2, Str1, re.IGNORECASE)
it should say that a match was found, isn't it? but the re.search fails for this query. why?
There are two things wrong here:
\b matches a position between a word and a non-word character, so between any letter, digit or underscore, and a character that doesn't match that set.
You are trying to match the boundary between a . and a space; both are non-word characters and the \b anchor would never match there.
You are handing re a 1., which means 'match a 1 and any other character'. You'd need to escape the dot by using re.escape() to match a literal ..
The following works better:
re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
Now it'll match your input literally, and look for a following space or the end of the string. The (?:...) creates a non-capturing group (always a good idea unless you specifically need to capture sections of the match); inside the group there is a | pipe to give two alternatives; either match \s (whitespace) or match $ (end of a line). You can expand this as needed.
Demo:
>>> import re
>>> Str1 = "1. how are you doing"
>>> Str2 = "1."
>>> re.search(r"%s(?:\s|$)" % re.escape(Str2), Str1, re.IGNORECASE)
<_sre.SRE_Match object at 0x10457eed0>
>>> _.group(0)
'1. '
I tried separate m's in a python regex by using word boundaries and find them all. These m's should either have a whitespace on both sides or begin/end the string:
r = re.compile("\\bm\\b")
re.findall(r, someString)
However, this method also finds m's within words like I'm since apostrophes are considered to be word boundaries. How do I write a regex that doesn't consider apostrophes as word boundaries?
I've tried this:
r = re.compile("(\\sm\\s) | (^m) | (m$)")
re.findall(r, someString)
but that just doesn't match any m. Odd.
Using lookaround assertion:
>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I'm a boy")
[]
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I m a boy")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "pm")
['m']
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match
for ... that ends at the current position. This is called a positive
lookbehind assertion. (?<=abc)def will find a match in abcdef, ...
from Regular expression syntax
BTW, using raw string (r'this is raw string'), you don't need to escape \.
>>> r'\s' == '\\s'
True
You don't even need look-around (unless you want to capture the m without the spaces), but your second example was inches away. It was the extra spaces (ok in python, but not within a regex) which made them not work:
>>> re.findall(r'\sm\s|^m|m$', "I m a boy")
[' m ']
>>> re.findall(r'\sm\s|^m|m$', "mamam")
['m', 'm']
>>> re.findall(r'\sm\s|^m|m$', "mama")
['m']
>>> re.findall(r'\sm\s|^m|m$', "I'm a boy")
[]
>>> re.findall(r'\sm\s|^m|m$', "I'm a boym")
['m']
falsetru's answer is almost the equivalent of "\b except apostrophes", but not quite. It will still find matches where a boundary is missing. Using one of falsetru's examples:
>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
It finds 'm', but there is no occurrence of 'm' in 'mama' that would match '\bm\b'. The first 'm' matches '\bm', but that's as close as it gets.
The regex that implements "\b without apostrophes" is shown below:
(?<=\s)m(?=\s)|^m(?=\s)|(?<=\s)m$|^m$
This will find any of the following 4 cases:
'm' with white space before and after
'm' at beginning followed by white space
'm' at end preceded by white space
'm' with nothing preceding or following it (i.e. just literally the string "m")
What is a word boundary in a Python regex? Can someone please explain this on these examples:
Example 1
>>> x = '456one two three123'
>>> y=re.search(r"\btwo\b",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>
Example 2
>>> y=re.search(r"two",x)
>>> y
<_sre.SRE_Match object at 0x2aaaaab47d30>
Example 3
>>> ip="192.168.254.1234"
>>> if re.search(r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip):
... print ip
...
Example 4
>>> ip="192.168.254.1234"
>>> if re.search(r"\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip):
... print ip
192.168.254.1234
"word boundary" means exactly what it says: the boundary of a word, i.e. either the beginning or the end.
It does not match any actual character in the input, but it will only match if the current match position is at the beginning or end of the word.
This is important because, unlike if you just matched whitespace, it will also match at the beginning or end of the entire input.
So '\bfoo' will match 'foobar' and 'foo bar' and 'bar foo', but not 'barfoo'.
'foo\b' will match 'foo bar' and 'bar foo' and 'barfoo', but not 'foobar'.
Try this:
ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}",ip)
print(res)
Notice how I correctly escaped the dots.
The ip is found because the regex doesn't care what comes after the last 1-3 digits.
Now:
ip="192.168.254.1234"
res = re.findall("\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",ip)
print(res)
This will not work, since the last 1-3 digits are NOT ENDING AT A BOUNDARY.