I have the following string, Hello, season 2 (VSF) and I need to parse "2" out of it. Here is what I'm trying:
s = 'Hello, season 2 (VSF)'
re.findall('Season|Saison|Staffel[\s]+\d',s)
>>> ["Season"]
How would I get "Season 2" here?
Season|Saison|Staffel should be grouped. Also specify re.IGNORECASE or re.I flag to match case-insensitively.
s = 'Hello, season 2 (VSF)'
>>> re.findall(r'(?:Season|Saison|Staffel)\s+\d+', s, flags=re.IGNORECASE)
['season 2']
>>> re.findall(r'(?:Season|Saison|Staffel)\s+\d+', s) # without re.I
[]
Use non-capturing group. Otherwise the pattern include a capturing group and re.findall return a list of matched group instead of match string.
>>> re.findall(r'(Season|Saison|Staffel)\s+\d+', s, flags=re.IGNORECASE)
['season']
Related
The source string is:
# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'
and here is my pattern:
pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'
however, re.search can give me correct result:
m = re.search(pattern, s)
print(m) # output: <_sre.SRE_Match object; span=(3, 6), match='123'>
re.findall just dump out an empty list:
L = re.findall(pattern, s)
print(L) # output: ['', '', '']
why can't re.findall give me the expected list:
['123', '3.1415926']
There are two things to note here:
re.findall returns captured texts if the regex pattern contains capturing groups in it
the r'\\.' part in your pattern matches two consecutive chars, \ and any char other than a newline.
See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Note that to make re.findall return just match values, you may usually
remove redundant capturing groups (e.g. (a(b)c) -> abc)
convert all capturing groups into non-capturing (that is, replace ( with (?:) unless there are backreferences that refer to the group values in the pattern (then see below)
use re.finditer instead ([x.group() for x in re.finditer(pattern, s)])
In your case, findall returned all captured texts that were empty because you have \\ within r'' string literal that tried to match a literal \.
To match the numbers, you need to use
-?\d*\.?\d+
The regex matches:
-? - Optional minus sign
\d* - Optional digits
\.? - Optional decimal separator
\d+ - 1 or more digits.
See demo
Here is IDEONE demo:
import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)
s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)
You dont need to escape twice when you are using raw mode.
Output:['123', '3.1415926']
Also the return type will be a list of strings. If you want return type as integers and floats use map
import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))
Output: [123, 3.1415926]
Just to explain why you think that search returned what you want and findall didn't?
search return a SRE_Match object that hold some information like:
string : attribute contains the string that was passed to search function.
re : REGEX object used in search function.
groups() : list of string captured by the capturing groups inside the REGEX.
group(index): to retrieve the captured string by group using index > 0.
group(0) : return the string matched by the REGEX.
search stops when It found the first mach build the SRE_Match Object and returning it, check this code:
import re
s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string) # 'abc123d'
print(m.group(0)) # REGEX matched 123
print(m.groups()) # there is only one group in REGEX (\.[0-9]*) will empy string tgis why it return (None,)
s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s) # ', hello 3.1415926, this is my book'
print(m2.string) # abc123d
print(m2.group(0)) # REGEX matched 3.1415926
print(m2.groups()) # the captured group has captured this part '.1415926'
findall behave differently because it doesn't just stop when It find the first mach it keeps extracting until the end of the text, but if the REGEX contains at least one capturing group the findall don't return the matched string but the captured string by the capturing groups:
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['', '.1415926']
the first element is return when the first mach was found witch is '123' the capturing group captured only '', but the second element was captured in the second match '3.1415926' the capturing group matched this part '.1415926'.
If you want to make the findall return matched string you should make all capturing groups () in your REGEX a non capturing groups(?:):
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['123', '3.1415926']
In python, suppose I want to search the string
"123"
for occurrences of the pattern
"abc|1.*|def|.23" .
I would currently do this as follows:
import re
re.match ("abc|1.*|def|.23", "123") .
The above returns a match object from which I can retrieve the starting and ending indices of the match in the string, which in this case would be 0 and 3.
My question is: How can I retrieve the particular word(s) in the regular expression which matched with
"123" ?
In other words: I would like to get "1.*" and ".23". Is this possible?
Given your string always have a common separator - in our case "|"
you can try:
str = "abc|1.*|def|.23"
matches = [s for s in str.split("|") if re.match(s, "123")]
print(matches)
output:
['1.*', '.23']
Another approach would be to create one capture group for each token in the alternation:
import re
s = 'def'
rgx = r'\b(?:(abc)|(1.*)|(def)|(.23))\b'
m = re.match(rgx, s)
print(m.group(0)) #=> def
print(m.group(1)) #=> None
print(m.group(2)) #=> None
print(m.group(3)) #=> def
print(m.group(4)) #=> None
This example shows the match is 'def' and was matched by the 3rd capture group,(def).
Python code
The source string is:
# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'
and here is my pattern:
pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'
however, re.search can give me correct result:
m = re.search(pattern, s)
print(m) # output: <_sre.SRE_Match object; span=(3, 6), match='123'>
re.findall just dump out an empty list:
L = re.findall(pattern, s)
print(L) # output: ['', '', '']
why can't re.findall give me the expected list:
['123', '3.1415926']
There are two things to note here:
re.findall returns captured texts if the regex pattern contains capturing groups in it
the r'\\.' part in your pattern matches two consecutive chars, \ and any char other than a newline.
See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Note that to make re.findall return just match values, you may usually
remove redundant capturing groups (e.g. (a(b)c) -> abc)
convert all capturing groups into non-capturing (that is, replace ( with (?:) unless there are backreferences that refer to the group values in the pattern (then see below)
use re.finditer instead ([x.group() for x in re.finditer(pattern, s)])
In your case, findall returned all captured texts that were empty because you have \\ within r'' string literal that tried to match a literal \.
To match the numbers, you need to use
-?\d*\.?\d+
The regex matches:
-? - Optional minus sign
\d* - Optional digits
\.? - Optional decimal separator
\d+ - 1 or more digits.
See demo
Here is IDEONE demo:
import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)
s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)
You dont need to escape twice when you are using raw mode.
Output:['123', '3.1415926']
Also the return type will be a list of strings. If you want return type as integers and floats use map
import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))
Output: [123, 3.1415926]
Just to explain why you think that search returned what you want and findall didn't?
search return a SRE_Match object that hold some information like:
string : attribute contains the string that was passed to search function.
re : REGEX object used in search function.
groups() : list of string captured by the capturing groups inside the REGEX.
group(index): to retrieve the captured string by group using index > 0.
group(0) : return the string matched by the REGEX.
search stops when It found the first mach build the SRE_Match Object and returning it, check this code:
import re
s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string) # 'abc123d'
print(m.group(0)) # REGEX matched 123
print(m.groups()) # there is only one group in REGEX (\.[0-9]*) will empy string tgis why it return (None,)
s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s) # ', hello 3.1415926, this is my book'
print(m2.string) # abc123d
print(m2.group(0)) # REGEX matched 3.1415926
print(m2.groups()) # the captured group has captured this part '.1415926'
findall behave differently because it doesn't just stop when It find the first mach it keeps extracting until the end of the text, but if the REGEX contains at least one capturing group the findall don't return the matched string but the captured string by the capturing groups:
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['', '.1415926']
the first element is return when the first mach was found witch is '123' the capturing group captured only '', but the second element was captured in the second match '3.1415926' the capturing group matched this part '.1415926'.
If you want to make the findall return matched string you should make all capturing groups () in your REGEX a non capturing groups(?:):
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['123', '3.1415926']
The source string is:
# Python 3.4.3
s = r'abc123d, hello 3.1415926, this is my book'
and here is my pattern:
pattern = r'-?[0-9]+(\\.[0-9]*)?|-?\\.[0-9]+'
however, re.search can give me correct result:
m = re.search(pattern, s)
print(m) # output: <_sre.SRE_Match object; span=(3, 6), match='123'>
re.findall just dump out an empty list:
L = re.findall(pattern, s)
print(L) # output: ['', '', '']
why can't re.findall give me the expected list:
['123', '3.1415926']
There are two things to note here:
re.findall returns captured texts if the regex pattern contains capturing groups in it
the r'\\.' part in your pattern matches two consecutive chars, \ and any char other than a newline.
See findall reference:
If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
Note that to make re.findall return just match values, you may usually
remove redundant capturing groups (e.g. (a(b)c) -> abc)
convert all capturing groups into non-capturing (that is, replace ( with (?:) unless there are backreferences that refer to the group values in the pattern (then see below)
use re.finditer instead ([x.group() for x in re.finditer(pattern, s)])
In your case, findall returned all captured texts that were empty because you have \\ within r'' string literal that tried to match a literal \.
To match the numbers, you need to use
-?\d*\.?\d+
The regex matches:
-? - Optional minus sign
\d* - Optional digits
\.? - Optional decimal separator
\d+ - 1 or more digits.
See demo
Here is IDEONE demo:
import re
s = r'abc123d, hello 3.1415926, this is my book'
pattern = r'-?\d*\.?\d+'
L = re.findall(pattern, s)
print(L)
s = r'abc123d, hello 3.1415926, this is my book'
print re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s)
You dont need to escape twice when you are using raw mode.
Output:['123', '3.1415926']
Also the return type will be a list of strings. If you want return type as integers and floats use map
import re,ast
s = r'abc123d, hello 3.1415926, this is my book'
print map(ast.literal_eval,re.findall(r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+',s))
Output: [123, 3.1415926]
Just to explain why you think that search returned what you want and findall didn't?
search return a SRE_Match object that hold some information like:
string : attribute contains the string that was passed to search function.
re : REGEX object used in search function.
groups() : list of string captured by the capturing groups inside the REGEX.
group(index): to retrieve the captured string by group using index > 0.
group(0) : return the string matched by the REGEX.
search stops when It found the first mach build the SRE_Match Object and returning it, check this code:
import re
s = r'abc123d'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.search(pattern, s)
print(m.string) # 'abc123d'
print(m.group(0)) # REGEX matched 123
print(m.groups()) # there is only one group in REGEX (\.[0-9]*) will empy string tgis why it return (None,)
s = ', hello 3.1415926, this is my book'
m2 = re.search(pattern, s) # ', hello 3.1415926, this is my book'
print(m2.string) # abc123d
print(m2.group(0)) # REGEX matched 3.1415926
print(m2.groups()) # the captured group has captured this part '.1415926'
findall behave differently because it doesn't just stop when It find the first mach it keeps extracting until the end of the text, but if the REGEX contains at least one capturing group the findall don't return the matched string but the captured string by the capturing groups:
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['', '.1415926']
the first element is return when the first mach was found witch is '123' the capturing group captured only '', but the second element was captured in the second match '3.1415926' the capturing group matched this part '.1415926'.
If you want to make the findall return matched string you should make all capturing groups () in your REGEX a non capturing groups(?:):
import re
s = r'abc123d , hello 3.1415926, this is my book'
pattern = r'-?[0-9]+(?:\.[0-9]*)?|-?\.[0-9]+'
m = re.findall(pattern, s)
print(m) # ['123', '3.1415926']
I need a regex for some color which can be described like this:
starts with #
then 3 the same pairs of hex characters (0-9, a-f, A-F). aA and Aa are also the same pairs
Now i have #(([0-9A-Fa-f]){2}){3}
How can I make regexp for the SAME pairs of hex characters?
Some examples of the matching strings:
"#FFFFFF",
"#000000",
"#aAAaaA",
"#050505",
"###93#0b0B0b1B34"
Strings like "#000100" shouldn't match
With re.search() function:
import re
s = '#aAAaaA'
match = re.search(r'#([0-9a-z]{2})\1\1', s, re.I)
result = match if not match else match.group()
print(result)
\1 - points to the 1st parenthesized group (...)
re.I - IGNORECASE regex flag
You may use the following regex with a capturing group and a backreference:
#([0-9A-Fa-f]{2})\1{2}
See the regex demo
Details
# - a #
([0-9A-Fa-f]{2}) - Group 1: 2 hex chars
\1{2} - 2 consecutive occurrences of the same value as captured in Group 1.
NOTE: the case insensitive flag is required to make the \1 backreference match Group 1 contents in a case insensitive way. Bear in mind we need to use a raw string literal to define the regex to avoid overescaping the backreferences.
See the Python demo:
import re
strs = ["#FFFFFF","#000000","#aAAaaA","#050505","###93#0b0B0b1B34", "#000100"]
for s in strs:
m = re.search(r'#([0-9A-Fa-f]{2})\1{2}', s, flags=re.I)
if m:
print("{} MATCHED".format(s))
else:
print("{} DID NOT MATCH".format(s))
Results:
#FFFFFF MATCHED
#000000 MATCHED
#aAAaaA MATCHED
#050505 MATCHED
###93#0b0B0b1B34 MATCHED
#000100 DID NOT MATCH