Python Regular expression not returning as expected - python

I am having trouble understanding the output of this regular expression. I am using the following regex to find a dates in text:
^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$
It appears to be matching the pattern within text correctly, but I'm confused by the return values.
For this test string:
TestString = "10-20-2015"
It's returning this:
[('10', '20', '', '')]
If I put () around the entire regex, I get this returned:
[('10-20-2015', '10', '20', '', '')]
I would expect it to simply return the full date string, but it appears to be breaking the results up and I don't understand why. Wrapping my regex in () returns the full date string, but it also returns 4 extra values.
How do I make this ONLY match the full date string and not small parts of the string?
from my console:
Python 3.4.2 (default, Oct 8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> TestString = "10-20-2015"
>>> re.findall(pattern, TestString, re.I)
[('10', '20', '', '')]
>>> pattern = "(^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$)"
>>> re.findall(pattern, TestString, re.I)
[('10-20-2015', '10', '20', '', '')]
>>>
>>> TestString = "10--2015"
>>> re.findall(pattern, TestString, re.I)
[]
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> re.findall(pattern, TestString, re.I)
[]
Based on the the response, here was my answer: ((?:(?:1[0-2]|0[1-9])-(?:3[01]|[12][0-9]|0[1-9])|(?:3[01]|[12][0-9]|0[1-9])-(?:1[0-2]|0[1-9]))-(?:[0-9]{2})?[0-9]{2})

Every () is a captured group, (1[0-2]|0?[1-9]) captures 10, (3[01]|[12][0-9]|0?[1-9]) captures 20, and so on. When you surround everything in (), it came before the other () and matched everything. You can ignore a captured group, which is called non-captured group, use (?:) instead of ().

We can do that using one of the most important re functions - search(). This function scans through a string, looking for any location where this RE matches.
import re
text = "10-20-2015"
date_regex = '(\d{1,2})-(\d{1,2})-(\d{4})'
"""
\d in above pattern stands for numerical characters [0-9].
The numbers in curly brackets {} indicates the count of numbers permitted.
Parentheses/round brackets are used for capturing groups so that we can treat
multiple characters as a single unit.
"""
search_date = re.search(date_regex, text)
# for entire match
print(search_date.group())
# also print(search_date.group(0)) can be used
# for the first parenthesized subgroup
print(search_date.group(1))
# for the second parenthesized subgroup
print(search_date.group(2))
# for the third parenthesized subgroup
print(search_date.group(3))
# for a tuple of all matched subgroups
print(search_date.group(1, 2, 3))
Output for each of the print statement mentioned above:
10-20-2015
10
20
2015
('10', '20', '2015')
Hope this answer clears your doubt :-)

Related

Python regular expression \W: with vs without parenthesis

Below is a quick demo. Using \W for matching non-words and split a given string. Why is there a difference between with and without parenthesis?
>>> s = "abc:def:ghi"
>>> p = "(\W+)"
>>> q = "\W+"
>>> import re
>>> re.split(p, s, flags=re.UNICODE)
['abc', ':', 'def', ':', 'ghi']
>>> re.split(q, s, flags=re.UNICODE)
['abc', 'def', 'ghi']
From the re module documentation:
If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
For reference, wrapping parts of a regular expression in parentheses creates a capturing group. These are groups of the pattern that can later be referenced as individual entities.

Using Python and Regex to extract different formats of dates

I have the following code to match the dates
import re
date_reg_exp2 = re.compile(r'\d{2}([-/.])(\d{2}|[a-zA-Z]{3})\1(\d{4}|\d{2})|\w{3}\s\d{2}[,.]\s\d{4}')
matches_list = date_reg_exp2.findall("23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015")
print matches_list
The output I expect is
["23-SEP-2015","23-09-2015","23-09-15","Sep 23, 2015"]
What I am getting is:
[('-', 'SEP', '2015'), ('-', '09', '2015'), ('-', '09', '15'), ('', '', '')]
Please check the link for regex here.
The problem you have is that re.findall returns captured texts only excluding Group 0 (the whole match). Since you need the whole match (Group 0), you just need to use re.finditer and grab the group() value:
matches_list = [x.group() for x in date_reg_exp2.finditer("23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015")]
See IDEONE demo
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings... If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
re.finditer(pattern, string, flags=0)
Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string.
You could try this regex
date_reg_exp2 = re.compile(r'(\d{2}(/|-|\.)\w{3}(/|-|\.)\d{4})|([a-zA-Z]{3}\s\d{2}(,|-|\.|,)?\s\d{4})|(\d{2}(/|-|\.)\d{2}(/|-|\.)\d+)')
Then use re.finditer()
for m in re.finditer(date_reg_exp2,"23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015"):
print m.group()
The Output will be
23-SEP-2015
23-09-2015
23-09-15
Sep 23, 2015
try this
# The first (\d{2}-([A-Z]{3}|\d{2})-(\d{4}|\d{2})) group tries to match the first three types of dates
# rest will match the last type
dates = "23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015"
for x in re.finditer('((\d{2}-([A-Z]{3}|\d{2})-(\d{4}|\d{2}))|([a-zA-Z]{3}\s\d{1,2},\s\d{4}))', dates):
print x.group(1)

python regex finditer

I have question about re, I tried to look answer on re documentary but I think I am to newbie for this.
I have string like this
string = "id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2"
I want to retrive all result after '=' so I used
re.finditer("=[\w]*", string)
My result was as follow
186
0
empty space <-- there should be a [cspacer0]--BlaBla--
2
How should my pattern look to get the channel_name as well?
The \w token only matches word characters, to allow metacharacters I would use \S (any non-white space character) instead. Also, instead of finditer you can use findall for this task:
>>> import re
>>> s = 'id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'=(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']
EDIT
The orginal string looks like this, I want to get everything starting with = skip =ok and idx=0
>>> s = 'error idx=0 msg=ok id=186 s_id=0 channel_name=[cspacer0]---BlaBla--- number=2'
>>> re.findall(r'(?<!idx)=(?!ok)(\S+)', s)
['186', '0', '[cspacer0]---BlaBla---', '2']

python regular expression, extracting set of numbers from a string

How can i get the number 24 and 200 from the string "Size:24 Resp_code:200"
by using re in python?, i have tried with \d+ but then i only get 24
in addition i have also tried this out:
import re
string2 = " Size:24 Resp_code:200"
regx = "(\d+) Resp_code:(\d+)"
print re.search(regx, string2).group(0)
print re.search(regx, string2).group(1)
here the out put is:
24 Resp_code:200
24
any advice on how to solve this ?
thanks in advance
The group 0 contains the whole matched string. Extract group 1, group 2 instead.
>>> string2 = " Size:24 Resp_code:200"
>>> regx = r"(\d+) Resp_code:(\d+)"
>>> match = re.search(regx, string2)
>>> match.group(1), match.group(2)
('24', '200')
>>> match.groups() # to get all groups from 1 to however many groups
('24', '200')
or using re.findall:
>>> re.findall(r'\d+', string2)
['24', '200']
Use:
print re.search(regx, string2).group(1) // 24
print re.search(regx, string2).group(2) // 200
group(0) prints whole string matched by your regex. Where group(1) is first match and group(2) is second match.
Check the doc:
If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)
You are doing it right but groups don't start at 0, but 1, group(0) will print the whole match:
>>> re.search(regx, string2).group(1,2)
('24', '200')

python return matching and non-matching patterns of string

I would like to split a string into parts that match a regexp pattern and parts that do not match into a list.
For example
import re
string = 'my_file_10'
pattern = r'\d+$'
# I know the matching pattern can be obtained with :
m = re.search(pattern, string).group()
print m
'10'
# The final result should be as following
['my_file_', '10']
Put parenthesis around the pattern to make it a capturing group, then use re.split() to produce a list of matching and non-matching elements:
pattern = r'(\d+$)'
re.split(pattern, string)
Demo:
>>> import re
>>> string = 'my_file_10'
>>> pattern = r'(\d+$)'
>>> re.split(pattern, string)
['my_file_', '10', '']
Because you are splitting on digits at the end of the string, an empty string is included.
If you only ever expect one match, at the end of the string (which the $ in your pattern forces here), then just use the m.start() method to obtain an index to slice the input string:
pattern = r'\d+$'
match = re.search(pattern, string)
not_matched, matched = string[:match.start()], match.group()
This returns:
>>> pattern = r'\d+$'
>>> match = re.search(pattern, string)
>>> string[:match.start()], match.group()
('my_file_', '10')
You can use re.split to make a list of those separate matches and use filter, which filters out all elements which are considered false ( empty strings )
>>> import re
>>> filter(None, re.split(r'(\d+$)', 'my_file_015_01'))
['my_file_015_', '01']

Categories