Using Python and Regex to extract different formats of dates - python

I have the following code to match the dates
import re
date_reg_exp2 = re.compile(r'\d{2}([-/.])(\d{2}|[a-zA-Z]{3})\1(\d{4}|\d{2})|\w{3}\s\d{2}[,.]\s\d{4}')
matches_list = date_reg_exp2.findall("23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015")
print matches_list
The output I expect is
["23-SEP-2015","23-09-2015","23-09-15","Sep 23, 2015"]
What I am getting is:
[('-', 'SEP', '2015'), ('-', '09', '2015'), ('-', '09', '15'), ('', '', '')]
Please check the link for regex here.

The problem you have is that re.findall returns captured texts only excluding Group 0 (the whole match). Since you need the whole match (Group 0), you just need to use re.finditer and grab the group() value:
matches_list = [x.group() for x in date_reg_exp2.finditer("23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015")]
See IDEONE demo
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings... If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.
re.finditer(pattern, string, flags=0)
Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string.

You could try this regex
date_reg_exp2 = re.compile(r'(\d{2}(/|-|\.)\w{3}(/|-|\.)\d{4})|([a-zA-Z]{3}\s\d{2}(,|-|\.|,)?\s\d{4})|(\d{2}(/|-|\.)\d{2}(/|-|\.)\d+)')
Then use re.finditer()
for m in re.finditer(date_reg_exp2,"23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015"):
print m.group()
The Output will be
23-SEP-2015
23-09-2015
23-09-15
Sep 23, 2015

try this
# The first (\d{2}-([A-Z]{3}|\d{2})-(\d{4}|\d{2})) group tries to match the first three types of dates
# rest will match the last type
dates = "23-SEP-2015 and 23-09-2015 and 23-09-15 and Sep 23, 2015"
for x in re.finditer('((\d{2}-([A-Z]{3}|\d{2})-(\d{4}|\d{2}))|([a-zA-Z]{3}\s\d{1,2},\s\d{4}))', dates):
print x.group(1)

Related

python regex: string search for a date

I am searching for a specific string within a document that will have known words before and after a date, and I want to extract the date. For example, if the substring is "dated as of 29 Jan 2017 to the schedule", I want to extract "29 Jan 2017".
My code is:
m = re.search(r'dated as of \w+\s+(.+?)+to the schedule', text, re.IGNORECASE)
if m:
items["date"] = m.group(1)
But - this just gives me "Jan 2017" - it misses the day.
I have tried various variations on the regex search string, but still can't get the day. Any thoughts?
You have your capturing group (parentheses) not enclose the first part that is captured by \w+.
Try mixing capturing group (for the whole part) and non-capturing group for your current parentheses:
r'dated as of (\w+\s+(?:.+?)+) to the schedule'
As you can see, we have a simple grouping with no repetition that encloses both \w+ and your previous parentheses.
And your previous parentheses were changed to non-capturing group with ?: just inside them.
Better yet, your already-existing parentheses and combination of +? and + doesn't make much sense, so you can just remove it:
r'dated as of (\w+\s+.+) to the schedule'
"re" module included with Python primarily used for string searching and manipulation
\w = letters ( Match alphanumeric character, including "_")
\d= any number (a digit)
+ = matches 1 or more
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.
import re
data = "dated as of 29 Jan 2017 to the schedule"
match = re.findall(r'\d+ \w+ \d{4}', data)
print (match[0])
output:
29 Jan 2017
This works fine :-
text ="dated as of 29 Jan 2017"
m =re.search(r'\d\d\s(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\d{4}',
text, re.IGNORECASE)
if m:
print (m.group(0))

Regex doesn't filter out the right text on datatime

I have a string below:
senton = "Sent: Friday, June 18, 2010 12:57 PM"
I created a regex to filter out the datetime portion:
reg_datetime = "(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), (January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4} \d{2}:\d{2} (AM|PM)"
I tested the regex in regex101.com and it works as expected, however, when running it in my python test script, it fails to give me the right text, can anyone help me fix it?
Using it this way:
real_senton = re.findall(reg_datetime, senton)
print real_senton
Produces this result (here is the screenshot):
[('Friday', 'June', 'PM')]
Thank you very much.
Function re.findall does the following:
Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result unless they touch the beginning of another match.
So, if there are groups, it returns the groups. A group is anything in the regular expression enclosed in parenthesis.
solution 1
To get every item separately, put everything into parentesis:
reg_datetime = "(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), "\
"(January|February|March|April|May|June|July|August|September|October|November|December)"\
" (\d{1,2}), (\d{4}) (\d{2}):(\d{2}) (AM|PM)"
Then will re.findall(reg_datetime, senton) return:
[('Friday', 'June', '18', '2010', '12', '57', 'PM')]
solution 2
Alternatively, put everything into one big group:
reg_datetime = "((Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), "\
"(January|February|March|April|May|June|July|August|September|October|November|December)"\
" \d{1,2}, \d{4} \d{2}:\d{2} (AM|PM))"
Now the big group is returned as well:
[('Friday, June 18, 2010 12:57 PM', 'Friday', 'June', 'PM')]
solution 3
Or change the existing grops into non-capturing groups (syntax (?:...))
reg_datetime = "(?:Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), "\
"(?:January|February|March|April|May|June|July|August|September|October|November|December)"\
" \d{1,2}, \d{4} \d{2}:\d{2} (?:AM|PM)"
Result:
['Friday, June 18, 2010 12:57 PM']
solution 4
Or don't use findall at all. Use re.search. It returns a Match object, which gives you more options. With the original reg_datetime it works this way:
>>> m = re.search(reg_datetime, senton)
>>> m.group(0)
'Friday, June 18, 2010 12:57 PM'
>>> m.group(1)
'Friday'
>>> m.group(2)
'June'
>>> m.group(3)
'PM'
without change reg_datetime and only use search
import re
senton = "Sent: Friday, June 18, 2010 12:57 PM"
reg_datetime = "(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), (January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4} \d{2}:\d{2} (AM|PM)"
l = re.search(reg_datetime,senton,re.M|re.I)
print l.group()
and run:
$ python file.py
Friday, June 18, 2010 12:57 PM
$
If you want regex to return the all those values you have to make sure that they're in separate groups, like so:
reg_datetime = "(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), (January|February|March|April|May|June|July|August|September|October|November|December) (\d{1,2}), (\d{4}) (\d{2}):(\d{2}) (AM|PM)"
The problem is that the match results that are returned to you are the ones between '(' ')' which are called group match.
Thus, your regex should look like this to return all the data:
reg_datetime = "(Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday), (January|February|March|April|May|June|July|August|September|October|November|December) (\d{1,2}), (\d{4}) (\d{2}:\d{2}) (AM|PM)"
You can see here the demo. Or if you want all the date in one single string, just add all the regex between '(' ')'

Python Regular expression not returning as expected

I am having trouble understanding the output of this regular expression. I am using the following regex to find a dates in text:
^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$
It appears to be matching the pattern within text correctly, but I'm confused by the return values.
For this test string:
TestString = "10-20-2015"
It's returning this:
[('10', '20', '', '')]
If I put () around the entire regex, I get this returned:
[('10-20-2015', '10', '20', '', '')]
I would expect it to simply return the full date string, but it appears to be breaking the results up and I don't understand why. Wrapping my regex in () returns the full date string, but it also returns 4 extra values.
How do I make this ONLY match the full date string and not small parts of the string?
from my console:
Python 3.4.2 (default, Oct 8 2014, 10:45:20)
[GCC 4.9.1] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> TestString = "10-20-2015"
>>> re.findall(pattern, TestString, re.I)
[('10', '20', '', '')]
>>> pattern = "(^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$)"
>>> re.findall(pattern, TestString, re.I)
[('10-20-2015', '10', '20', '', '')]
>>>
>>> TestString = "10--2015"
>>> re.findall(pattern, TestString, re.I)
[]
>>> pattern = "^(?:(1[0-2]|0?[1-9])-(3[01]|[12][0-9]|0?[1-9])|(3[01]|[12][0-9]|0?[1-9])-(1[0-2]|0?[1-9]))-(?:[0-9]{2})?[0-9]{2}$"
>>> re.findall(pattern, TestString, re.I)
[]
Based on the the response, here was my answer: ((?:(?:1[0-2]|0[1-9])-(?:3[01]|[12][0-9]|0[1-9])|(?:3[01]|[12][0-9]|0[1-9])-(?:1[0-2]|0[1-9]))-(?:[0-9]{2})?[0-9]{2})
Every () is a captured group, (1[0-2]|0?[1-9]) captures 10, (3[01]|[12][0-9]|0?[1-9]) captures 20, and so on. When you surround everything in (), it came before the other () and matched everything. You can ignore a captured group, which is called non-captured group, use (?:) instead of ().
We can do that using one of the most important re functions - search(). This function scans through a string, looking for any location where this RE matches.
import re
text = "10-20-2015"
date_regex = '(\d{1,2})-(\d{1,2})-(\d{4})'
"""
\d in above pattern stands for numerical characters [0-9].
The numbers in curly brackets {} indicates the count of numbers permitted.
Parentheses/round brackets are used for capturing groups so that we can treat
multiple characters as a single unit.
"""
search_date = re.search(date_regex, text)
# for entire match
print(search_date.group())
# also print(search_date.group(0)) can be used
# for the first parenthesized subgroup
print(search_date.group(1))
# for the second parenthesized subgroup
print(search_date.group(2))
# for the third parenthesized subgroup
print(search_date.group(3))
# for a tuple of all matched subgroups
print(search_date.group(1, 2, 3))
Output for each of the print statement mentioned above:
10-20-2015
10
20
2015
('10', '20', '2015')
Hope this answer clears your doubt :-)

python regular expression, extracting set of numbers from a string

How can i get the number 24 and 200 from the string "Size:24 Resp_code:200"
by using re in python?, i have tried with \d+ but then i only get 24
in addition i have also tried this out:
import re
string2 = " Size:24 Resp_code:200"
regx = "(\d+) Resp_code:(\d+)"
print re.search(regx, string2).group(0)
print re.search(regx, string2).group(1)
here the out put is:
24 Resp_code:200
24
any advice on how to solve this ?
thanks in advance
The group 0 contains the whole matched string. Extract group 1, group 2 instead.
>>> string2 = " Size:24 Resp_code:200"
>>> regx = r"(\d+) Resp_code:(\d+)"
>>> match = re.search(regx, string2)
>>> match.group(1), match.group(2)
('24', '200')
>>> match.groups() # to get all groups from 1 to however many groups
('24', '200')
or using re.findall:
>>> re.findall(r'\d+', string2)
['24', '200']
Use:
print re.search(regx, string2).group(1) // 24
print re.search(regx, string2).group(2) // 200
group(0) prints whole string matched by your regex. Where group(1) is first match and group(2) is second match.
Check the doc:
If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned)
You are doing it right but groups don't start at 0, but 1, group(0) will print the whole match:
>>> re.search(regx, string2).group(1,2)
('24', '200')

python return matching and non-matching patterns of string

I would like to split a string into parts that match a regexp pattern and parts that do not match into a list.
For example
import re
string = 'my_file_10'
pattern = r'\d+$'
# I know the matching pattern can be obtained with :
m = re.search(pattern, string).group()
print m
'10'
# The final result should be as following
['my_file_', '10']
Put parenthesis around the pattern to make it a capturing group, then use re.split() to produce a list of matching and non-matching elements:
pattern = r'(\d+$)'
re.split(pattern, string)
Demo:
>>> import re
>>> string = 'my_file_10'
>>> pattern = r'(\d+$)'
>>> re.split(pattern, string)
['my_file_', '10', '']
Because you are splitting on digits at the end of the string, an empty string is included.
If you only ever expect one match, at the end of the string (which the $ in your pattern forces here), then just use the m.start() method to obtain an index to slice the input string:
pattern = r'\d+$'
match = re.search(pattern, string)
not_matched, matched = string[:match.start()], match.group()
This returns:
>>> pattern = r'\d+$'
>>> match = re.search(pattern, string)
>>> string[:match.start()], match.group()
('my_file_', '10')
You can use re.split to make a list of those separate matches and use filter, which filters out all elements which are considered false ( empty strings )
>>> import re
>>> filter(None, re.split(r'(\d+$)', 'my_file_015_01'))
['my_file_015_', '01']

Categories