Extract Date from string using regex in Python 3.5.2 - python

I have this data extracted from Email body
Data=("""-------- Forwarded Message --------
Subject: Sample Report
Date: Thu, 6 Apr 2017 16:39:19 +0000
From: test1#abc.com
To: test2#xyz.com""")
I want to extract this particular date and month , and copy it in the variables
Need output as
Date = 6
Month = "Apr"
Can anyone please help with this using regular expressions?

You can use this regex with multiline mode m:
^Date:[^,]+,\ (\d+) (\w+)
This will capture the date and the month in groups 1 and 2 respectively, so the match can easily be unpacked into two variables like so:
date, month = re.search("^Date:[^,]+,\ (\d+) (\w+)", Data, re.MULTILINE).groups()
date = int(date)
print(date, month)
# output: 6 Apr

Adding to the solution of #Rakesh,
import re
from datetime import datetime
data1 = re.sub(' ', '', data)
res = re.search(r'Date(.*)$', data1, re.MULTILINE).group()
res2 = datetime.strptime(res, 'Date:%a,%d%b%Y%X%z')
print(res2.day, res2.month)

You can use regex to extract the date
Ex:
import re
from dateutil.parser import parse
s = """-------- Forwarded Message --------
Subject: Sample Report
Date: Thu, 6 Apr 2017 16:39:19 +0000
From: test1#abc.com
To: test2#xyz.com"""
date = re.search("Date(.*)$", s, re.MULTILINE)
if date:
date = date.group().replace("Date:", "").strip()
d = parse(date)
Date = d.day
Month = d.strftime("%b")
print(Date, Month)
Output:
6 Apr

Related

Split and Format Date Range

I am working on parsing a date range from an email in zapier. Here is what comes in: Dec 4 - Jan 4, 2020 From this I need to separate the start and end date to something like 12/04/2019 and 01/04/2020 accounting for the fact that some dates will start in the prior year as in the example above and some will be in the same year for example Mar 4 - Mar 22, 2020. It seems the code to use in zapier is python. I have looked at examples for panda
import pandas as pd
date_series = pd.date_range(start='Mar 4' -, end='Mar 7, 2020')
print(date)
But keep getting errors.
Any suggestions would be much appreciated thanks
This is one way to do it:
def parse_email_range(date_string):
dates = date_string.split(' - ')
month_1 = pd.to_datetime(dates[0], format='%b %d').month
month_2 = pd.to_datetime(dates[1]).month
day_1 = pd.to_datetime(dates[0], format='%b %d').day
day_2 = pd.to_datetime(dates[1]).day
year_2 = pd.to_datetime(dates[1]).year
year_1 = year_2 if (month_1 < month_2) or (month_1 == month_2 and day_1 < day_2) else year_2 - 1
return '{}-{}-{}'.format(year_1, month_1, day_1), '{}-{}-{}'.format(year_2, month_2, day_2)
parse_email_range('Dec 4 - Jan 4, 2020')
## ('2019-12-4', '2020-1-4')
Split the two dates and record them into a single variable:
raw_dates = 'Dec 4 - Jan 4, 2020'.split(" - ")
dateutil package is capable of parsing most dates:
from dateutil.parser import parse
Parse and separate start and end date from the raw dates:
start_date, end_date = (parse(date) for date in raw_dates)
strftime is the method that could be used to format dates.
Store desired format in a variable (please note I have used day first format):
date_format = '%d/%m/%Y'
Convert the end date into the desired format:
print(end_date.strftime(date_format))
'04/01/2020'
Convert start date:
dateutil's relativedelta function will help us to subtract one year from the start date:
from dateutil.relativedelta import relativedelta
adjusted_start_date = start_date - relativedelta(years=1)
print(adjusted_start_date.strftime(date_format))
'04/12/2019'

How to filter a list based on a substring in each element?

I have an inconsistent list of strings that contain dates in different formats. I need to determine the dates in each list.
My list/array looks like:
dates_list = []
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
I've tried a for loop and tried splitting the string however each string has different delimiters before the date. So what's a plausible way to find the date from each string in this inconsistent list. The only help looking at the list is that the date are all positioned at the end of each string
well it is not the best way to do it, but it may solve your problem, you can adapt it more :
dates_list = []
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
import os
for i in my_array :
for j in i :
if j >= '0' and j <= '9' :
i = i.replace(j,"",1)
else:
break
print(os.path.splitext(i)[0].replace("_","").replace("-",""))
output :
01.05.2019
01.05.2019
21.06.2019
19 June 2019
25.06.18
20 March 2019
31.1.2019
It is still unclear what you want to do with the dates or if you want them in some sort of consistent format however all your question states is you want to extract the date from the file name. you can do this with regex based on your samples which you say are the only 7 formats you have.
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
import re
for filename in my_array:
date = re.search(r'(\d{1,2}([.\s])(?:\d{1,2}|\w+)\2\d{2,4})', filename).group()
print(f"The date '{date}' was extracted from the file name '{filename}'")
OUTPUT
The date '01.05.2019' was extracted from the file name '5364345354_01.05.2019.pdf'
The date '01.05.2019' was extracted from the file name '5364344354_ 01.05.2019.pdf'
The date '21.06.2019' was extracted from the file name '5345453454 - 21.06.2019.pdf'
The date '19 June 2019' was extracted from the file name '4675535643 - 19 June 2019.docx'
The date '25.06.18' was extracted from the file name '57467874 25.06.18.pdf'
The date '20 March 2019' was extracted from the file name '6565653635_20 March 2019.txt'
The date '31.1.2019' was extracted from the file name '252252452_31.1.2019.txt'
the datetime module is helpful when working with dates and date formats and may help to convert your various format dates to a single format.
Extra characters, like the number in front of the dates, still need to be stripped manually. Other answers already pointed out several ways to do it, here I propose my own which does not require regex. I'm going to assume that the patterns are the one shown in your example, if there are other patters they need to be included in the code.
Once the numbers at the beginning of the strings and the file extension are discarded, datetime.strptime() is used to read the date and create a datetime object.
Then datetime.strftime() is used to get back a string representing the date with a given, unique format.
import datetime
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
def multiformat(string, format_list, format_res):
delim = None
if '_' in string:
delim = '_'
elif '-' in string:
delim = '-'
else:
delim = ' '
strdate = string.split(delim)[1].strip().split('.')[:-1]
txtdate = ' '.join(strdate)
print(txtdate)
date = None
for frm in format_list:
try:
date = datetime.datetime.strptime(txtdate, frm)
break
except ValueError:
pass
return date.strftime(format_res)
format_dates = ['%d %m %Y', '%d %m %y', '%d %B %Y']
dates_list = list(map(lambda x : multiformat(x, format_dates, '%d-%m-%Y'), my_array))
print(dates_list)
This prints:
['01-05-2019', '01-05-2019', '21-06-2019', '19-06-2019', '25-06-2018', '20-03-2019', '31-01-2019']
This can be solved with regex. The pattern I'm using here works in this case, but it's not pretty.
import re
regex = re.compile(r'\d{1,2}(\.| )\w+\1\d{2,4}')
for f in my_array:
print(regex.search(f).group())
Output:
01.05.2019
01.05.2019
21.06.2019
19 June 2019
25.06.18
20 March 2019
31.1.2019
Broken down:
\d{1,2} - One or two digits
(\.| ) ... \1 - A dot or a space, then the same again
\w+ - One or more letters, digits, or underscores
\d{2,4} - Two or four digits
you could try this, a little hacky but you do have some variations in your date formats:)
import re
mons = {'January':'01','February':'02','March':'03','April':'04','May':'05','June':'06','July':'07','August':'08','September':'09','October':'10','November':'11','December':'12'}
unformatted = [re.sub('\d{5,}|\s-\s|_|\s','',d.rsplit('.',1)[0]).replace('.','-') for d in my_array]
output:
['01-05-2019', '01-05-2019', '21-06-2019', '19June2019', '25-06-18', '20March2019', '31-1-2019']
for i,d in enumerate(unformatted):
if any(c.isalpha() for c in d):
key = re.search('[a-zA-Z]+',d).group()
unformatted[i] = d.replace(key,'-'+mons[key]+'-')
if len(d.split('-')[-1])==2:
yr = d.split('-')[-1]
unformatted[i] = d[:-2]+'20'+yr
#was having issues getting this one to work in the same loop..but:
for i,d in enumerate(unformatted):
if len(d.split('-')[1])==1:
mnth = d.split('-')[1]
unformatted[i] = d[:3]+'0'+mnth+d[-5:]
output:
['01-05-2019', '01-05-2019', '21-06-2019', '19-06-2019', '25-06-2018', '20-03-2019', '31-01-2019']
this not only extracts the date for each entry, but puts them into the same format so you can use them in pandas, or whatever you need to do with them afterwards
if the provided example contains all variations of the dates, this should work, if not you could make some minor adaptations and should be able to get it to work

how to get this regular expression in python [duplicate]

This question already has an answer here:
Learning Regular Expressions [closed]
(1 answer)
Closed 4 years ago.
I have this string:
Sat Apr 18 23:22:15 PDT 2009
and I want to extract
23
what should I have for it ? something like \d\w
Use datetime to parse datetime strings, then you can easily extract all the parts individually
from datetime import datetime
dtime = datetime.strptime('%a %b %d %H:%M:%S %Z %Y', 'Sat Apr 18 23:22:15 PDT 2009')
hour = dtime.hour
year = dtime.year
# etc.
See docs for more details:
You could use re.split to split on either spaces or colons and grab the 4th element:
import re
somedate = "Sat Apr 18 23:22:15 PDT 2009"
re.split('\s|\:', somedate)
['Sat', 'Apr', '18', '23', '22', '15', 'PDT', '2009']
hour = re.split('\s|\:', somedate)[3]
You could unpack it that way, as well:
day_of_week, month, day_of_month, hour, minute, second, timezone, year = re.split('\s|\:', somedate)
That would allow you more access
Otherwise, I'd go with #liamhawkins suggestion of the datetime module
EDIT: If you're looking for similar access paradigms to datetime objects, you can use a namedtuple from the collections module:
from collections import namedtuple
date_obj = namedtuple("date_obj", ['day_of_week', 'month', 'day_of_month', 'hour', 'minute', 'second', 'timezone', 'year'])
mydatetime = date_obj(*re.split('\s|\:', somedate))
hour = mydatetime.hour
While this could be accomplished with re, the use of datetime.strptime in #liamhawkins answer [ https://stackoverflow.com/a/54600322/214150 ] would be preferred, assuming you are always dealing with formatted dates.
In addition, you could accomplish your goal by simply using a string method (.split()) and basic slicing of the resulting list. For example:
import re
word = 'Sat Apr 18 23:22:15 PDT 2009'
# Example using re.
rehour = re.findall('(\d+):\d+:\d+', word)
print('rehour:', *rehour)
# Example using string.split() and slicing.
somedate = word.split(' ')
somehour = somedate[3][:2]
print('somedate:', somedate)
print('somehour:', somehour)
Hope this will find the date in string and returns date
def get_date(input_date):
date_format = re.compile("[0-9]{2}:[0-9]{2}:[0-9]{2}")
date_search =date.search(input_date)
if date_search:
date = date_search.group()
if date:
return date[:2]
return ''
if it is truly just a string and the data you want will always be at the same position you could just do this.
String = "Sat Apr 18 23:22:15 PDT 2009"
hour = String[11:13]
print(hour)
This returns,
23
This works the same even if its from datetime or something.
If this is some other output from a function you can just convert it to a string and then extract the data the same way.
hour = str(some_output)[11:13]
If however you are not sure the data you want will always be in the same place of the string then I would suggest the following.
import re
somestring = "More text here Sat Apr 18 23:22:15 PDT 2009 - oh boy! the date could be anywhere in this string"
regex = re.search('\d{2}\:\d{2}\:\d{2}', somestring)
hour = regex.group()[:2]
print(hour)
the regex.group() is returning,
23:22:15
And then [:2] is extracting the first two items to return,
23

Selecting specific dates from dataframe

I have a dataset with the column 'Date', which has dates in several formats, including:
2018.05.07
01-Jun-2018
Reported 01 Jun 2018
Jun 2018
2018
before 1970
1941-1945
Ca. 1960
There are also invalid dates, such as:
190Feb-2010
I am trying to find dates which have an exact date (day, month, and year) and convert them to datetime. I also need to exclude dates with "Reported" in the field. Is there any way to filter such data without finding before all the possible formats of dates?
Using dateutil library.
if statement to check if any part of date (month,year,date) is missing, if yes then avoid it.
use fuzzy=True if want to extract dates from strings such as "Reported 01 Jun 2018"
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
another solution. This is brute force method that check each date with every format. Keep on adding more formats to make it work on any date format. But this is time taking method.
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue
In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
Hope this would help you to find dates from string with dates

Python Regex matching any order

Lets say I have datetime in the format
12 September, 2016
September 12, 2016
2016 September, 12
I need regex like it should return match in same order always for any dateformat given above
match-1 : 12
match-2 : September
match-3 : 2016
I need results in the same order always.
You can't switch the group order but you can name your groups:
(r'(?P<day>[\d]{2})(?:\s|,|\?|$)|(?P<month>[a-zA-Z]+)|(?P<year>[\d]{4})')
(?P<day>[\d]{2})(?:\s|,|\?|$): matches a day, can be accessed in python with l.group("day")
(?P<month>[a-zA-Z]+): matches a month, can be accessed in python with l.group("month")
(?P<year>[\d]{4}): matches a year, can be accessed in python with l.group("year")
Example:
import re
data = """
12 September, 2016
September 12, 2016
2016 September, 12
September 17, 2012
17 October, 2015
"""
rgx = re.compile(r'(?P<day>[\d]{2})(?:\s|,|\?|$)|(?P<month>[a-zA-Z]+)|(?P<year>[\d]{4})')
day = ""
month = ""
year = ""
for l in rgx.finditer(data):
if(l.group("day")):
day = l.group("day")
elif(l.group("month")):
month = l.group("month")
elif(l.group("year")):
year = l.group("year")
if(day != "" and month != "" and year != ""):
print "{0} {1} {2}".format(day, month, year)
day = ""
month = ""
year = ""
Demo
Named groups as suggested below is a good way of doing it (especially if you already have the regexes set up) but for completion's sake here's how to handle it with the datetime module.
from datetime import datetime as date
def parse_date(s):
formats = ["%d %B, %Y",
"%B %d, %Y",
"%Y %B, %d"]
for f in formats:
try:
return date.strptime(s, f)
except ValueError:
pass
raise ValueError("Invalid date format!")
arr = ["12 September, 2016",
"September 12, 2016",
"2016 September, 12",
"12/9/2016"]
for s in arr:
dt = parse_date(s)
print(dt.year, dt.strftime("%B"), dt.day)
"""
2016 September 12
2016 September 12
2016 September 12
Traceback (most recent call last):
File "C:/Python33/datetest.py", line 22, in <module>
dt = parse_date(s)
File "C:/Python33/datetest.py", line 19, in parse_date
raise ValueError("Invalid date format!")
ValueError: Invalid date format!
"""
For more information, see the datetime documentation page.
You cannot change group orderings. You need to do a "or" of 3 patterns and then pass through the result to determine which group mapped to what, which should be pretty simple.

Categories