Python Regex matching any order - python

Lets say I have datetime in the format
12 September, 2016
September 12, 2016
2016 September, 12
I need regex like it should return match in same order always for any dateformat given above
match-1 : 12
match-2 : September
match-3 : 2016
I need results in the same order always.

You can't switch the group order but you can name your groups:
(r'(?P<day>[\d]{2})(?:\s|,|\?|$)|(?P<month>[a-zA-Z]+)|(?P<year>[\d]{4})')
(?P<day>[\d]{2})(?:\s|,|\?|$): matches a day, can be accessed in python with l.group("day")
(?P<month>[a-zA-Z]+): matches a month, can be accessed in python with l.group("month")
(?P<year>[\d]{4}): matches a year, can be accessed in python with l.group("year")
Example:
import re
data = """
12 September, 2016
September 12, 2016
2016 September, 12
September 17, 2012
17 October, 2015
"""
rgx = re.compile(r'(?P<day>[\d]{2})(?:\s|,|\?|$)|(?P<month>[a-zA-Z]+)|(?P<year>[\d]{4})')
day = ""
month = ""
year = ""
for l in rgx.finditer(data):
if(l.group("day")):
day = l.group("day")
elif(l.group("month")):
month = l.group("month")
elif(l.group("year")):
year = l.group("year")
if(day != "" and month != "" and year != ""):
print "{0} {1} {2}".format(day, month, year)
day = ""
month = ""
year = ""
Demo

Named groups as suggested below is a good way of doing it (especially if you already have the regexes set up) but for completion's sake here's how to handle it with the datetime module.
from datetime import datetime as date
def parse_date(s):
formats = ["%d %B, %Y",
"%B %d, %Y",
"%Y %B, %d"]
for f in formats:
try:
return date.strptime(s, f)
except ValueError:
pass
raise ValueError("Invalid date format!")
arr = ["12 September, 2016",
"September 12, 2016",
"2016 September, 12",
"12/9/2016"]
for s in arr:
dt = parse_date(s)
print(dt.year, dt.strftime("%B"), dt.day)
"""
2016 September 12
2016 September 12
2016 September 12
Traceback (most recent call last):
File "C:/Python33/datetest.py", line 22, in <module>
dt = parse_date(s)
File "C:/Python33/datetest.py", line 19, in parse_date
raise ValueError("Invalid date format!")
ValueError: Invalid date format!
"""
For more information, see the datetime documentation page.

You cannot change group orderings. You need to do a "or" of 3 patterns and then pass through the result to determine which group mapped to what, which should be pretty simple.

Related

How to filter a list based on a substring in each element?

I have an inconsistent list of strings that contain dates in different formats. I need to determine the dates in each list.
My list/array looks like:
dates_list = []
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
I've tried a for loop and tried splitting the string however each string has different delimiters before the date. So what's a plausible way to find the date from each string in this inconsistent list. The only help looking at the list is that the date are all positioned at the end of each string
well it is not the best way to do it, but it may solve your problem, you can adapt it more :
dates_list = []
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
import os
for i in my_array :
for j in i :
if j >= '0' and j <= '9' :
i = i.replace(j,"",1)
else:
break
print(os.path.splitext(i)[0].replace("_","").replace("-",""))
output :
01.05.2019
01.05.2019
21.06.2019
19 June 2019
25.06.18
20 March 2019
31.1.2019
It is still unclear what you want to do with the dates or if you want them in some sort of consistent format however all your question states is you want to extract the date from the file name. you can do this with regex based on your samples which you say are the only 7 formats you have.
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
import re
for filename in my_array:
date = re.search(r'(\d{1,2}([.\s])(?:\d{1,2}|\w+)\2\d{2,4})', filename).group()
print(f"The date '{date}' was extracted from the file name '{filename}'")
OUTPUT
The date '01.05.2019' was extracted from the file name '5364345354_01.05.2019.pdf'
The date '01.05.2019' was extracted from the file name '5364344354_ 01.05.2019.pdf'
The date '21.06.2019' was extracted from the file name '5345453454 - 21.06.2019.pdf'
The date '19 June 2019' was extracted from the file name '4675535643 - 19 June 2019.docx'
The date '25.06.18' was extracted from the file name '57467874 25.06.18.pdf'
The date '20 March 2019' was extracted from the file name '6565653635_20 March 2019.txt'
The date '31.1.2019' was extracted from the file name '252252452_31.1.2019.txt'
the datetime module is helpful when working with dates and date formats and may help to convert your various format dates to a single format.
Extra characters, like the number in front of the dates, still need to be stripped manually. Other answers already pointed out several ways to do it, here I propose my own which does not require regex. I'm going to assume that the patterns are the one shown in your example, if there are other patters they need to be included in the code.
Once the numbers at the beginning of the strings and the file extension are discarded, datetime.strptime() is used to read the date and create a datetime object.
Then datetime.strftime() is used to get back a string representing the date with a given, unique format.
import datetime
my_array = [
'5364345354_01.05.2019.pdf',
'5364344354_ 01.05.2019.pdf',
'5345453454 - 21.06.2019.pdf',
'4675535643 - 19 June 2019.docx',
'57467874 25.06.18.pdf',
'6565653635_20 March 2019.txt',
'252252452_31.1.2019.txt'
]
def multiformat(string, format_list, format_res):
delim = None
if '_' in string:
delim = '_'
elif '-' in string:
delim = '-'
else:
delim = ' '
strdate = string.split(delim)[1].strip().split('.')[:-1]
txtdate = ' '.join(strdate)
print(txtdate)
date = None
for frm in format_list:
try:
date = datetime.datetime.strptime(txtdate, frm)
break
except ValueError:
pass
return date.strftime(format_res)
format_dates = ['%d %m %Y', '%d %m %y', '%d %B %Y']
dates_list = list(map(lambda x : multiformat(x, format_dates, '%d-%m-%Y'), my_array))
print(dates_list)
This prints:
['01-05-2019', '01-05-2019', '21-06-2019', '19-06-2019', '25-06-2018', '20-03-2019', '31-01-2019']
This can be solved with regex. The pattern I'm using here works in this case, but it's not pretty.
import re
regex = re.compile(r'\d{1,2}(\.| )\w+\1\d{2,4}')
for f in my_array:
print(regex.search(f).group())
Output:
01.05.2019
01.05.2019
21.06.2019
19 June 2019
25.06.18
20 March 2019
31.1.2019
Broken down:
\d{1,2} - One or two digits
(\.| ) ... \1 - A dot or a space, then the same again
\w+ - One or more letters, digits, or underscores
\d{2,4} - Two or four digits
you could try this, a little hacky but you do have some variations in your date formats:)
import re
mons = {'January':'01','February':'02','March':'03','April':'04','May':'05','June':'06','July':'07','August':'08','September':'09','October':'10','November':'11','December':'12'}
unformatted = [re.sub('\d{5,}|\s-\s|_|\s','',d.rsplit('.',1)[0]).replace('.','-') for d in my_array]
output:
['01-05-2019', '01-05-2019', '21-06-2019', '19June2019', '25-06-18', '20March2019', '31-1-2019']
for i,d in enumerate(unformatted):
if any(c.isalpha() for c in d):
key = re.search('[a-zA-Z]+',d).group()
unformatted[i] = d.replace(key,'-'+mons[key]+'-')
if len(d.split('-')[-1])==2:
yr = d.split('-')[-1]
unformatted[i] = d[:-2]+'20'+yr
#was having issues getting this one to work in the same loop..but:
for i,d in enumerate(unformatted):
if len(d.split('-')[1])==1:
mnth = d.split('-')[1]
unformatted[i] = d[:3]+'0'+mnth+d[-5:]
output:
['01-05-2019', '01-05-2019', '21-06-2019', '19-06-2019', '25-06-2018', '20-03-2019', '31-01-2019']
this not only extracts the date for each entry, but puts them into the same format so you can use them in pandas, or whatever you need to do with them afterwards
if the provided example contains all variations of the dates, this should work, if not you could make some minor adaptations and should be able to get it to work

Selecting specific dates from dataframe

I have a dataset with the column 'Date', which has dates in several formats, including:
2018.05.07
01-Jun-2018
Reported 01 Jun 2018
Jun 2018
2018
before 1970
1941-1945
Ca. 1960
There are also invalid dates, such as:
190Feb-2010
I am trying to find dates which have an exact date (day, month, and year) and convert them to datetime. I also need to exclude dates with "Reported" in the field. Is there any way to filter such data without finding before all the possible formats of dates?
Using dateutil library.
if statement to check if any part of date (month,year,date) is missing, if yes then avoid it.
use fuzzy=True if want to extract dates from strings such as "Reported 01 Jun 2018"
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
another solution. This is brute force method that check each date with every format. Keep on adding more formats to make it work on any date format. But this is time taking method.
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue
In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
Hope this would help you to find dates from string with dates

Python Error Control Date

I want my program to take user input in form of a date and the control if it is valid. But with the code I have know the program say that it is wrong regardless of what format i give. I don't see the problem with the code:
import datetime
def visit_date():
while True:
date_visit = input("Enter the date you want to visit the Zoo in YYYY-MM-DD format: ")
try:
return datetime.datetime.strptime(date_visit, "%d/%m/%y")
except ValueError:
print("Not a valid format\n")
You're asking the user for a date in the format YYYY-MM-DD but then trying to parse it according to this format %d/%m/%y.
You instead should parse the string in the same way that you requested it, %Y-%m-%d
You're looking for format %d/%m/%y and asking for %Y-%m-%d
> date_visit = '2016-11-23'
> datetime.datetime.strptime(date_visit, "%Y-%m-%d")
datetime.datetime(2016, 11, 23, 0, 0)
Some notes:
%Y : Year in four digits %y : year in two digits.
%d/%m/%y translates to "day of month in one or two digits, /, month of year in one or two digits, / year in two digits".
%Y-%m-%d translates to "four-digit-year, -, month-of-year, - day-of-month"
Your prompt asks you to enter in YYYY-MM-DD, but your strptime is attempting to use the format %d/%m/%y. You need to have the formats to match for strptime to work
import datetime
d = '2016-11-21'
d = datetime.datetime.strptime(d, "%Y-%m-%d")
d = datetime.datetime(2016, 11, 21, 0, 0)
>>> 2016-11-21 00:00:00
d = '11/21/2016'
d = datetime.datetime.strptime(d, "%d/%M/%Y")
d = datetime.datetime(2016, 11, 21, 0, 0)
>>> 2016-11-21 00:00:00
I personally like to use the python-dateutil module for parsing date strings that allows for different formats
pip install python-dateutil
from dateutil.parser import parse
d1 = 'Tuesday, October 21 2003, 12:14 CDT'
d2 = 'Dec. 23rd of 2012 at 12:34pm'
d3 = 'March 4th, 2016'
d4 = '2015-12-09'
print(parse(d1))
>>> 2003-10-21 12:14:00
print(parse(d2))
>>> 2012-12-23 12:34:00
print(parse(d3))
>>> 2016-03-04 00:00:00
print(parse(d4))
>>> 2015-12-09 00:00:00

Python get last month and year

I am trying to get last month and current year in the format: July 2016.
I have tried (but that didn't work) and it does not print July but the number:
import datetime
now = datetime.datetime.now()
print now.year, now.month(-1)
If you're manipulating dates then the dateutil library is always a great one to have handy for things the Python stdlib doesn't cover easily.
First, install the dateutil library if you haven't already:
pip install python-dateutil
Next:
from datetime import datetime
from dateutil.relativedelta import relativedelta
# Returns the same day of last month if possible otherwise end of month
# (eg: March 31st->29th Feb an July 31st->June 30th)
last_month = datetime.now() - relativedelta(months=1)
# Create string of month name and year...
text = format(last_month, '%B %Y')
Gives you:
'July 2016'
now = datetime.datetime.now()
last_month = now.month-1 if now.month > 1 else 12
last_year = now.year - 1
to get the month name you can use
"Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec".split()[last_month-1]
An alternative solution using Pandas which converts today to a monthly period and then subtracts one (month). Converted to desired format using strftime.
import datetime as dt
import pandas as pd
>>> (pd.Period(dt.datetime.now(), 'M') - 1).strftime('%B %Y')
u'July 2016'
You can use just the Python datetime library to achieve this.
Explanation:
Replace day in today's date with 1, so you get date of first day of this month.
Doing - timedelta(days=1) will give last day of previous month.
format and use '%B %Y' to convert to required format.
import datetime as dt
format(dt.date.today().replace(day=1) - dt.timedelta(days=1), '%B %Y')
>>>'June-2019'
from datetime import date, timedelta
last_month = date.today().replace(day=1) - timedelta(1)
last_month.strftime("%B %Y")
date.today().replace(day=1) gets the first day of current month, substracting 1 day will get last day of last month
def subOneMonth(dt):
day = dt.day
res = dt.replace(day=1) - datetime.timedelta(days =1)
try:
res.replace(day= day)
except ValueError:
pass
return res
print subOneMonth(datetime.datetime(2016,07,11)).strftime('%d, %b %Y')
11, Jun 2016
print subOneMonth(datetime.datetime(2016,01,11)).strftime('%d, %b %Y')
11, Dec 2015
print subOneMonth(datetime.datetime(2016,3,31)).strftime('%d, %b %Y')
29, Feb 2016
from datetime import datetime, timedelta, date, time
#Datetime: 1 month ago
datetime_to = datetime.now().replace(day=15) - timedelta(days=30 * 1)
#Date : 2 months ago
date_to = date.today().replace(day=15) - timedelta(days=30 * 2)
#Date : 12 months ago
date_to = date.today().replace(day=15) - timedelta(days=30 *12)
#Accounting standards: 13 months ago of pervious day
date_ma = (date.today()-timedelta(1)).replace(day=15)-timedelta(days=30*13)
yyyymm = date_ma.strftime('%Y%m') #201909
yyyy = date_ma.strftime('%Y') #2019
#Error Range Test
from datetime import datetime, timedelta, date, time
import pandas as pd
for i in range(1,120):
pdmon = (pd.Period(dt.datetime.now(), 'M')-i).strftime('%Y%m')
wamon = (date.today().replace(day=15)-timedelta(days=30*i)).strftime('%Y%m')
if pdmon != wamon:
print('Incorrect %s months ago:%s,%s' % (i,pdmon,wamon))
break
#Incorrect 37 months ago:201709,201710
import datetime as dt
.replace(day=1) replaces today's date with the first day of the month, simple
subtracting timedelta(1) subtracts 1 day, giving the last day of the previous month
last_month = dt.datetime.today().replace(day=1) - dt.timedelta(1)
user wanted the word July, not the 6th month so updating %m to %B
last_month.strftime("%Y, %B")

Check if string has date, any format

How do I check if a string can be parsed to a date?
Jan 19, 1990
January 19, 1990
Jan 19,1990
01/19/1990
01/19/90
1990
Jan 1990
January1990
These are all valid dates. If there's any concern regarding the lack of space in between stuff in item #3 and the last item above, that can be easily remedied via automatically inserting a space in between letters/characters and numbers, if so needed.
But first, the basics:
I tried putting it in an if statement:
if datetime.strptime(item, '%Y') or datetime.strptime(item, '%b %d %y') or datetime.strptime(item, '%b %d %Y') or datetime.strptime(item, '%B %d %y') or datetime.strptime(item, '%B %d %Y'):
But that's in a try-except block, and keeps returning something like this:
16343 time data 'JUNE1890' does not match format '%Y'
Unless, it met the first condition in the if statement.
To clarify, I don't actually need the value of the date - I just want to know if it is. Ideally, it would've been something like this:
if item is date:
print date
else:
print "Not a date"
Is there any way to do this?
The parse function in dateutils.parser is capable of parsing many date string formats to a datetime object.
If you simply want to know whether a particular string could represent or contain a valid date, you could try the following simple function:
from dateutil.parser import parse
def is_date(string, fuzzy=False):
"""
Return whether the string can be interpreted as a date.
:param string: str, string to check for date
:param fuzzy: bool, ignore unknown tokens in string if True
"""
try:
parse(string, fuzzy=fuzzy)
return True
except ValueError:
return False
Then you have:
>>> is_date("1990-12-1")
True
>>> is_date("2005/3")
True
>>> is_date("Jan 19, 1990")
True
>>> is_date("today is 2019-03-27")
False
>>> is_date("today is 2019-03-27", fuzzy=True)
True
>>> is_date("Monday at 12:01am")
True
>>> is_date("xyz_not_a_date")
False
>>> is_date("yesterday")
False
Custom parsing
parse might recognise some strings as dates which you don't want to treat as dates. For example:
Parsing "12" and "1999" will return a datetime object representing the current date with the day and year substituted for the number in the string
"23, 4" and "23 4" will be parsed as datetime.datetime(2023, 4, 16, 0, 0).
"Friday" will return the date of the nearest Friday in the future.
Similarly "August" corresponds to the current date with the month changed to August.
Also parse is not locale aware, so does not recognise months or days of the week in languages other than English.
Both of these issues can be addressed to some extent by using a custom parserinfo class, which defines how month and day names are recognised:
from dateutil.parser import parserinfo
class CustomParserInfo(parserinfo):
# three months in Spanish for illustration
MONTHS = [("Enero", "Enero"), ("Feb", "Febrero"), ("Marzo", "Marzo")]
An instance of this class can then be used with parse:
>>> parse("Enero 1990")
# ValueError: Unknown string format
>>> parse("Enero 1990", parserinfo=CustomParserInfo())
datetime.datetime(1990, 1, 27, 0, 0)
If you want to parse those particular formats, you can just match against a list of formats:
txt='''\
Jan 19, 1990
January 19, 1990
Jan 19,1990
01/19/1990
01/19/90
1990
Jan 1990
January1990'''
import datetime as dt
fmts = ('%Y','%b %d, %Y','%b %d, %Y','%B %d, %Y','%B %d %Y','%m/%d/%Y','%m/%d/%y','%b %Y','%B%Y','%b %d,%Y')
parsed=[]
for e in txt.splitlines():
for fmt in fmts:
try:
t = dt.datetime.strptime(e, fmt)
parsed.append((e, fmt, t))
break
except ValueError as err:
pass
# check that all the cases are handled
success={t[0] for t in parsed}
for e in txt.splitlines():
if e not in success:
print e
for t in parsed:
print '"{:20}" => "{:20}" => {}'.format(*t)
Prints:
"Jan 19, 1990 " => "%b %d, %Y " => 1990-01-19 00:00:00
"January 19, 1990 " => "%B %d, %Y " => 1990-01-19 00:00:00
"Jan 19,1990 " => "%b %d,%Y " => 1990-01-19 00:00:00
"01/19/1990 " => "%m/%d/%Y " => 1990-01-19 00:00:00
"01/19/90 " => "%m/%d/%y " => 1990-01-19 00:00:00
"1990 " => "%Y " => 1990-01-01 00:00:00
"Jan 1990 " => "%b %Y " => 1990-01-01 00:00:00
"January1990 " => "%B%Y " => 1990-01-01 00:00:00

Categories