Extraction of some date formats failed when using Dateutil in Python - python

I have gone through multiple links before posting this question so please read through and below are the two answers which have solved 90% of my problem:
parse multiple dates using dateutil
How to parse multiple dates from a block of text in Python (or another language)
Problem: I need to parse multiple dates in multiple formats in Python
Solution by Above Links: I am able to do so but there are still certain formats which I am not able to do so.
Formats which still can't be parsed are:
text ='I want to visit from May 16-May 18'
text ='I want to visit from May 16-18'
text ='I want to visit from May 6 May 18'
I have tried regex also but since dates can come in any format,so ruled out that option because the code was getting very complex. Hence, Please suggest me modifications on the code presented on the link, so that above 3 formats can also be handled on the same.

This kind of problem is always going to need tweeking with new edge cases, but the following approach is fairly robust:
from itertools import groupby, izip_longest
from datetime import datetime, timedelta
import calendar
import string
import re
def get_date_part(x):
if x.lower() in month_list:
return x
day = re.match(r'(\d+)(\b|st|nd|rd|th)', x, re.I)
if day:
return day.group(1)
return False
def month_full(month):
try:
return datetime.strptime(month, '%B').strftime('%b')
except:
return datetime.strptime(month, '%b').strftime('%b')
tests = [
'I want to visit from May 16-May 18',
'I want to visit from May 16-18',
'I want to visit from May 6 May 18',
'May 6,7,8,9,10',
'8 May to 10 June',
'July 10/20/30',
'from June 1, july 5 to aug 5 please',
'2nd March to the 3rd January',
'15 march, 10 feb, 5 jan',
'1 nov 2017',
'27th Oct 2010 until 1st jan',
'27th Oct 2010 until 1st jan 2012'
]
cur_year = 2017
month_list = [m.lower() for m in list(calendar.month_name) + list(calendar.month_abbr) if len(m)]
remove_punc = string.maketrans(string.punctuation, ' ' * len(string.punctuation))
for date in tests:
date_parts = [get_date_part(part) for part in date.translate(remove_punc).split() if get_date_part(part)]
days = []
months = []
years = []
for k, g in groupby(sorted(date_parts, key=lambda x: x.isdigit()), lambda y: not y.isdigit()):
values = list(g)
if k:
months = map(month_full, values)
else:
for v in values:
if 1900 <= int(v) <= 2100:
years.append(int(v))
else:
days.append(v)
if days and months:
if years:
dates_raw = [datetime.strptime('{} {} {}'.format(m, d, y), '%b %d %Y') for m, d, y in izip_longest(months, days, years, fillvalue=years[0])]
else:
dates_raw = [datetime.strptime('{} {}'.format(m, d), '%b %d').replace(year=cur_year) for m, d in izip_longest(months, days, fillvalue=months[0])]
years = [cur_year]
# Fix for jumps in year
dates = []
start_date = datetime(years[0], 1, 1)
next_year = years[0] + 1
for d in dates_raw:
if d < start_date:
d = d.replace(year=next_year)
next_year += 1
start_date = d
dates.append(d)
print "{} -> {}".format(date, ', '.join(d.strftime("%d/%m/%Y") for d in dates))
This converts the test strings as follows:
I want to visit from May 16-May 18 -> 16/05/2017, 18/05/2017
I want to visit from May 16-18 -> 16/05/2017, 18/05/2017
I want to visit from May 6 May 18 -> 06/05/2017, 18/05/2017
May 6,7,8,9,10 -> 06/05/2017, 07/05/2017, 08/05/2017, 09/05/2017, 10/05/2017
8 May to 10 June -> 08/05/2017, 10/06/2017
July 10/20/30 -> 10/07/2017, 20/07/2017, 30/07/2017
from June 1, july 5 to aug 5 please -> 01/06/2017, 05/07/2017, 05/08/2017
2nd March to the 3rd January -> 02/03/2017, 03/01/2018
15 march, 10 feb, 5 jan -> 15/03/2017, 10/02/2018, 05/01/2019
1 nov 2017 -> 01/11/2017
27th Oct 2010 until 1st jan -> 27/10/2010, 01/01/2011
27th Oct 2010 until 1st jan 2012 -> 27/10/2010, 01/01/2012
This works as follows:
First create a list of valid months names, i.e. both full and abbreviated.
Make a translation table to make it easy to quickly remove any punctuation from the text.
Split the text, and extract only the date parts by using a function with a regular expression to spot days or months.
Sort the list based on whether or not the part is a digit, this will group months to the front and digits to the end.
Take the first and last part of each list. Convert months into full form e.g. Aug to August and convert each into datetime objects.
If a date appears to be before the previous one, add a whole year.

Related

Obtaining decimal format for range of years and specific months

I have monthly data (1993 - 2019) but I am hoping to get the decimal format of only July, August, and September months from 1993 - 2019.
Below is the code for the months in decimal format between 1993 - 2019 (all 12 months) but hoping to get the same thing but just for July, August, and September months:
year_start = 1993
year_end = 2019
full_time_months = np.arange(year_start+.5/12,year_end+1,1/12)
print(full_time_months[:12])
# these are the 12 months in 1993 as decimals
1993.04166667 1993.125 1993.20833333 1993.29166667 1993.375
1993.45833333 1993.54166667 1993.625 1993.70833333 1993.79166667
1993.875 1993.95833333
My goal is to just get an array of months july, august, and september:
1993.54167, 1993.625, 1993.708... 2019.54167 , 2019.625, 2019.708
where year.54167 = July, year.625 = August, and year.708 = September.
How might I go about doing this? Hope my question is clear enough, please comment if something is unclear, thank you!!!
I'm not sure what you want to achieve with this, but you can do something like this, to separate the data you want.
import numpy as np
year_start = 1993
year_end = 2019
full_time_months = np.arange(year_start+.5/12,year_end+1,1/12)
# Reshape into 2D array
full_time_months = full_time_months.reshape(-1, 12)
# Choose selected columns
# July, Aug, Sept
selected_months = full_time_months[:, [6,7,8]]
print(selected_months)
Results:
[[1993.54166667 1993.625 1993.70833333]
[1994.54166667 1994.625 1994.70833333]
...
[2018.54166667 2018.625 2018.70833333]
[2019.54166667 2019.625 2019.70833333]]

Selecting specific dates from dataframe

I have a dataset with the column 'Date', which has dates in several formats, including:
2018.05.07
01-Jun-2018
Reported 01 Jun 2018
Jun 2018
2018
before 1970
1941-1945
Ca. 1960
There are also invalid dates, such as:
190Feb-2010
I am trying to find dates which have an exact date (day, month, and year) and convert them to datetime. I also need to exclude dates with "Reported" in the field. Is there any way to filter such data without finding before all the possible formats of dates?
Using dateutil library.
if statement to check if any part of date (month,year,date) is missing, if yes then avoid it.
use fuzzy=True if want to extract dates from strings such as "Reported 01 Jun 2018"
import dateutil.parser
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formated_date = []
for date in dates:
try:
if dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2015, 1, 1)) == dateutil.parser.parse(date,fuzzy=False,default=datetime.datetime(2016, 2, 2)):
formated_date.append(yourdate)
except:
continue
another solution. This is brute force method that check each date with every format. Keep on adding more formats to make it work on any date format. But this is time taking method.
import datetime
dates = ["2018.05.07","01-Jun-2018","Reported 01 Jun 2018","Jun 2018","2018","before 1970","1941-1945","Ca. 1960","190Feb-2010"]
formats = ["%Y%m%d","%Y.%m.%d","%Y-%m-%d","%Y/%m/%d","%Y%a%d","%Y.%a.%d","%Y-%a-%d","%Y%A%d","%Y.%A.%d","%Y-%A-%d",
"%d-%m-%Y","%d.%m.%Y","%d%m%Y","%d/%m/%Y","%d-%b-%Y","%d%b%Y","%d.%b.%Y","%d/%b/%Y"]
formated_date = []
for date in dates:
for fmt in formats:
try:
dt = datetime.datetime.strptime(date,fmt)
formated_date.append(dt)
except:
continue
In [1]: string_with_dates = """entries are due by January 4th, 2017 at 8:00pm created 01/15/2005 by ACME Inc. and associates."""
In [2]: import datefinder
In [3]: matches = datefinder.find_dates(string_with_dates)
In [4]: for match in matches:
...: print match
2017-01-04 20:00:00
2005-01-15 00:00:00
Hope this would help you to find dates from string with dates

Get number of days in a specific month that are in a date range

Haven't been able to find an answer to this problem. Basically what I'm trying to do is this:
Take a daterange, for example October 10th to November 25th. What is the best algorithm for determining how many of the days in the daterange are in October and how many are in November.
Something like this:
def daysInMonthFromDaterange(daterange, month):
# do stuff
return days
I know that this is pretty easy to implement, I'm just wondering if there's a very good or efficient algorithm.
Thanks
Borrowing the algorithm from this answer How do I divide a date range into months in Python?
, this might work. The inputs are in date format, but can be changed to date strings if preferred:
import datetime
begin = '2018-10-10'
end = '2018-11-25'
dt_start = datetime.datetime.strptime(begin, '%Y-%m-%d')
dt_end = datetime.datetime.strptime(end, '%Y-%m-%d')
one_day = datetime.timedelta(1)
start_dates = [dt_start]
end_dates = []
today = dt_start
while today <= dt_end:
#print(today)
tomorrow = today + one_day
if tomorrow.month != today.month:
start_dates.append(tomorrow)
end_dates.append(today)
today = tomorrow
end_dates.append(dt_end)
out_fmt = '%d %B %Y'
for start, end in zip(start_dates,end_dates):
diff = (end - start).days
print('{} to {}: {} days'.format(start.strftime(out_fmt), end.strftime(out_fmt), diff))
result:
10 October 2018 to 31 October 2018: 21 days
01 November 2018 to 25 November 2018: 24 days
The problem as stated may not have a unique answer. For example what should you get from daysInMonthFromDaterange('Feb 15 - Mar 15', 'February')? That will depend on the year!
But if you substitute actual days, I would suggest converting from dates to integer days, using the first of the month to the first of the next month as your definition of a month. This is now reduced to intersecting intervals of integers, which is much easier.
The assumption that the first of the month always happened deals with months of different lengths, variable length months, and even correctly handles the traditional placement of the switch from the Julian calendar to the Gregorian. See cal 1752 for that. (It will not handle that switch for all locations though. Should you be dealing with a library that does Romanian dates in 1919, you could have a problem...)
You can use the datetime module:
from datetime import datetime
start = datetime(2018,10,10)
end = datetime(2018,11,25)
print((end - start).days)
Something like this would work:
def daysInMonthFromDaterange(date1, date2, month):
return [x for x in range(date1.toordinal(), date2.toordinal()) if datetime.date.fromordinal(x).year == month.year and datetime.date.fromordinal(x).month == month.month]
print(len(days_in_month(date(2018,10,10), date(2018,11,25), date(2018,10,01))))
This just loops through all the days between date1 and date2, and returns it as part of a list if it matches the year and month of the third argument.

How to find out week no of the month in python?

I have seen many ways to determine week of the year. Like by giving instruction datetime.date(2016, 2, 14).isocalendar()[1] I get 6 as output. Which means 14th feb 2016 falls under 6th Week of the year. But I couldn't find any way by which I could find week of the Month.
Means IF I give input as some_function(2016,2,16)
I should get output as 3, denoting me that 16th Feb 2016 is 3rd week of the Feb 2016
[ this is different question than similar available question, here I'm asking about finding week no of the month and not of the year]
This function did the work what I wanted
from math import ceil
def week_of_month(dt):
first_day = dt.replace(day=1)
dom = dt.day
adjusted_dom = dom + first_day.weekday()
return int(ceil(adjusted_dom/7.0))
I got this function from This StackOverFlow Answer
import datetime
def week_number_of_month(date_value):
week_number = (date_value.isocalendar()[1] - date_value.replace(day=1).isocalendar()[1] + 1)
if week_number == -46:
week_number = 6
return week_number
date_given = datetime.datetime(year=2018, month=12, day=31).date()
week_number_of_month(date_given)

nth weekday calculation in Python - whats wrong with this code?

I'm trying to calculate the nth weekday for a given date. For example, I should be able to calculate the 3rd wednesday in the month for a given date.
I have written 2 versions of a function that is supposed to do that:
from datetime import datetime, timedelta
### version 1
def nth_weekday(the_date, nth_week, week_day):
temp = the_date.replace(day=1)
adj = (nth_week-1)*7 + temp.weekday()-week_day
return temp + timedelta(days=adj)
### version 2
def nth_weekday(the_date, nth_week, week_day):
temp = the_date.replace(day=1)
adj = temp.weekday()-week_day
temp += timedelta(days=adj)
temp += timedelta(weeks=nth_week)
return temp
Console output
# Calculate the 3rd Friday for the date 2011-08-09
x=nth_weekday(datetime(year=2011,month=8,day=9),3,4)
print 'output:',x.strftime('%d%b%y')
# output: 11Aug11 (Expected: '19Aug11')
The logic in both functions is obviously wrong, but I can't seem to locate the bug - can anyone spot what is wrong with the code - and how do I fix it to return the correct value?
Your problem is here:
adj = temp.weekday()-week_day
First of all, you are subtracting things the wrong way: you need to subtract the actual day from the desired one, not the other way around.
Second, you need to ensure that the result of the subtraction is not negative - it should be put in the range 0-6 using % 7.
The result:
adj = (week_day - temp.weekday()) % 7
In addition, in your second version, you need to add nth_week-1 weeks like you do in your first version.
Complete example:
def nth_weekday(the_date, nth_week, week_day):
temp = the_date.replace(day=1)
adj = (week_day - temp.weekday()) % 7
temp += timedelta(days=adj)
temp += timedelta(weeks=nth_week-1)
return temp
>>> nth_weekday(datetime(2011,8,9), 3, 4)
datetime.datetime(2011, 8, 19, 0, 0)
one-liner
You can find the nth weekday with a one liner that uses calendar from the standard library.
import calendar
calendar.Calendar(x).monthdatescalendar(year, month)[n][0]
where:
x : the integer representing your weekday (0 is Monday)
n : the 'nth' part of your question
year, month : the integers year and month
This will return a datetime.date object.
broken down
It can be broken down this way:
calendar.Calendar(x)
creates a calendar object with weekdays starting on your required weekday.
.monthdatescalendar(year, month)
returns all the calendar days of that month.
[n][0]
returns the 0 indexed value of the nth week (the first day of that week, which starts on the xth day).
why it works
The reason for starting the week on your required weekday is that by default 0 (Monday) will be used as the first day of the week and if the month starts on a Wednesday, calendar will consider the first week to start on the first occurrence of Monday (ie. week 2) and you'll be a week behind.
example
If you were to need the third Saturday of September 2013 (that month's US stock option expiry day), you would use the following:
calendar.Calendar(5).monthdatescalendar(2013,9)[3][0]
The problem with the one-liner with the most votes is it doesn't work.
It can however be used as a basis for refinement:
You see this is what you get:
c = calendar.Calendar(calendar.SUNDAY).monthdatescalendar(2018, 7)
for c2 in c:
print(c2[0])
2018-07-01
2018-07-08
2018-07-15
2018-07-22
2018-07-29
c = calendar.Calendar(calendar.SUNDAY).monthdatescalendar(2018, 8)
for c2 in c:
print(c2[0])
2018-07-29
2018-08-05
2018-08-12
2018-08-19
2018-08-26
If you think about it it's trying to organise the calendars into nested lists to print a weeks worth of dates at a time. So stragglers from other months come into play. By using a new list of valid days that fall in the month - this does the trick.
Answer with appended list
import calendar
import datetime
def get_nth_DOW_for_YY_MM(dow, yy, mm, nth) -> datetime.date:
#dow - Python Cal - 6 Sun 0 Mon ... 5 Sat
#nth is 1 based... -1. is ok for last.
i = -1 if nth == -1 or nth == 5 else nth -1
valid_days = []
for d in calendar.Calendar(dow).monthdatescalendar(yy, mm):
if d[0].month == mm:
valid_days.append(d[0])
return valid_days[i]
So here's how it could be called:
firstSundayInJuly2018 = get_nth_DOW_for_YY_MM(calendar.SUNDAY, 2018, 7, 1)
firstSundayInAugust2018 = get_nth_DOW_for_YY_MM(calendar.SUNDAY, 2018, 8, 1)
print(firstSundayInJuly2018)
print(firstSundayInAugust2018)
And here is the output:
2018-07-01
2018-08-05
get_nth_DOW_for_YY_MM() can be refactored using lambda expressions like so:
Answer with lambda expression refactoring
import calendar
import datetime
def get_nth_DOW_for_YY_MM(dow, yy, mm, nth) -> datetime.date:
#dow - Python Cal - 6 Sun 0 Mon ... 5 Sat
#nth is 1 based... -1. is ok for last.
i = -1 if nth == -1 or nth == 5 else nth -1
return list(filter(lambda x: x.month == mm, \
list(map(lambda x: x[0], \
calendar.Calendar(dow).monthdatescalendar(yy, mm) \
)) \
))[i]
The one-liner answer does not seem to work if the target day falls on the first of the month. For instance, if you want the 2nd Friday of every month, then the one-liner approach
calendar.Calendar(4).monthdatescalendar(year, month)[2][0]
for March 2013 will return March 15th 2013 when it should be March 8th 2013. Perhaps add in a check like
if date(year, month, 1).weekday() == x:
delivery_date.append(calendar.Calendar(x).monthdatescalendar(year, month)[n-1][0])
else:
delivery_date.append(calendar.Calendar(x).monthdatescalendar(year, month)[n][0])
Alternatively this will work for Python 2, returns the occurance of weekday in the said month, i.e if 16 June 2018 is the input, then returns the occurance of the day on 16th June 2018
You may substitute the month/year/date integers to anything you might want - right now it's getting the input / date from the system via datetime
Omit out print statements or use pass where they're not needed
import calendar
import datetime
import pprint
month_number = int(datetime.datetime.now().strftime('%m'))
year_number = int(datetime.datetime.now().strftime('%Y'))
date_number = int(datetime.datetime.now().strftime('%d'))
day_ofweek = str(datetime.datetime.now().strftime('%A'))
def weekday_occurance():
print "\nFinding current date here\n"
for week in xrange(5):
try:
calendar.monthcalendar(year_number, month_number)[week].index(date_number)
occurance = week + 1
print "Date %s of month %s and year %s is %s #%s in this month." % (date_number,month_number,year_number,day_ofweek,occurance)
return occurance
break
except ValueError as e:
print "The date specified is %s which is week %s" % (e,week)
myocc = weekday_occurance()
print myocc
A little tweak would make the one-liner work correctly:
import calendar
calendar.Calendar((weekday+1)%7).monthdatescalendar(year, month)[n_th][-1]
Here n_th should be interpreted as c-style, e.g. 0 is the first index.
Example: to find 1st Sunday in July 2018 one could type:
>>> calendar.Calendar(0).monthdatescalendar(2018, 7)[0][-1]
datetime.date(2018, 7, 1)
People here seem to like one-liner, I will propose below.
import calendar
[cal[0] for cal in calendar.Calendar(x).monthdatescalendar(year, month) if cal[0].month == month][n]
The relativedelta module that's an extension from the Python dateutil package (pip install python-dateutil) does exactly what you want:
from dateutil import relativedelta
import datetime
def nth_weekday(the_date, nth_week, week_day):
return the_date.replace(day=1) + relativedelta.relativedelta(
weekday=week_day(nth_week)
)
print(nth_weekday(datetime.date.today(), 3, relativedelta.FR))
The key part here evaluates to weekday=relativedelta.FR(3): the third Friday of the month. Here are the relevant part of the docs for the weekday parameter,
weekday:
One of the weekday instances (MO, TU, etc) available in the
relativedelta module. These instances may receive a parameter N,
specifying the Nth weekday, which could be positive or negative
(like MO(+1) or MO(-2)).

Categories