Stanford NLP's SUTime: Unable to capture certain date formats - python

I am using the python wrapper of NLP Stanford's SUTime.
So far comparing the results to other date parsers like duckling, dateparser's search_dates, parsedatetime and natty, SUTime gives the most reliable results.
However, it fails to capture some obvious dates from documents.
Following are the 2 types of documents that I am having difficult parsing for dates using SUTime.
I am out and I won't be available until 9/19
I am out and I won't be available between (September 18-September 20)
It gives no results in case of the first document.
However, for the second document, it only captures the month but not the date or date range.
I tried wrapping my head around the java's code to see if I could alter or add some rules to make this work, but couldn't figure it out.
If someone can suggest a way to make this work with SUTime, it would be really helpful.
Also, I tried dateparser's search_dates, and it is unreliable as it captures anything and everything. Like for the first document it would parse a date on text "am out" (which is not required) and "9/19" (which is okay). So if there is a way to control this behavior it would work as well.

Question: Unable to capture certain date formats
This solution does use datetime instead of SUTime
import datetime
def datetime_from_string(datestring):
rules = [('(\d{1,2}\/\d{1,2})', '%m/%d', {'year': 2018}), ('(\w+ \d{1,2})-(\w+ \d{1,2})', '%B %d', {'year': 2018})]
result = None
for rule in rules:
match = re.match(rule[0], datestring)
if match:
result = []
for part in match.groups():
try:
date = datetime.strptime(part, rule[1])
if rule[2]:
for key in rule[2]:
if key == 'year':
date = datetime(rule[2][key], date.month, date.day)
result.append(date)
except ValueError:
pass
return result
# If you reach heare, NO matching rule
raise ValueError("Datestring '{}', does not match any rule!".format(datestring))
# Usage
for datestring in ['9/19', 'September 18-September 20', '2018-09-01']:
result = datetime_from_string(datestring)
print("str:{} result:{}".format(datestring, result))
Output:
str:'9/19' result:[datetime.datetime(2018, 9, 19, 0, 0)]
str:'September 18-September 20' result:[datetime.datetime(2018, 9, 18, 0, 0), datetime.datetime(2018, 9, 20, 0, 0)
ValueError: Datestring '2018-09-01', does not match any rule!
Tested with Python: 3.4.2

Related

Converting different time formats into same format and finding difference between 2 times

Hope you're all doing good.
I've been working this particular problem of finding the time difference between 2 times (that are in all different formats). I have partially solved it for some cases, however cannot understand at the moment how to create a solution for all cases.
My step by step process includes:
Converting all data (originally in String format) into datetime format
Finding cases where times have been expressed differently in String format to properly convert into datetime format without losing accuracy of PM is still PM and not converted into AM if it's '4' and not 4PM or 16:00 already
Then... calculating the difference between 2 times (once in datetime format)
Some specifics include finding the time difference for the following times (stored as Strings originally and in this example):
16-19
19.30-20.00
5PM-6PM
4-5
5-5.10PM
16 - 18 (yes the space between the numbers and hyphen is intentional, although some string manipulation should resolve this quite simply)
12 - 14
I've managed to convert the 16-19 into: 16:00:00 and 19:00:00, however for the 19.30-20.00 example, i receive the following error: ValueError("unconverted data remains: %s" % ValueError: unconverted data remains: .30.
I'm assuming this is due to my code implementation:
theDate1, theDate2 = datetime.strptime(temp[0: temp.find('-')], "%H"), datetime.strptime(temp[temp.find('-') + 1: len(temp)], "%H")
...where the 19.30-20.00 includes a %M part and not only a %H so the code doesn't say how to deal with the :30 part. I was going to try a conditional part where if the string has 16-19, run some code and if the string has 19.30-20.00 run some other code (and the same for the other examples).
Apologies if my explanation is a bit all over the place... I'm in the headspace of hacking a solution together and trying all different combinations.
Thanks for any guidance with this!
Have a good day.
Well the error is pretty explicit: you're trying to parse '19.30' with format '%H', which matches 19, so the '.30' is left unmatched.
Using format '%H.%M' instead works for me ;)
Also, have a look at the dateutil package, it's meant to make parsing easier. For example:
>>> from datetutil.parser import parse
>>> parse('5PM')
datetime.timedelta(seconds=3600)
>>> parse('19h') - parse('17h45')
datetime.timedelta(seconds=4500)
>>> parse('19:00') - parse('18:30')
datetime.timedelta(seconds=1800)
It's really powerful and can take care of a lot a little details like whitespaces and such ;)
here's a way to parse the example strings using dateutil's parser, after a little pre-processing:
from dateutil import parser
strings = ['16-19', '19.30-20.00', '5PM-6PM', '4-5' ,'5-5.10PM', '16 - 18', '12 - 14']
datepfx = '2020-07-21 ' # will prefix this so parser.parse works correctly
for s in strings:
# split on '-', strip trailing spaces
# replace . with : as time separator, ensure upper-case letters
parts = [part.strip().replace('.',':').upper() for part in s.split('-')]
# if only one number is given, assume hour and add minute :00
parts = [p+':00' if len(p)==1 else p for p in parts]
# check if AM or PM appears in only one of the parts
ampm = [i for i in ('AM', 'PM') for p in parts if i in p]
if len(ampm) == 1:
parts = [p+ampm[0] if not ampm[0] in p else p for p in parts]
print(f"\n'{s}' processed to -> {parts}")
print([parser.parse(datepfx + p).time() for p in parts])
gives
'16-19' processed to -> ['16', '19']
[datetime.time(16, 0), datetime.time(19, 0)]
'19.30-20.00' processed to -> ['19:30', '20:00']
[datetime.time(19, 30), datetime.time(20, 0)]
'5PM-6PM' processed to -> ['5PM', '6PM']
[datetime.time(17, 0), datetime.time(18, 0)]
'4-5' processed to -> ['4:00', '5:00']
[datetime.time(4, 0), datetime.time(5, 0)]
'5-5.10PM' processed to -> ['5:00PM', '5:10PM']
[datetime.time(17, 0), datetime.time(17, 10)]
'16 - 18' processed to -> ['16', '18']
[datetime.time(16, 0), datetime.time(18, 0)]
'12 - 14' processed to -> ['12', '14']
[datetime.time(12, 0), datetime.time(14, 0)]

What is an efficient way to test a string for a specific datetime format like "m%/d%/Y%" in python 3.6?

In my Python 3.6 application, from my input data I can receive datatimes in two different formats:
"datefield":"12/29/2017" or "datefield":"2017-12-31"
I need to make sure the that I can handle either datetime format and convert them to (or leave it in) the iso 8601 format. I want to do something like this:
#python pseudocode
import datetime
if datefield = "m%/d%/Y%":
final_date = datetime.datetime.strptime(datefield, "%Y-%m-%d").strftime("%Y-%m-%d")
elif datefield = "%Y-%m-%d":
final_date = datefield
The problem is I don't know how to check the datefield for a specific datetime format in that first if-statement in my pseudocode. I want a true or false back. I read through the Python docs and some tutorials. I did see one or two obscure examples that used try-except blocks, but that doesn't seem like an efficient way to handle this. This question is unique from other stack overflow posts because I need to handle and validate two different cases, not just one case, where I can simply fail it if it does validate.
You can detect the first style of date by a simple string test, looking for the / separators. Depending on how "loose" you want the check to be, you could check a specific index or scan the whole string with a substring test using the in operator:
if "/" in datefield: # or "if datefield[2] = '/'", or maybe "if datefield[-5] = '/'"
final_date = datetime.datetime.strptime(datefield, "%m/%d/%Y").strftime("%Y-%m-%d")
Since you'll only ever deal with two date formats, just check for a / or a - character.
import datetime
# M/D/Y
if '/' in datefield:
final_date = datetime.datetime.strpdate(date, '%M/%D/%Y').isoformat()
# Y-M-D
elif '-' in datefield:
final_date = datetime.datetime.strpdate(date, '%Y-%M-%D').isoformat()
A possible approach is to use the dateutil library. It contains many of the commonest datetime formats and can automagically detect these formats for you.
>>> from dateutil.parser import parse
>>> d1 = "12/29/2017"
>>> d2 = "2017-12-31"
>>> parse(d1)
datetime.datetime(2017, 12, 29, 0, 0)
>>> parse(d2)
datetime.datetime(2017, 12, 31, 0, 0)
NOTE: dateutil is a 3rd party library so you may need to install it with something like:
pip install python-dateutil
It can be found on pypi:
https://pypi.python.org/pypi/python-dateutil/2.6.1
And works with Python2 and Python3.
Alternate Examples:
Here are a couple of alternate examples of how well dateutil handles random date formats:
>>> d3 = "December 28th, 2017"
>>> parse(d3)
datetime.datetime(2017, 12, 28, 0, 0)
>>> d4 = "27th Dec, 2017"
>>> parse(d4)
datetime.datetime(2017, 12, 27, 0, 0)
I went with the advice of #ChristianDean and used the try-except block in effort to be Pythonic. The first format %m/%d/%Y appears a bit more in my data, so I lead the try-except with that datetime formatting attempt.
Here is my final solution:
import datetime
try:
final_date = datetime.datetime.strptime(datefield, "%m/%d/%Y").strftime("%Y-%m-%d")
except ValueError:
final_date = datetime.datetime.strptime(datefield, "%Y-%m-%d").strftime("%Y-%m-%d")

How to trim spaces within timestamps using 'm/d/yy' format

I have a Python script that generates .csv files from other data sources.
Currently, an error happens when the user manually adds a space to a date by accident. Instead of inputting the date as "1/13/17", a space may be added at the front (" 1/13/17") so that there's a space in front of the month.
I've included the relevant part of my Python script below:
def processDateStamp(sourceStamp):
matchObj = re.match(r'^(\d+)/(\d+)/(\d+)\s', sourceStamp)
(month, day, year) = (matchObj.group(1), matchObj.group(2), matchObj.group(3))
return "%s/%s/%s" % (month, day, year)
How do I trim the space issue in front of month and possibly on other components of the date (the day and year) as well for the future?
Thanks in advance.
Since you're dealing with dates, it might be more appropriate to use datetime.strptime than regex here. There are two advantages of this approach:
It makes it slightly clearer to anyone reading that you're trying to parse dates.
Your code will be more prone to throw exceptions when trying to parse data that doesn't represent dates, or represent dates in an incorrect format - this is good because it helps you catch and address issues that might otherwise go unnoticed.
Here's the code:
from datetime import datetime
def processDateStamp(sourceStamp):
date = datetime.strptime(sourceStamp.replace(' ', ''), '%M/%d/%y')
return '{}/{}/{}'.format(date.month, date.day, date.year)
if __name__ == '__main__':
print(processDateStamp('1/13/17')) # 1/13/17
print(processDateStamp(' 1/13/17')) # 1/13/17
print(processDateStamp(' 1 /13 /17')) # 1/13/17
You also can use parser from python-dateutil library. The main benefit you will get - it can recognize the datetime format for you (sometimes it may be useful):
from dateutil import parser
from datetime import datetime
def processDateTimeStamp(sourceStamp):
dt = parser.parse(sourceStamp)
return dt.strftime("%m/%d/%y")
processDateTimeStamp(" 1 /13 / 17") # returns 01/13/17
processDateTimeStamp(" jan / 13 / 17")
processDateTimeStamp(" 1 - 13 - 17")
processDateTimeStamp(" 1 .13 .17")
Once again, a perfect opportunity to use split, strip, and join:
def remove_spaces(date_string):
date_list = date_string.split('/')
result = '/'.join(x.strip() for x in date_list)
return result
Examples
In [7]: remove_spaces('1/13/17')
Out[7]: '1/13/17'
In [8]: remove_spaces(' 1/13/17')
Out[8]: '1/13/17'
In [9]: remove_spaces(' 1/ 13/17')
Out[9]: '1/13/17'

Convert Military Time from text file to Standard time Python

I am having problems with logic on how to convert military time from a text file to standard time and discard all the wrong values. I have only got to a point where the user is asked for the input file and the contents are displayed from the text file entered. Please help me
Python's datetime.time objects use "military" time. You can do things like this:
>>> t = datetime.time(hour=15, minute=12)
>>> u = datetime.time(hour=16, minute=44)
>>> t = datetime.datetime.combine(datetime.datetime.today(), t)
>>> t
datetime.datetime(2011, 5, 11, 15, 12)
>>> u = datetime.datetime.combine(datetime.datetime.today(), u)
>>> t - u
datetime.timedelta(-1, 80880)
With a little twiddling, conversions like the ones you describe should be pretty simple.
Without seeing any code, it's hard to tell what exactly you want. But I assume you could do something like this:
raw_time = '2244'
converted_time = datetime.time(hour=int(raw_time[0:2]), minute=int(raw_time[2:4]))
converted_time = datetime.datetime.combine(datetime.datetime.today(), converted_time)
Now you can work with converted_time, adding and subtracting timedelta objects. Fyi, you can create a timedelta like so:
td = datetime.timedelta(hours=4)
The full list of possible keyword arguments to the timedelta constructor is here.
from dateutil.parser import parse
time_today = parse("08:00")
from dateutil.relativedelta import relativedelta
required_time = time_today-relativedelta(minutes=35)
print required_time
datetime.datetime(2011, 5, 11, 7, 25)
It's not a true answer like the other two, but the philosophy I use when dealing with dates and times in python: convert to a datetime object as soon as possible after getting from the user, and only convert back into a string when you are presenting the value back to the user.
Most of the date/time math you will need can be done by datetime objects, and their cousins the timedelta objects. Unless you need ratios of timedeltas to other timedeltas, but that's another story.

Python `YYYY-MM-DD`

In Python, what is the best way to get the
RFC 3339 YYYY-MM-DD text from the output of gtk.Calendar::get_date()?
Thanks to Mark and treeface for the solutions, but it seems I've invented a better one:
year, month, day = cal.get_date()
return '{0:04d}-{1:02d}-{2:02d}'.format(year, month+1, day)
This solution is shorter, easier to understand (in my opinion), and I believe takes less processing as it cuts out the stage where the tuple is converted into UNIX time.
According to the docs, the get_date returns a tuple of (year,month,day), where month 0 to 11 and day is 1 to 31, so:
import datetime
dtTuple = calControl.get_date()
dtObj = datetime.datetime(dtTuple[0], dtTuple[1] + 1, dtTuple[2]) #add one to month to make it 1 to 12
print dtObj.strftime("%Y-%m-%d")
This might be useful to you:
http://www.tutorialspoint.com/python/time_strftime.htm
I've also just looked around for this gtk get_date method and I was able to find some code that handles it like this:
//here replace "self.window.get_date()" with "gtk.Calendar::get_date()"
year, month, day = self.window.get_date()
mytime = time.mktime((year, month+1, day, 0, 0, 0, 0, 0, 0))
return time.strftime("%x", time.gmtime(mytime))
So just replace "%x" with "%G-%m-%d" and I think it would work. Or at least it's worth a try.

Categories