Best way to identify and extract dates from text Python?

Best way to identify and extract dates from text Python? - python

As part of a larger personal project I'm working on, I'm attempting to separate out inline dates from a variety of text sources.
For example, I have a large list of strings (that usually take the form of English sentences or statements) that take a variety of forms:
Central design committee session Tuesday 10/22 6:30 pm
Th 9/19 LAB: Serial encoding (Section 2.2)
There will be another one on December 15th for those who are unable to make it today.
Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
He will be flying in Sept. 15th.
While these dates are in-line with natural text, none of them are in specifically natural language forms themselves (e.g., there's no "The meeting will be two weeks from tomorrow"—it's all explicit).
As someone who doesn't have too much experience with this kind of processing, what would be the best place to begin? I've looked into things like the dateutil.parser module and parsedatetime, but those seem to be for after you've isolated the date.
Because of this, is there any good way to extract the date and the extraneous text
input: Th 9/19 LAB: Serial encoding (Section 2.2)
output: ['Th 9/19', 'LAB: Serial encoding (Section 2.2)']
or something similar? It seems like this sort of processing is done by applications like Gmail and Apple Mail, but is it possible to implement in Python?

I was also looking for a solution to this and couldn't find any, so a friend and I built a tool to do this. I thought I would come back and share incase others found it helpful.
datefinder -- find and extract dates inside text
Here's an example:
import datefinder
string_with_dates = '''
Central design committee session Tuesday 10/22 6:30 pm
Th 9/19 LAB: Serial encoding (Section 2.2)
There will be another one on December 15th for those who are unable to make it today.
Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm
He will be flying in Sept. 15th.
We expect to deliver this between late 2021 and early 2022.
'''
matches = datefinder.find_dates(string_with_dates)
for match in matches:
print(match)

I am surprised that there is no mention of SUTime and dateparser's search_dates method.
from sutime import SUTime
import os
import json
from dateparser.search import search_dates
str1 = "Let's meet sometime next Thursday"
# You'll get more information about these jar files from SUTime's github page
jar_files = os.path.join(os.path.dirname(__file__), 'jars')
sutime = SUTime(jars=jar_files, mark_time_ranges=True)
print(json.dumps(sutime.parse(str1), sort_keys=True, indent=4))
"""output:
[
{
"end": 33,
"start": 20,
"text": "next Thursday",
"type": "DATE",
"value": "2018-10-11"
}
]
"""
print(search_dates(str1))
#output:
#[('Thursday', datetime.datetime(2018, 9, 27, 0, 0))]
Although I have tried other modules like dateutil, datefinder and natty (couldn't get duckling to work with python), this two seem to give the most promising results.
The results from SUTime are more reliable and it's clear from the above code snippet. However, the SUTime fails in some basic scenarios like parsing a text
"I won't be available until 9/19"
or
"I won't be available between (September 18-September 20).
It gives no result for the first text and only gives month and year for the second text.
This is however handled quite well in the search_dates method.
search_dates method is more aggressive and will give all possible dates related to any words in the input text.
I haven't yet found a way to parse the text strictly for dates in search_methods. If I could find a way to do that, it'll be my first choice over SUTime and I would also make sure to update this answer if I find it.

You can use the dateutil module's parse method with the fuzzy option.
>>> from dateutil.parser import parse
>>> parse("Central design committee session Tuesday 10/22 6:30 pm", fuzzy=True)
datetime.datetime(2018, 10, 22, 18, 30)
>>> parse("There will be another one on December 15th for those who are unable to make it today.", fuzzy=True)
datetime.datetime(2018, 12, 15, 0, 0)
>>> parse("Workbook 3 (Minimum Wage): due Wednesday 9/18 11:59pm", fuzzy=True)
datetime.datetime(2018, 3, 9, 23, 59)
>>> parse("He will be flying in Sept. 15th.", fuzzy=True)
datetime.datetime(2018, 9, 15, 0, 0)
>>> parse("Th 9/19 LAB: Serial encoding (Section 2.2)", fuzzy=True)
datetime.datetime(2002, 9, 19, 0, 0)

If you can identify the segments that actually contain the date information, parsing them can be fairly simple with parsedatetime. There are a few things to consider though namely that your dates don't have years and you should pick a locale.
>>> import parsedatetime
>>> p = parsedatetime.Calendar()
>>> p.parse("December 15th")
((2013, 12, 15, 0, 13, 30, 4, 319, 0), 1)
>>> p.parse("9/18 11:59 pm")
((2014, 9, 18, 23, 59, 0, 4, 319, 0), 3)
>>> # It chooses 2014 since that's the *next* occurence of 9/18
It doesn't always work perfectly when you have extraneous text.
>>> p.parse("9/19 LAB: Serial encoding")
((2014, 9, 19, 0, 15, 30, 4, 319, 0), 1)
>>> p.parse("9/19 LAB: Serial encoding (Section 2.2)")
((2014, 2, 2, 0, 15, 32, 4, 319, 0), 1)
Honestly, this seems like the kind of problem that would be simple enough to parse for particular formats and pick the most likely out of each sentence. Beyond that, it would be a decent machine learning problem.

Newer versions of parsedatetime lib provide search functionality.
Example
from dateparser.search import search_dates
dates = search_dates('Central design committee session Tuesday 10/22 6:30 pm')

Hi I'm not sure bellow approach is machine learning but you may try it:
add some context from outside text, e.g publishing time of text message, posting, now etc. (your text doesn't tell anything about year)
extract all tokens with separator white-space and should get something like this:
['Th','Wednesday','9:34pm','7:34','pm','am','9/18','9/','/18', '19','12']
process them with rule-sets e.g subsisting from weekdays and/or variations of components forming time and mark them e.g. '%d:%dpm', '%d am', '%d/%d', '%d/ %d' etc. may means time.
Note that it may have compositions e.g. "12 / 31" is 3gram ('12','/','31') should be one token "12/31" of interest.
"see" what tokens are around marked tokens like "9:45pm" e.g ('Th",'9/19','9:45pm') is 3gram formed from "interesting" tokens and apply rules about it that may determine meaning.
process for more specific analysis for example if have 31/12 so 31 > 12 means d/m, or vice verse, but if have 12/12 m,d will be available only in context build from text and/or outside.
Cheers

There is no any perfact solution. IT's completely depend on which type of data u are suppose to work. Quickly review and analyze data by going through certain set of data manually and prepare regex pattern and test it wheather it is working or not.
Predefined all packages solve a date extraction problem up to some extent and it is limited one. if one will approximately find out pattern by looking to data then user can prepare regex. It will help them to prevent to iterate and loop over all rules written in packages.

Related

Parse unformatted dates in Python

I have some text, taken from different websites, that I want to extract dates from. As one can imagine, the dates vary substantially in how they are formatted, and look something like:
Posted: 10/01/2014
Published on August 1st 2014
Last modified on 5th of July 2014
Posted by Dave on 10-01-14
What I want to know is if anyone knows of a Python library [or API] which would help with this - (other than e.g. regex, which will be my fallback). I could probably relatively easily remove the "posed on" parts, but getting the other stuff consistent does not look easy.

My solution using dateutil
Following Lukas's suggestion, I used the dateutil package (seemed far more flexible than Arrow), using the Fuzzy entry, which basically ignores things which are not dates.
Caution on Fuzzy parsing using dateutil
The main thing to note with this is that as noted in the thread Trouble in parsing date using dateutil if it is unable to parse a day/month/year it takes a default value (which is the current day, unless specified), and as far as i can tell there is no flag reported to indicate that it took the default.
This would result in "random text" returning today's date of 2015-4-16 which could have caused problems.
Solution
Since I really want to know when it fails, rather than fill in the date with a default value, I ended up running twice, and then seeing if it took the default on both instances - if not, then I assumed parsing correctly.
from datetime import datetime
from dateutil.parser import parse
def extract_date(text):
date = {}
date_1 = parse(text, fuzzy=True, default=datetime(2001, 01, 01))
date_2 = parse(text, fuzzy=True, default=datetime(2002, 02, 02))
if date_1.day == 1 and date_2.day ==2:
date["day"] = "XX"
else:
date["day"] = date_1.day
if date_1.month == 1 and date_2.month ==2:
date["month"] = "XX"
else:
date["month"] = date_1.month
if date_1.year == 2001 and date_2.year ==2002:
date["year"] = "XXXX"
else:
date["year"] = date_1.year
return(date)
print extract_date("Posted: by dave August 1st")
Obviously this is a bit of a botch (so if anyone has a more elegant solution -please share), but this correctly parsed the four examples i had above [where it assumed US format for the date 10/01/2014 rather than UK format], and resulted in XX being returned appropriately when missing data entered.

You could use Arrow library:
arrow.get('2013-05-05 12:30:45', ['MM/DD/YYYY', 'MM-DD-YYYY'])
Two arguments, first a str to parse and second a list of formats to try.

Parsing human-readable recurring dates in Python

The problem. In my Django application, users create tasks for scheduled execution. The users are quite non-technical, and it would be great if they can write conventional human-readable expressions to define when to execute certain task, such as:
every monday
every fri, wed
daily
1, 14, 20 of each month
every fri; every end of month
This is inspired by Todoist. For now, only dates are necessary; no times. I've spent a couple of hours googling for a library to do that, but with no luck. I am expecting a function, say, in_range(expression, date), such that:
>>> in_range('every monday, wednesday', date(2014, 4, 28))
True
>>> in_range('every end of month', date(2014, 5, 12))
False
>>> in_range('every millenium', date(2014, 5, 8))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: unknown token "millenium".
Variants. That's what I've looked through.
Standard datetime library does date parsing, but not date range parsing as per above.
Python-dateutil - supports recurring dates via rrule, very functional, but still does not support parsing.
Python-crontab and Python-croniter accept standard Unix crontab syntax (and allow to specify weekdays, etc), but still such syntax is a way too technical and I'd like to avoid it if possible.
Arrow and Parsedatetime do not support the feature.
So, is there a Python code snippet, or a library that I missed, to do the thing? If not, I'm going to write the parser myself. Would like to release it in open source if it appears to be not too bad.

Recurrent is a library that will do natural language date parsing with support for recurring dates. It doesn't match the API you provided, but allows you to create rules that can be used with Python's datetime library.
From their Github page:
Natural language parsing of dates and recurring events
Examples
Date times
next tuesday
tomorrow
in an hour
Recurring events
on weekdays
every fourth of the month from jan 1 2010 to dec 25th 2020
each thurs until next month
once a year on the fourth thursday in november
tuesdays and thursdays at 3:15
Messy strings
Please schedule the meeting for every other tuesday at noon
Set an alarm for next tuesday at 11pm

BC dates in Python

I'm setting out to build an app with Python that will need to handle BC dates extensively (store and retrieve in DB, do calculations). Most dates will be of various uncertainties, like "around 2000BC".
I know Python's datetime library only handles dates from 1 AD.
So far I only found FlexiDate. Are there any other options?
EDIT: The best approach would probably be to store them as strings (have String as the basic data type) and -as suggested- have a custom datetime class which can make some numerical sense of it. For the majority it looks like dates will only consist of a year. There are some interesting problems to solve like "early 500BC", "between 1600BC and 1500BC", "before 1800BC".

Astronomers and aerospace engineers have to deal with BC dates and a continuous time line, so that's the google context for your search.
Astropy's Time class will work for you (and even more precisely and completely than you hoped). pip install astropy and you're on your way.
If you roll your own, you should review some of the formulas in Vallado's chapter on dates. There are lots of obscure fudge factors required to convert dates from Julian to Gregorian etc.

Its an interesting question, it seems odd that such a class does not exist yet (re #joel Cornett comment) If you only work in years only it would simplify your class to handling integers rather than calendar dates - you could possibly use a dictionary with the text description (10 BC) against and integer value (-10)
EDIT: I googled this:
http://code.activestate.com/lists/python-list/623672/

NASA Spice functions handle BC extremely well with conversions from multiple formats. In these examples begin_date and end_date contain the TDB seconds past the J2000 epoch corresponding to input dates:
import spiceypy as spice
# load a leap second kernel
spicey.furnsh("path/to/leap/second/kernel/naif0012.tls")
begin_date = spice.str2et('13201 B.C. 05-06 00:00')
end_date = spice.str2et('17191 A.D. 03-15 00:00')
Documentation of str2et(),
Input format documentation, as well as
Leapsecond kernel files are available via the NASA Spice homepage.
converting from datetime or other time methods to spice is simple:
if indate.year < 0.0:
spice_indate = str(indate.year) + ' B.C. ' + sindate[-17:]
spice_indate = str(spice_indate)[1:]
else:
spice_indate = str(indate.year) + ' A.D. ' + sindate[-17:]
'2018 B.C. 03-31 19:33:38.44'
Other functions include: TIMOUT, TPARSE both converting to and from J2000 epoch seconds.
These functions are available in python through spiceypy, install e.g. via pip3 install spiceypy

This is an old question, but I had the same one and found this article announcing datautil, which is designed to handle dates like:
Dates in distant past and future including BC/BCE dates
Dates in a wild variety of formats: Jan 1890, January 1890, 1st Dec 1890, Spring 1890 etc
Dates of varying precision: e.g. 1890, 1890-01 (i.e. Jan 1890), 1890-01-02
Imprecise dates: c1890, 1890?, fl 1890 etc
Install is just
pip install datautil
I explored it for only a few minutesso far, but have noted that it doesn't accept str as an argument (only unicode) and it implements its own date class (Flexidate, 'a slightly extended version of ISO8601'), which is sort of useful maybe.
>>> from datautil.date import parse
>>> parse('Jan 1890')
error: 'str' object has no attribute 'read'
>>> fd = parse(u'Jan 1890')
<class 'datautil.date.FlexiDate'> 1890-01
fd.as_datetime()
>>> datetime.datetime(1890, 1, 1, 0, 0)
>>> bc = parse(u'2000BC')
<class 'datautil.date.FlexiDate'> -2000
but alas...
>>> bc.as_datetime()
ValueError: year is out of range
Unfortunately for me, I was looking for something that could handle dates with "circa" (c., ca, ca., circ. or cca.)
>>> ca = parse(u'ca 1900')
<class 'datautil.date.FlexiDate'> [UNPARSED: ca 1900]
Oh well - I guess I can always send a pull request ;-)

getting datetime from python string when tzinfo is present

I have found answers to question like this one helpful but not complete for my problem.
I have a form where the user automatically produces a date. I would like to store that as a date time.
I don't need any of the information after the seconds, but I cannot find a datetime.datetime.strptime code to translate the remaining stuff. So I would either like a strptime code that works for python2.7 on google app engine, or a string editing trick for removing the extra information that is not needed.
date-from-user='2012-09-22 07:36:36.333373-05:00'

You can slice your string to only select the first 19 characters:
>>> datefromuser='2012-09-22 07:36:36.333373-05:00'
>>> datefromuser[:19]
'2012-09-22 07:36:36'
This let's you parse the date without having to bother with the microseconds and timezone.
Do note that you probably do want to parse the timezone too though. You can use the iso8601 module to handle the whole format, without the need to slice:
>>> import iso8601
>>> iso8601.parse_date(datefromuser)
datetime.datetime(2012, 9, 22, 7, 36, 36, 333373, tzinfo=<FixedOffset '-05:00'>)
The iso8601 module is written in pure python and works without problems on the Google App Engine.

Python Docs would be a good place to start. strptime() would be your best option.
import datetime
datefromuser = '2012-09-22 07:36:36.333373-05:00'
datetime.datetime.strptime(datefromuser.split(".")[0], "%Y-%m-%d %H:%M:%S")
2012-09-22 07:36:36
http://docs.python.org/library/datetime.html#strftime-and-strptime-behavior

Convert Chrome history date/time stamp to readable format

I originally posted this question looking for an answer with using python, got some good help, but have still not been able to find a solution. I have a script running on OS X 10.5 client machines that captures internet browsing history (required as part of my sys admin duties in a US public school). Firefox 3.x stores history in a sqlite db, and I have figured out how to get that info out using python/sqlite3. Firefox 3.x uses a conventional unixtimestamp to mark visits and that is not difficult to convert... Chrome also stores browser history in a sqlite db, but its timestamp is formatted as the number of microseconds since January, 1601. I'd like to figure this out using python, but as far as I know, the sqlite3 module doesn't support that UTC format. Is there another tool out there to convert Chrome timestamps to a human readable format?

Use the datetime module. For example, if the number of microseconds in questions is 10**16:
>>> datetime.datetime(1601, 1, 1) + datetime.timedelta(microseconds=1e16)
datetime.datetime(1917, 11, 21, 17, 46, 40)
>>> _.isoformat()
'1917-11-21T17:46:40'
this tells you it was just past a quarter to 6pm of November 21, 1917. You can format datetime objects in any way you want thanks to their strftime method, of course. If you also need to apply timezones (other than the UTC you start with), look at third-party module pytz.

Bash
$ date -ud #$[13315808702856828/10**6-11644473600] +"%F %T %Z"
2022-12-18 03:45:02 UTC
$ printf '%(%FT %T %z)T\n' $[13315808702856828/10**6-11644473600]
2022-12-17 T19:45:02 -0800
Perl
$ echo ".. 13315808702856828 .." |\
perl -MPOSIX -pe 's!\b(1\d{16})\b!strftime(q/%F/,gmtime($1/1e6-11644473600))!e'
.. 2022-12-17 ..

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best way to identify and extract dates from text Python? - python

Newer versions of parsedatetime lib provide search functionality. Example from dateparser.search import search_dates dates = search_dates('Central design committee session Tuesday 10/22 6:30 pm')

Related

Parse unformatted dates in Python

Parsing human-readable recurring dates in Python

BC dates in Python

getting datetime from python string when tzinfo is present

Convert Chrome history date/time stamp to readable format

Categories

Resources