How to handle multiple datetime-formats within a dataset? - python

within a dataset, there are several different datetime-strings.
For example:
2020-11-16T06:00:00Z
2020-11-16T06:00:00+01:00
2020-11-16T06:00:00+01:00Z
2020-11-16T06:00:00+02:00
2020-11-16T06:00:00.000Z
I thought about replacing everything after the seconds, but it gives me errors, when for example +01:00 isn't given in the first place.. or else.
Do you have any clou, how to handle this?
It would be absolutely enough, if I could get:
%Y-%m-%dT%H:%M
(The basics, how to strp and strf are known...)
I've wrangled my head all night about this problem.
Hope, that one of you have got a solution...
thank you in advance!

Python has a standard library that deals with this problem:
import dateutil.parser
examples = [
'2020-11-16T06:00:00Z',
'2020-11-16T06:00:00+01:00',
'2020-11-16T06:00:00+01:00Z',
'2020-11-16T06:00:00+02:00',
'2020-11-16T06:00:00.000Z'
]
for e in examples:
try:
print(dateutil.parser.parse(e))
except ValueError:
print(f'Invalid datetime: {e}')
Result:
2020-11-16 06:00:00+00:00
2020-11-16 06:00:00+01:00
Invalid datetime: 2020-11-16T06:00:00+01:00Z
2020-11-16 06:00:00+02:00
2020-11-16 06:00:00+00:00
#Z4-tier also has a solution for your examples(be careful with just leaving off the end of the string though), but dateutil will also deal with more exotic stuff:
print(dateutil.parser.parse('15:45 16 Nov 2020'))
Result:
2020-11-16 15:45:00
Also note this:
print(dateutil.parser.parse(e).tzinfo)
If you add that, you'll see that dateutil includes the information about the time zone in the result, which would be lost if you only parse the first part of the strings.

how about this:
import datetime
dates = ['2020-11-16T06:00:00Z',
'2020-11-16T06:00:00+01:00',
'2020-11-16T06:00:00+01:00Z',
'2020-11-16T06:00:00+02:00',
'2020-11-16T06:00:00.000Z']
for d in dates:
datetime.datetime.fromisoformat(d[0:19])
Since each date has the same format up to the offset and timezone, just strip that part off of the string and cast it to a datetime.datetime.

The dataset is generated by a scraper, which scrapes news pages.
Therefore the datetime is scraped as string in the first place, so that the conversion has to take place for several different occurrences, before a strptime can be executed.
I found a solution for my problem, which was influenced by all of the approaches of you guys.
'''
date = '2020-11-16T06:00:00+01:00'
splitat = 19
date = date[:splitat]
date
'''
This resulted in the standardized format, which I needed:
'2020-11-16T06:00:00'

Related

Python library to return date format

I need to return the date format from a string. Currently I am using parser to parse a string as a date, then replacing the year with a yyyy or yy. Similarly for other dates items. Is there some function I could use that would return mm-dd-yyyy when I send 12-05-2018?
Technically, it is an impossible question. If you send in 12-05-2018, there is no way for me to know whether you are sending in a mm-dd-yyyy (Dec 5, 2018) or dd-mm-yyyy (May 12, 2018).
One approach might be to do a regex replacement of anything which matches your expected date pattern, e.g.
date = "Here is a date: 12-05-2018 and here is another one: 10-31-2010"
date_masked = re.sub(r'\b\d{2}-\d{2}-\d{4}\b', 'mm-dd-yyyy', date)
print(date)
print(date_masked)
Here is a date: 12-05-2018 and here is another one: 10-31-2010
Here is a date: mm-dd-yyyy and here is another one: mm-dd-yyyy
Of course, the above script makes no effort to check whether the dates are actually valid. If you require that, you may use one of the date libraries available in Python.
I don't really understand what you plan to do with the format. There are two reasons I can think of why you might want it. (1) You want at some future point to convert a normalized datetime back into the original string. If that is what you want you would be better off just storing the normalized datetime and the original string. Or (2) you want to draw (dodgy) conclusions about person sending the data, because different nationalities will tend to use different formats. But, whatever you want it for, you can do it this way:
from dateutil import parser
def get_date_format(date_input):
date = parser.parse(date_input)
for date_format in ("%m-%d-%Y", "%d-%m-%Y", "%Y-%m-%d"):
# You can extend the list above to include formats with %y in addition to %Y, etc, etc
if date.strftime(date_format) == date_input:
return date_format
>>> date_input = "12-05-2018"
>>> get_date_format(date_input)
'%m-%d-%Y'
You mention in a comment you are prepared to make assumptions about ambiguous dates like 12-05-2018 (could be May or December) and 05-12-18 (could be 2018 or 2005). You can pass those assumptions to dateutil.parser.parse. It accepts boolean keyword parameters dayfirst and yearfirst which it will use in ambiguous cases.
Take a look at the datetime library. There you will find the function strptime(), which is exactly what you are looking for.
Here is the documentation: https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior

Information lost while parsing time using dateutil in python

I have an input which may contain a date, or time, or both. In case the date is provided I can use that date but in case the date is not provided (i.e. only the time is provided) then I have to use a different date provided from someplace else.
But once I parse the date using using python dateutil, today's date is added to the parsed value. For example:
from dateutil.parser import parse
print(parse('03:15:08'))
print(parse('04-04-2019 03:15:08'))
the above code gives the following output:
2019-04-04 03:15:08
2019-04-04 03:15:08
As you can see the information (that the date was provided or not) is lost and can't be differentiated.
Also some kind of length manipulation may not work because only the date may be provided.
How to differentiate between the 2 inputs?
Thank you.

Is it possible to extract a format string (e.g. "YY-mm-DD HH:MM:SS.sss") from a python datetime object? [duplicate]

Here's an array of datetime values:
array = np.array(['2016-05-01T00:00:59.3+10:00', '2016-05-01T00:02:59.4+10:00',
'2016-05-01T00:03:59.4+10:00', '2016-05-01T00:13:00.1+10:00',
'2016-05-01T00:22:00.5+10:00', '2016-05-01T00:31:01.1+10:00'],
dtype=object)
pd.to_datetime is very good at inferring datetime formats.
array = pd.to_datetime(array)
print(array)
DatetimeIndex(['2016-04-30 14:00:59.300000', '2016-04-30 14:02:59.400000',
'2016-04-30 14:03:59.400000', '2016-04-30 14:13:00.100000',
'2016-04-30 14:22:00.500000', '2016-04-30 14:31:01.100000'],
dtype='datetime64[ns]', freq=None)
How can I dynamically figure out what datetime format pd.to_datetime inferred? Something like: %Y-%m-%dT... (sorry, my datetime foo is really bad).
I don't think it's possible to do this in full generality in pandas.
As mentioned in other comments and answers, the internal function _guess_datetime_format is close to being what you ask for, but it has strict criteria for what constitutes a guessable format and so it will only work for a restricted class of datetime strings.
These criteria are set out in the _guess_datetime_format function on these lines and you can also see some examples of good and bad formats in the test_parsing script.
Some of the main points are:
year, month and day must each be present and identifiable
the year must have four digits
exactly six digits must be used if using microseconds
you can't specify a timezone
This means that it will fail to guess the format for datetime strings in the question despite them being a valid ISO 8601 format:
>>> from pandas.core.tools.datetimes import _guess_datetime_format_for_array
>>> array = np.array(['2016-05-01T00:00:59.3+10:00'])
>>> _guess_datetime_format_for_array(array)
# returns None
In this case, dropping the timezone and padding the microseconds to six digits is enough to make pandas to recognise the format:
>>> array = np.array(['2016-05-01T00:00:59.300000']) # six digits, no tz
>>> _guess_datetime_format_for_array(array)
'%Y-%m-%dT%H:%M:%S.%f'
This is probably as good as it gets.
If pd.to_datetime is not asked to infer the format of the array, or given a format string to try, it will just try and parse each string separately and hope that it is successful. Crucially, it does not need to infer a format in advance to do this.
First, pandas parses the string assuming it is (approximately) a ISO 8601 format. This begins in a call to _string_to_dts and ultimately hits the low-level parse_iso_8601_datetime function that does the hard work.
You can check if your string is able to be parsed in this way using the _test_parse_iso8601 function. For example:
from pandas._libs.tslib import _test_parse_iso8601
def is_iso8601(string):
try:
_test_parse_iso8601(string)
return True
except ValueError:
return False
The dates in the array you give are recognised as this format:
>>> is_iso8601('2016-05-01T00:00:59.3+10:00')
True
But this doesn't deliver what the question asks for and I don't see any realistic way to recover the exact format that is recognised by the parse_iso_8601_datetime function.
If parsing the string as a ISO 8601 format fails, pandas falls back to using the parse() function from the third-party dateutil library (called by parse_datetime_string). This allows a fantastic level of parsing flexibility but, again, I don't know of any good way to extract the recognised datetime format from this function.
If both of these two parsers fail, pandas either raises an error, ignores the string or defaults to NaT (depending on what the user specifies). No further attempt is made to parse the string or guess the format of the string.
DateInfer (PyDateInfer) library allows to infer dates based on the sequence of available dates:
github.com/wdm0006/dateinfer
Usage from docs:
>>> import dateinfer
>>> dateinfer.infer(['Mon Jan 13 09:52:52 MST 2014', 'Tue Jan 21 15:30:00 EST 2014'])
'%a %b %d %H:%M:%S %Z %Y'
>>>
Disclaimer: I have used and then contributed to this library
You can use _guess_datetime_format from core.tools to get the format. ie
from pandas.core.tools import datetimes as tools
tools._guess_datetime_format(pd.to_datetime(array).format()[0][:10])
Output :
'%Y-%m-%d'
To know more about this method you can see here. Hope it helps.

Parse unformatted dates in Python

I have some text, taken from different websites, that I want to extract dates from. As one can imagine, the dates vary substantially in how they are formatted, and look something like:
Posted: 10/01/2014
Published on August 1st 2014
Last modified on 5th of July 2014
Posted by Dave on 10-01-14
What I want to know is if anyone knows of a Python library [or API] which would help with this - (other than e.g. regex, which will be my fallback). I could probably relatively easily remove the "posed on" parts, but getting the other stuff consistent does not look easy.
My solution using dateutil
Following Lukas's suggestion, I used the dateutil package (seemed far more flexible than Arrow), using the Fuzzy entry, which basically ignores things which are not dates.
Caution on Fuzzy parsing using dateutil
The main thing to note with this is that as noted in the thread Trouble in parsing date using dateutil if it is unable to parse a day/month/year it takes a default value (which is the current day, unless specified), and as far as i can tell there is no flag reported to indicate that it took the default.
This would result in "random text" returning today's date of 2015-4-16 which could have caused problems.
Solution
Since I really want to know when it fails, rather than fill in the date with a default value, I ended up running twice, and then seeing if it took the default on both instances - if not, then I assumed parsing correctly.
from datetime import datetime
from dateutil.parser import parse
def extract_date(text):
date = {}
date_1 = parse(text, fuzzy=True, default=datetime(2001, 01, 01))
date_2 = parse(text, fuzzy=True, default=datetime(2002, 02, 02))
if date_1.day == 1 and date_2.day ==2:
date["day"] = "XX"
else:
date["day"] = date_1.day
if date_1.month == 1 and date_2.month ==2:
date["month"] = "XX"
else:
date["month"] = date_1.month
if date_1.year == 2001 and date_2.year ==2002:
date["year"] = "XXXX"
else:
date["year"] = date_1.year
return(date)
print extract_date("Posted: by dave August 1st")
Obviously this is a bit of a botch (so if anyone has a more elegant solution -please share), but this correctly parsed the four examples i had above [where it assumed US format for the date 10/01/2014 rather than UK format], and resulted in XX being returned appropriately when missing data entered.
You could use Arrow library:
arrow.get('2013-05-05 12:30:45', ['MM/DD/YYYY', 'MM-DD-YYYY'])
Two arguments, first a str to parse and second a list of formats to try.

BC dates in Python

I'm setting out to build an app with Python that will need to handle BC dates extensively (store and retrieve in DB, do calculations). Most dates will be of various uncertainties, like "around 2000BC".
I know Python's datetime library only handles dates from 1 AD.
So far I only found FlexiDate. Are there any other options?
EDIT: The best approach would probably be to store them as strings (have String as the basic data type) and -as suggested- have a custom datetime class which can make some numerical sense of it. For the majority it looks like dates will only consist of a year. There are some interesting problems to solve like "early 500BC", "between 1600BC and 1500BC", "before 1800BC".
Astronomers and aerospace engineers have to deal with BC dates and a continuous time line, so that's the google context for your search.
Astropy's Time class will work for you (and even more precisely and completely than you hoped). pip install astropy and you're on your way.
If you roll your own, you should review some of the formulas in Vallado's chapter on dates. There are lots of obscure fudge factors required to convert dates from Julian to Gregorian etc.
Its an interesting question, it seems odd that such a class does not exist yet (re #joel Cornett comment) If you only work in years only it would simplify your class to handling integers rather than calendar dates - you could possibly use a dictionary with the text description (10 BC) against and integer value (-10)
EDIT: I googled this:
http://code.activestate.com/lists/python-list/623672/
NASA Spice functions handle BC extremely well with conversions from multiple formats. In these examples begin_date and end_date contain the TDB seconds past the J2000 epoch corresponding to input dates:
import spiceypy as spice
# load a leap second kernel
spicey.furnsh("path/to/leap/second/kernel/naif0012.tls")
begin_date = spice.str2et('13201 B.C. 05-06 00:00')
end_date = spice.str2et('17191 A.D. 03-15 00:00')
Documentation of str2et(),
Input format documentation, as well as
Leapsecond kernel files are available via the NASA Spice homepage.
converting from datetime or other time methods to spice is simple:
if indate.year < 0.0:
spice_indate = str(indate.year) + ' B.C. ' + sindate[-17:]
spice_indate = str(spice_indate)[1:]
else:
spice_indate = str(indate.year) + ' A.D. ' + sindate[-17:]
'2018 B.C. 03-31 19:33:38.44'
Other functions include: TIMOUT, TPARSE both converting to and from J2000 epoch seconds.
These functions are available in python through spiceypy, install e.g. via pip3 install spiceypy
This is an old question, but I had the same one and found this article announcing datautil, which is designed to handle dates like:
Dates in distant past and future including BC/BCE dates
Dates in a wild variety of formats: Jan 1890, January 1890, 1st Dec 1890, Spring 1890 etc
Dates of varying precision: e.g. 1890, 1890-01 (i.e. Jan 1890), 1890-01-02
Imprecise dates: c1890, 1890?, fl 1890 etc
Install is just
pip install datautil
I explored it for only a few minutesso far, but have noted that it doesn't accept str as an argument (only unicode) and it implements its own date class (Flexidate, 'a slightly extended version of ISO8601'), which is sort of useful maybe.
>>> from datautil.date import parse
>>> parse('Jan 1890')
error: 'str' object has no attribute 'read'
>>> fd = parse(u'Jan 1890')
<class 'datautil.date.FlexiDate'> 1890-01
fd.as_datetime()
>>> datetime.datetime(1890, 1, 1, 0, 0)
>>> bc = parse(u'2000BC')
<class 'datautil.date.FlexiDate'> -2000
but alas...
>>> bc.as_datetime()
ValueError: year is out of range
Unfortunately for me, I was looking for something that could handle dates with "circa" (c., ca, ca., circ. or cca.)
>>> ca = parse(u'ca 1900')
<class 'datautil.date.FlexiDate'> [UNPARSED: ca 1900]
Oh well - I guess I can always send a pull request ;-)

Categories