Error filling in empty Numpy array with `np.datetime64` objects - python

I've always been confused about the interaction between Python's standard library datetime objects and Numpy's datetime objects. The following code gives an error, which baffles me.
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype=np.datetime64)
now = datetime.now()
b[0] = np.datetime64(now)
This gives the following error:
TypeError: Cannot cast NumPy timedelta64 scalar from metadata [us] to according to the rule 'same_kind'
What am I doing wrong here?

np.datetime64 is a class, whereas np.dtype('datetime64[us]') is a NumPy dtype:
import numpy as np
print(type(np.datetime64))
# <class 'type'>
print(type(np.dtype('datetime64[us]')))
# <class 'numpy.dtype'>
Specify the dtype of b using the NumPy dtype, not the class:
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype='datetime64[us]')
# b = np.empty((1,), dtype=np.dtype('datetime64[us]')) # also works
now = datetime.now()
b[0] = np.datetime64(now)
print(b)
# ['2019-05-30T08:55:43.111008']
Note that datetime64[us] is just one of a number of possible dtypes. For
instance, there are datetime64[ns], datetime64[ms], datetime64[s],
datetime64[D], datetime64[Y] dtypes, depending on the desired time
resolution.
datetime.dateitem.now() returns a a datetime with microsecond resolution,
so I chose datetime64[us] to match.

Related

Instance after pd.to_datetime?

The below code returns false. What instance has a column that has been transformed using pd.to_datetime?
import datetime
import pandas as pd
d=pd.DataFrame({'time':['2021-03-23', '2022-03-21', '2022-08-18']})
d['time1']=pd.to_datetime(d.time)
isinstance(d.time1, datetime.date)
If I run:
print(d.time1.dtype)
It just returns...
dtype('<M8[ns]')
I've read this post about the dtype M8[ns] but I still can't figure out what instance it has.
Difference between data type 'datetime64[ns]' and '<M8[ns]'?

Why is pd.Timestamp converted to np.datetime64 when calling '.values'?

When accessing the DataFrame.values, all pd.Timestamp objects are converted to np.datetime64 objects, why? An np.ndarray containing pd.Timestamp objects can exists, therefore I don't understand why would such automatic conversion always happen.
Would you know how to prevent it?
Minimal example:
import numpy as np
import pandas as pd
from datetime import datetime
# Let's declare an array with a datetime.datetime object
values = [datetime.now()]
print(type(values[0]))
> <class 'datetime.datetime'>
# Clearly, the datetime.datetime objects became pd.Timestamp once moved to a pd.DataFrame
df = pd.DataFrame(values, columns=['A'])
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# Just to be sure, lets iterate over each datetime and manually convert them to pd.Timestamp
df['A'].apply(lambda x: pd.Timestamp(x))
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# df.values (or series.values in this case) returns an np.ndarray
print(type(df.iloc[0].values))
> <class 'numpy.ndarray'>
# When we check what is the type of elements of the '.values' array,
# it turns out the pd.Timestamp objects got converted to np.datetime64
print(type(df.iloc[0].values[0]))
> <class 'numpy.datetime64'>
# Just to double check, can an np.ndarray contain pd.Timestamps?
timestamp = pd.Timestamp(datetime.now())
timestamps = np.array([timestamp])
print(type(timestamps))
> <class 'numpy.ndarray'>
# Seems like it does. Why the above conversion then?
print(type(timestamps[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
python : 3.6.7.final.0
pandas : 0.25.3
numpy : 1.16.4
Found a workaround - using .array instead of .values (docs)
print(type(df['A'].array[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
This prevents the conversion and gives me access to the objects I wanted to use.
The whole idea behind .values is to:
Return a Numpy representation of the DataFrame. [docs]
I find it logical that a pd.Timestamp is then 'downgraded' to a dtype that is native to numpy. If it wouldn't do this, what is then the purpose of .values?
If you do want to keep the pd.Timestamp dtype I would suggest working with the original Series (df.iloc[0]). I don't see any other way since .values uses np.ndarray to convert according to the source on Github.

numpy datetime and pandas datetime

I'm confused by the interoperation between numpy and pandas date objects (or maybe just by numpy's datetime64 in general).
I was trying to count business days using numpy's built-in functionality like so:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64("28/03/2016")])
However, numpy apparently can't deal with the inverted date format:
ValueError: Error parsing datetime string "28/03/2016" at position 2
To get around this, I thought I'd just use pandas to_datetime, which can. However:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64(pd.to_datetime("28/03/2016"))])
ValueError: Cannot safely convert provided holidays input into an array of dates
Searching around for a bit, it seemed that this was caused by the fact that the chaining of to_datetime and np.datetime64 results in a datetime64[us] object, which apparently the busday_count function cannot accept (is this intended behaviour or a bug?). Thus, my next attempt was:
np.busday_count("2016-03-01", "2016-03-31", holidays=[np.datetime64(pd.Timestamp("28"), "D")])
But:
TypeError: Cannot cast datetime.datetime object from metadata [us] to [D] according to the rule 'same_kind'
And that's me out - why are there so many incompatibilities between all these datetime formats? And how can I get around them?
I've been having a similar issue, using np.is_busday()
The type of datetime64 is vital to get right. Checking the numpy datetime docs, you can specify the numpy datetime type to be D.
This works:
my_holidays=np.array([datetime.datetime.strptime(x,'%m/%d/%y') for x in holidays.Date.values], dtype='datetime64[D]')
day_flags['business_day'] = np.is_busday(days,holidays=my_holidays)
Whereas this throws the same error you got:
my_holidays=np.array([datetime.datetime.strptime(x,'%m/%d/%y') for x in holidays.Date.values], dtype='datetime64')
The only difference is specifying the type of datetime64.
dtype='datetime64[D]'
vs
dtype='datetime64'
Docs are here:
https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.datetime.html
I had the same issue while using np.busday_count, later I figured out the problem was with the hours, minutes, seconds, and milliseconds getting added while converting it to datetime object or numpy datetime object.
I just converted to datetime object with only date and not the hours, minutes, seconds, and milliseconds.
The following was my code:
holidays_list.json file:
{
"holidays_2019": [
"04-Mar-2019",
"21-Mar-2019",
"17-Apr-2019",
"19-Apr-2019",
"29-Apr-2019",
"01-May-2019",
"05-Jun-2019",
"12-Aug-2019",
"15-Aug-2019",
"02-Sep-2019",
"10-Sep-2019",
"02-Oct-2019",
"08-Oct-2019",
"28-Oct-2019",
"12-Nov-2019",
"25-Dec-2019"
],
"format": "%d-%b-%Y"
}
code file:
import json
import datetime
import numpy as np
with open('holidays_list.json', 'r') as infile:
data = json.loads(infile.read())
# the following is where I convert the datetime object to date
holidays = list(map(lambda x: datetime.datetime.strptime(
x, data['format']).date(), data['holidays_2019']))
start_date = datetime.datetime.today().date()
end_date = start_date + datetime.timedelta(days=30)
holidays = [start_date + datetime.timedelta(days=1)]
print(np.busday_count(start_date, end_date, holidays=holidays))

Problems with timezone in numpy.datetime64

I'm a bit confused how numpy handles timezones. If I create a datetime-object just with a date, it seems it uses Zulu-Timezone. If I use an additional timestep, it uses my current timezone. If I then manipulate these objects, e.g. add a timedelta, the results are different:
import numpy as np
a = np.datetime64('2015-04-22')
b = np.datetime64('2015-04-22T00:00')
delta = np.timedelta64(1,'h')
print(a+delta,b+delta)
I must ensure that all values are in the same timezone, so my question is, how can I ensure that a user, who initializes these date doesn't mix dates and dates with time.
If you specify Zulu in datetime with timestep you'll get uniform data.
In [30]: b = np.datetime64('2015-04-22T00:00Z')
In [31]: b + delta
Out[31]: numpy.datetime64('2015-04-22T03:00+0200')
In [32]: a + delta
Out[32]: numpy.datetime64('2015-04-22T03:00+0200','h')
http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html#basic-datetimes

pandas handling of numpy timedelta64[ms]

>>> import pandas as pd
>>> pd.__version__
'0.11.0'
>>> import numpy as np
>>> np.__version__
'1.7.1'
>>> d={'a':np.array([68614867, 72200835], dtype=np.dtype('timedelta64[ms]'))}
>>> d['a'][0]
numpy.timedelta64(68614867,'ms')
>>> df = pd.DataFrame.from_dict(d)
>>> print df
a
0 00:00:00.068615
1 00:00:00.072201
It looks like it is interpreting the values in the underlying int64 as ns not ms. Is this a bug in pandas' handling of timedelta64[ms] types?
timedelta handling is still a work-in-progress, see this issue: https://github.com/pydata/pandas/issues/3009
main issue is that timedeltas are broken in numpy 1.6.2.
passing of arbitrary timedeltas dtypes in creation is not supported yet, as
a workaround, you can do this, as the ONLY dtype supported at the moment is the
internal timedelta64[ns] (this is exactly how datetime64[ns]) works btw. Pandas
converts to an internal repr and then you do want you want.
(this solution is ONLY good for numpy >= 1.7).
In [22]: d['a'].astype('timedelta64[ns]')
Out[22]: array([68614867000000, 72200835000000], dtype='timedelta64[ns]')
In [23]: DataFrame(dict(a = d['a'].astype('timedelta64[ns]')))
Out[23]:
a
0 19:03:34.867000
1 20:03:20.835000
In [24]: DataFrame(dict(a = d['a'].astype('timedelta64[ns]'))).dtypes
Out[24]:
a timedelta64[ns]
dtype: object
what is the final goal you are trying to accomplish?

Categories