Why is pd.Timestamp converted to np.datetime64 when calling '.values'? - python

When accessing the DataFrame.values, all pd.Timestamp objects are converted to np.datetime64 objects, why? An np.ndarray containing pd.Timestamp objects can exists, therefore I don't understand why would such automatic conversion always happen.
Would you know how to prevent it?
Minimal example:
import numpy as np
import pandas as pd
from datetime import datetime
# Let's declare an array with a datetime.datetime object
values = [datetime.now()]
print(type(values[0]))
> <class 'datetime.datetime'>
# Clearly, the datetime.datetime objects became pd.Timestamp once moved to a pd.DataFrame
df = pd.DataFrame(values, columns=['A'])
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# Just to be sure, lets iterate over each datetime and manually convert them to pd.Timestamp
df['A'].apply(lambda x: pd.Timestamp(x))
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# df.values (or series.values in this case) returns an np.ndarray
print(type(df.iloc[0].values))
> <class 'numpy.ndarray'>
# When we check what is the type of elements of the '.values' array,
# it turns out the pd.Timestamp objects got converted to np.datetime64
print(type(df.iloc[0].values[0]))
> <class 'numpy.datetime64'>
# Just to double check, can an np.ndarray contain pd.Timestamps?
timestamp = pd.Timestamp(datetime.now())
timestamps = np.array([timestamp])
print(type(timestamps))
> <class 'numpy.ndarray'>
# Seems like it does. Why the above conversion then?
print(type(timestamps[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
python : 3.6.7.final.0
pandas : 0.25.3
numpy : 1.16.4

Found a workaround - using .array instead of .values (docs)
print(type(df['A'].array[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
This prevents the conversion and gives me access to the objects I wanted to use.

The whole idea behind .values is to:
Return a Numpy representation of the DataFrame. [docs]
I find it logical that a pd.Timestamp is then 'downgraded' to a dtype that is native to numpy. If it wouldn't do this, what is then the purpose of .values?
If you do want to keep the pd.Timestamp dtype I would suggest working with the original Series (df.iloc[0]). I don't see any other way since .values uses np.ndarray to convert according to the source on Github.

Related

Instance after pd.to_datetime?

The below code returns false. What instance has a column that has been transformed using pd.to_datetime?
import datetime
import pandas as pd
d=pd.DataFrame({'time':['2021-03-23', '2022-03-21', '2022-08-18']})
d['time1']=pd.to_datetime(d.time)
isinstance(d.time1, datetime.date)
If I run:
print(d.time1.dtype)
It just returns...
dtype('<M8[ns]')
I've read this post about the dtype M8[ns] but I still can't figure out what instance it has.
Difference between data type 'datetime64[ns]' and '<M8[ns]'?

Error filling in empty Numpy array with `np.datetime64` objects

I've always been confused about the interaction between Python's standard library datetime objects and Numpy's datetime objects. The following code gives an error, which baffles me.
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype=np.datetime64)
now = datetime.now()
b[0] = np.datetime64(now)
This gives the following error:
TypeError: Cannot cast NumPy timedelta64 scalar from metadata [us] to according to the rule 'same_kind'
What am I doing wrong here?
np.datetime64 is a class, whereas np.dtype('datetime64[us]') is a NumPy dtype:
import numpy as np
print(type(np.datetime64))
# <class 'type'>
print(type(np.dtype('datetime64[us]')))
# <class 'numpy.dtype'>
Specify the dtype of b using the NumPy dtype, not the class:
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype='datetime64[us]')
# b = np.empty((1,), dtype=np.dtype('datetime64[us]')) # also works
now = datetime.now()
b[0] = np.datetime64(now)
print(b)
# ['2019-05-30T08:55:43.111008']
Note that datetime64[us] is just one of a number of possible dtypes. For
instance, there are datetime64[ns], datetime64[ms], datetime64[s],
datetime64[D], datetime64[Y] dtypes, depending on the desired time
resolution.
datetime.dateitem.now() returns a a datetime with microsecond resolution,
so I chose datetime64[us] to match.

How get timedelta64[ns] to work with pandas astype() to cast multiple columns to different dtypes

I'm using pandas .astype() to cast a dict of column names to their correct dtypes. It works for str, int, datetime64[ns], and float but is failing on timedelta64[ns]. When I run this I get ValueError: Could not convert object to NumPy timedelta.
import pandas as pd
import numpy as np
sample_row = pd.DataFrame([['g1',
3912841,
'2018-09-29 16:03:49',
4.040196e+09,
'1 days 15:49:38']],
columns=['group',
'job_number',
'submission_time',
'maxvmem',
'wait_time'])
sample_row = (sample_row.astype(dtype={'group':'str',
'job_number':'int',
'submission_time':'datetime64[ns]',
'maxvmem':'float',
'wait_time':'timedelta64[ns]'}))
I found this answer to a similar question but it seems to suggest I'm using the correct dtype format.
Update: Here's the same code with the suggested change from #hpaulj:
import pandas as pd
import numpy as np
sample_row = pd.DataFrame([['g1',
3912841,
'2018-09-29 16:03:49',
4.040196e+09,
pd.Timedelta('1 days 15:49:38')]],
columns=['group',
'job_number',
'submission_time',
'maxvmem',
'wait_time'])
sample_row = (sample_row.astype(dtype={'group':'str',
'job_number':'int',
'submission_time':'datetime64[ns]',
'maxvmem':'float',
'wait_time':'timedelta64[ns]'}))
To confirm that the dtypes are set correctly:
for i in sample_row.loc[0, sample_row.columns]:
print(type(i))
Output:
<class 'str'>
<class 'numpy.int32'>
<class 'pandas._libs.tslib.Timestamp'>
<class 'numpy.float64'>
<class 'pandas._libs.tslib.Timedelta'>

Why does pandas return timestamps instead of datetime objects when calling pd.to_datetime()?

According to the manual, pd.to_datetime() should create a datetime object.
Instead, when I call pd.to_datetime("2012-05-14"), I get a timestamp object! Calling to_datetime() on that object finally gives me a datetime object.
In [1]: pd.to_datetime("2012-05-14")
Out[1]: Timestamp('2012-05-14 00:00:00', tz=None)
In [2]: t = pd.to_datetime("2012-05-14")
In [3]: t.to_datime()
Out[2]: datetime.datetime(2012, 5, 14, 0, 0)
Is there an explanation for this unexpected behaviour?
A Timestamp object is the way pandas works with datetimes, so it is a datetime object in pandas. But you expected a datetime.datetime object.
Normally you should not care about this (it is just a matter of a different repr). As long as you are working with pandas, the Timestamp is OK. And even if you really want a datetime.datetime, most things will work (eg all methods), and otherwise you can use to_pydatetime to retrieve the datetime.datetime object.
The longer story:
pandas stores datetimes as data with type datetime64 in index/columns (this are not datetime.datetime objects). This is the standard numpy type for datetimes and is more performant than using datetime.datetime objects:
In [15]: df = pd.DataFrame({'A':[dt.datetime(2012,1,1), dt.datetime(2012,1,2)]})
In [16]: df.dtypes
Out[16]:
A datetime64[ns]
dtype: object
In [17]: df.loc[0,'A']
Out[17]: Timestamp('2012-01-01 00:00:00', tz=None)
when retrieving one value of such a datetime column/index, you will see a Timestamp object. This is a more convenient object to work with the datetimes (more methods, better representation, etc than the datetime64), and this is a subclass of datetime.datetime, and so has all methods of it.

Why pandas series return the element of my numpy datetime64 array as timestamp?

I have a pandas Series which can be constructed like the following:
given_time = datetime(2013, 10, 8, 0, 0, 33, 945109,
tzinfo=psycopg2.tz.FixedOffsetTimezone(offset=60, name=None))
given_times = np.array([given_time] * 3, dtype='datetime64[ns]'))
column = pd.Series(given_times)
The dtype of my Series is datetime64[ns]
However, when I access it: column[1], somehow it becomes of type pandas.tslib.Timestamp, while column.values[1] stays np.datetime64. Does Pandas auto cast my datetime into Timestamp when accessing the item? Is it slow?
Do I need to worry about the difference in types? As far as I see, Timestamp seems not have timezone (numpy.datetime64('2013-10-08T00:00:33.945109000+0100') -> Timestamp('2013-10-07 23:00:33.945109', tz=None))
In practice, I would do datetime arithmetic like take difference, compare to a datetimedelta. Does the possible type inconsistency around my operators affect my use case at all?
Besides, am I encouraged to use pd.to_datetime instead of astype(dtype='datetime64') while converting datetime objects?
Pandas time types are built on top of numpy's datetime64.
In order to continue using the pandas operators, you should keep using pd.to_datetime, rather than as astype(dtype='datetime64'). This is especially true since you'll be taking date time deltas, which pandas handles admirably, for example with resampling, and period definitions.
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#up-and-downsampling
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#period
Though I haven't measured, since the pandas times are hiding numpy times, I suspect the conversion, is quite fast. Alternatively, you can just use pandas built in time series definitions and avoid the conversion altogether.
As a rule of thumb, it's good to use the type from the package you'll be using functions from, though. So if you're really only going to use numpy to manipulate the arrays, then stick with numpy date time. Pandas methods => pandas date time.
I had read in the documentation somewhere (apologies, can't find link) that scalar values will be converted to timestamps while arrays will keep their data type. For example:
from datetime import date
import pandas as pd
time_series = pd.Series([date(2010 + x, 1, 1) for x in range(5)])
time_series = time_series.apply(pd.to_datetime)
so that:
In[1]:time_series
Out[1]:
0 2010-01-01
1 2011-01-01
2 2012-01-01
3 2013-01-01
4 2014-01-01
dtype: datetime64[ns]
and yet:
In[2]:time_series.iloc[0]
Out[2]:Timestamp('2010-01-01 00:00:00')
while:
In[3]:time_series.values[0]
In[3]:numpy.datetime64('2009-12-31T19:00:00.000000000-0500')
because iloc requests a scalar from pandas (type conversion to Timestamp) while values requests the full numpy array (no type conversion).
There is similar behavior for series of length one. Additionally, referencing more than one element in the slice (ie iloc[1:10]) will return a series, which will always keep its datatype.
I'm unsure as to why pandas behaves this way.
In[4]: pd.__version__
Out[4]: '0.15.2'

Categories