The below code returns false. What instance has a column that has been transformed using pd.to_datetime?
import datetime
import pandas as pd
d=pd.DataFrame({'time':['2021-03-23', '2022-03-21', '2022-08-18']})
d['time1']=pd.to_datetime(d.time)
isinstance(d.time1, datetime.date)
If I run:
print(d.time1.dtype)
It just returns...
dtype('<M8[ns]')
I've read this post about the dtype M8[ns] but I still can't figure out what instance it has.
Difference between data type 'datetime64[ns]' and '<M8[ns]'?
Related
I have a datetime variable date_var=datetime(2020,09,11,0,0,0 ) and i am trying to populate a dataframe column for each row with this value. So i did something like df['Time']=date_var first this show 'Time' field datatype as datetime64 [ns] and not datetime and this populates Time field with value 2020-09-11 instead of 2020-09-11 00:00:00. Am i doing something incorrect ?
Thanks
You've done nothing wrong. The fact that it prints as the date without time is just a convention in Pandas for simpler output. You can use df['Time'].dt.strftime('%F %T') if you want the column printed with the time part as well.
Storing datetimes as the Pandas type (datetime64[ns]) is better than storing them as the Python type, because it is more efficient to manipulate (e.g. to add offsets to all of them).
Try this code
import datetime
import pandas as pd
date_var = datetime.datetime(2020,9,11,0,0,0)
df['Time'] = date_var.strftime('%Y-%m-%d %H:%M:%S')
df['Time'] = pd.to_datetime(df['Time'])
When accessing the DataFrame.values, all pd.Timestamp objects are converted to np.datetime64 objects, why? An np.ndarray containing pd.Timestamp objects can exists, therefore I don't understand why would such automatic conversion always happen.
Would you know how to prevent it?
Minimal example:
import numpy as np
import pandas as pd
from datetime import datetime
# Let's declare an array with a datetime.datetime object
values = [datetime.now()]
print(type(values[0]))
> <class 'datetime.datetime'>
# Clearly, the datetime.datetime objects became pd.Timestamp once moved to a pd.DataFrame
df = pd.DataFrame(values, columns=['A'])
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# Just to be sure, lets iterate over each datetime and manually convert them to pd.Timestamp
df['A'].apply(lambda x: pd.Timestamp(x))
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
# df.values (or series.values in this case) returns an np.ndarray
print(type(df.iloc[0].values))
> <class 'numpy.ndarray'>
# When we check what is the type of elements of the '.values' array,
# it turns out the pd.Timestamp objects got converted to np.datetime64
print(type(df.iloc[0].values[0]))
> <class 'numpy.datetime64'>
# Just to double check, can an np.ndarray contain pd.Timestamps?
timestamp = pd.Timestamp(datetime.now())
timestamps = np.array([timestamp])
print(type(timestamps))
> <class 'numpy.ndarray'>
# Seems like it does. Why the above conversion then?
print(type(timestamps[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
python : 3.6.7.final.0
pandas : 0.25.3
numpy : 1.16.4
Found a workaround - using .array instead of .values (docs)
print(type(df['A'].array[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>
This prevents the conversion and gives me access to the objects I wanted to use.
The whole idea behind .values is to:
Return a Numpy representation of the DataFrame. [docs]
I find it logical that a pd.Timestamp is then 'downgraded' to a dtype that is native to numpy. If it wouldn't do this, what is then the purpose of .values?
If you do want to keep the pd.Timestamp dtype I would suggest working with the original Series (df.iloc[0]). I don't see any other way since .values uses np.ndarray to convert according to the source on Github.
I've always been confused about the interaction between Python's standard library datetime objects and Numpy's datetime objects. The following code gives an error, which baffles me.
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype=np.datetime64)
now = datetime.now()
b[0] = np.datetime64(now)
This gives the following error:
TypeError: Cannot cast NumPy timedelta64 scalar from metadata [us] to according to the rule 'same_kind'
What am I doing wrong here?
np.datetime64 is a class, whereas np.dtype('datetime64[us]') is a NumPy dtype:
import numpy as np
print(type(np.datetime64))
# <class 'type'>
print(type(np.dtype('datetime64[us]')))
# <class 'numpy.dtype'>
Specify the dtype of b using the NumPy dtype, not the class:
from datetime import datetime
import numpy as np
b = np.empty((1,), dtype='datetime64[us]')
# b = np.empty((1,), dtype=np.dtype('datetime64[us]')) # also works
now = datetime.now()
b[0] = np.datetime64(now)
print(b)
# ['2019-05-30T08:55:43.111008']
Note that datetime64[us] is just one of a number of possible dtypes. For
instance, there are datetime64[ns], datetime64[ms], datetime64[s],
datetime64[D], datetime64[Y] dtypes, depending on the desired time
resolution.
datetime.dateitem.now() returns a a datetime with microsecond resolution,
so I chose datetime64[us] to match.
I have a DataFrame with two columns of time information. The first is the epoch time in seconds, and the second is the corresponding formatted str time like "2015-06-01T09:00:00+08:00" where "+08:00" denotes the timezone.
I'm aware that time formats are in a horrible mess in Python, and that matplotlib.pyplot seems to only recognise the datetime format. I tried several ways to convert the str time to datetime but none of them would work. When I use pd.to_datetime it will convert to datetime64, and when using pd.Timestamp it converts to Timestamp, and even when I tried using combinations of these two functions, the output would always be either datetime64 or Timestamp but NEVER for once datetime. I also tried the method suggested in this answer. Didn't work. It's kind of driving me up the wall now.
Could anybody kindly figure out a quick way for this? Thanks!
I post a minimal example below:
import matplotlib.pyplot as plt
import time
import pandas as pd
df = pd.DataFrame([[1433120400, "2015-06-01T09:00:00+08:00"]], columns=["epoch", "strtime"])
# didn't work
df["usable_time"] = pd.to_datetime(df["strtime"])
# didn't work either
df["usable_time"] = pd.to_datetime(df["strtime"].apply(lambda s: pd.Timestamp(s)))
# produced a strange type called "struct_time". Don't think it'd be compatible with pyplot
df["usable_time"] = df["epoch"].apply(lambda x: time.localtime(x))
# attempted to plot with pyplot
df["usable_time"] = pd.to_datetime(df["strtime"])
plt.plot(x=df["usable_time"], y=[0.123])
plt.show()
UPDATE (per comments)
It seems like the confusion here is stemming from the fact that the call to plt.plot() takes positional x/y arguments instead of keyword arguments. In other words, the appropriate signature is:
plt.plot(x, y)
Or, alternately:
plt.plot('x_label', 'y_label', data=obj)
But not:
plt.plot(x=x, y=y)
There's a separate discussion of why this quirk of Pyplot exists here, also see ImportanceOfBeingErnest's comments below.
Original
This isn't really an answer, more of a demonstration that Pyplot doesn't have an issue with Pandas datetime data. I've added an extra row to df to make the plot clearer:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([[1433120400, "2015-06-01T09:00:00+08:00"],
[1433130400, "2015-07-01T09:00:00+08:00"]],
columns=["epoch", "strtime"])
df["usable_time"] = pd.to_datetime(df["strtime"])
df.dtypes
epoch int64
strtime object
usable_time datetime64[ns]
dtype: object
plt.plot(df.usable_time, df.epoch)
pd.__version__ # '0.23.3'
matplotlib.__version__ # '2.2.2'
You can use to_pydatetime (from the dt accessor or Timestamp) to get back native datetime objects if you really want to, e.g.:
pd.to_datetime(df["strtime"]).dt.to_pydatetime()
This will return an array of native datetime objects:
array([datetime.datetime(2015, 6, 1, 1, 0)], dtype=object)
However, pyplot seems to be able to work with pandas datetime series.
I'm trying to get a year (1980) into a datetime column in pandas, but I'm getting an error. Anybody know what I'm doing wrong?
import pandas as pd
import datetime
df = pd.read_csv(r'd:\downloads\googlebooks-eng-all-1gram-20120701-a', sep='\t',
header=None, \
names=["word","year","occurred","books"], \
dtype={"word":"str","year":"datetime","occured":"int64","books":"int64"},
parse_dates=True)
df.head()
The error is
TypeError: data type "datetime" not understood
This seems to be a well-documented bug, the suggestion I can give now is to:
Remove dtype from pd.read_csv().
-> read_csv() automatically infers the data type of the columns,
Do df.dtypes to ensure you have your preferred datatypes.
Now, to explicitly convert the column year to datetime, you can use the method pd.to_datetime. For example:
df['year'] = pd.to_datetime(df['year'])
you need to import the datetime package