Convert datetime ns to daily format - python

I have a column in my dataframe in this formate:
2013-01-25 00:00:00+00:00
non-null datetime64[ns, UTC]
I would like to convert this to daily format, like this:
2013-01-25
I tried this approach, but have been receiving an error:
df['date_column'].date()
AttributeError: 'Series' object has no attribute 'date'
The error message is not quite clear to me, because the object should be a datetime object according to df.info()
Can anyone suggest an approach of how to do this?

In short: It is not advisable to convert to date objects, since then you lose a lot of functionality to inspect the dates. It might be better to just dt.floor(..) [pandas-doc], or dt.normalize(..) [pandas-doc].
You can convert a series of strings with pd.to_datetime(..) [pandas-doc], for example:
>>> pd.to_datetime(pd.Series(['2013-01-25 00:00:00+00:00']))
0 2013-01-25
dtype: datetime64[ns]
We can then later convert this to date objects with .dt.date [pandas-doc]:
>>> pd.to_datetime(pd.Series(['2013-01-25 00:00:00+00:00'])).dt.date
0 2013-01-25
dtype: object
Note that a date is not a native Numpy type, and thus it will use a Python date(..) object. A disadvantage of this is that you can no longer process the objects are datetime-like objects. So the Series more or less loses a lot of functionality.
It might be better to just dt.floor(..) [pandas-doc] to the day, and thus keep it a datetime64[ns] object:
>>> pd.to_datetime(pd.Series(['2013-01-25 00:00:00+00:00'])).dt.floor(freq='d')
0 2013-01-25
dtype: datetime64[ns]
We can use dt.normalize(..) [pandas-doc] as well. This just sets the time component to 0:00:00, and leaves the timezone unaffected:
>>> pd.to_datetime(pd.Series(['2013-01-25 00:00:00+00:00'])).dt.normalize()
0 2013-01-25
dtype: datetime64[ns]

Related

Modifying output of dtypes

Currently, dataframe.dtypes outputs:
age int64
gender object
date datetime64[ns]
time datetime64[ns]
dtype: object
I want the output to only have date and time columns, or conversely, only the columns with type datetime64[ns], i.e. the output should be:
date datetime64[ns]
time datetime64[ns]
dtype: object
I tried various methods such as using dataframe.select_dtypes, but none of them exactly match the required output.
You may select a subpart of your df, with fewer columns :
df[['date', 'time']].dtypes
To be more generic and get those with datetime64 type, do
import numpy as np
tt = df.dtypes
print(tt[tt.apply(lambda x: np.issubdtype(x, np.datetime64))]

Convert Pandas Timestamp with Time-Offset Column

I get daily reports which include a timestamp column and a UTC Offset column. Using pandas, I can convert the int Timestamp into a datetime64 type. I unfortunately can't figure out how to use the offset.
Since the 'UTC Offset' column comes in as a string I have tried converting it to an int to help, but can't figure out how to use it. I tried using pd.offsets.Hour, but that can't use the column of offsets.
df = pd.read_csv(filename, encoding='utf-8', delimiter=r'\t',engine='python')
df['Timestamp'] = pd.to_datetime(df[r'Stream Timestamp'],utc=True, unit='s')
print(df[:3][r'Stream Timestamp'])
0 2019-05-01 14:21:37+00:00
1 2019-05-01 15:50:12+00:00
Name: Stream Timestamp, dtype: datetime64[ns, UTC]
0 -06:00
1 +01:00
2 -04:00
Name: UTC Offset, dtype: object
df[r"UTC Offset"] = df[r"UTC Offset"].astype(int)
Optimally, I want to do something like this
df[r'Adjusted'] = df[r'Timestamp'] + pd.offsets.Hour(df[r'UTC Offset'])
However I can't seem to figure out how best to reference the column of offsets. I'm a little new to datetime in general, but any help would be appreciated!
Maybe not the prettiest, but since when you read in the csv it is an object you can shave off the old offset and combine it with the offset column as a string. For this to work all of the Timestamps when read in must have the offset. If they don't consider maybe checking if there is a + or - in the string and then going from there.
Then convert to datetime. I included the format parameter in the pd.to_datetime just so it would be faster, however you do not need this if your dataset is small. I am actually surprised at how hard it is to get information for pandas timezones, but maybe check out tzinfo?
I included the intermediate steps in different columns for ease of understanding, but you of course need not do this.
df = pd.DataFrame({'timestamp_str': ['2019-05-01 14:21:37+00:00',
'2019-05-01 15:50:12+00:00',
'2019-05-01 15:50:12+00:00'],
'utc_offset': ['-06:00','+01:00','-04:00']})
df['timestamp_str_combine'] = df['timestamp_str'].str[:-6] + df['utc_offset']
df['timestamp'] = pd.to_datetime(df['timestamp_str_combine'],
format="%Y-%m-%d %H:%M:%S", utc=True)
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
timestamp_str 3 non-null object
utc_offset 3 non-null object
timestamp_str_combine 3 non-null object
timestamp 3 non-null datetime64[ns, UTC]
dtypes: datetime64[ns, UTC](1), object(3)
memory usage: 176.0+ bytes
I prefer to convert to datetime as it's easier to apply offsets than with native pandas time format.
To get the timezone offset:
from tzlocal import get_localzone # pip install tzlocal
millis = 1288483950000
ts = millis * 1e-3
local_dt = datetime.fromtimestamp(ts, get_localzone())
utc_offset = local_dt.utcoffset()
hours_offset = utc_offset / timedelta(hours=1)
Then apply the offset:
df['dt'] = pd.to_datetime(df['timestamp'],infer_datetime_format=True,errors='ignore')
df['dtos'] = df['dt'] + timedelta(hours=hours_offset)

Remove the time from datetime.datetime in pandas column

I have a pandas column called 'date'
which has values and type like 2014-07-30 00:00:00 <class 'datetime.datetime'>.
I want to remove the time from the date.The end result being `2014-07-30' in datetime.datetime format.
I tried a bunch of solutions like-
df['PSG Date '] = df['PSG Date '].dt.date
but its giving me error-
AttributeError: Can only use .dt accessor with datetimelike values
I believe need first to_datetime and for dates use dt.date:
df['PSG Date '] = pd.to_datetime(df['PSG Date '], errors='coerce').dt.date
If want datetimes with no times use dt.floor:
df['PSG Date '] = pd.to_datetime(df['PSG Date '], errors='coerce').dt.floor('d')
First, you should begin with a datetime series; if you don't have one, use pd.to_datetime to force this conversion. This will permit vectorised computations:
df = pd.DataFrame({'col': ['2014-07-30 12:19:22', '2014-07-30 05:52:05',
'2014-07-30 20:15:00']})
df['col'] = pd.to_datetime(df['col'])
Next, note you cannot remove time from a datetime series in Pandas. By definition, a datetime series will include both "date" and "time" components.
Normalize time
You can use pd.Series.dt.floor or pd.Series.dt.normalize to reset the time component to 00:00:00:
df['col_floored'] = df['col'].dt.floor('d')
df['col_normalized'] = df['col'].dt.normalize()
print(df['col_floored'].iloc[0]) # 2014-07-30 00:00:00
print(df['col_normalized'].iloc[0]) # 2014-07-30 00:00:00
Convert to datetime.date pointers
You can convert your datetime series to an object series, consisting of datetime.date objects representing dates:
df['col_date'] = df['col'].dt.date
print(df['col_date'].iloc[0]) # 2014-07-30
Since these are not held in a contiguous memory block, operations on df['col_date'] will not be vectorised.
How to check the difference
It's useful to check the dtype for the series we have derived. Notice the one option which "removes" time involves converting your series to object.
Computations will be non-vectorised with such a series, since it consists of pointers to datetime.date objects instead of data in a contiguous memory block.
print(df.dtypes)
col datetime64[ns]
col_date object
col_floored datetime64[ns]
col_normalized datetime64[ns]
You can convert a datetime.datetime to date time.date by calling the .date() method of the object. eg
current_datetime = datetime.datetime.now()
date_only = current_datetime.date()

Python np.busday_count with datetime64[ns] as input

I have a column from a pandas Dataframe that I want to use as input for np.busday_count:
np.busday_count(df['date_from'].tolist(), df['date_to_plus_one'].tolist(), weekmask='1000000')
I have always use .tolist() but since one of the last updates this results in an error:
> TypeError: Iterator operand 0 dtype could not be cast from
> dtype('<M8[us]') to dtype('<M8[D]') according to the rule 'safe'
The column df['date_from']is of type dtype: datetime64[ns].
Any tips or solution for this?
try Using
df['date_from'].date()
The column df['date_from'] with datatype dtype: datetime64[ns] contains data like
2018-04-06 00:00:00 its a timestamp
But np.busyday_count takes datetime.date as input like "2018-04-06"

Converting into date-time format in pandas?

I need help converting into python/pandas date time format. For example, my times are saved like the following line:
2017-01-01 05:30:24.468911+00:00
.....
2017-05-05 01:51:31.351718+00:00
and I want to know the simplest way to convert this into date time format for essentially performing operations with time (like what is the range in days of my dataset to split up my dataset into chunks by time, what's the time difference from one time to another)? I don't mind losing some of the significance for the times if that makes things easier. Thank you so much!
Timestamp will convert it for you.
>>> pd.Timestamp('2017-01-01 05:30:24.468911+00:00')
Timestamp('2017-01-01 05:30:24.468911+0000', tz='UTC')
Let's say you have a dataframe that includes your timestamp column (let's call it stamp). You can use apply on that column together with Timestamp:
df = pd.DataFrame(
{'stamp': ['2017-01-01 05:30:24.468911+00:00',
'2017-05-05 01:51:31.351718+00:00']})
>>> df
stamp
0 2017-01-01 05:30:24.468911+00:00
1 2017-05-05 01:51:31.351718+00:00
>>> df['stamp'].apply(pd.Timestamp)
0 2017-01-01 05:30:24.468911+00:00
1 2017-05-05 01:51:31.351718+00:00
Name: stamp, dtype: datetime64[ns, UTC]
You could also use Timeseries:
>>> pd.TimeSeries(df.stamp)
0 2017-01-01 05:30:24.468911+00:00
1 2017-05-05 01:51:31.351718+00:00
Name: stamp, dtype: object
Once you have a Timestamp object, it is pretty efficient to manipulate. You can just difference their values, for example.
You may also want to have a look at this SO answer which discusses timezone unaware values to aware.
Let's say I have two strings 2017-06-06 and 1944-06-06 and I wanted to get the difference (what Python calls a timedelta) between the two.
First, I'll need to import datetime. Then I'll need to get both of those strings into datetime objects:
>>> a = datetime.datetime.strptime('2017-06-06', '%Y-%m-%d')
>>> b = datetime.datetime.strptime('1944-06-06', '%Y-%m-%d')
That will give us two datetime objects that can be used in arithmetic functions that will return a timedelta object:
>>> c = abs((a-b).days)
This will give us 26663, and days is the largest resolution that timedelta supports: documentation
Since the Pandas tag is there:
df = pd.DataFrame(['2017-01-01 05:30:24.468911+00:00'])
df.columns = ['Datetime']
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S.%f', utc=True)
print(df.dtypes)

Categories