Truncating milliseconds out of DateTimeIndex - python

When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664

I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)

Related

How to calculate time difference between two dates in pandas Dataframe

I have a dataframe which is having multiple rows with column date. date column is having date and time. not each row has incremental time so I want to calculate after each row how much was the time difference between current and previous date in seconds.
import pandas as pd
data = pd.date_range('1/1/2011', periods = 10, freq ='H')
In the above snippet time difference after each step is 1hr which means 3600 seconds so I want a list of tuple having [(<prev date time>, <current_datetime>, <time_difference>),.....].
I want a list of tuple having [(prev date time, current_datetime,
time_difference),.....]
In this case, use list with zip and compute the time difference with tolal_seconds :
data = pd.date_range("1/1/2011", periods = 10, freq ="H")
​
L = list(zip(data.shift(), # <- previous time
data, # <- current time
(data.shift() - data).total_seconds())) # <- time diff
NB : If you manipulate a dataframe, you need to replace data by df["date_column"].
​
Output :
print(L)
[(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 00:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
...
You can achieve this by using diff function in Pandas to calculate the time difference between consecutive rows in the data column. Here's an example:
df = pd.DataFrame({"date": pd.date_range("1/1/2011", periods=10, freq="H")})
# Calculate the time difference between consecutive rows in seconds
df["time_diff"] = df["date"].diff().dt.total_seconds()
# Create a list of tuples
result = [(df.iloc[i-1]["date"], row["date"], row["time_diff"]) for i, row in df[1:].iterrows()]
df:
date time_diff
0 2011-01-01 00:00:00 NaN
1 2011-01-01 01:00:00 3600.0
2 2011-01-01 02:00:00 3600.0
3 2011-01-01 03:00:00 3600.0
4 2011-01-01 04:00:00 3600.0
5 2011-01-01 05:00:00 3600.0
6 2011-01-01 06:00:00 3600.0
7 2011-01-01 07:00:00 3600.0
8 2011-01-01 08:00:00 3600.0
9 2011-01-01 09:00:00 3600.0
result:
[(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-01-01 01:00:00'), 3600.0),
(Timestamp('2011-01-01 01:00:00'), Timestamp('2011-01-01 02:00:00'), 3600.0),
(Timestamp('2011-01-01 02:00:00'), Timestamp('2011-01-01 03:00:00'), 3600.0),
(Timestamp('2011-01-01 03:00:00'), Timestamp('2011-01-01 04:00:00'), 3600.0),
(Timestamp('2011-01-01 04:00:00'), Timestamp('2011-01-01 05:00:00'), 3600.0),
(Timestamp('2011-01-01 05:00:00'), Timestamp('2011-01-01 06:00:00'), 3600.0),
(Timestamp('2011-01-01 06:00:00'), Timestamp('2011-01-01 07:00:00'), 3600.0),
(Timestamp('2011-01-01 07:00:00'), Timestamp('2011-01-01 08:00:00'), 3600.0),
(Timestamp('2011-01-01 08:00:00'), Timestamp('2011-01-01 09:00:00'), 3600.0)]
It's possible to do this with list comprehension. [:-1] is required because we get a list of 10 intervals just using shift, but there are N-1 intervals between N points.
result = [(i[0],
i[1],
(i[1] - i[0]).total_seconds())
for i in list(zip(data, data.shift(1)))[:-1]]
print(result)
[(Timestamp('2011-01-01 00:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 05:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 06:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 06:00:00', freq='H'),
Timestamp('2011-01-01 07:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 07:00:00', freq='H'),
Timestamp('2011-01-01 08:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 08:00:00', freq='H'),
Timestamp('2011-01-01 09:00:00', freq='H'),
3600.0)]

why inclusive is not working in pandas date_range function?

pandas.date_range() to not return correct start and end timestamp when dates are set as a quarter string value.
start_date = '2021Q4'
end_date = '2024Q1'
dates=pd.date_range(start_date, end_date, freq='Q', inclusive='both').to_list()
dates
[Timestamp('2021-12-31 00:00:00', freq='Q-DEC'),
Timestamp('2022-03-31 00:00:00', freq='Q-DEC'),
Timestamp('2022-06-30 00:00:00', freq='Q-DEC'),
Timestamp('2022-09-30 00:00:00', freq='Q-DEC'),
Timestamp('2022-12-31 00:00:00', freq='Q-DEC'),
Timestamp('2023-03-31 00:00:00', freq='Q-DEC'),
Timestamp('2023-06-30 00:00:00', freq='Q-DEC'),
Timestamp('2023-09-30 00:00:00', freq='Q-DEC'),
Timestamp('2023-12-31 00:00:00', freq='Q-DEC')]
After I tried to fix your problem I found following solution:
The problem here is that date_range does not identify your strings correctly.
When you convert your "quarter"-strings to pandas.Period() and afterwards to pandas.Timestamp() this will work as expected:
import pandas as pd
from pprint import pprint
start_date = '2021Q4'
end_date = '2024Q1'
first_date_of_period = pd.Period(start_date, 'Q').to_timestamp('D', 'S') # D for day, S for start
last_date_of_period = pd.Period(end_date, 'Q').to_timestamp('D', 'E') # D for day, E for end
print(first_date_of_period)
# >> 2021-10-01 00:00:00
print(last_date_of_period)
# >> 2024-03-31 23:59:59.999999999
dates=pd.date_range(start=first_date_of_period, end=last_date_of_period, freq='Q').to_list()
pprint(dates)
# >> [Timestamp('2021-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-03-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-06-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-09-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-03-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-06-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-09-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2024-03-31 00:00:00', freq='Q-DEC')]
I used pandas v1.4.3 (last stable version).

How to construct non-leap datetime list in python?

I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!
All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)

Python Pandas: detecting frequency of time series

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]

Categories