How to construct non-leap datetime list in python? - python

I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!

All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)

Related

why inclusive is not working in pandas date_range function?

pandas.date_range() to not return correct start and end timestamp when dates are set as a quarter string value.
start_date = '2021Q4'
end_date = '2024Q1'
dates=pd.date_range(start_date, end_date, freq='Q', inclusive='both').to_list()
dates
[Timestamp('2021-12-31 00:00:00', freq='Q-DEC'),
Timestamp('2022-03-31 00:00:00', freq='Q-DEC'),
Timestamp('2022-06-30 00:00:00', freq='Q-DEC'),
Timestamp('2022-09-30 00:00:00', freq='Q-DEC'),
Timestamp('2022-12-31 00:00:00', freq='Q-DEC'),
Timestamp('2023-03-31 00:00:00', freq='Q-DEC'),
Timestamp('2023-06-30 00:00:00', freq='Q-DEC'),
Timestamp('2023-09-30 00:00:00', freq='Q-DEC'),
Timestamp('2023-12-31 00:00:00', freq='Q-DEC')]
After I tried to fix your problem I found following solution:
The problem here is that date_range does not identify your strings correctly.
When you convert your "quarter"-strings to pandas.Period() and afterwards to pandas.Timestamp() this will work as expected:
import pandas as pd
from pprint import pprint
start_date = '2021Q4'
end_date = '2024Q1'
first_date_of_period = pd.Period(start_date, 'Q').to_timestamp('D', 'S') # D for day, S for start
last_date_of_period = pd.Period(end_date, 'Q').to_timestamp('D', 'E') # D for day, E for end
print(first_date_of_period)
# >> 2021-10-01 00:00:00
print(last_date_of_period)
# >> 2024-03-31 23:59:59.999999999
dates=pd.date_range(start=first_date_of_period, end=last_date_of_period, freq='Q').to_list()
pprint(dates)
# >> [Timestamp('2021-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-03-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-06-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-09-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-03-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-06-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-09-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2024-03-31 00:00:00', freq='Q-DEC')]
I used pandas v1.4.3 (last stable version).

Truncating milliseconds out of DateTimeIndex

When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664
I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)

Create custom date range, 22 hours a day python

I'm working with pandas and want to create a month-long custom date range where the week starts on Sunday night at 6pm and ends Friday afternoon at 4pm. And each day has 22 hours, so for example Sunday at 6pm to Monday at 4pm, Monday 6pm to Tuesday 4pm, etc.
I tried day_range = pd.date_range(datetime(2016,9,12,18),datetime.now(),freq='H') but that always gives me in 24 hours.
Any suggestions?
You need Custom Business Hour with date_range:
cbh = pd.offsets.CustomBusinessHour(start='06:00',
end='16:00',
weekmask='Mon Tue Wed Thu Fri Sat')
print (cbh)
<CustomBusinessHour: CBH=06:00-16:00>
day_range = pd.date_range(pd.datetime(2016,9,12,18),pd.datetime.now(),freq=cbh)
print (day_range)
DatetimeIndex(['2016-09-13 06:00:00', '2016-09-13 07:00:00',
'2016-09-13 08:00:00', '2016-09-13 09:00:00',
'2016-09-13 10:00:00', '2016-09-13 11:00:00',
'2016-09-13 12:00:00', '2016-09-13 13:00:00',
'2016-09-13 14:00:00', '2016-09-13 15:00:00',
...
'2016-10-11 08:00:00', '2016-10-11 09:00:00',
'2016-10-11 10:00:00', '2016-10-11 11:00:00',
'2016-10-11 12:00:00', '2016-10-11 13:00:00',
'2016-10-11 14:00:00', '2016-10-11 15:00:00',
'2016-10-12 06:00:00', '2016-10-12 07:00:00'],
dtype='datetime64[ns]', length=252, freq='CBH')
Test - it omit Sunday:
day_range = pd.date_range(pd.datetime(2016,9,12,18),pd.datetime.now(),freq=cbh)[45:]
print (day_range)
DatetimeIndex(['2016-09-17 11:00:00', '2016-09-17 12:00:00',
'2016-09-17 13:00:00', '2016-09-17 14:00:00',
'2016-09-17 15:00:00', '2016-09-19 06:00:00',
'2016-09-19 07:00:00', '2016-09-19 08:00:00',
'2016-09-19 09:00:00', '2016-09-19 10:00:00',
...
'2016-10-11 08:00:00', '2016-10-11 09:00:00',
'2016-10-11 10:00:00', '2016-10-11 11:00:00',
'2016-10-11 12:00:00', '2016-10-11 13:00:00',
'2016-10-11 14:00:00', '2016-10-11 15:00:00',
'2016-10-12 06:00:00', '2016-10-12 07:00:00'],
dtype='datetime64[ns]', length=207, freq='CBH')

Python Pandas: detecting frequency of time series

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]

Categories