why inclusive is not working in pandas date_range function? - python

pandas.date_range() to not return correct start and end timestamp when dates are set as a quarter string value.
start_date = '2021Q4'
end_date = '2024Q1'
dates=pd.date_range(start_date, end_date, freq='Q', inclusive='both').to_list()
dates
[Timestamp('2021-12-31 00:00:00', freq='Q-DEC'),
Timestamp('2022-03-31 00:00:00', freq='Q-DEC'),
Timestamp('2022-06-30 00:00:00', freq='Q-DEC'),
Timestamp('2022-09-30 00:00:00', freq='Q-DEC'),
Timestamp('2022-12-31 00:00:00', freq='Q-DEC'),
Timestamp('2023-03-31 00:00:00', freq='Q-DEC'),
Timestamp('2023-06-30 00:00:00', freq='Q-DEC'),
Timestamp('2023-09-30 00:00:00', freq='Q-DEC'),
Timestamp('2023-12-31 00:00:00', freq='Q-DEC')]

After I tried to fix your problem I found following solution:
The problem here is that date_range does not identify your strings correctly.
When you convert your "quarter"-strings to pandas.Period() and afterwards to pandas.Timestamp() this will work as expected:
import pandas as pd
from pprint import pprint
start_date = '2021Q4'
end_date = '2024Q1'
first_date_of_period = pd.Period(start_date, 'Q').to_timestamp('D', 'S') # D for day, S for start
last_date_of_period = pd.Period(end_date, 'Q').to_timestamp('D', 'E') # D for day, E for end
print(first_date_of_period)
# >> 2021-10-01 00:00:00
print(last_date_of_period)
# >> 2024-03-31 23:59:59.999999999
dates=pd.date_range(start=first_date_of_period, end=last_date_of_period, freq='Q').to_list()
pprint(dates)
# >> [Timestamp('2021-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-03-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-06-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-09-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2022-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-03-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-06-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-09-30 00:00:00', freq='Q-DEC'),
# >> Timestamp('2023-12-31 00:00:00', freq='Q-DEC'),
# >> Timestamp('2024-03-31 00:00:00', freq='Q-DEC')]
I used pandas v1.4.3 (last stable version).

Related

Assign rows from one dataframe to another dataframe at a discretion

I need to generate df_Result_Sensor automatically.
I would like the dataframe (df_Result_Sensor) to only receive df_Sensor rows where the ['TimeStamp'] Column was not contained in the df_Message ['date init'] and df_Message ['date end'] ranges.
#In the code example, I wrote a df_Result_Sensor manually, just to illustrate the desired output:
TimeStamp Sensor_one Sensor_two
0 2017-05-20 00:00:00 1 1
1 2017-04-13 00:00:00 1 1
2 2017-09-10 00:00:00 0 1
import pandas as pd
df_Sensor = pd.DataFrame({'TimeStamp' : ['2017-05-25 00:00:00','2017-05-20 00:00:00', '2017-04-13 00:00:00', '2017-08-29 01:15:12', '2017-08-15 02:15:12', '2017-09-10 00:00:00'], 'Sensor_one': [1,1,1,1,1,0], 'Sensor_two': [1,1,1,0,1,1]})
df_Message = pd.DataFrame({'date init': ['2017-05-22 00:00:00', '2017-08-14 00:00:10'], 'date end': ['2017-05-26 00:00:05', '2017-09-01 02:10:05'], 'Message': ['Cold', 'Cold']})
just to illustrate the desired output:
df_Result_Sensor = pd.DataFrame({'TimeStamp' : ['2017-05-20 00:00:00', '2017-04-13 00:00:00', '2017-09-10 00:00:00'], 'Sensor_one': [1,1,0], 'Sensor_two': [1,1,1]})
This will work,make sure your date columns are converted to datetime before date comparisons
df_Message["date init"] = pd.to_datetime(df_Message["date init"])
df_Message['date end'] = pd.to_datetime(df_Message['date end'])
df_Sensor["TimeStamp"] = pd.to_datetime(df_Sensor["TimeStamp"])
df_Sensor_ = df_Sensor.copy()
for index, row in df_Message.iterrows():
df_Sensor_ = df_Sensor_[~((df_Sensor_["TimeStamp"] > row['date init']) & (df_Sensor_["TimeStamp"] < row['date end'])) ]
df_Result_Sensor = df_Sensor_

Choose rows a fixed time-interval apart in Datetime-indexed pandas dataframe

I have a pandas dataframe indexed by DateTime from hour "00:00:00" until hour "23:59:00" (increments by minute, seconds not counted).
in: df.index
out: DatetimeIndex(['2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
...
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 05:16:00', '2018-10-08 07:08:00',
'2018-10-08 13:58:00', '2018-10-08 09:30:00'],
dtype='datetime64[ns]', name='DateTime', length=91846, freq=None)
Now I want to choose specific intervals, say every 1 minute, or every 1 hour, starting from "00:00:00" and retrieve all the rows that interval apart consecutively.
I can grab entire intervals, say the first hour interval, with
df.between_time("01:00:00","00:00:00")
But I want to be able to
(a) get only all the times that are a specific intervals apart
(b) get all the 1-hour intervals without having to manually ask for them 24 times. How do I increment the DatetimeIndex inside the between_time command? Is there a better way than that?
I would solve this problem with masking rather than making new dataframes. For example you can add a column df['which_one'] and set different numbers for each subset. Then you can access the subset by calling df[df['which_one']==x] where x is the subset you want to select. You can still do other conditional statements and just about everything else that Pandas had to offer by access the data this way.
P.S. There are other methods to access data that might be faster. I just used what I'm most comfortable with another way would be df[df['which_one'].eq(x)].
If you are deadset on dataframes I would suggest doing so with a dictionary of dataframes such as:
import pandas as pd
dfdict={}
for i in range(0,10):
dfdict[i]=pd.DataFrame()
print(dfdict)
as you will see they are indeed dfs
out[1]
{0: Empty DataFrame
Columns: []
Index: [], 1: Empty DataFrame
Columns: []
Index: [], 2: Empty DataFrame
Columns: []
Index: [], 3: Empty DataFrame
Columns: []
Index: [], 4: Empty DataFrame
Columns: []
Index: [], 5: Empty DataFrame
Columns: []
Index: [], 6: Empty DataFrame
Columns: []
Index: [], 7: Empty DataFrame
Columns: []
Index: [], 8: Empty DataFrame
Columns: []
Index: [], 9: Empty DataFrame
Columns: []
Index: []}
Although as others have suggested there might be a more practical approach to solve your problem (difficult to say without more specifics of the issue)

Truncating milliseconds out of DateTimeIndex

When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664
I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)

How to construct non-leap datetime list in python?

I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!
All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)

Python Pandas: detecting frequency of time series

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

Categories