Python Pandas: detecting frequency of time series

Python Pandas: detecting frequency of time series - python

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series

It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')

Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)

The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

Related

How to calculate time difference between two dates in pandas Dataframe

I have a dataframe which is having multiple rows with column date. date column is having date and time. not each row has incremental time so I want to calculate after each row how much was the time difference between current and previous date in seconds.
import pandas as pd
data = pd.date_range('1/1/2011', periods = 10, freq ='H')
In the above snippet time difference after each step is 1hr which means 3600 seconds so I want a list of tuple having [(<prev date time>, <current_datetime>, <time_difference>),.....].

I want a list of tuple having [(prev date time, current_datetime,
time_difference),.....]
In this case, use list with zip and compute the time difference with tolal_seconds :
data = pd.date_range("1/1/2011", periods = 10, freq ="H")

L = list(zip(data.shift(), # <- previous time
data, # <- current time
(data.shift() - data).total_seconds())) # <- time diff
NB : If you manipulate a dataframe, you need to replace data by df["date_column"].

Output :
print(L)
[(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 00:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
...

You can achieve this by using diff function in Pandas to calculate the time difference between consecutive rows in the data column. Here's an example:
df = pd.DataFrame({"date": pd.date_range("1/1/2011", periods=10, freq="H")})
# Calculate the time difference between consecutive rows in seconds
df["time_diff"] = df["date"].diff().dt.total_seconds()
# Create a list of tuples
result = [(df.iloc[i-1]["date"], row["date"], row["time_diff"]) for i, row in df[1:].iterrows()]
df:
date time_diff
0 2011-01-01 00:00:00 NaN
1 2011-01-01 01:00:00 3600.0
2 2011-01-01 02:00:00 3600.0
3 2011-01-01 03:00:00 3600.0
4 2011-01-01 04:00:00 3600.0
5 2011-01-01 05:00:00 3600.0
6 2011-01-01 06:00:00 3600.0
7 2011-01-01 07:00:00 3600.0
8 2011-01-01 08:00:00 3600.0
9 2011-01-01 09:00:00 3600.0
result:
[(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-01-01 01:00:00'), 3600.0),
(Timestamp('2011-01-01 01:00:00'), Timestamp('2011-01-01 02:00:00'), 3600.0),
(Timestamp('2011-01-01 02:00:00'), Timestamp('2011-01-01 03:00:00'), 3600.0),
(Timestamp('2011-01-01 03:00:00'), Timestamp('2011-01-01 04:00:00'), 3600.0),
(Timestamp('2011-01-01 04:00:00'), Timestamp('2011-01-01 05:00:00'), 3600.0),
(Timestamp('2011-01-01 05:00:00'), Timestamp('2011-01-01 06:00:00'), 3600.0),
(Timestamp('2011-01-01 06:00:00'), Timestamp('2011-01-01 07:00:00'), 3600.0),
(Timestamp('2011-01-01 07:00:00'), Timestamp('2011-01-01 08:00:00'), 3600.0),
(Timestamp('2011-01-01 08:00:00'), Timestamp('2011-01-01 09:00:00'), 3600.0)]

It's possible to do this with list comprehension. [:-1] is required because we get a list of 10 intervals just using shift, but there are N-1 intervals between N points.
result = [(i[0],
i[1],
(i[1] - i[0]).total_seconds())
for i in list(zip(data, data.shift(1)))[:-1]]
print(result)
[(Timestamp('2011-01-01 00:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 05:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 06:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 06:00:00', freq='H'),
Timestamp('2011-01-01 07:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 07:00:00', freq='H'),
Timestamp('2011-01-01 08:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 08:00:00', freq='H'),
Timestamp('2011-01-01 09:00:00', freq='H'),
3600.0)]

How to keep correct date when plotting data in Python?

I am working with a dataset with records ranging from 16-02-2022 00:00 to 01/04/2022 11:30. I want to plot the 4-variable daily and monthly means (H, LE, co2, h2o) and then compare them with the same variables from another dataset. I filtered the interest variables for the quality flags and I removed outliers using interquartile range. My problem is that I can't get the real date when I plot the average values. For example, I should get about 2 months, instead I get really more.
As you can see it is not the correct plot of Sensible Heat Flux monthly mean Cycle because I have more or less to months
I used this script to import the data:
datasetBio=pd.read_csv(io.BytesIO(uploaded["eddypro_Bioesame_full_output_exp2.csv"]),sep=';',header = 1, parse_dates= ['day&time'] ,index_col=['day&time'], na_values= "-9999")
and my DateTimeIndex look like this:
DatetimeIndex(['2022-02-16 00:00:00', '2022-02-16 00:30:00',
'2022-02-16 01:00:00', '2022-02-16 01:30:00',
'2022-02-16 02:00:00', '2022-02-16 02:30:00',
'2022-02-16 03:00:00', '2022-02-16 03:30:00',
'2022-02-16 04:00:00', '2022-02-16 04:30:00',
...
'2022-01-04 07:00:00', '2022-01-04 07:30:00',
'2022-01-04 08:00:00', '2022-01-04 08:30:00',
'2022-01-04 09:00:00', '2022-01-04 09:30:00',
'2022-01-04 10:00:00', '2022-01-04 10:30:00',
'2022-01-04 11:00:00', '2022-01-04 11:30:00'],
dtype='datetime64[ns]', name='day&time', length=2136, freq=None)
In my dataset there is also the DOY column, but I tried to use it without success. I tried with datetime moduls, strftime and strptime, but without success.
I also tried with:
#list comprehension
datasetBio['HM'] = ['%s-%s' %(el.hour,el.minute) for el in datasetBio.index]
list_m = []
list_h = []
for el in datasetBio['HM'].unique():
list_m.append(datasetBio['H'].loc[datasetBio['HM']==el].mean())
list_h.append(el)
#Look at the group
for el in datasetBio['HM'].unique():`
print(datasetBio.loc[datasetBio['HM']==el])
partial output:
2022-02-19 12:30:00 0.0 ... 0.575134 0.424103 0.066102 0.041973
2022-02-20 12:30:00 1.0 ... 0.898857 0.551975 0.069380 0.221436
2022-02-21 12:30:00 0.0 ... 221.180000 234.682000 0.369427 0.161920
2022-02-22 12:30:00 1.0 ... 0.521469 0.673882 0.074374 0.312831
2022-02-23 12:30:00 0.0 ... 0.303948 0.630388 0.069664 0.283314
When I try to plot together the variables coming from the 2 datasets obviously the problem of the days remains.
Instead of plotting the correct time parameters, it runs from January to December
Please someone help me to solve this problem because I don't know what to do anymore.
Thanks in advance.

Truncating milliseconds out of DateTimeIndex

When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664

I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)

How to construct non-leap datetime list in python?

I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!

All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30

Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas: detecting frequency of time series - python

The minimum time difference is found with np.diff(data.index.values).min() which is normally in units of ns. To get a frequency, assuming ns: freq = 1e9 / np.diff(df.index.values).min().astype(int)

Related

How to calculate time difference between two dates in pandas Dataframe

How to keep correct date when plotting data in Python?

Truncating milliseconds out of DateTimeIndex

How to construct non-leap datetime list in python?

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

Categories

Resources