Python Pandas: detecting frequency of time series - python

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series

It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')

Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)

The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

Related

How to calculate time difference between two dates in pandas Dataframe

I have a dataframe which is having multiple rows with column date. date column is having date and time. not each row has incremental time so I want to calculate after each row how much was the time difference between current and previous date in seconds.
import pandas as pd
data = pd.date_range('1/1/2011', periods = 10, freq ='H')
In the above snippet time difference after each step is 1hr which means 3600 seconds so I want a list of tuple having [(<prev date time>, <current_datetime>, <time_difference>),.....].
I want a list of tuple having [(prev date time, current_datetime,
time_difference),.....]
In this case, use list with zip and compute the time difference with tolal_seconds :
data = pd.date_range("1/1/2011", periods = 10, freq ="H")
​
L = list(zip(data.shift(), # <- previous time
data, # <- current time
(data.shift() - data).total_seconds())) # <- time diff
NB : If you manipulate a dataframe, you need to replace data by df["date_column"].
​
Output :
print(L)
[(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 00:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
...
You can achieve this by using diff function in Pandas to calculate the time difference between consecutive rows in the data column. Here's an example:
df = pd.DataFrame({"date": pd.date_range("1/1/2011", periods=10, freq="H")})
# Calculate the time difference between consecutive rows in seconds
df["time_diff"] = df["date"].diff().dt.total_seconds()
# Create a list of tuples
result = [(df.iloc[i-1]["date"], row["date"], row["time_diff"]) for i, row in df[1:].iterrows()]
df:
date time_diff
0 2011-01-01 00:00:00 NaN
1 2011-01-01 01:00:00 3600.0
2 2011-01-01 02:00:00 3600.0
3 2011-01-01 03:00:00 3600.0
4 2011-01-01 04:00:00 3600.0
5 2011-01-01 05:00:00 3600.0
6 2011-01-01 06:00:00 3600.0
7 2011-01-01 07:00:00 3600.0
8 2011-01-01 08:00:00 3600.0
9 2011-01-01 09:00:00 3600.0
result:
[(Timestamp('2011-01-01 00:00:00'), Timestamp('2011-01-01 01:00:00'), 3600.0),
(Timestamp('2011-01-01 01:00:00'), Timestamp('2011-01-01 02:00:00'), 3600.0),
(Timestamp('2011-01-01 02:00:00'), Timestamp('2011-01-01 03:00:00'), 3600.0),
(Timestamp('2011-01-01 03:00:00'), Timestamp('2011-01-01 04:00:00'), 3600.0),
(Timestamp('2011-01-01 04:00:00'), Timestamp('2011-01-01 05:00:00'), 3600.0),
(Timestamp('2011-01-01 05:00:00'), Timestamp('2011-01-01 06:00:00'), 3600.0),
(Timestamp('2011-01-01 06:00:00'), Timestamp('2011-01-01 07:00:00'), 3600.0),
(Timestamp('2011-01-01 07:00:00'), Timestamp('2011-01-01 08:00:00'), 3600.0),
(Timestamp('2011-01-01 08:00:00'), Timestamp('2011-01-01 09:00:00'), 3600.0)]
It's possible to do this with list comprehension. [:-1] is required because we get a list of 10 intervals just using shift, but there are N-1 intervals between N points.
result = [(i[0],
i[1],
(i[1] - i[0]).total_seconds())
for i in list(zip(data, data.shift(1)))[:-1]]
print(result)
[(Timestamp('2011-01-01 00:00:00', freq='H'),
Timestamp('2011-01-01 01:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 01:00:00', freq='H'),
Timestamp('2011-01-01 02:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 02:00:00', freq='H'),
Timestamp('2011-01-01 03:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 03:00:00', freq='H'),
Timestamp('2011-01-01 04:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 04:00:00', freq='H'),
Timestamp('2011-01-01 05:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 05:00:00', freq='H'),
Timestamp('2011-01-01 06:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 06:00:00', freq='H'),
Timestamp('2011-01-01 07:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 07:00:00', freq='H'),
Timestamp('2011-01-01 08:00:00', freq='H'),
3600.0),
(Timestamp('2011-01-01 08:00:00', freq='H'),
Timestamp('2011-01-01 09:00:00', freq='H'),
3600.0)]

How to keep correct date when plotting data in Python?

I am working with a dataset with records ranging from 16-02-2022 00:00 to 01/04/2022 11:30. I want to plot the 4-variable daily and monthly means (H, LE, co2, h2o) and then compare them with the same variables from another dataset. I filtered the interest variables for the quality flags and I removed outliers using interquartile range. My problem is that I can't get the real date when I plot the average values. For example, I should get about 2 months, instead I get really more.
As you can see it is not the correct plot of Sensible Heat Flux monthly mean Cycle because I have more or less to months
I used this script to import the data:
datasetBio=pd.read_csv(io.BytesIO(uploaded["eddypro_Bioesame_full_output_exp2.csv"]),sep=';',header = 1, parse_dates= ['day&time'] ,index_col=['day&time'], na_values= "-9999")
and my DateTimeIndex look like this:
DatetimeIndex(['2022-02-16 00:00:00', '2022-02-16 00:30:00',
'2022-02-16 01:00:00', '2022-02-16 01:30:00',
'2022-02-16 02:00:00', '2022-02-16 02:30:00',
'2022-02-16 03:00:00', '2022-02-16 03:30:00',
'2022-02-16 04:00:00', '2022-02-16 04:30:00',
...
'2022-01-04 07:00:00', '2022-01-04 07:30:00',
'2022-01-04 08:00:00', '2022-01-04 08:30:00',
'2022-01-04 09:00:00', '2022-01-04 09:30:00',
'2022-01-04 10:00:00', '2022-01-04 10:30:00',
'2022-01-04 11:00:00', '2022-01-04 11:30:00'],
dtype='datetime64[ns]', name='day&time', length=2136, freq=None)
In my dataset there is also the DOY column, but I tried to use it without success. I tried with datetime moduls, strftime and strptime, but without success.
I also tried with:
#list comprehension
datasetBio['HM'] = ['%s-%s' %(el.hour,el.minute) for el in datasetBio.index]
list_m = []
list_h = []
for el in datasetBio['HM'].unique():
list_m.append(datasetBio['H'].loc[datasetBio['HM']==el].mean())
list_h.append(el)
#Look at the group
for el in datasetBio['HM'].unique():`
print(datasetBio.loc[datasetBio['HM']==el])
partial output:
2022-02-19 12:30:00 0.0 ... 0.575134 0.424103 0.066102 0.041973
2022-02-20 12:30:00 1.0 ... 0.898857 0.551975 0.069380 0.221436
2022-02-21 12:30:00 0.0 ... 221.180000 234.682000 0.369427 0.161920
2022-02-22 12:30:00 1.0 ... 0.521469 0.673882 0.074374 0.312831
2022-02-23 12:30:00 0.0 ... 0.303948 0.630388 0.069664 0.283314
When I try to plot together the variables coming from the 2 datasets obviously the problem of the days remains.
Instead of plotting the correct time parameters, it runs from January to December
Please someone help me to solve this problem because I don't know what to do anymore.
Thanks in advance.

Truncating milliseconds out of DateTimeIndex

When I use pandas.date_range(), I sometimes have timestamp that have lots of milliseconds that I don't wish to keep.
Suppose I do...
import pandas as pd
dr = pd.date_range('2011-01-01', '2011-01-03', periods=15)
>>> dr
DatetimeIndex([ '2011-01-01 00:00:00',
'2011-01-01 03:25:42.857142784',
'2011-01-01 06:51:25.714285824',
'2011-01-01 10:17:08.571428608',
'2011-01-01 13:42:51.428571392',
'2011-01-01 17:08:34.285714176',
'2011-01-01 20:34:17.142857216',
'2011-01-02 00:00:00',
'2011-01-02 03:25:42.857142784',
'2011-01-02 06:51:25.714285824',
'2011-01-02 10:17:08.571428608',
'2011-01-02 13:42:51.428571392',
'2011-01-02 17:08:34.285714176',
'2011-01-02 20:34:17.142857216',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)
To ignore the currend miliseconds, I am forced to do this.
>>> t = []
>>> for item in dr:
... idx = str(item).find('.')
... if idx != -1:
... item = str(item)[:idx]
... t.append(pd.to_datetime(item))
...
>>> t
[Timestamp('2011-01-01 00:00:00'),
Timestamp('2011-01-01 03:25:42'),
Timestamp('2011-01-01 06:51:25'),
Timestamp('2011-01-01 10:17:08'),
Timestamp('2011-01-01 13:42:51'),
Timestamp('2011-01-01 17:08:34'),
Timestamp('2011-01-01 20:34:17'),
Timestamp('2011-01-02 00:00:00'),
Timestamp('2011-01-02 03:25:42'),
Timestamp('2011-01-02 06:51:25'),
Timestamp('2011-01-02 10:17:08'),
Timestamp('2011-01-02 13:42:51'),
Timestamp('2011-01-02 17:08:34'),
Timestamp('2011-01-02 20:34:17'),
Timestamp('2011-01-03 00:00:00')]
Is there a better way ?
I already tried this...
dr = [ pd.to_datetime(item, format='%Y-%m-%d %H:%M:%S') for item in dr ]
But it doesn't do anything.
(pd.date_range('2011-01-01', '2011-01-03', periods=15)).astype('datetime64[s]')
But it says it can't cast it.
dr = (dr.to_series()).apply(lambda x:x.replace(microseconds=0))
But this line doesn't solve my problem, as...
2018-04-17 15:07:04.777777664 gives --> 2018-04-17 15:07:04.000000664
I believe need DatetimeIndex.floor:
print (dr.floor('S'))
DatetimeIndex(['2011-01-01 00:00:00', '2011-01-01 03:25:42',
'2011-01-01 06:51:25', '2011-01-01 10:17:08',
'2011-01-01 13:42:51', '2011-01-01 17:08:34',
'2011-01-01 20:34:17', '2011-01-02 00:00:00',
'2011-01-02 03:25:42', '2011-01-02 06:51:25',
'2011-01-02 10:17:08', '2011-01-02 13:42:51',
'2011-01-02 17:08:34', '2011-01-02 20:34:17',
'2011-01-03 00:00:00'],
dtype='datetime64[ns]', freq=None)

How to construct non-leap datetime list in python?

I have a user case that I need always the non-leap calendar whatever the year is a leap year or not. I want to construct a 6-hourly datetime list for year 2000, for example:
import datetime
import pandas as pa
tdelta = datetime.timedelta(hours=6)
dt = datetime.datetime(2000,1,1,0,)
ts = [dt+i*tdelta for i in range(1460)]
pa.DatetimeIndex(ts)
With this block of code, I get the result:
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 06:00:00',
'2000-01-01 12:00:00', '2000-01-01 18:00:00',
'2000-01-02 00:00:00', '2000-01-02 06:00:00',
'2000-01-02 12:00:00', '2000-01-02 18:00:00',
'2000-01-03 00:00:00', '2000-01-03 06:00:00',
...
'2000-12-28 12:00:00', '2000-12-28 18:00:00',
'2000-12-29 00:00:00', '2000-12-29 06:00:00',
'2000-12-29 12:00:00', '2000-12-29 18:00:00',
'2000-12-30 00:00:00', '2000-12-30 06:00:00',
'2000-12-30 12:00:00', '2000-12-30 18:00:00'],
dtype='datetime64[ns]', length=1460, freq=None, tz=None)
However I want the February to have 28 days and thus the last member of the output should be '2000-12-31 18:00:00', are there some way to do this with python? Thanks!!
All you need to do is check for the .month and .day attribute for the datetime instance. So just insert a condition that checks:
if month == 2
if day == 2
If both the conditions are true, you don't add it to the list.
To make it more descriptive:
ts = []
for i in range(1460):
x = dt + i * tdelta
if x.month == 2 and x.day == 29:
continue
ts.append(x)

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]

Categories