How to set a multiindex with multiple dates in pandas? - python

I have the following dataframe df:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 02:00 2018-01-01 03:00 2
2018-01-01 03:00 2018-01-01 04:00 3
2018-01-01 04:00 2018-01-01 05:00 6
I want to set a multi index composed of Datetime1 and Datetime2 to further proceed with the data resampling and interpolation (from 1 hour to 30 minutes frequency).
If I do df.set_index(["Datetime1","Datetime2"]).resample("30T").ffill(), then it fails.
Desired output:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 00:30 2018-01-01 01:30 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 01:30 2018-01-01 02:30 1
...

If there is one hour difference is possible create MultiIndex after resample with add 1H to new DatetimeIndex:
df = df.set_index(["Datetime1"])[['Value']].resample("30T").ffill()
df = df.set_index([df.index.rename('Datetime2') + pd.Timedelta('1H')], append=True)
print (df)
Value
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Or:
s = df.set_index(["Datetime1"])['Value'].resample("30T").ffill()
s.index = [s.index,s.index.rename('Datetime2') + pd.Timedelta('1H')]
print (s)
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Name: Value, dtype: int64

The multi-index is not meant for a double-index but for a hierarchical (grouped) index. See the docs. You said in the comments, that Datetime2 is always offset by 1 hour. That means it's probably fastest to recalculate it:
df.set_index("Datetime1","Datetime2").resample("30T").ffill()
df["Datetime2" = df.index + pd.Timedelta(1, "hour")

Related

Find value cycles in time series data

I have a large time-series > 5 million rows, the values in time series fluctuate randomly between 2-10:
A small section of time-series:
I want to identify a certain pattern from this time series, pattern:
when the value of pct_change is >= threshold " T " I want to raise a flag that says reading begins
if the value of pct_change is >= T or < T and !=0 after reading begins flag has been raised then a reading continue flag should be raised until a zero is encountered
if a zero is encountered then a reading stop flag should be raised if the value of pct_change is < T after this flag has been raised then a not reading flag should be raised.
I want to write a function that can tell me how many times and for what duration this happened.
If we take a threshold T of 4 and use pct_change from the example data screenshot then the output that I want is :
The main goal behind this is to find how many times this cycle is repeating for different thresholds.
To generate sample data :
import pandas as pd
a = [2,3,4,2,0,14,5,6,3,2,0,4,5,7,8,10,4,0,5,6,7,10,7,6,4,2,0,1,2,5,6]
idx = pd.date_range("2018-01-01", periods=len(a), freq="H")
ts = pd.Series(a, index=idx)
dd = pd.DataFrame()
dd['pct_change'] =ts
dd.head()
Can you please suggest an efficient way of doing it?
Output that I want if threshold 'T' is >= 4 :
First, keep only interesting data (>= T | == 0):
threshold = 4
df = dd.loc[dd["pct_change"].ge(threshold) | dd["pct_change"].eq(0)]
>>> df
pct_change
2018-01-01 02:00:00 4 # group 0, end=2018-01-01 04:00:00
2018-01-01 04:00:00 0
2018-01-01 05:00:00 14 # group 1, end=2018-01-01 10:00:00
2018-01-01 06:00:00 5
2018-01-01 07:00:00 6
2018-01-01 10:00:00 0
2018-01-01 11:00:00 4 # group 2, end=2018-01-01 17:00:00
2018-01-01 12:00:00 5
2018-01-01 13:00:00 7
2018-01-01 14:00:00 8
2018-01-01 15:00:00 10
2018-01-01 16:00:00 4
2018-01-01 17:00:00 0
2018-01-01 18:00:00 5 # group 3, end=2018-01-02 02:00:00
2018-01-01 19:00:00 6
2018-01-01 20:00:00 7
2018-01-01 21:00:00 10
2018-01-01 22:00:00 7
2018-01-01 23:00:00 6
2018-01-02 00:00:00 4
2018-01-02 02:00:00 0
2018-01-02 05:00:00 5 # group 4, end=2018-01-02 06:00:00
2018-01-02 06:00:00 6
Then, create wanting groups:
groups = df["pct_change"].eq(0).shift(fill_value=0).cumsum()
>>> groups
2018-01-01 02:00:00 0 # group 0
2018-01-01 04:00:00 0
2018-01-01 05:00:00 1 # group 1
2018-01-01 06:00:00 1
2018-01-01 07:00:00 1
2018-01-01 10:00:00 1
2018-01-01 11:00:00 2 # group 2
2018-01-01 12:00:00 2
2018-01-01 13:00:00 2
2018-01-01 14:00:00 2
2018-01-01 15:00:00 2
2018-01-01 16:00:00 2
2018-01-01 17:00:00 2
2018-01-01 18:00:00 3 # group 3
2018-01-01 19:00:00 3
2018-01-01 20:00:00 3
2018-01-01 21:00:00 3
2018-01-01 22:00:00 3
2018-01-01 23:00:00 3
2018-01-02 00:00:00 3
2018-01-02 02:00:00 3
2018-01-02 05:00:00 4 # group 4
2018-01-02 06:00:00 4
Name: pct_change, dtype: object
Finally, use groups to output result:
out = pd.DataFrame(df.groupby(groups) \
.apply(lambda x: (x.index[0], x.index[-1])) \
.tolist(), columns=["StartTime", "EndTime"])
>>> out
StartTime EndTime
0 2018-01-01 02:00:00 2018-01-01 04:00:00 # group 0
1 2018-01-01 05:00:00 2018-01-01 10:00:00 # group 1
2 2018-01-01 11:00:00 2018-01-01 17:00:00 # group 2
3 2018-01-01 18:00:00 2018-01-02 02:00:00 # group 3
4 2018-01-02 05:00:00 2018-01-02 06:00:00 # group 4
Bonus
There are some case where you have to remove groups:
The first pct value is 0
Two or more consecutive pct value is 0
To remove them:
out = out[~out["StartTime"].eq(out["EndTime"])]

How to convert hourly data to half hourly

I have the following dataframe:
datetime temp
0 2015-01-01 00:00:00 11.22
1 2015-01-01 01:00:00 11.32
2 2015-01-01 02:00:00 11.30
3 2015-01-01 03:00:00 11.25
4 2015-01-01 04:00:00 11.32
... ... ...
31339 2018-07-29 19:00:00 17.60
31340 2018-07-29 20:00:00 17.49
31341 2018-07-29 21:00:00 17.44
31342 2018-07-29 22:00:00 17.39
31343 2018-07-29 23:00:00 17.37
I want to convert this dataframe to have data each half hour, and inpute each new position with the mean between the previous and the following value (or any similar interpolation), that is, for example:
datetime temp
0 2015-01-01 00:00:00 11.00
1 2015-01-01 00:30:00 11.50
2 2015-01-01 01:00:00 12.00
Is there any pandas/datetime function to assist in this operation?
Thank you
You can use the resample() function in Pandas. With this you can set the time to down/upsample to and then what you want to do with it (mean, sum etc.). In your case you can also interpolate between the values.
For this to work your datetime column will have to be a datetime dtype, then set it to the index.
df['datetime'] = pd.to_datetime(df['datetime'])
df.set_index('datetime', inplace=True)
Then you can resample to 30 minutes ('30T') and then interpolate.
df.resample('30T').interpolate()
Resulting in...
temp
datetime
2015-01-01 00:00:00 11.220
2015-01-01 00:30:00 11.270
2015-01-01 01:00:00 11.320
2015-01-01 01:30:00 11.310
2015-01-01 02:00:00 11.300
2015-01-01 02:30:00 11.275
2015-01-01 03:00:00 11.250
2015-01-01 03:30:00 11.285
2015-01-01 04:00:00 11.320
Read more about the frequency strings and resampling in the Pandas docs.

Counting continuous nan values in panda Time series

I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks

Combine dataframes result with a DatetimeIndex index

i have a pandas dataframe with random values at every minute.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0,30,size=20), index=pd.date_range("20180101", periods=20, freq='T'))
df
0
2018-01-01 00:00:00 21
2018-01-01 00:01:00 21
2018-01-01 00:02:00 23
2018-01-01 00:03:00 18
2018-01-01 00:04:00 3
2018-01-01 00:05:00 11
2018-01-01 00:06:00 3
2018-01-01 00:07:00 4
2018-01-01 00:08:00 5
2018-01-01 00:09:00 25
2018-01-01 00:10:00 15
2018-01-01 00:11:00 11
2018-01-01 00:12:00 29
2018-01-01 00:13:00 22
2018-01-01 00:14:00 7
2018-01-01 00:15:00 13
2018-01-01 00:16:00 26
2018-01-01 00:17:00 7
2018-01-01 00:18:00 26
2018-01-01 00:19:00 15
Now, I must create a new column in the dataframe df that "reflects" the mean() of a window of 2 periods on an higher frequency(5 minutes).
df2 = df.resample('5T').sum().rolling(2).mean()
df2
0
2018-01-01 00:00:00 NaN
2018-01-01 00:05:00 67.0
2018-01-01 00:10:00 66.0
2018-01-01 00:15:00 85.5
Here comes the problem. I need to "map" somehow the values of the "higher frequency" frame to the lower.
I should get something like:
0 new_column
2018-01-01 00:00:00 21 NaN
2018-01-01 00:01:00 21 NaN
2018-01-01 00:02:00 23 NaN
2018-01-01 00:03:00 18 NaN
2018-01-01 00:04:00 3 NaN
2018-01-01 00:05:00 11 67.0
2018-01-01 00:06:00 3 67.0
2018-01-01 00:07:00 4 67.0
2018-01-01 00:08:00 5 67.0
2018-01-01 00:09:00 25 67.0
2018-01-01 00:10:00 15 66.0
2018-01-01 00:11:00 11 66.0
2018-01-01 00:12:00 29 66.0
2018-01-01 00:13:00 22 66.0
2018-01-01 00:14:00 7 66.0
2018-01-01 00:15:00 13 85.5
2018-01-01 00:16:00 26 85.5
2018-01-01 00:17:00 7 85.5
2018-01-01 00:18:00 26 85.5
2018-01-01 00:19:00 15 85.5
I am using pandas 0.23.4
You can just use:
df['new_column'] = df2[0].repeat(5).values
with 5 being your resampling factor
You can pd.concat both dataframes and fillforward
df3=pd.concat([df,df2],axis=1).ffill()

Pandas .resample() or .asfreq() fill forward times

I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5

Categories