Pandas .resample() or .asfreq() fill forward times - python

I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?

Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5

Related

How to set a multiindex with multiple dates in pandas?

I have the following dataframe df:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 02:00 2018-01-01 03:00 2
2018-01-01 03:00 2018-01-01 04:00 3
2018-01-01 04:00 2018-01-01 05:00 6
I want to set a multi index composed of Datetime1 and Datetime2 to further proceed with the data resampling and interpolation (from 1 hour to 30 minutes frequency).
If I do df.set_index(["Datetime1","Datetime2"]).resample("30T").ffill(), then it fails.
Desired output:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 00:30 2018-01-01 01:30 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 01:30 2018-01-01 02:30 1
...
If there is one hour difference is possible create MultiIndex after resample with add 1H to new DatetimeIndex:
df = df.set_index(["Datetime1"])[['Value']].resample("30T").ffill()
df = df.set_index([df.index.rename('Datetime2') + pd.Timedelta('1H')], append=True)
print (df)
Value
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Or:
s = df.set_index(["Datetime1"])['Value'].resample("30T").ffill()
s.index = [s.index,s.index.rename('Datetime2') + pd.Timedelta('1H')]
print (s)
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Name: Value, dtype: int64
The multi-index is not meant for a double-index but for a hierarchical (grouped) index. See the docs. You said in the comments, that Datetime2 is always offset by 1 hour. That means it's probably fastest to recalculate it:
df.set_index("Datetime1","Datetime2").resample("30T").ffill()
df["Datetime2" = df.index + pd.Timedelta(1, "hour")

Find value cycles in time series data

I have a large time-series > 5 million rows, the values in time series fluctuate randomly between 2-10:
A small section of time-series:
I want to identify a certain pattern from this time series, pattern:
when the value of pct_change is >= threshold " T " I want to raise a flag that says reading begins
if the value of pct_change is >= T or < T and !=0 after reading begins flag has been raised then a reading continue flag should be raised until a zero is encountered
if a zero is encountered then a reading stop flag should be raised if the value of pct_change is < T after this flag has been raised then a not reading flag should be raised.
I want to write a function that can tell me how many times and for what duration this happened.
If we take a threshold T of 4 and use pct_change from the example data screenshot then the output that I want is :
The main goal behind this is to find how many times this cycle is repeating for different thresholds.
To generate sample data :
import pandas as pd
a = [2,3,4,2,0,14,5,6,3,2,0,4,5,7,8,10,4,0,5,6,7,10,7,6,4,2,0,1,2,5,6]
idx = pd.date_range("2018-01-01", periods=len(a), freq="H")
ts = pd.Series(a, index=idx)
dd = pd.DataFrame()
dd['pct_change'] =ts
dd.head()
Can you please suggest an efficient way of doing it?
Output that I want if threshold 'T' is >= 4 :
First, keep only interesting data (>= T | == 0):
threshold = 4
df = dd.loc[dd["pct_change"].ge(threshold) | dd["pct_change"].eq(0)]
>>> df
pct_change
2018-01-01 02:00:00 4 # group 0, end=2018-01-01 04:00:00
2018-01-01 04:00:00 0
2018-01-01 05:00:00 14 # group 1, end=2018-01-01 10:00:00
2018-01-01 06:00:00 5
2018-01-01 07:00:00 6
2018-01-01 10:00:00 0
2018-01-01 11:00:00 4 # group 2, end=2018-01-01 17:00:00
2018-01-01 12:00:00 5
2018-01-01 13:00:00 7
2018-01-01 14:00:00 8
2018-01-01 15:00:00 10
2018-01-01 16:00:00 4
2018-01-01 17:00:00 0
2018-01-01 18:00:00 5 # group 3, end=2018-01-02 02:00:00
2018-01-01 19:00:00 6
2018-01-01 20:00:00 7
2018-01-01 21:00:00 10
2018-01-01 22:00:00 7
2018-01-01 23:00:00 6
2018-01-02 00:00:00 4
2018-01-02 02:00:00 0
2018-01-02 05:00:00 5 # group 4, end=2018-01-02 06:00:00
2018-01-02 06:00:00 6
Then, create wanting groups:
groups = df["pct_change"].eq(0).shift(fill_value=0).cumsum()
>>> groups
2018-01-01 02:00:00 0 # group 0
2018-01-01 04:00:00 0
2018-01-01 05:00:00 1 # group 1
2018-01-01 06:00:00 1
2018-01-01 07:00:00 1
2018-01-01 10:00:00 1
2018-01-01 11:00:00 2 # group 2
2018-01-01 12:00:00 2
2018-01-01 13:00:00 2
2018-01-01 14:00:00 2
2018-01-01 15:00:00 2
2018-01-01 16:00:00 2
2018-01-01 17:00:00 2
2018-01-01 18:00:00 3 # group 3
2018-01-01 19:00:00 3
2018-01-01 20:00:00 3
2018-01-01 21:00:00 3
2018-01-01 22:00:00 3
2018-01-01 23:00:00 3
2018-01-02 00:00:00 3
2018-01-02 02:00:00 3
2018-01-02 05:00:00 4 # group 4
2018-01-02 06:00:00 4
Name: pct_change, dtype: object
Finally, use groups to output result:
out = pd.DataFrame(df.groupby(groups) \
.apply(lambda x: (x.index[0], x.index[-1])) \
.tolist(), columns=["StartTime", "EndTime"])
>>> out
StartTime EndTime
0 2018-01-01 02:00:00 2018-01-01 04:00:00 # group 0
1 2018-01-01 05:00:00 2018-01-01 10:00:00 # group 1
2 2018-01-01 11:00:00 2018-01-01 17:00:00 # group 2
3 2018-01-01 18:00:00 2018-01-02 02:00:00 # group 3
4 2018-01-02 05:00:00 2018-01-02 06:00:00 # group 4
Bonus
There are some case where you have to remove groups:
The first pct value is 0
Two or more consecutive pct value is 0
To remove them:
out = out[~out["StartTime"].eq(out["EndTime"])]

Applying start and endtime as filters to dataframe

I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.

Insert missing rows in a specific time series

I have a specific time-series dataset which is as bellow.
0 2018-01-01 00:00:00+00:00 ...
1 2018-01-01 00:10:00+00:00 ...
2 2018-01-01 00:20:00+00:00 ...
3 2018-01-01 00:30:00+00:00 ...
4 2018-01-01 00:50:00+00:00 ...
5 2018-01-01 01:00:00+00:00 ...
6 2018-01-01 01:20:00+00:00 ...
7 2018-01-01 01:40:00+00:00 ...
.
.
.
However, there are some missing rows in the dataset.
I have searched how to insert rows for this specific dataset and did not find any useful help. In this dataset, we have to add rows that every 10 minutes have an entry and other columns should have Nan values.
any idea?
Create DatetimeIndex first and call DataFrame.asfreq:
print (df)
date_col value
0 2018-01-01 00:00:00+00:00 4
1 2018-01-01 00:10:00+00:00 9
2 2018-01-01 00:20:00+00:00 1
3 2018-01-01 00:30:00+00:00 6
4 2018-01-01 00:50:00+00:00 3
5 2018-01-01 01:00:00+00:00 4
6 2018-01-01 01:20:00+00:00 5
7 2018-01-01 01:40:00+00:00 0
#if necessary
df['date_col'] = pd.to_datetime(df['date_col'])
df = df.set_index('date_col').asfreq('10Min')
print (df)
value
date_col
2018-01-01 00:00:00+00:00 4.0
2018-01-01 00:10:00+00:00 9.0
2018-01-01 00:20:00+00:00 1.0
2018-01-01 00:30:00+00:00 6.0
2018-01-01 00:40:00+00:00 NaN
2018-01-01 00:50:00+00:00 3.0
2018-01-01 01:00:00+00:00 4.0
2018-01-01 01:10:00+00:00 NaN
2018-01-01 01:20:00+00:00 5.0
2018-01-01 01:30:00+00:00 NaN
2018-01-01 01:40:00+00:00 0.0

Group pandas rows into pairs then find timedelta

I have a dataframe where I need to group the TX/RX column into pairs, and then put these into a new dataframe with a new index and the timedelta between them as values.
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = pd.date_range('2018-01-01', periods=6, freq='1H1min')
df['id'] = ids
df['val'] = vals
time1 time2 id val
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A
3 2018-01-01 03:00:00 2018-01-01 03:03:00 4 B
4 2018-01-01 04:00:00 2018-01-01 04:04:00 5 A
5 2018-01-01 05:00:00 2018-01-01 05:05:00 6 B
needs to be...
index timedelta A B
0 1 1 2
1 1 3 4
2 1 5 6
I think that pivot_tables or stack/unstack is probably the best way to go about this, but I'm not entirely sure how...
I believe you need:
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = df['time1'] + pd.to_timedelta([60,60,120,120,180,180], 's')
df['id'] = range(1,7)
df['val'] = ['A','B'] * 3
df['t'] = df['time2'] - df['time1']
print (df)
time1 time2 id val t
0 2018-01-01 00:00:00 2018-01-01 00:01:00 1 A 00:01:00
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B 00:01:00
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A 00:02:00
3 2018-01-01 03:00:00 2018-01-01 03:02:00 4 B 00:02:00
4 2018-01-01 04:00:00 2018-01-01 04:03:00 5 A 00:03:00
5 2018-01-01 05:00:00 2018-01-01 05:03:00 6 B 00:03:00
#if necessary convert to seconds
#df['t'] = (df['time2'] - df['time1']).dt.total_seconds()
df = df.pivot('t','val','id').reset_index().rename_axis(None, axis=1)
#if necessary aggregate values
#df = (df.pivot_table(index='t',columns='val',values='id', aggfunc='mean')
# .reset_index().rename_axis(None, axis=1))
print (df)
t A B
0 00:01:00 1 2
1 00:02:00 3 4
2 00:03:00 5 6

Categories