I have a large time-series > 5 million rows, the values in time series fluctuate randomly between 2-10:
A small section of time-series:
I want to identify a certain pattern from this time series, pattern:
when the value of pct_change is >= threshold " T " I want to raise a flag that says reading begins
if the value of pct_change is >= T or < T and !=0 after reading begins flag has been raised then a reading continue flag should be raised until a zero is encountered
if a zero is encountered then a reading stop flag should be raised if the value of pct_change is < T after this flag has been raised then a not reading flag should be raised.
I want to write a function that can tell me how many times and for what duration this happened.
If we take a threshold T of 4 and use pct_change from the example data screenshot then the output that I want is :
The main goal behind this is to find how many times this cycle is repeating for different thresholds.
To generate sample data :
import pandas as pd
a = [2,3,4,2,0,14,5,6,3,2,0,4,5,7,8,10,4,0,5,6,7,10,7,6,4,2,0,1,2,5,6]
idx = pd.date_range("2018-01-01", periods=len(a), freq="H")
ts = pd.Series(a, index=idx)
dd = pd.DataFrame()
dd['pct_change'] =ts
dd.head()
Can you please suggest an efficient way of doing it?
Output that I want if threshold 'T' is >= 4 :
First, keep only interesting data (>= T | == 0):
threshold = 4
df = dd.loc[dd["pct_change"].ge(threshold) | dd["pct_change"].eq(0)]
>>> df
pct_change
2018-01-01 02:00:00 4 # group 0, end=2018-01-01 04:00:00
2018-01-01 04:00:00 0
2018-01-01 05:00:00 14 # group 1, end=2018-01-01 10:00:00
2018-01-01 06:00:00 5
2018-01-01 07:00:00 6
2018-01-01 10:00:00 0
2018-01-01 11:00:00 4 # group 2, end=2018-01-01 17:00:00
2018-01-01 12:00:00 5
2018-01-01 13:00:00 7
2018-01-01 14:00:00 8
2018-01-01 15:00:00 10
2018-01-01 16:00:00 4
2018-01-01 17:00:00 0
2018-01-01 18:00:00 5 # group 3, end=2018-01-02 02:00:00
2018-01-01 19:00:00 6
2018-01-01 20:00:00 7
2018-01-01 21:00:00 10
2018-01-01 22:00:00 7
2018-01-01 23:00:00 6
2018-01-02 00:00:00 4
2018-01-02 02:00:00 0
2018-01-02 05:00:00 5 # group 4, end=2018-01-02 06:00:00
2018-01-02 06:00:00 6
Then, create wanting groups:
groups = df["pct_change"].eq(0).shift(fill_value=0).cumsum()
>>> groups
2018-01-01 02:00:00 0 # group 0
2018-01-01 04:00:00 0
2018-01-01 05:00:00 1 # group 1
2018-01-01 06:00:00 1
2018-01-01 07:00:00 1
2018-01-01 10:00:00 1
2018-01-01 11:00:00 2 # group 2
2018-01-01 12:00:00 2
2018-01-01 13:00:00 2
2018-01-01 14:00:00 2
2018-01-01 15:00:00 2
2018-01-01 16:00:00 2
2018-01-01 17:00:00 2
2018-01-01 18:00:00 3 # group 3
2018-01-01 19:00:00 3
2018-01-01 20:00:00 3
2018-01-01 21:00:00 3
2018-01-01 22:00:00 3
2018-01-01 23:00:00 3
2018-01-02 00:00:00 3
2018-01-02 02:00:00 3
2018-01-02 05:00:00 4 # group 4
2018-01-02 06:00:00 4
Name: pct_change, dtype: object
Finally, use groups to output result:
out = pd.DataFrame(df.groupby(groups) \
.apply(lambda x: (x.index[0], x.index[-1])) \
.tolist(), columns=["StartTime", "EndTime"])
>>> out
StartTime EndTime
0 2018-01-01 02:00:00 2018-01-01 04:00:00 # group 0
1 2018-01-01 05:00:00 2018-01-01 10:00:00 # group 1
2 2018-01-01 11:00:00 2018-01-01 17:00:00 # group 2
3 2018-01-01 18:00:00 2018-01-02 02:00:00 # group 3
4 2018-01-02 05:00:00 2018-01-02 06:00:00 # group 4
Bonus
There are some case where you have to remove groups:
The first pct value is 0
Two or more consecutive pct value is 0
To remove them:
out = out[~out["StartTime"].eq(out["EndTime"])]
Related
I have the following dataframe df:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 02:00 2018-01-01 03:00 2
2018-01-01 03:00 2018-01-01 04:00 3
2018-01-01 04:00 2018-01-01 05:00 6
I want to set a multi index composed of Datetime1 and Datetime2 to further proceed with the data resampling and interpolation (from 1 hour to 30 minutes frequency).
If I do df.set_index(["Datetime1","Datetime2"]).resample("30T").ffill(), then it fails.
Desired output:
Datetime1 Datetime2 Value
2018-01-01 00:00 2018-01-01 01:00 5
2018-01-01 00:30 2018-01-01 01:30 5
2018-01-01 01:00 2018-01-01 02:00 1
2018-01-01 01:30 2018-01-01 02:30 1
...
If there is one hour difference is possible create MultiIndex after resample with add 1H to new DatetimeIndex:
df = df.set_index(["Datetime1"])[['Value']].resample("30T").ffill()
df = df.set_index([df.index.rename('Datetime2') + pd.Timedelta('1H')], append=True)
print (df)
Value
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Or:
s = df.set_index(["Datetime1"])['Value'].resample("30T").ffill()
s.index = [s.index,s.index.rename('Datetime2') + pd.Timedelta('1H')]
print (s)
Datetime1 Datetime2
2018-01-01 00:00:00 2018-01-01 01:00:00 5
2018-01-01 00:30:00 2018-01-01 01:30:00 5
2018-01-01 01:00:00 2018-01-01 02:00:00 1
2018-01-01 01:30:00 2018-01-01 02:30:00 1
2018-01-01 02:00:00 2018-01-01 03:00:00 2
2018-01-01 02:30:00 2018-01-01 03:30:00 2
2018-01-01 03:00:00 2018-01-01 04:00:00 3
2018-01-01 03:30:00 2018-01-01 04:30:00 3
2018-01-01 04:00:00 2018-01-01 05:00:00 6
Name: Value, dtype: int64
The multi-index is not meant for a double-index but for a hierarchical (grouped) index. See the docs. You said in the comments, that Datetime2 is always offset by 1 hour. That means it's probably fastest to recalculate it:
df.set_index("Datetime1","Datetime2").resample("30T").ffill()
df["Datetime2" = df.index + pd.Timedelta(1, "hour")
I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()
I have a specific time-series dataset which is as bellow.
0 2018-01-01 00:00:00+00:00 ...
1 2018-01-01 00:10:00+00:00 ...
2 2018-01-01 00:20:00+00:00 ...
3 2018-01-01 00:30:00+00:00 ...
4 2018-01-01 00:50:00+00:00 ...
5 2018-01-01 01:00:00+00:00 ...
6 2018-01-01 01:20:00+00:00 ...
7 2018-01-01 01:40:00+00:00 ...
.
.
.
However, there are some missing rows in the dataset.
I have searched how to insert rows for this specific dataset and did not find any useful help. In this dataset, we have to add rows that every 10 minutes have an entry and other columns should have Nan values.
any idea?
Create DatetimeIndex first and call DataFrame.asfreq:
print (df)
date_col value
0 2018-01-01 00:00:00+00:00 4
1 2018-01-01 00:10:00+00:00 9
2 2018-01-01 00:20:00+00:00 1
3 2018-01-01 00:30:00+00:00 6
4 2018-01-01 00:50:00+00:00 3
5 2018-01-01 01:00:00+00:00 4
6 2018-01-01 01:20:00+00:00 5
7 2018-01-01 01:40:00+00:00 0
#if necessary
df['date_col'] = pd.to_datetime(df['date_col'])
df = df.set_index('date_col').asfreq('10Min')
print (df)
value
date_col
2018-01-01 00:00:00+00:00 4.0
2018-01-01 00:10:00+00:00 9.0
2018-01-01 00:20:00+00:00 1.0
2018-01-01 00:30:00+00:00 6.0
2018-01-01 00:40:00+00:00 NaN
2018-01-01 00:50:00+00:00 3.0
2018-01-01 01:00:00+00:00 4.0
2018-01-01 01:10:00+00:00 NaN
2018-01-01 01:20:00+00:00 5.0
2018-01-01 01:30:00+00:00 NaN
2018-01-01 01:40:00+00:00 0.0
I actually work on time series in Python 3 and Pandas and I want to make a synthesis of periods of contiguous missing values but I'm only able to find the indexes of nan values ...
Sample data :
Valeurs
2018-01-01 00:00:00 1.0
2018-01-01 04:00:00 NaN
2018-01-01 08:00:00 2.0
2018-01-01 12:00:00 NaN
2018-01-01 16:00:00 NaN
2018-01-01 20:00:00 5.0
2018-01-02 00:00:00 6.0
2018-01-02 04:00:00 7.0
2018-01-02 08:00:00 8.0
2018-01-02 12:00:00 9.0
2018-01-02 16:00:00 5.0
2018-01-02 20:00:00 NaN
2018-01-03 00:00:00 NaN
2018-01-03 04:00:00 NaN
2018-01-03 08:00:00 1.0
2018-01-03 12:00:00 2.0
2018-01-03 16:00:00 NaN
Expected results :
Start_Date number of contiguous missing values
2018-01-01 04:00:00 1
2018-01-01 12:00:00 2
2018-01-02 20:00:00 3
2018-01-03 16:00:00 1
How can i manage to obtain this type of results with pandas (shift(), cumsum(), groupby() ???)?
Thank you for your advice!
Sylvain
groupby and agg
mask = df.Valeurs.isna()
d = df.index.to_series()[mask].groupby((~mask).cumsum()[mask]).agg(['first', 'size'])
d.rename(columns=dict(size='num of contig null', first='Start_Date')).reset_index(drop=True)
Start_Date num of contig null
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
Working on the underlying numpy array:
a = df.Valeurs.values
m = np.concatenate(([False],np.isnan(a),[False]))
idx = np.nonzero(m[1:] != m[:-1])[0]
out = df[df.Valeurs.isnull() & ~df.Valeurs.shift().isnull()].index
pd.DataFrame({'Start date': out, 'contiguous': (idx[1::2] - idx[::2])})
Start date contiguous
0 2018-01-01 04:00:00 1
1 2018-01-01 12:00:00 2
2 2018-01-02 20:00:00 3
3 2018-01-03 16:00:00 1
If you have the indices where the values occur, you can use itertools as in this to find continuous chunks
I'm trying to resample a dataframe with a time series from 1-hour increments to 15-minute. Both .resample() and .asfreq() do almost exactly what I want, but I'm having a hard time filling the last three intervals.
I could add an extra hour at the end, resample, and then drop that last hour, but it feels hacky.
Current code:
df = pd.DataFrame({'date':pd.date_range('2018-01-01 00:00', '2018-01-01 01:00', freq = '1H'), 'num':5})
df = df.set_index('date').asfreq('15T', method = 'ffill', how = 'end').reset_index()
Current output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
Desired output:
date num
0 2018-01-01 00:00:00 5
1 2018-01-01 00:15:00 5
2 2018-01-01 00:30:00 5
3 2018-01-01 00:45:00 5
4 2018-01-01 01:00:00 5
5 2018-01-01 01:15:00 5
6 2018-01-01 01:30:00 5
7 2018-01-01 01:45:00 5
Thoughts?
Not sure about asfreq but reindex works wonderfully:
df.set_index('date').reindex(
pd.date_range(
df.date.min(),
df.date.max() + pd.Timedelta('1H'), freq='15T', closed='left'
),
method='ffill'
)
num
2018-01-01 00:00:00 5
2018-01-01 00:15:00 5
2018-01-01 00:30:00 5
2018-01-01 00:45:00 5
2018-01-01 01:00:00 5
2018-01-01 01:15:00 5
2018-01-01 01:30:00 5
2018-01-01 01:45:00 5