I have the following dataset, a Pandas dataframe:
Score min max Date
Loc
0 2.757 0.000 2.757 2020-07-04 11:00:00
3 2.723 2.723 0.000 2020-07-04 14:00:00
8 2.724 2.724 0.000 2020-07-04 19:00:00
11 2.752 0.000 2.752 2020-07-04 22:00:00
13 2.742 2.742 0.000 2020-07-05 00:00:00
15 2.781 0.000 2.781 2020-07-05 02:00:00
18 2.758 2.758 0.000 2020-07-05 05:00:00
20 2.865 0.000 2.865 2020-07-05 07:00:00
24 2.832 0.000 2.832 2020-07-05 11:00:00
25 2.779 2.779 0.000 2020-07-05 12:00:00
29 2.775 2.775 0.000 2020-07-05 16:00:00
34 2.954 0.000 2.954 2020-07-05 21:00:00
37 2.886 2.886 0.000 2020-07-06 00:00:00
48 3.101 0.000 3.101 2020-07-06 11:00:00
53 3.012 3.012 0.000 2020-07-06 16:00:00
55 3.068 0.000 3.068 2020-07-06 18:00:00
61 2.970 2.970 0.000 2020-07-07 00:00:00
64 3.058 0.000 3.058 2020-07-07 03:00:00
Where:
Score is a very basic trend, min and max are the local Minima and Maxima of Score.
Loc is the value on the x axis of that row, and date is the that of that row on the chart.
This data, when plotted, looks like that:
I'm trying to detect the data in the red box from my code, so that i can detect it on other datasets. Basically what I'm looking for is a way to set a definition of that piece of data from my code, so that it can be detected from other data.
Until now, I only managed to mark the local maxima and minima (yellow and red points) on the chart, and i also know how to define that pattern with my own words, I only need to do that from code:
Define when a point of minima/maxima is very distant (so it has an higher value) from the previous point of minima/maxima
After that, find when the point of local minima and maxima are really near to each other and their values are not very different between each other. In short terms, when a strong increase if followed by a range where the score doesn't go up or down a lot
I hope the question was clear enough, if needed I can give more details.I don't know if it's doable with Numpy or any other library.
I think dynamic time warping (dtw) might work for you. I have used it for something similar. Essentially it allows you to evaluate time series similarity.
Here are the python implementations I know of:
fastdtw
dtw
dtw-python
Here is a decent explanation of how it works
Towards Data Science Explanation of DTW
You could use it to compare how similar the incoming time series is to the data in your red box.
For example:
# Event were looking for
event = np.array([10, 100, 50, 60, 50, 70])
# A matching event occurring
event2 = np.array([0, 7, 12, 4, 11, 100, 51, 62, 53, 72])
# A non matching event
non_event = np.array([0, 5, 10, 5, 10, 20, 30, 20, 11, 9])
distance, path = fastdtw(event, event2)
distance2, path2 = fastdtw(event, non_event)
This will produce a set of indices in which the two time series are matched best. At this point you can evaluate via which ever method you prefer. I did a crude look at the correlation of the values
def event_corr(event,event2, path):
d = []
for p in path:
d.append((event2[p[1]] * event[p[0]])/event[p[0]]**2)
return np.mean(d)
print("Our event re-occuring is {:0.2f} correlated with our search event.".format(event_corr(event, event2, path)))
print("Our non-event is {:0.2f} correlated with our search event.".format(event_corr(event, non_event, path2)))
Produces:
Our event re-occurring is 0.85 correlated with our search event.
Our non-event is 0.45 correlated with our search event.
Related
I have data from many sensors, and observations come 200 times every second. Now I want to resample at a lower rate, so make the dataset manageable calculation wise. But The time column is absolute and date time. Please see the first column below. Now I want to create an index in absolute datetime so that I can use resample() methods easily to resampling and aggregation at different durations.
Example:
0.000000 1.397081 -0.672387 0.552749
0.005000 2.374832 -0.221770 1.348744
0.010000 3.191852 0.776504 0.044648
0.015000 2.304027 0.188047 0.433253
0.020000 2.331740 -0.000074 0.424112
0.025000 2.869129 0.282714 1.081615
0.030000 3.312915 0.997374 0.456503
0.035000 2.044041 -0.114705 0.993204
I want a method to generate timestamps 200 times a second starting at a timestamp, when this run of experiment was started, 2020/03/14 23:49:19 for example. Starting at 2020/03/14 23:49:19 I want to generate time stamps 200 times every second. This will help me generate a DatetimeIndex and then resample and aggregate it to 10 times a second.
I could find no example at this frequency and granularity, after reading the date functionality pages at pandas, https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamps-vs-time-spans
the real datafiles are of course extremely big, and confidential so can not post it.
assuming we have for example
df
Out[52]:
t v1 v2 v3
0 0.000 1.397081 -0.672387 0.552749
1 0.005 2.374832 -0.221770 1.348744
2 0.010 3.191852 0.776504 0.044648
3 0.015 2.304027 0.188047 0.433253
4 0.020 2.331740 -0.000074 0.424112
5 0.025 2.869129 0.282714 1.081615
6 0.030 3.312915 0.997374 0.456503
7 0.035 2.044041 -0.114705 0.993204
we can define a start date/time and add the existing time axis as a timedelta (assuming seconds here) and set that as index:
start = pd.Timestamp("2020/03/14 23:49:19")
df.index = pd.DatetimeIndex(start + pd.to_timedelta(df['t'], unit='s'))
df
Out[55]:
t v1 v2 v3
t
2020-03-14 23:49:19.000 0.000 1.397081 -0.672387 0.552749
2020-03-14 23:49:19.005 0.005 2.374832 -0.221770 1.348744
2020-03-14 23:49:19.010 0.010 3.191852 0.776504 0.044648
2020-03-14 23:49:19.015 0.015 2.304027 0.188047 0.433253
2020-03-14 23:49:19.020 0.020 2.331740 -0.000074 0.424112
2020-03-14 23:49:19.025 0.025 2.869129 0.282714 1.081615
2020-03-14 23:49:19.030 0.030 3.312915 0.997374 0.456503
2020-03-14 23:49:19.035 0.035 2.044041 -0.114705 0.993204
I'm trying to get a rolling n-day annualized equity return volatility but am having trouble implementing it. Basically, I would want to see in the last row (index 10) an implementation of sorts that does np.std(df["log returns“]*np.sqrt(252) for a rolling n-day window (e.g. index 6-10 for a 5-day window). If there aren't n values left, leave empty/fill with np.nan.
Index
log returns
annualized volatility
0
0.01
1
-0.005
2
0.021
3
0.01
4
-0.01
5
0.02
6
0.012
7
0.022
8
-0.001
9
-0.01
10
0.01
I thought about doing this with a while loop, but since I'm working with a lot of data I thought an array-wise operation may be smarter. Unfortunately I can't come up with one for the life of me.
I will try and explain the problem I am currently having concerning cumulative sums on DataFrames in Python, and hopefully you'll grasp it!
Given a pandas DataFrame df with a column returns as such:
returns
Date
2014-12-10 0.0000
2014-12-11 0.0200
2014-12-12 0.0500
2014-12-15 -0.0200
2014-12-16 0.0000
Applying a cumulative sum on this DataFrame is easy, just using e.g. df.cumsum(). But is it possible to apply a cumulative sum every X days (or data points) say, yielding only the cumulative sum of the last Y days (data points).
Clarification: Given daily data as above, how do I get the accumulated sum of the last Y days, re-evaluated (from zero) every X days?
Hope its clear enough,
Thanks,
N
"Every X days" and "every X data points" are very different; the following assumes you really mean the first, since you mention it more frequently.
If the index is a DatetimeIndex, you can resample to a daily frequency, take a rolling_sum, and then select only the original dates:
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1).loc[df.index]
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-15 -0.02
2014-12-16 -0.02
or, step by step:
>>> df.resample("1d")
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.05
2014-12-13 NaN
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 0.00
>>> pd.rolling_sum(df.resample("1d"), 2, min_periods=1)
returns
Date
2014-12-10 0.00
2014-12-11 0.02
2014-12-12 0.07
2014-12-13 0.05
2014-12-14 NaN
2014-12-15 -0.02
2014-12-16 -0.02
The way I would do it is with helper columns. It's a little kludgy but it should work:
numgroups = int(len(df)/(x-1))
df['groupby'] = sorted(list(range(numgroups))*x)[:len(df)]
df['mask'] = (([0]*(x-y)+[1]*(y))*numgroups)[:len(df)]
df['masked'] = df.returns*df['mask']
df.groupby('groupby').masked.cumsum()
I am not sure if there is a built in method but it does not seem very difficult to write one.
for example, here is one for pandas series.
def cum(df, interval):
all = []
quotient = len(df)//interval
intervals = range(quotient)
for i in intervals:
all.append(df[0:(i+1)*interval].sum())
return pd.Series(all)
>>>s1 = pd.Series(range(20))
>>>print(cum(s1, 4))
0 6
1 28
2 66
3 120
4 190
dtype: int64
Thanks to #DSM I managed to come up with a variation of his solution that actually does pretty much what I was looking for:
import numpy as np
import pandas as pd
df.resample("1w"), how={'A': np.sum})
Yields what I want for the example below:
rng = range(1,29)
dates = pd.date_range('1/1/2000', periods=len(rng))
r = pd.DataFrame(rng, index=dates, columns=['A'])
r2 = r.resample("1w", how={'A': np.sum})
Outputs:
>> print r
A
2000-01-01 1
2000-01-02 2
2000-01-03 3
2000-01-04 4
2000-01-05 5
2000-01-06 6
2000-01-07 7
2000-01-08 8
2000-01-09 9
2000-01-10 10
2000-01-11 11
...
2000-01-25 25
2000-01-26 26
2000-01-27 27
2000-01-28 28
>> print r2
A
2000-01-02 3
2000-01-09 42
2000-01-16 91
2000-01-23 140
2000-01-30 130
Even though it doesn't start "one week in" in this case (resulting in sum of 3 in the very first case), it does always get the correct rolling sum, starting on the previous date with initial value of zero.
I want to compute the duration (in weeks between change). For example, p is the same for weeks 1,2,3 and changes to 1.11 in period 4. So duration is 3. Now the duration is computed in a loop ported from R. It works but it is slow. Any suggestion how to improve this would be greatly appreciated.
raw['duration']=np.nan
id=raw['unique_id'].unique()
for i in range(0,len(id)):
pos1= abs(raw['dp'])>0
pos2= raw['unique_id']==id[i]
pos= np.where(pos1 & pos2)[0]
raw['duration'][pos[0]]=raw['week'][pos[0]]-1
for j in range(1,len(pos)):
raw['duration'][pos[j]]=raw['week'][pos[j]]-raw['week'][pos[j-1]]
The dataframe is raw, and values for a particular unique_id looks like this.
date week p change duration
2006-07-08 27 1.05 -0.07 1
2006-07-15 28 1.05 0.00 NaN
2006-07-22 29 1.05 0.00 NaN
2006-07-29 30 1.11 0.06 3
... ... ... ... ...
2010-06-05 231 1.61 0.09 1
2010-06-12 232 1.63 0.02 1
2010-06-19 233 1.57 -0.06 1
2010-06-26 234 1.41 -0.16 1
2010-07-03 235 1.35 -0.06 1
2010-07-10 236 1.43 0.08 1
2010-07-17 237 1.59 0.16 1
2010-07-24 238 1.59 0.00 NaN
2010-07-31 239 1.59 0.00 NaN
2010-08-07 240 1.59 0.00 NaN
2010-08-14 241 1.59 0.00 NaN
2010-08-21 242 1.61 0.02 5
##
Computing duratiosn once you have your list in date order is trivial: iterate over the list, keeping track of how long since the last change to p. If the slowness comes from how you get that list, you haven't provided nearly enough info for help with that.
You can simply get the list of weeks where there is a change, then compute their differences, and finally join those differences back onto your original DataFrame.
weeks = raw.query('change != 0.0')[['week']]
weeks['duration'] = weeks.week.diff()
pd.merge(raw, weeks, on='week', how='left')
raw2=raw.ix[raw['change'] !=0,['week','unique_id']]
data2=raw2.groupby('unique_id')
raw2['duration']=data2['week'].transform(lambda x: x.diff())
raw2.drop('unique_id',1)
raw=pd.merge(raw,raw2,on=['unique_id','week'],how='left')
Thank you all. I modified the suggestion and got this to give the same answer as the complicated loop. For 10,000. observations, it is not a whole lot faster but the code seems more compact.
I put no change to Nan because the duration seems to be undefined when no change is made. But zero will work too. With the above code, the NaN is put in automatically by merge. In any case,
I want to compute statistics for the non-change group separately.
I'm fairly new to python so I apologize in advance if this is a rookie mistake. I'm using python 3.4. Here's the problem:
I have a pandas dataframe with a datetimeindex and multiple named columns like so:
>>>df
'a' 'b' 'c'
1949-01-08 42.915 0 1.448
1949-01-09 19.395 0 0.062
1949-01-10 1.077 0.05 0.000
1949-01-11 0.000 0.038 0.000
1949-01-12 0.012 0.194 0.000
1949-01-13 0.000 0 0.125
1949-01-14 0.000 0.157 0.007
1949-01-15 0.000 0.003 0.000
I am trying to extract a subset using both the year from the datetimeindex and a conditional statement on the values:
>>>df['1949':'1980'][df > 0]
'a' 'b' 'c'
1949-01-08 42.915 NaN 1.448
1949-01-09 19.395 NaN 0.062
1949-01-10 1.077 0.05 NaN
1949-01-11 NaN 0.038 NaN
1949-01-12 0.012 0.194 NaN
1949-01-13 NaN NaN 0.125
1949-01-14 NaN 0.157 0.007
1949-01-15 NaN 0.003 NaN
My final goal is to find percentiles of this subset, however np.percentile cannot handle NaNs. I have tried using the dataframe quantile method but there are a couple of missing data points which cause it to drop the whole column. It seems like it would be simple to use a conditional statement to select values without returning NaNs, but I can't seem to find anything that will return a smaller subset without the NaNs. Any help or suggestions would be much appreciated. Thanks!
I don't know what exactly result you expect.
You can use df >= 0 to keep 0 in columns.
df['1949':'1980'][df >= 0]
You can use .fillna(0) to change NaN into 0
df['1949':'1980'][df > 0].fillna(0)
You can use .dropna() to remove rows with any NaN - but this way probably you get empty result.
df['1949':'1980'][df > 0].dropna()