How to generate datetimeindex for 200 observations per second? - python

I have data from many sensors, and observations come 200 times every second. Now I want to resample at a lower rate, so make the dataset manageable calculation wise. But The time column is absolute and date time. Please see the first column below. Now I want to create an index in absolute datetime so that I can use resample() methods easily to resampling and aggregation at different durations.
Example:
0.000000 1.397081 -0.672387 0.552749
0.005000 2.374832 -0.221770 1.348744
0.010000 3.191852 0.776504 0.044648
0.015000 2.304027 0.188047 0.433253
0.020000 2.331740 -0.000074 0.424112
0.025000 2.869129 0.282714 1.081615
0.030000 3.312915 0.997374 0.456503
0.035000 2.044041 -0.114705 0.993204
I want a method to generate timestamps 200 times a second starting at a timestamp, when this run of experiment was started, 2020/03/14 23:49:19 for example. Starting at 2020/03/14 23:49:19 I want to generate time stamps 200 times every second. This will help me generate a DatetimeIndex and then resample and aggregate it to 10 times a second.
I could find no example at this frequency and granularity, after reading the date functionality pages at pandas, https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamps-vs-time-spans
the real datafiles are of course extremely big, and confidential so can not post it.

assuming we have for example
df
Out[52]:
t v1 v2 v3
0 0.000 1.397081 -0.672387 0.552749
1 0.005 2.374832 -0.221770 1.348744
2 0.010 3.191852 0.776504 0.044648
3 0.015 2.304027 0.188047 0.433253
4 0.020 2.331740 -0.000074 0.424112
5 0.025 2.869129 0.282714 1.081615
6 0.030 3.312915 0.997374 0.456503
7 0.035 2.044041 -0.114705 0.993204
we can define a start date/time and add the existing time axis as a timedelta (assuming seconds here) and set that as index:
start = pd.Timestamp("2020/03/14 23:49:19")
df.index = pd.DatetimeIndex(start + pd.to_timedelta(df['t'], unit='s'))
df
Out[55]:
t v1 v2 v3
t
2020-03-14 23:49:19.000 0.000 1.397081 -0.672387 0.552749
2020-03-14 23:49:19.005 0.005 2.374832 -0.221770 1.348744
2020-03-14 23:49:19.010 0.010 3.191852 0.776504 0.044648
2020-03-14 23:49:19.015 0.015 2.304027 0.188047 0.433253
2020-03-14 23:49:19.020 0.020 2.331740 -0.000074 0.424112
2020-03-14 23:49:19.025 0.025 2.869129 0.282714 1.081615
2020-03-14 23:49:19.030 0.030 3.312915 0.997374 0.456503
2020-03-14 23:49:19.035 0.035 2.044041 -0.114705 0.993204

Related

Rolling calculation across a column - array-wise

I'm trying to get a rolling n-day annualized equity return volatility but am having trouble implementing it. Basically, I would want to see in the last row (index 10) an implementation of sorts that does np.std(df["log returns“]*np.sqrt(252) for a rolling n-day window (e.g. index 6-10 for a 5-day window). If there aren't n values left, leave empty/fill with np.nan.
Index
log returns
annualized volatility
0
0.01
1
-0.005
2
0.021
3
0.01
4
-0.01
5
0.02
6
0.012
7
0.022
8
-0.001
9
-0.01
10
0.01
I thought about doing this with a while loop, but since I'm working with a lot of data I thought an array-wise operation may be smarter. Unfortunately I can't come up with one for the life of me.

Exctract a pattern from a time series

I have the following dataset, a Pandas dataframe:
Score min max Date
Loc
0 2.757 0.000 2.757 2020-07-04 11:00:00
3 2.723 2.723 0.000 2020-07-04 14:00:00
8 2.724 2.724 0.000 2020-07-04 19:00:00
11 2.752 0.000 2.752 2020-07-04 22:00:00
13 2.742 2.742 0.000 2020-07-05 00:00:00
15 2.781 0.000 2.781 2020-07-05 02:00:00
18 2.758 2.758 0.000 2020-07-05 05:00:00
20 2.865 0.000 2.865 2020-07-05 07:00:00
24 2.832 0.000 2.832 2020-07-05 11:00:00
25 2.779 2.779 0.000 2020-07-05 12:00:00
29 2.775 2.775 0.000 2020-07-05 16:00:00
34 2.954 0.000 2.954 2020-07-05 21:00:00
37 2.886 2.886 0.000 2020-07-06 00:00:00
48 3.101 0.000 3.101 2020-07-06 11:00:00
53 3.012 3.012 0.000 2020-07-06 16:00:00
55 3.068 0.000 3.068 2020-07-06 18:00:00
61 2.970 2.970 0.000 2020-07-07 00:00:00
64 3.058 0.000 3.058 2020-07-07 03:00:00
Where:
Score is a very basic trend, min and max are the local Minima and Maxima of Score.
Loc is the value on the x axis of that row, and date is the that of that row on the chart.
This data, when plotted, looks like that:
I'm trying to detect the data in the red box from my code, so that i can detect it on other datasets. Basically what I'm looking for is a way to set a definition of that piece of data from my code, so that it can be detected from other data.
Until now, I only managed to mark the local maxima and minima (yellow and red points) on the chart, and i also know how to define that pattern with my own words, I only need to do that from code:
Define when a point of minima/maxima is very distant (so it has an higher value) from the previous point of minima/maxima
After that, find when the point of local minima and maxima are really near to each other and their values are not very different between each other. In short terms, when a strong increase if followed by a range where the score doesn't go up or down a lot
I hope the question was clear enough, if needed I can give more details.I don't know if it's doable with Numpy or any other library.
I think dynamic time warping (dtw) might work for you. I have used it for something similar. Essentially it allows you to evaluate time series similarity.
Here are the python implementations I know of:
fastdtw
dtw
dtw-python
Here is a decent explanation of how it works
Towards Data Science Explanation of DTW
You could use it to compare how similar the incoming time series is to the data in your red box.
For example:
# Event were looking for
event = np.array([10, 100, 50, 60, 50, 70])
# A matching event occurring
event2 = np.array([0, 7, 12, 4, 11, 100, 51, 62, 53, 72])
# A non matching event
non_event = np.array([0, 5, 10, 5, 10, 20, 30, 20, 11, 9])
distance, path = fastdtw(event, event2)
distance2, path2 = fastdtw(event, non_event)
This will produce a set of indices in which the two time series are matched best. At this point you can evaluate via which ever method you prefer. I did a crude look at the correlation of the values
def event_corr(event,event2, path):
d = []
for p in path:
d.append((event2[p[1]] * event[p[0]])/event[p[0]]**2)
return np.mean(d)
print("Our event re-occuring is {:0.2f} correlated with our search event.".format(event_corr(event, event2, path)))
print("Our non-event is {:0.2f} correlated with our search event.".format(event_corr(event, non_event, path2)))
Produces:
Our event re-occurring is 0.85 correlated with our search event.
Our non-event is 0.45 correlated with our search event.

Python - fastest way to populate a dataframe with a condition based on an index in another dataframe

I have data in an input dataframe (input_df). Based on an index in another benchmark dataframe (bm_df), I would like to create a third dataframe (output_df) that is populated based on a condition using the indices in the original two dataframes.
For each date in the index for the bm_df I would like to populate my output using the latest data available in the input_df, subject to the condition that the data has an index date before or equal to that in the bm_df. For example, in the case study data below the output dataframe for the first index date (2019-01-21) would be populated with the data from the input_df datapoint for the 2019-01-21. However, if a datapoint for the 2019-01-21 did not exist this would use the 2019-01-18.
The use case here is mapping and backfilling large datasets for the latest data available for a given date. I have written up some python to do this for me (which works), however I think there is probably a more pythonic and therefore faster way to implement the solution. My underlying dataset this is applied to has large dimensions in terms of the number of columns and length of the columns and so I would like something as efficient as possible - my current solution is too slow when run on the full dataset I am using.
Any help is much appreciated!
input_df:
index data
2019-01-21 0.008
2019-01-18 0.016
2019-01-17 0.006
2019-01-16 0.01
2019-01-15 0.013
2019-01-14 0.017
2019-01-11 0.017
2019-01-10 0.024
2019-01-09 0.032
2019-01-08 0.012
bm_df:
index
2019-01-21
2019-01-14
2019-01-07
output_df:
index data
2019-01-21 0.008
2019-01-14 0.017
2019-01-07 NaN
Please see the code I am currently using below:
import pandas as pd
import numpy as np
# Import datasets
test_index = ['2019-01-21','2019-01-18','2019-01-17','2019-01-16','2019-01-15','2019-01-14','2019-01-11','2019-01-10','2019-01-09','2019-01-08']
test_data = [0.008, 0.016,0.006,0.01,0.013,0.017,0.017,0.024,0.032,0.012]
input_df= pd.DataFrame(test_data,columns=['data'], index=test_index)
test_index_2= ['2019-01-21','2019-01-14','2019-01-07']
bm_df= pd.DataFrame(index=test_index_2)
#Preallocate
data_mat= np.zeros([len(bm_df)])
#Loop over bm_df index and find the most recent variable from input_df which from a date before the index date
for i in range(len(bm_df)):
#First check to see if there are no dates before the selected date, if true fill with NaN
if sum(input_df.index <= bm_df.index[i])>0:
data_mat[i] = input_df['data'][max(input_df.index[input_df.index <= bm_df.index[i]])]
else:
data_mat[i] = float('NaN')
output_df= pd.DataFrame(data_mat,columns=['data'],index=bm_df.index)
I have not tested the execution time, but I would rely on join being referenced as efficient in pandas documentation:
... Efficiently join multiple DataFrame objects by index at once...
And I would use shift to get the value for the highest date before the searched one.
All that give:
output_df = bm_df.join(input_df.shift(-1), how='left')
data
2019-01-21 0.016
2019-01-14 0.017
2019-01-07 NaN
This approach is indeed far less versatile that explicit loops. It is the price for pandas vectorization. For example for a less than or equal to condition the code will be slightly different. Here is an example with an additional date in bm_df not present in input_df:
...
test_index_2= ['2019-01-21','2019-01-14','2019-01-13','2019-01-07']
...
tmp_df = input_df.join(bm_df).fillna(method='bfill')
output_df = bm_df.join(tmp_df, how='inner')
And we obtain as expected:
data
2019-01-21 0.008
2019-01-14 0.017
2019-01-13 0.017
2019-01-07 0.012

Group rows in fixed duration windows that satisfy multiple conditions

I have a df as below. Consider df is indexed by timestamps as dtype='datetime64[ns]' i.e. 1970-01-01 00:00:27.603046999. I am putting dummy timestamps here.
Timestamp Address Type Arrival_Time Time_Delta
0.1 2 A 0.25 0.15
0.4 3 B 0.43 0.03
0.9 1 B 1.20 0.20
1.3 1 A 1.39 0.09
1.5 3 A 1.64 0.14
1.7 3 B 1.87 0.17
2.0 3 A 2.09 0.09
2.1 1 B 2.44 0.34
I have three unique "addresses" (1, 2,3).
I have two unique "types" (A, B)
Now what I am trying to do two things in simple way (possibly using pd.Grouper and pd.Groupby functions in Panda).
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each 1sec bin, for each "address" find the mean and sum of "Time_delta" only if "Type" = A.
I want to group rows by fixed bin of 1 duration (using timestamp values). Then in each bin, for each "address", find the mean and sum of Inter-Arrival Time*.
IAT = Arrival Time (i) - Arrival Time (i-1)
Note: If the timestamps duration/length is of 100 seconds, we should have exactly 100 rows in the output dataframe and six columns i.e. two (mean, sum) for each address.
For Problem 1:
I tried the following code:
df = pd.DataFrame({'Timestamp': Timestamp, 'Address': Address,
'Type': Type, 'Arrival_Time': Arrival_time, 'Time_Delta': Time_delta})
# Set index to Datetime
index = pd.DatetimeIndex(df[df.columns[3]]*10**9) # Convert timestamp into format
df = df.set_index(index) # Set timestamp as index
df_1 = df[df.columns[2]].groupby([pd.TimeGrouper('1S'), df['Address']]).mean().unstack(fill_value=0)
which gives results:
Timestamp 1 2 3
1970-01-01 00:00:00 0.20 0.15 0.030
1970-01-01 00:00:01 0.09 0.00 0.155
1970-01-01 00:00:02 0.34 0.00 0.090
As you can see, it gives the mean Time_delta for each address in the 1S bin, But I want to add the second condition i.e. find mean for each address only if Type=A. I hope problem 1 is now clear.
For Problem 2:
Its a bit complicated. I want to do get Mean IAT for each address in the same format (See below):
One possible way is to add an extra column to original df as df['IAT'], where
for in range (1, len(df))
i = 0
df['IAT'] = df['Arrival_Time'][i] - df['Arrival_Time'][i-1] i =
i=i+1
Then apply the same above code to find mean of IAT for each address if Type=A.
Actual Data
Timestamp Address Type Time Delta Arrival Time
1970-01-01 00:00:00.000000000 28:5a:ec:16:00:22 Control frame 0.000000 Nov 10, 2017 22:39:20.538561000
1970-01-01 00:00:00.000287000 28:5a:ec:16:00:23 Data frame 0.000287 Nov 10, 2017 22:39:20.548121000
1970-01-01 00:00:00.000896000 28:5a:ec:16:00:22 Control frame 0.000609 Nov 10, 2017 22:39:20.611256000
1970-01-01 00:00:00.001388000 28:5a:ec:16:00:21 Data frame 0.000492 Nov 10, 2017 22:39:20.321745000
... ...

Returning a subset of a dataframe using a conditional statement

I'm fairly new to python so I apologize in advance if this is a rookie mistake. I'm using python 3.4. Here's the problem:
I have a pandas dataframe with a datetimeindex and multiple named columns like so:
>>>df
'a' 'b' 'c'
1949-01-08 42.915 0 1.448
1949-01-09 19.395 0 0.062
1949-01-10 1.077 0.05 0.000
1949-01-11 0.000 0.038 0.000
1949-01-12 0.012 0.194 0.000
1949-01-13 0.000 0 0.125
1949-01-14 0.000 0.157 0.007
1949-01-15 0.000 0.003 0.000
I am trying to extract a subset using both the year from the datetimeindex and a conditional statement on the values:
>>>df['1949':'1980'][df > 0]
'a' 'b' 'c'
1949-01-08 42.915 NaN 1.448
1949-01-09 19.395 NaN 0.062
1949-01-10 1.077 0.05 NaN
1949-01-11 NaN 0.038 NaN
1949-01-12 0.012 0.194 NaN
1949-01-13 NaN NaN 0.125
1949-01-14 NaN 0.157 0.007
1949-01-15 NaN 0.003 NaN
My final goal is to find percentiles of this subset, however np.percentile cannot handle NaNs. I have tried using the dataframe quantile method but there are a couple of missing data points which cause it to drop the whole column. It seems like it would be simple to use a conditional statement to select values without returning NaNs, but I can't seem to find anything that will return a smaller subset without the NaNs. Any help or suggestions would be much appreciated. Thanks!
I don't know what exactly result you expect.
You can use df >= 0 to keep 0 in columns.
df['1949':'1980'][df >= 0]
You can use .fillna(0) to change NaN into 0
df['1949':'1980'][df > 0].fillna(0)
You can use .dropna() to remove rows with any NaN - but this way probably you get empty result.
df['1949':'1980'][df > 0].dropna()

Categories