Pandas accumulate time consecutively as long as condition is true - python

Whish to have time duration/accumulation of time diff as long as "state" == 1 is active and else 'off'
timestamp state
2020-01-01 00:00:00 0
2020-01-01 00:00:01 0
2020-01-01 00:00:02 0
2020-01-01 00:00:03 1
2020-01-01 00:00:04 1
2020-01-01 00:00:05 1
2020-01-01 00:00:06 1
2020-01-01 00:00:07 0
2020-01-01 00:00:08 0
2020-01-01 00:00:09 0
2020-01-01 00:00:10 0
2020-01-01 00:00:11 1
2020-01-01 00:00:12 1
2020-01-01 00:00:13 1
2020-01-01 00:00:14 1
2020-01-01 00:00:15 1
2020-01-01 00:00:16 1
2020-01-01 00:00:17 0
2020-01-01 00:00:18 0
2020-01-01 00:00:19 0
2020-01-01 00:00:20 0
Based on a similar question, I tried something with groupby, however, the code ignores to stop doing timediff when "state" == 0.
I also tried to apply a lambda function (commented) but an error pops up sayin "KeyError: ('state', 'occurred at index timestamp')"
Any idea how to do that better ?
import numpy as np
import pandas as pd
dt = pd.date_range('2020-01-01', '2020-01-01 00:00:20',freq='1s')
s = [0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0]
df = pd.DataFrame({'timestamp': dt,
'state': s})
df['timestamp']=pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
df['tdiff']=(df.groupby('state').diff().timestamp.values/60)
#df['tdiff'] = df.apply(lambda x: x['timestamp'].diff().state.values/60 if x['state'] == 1 else 'off')
The desired output should be:
timestamp state tdiff accum.
2020-01-01 00:00:00 0 off 0
2020-01-01 00:00:01 0 off 0
2020-01-01 00:00:02 0 off 0
2020-01-01 00:00:03 1 nan 0
2020-01-01 00:00:04 1 1.0 1.0
2020-01-01 00:00:05 1 1.0 2.0
2020-01-01 00:00:06 1 1.0 3.0
2020-01-01 00:00:07 0 off 0
2020-01-01 00:00:08 0 off 0
2020-01-01 00:00:09 0 off 0
2020-01-01 00:00:10 0 off 0
2020-01-01 00:00:11 1 nan 0
2020-01-01 00:00:12 1 1.0 1.0
2020-01-01 00:00:13 1 1.0 2.0
2020-01-01 00:00:14 1 1.0 3.0
2020-01-01 00:00:15 1 1.0 4.0
2020-01-01 00:00:16 1 1.0 5.0

You can check with groupby with cumsum for the additional groupkey
g = df.loc[df['state'].ne(0)].groupby(df['state'].eq(0).cumsum())['timestamp']
s1 = g.diff().dt.total_seconds()
s2 = g.apply(lambda x : x.diff().dt.total_seconds().cumsum())
df['tdiff'] = 'off'
df.loc[df['state'].ne(0),'tdiff'] = s1
df['accum'] = s2
# notice I did not fillna with 0, you can do it with df['accum'].fillna(0,inplace=True)
df
Out[53]:
timestamp state tdiff accum
0 2020-01-01 00:00:00 0 off NaN
1 2020-01-01 00:00:01 0 off NaN
2 2020-01-01 00:00:02 0 off NaN
3 2020-01-01 00:00:03 1 NaN NaN
4 2020-01-01 00:00:04 1 1 1.0
5 2020-01-01 00:00:05 1 1 2.0
6 2020-01-01 00:00:06 1 1 3.0
7 2020-01-01 00:00:07 0 off NaN
8 2020-01-01 00:00:08 0 off NaN
9 2020-01-01 00:00:09 0 off NaN
10 2020-01-01 00:00:10 0 off NaN
11 2020-01-01 00:00:11 1 NaN NaN
12 2020-01-01 00:00:12 1 1 1.0
13 2020-01-01 00:00:13 1 1 2.0
14 2020-01-01 00:00:14 1 1 3.0
15 2020-01-01 00:00:15 1 1 4.0
16 2020-01-01 00:00:16 1 1 5.0
17 2020-01-01 00:00:17 0 off NaN
18 2020-01-01 00:00:18 0 off NaN
19 2020-01-01 00:00:19 0 off NaN
20 2020-01-01 00:00:20 0 off NaN

Related

Add missing timestamps for each different ID in dataframe

I have two dataframes (simple examples shown below):
df1 df2
time column time column ID column Value
2022-01-01 00:00:00 2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 2022-01-01 00:30:00 1 9
2022-01-01 00:30:00 2022-01-02 00:30:00 1 5
2022-01-01 00:45:00 2022-01-02 00:45:00 1 15
2022-01-02 00:00:00 2022-01-01 00:00:00 2 6
2022-01-02 00:15:00 2022-01-01 00:15:00 2 2
2022-01-02 00:30:00 2022-01-02 00:45:00 2 7
2022-01-02 00:45:00
df1 shows every timestamp I am interested in. df2 shows data sorted by timestamp and ID. What I need to do is add every single timestamp from df1 that is not in df2 for each unique ID and add zero to the value column.
This is the outcome I'm interested in
df3
time column ID column Value
2022-01-01 00:00:00 1 10
2022-01-01 00:15:00 1 0
2022-01-01 00:30:00 1 9
2022-01-01 00:45:00 1 0
2022-01-02 00:00:00 1 0
2022-01-02 00:15:00 1 0
2022-01-02 00:30:00 1 5
2022-01-02 00:45:00 1 15
2022-01-01 00:00:00 2 6
2022-01-01 00:15:00 2 2
2022-01-01 00:30:00 2 0
2022-01-01 00:45:00 2 0
2022-01-02 00:00:00 2 0
2022-01-02 00:15:00 2 0
2022-01-02 00:30:00 2 0
2022-01-02 00:45:00 2 7
My df2 is much larger (hundreds of thousands of rows, and more than 500 unique IDs) so manually doing this isn't feasible. I've search for hours for something that could help, but everything has fallen flat. This data will ultimately be fed into a NN.
I am open to other libraries and can work in python or R.
Any help is greatly appreciated.
Try:
x = (
df2.groupby("ID column")
.apply(lambda x: x.merge(df1, how="outer").fillna(0))
.drop(columns="ID column")
.droplevel(1)
.reset_index()
.sort_values(by=["ID column", "time column"])
)
print(x)
Prints:
ID column time column Value
0 1 2022-01-01 00:00:00 10.0
4 1 2022-01-01 00:15:00 0.0
1 1 2022-01-01 00:30:00 9.0
5 1 2022-01-01 00:45:00 0.0
6 1 2022-01-02 00:00:00 0.0
7 1 2022-01-02 00:15:00 0.0
2 1 2022-01-02 00:30:00 5.0
3 1 2022-01-02 00:45:00 15.0
8 2 2022-01-01 00:00:00 6.0
9 2 2022-01-01 00:15:00 2.0
11 2 2022-01-01 00:30:00 0.0
12 2 2022-01-01 00:45:00 0.0
13 2 2022-01-02 00:00:00 0.0
14 2 2022-01-02 00:15:00 0.0
15 2 2022-01-02 00:30:00 0.0
10 2 2022-01-02 00:45:00 7.0

pandas: evaluate if condition is met for consecutive data points in a given timeframe

I want to evaluate if a given condition (e.g. treshold) is met for a certain duration in pandas dataframe and set an output value accordingly.
E.g. set output to 1 if data > treshold for at least the next 45 min and back to 0 if data < treshold
What works so far (for treshold = 3 for a minimum duration of 45 min):
import pandas as pd
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
output = []
for i,e in enumerate(data):
if i > (len(data) - cell_range):
futures = data[i:len(data)]
else:
futures = data[i:i + cell_range]
if i == 0:
last = 0
else:
last = output[i-1]
current = data[i]
if (min(futures) > treshold or (last > 0 and current > treshold)):
output.append(1)
else:
output.append(0)
df['output'] = output
result:
data output
dt
2020-01-01 00:00:00 1 0
2020-01-01 00:15:00 1 0
2020-01-01 00:30:00 5 1
2020-01-01 00:45:00 6 1
2020-01-01 01:00:00 7 1
2020-01-01 01:15:00 0 0
2020-01-01 01:30:00 4 0
2020-01-01 01:45:00 5 0
2020-01-01 02:00:00 0 0
2020-01-01 02:15:00 10 1
2020-01-01 02:30:00 5 1
2020-01-01 02:45:00 9 1
2020-01-01 03:00:00 6 1
2020-01-01 03:15:00 6 1
2020-01-01 03:30:00 4 1
2020-01-01 03:45:00 10 1
2020-01-01 04:00:00 6 1
2020-01-01 04:15:00 5 1
2020-01-01 04:30:00 0 0
2020-01-01 04:45:00 8 1
2020-01-01 05:00:00 9 1
2020-01-01 05:15:00 5 1
2020-01-01 05:30:00 9 1
2020-01-01 05:45:00 6 1
2020-01-01 06:00:00 3 0
However, I'm wondering if there is an easier (and more efficient) way to do this with python/pandas?
I found a solution which seems to work, using .rolling and .shift.
import pandas as pd
import numpy as np
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
df['above_treshold'] = np.where(df['data'] > treshold, df['data'], 0)
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=cell_range)
rolling_min_fwd = df['above_treshold'].rolling(window=indexer).min()
rolling_min_bwd = df['above_treshold'].rolling(window=cell_range).min()
shifted_fwd = df['above_treshold'].shift(1)
shifted_bwd = df['above_treshold'].shift(-1)
start_condition = ((rolling_min_fwd > 0) & ((df['above_treshold'] - shifted_fwd) == df['above_treshold']))
stop_condition = ((rolling_min_bwd > 0) & ((df['above_treshold'] - shifted_bwd) == df['above_treshold']))
cycles = start_condition.cumsum()
idx = cycles == stop_condition.shift(1).cumsum()
cycles.loc[idx] = 0
df['output'] = np.where(cycles > 0, 1, 0)
resulting in:
data above_treshold output
dt
2020-01-01 00:00:00 8 8 0
2020-01-01 00:15:00 3 0 0
2020-01-01 00:30:00 3 0 0
2020-01-01 00:45:00 1 0 0
2020-01-01 01:00:00 3 0 0
2020-01-01 01:15:00 9 9 1
2020-01-01 01:30:00 4 4 1
2020-01-01 01:45:00 8 8 1
2020-01-01 02:00:00 6 6 1
2020-01-01 02:15:00 4 4 1
2020-01-01 02:30:00 6 6 1
2020-01-01 02:45:00 6 6 1
2020-01-01 03:00:00 1 0 0
2020-01-01 03:15:00 6 6 0
2020-01-01 03:30:00 7 7 0
2020-01-01 03:45:00 0 0 0
2020-01-01 04:00:00 2 0 0
2020-01-01 04:15:00 8 8 1
2020-01-01 04:30:00 8 8 1
2020-01-01 04:45:00 9 9 1
2020-01-01 05:00:00 1 0 0
2020-01-01 05:15:00 9 9 1
2020-01-01 05:30:00 10 10 1
2020-01-01 05:45:00 5 5 1
2020-01-01 06:00:00 8 8 1
Couldn't measure a significant impact on performance (working on DataFrames with > 35k data points), but still better than iterating over each datapoint (though less intuitive).

Identify continuous sequences or groups of boolean data in Pandas

I have a boolean time based data set. As per the example below. I am interested in highlighting continuous sequences of more than three 1's in the data set. I would like to capture this in a new column called [Continuous_out_x]. Is there any efficient operation to do this?
I generated test data in this way:
df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))
The output I got was the following:
print (df):
tag1 tag2
2020-01-01 00:00:00 0 0
2020-01-01 00:00:01 1 0
2020-01-01 00:00:02 1 0
2020-01-01 00:00:03 1 1
2020-01-01 00:00:04 1 0
2020-01-01 00:00:05 1 0
2020-01-01 00:00:06 1 1
2020-01-01 00:00:07 0 1
2020-01-01 00:00:08 0 0
2020-01-01 00:00:09 1 1
2020-01-01 00:00:10 1 0
2020-01-01 00:00:11 0 1
2020-01-01 00:00:12 1 0
2020-01-01 00:00:13 0 1
2020-01-01 00:00:14 0 1
2020-01-01 00:00:15 0 1
2020-01-01 00:00:16 1 1
2020-01-01 00:00:17 0 0
2020-01-01 00:00:18 0 1
2020-01-01 00:00:19 1 0
A solution to this example set (above) would look like this:
print(df):
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0
You can do this as:
create a series that distinguishes each streak (group)
assign bool to groups with more than three rows
code
# ok to loop over a few columns, still very performant
for col in ["tag1", "tag2"]:
col_no = col[-1]
df[f"group_{col}"] = np.cumsum(df[col].shift(1) != df[col])
df[f"{col}_counts"] = df.groupby(f"group_{col}").tag1.transform("count") > 3
df[f"Continuous_out_{col_no}"] = df[f"{col}_counts"].astype(int)
df = df.drop(columns=[f"group_{col}", f"{col}_counts"])
output
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
00:00:01 1 0 1 0
00:00:02 1 0 1 0
00:00:03 1 1 1 0
00:00:04 1 0 1 0
00:00:05 1 0 1 0
00:00:06 1 1 1 0
00:00:07 0 1 0 0
00:00:08 0 0 0 0
00:00:09 1 1 0 0
00:00:10 1 0 0 0
00:00:11 0 1 0 0
00:00:12 1 0 0 0
00:00:13 0 1 0 1
00:00:14 0 1 0 1
00:00:15 0 1 0 1
00:00:16 1 1 0 1
00:00:17 0 0 0 0
00:00:18 0 1 0 0
00:00:19 1 0 0 0
You can identify the regions of contiguous True/False and check if they are greater than your cutoff.
for colname, series in df.items():
new = f'Continuous_{colname}'
df[new] = series.diff().ne(0).cumsum() # label contiguous regions
df[new] = series.groupby(df[new]).transform('size') # get size of region
df[new] = df[new].gt(3) * series # mark with cutoff
Output
tag1 tag2 Continuous_tag1 Continuous_tag2
index
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0

How can I select values in one column base on a condition in another using python?

Is there a way to filter rows if the column2 has all zeroes 10 minutes ahead from the current value in columnn1. How can I do this while keeping datetime index?
2020-01-01 00:01:00 60 0
2020-01-01 00:02:00 70 0
2020-01-01 00:03:00 80 0
2020-01-01 00:04:00 70 0
2020-01-01 00:05:00 60 0
2020-01-01 00:06:00 60 0
2020-01-01 00:07:00 70 0
2020-01-01 00:08:00 80 0
2020-01-01 00:09:00 80 2
2020-01-01 00:10:00 80 0
2020-01-01 00:11:00 70 0
2020-01-01 00:12:00 70 0
2020-01-01 00:13:00 50 0
2020-01-01 00:14:00 50 0
2020-01-01 00:15:00 60 0
2020-01-01 00:16:00 60 0
2020-01-01 00:17:00 70 0
2020-01-01 00:18:00 70 0
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
2020-01-01 00:21:00 80 1
2020-01-01 00:22:00 90 2
Expected output
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
I figured it out. It's actually simple.
input['col3'] = input['col2'].rolling(10).sum()
output = input.loc[(input['col3'] == 0)]
Just a guess, because I do not know pandas, but assuming it is a bit like SQL or linq or linkable datasets in C# - what about linking/joining your table (A) with itself (B) for all 12 minutes, grouping by each row of A and then summing the column2 of B (if only positive values there) and filter (SQL having) by the ones whose sum is 0?
As result report A.column0, A.column1 and SUM(B.column2)?
Using pandas.DataFrame.query, pandas.DataFrame.query - documentation
df.query(f'column_1 == {0} and column_2 == {value} or column_3 == {another_value}')

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

Categories