I have a boolean time based data set. As per the example below. I am interested in highlighting continuous sequences of more than three 1's in the data set. I would like to capture this in a new column called [Continuous_out_x]. Is there any efficient operation to do this?
I generated test data in this way:
df = pd.DataFrame(zip(list(np.random.randint(2, size=20)),list(np.random.randint(2, size=20))), columns=['tag1','tag2'] ,index=pd.date_range('2020-01-01', periods=20, freq='s'))
The output I got was the following:
print (df):
tag1 tag2
2020-01-01 00:00:00 0 0
2020-01-01 00:00:01 1 0
2020-01-01 00:00:02 1 0
2020-01-01 00:00:03 1 1
2020-01-01 00:00:04 1 0
2020-01-01 00:00:05 1 0
2020-01-01 00:00:06 1 1
2020-01-01 00:00:07 0 1
2020-01-01 00:00:08 0 0
2020-01-01 00:00:09 1 1
2020-01-01 00:00:10 1 0
2020-01-01 00:00:11 0 1
2020-01-01 00:00:12 1 0
2020-01-01 00:00:13 0 1
2020-01-01 00:00:14 0 1
2020-01-01 00:00:15 0 1
2020-01-01 00:00:16 1 1
2020-01-01 00:00:17 0 0
2020-01-01 00:00:18 0 1
2020-01-01 00:00:19 1 0
A solution to this example set (above) would look like this:
print(df):
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0
You can do this as:
create a series that distinguishes each streak (group)
assign bool to groups with more than three rows
code
# ok to loop over a few columns, still very performant
for col in ["tag1", "tag2"]:
col_no = col[-1]
df[f"group_{col}"] = np.cumsum(df[col].shift(1) != df[col])
df[f"{col}_counts"] = df.groupby(f"group_{col}").tag1.transform("count") > 3
df[f"Continuous_out_{col_no}"] = df[f"{col}_counts"].astype(int)
df = df.drop(columns=[f"group_{col}", f"{col}_counts"])
output
tag1 tag2 Continuous_out_1 Continuous_out_2
2020-01-01 00:00:00 0 0 0 0
00:00:01 1 0 1 0
00:00:02 1 0 1 0
00:00:03 1 1 1 0
00:00:04 1 0 1 0
00:00:05 1 0 1 0
00:00:06 1 1 1 0
00:00:07 0 1 0 0
00:00:08 0 0 0 0
00:00:09 1 1 0 0
00:00:10 1 0 0 0
00:00:11 0 1 0 0
00:00:12 1 0 0 0
00:00:13 0 1 0 1
00:00:14 0 1 0 1
00:00:15 0 1 0 1
00:00:16 1 1 0 1
00:00:17 0 0 0 0
00:00:18 0 1 0 0
00:00:19 1 0 0 0
You can identify the regions of contiguous True/False and check if they are greater than your cutoff.
for colname, series in df.items():
new = f'Continuous_{colname}'
df[new] = series.diff().ne(0).cumsum() # label contiguous regions
df[new] = series.groupby(df[new]).transform('size') # get size of region
df[new] = df[new].gt(3) * series # mark with cutoff
Output
tag1 tag2 Continuous_tag1 Continuous_tag2
index
2020-01-01 00:00:00 0 0 0 0
2020-01-01 00:00:01 1 0 1 0
2020-01-01 00:00:02 1 0 1 0
2020-01-01 00:00:03 1 1 1 0
2020-01-01 00:00:04 1 0 1 0
2020-01-01 00:00:05 1 0 1 0
2020-01-01 00:00:06 1 1 1 0
2020-01-01 00:00:07 0 1 0 0
2020-01-01 00:00:08 0 0 0 0
2020-01-01 00:00:09 1 1 0 0
2020-01-01 00:00:10 1 0 0 0
2020-01-01 00:00:11 0 1 0 0
2020-01-01 00:00:12 1 0 0 0
2020-01-01 00:00:13 0 1 0 1
2020-01-01 00:00:14 0 1 0 1
2020-01-01 00:00:15 0 1 0 1
2020-01-01 00:00:16 1 1 0 1
2020-01-01 00:00:17 0 0 0 0
2020-01-01 00:00:18 0 1 0 0
2020-01-01 00:00:19 1 0 0 0
Related
I have the data in the below format stored in a pandas dataframe
PolicyNumber InceptionDate
1 2017-12-28 00:00:00.0
https://i.stack.imgur.com/pEfLT.png
I want to split this single record into 12 records based on the inception date. For eg,
1 2017-12-28 00:00:00.0
1 2018-1-28 00:00:00.0
1 2018-2-28 00:00:00.0
1 2018-3-28 00:00:00.0
.
.
1 2018-11-28 00:00:00.0
Is this possible?
You can use pd.date_range to generate a list of date range then explode the column
df['InceptionDate'] = pd.to_datetime(df['InceptionDate'])
df = (df.assign(InceptionDate=df['InceptionDate'].apply(lambda date: pd.date_range(start=date, periods=12, freq='MS')+pd.DateOffset(days=date.day-1)))
.explode('InceptionDate'))
print(df)
PolicyNumber InceptionDate
0 1 2018-01-28
0 1 2018-02-28
0 1 2018-03-28
0 1 2018-04-28
0 1 2018-05-28
0 1 2018-06-28
0 1 2018-07-28
0 1 2018-08-28
0 1 2018-09-28
0 1 2018-10-28
0 1 2018-11-28
0 1 2018-12-28
To convert it to your original format from datetime type
df['InceptionDate'] = df['InceptionDate'].dt.strftime('%Y-%m-%d %H:%M:%S.%f')
PolicyNumber InceptionDate
0 1 2018-01-28 00:00:00.000000
0 1 2018-02-28 00:00:00.000000
0 1 2018-03-28 00:00:00.000000
0 1 2018-04-28 00:00:00.000000
0 1 2018-05-28 00:00:00.000000
0 1 2018-06-28 00:00:00.000000
0 1 2018-07-28 00:00:00.000000
0 1 2018-08-28 00:00:00.000000
0 1 2018-09-28 00:00:00.000000
0 1 2018-10-28 00:00:00.000000
0 1 2018-11-28 00:00:00.000000
0 1 2018-12-28 00:00:00.000000
I want to evaluate if a given condition (e.g. treshold) is met for a certain duration in pandas dataframe and set an output value accordingly.
E.g. set output to 1 if data > treshold for at least the next 45 min and back to 0 if data < treshold
What works so far (for treshold = 3 for a minimum duration of 45 min):
import pandas as pd
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
output = []
for i,e in enumerate(data):
if i > (len(data) - cell_range):
futures = data[i:len(data)]
else:
futures = data[i:i + cell_range]
if i == 0:
last = 0
else:
last = output[i-1]
current = data[i]
if (min(futures) > treshold or (last > 0 and current > treshold)):
output.append(1)
else:
output.append(0)
df['output'] = output
result:
data output
dt
2020-01-01 00:00:00 1 0
2020-01-01 00:15:00 1 0
2020-01-01 00:30:00 5 1
2020-01-01 00:45:00 6 1
2020-01-01 01:00:00 7 1
2020-01-01 01:15:00 0 0
2020-01-01 01:30:00 4 0
2020-01-01 01:45:00 5 0
2020-01-01 02:00:00 0 0
2020-01-01 02:15:00 10 1
2020-01-01 02:30:00 5 1
2020-01-01 02:45:00 9 1
2020-01-01 03:00:00 6 1
2020-01-01 03:15:00 6 1
2020-01-01 03:30:00 4 1
2020-01-01 03:45:00 10 1
2020-01-01 04:00:00 6 1
2020-01-01 04:15:00 5 1
2020-01-01 04:30:00 0 0
2020-01-01 04:45:00 8 1
2020-01-01 05:00:00 9 1
2020-01-01 05:15:00 5 1
2020-01-01 05:30:00 9 1
2020-01-01 05:45:00 6 1
2020-01-01 06:00:00 3 0
However, I'm wondering if there is an easier (and more efficient) way to do this with python/pandas?
I found a solution which seems to work, using .rolling and .shift.
import pandas as pd
import numpy as np
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
df['above_treshold'] = np.where(df['data'] > treshold, df['data'], 0)
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=cell_range)
rolling_min_fwd = df['above_treshold'].rolling(window=indexer).min()
rolling_min_bwd = df['above_treshold'].rolling(window=cell_range).min()
shifted_fwd = df['above_treshold'].shift(1)
shifted_bwd = df['above_treshold'].shift(-1)
start_condition = ((rolling_min_fwd > 0) & ((df['above_treshold'] - shifted_fwd) == df['above_treshold']))
stop_condition = ((rolling_min_bwd > 0) & ((df['above_treshold'] - shifted_bwd) == df['above_treshold']))
cycles = start_condition.cumsum()
idx = cycles == stop_condition.shift(1).cumsum()
cycles.loc[idx] = 0
df['output'] = np.where(cycles > 0, 1, 0)
resulting in:
data above_treshold output
dt
2020-01-01 00:00:00 8 8 0
2020-01-01 00:15:00 3 0 0
2020-01-01 00:30:00 3 0 0
2020-01-01 00:45:00 1 0 0
2020-01-01 01:00:00 3 0 0
2020-01-01 01:15:00 9 9 1
2020-01-01 01:30:00 4 4 1
2020-01-01 01:45:00 8 8 1
2020-01-01 02:00:00 6 6 1
2020-01-01 02:15:00 4 4 1
2020-01-01 02:30:00 6 6 1
2020-01-01 02:45:00 6 6 1
2020-01-01 03:00:00 1 0 0
2020-01-01 03:15:00 6 6 0
2020-01-01 03:30:00 7 7 0
2020-01-01 03:45:00 0 0 0
2020-01-01 04:00:00 2 0 0
2020-01-01 04:15:00 8 8 1
2020-01-01 04:30:00 8 8 1
2020-01-01 04:45:00 9 9 1
2020-01-01 05:00:00 1 0 0
2020-01-01 05:15:00 9 9 1
2020-01-01 05:30:00 10 10 1
2020-01-01 05:45:00 5 5 1
2020-01-01 06:00:00 8 8 1
Couldn't measure a significant impact on performance (working on DataFrames with > 35k data points), but still better than iterating over each datapoint (though less intuitive).
Whish to have time duration/accumulation of time diff as long as "state" == 1 is active and else 'off'
timestamp state
2020-01-01 00:00:00 0
2020-01-01 00:00:01 0
2020-01-01 00:00:02 0
2020-01-01 00:00:03 1
2020-01-01 00:00:04 1
2020-01-01 00:00:05 1
2020-01-01 00:00:06 1
2020-01-01 00:00:07 0
2020-01-01 00:00:08 0
2020-01-01 00:00:09 0
2020-01-01 00:00:10 0
2020-01-01 00:00:11 1
2020-01-01 00:00:12 1
2020-01-01 00:00:13 1
2020-01-01 00:00:14 1
2020-01-01 00:00:15 1
2020-01-01 00:00:16 1
2020-01-01 00:00:17 0
2020-01-01 00:00:18 0
2020-01-01 00:00:19 0
2020-01-01 00:00:20 0
Based on a similar question, I tried something with groupby, however, the code ignores to stop doing timediff when "state" == 0.
I also tried to apply a lambda function (commented) but an error pops up sayin "KeyError: ('state', 'occurred at index timestamp')"
Any idea how to do that better ?
import numpy as np
import pandas as pd
dt = pd.date_range('2020-01-01', '2020-01-01 00:00:20',freq='1s')
s = [0,0,0,1,1,1,1,0,0,0,0,1,1,1,1,1,1,0,0,0,0]
df = pd.DataFrame({'timestamp': dt,
'state': s})
df['timestamp']=pd.to_datetime(df.timestamp, format='%Y-%m-%d %H:%M:%S')
df['tdiff']=(df.groupby('state').diff().timestamp.values/60)
#df['tdiff'] = df.apply(lambda x: x['timestamp'].diff().state.values/60 if x['state'] == 1 else 'off')
The desired output should be:
timestamp state tdiff accum.
2020-01-01 00:00:00 0 off 0
2020-01-01 00:00:01 0 off 0
2020-01-01 00:00:02 0 off 0
2020-01-01 00:00:03 1 nan 0
2020-01-01 00:00:04 1 1.0 1.0
2020-01-01 00:00:05 1 1.0 2.0
2020-01-01 00:00:06 1 1.0 3.0
2020-01-01 00:00:07 0 off 0
2020-01-01 00:00:08 0 off 0
2020-01-01 00:00:09 0 off 0
2020-01-01 00:00:10 0 off 0
2020-01-01 00:00:11 1 nan 0
2020-01-01 00:00:12 1 1.0 1.0
2020-01-01 00:00:13 1 1.0 2.0
2020-01-01 00:00:14 1 1.0 3.0
2020-01-01 00:00:15 1 1.0 4.0
2020-01-01 00:00:16 1 1.0 5.0
You can check with groupby with cumsum for the additional groupkey
g = df.loc[df['state'].ne(0)].groupby(df['state'].eq(0).cumsum())['timestamp']
s1 = g.diff().dt.total_seconds()
s2 = g.apply(lambda x : x.diff().dt.total_seconds().cumsum())
df['tdiff'] = 'off'
df.loc[df['state'].ne(0),'tdiff'] = s1
df['accum'] = s2
# notice I did not fillna with 0, you can do it with df['accum'].fillna(0,inplace=True)
df
Out[53]:
timestamp state tdiff accum
0 2020-01-01 00:00:00 0 off NaN
1 2020-01-01 00:00:01 0 off NaN
2 2020-01-01 00:00:02 0 off NaN
3 2020-01-01 00:00:03 1 NaN NaN
4 2020-01-01 00:00:04 1 1 1.0
5 2020-01-01 00:00:05 1 1 2.0
6 2020-01-01 00:00:06 1 1 3.0
7 2020-01-01 00:00:07 0 off NaN
8 2020-01-01 00:00:08 0 off NaN
9 2020-01-01 00:00:09 0 off NaN
10 2020-01-01 00:00:10 0 off NaN
11 2020-01-01 00:00:11 1 NaN NaN
12 2020-01-01 00:00:12 1 1 1.0
13 2020-01-01 00:00:13 1 1 2.0
14 2020-01-01 00:00:14 1 1 3.0
15 2020-01-01 00:00:15 1 1 4.0
16 2020-01-01 00:00:16 1 1 5.0
17 2020-01-01 00:00:17 0 off NaN
18 2020-01-01 00:00:18 0 off NaN
19 2020-01-01 00:00:19 0 off NaN
20 2020-01-01 00:00:20 0 off NaN
Hello I am trying to set a multi Index on my office computer
data.set_index(['POM', 'DTM'],inplace = True)
but I get the following error
Categorical levels must be unique
At home I don't get the error. Both Pandas are version 0.13.1
Here is some sample data
POM DTM RNF WET HMD TMP DEW INF
0 QuintaVilar 2011-11-01 00:00:00 0 0 0 0 0 0
1 QuintaVilar 2011-11-01 00:15:00 0 0 0 0 0 0
2 QuintaVilar 2011-11-01 00:30:00 0 0 0 0 0 0
3 QuintaVilar 2011-11-01 00:45:00 0 0 0 0 0 0
4 QuintaVilar 2011-11-01 01:00:00 0 0 0 0 0 0
5 QuintaVilar 2011-11-01 01:15:00 0 0 0 0 0 0
6 QuintaVilar 2011-11-01 01:30:00 0 0 0 0 0 0
Could you help me?
Thank you
Shouldn't be. But how about just creating a MultiIndex?:
In [52]:
print df
POM DTM RNF WET HMD TMP DEW INF
0 QuintaVilar 2011-11-01 00:00:00 0 0 0 0 0 0
1 QuintaVilar 2011-11-01 00:15:00 0 0 0 0 0 0
2 QuintaVilar 2011-11-01 00:30:00 0 0 0 0 0 0
3 QuintaVilar 2011-11-01 00:45:00 0 0 0 0 0 0
4 QuintaVilar 2011-11-01 01:00:00 0 0 0 0 0 0
5 QuintaVilar 2011-11-01 01:15:00 0 0 0 0 0 0
6 QuintaVilar 2011-11-01 01:30:00 0 0 0 0 0 0
[7 rows x 8 columns]
In [53]:
idx=pd.MultiIndex.from_arrays(df[['POM','DTM']].values.T)
In [54]:
df.index=idx
In [56]:
print df
POM DTM RNF WET \
QuintaVilar 2011-11-01 00:00:00 QuintaVilar 2011-11-01 00:00:00 0 0
2011-11-01 00:15:00 QuintaVilar 2011-11-01 00:15:00 0 0
2011-11-01 00:30:00 QuintaVilar 2011-11-01 00:30:00 0 0
2011-11-01 00:45:00 QuintaVilar 2011-11-01 00:45:00 0 0
2011-11-01 01:00:00 QuintaVilar 2011-11-01 01:00:00 0 0
2011-11-01 01:15:00 QuintaVilar 2011-11-01 01:15:00 0 0
2011-11-01 01:30:00 QuintaVilar 2011-11-01 01:30:00 0 0
HMD TMP DEW INF
QuintaVilar 2011-11-01 00:00:00 0 0 0 0
2011-11-01 00:15:00 0 0 0 0
2011-11-01 00:30:00 0 0 0 0
2011-11-01 00:45:00 0 0 0 0
2011-11-01 01:00:00 0 0 0 0
2011-11-01 01:15:00 0 0 0 0
2011-11-01 01:30:00 0 0 0 0
[7 rows x 8 columns]
My goal is to create a Series from a Pandas DataFrame by choosing an element from different columns on each row.
For example, I have the following DataFrame:
In [171]: pred[:10]
Out[171]:
0 1 2
Timestamp
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1
And, I have the following series:
In [172]: useProb[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0
I would like to create a new series, usePred, that takes the values from pred, based on the column information in useProb to return the following:
In [172]: usePred[:10]
Out[172]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
This last step is where I fail. I've tried things like:
usePred = pd.DataFrame(index = pred.index)
for row in usePred:
usePred['PREDS'].ix[row] = pred.ix[row, useProb[row]]
And, I've tried:
usePred['PREDS'] = pred.iloc[:,useProb]
I google'd and search on stackoverflow, for hours, but can't seem to solve the problem.
One solution could be to use get dummies (which should be more efficient that apply):
In [11]: (pd.get_dummies(useProb) * pred).sum(axis=1)
Out[11]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: float64
You could use an apply with a couple of locs:
In [21]: pred.apply(lambda row: row.loc[useProb.loc[row.name]], axis=1)
Out[21]:
Timestamp
2010-12-21 00:00:00 0
2010-12-20 00:00:00 1
2010-12-17 00:00:00 1
2010-12-16 00:00:00 1
2010-12-15 00:00:00 1
2010-12-14 00:00:00 1
2010-12-13 00:00:00 0
2010-12-10 00:00:00 1
2010-12-09 00:00:00 1
2010-12-08 00:00:00 0
dtype: int64
The trick being that you have access to the rows index via the name property.
Here is another way to do it using DataFrame.lookup:
pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']])
It seems to be exactly what you need, except that care must be taken to supply values which are labels. For example, if pred.columns are strings, and useProb['0'] values are integers, then we could use
pred.columns[useProb['0']]
so that the values passed to the col_labels parameter are proper label values.
For example,
import io
import pandas as pd
content = io.BytesIO('''\
Timestamp 0 1 2
2010-12-21 00:00:00 0 0 1
2010-12-20 00:00:00 1 1 1
2010-12-17 00:00:00 1 1 1
2010-12-16 00:00:00 0 0 1
2010-12-15 00:00:00 1 1 1
2010-12-14 00:00:00 1 1 1
2010-12-13 00:00:00 0 0 1
2010-12-10 00:00:00 1 1 1
2010-12-09 00:00:00 1 1 1
2010-12-08 00:00:00 0 0 1''')
pred = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
content = io.BytesIO('''\
Timestamp 0
2010-12-21 00:00:00 1
2010-12-20 00:00:00 2
2010-12-17 00:00:00 1
2010-12-16 00:00:00 2
2010-12-15 00:00:00 2
2010-12-14 00:00:00 2
2010-12-13 00:00:00 0
2010-12-10 00:00:00 2
2010-12-09 00:00:00 2
2010-12-08 00:00:00 0''')
useProb = pd.read_table(content, sep='\s{2,}', parse_dates=True, index_col=[0])
print(pd.Series(pred.lookup(row_labels=pred.index,
col_labels=pred.columns[useProb['0']]),
index=pred.index))
yields
Timestamp
2010-12-21 0
2010-12-20 1
2010-12-17 1
2010-12-16 1
2010-12-15 1
2010-12-14 1
2010-12-13 0
2010-12-10 1
2010-12-09 1
2010-12-08 0
dtype: int64