Count rows with multiple criteria in pandas - python

I have a pandas dataframe with "user_ID", "datetime" and "action_type" columns like it is shown below and I want to get the last column (the last column = desired output) by performing some calculations:
data = {'user_id': list('ddabdacddaaa'),
'datetime':pd.date_range("20201001", periods=12, freq='H'),
'action_type':list('XXXWZWKOOXWX'),
'as_if_X_calculated':list('121021022223')
}
df = pd.DataFrame(data)
df
user_id datetime action_type as_if_X_calculated
0 d 2020-10-01 00:00:00 X 1
1 d 2020-10-01 01:00:00 X 2
2 a 2020-10-01 02:00:00 X 1
3 b 2020-10-01 03:00:00 W 0
4 d 2020-10-01 04:00:00 Z 2
5 a 2020-10-01 05:00:00 W 1
6 c 2020-10-01 06:00:00 K 0
7 d 2020-10-01 07:00:00 O 2
8 d 2020-10-01 08:00:00 O 2
9 a 2020-10-01 09:00:00 X 2
10 a 2020-10-01 10:00:00 W 2
11 a 2020-10-01 11:00:00 X 3
So the last column shows how many times the user has performed an action X at the time of the current record. If we see a user "a", his results will be like 1-1-2-2-3 in chronological order. So how can I calculate the number of action X for the given user that happened at the time of the record or earlier?
P.S. In Excel it would look like =countifs(A:A; A2; B:B; "<="&B2; C:C; "X") (Column A = "user_id")

If your dataframe is sorted by datetime you can create a temporary column for the condition on action_type and use pd.expanding
df.sort_values('datetime', inplace=True)
df['dummy'] = df.action_type == 'X'
df['X_calculated'] = (df.groupby('user_id')['dummy']
.expanding().sum()
.reset_index(level=0, drop=True)
.astype('int'))
df.sort_index(inplace=True)
print(df.drop('dummy', 1))
assert df.as_if_X_calculated.astype('int').equals(df.X_calculated), 'X_calculated is not equal'
Out:
user_id datetime action_type as_if_X_calculated X_calculated
0 d 2020-10-01 00:00:00 X 1 1
1 d 2020-10-01 01:00:00 X 2 2
2 a 2020-10-01 02:00:00 X 1 1
3 b 2020-10-01 03:00:00 W 0 0
4 d 2020-10-01 04:00:00 Z 2 2
5 a 2020-10-01 05:00:00 W 1 1
6 c 2020-10-01 06:00:00 K 0 0
7 d 2020-10-01 07:00:00 O 2 2
8 d 2020-10-01 08:00:00 O 2 2
9 a 2020-10-01 09:00:00 X 2 2
10 a 2020-10-01 10:00:00 W 2 2
11 a 2020-10-01 11:00:00 X 3 3

Related

pandas: evaluate if condition is met for consecutive data points in a given timeframe

I want to evaluate if a given condition (e.g. treshold) is met for a certain duration in pandas dataframe and set an output value accordingly.
E.g. set output to 1 if data > treshold for at least the next 45 min and back to 0 if data < treshold
What works so far (for treshold = 3 for a minimum duration of 45 min):
import pandas as pd
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
output = []
for i,e in enumerate(data):
if i > (len(data) - cell_range):
futures = data[i:len(data)]
else:
futures = data[i:i + cell_range]
if i == 0:
last = 0
else:
last = output[i-1]
current = data[i]
if (min(futures) > treshold or (last > 0 and current > treshold)):
output.append(1)
else:
output.append(0)
df['output'] = output
result:
data output
dt
2020-01-01 00:00:00 1 0
2020-01-01 00:15:00 1 0
2020-01-01 00:30:00 5 1
2020-01-01 00:45:00 6 1
2020-01-01 01:00:00 7 1
2020-01-01 01:15:00 0 0
2020-01-01 01:30:00 4 0
2020-01-01 01:45:00 5 0
2020-01-01 02:00:00 0 0
2020-01-01 02:15:00 10 1
2020-01-01 02:30:00 5 1
2020-01-01 02:45:00 9 1
2020-01-01 03:00:00 6 1
2020-01-01 03:15:00 6 1
2020-01-01 03:30:00 4 1
2020-01-01 03:45:00 10 1
2020-01-01 04:00:00 6 1
2020-01-01 04:15:00 5 1
2020-01-01 04:30:00 0 0
2020-01-01 04:45:00 8 1
2020-01-01 05:00:00 9 1
2020-01-01 05:15:00 5 1
2020-01-01 05:30:00 9 1
2020-01-01 05:45:00 6 1
2020-01-01 06:00:00 3 0
However, I'm wondering if there is an easier (and more efficient) way to do this with python/pandas?
I found a solution which seems to work, using .rolling and .shift.
import pandas as pd
import numpy as np
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
df['above_treshold'] = np.where(df['data'] > treshold, df['data'], 0)
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=cell_range)
rolling_min_fwd = df['above_treshold'].rolling(window=indexer).min()
rolling_min_bwd = df['above_treshold'].rolling(window=cell_range).min()
shifted_fwd = df['above_treshold'].shift(1)
shifted_bwd = df['above_treshold'].shift(-1)
start_condition = ((rolling_min_fwd > 0) & ((df['above_treshold'] - shifted_fwd) == df['above_treshold']))
stop_condition = ((rolling_min_bwd > 0) & ((df['above_treshold'] - shifted_bwd) == df['above_treshold']))
cycles = start_condition.cumsum()
idx = cycles == stop_condition.shift(1).cumsum()
cycles.loc[idx] = 0
df['output'] = np.where(cycles > 0, 1, 0)
resulting in:
data above_treshold output
dt
2020-01-01 00:00:00 8 8 0
2020-01-01 00:15:00 3 0 0
2020-01-01 00:30:00 3 0 0
2020-01-01 00:45:00 1 0 0
2020-01-01 01:00:00 3 0 0
2020-01-01 01:15:00 9 9 1
2020-01-01 01:30:00 4 4 1
2020-01-01 01:45:00 8 8 1
2020-01-01 02:00:00 6 6 1
2020-01-01 02:15:00 4 4 1
2020-01-01 02:30:00 6 6 1
2020-01-01 02:45:00 6 6 1
2020-01-01 03:00:00 1 0 0
2020-01-01 03:15:00 6 6 0
2020-01-01 03:30:00 7 7 0
2020-01-01 03:45:00 0 0 0
2020-01-01 04:00:00 2 0 0
2020-01-01 04:15:00 8 8 1
2020-01-01 04:30:00 8 8 1
2020-01-01 04:45:00 9 9 1
2020-01-01 05:00:00 1 0 0
2020-01-01 05:15:00 9 9 1
2020-01-01 05:30:00 10 10 1
2020-01-01 05:45:00 5 5 1
2020-01-01 06:00:00 8 8 1
Couldn't measure a significant impact on performance (working on DataFrames with > 35k data points), but still better than iterating over each datapoint (though less intuitive).

Drop overlapping periods less than 6 months in pandas dataframe

I have the following Pandas dataframe and I want to drop the rows for each customer where the difference between Dates is less than 6 month per customer. For example, I want to keep the following dates for customer with ID 1 - 2017-07-01, 2018-01-01, 2018-08-01
Customer_ID Date
1 2017-07-01
1 2017-08-01
1 2017-09-01
1 2017-10-01
1 2017-11-01
1 2017-12-01
1 2018-01-01
1 2018-02-01
1 2018-03-01
1 2018-04-01
1 2018-06-01
1 2018-08-01
2 2018-11-01
2 2019-02-01
2 2019-03-01
2 2019-05-01
2 2020-02-01
2 2020-05-01
Define the following function to process each group of rows (for each customer):
def selDates(grp):
res = []
while grp.size > 0:
stRow = grp.iloc[0]
res.append(stRow)
grp = grp[grp.Date >= stRow.Date + pd.DateOffset(months=6)]
return pd.DataFrame(res)
Then apply this function to each group:
result = df.groupby('Customer_ID', group_keys=False).apply(selDates)
The result, for your data sample, is:
Customer_ID Date
0 1 2017-07-01
6 1 2018-01-01
11 1 2018-08-01
12 2 2018-11-01
15 2 2019-05-01
16 2 2020-02-01

Pandas: time column addition and repeating all rows for a month

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

Group pandas rows into pairs then find timedelta

I have a dataframe where I need to group the TX/RX column into pairs, and then put these into a new dataframe with a new index and the timedelta between them as values.
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = pd.date_range('2018-01-01', periods=6, freq='1H1min')
df['id'] = ids
df['val'] = vals
time1 time2 id val
0 2018-01-01 00:00:00 2018-01-01 00:00:00 1 A
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A
3 2018-01-01 03:00:00 2018-01-01 03:03:00 4 B
4 2018-01-01 04:00:00 2018-01-01 04:04:00 5 A
5 2018-01-01 05:00:00 2018-01-01 05:05:00 6 B
needs to be...
index timedelta A B
0 1 1 2
1 1 3 4
2 1 5 6
I think that pivot_tables or stack/unstack is probably the best way to go about this, but I'm not entirely sure how...
I believe you need:
df = pd.DataFrame()
df['time1'] = pd.date_range('2018-01-01', periods=6, freq='H')
df['time2'] = df['time1'] + pd.to_timedelta([60,60,120,120,180,180], 's')
df['id'] = range(1,7)
df['val'] = ['A','B'] * 3
df['t'] = df['time2'] - df['time1']
print (df)
time1 time2 id val t
0 2018-01-01 00:00:00 2018-01-01 00:01:00 1 A 00:01:00
1 2018-01-01 01:00:00 2018-01-01 01:01:00 2 B 00:01:00
2 2018-01-01 02:00:00 2018-01-01 02:02:00 3 A 00:02:00
3 2018-01-01 03:00:00 2018-01-01 03:02:00 4 B 00:02:00
4 2018-01-01 04:00:00 2018-01-01 04:03:00 5 A 00:03:00
5 2018-01-01 05:00:00 2018-01-01 05:03:00 6 B 00:03:00
#if necessary convert to seconds
#df['t'] = (df['time2'] - df['time1']).dt.total_seconds()
df = df.pivot('t','val','id').reset_index().rename_axis(None, axis=1)
#if necessary aggregate values
#df = (df.pivot_table(index='t',columns='val',values='id', aggfunc='mean')
# .reset_index().rename_axis(None, axis=1))
print (df)
t A B
0 00:01:00 1 2
1 00:02:00 3 4
2 00:03:00 5 6

Temporal Binning in Pandas

I would like to perform something similar to an SQL groupby operation or R's aggregate in Pandas. I have a bunch of rows with irregular timestamps, I would like to create temporal bins and count the number of rows falling into each bin. I can't quite see how to use resample to do this
Example Rows
Time, Val
05.33, XYZ
05.45, ABC
07.13, DEF
Example Output
05.00-06.00, 2
06.00-07.00, 0
07.00-08.00, 1
If you are indexing on another value, you can use a groupby statement on the timestamp.
In [1]: dft = pd.DataFrame({'A' : ['spam', 'eggs', 'spam', 'eggs'] * 6,
'B' : np.random.randn(24),
'C' : [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,2,0,0,0),freq='T')) for i in range(24)]})
In [2]: dft['B'].groupby([dft['C'].apply(lambda x:x.hour)]).agg(pd.Series.nunique)
Out[2]:
C
2 1
4 1
6 1
7 1
9 1
10 2
11 1
12 4
14 1
15 2
16 1
18 3
19 1
20 1
21 1
22 1
23 1
dtype: float64
If you're indexing on timestamps, then you can use resample.
In [3]: dft2 = pd.DataFrame({'A' : ['spam', 'eggs', 'spam', 'eggs'] * 6,
'B' : np.random.randn(24)},
index = [np.random.choice(pd.date_range(datetime.datetime(2013,1,1,0,0,0),datetime.datetime(2013,1,2,0,0,0),freq='T')) for i in range(24)])
In [4]: dft2.resample('H',how=pd.Series.nunique)
Out[4]:
A B
2013-01-01 01:00:00 1 1
2013-01-01 02:00:00 0 0
2013-01-01 03:00:00 0 0
2013-01-01 04:00:00 0 0
2013-01-01 05:00:00 2 2
2013-01-01 06:00:00 2 3
2013-01-01 07:00:00 1 2
2013-01-01 08:00:00 2 2
2013-01-01 09:00:00 1 1
2013-01-01 10:00:00 2 3
2013-01-01 11:00:00 1 1
2013-01-01 12:00:00 1 2
2013-01-01 13:00:00 0 0
2013-01-01 14:00:00 1 1
2013-01-01 15:00:00 0 0
2013-01-01 16:00:00 1 1
2013-01-01 17:00:00 1 2
2013-01-01 18:00:00 0 0
2013-01-01 19:00:00 0 0
2013-01-01 20:00:00 2 2
2013-01-01 21:00:00 1 1

Categories