Rolling sum for a window of 2 days - python

I am trying to compute a rolling 2 day using trans_date sum against the amount column that is grouped by ID within the table below using python.
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>50</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td> </td></tr></tbody></table>
Ultimately, I am trying to achieve the result below using
<table><tbody><tr><th>ID</th><th>Trans_Date</th><th>Trans_Time</th><th>Amount</th><th>2d_Running_Total</th><th> </th></tr><tr><td>1</td><td>03/23/2019</td><td>06:51:03</td><td>100</td><td>100</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>12:32:48</td><td>600</td><td>700</td><td> </td></tr><tr><td>1</td><td>03/24/2019</td><td>14:15:35</td><td>250</td><td>950</td><td> </td></tr><tr><td>1</td><td>06/05/2019</td><td>16:18:21</td><td>75</td><td>75</td><td> </td></tr><tr><td>2</td><td>02/01/2019</td><td>18:02:52</td><td>200</td><td>200</td><td> </td></tr><tr><td>2</td><td>02/02/2019</td><td>10:03:02</td><td>150</td><td>350</td><td> </td></tr><tr><td>2</td><td>02/03/2019</td><td>23:47:51</td><td>800</td><td>950</td><td> </td></tr><tr><td>3</td><td>01/18/2019</td><td>11:12:58</td><td>1000</td><td>1000</td><td> </td></tr><tr><td>3</td><td>01/23/2019</td><td>22:12:41</td><td>15</td><td>15</td><td> </td></tr></tbody></table>
This hyperlink was very close to solving this, but the issue is for the records that have multiple transactions on the same day, it provides the same value for the same day.
https://python-forum.io/Thread-Rolling-sum-for-a-window-of-2-days-Pandas

This should do it:
import pandas as pd
# create dummy data
df = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount'],
data = [
[1, '03/23/2019', '06:51:03', 100],
[1, '03/24/2019', '12:32:48', 600],
[1, '03/24/2019', '14:15:35', 250],
[1, '06/05/2019', '16:18:21', 75],
[2, '02/01/2019', '18:02:52', 200],
[2, '02/02/2019', '10:03:02', 150],
[2, '02/03/2019', '23:47:51', 800],
[3, '01/18/2019', '11:12:58', 1000],
[3, '01/23/2019', '22:12:41', 15]
]
)
df_out = pd.DataFrame(
columns = ['ID', 'Trans_Date', 'Trans_Time', 'Amount', '2d_Running_Total'],
data = [
[1, '03/23/2019', '06:51:03', 100, 100],
[1, '03/24/2019', '12:32:48', 600, 700],
[1, '03/24/2019', '14:15:35', 250, 950],
[1, '06/05/2019', '16:18:21', 75, 75],
[2, '02/01/2019', '18:02:52', 200, 200],
[2, '02/02/2019', '10:03:02', 150, 350],
[2, '02/03/2019', '23:47:51', 800, 950],
[3, '01/18/2019', '11:12:58', 1000, 1000]
]
)
# convert into datetime object and set as index
df['Trans_DateTime'] = pd.to_datetime(df['Trans_Date'] + ' ' + df['Trans_Time'])
df = df.set_index('Trans_DateTime')
# group by ID and apply rolling window to the amount column
df['2d_Running_Total'] = df.groupby('ID')['Amount'].rolling('2d').sum().values.astype(int)
df.reset_index(drop=True)

Related

Why is this Groupby transform not working?

For a dummy dataset, which each id corresponds to one match:
df2 = pd.DataFrame(columns=['id', 'score', 'duration', 'user'],
data=[[1, 800, 60, 'abc'], [1, 900, 60, 'zxc'], [2, 800, 250, 'abc'], [2, 5000, 250, 'bvc'],
[3, 6000, 250, 'zxc'], [3, 8000, 250, 'klp'], [4, 1400, 500,'kod'],
[4, 8000, 500, 'bvc']])
If I want to keep only the records where either one of the same id have duration greater than 120 and score greater than 1500, this works fine:
cond = df2['duration'].gt(120) & df2['score'].gt(1500)
out = df2[cond.groupby(df2['id']).transform('all')]
and returns 2 instances of the same id. However, if I want to keep only the pairs of id's where the user is 'abc' it does not work. I have tried:
out = df2[(df2['user'].eq('abc')).groupby(df2['id']).transform('all')]
out = df2[(df2['user'] == 'abc').groupby(df2['id']).transform('all')]
and they both return blank df's. How to solve this problem? The outcome should be any match that user 'abc' played in.
From the comments, you want 'any', not 'all':
out = df2[(df2['user'] == 'abc').groupby(df2['id']).transform('any')]

groupby column if value is less than some value

I have a dataframe like
df = pd.DataFrame({'time': [1, 5, 100, 250, 253, 260, 700], 'qty': [3, 6, 2, 5, 64, 2, 5]})
df['time_delta'] = df.time.diff()
and I would like to groupby time_delta such that all rows where the time_delta is less than 10 are grouped together, time_delta column could be dropped, and qty is summed.
The expected result is
pd.DataFrame({'time': [1, 100, 250, 700], 'qty': [9, 2, 71, 5]})
Basically I am hoping there is something like a df.groupby(time_delta_func(10)).agg({'time': 'min', 'qty': 'sum'}) func. I read up on pd.Grouper but it seems like the grouping based on time is very strict and interval based.
you can do it with gt meaning greater than and cumsum to create a new group each time the time-delta is greater than 10
res = (
df.groupby(df['time_delta'].gt(10).cumsum(), as_index=False)
.agg({'time':'first','qty':sum})
)
print(res)
time qty
0 1 9
1 100 2
2 250 71
3 700 5

Splitting ranges according to list of excluded ranges

I have two arrays:
ranges = [[50, 60], [100, 100], [5000, 6000]]
exclude = [[1000, 1000], [5060, 5060]]
How to get the result as a sorted list that is
[[50, 60], [100, 100], [5000, 5059], [5061, 6000]]
Basically, remove the ranges of the second list from the ranges of the first list, creating new ranges where needed.
More examples:
ranges = [[2, 124235]]
exclude = [[2000, 3000], [400, 2500]]
that gives me output
[[2, 399], [3001, 124235]]

Numpy aggregate into bins, then calculate sum?

I have a matrix that looks like this:
M = [[1, 200],
[1.8, 100],
[2, 500],
[2.5, 300],
[3, 400],
[3.5, 200],
[5, 200],
[8, 100]]
I want to group the rows by a bin size (applied to the left column), e.g. for a bin size 2 (first bin is values from 0-2, second bin from 2-4, third bin from 4-6 etc):
[[1, 200],
[1.8, 100],
----
[2, 500],
[2.5, 300],
[3, 400],
[3.5, 200],
----
[5, 200],
----
[8, 100]]
Then output a new matrix with the sum of the right columns for each group:
[200+100, 500+300+400+200, 200, 100]
What is an efficient way to sum each value based on the bin_size boundaries?
With pandas:
Make a DataFrame and then use integer division to define your bins:
import pandas as pd
df = pd.DataFrame(M)
df.groupby(df[0]//2)[1].sum()
#0
#0.0 300
#1.0 1400
#2.0 200
#4.0 100
#Name: 1, dtype: int64
Use .tolist() to get your desired output:
df.groupby(df[0]//2)[1].sum().tolist()
#[300, 1400, 200, 100]
With numpy.bincount
import numpy as np
gp, vals = np.transpose(M)
gp = (gp//2).astype(int)
np.bincount(gp, vals)
#array([ 300., 1400., 200., 0., 100.])
You can make use of np.digitize and a scipy.sparse.csr_matrix here:
bins = [2, 4, 6, 8, 10]
b = np.digitize(M[:, 0], bins)
v = M[:, 1]
Now using a vectorized groupby using a csr_matrix
from scipy import sparse
sparse.csr_matrix(
(v, b, np.arange(v.shape[0]+1)), (v.shape[0], b.max()+1)
).sum(0)
matrix([[ 300., 1400., 200., 0., 100.]])

Counting uneven bins in Panda

pd.DataFrame({'email':["a#gmail.com", "b#gmail.com", "c#gmail.com", "d#gmail.com", "e#gmail.com",],
'one':[88, 99, 11, 44, 33],
'two': [80, 80, 85, 80, 70],
'three': [50, 60, 70, 80, 20]})
Given this DataFrame, I would like to compute, for each column, one, two and three, how many values are in certain ranges.
The ranges are for example: 0-70, 71-80, 81-90, 91-100
So the result would be:
out = pd.DataFrame({'colname': ["one", "two", "three"],
'b0to70': [3, 1, 4],
'b71to80': [0, 3, 1],
'b81to90': [1, 1, 0],
'b91to100': [1, 0, 0]})
What would be a nice idiomatic way to do this?
This would do it:
out = pd.DataFrame()
for name in ['one','two','three']:
out[name] = pd.cut(df[name], bins=[0,70,80,90,100]).value_counts()
out.sort_index(inplace=True)
Returns:
one two three
(0, 70] 3 1 4
(70, 80] 0 3 1
(80, 90] 1 1 0
(90, 100] 1 0 0

Categories