I want to get a count & sum of values over +/- 7 days period of a column after the dataframe being grouped to certain column
Example data (edited to reflect my real dataset):
group | date | amount
-------------------------------------------
A | 2017-12-26 04:20:20 | 50000.0
A | 2018-01-17 00:54:15 | 60000.0
A | 2018-01-27 06:10:12 | 150000.0
A | 2018-02-01 01:15:06 | 100000.0
A | 2018-02-11 05:05:34 | 150000.0
A | 2018-03-01 11:20:04 | 150000.0
A | 2018-03-16 12:14:01 | 150000.0
A | 2018-03-23 05:15:07 | 150000.0
A | 2018-04-02 10:40:35 | 150000.0
group by group then sum based on date-7 < date < date+7
Results that I want:
group | date | amount | grouped_sum
-----------------------------------------------------------
A | 2017-12-26 04:00:00 | 50000.0 | 50000.0
A | 2018-01-17 00:00:00 | 60000.0 | 60000.0
A | 2018-01-27 06:00:00 | 150000.0 | 250000.0
A | 2018-02-01 01:00:00 | 100000.0 | 250000.0
A | 2018-02-11 05:05:00 | 150000.0 | 150000.0
A | 2018-03-01 11:00:04 | 150000.0 | 150000.0
A | 2018-03-16 12:00:01 | 150000.0 | 150000.0
A | 2018-03-23 05:00:07 | 100000.0 | 100000.0
A | 2018-04-02 10:00:00 | 100000.0 | 100000.0
Quick snippet to achieve the dataset:
group = 9 * ['A']
date = pd.to_datetime(['2017-12-26 04:20:20', '2018-01-17 00:54:15',
'2018-01-27 06:10:12', '2018-02-01 01:15:06',
'2018-02-11 05:05:34', '2018-03-01 11:20:04',
'2018-03-16 12:14:01', '2018-03-23 05:15:07',
'2018-04-02 10:40:35'])
amount = [50000.0, 60000.0, 150000.0, 100000.0, 150000.0,
150000.0, 150000.0, 150000.0, 150000.0]
df = pd.DataFrame({'group':group, 'date':date, 'amount':amount})
Bit of explanation:
2nd row is 40 because it sums data for A in period 2018-01-14 and 2018-01-15
4th row is 30 because it sums data for B in period 2018-01-03 + next 7 days
6th row is 30 because it sums data for B in period 2018-01-03 + prev 7 days.
I dont have any idea how to do sum over a period of date range. I might be able to do it if I make this way:
1.Create another column that shows date-7 and date+7 for each rows
group | date | amount | date-7 | date+7
-------------------------------------------------------------
A | 2017-12-26 | 50000.0 | 2017-12-19 | 2018-01-02
A | 2018-01-17 | 60000.0 | 2018-01-10 | 2018-01-24
2.calculate amount between the date range: df[df.group == 'A' & df.date > df.date-7 & df.date < df.date+7].amount.sum()
3.But this method is quite tedious.
EDIT (2018-09-01):
Found out this method below based on #jezrael answer which works for me but only works for single group:
t = pd.Timedelta(7, unit='d')
def g(row):
res = df[(df.created > row.created - t) & (df.created < row.created + t)].amount.sum()
return res
df['new'] = df.apply(g, axis=1)
Here is problem need loop for each row and for each groups:
t = pd.Timedelta(7, unit='d')
def f(x):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum() ,axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f)
print (df)
group date amount new
0 A 2018-01-01 10 10.0
1 A 2018-01-14 20 40.0
2 A 2018-01-15 20 40.0
3 B 2018-02-03 10 30.0
4 B 2018-02-04 10 30.0
5 B 2018-02-05 10 30.0
Thanks for improvement by #jpp:
def f(x, t):
return x.apply(lambda y: x.loc[x['date'].between(y['date'] - t,
y['date'] + t,
inclusive=False),'amount'].sum(),axis=1)
df['new'] = df.groupby('group', group_keys=False).apply(f, pd.Timedelta(7, unit='d'))
Verify solution:
t = pd.Timedelta(7, unit='d')
df = df[df['group'] == 'A']
def test(y):
a = df.loc[df['date'].between(y['date'] - t, y['date'] + t,inclusive=False)]
print (a)
print (a['amount'])
return a['amount'].sum()
group date amount
0 A 2018-01-01 10
0 10
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
group date amount
1 A 2018-01-14 20
2 A 2018-01-15 20
1 20
2 20
Name: amount, dtype: int64
df['new'] = df.apply(test,axis=1)
print (df)
group date amount new
0 A 2018-01-01 10 10
1 A 2018-01-14 20 40
2 A 2018-01-15 20 40
Add column with first days of the week:
df['week_start'] = df['date'].dt.to_period('W').apply(lambda x: x.start_time)
Result:
group date amount week_start
0 A 2018-01-01 10 2017-12-26
1 A 2018-01-14 20 2018-01-09
2 A 2018-01-15 20 2018-01-09
3 B 2018-02-03 10 2018-01-30
4 B 2018-02-04 10 2018-01-30
5 B 2018-02-05 10 2018-01-30
Group by new column and find weekly total amount:
grouped_sum = df.groupby('week_start')['amount'].sum().reset_index()
Result:
week_start amount
0 2017-12-26 10
1 2018-01-09 40
2 2018-01-30 30
Merge dataframes on week_start:
pd.merge(df.drop('amount', axis=1), grouped_sum, on='week_start').drop('week_start', axis=1)
Result:
group date amount
0 A 2018-01-01 10
1 A 2018-01-14 40
2 A 2018-01-15 40
3 B 2018-02-03 30
4 B 2018-02-04 30
5 B 2018-02-05 30
Related
I want to extract the closing balance for the week across different date from the below dataframe
Date Week Balance
2017-02-12 6 50000.46
2017-02-12 6 49531.46
2017-02-12 6 48108.46
2017-05-12 19 21558.96
2017-08-12 32 21561.1
2018-02-05 6 2816.20
2018-02-06 6 78.53
2018-02-07 6 39.53
2018-08-12 32 21561.1
Expected output is:
Date Week Balance
2017-02-12 6 48108.46
2017-05-12 19 21558.96
2018-02-07 6 39.53
2018-08-12 32 21561.1
I tried to use the .last() attribute of groupby function but I get multiple returns for the same week
weekly = df.groupby(["Transaction Date",'Week']).last().Balance
weekly
Date. week Balance
2017-02-12 6 48108.46
2017-03-12 10 46802.46
2017-04-12 15 39588.46
2017-05-12 19 21558.96
2018-02-03 5 24699.73
2018-02-04 5 103.20
2018-02-05 6 2816.20
2018-02-06 6 78.53
2018-02-07 6 39.53
You can use shift to check for consecutive rows and keep the last one:
df.loc[df['Week'] != df['Week'].shift(-1)]
Output:
| | Date | Week | Balance |
|---:|:-----------|-------:|----------:|
| 2 | 2017-02-12 | 6 | 48108.46 |
| 3 | 2017-05-12 | 19 | 21558.96 |
| 4 | 2017-08-12 | 32 | 21561.10 |
| 7 | 2018-02-07 | 6 | 39.53 |
| 8 | 2018-08-12 | 32 | 21561.10 |
I'm a bit new with python and data processing elements, so sorry if this is a nooby question.
So I have a large 3D tensor(?) dataset that looks something like this:
data = [[[a], [b]], [[c], [d]] ... ]
And each 2D tensor in the dataset is connected to a timestamp i.e.
2018-09-29 05:00:00 -> [[a], [b]]
2018-09-29 06:00:00 -> [[c], [d]]
...
And each data-set i.e. a, b, c, d contains the same columns i.e:
a.head()
| val1 | val2 | val3 |
----------------------
| 1 | 3 | 2 |
| 3 | 5 | 6 |
| 4 | 1 | 3 |
...
I need to create a multivariable index, that is, a timestamp should refer to a matrix.
I've tried with:
dfs = [[[a], [b]], [[c], [d]] ... ]
dates = ['2018-09-29 05:00:00', '2018-09-29 06:00:00']
x = pd.concat(dfs, keys=pd.to_datetime(dates))
which creates an outermost index with dates, but I have no way of reaching this index. When I list the keys with x.keys(), I only get the columns for a, b... i.e. val1, val2, val3. That is, it creates this kindof table:
| val1 | val2 | val3 |
----------------------
2018-09-29 05:00:00 | 1 | 3 | 2 |
| 3 | 5 | 6 |
| 4 | 1 | 3 |
----------------------
2018-09-29 06:00:00 | 1 | 3 | 2 |
| 3 | 5 | 6 |
| 4 | 1 | 3 |
So how do I do create this DateTime indexing of multivariate values effectively? How can I access the timestamp-keys? Are there better ways of doing this?
Edit
I.e how can I achieve this as shown in pandas reshaping guide:
a b
variable val1 val2 val3 val1 val2 val3
date
2018-09-29 05:00:00 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265
2018-09-29 06:00:00 0.469112 -1.135632 0.119209 -2.104569 0.938225 -2.271265
Not sure if this is what you want to do, but I tried to create a little toy example,
as specified in your question. So we have 2d matrices referenced by timestamps:
import pandas as pd
import numpy as np
data = {
'2018-09-29 05:00:00': np.arange(9). reshape(3, 3),
'2018-10-29 05:00:00': np.arange(9, 18). reshape(3, 3),
'2018-11-29 05:00:00': np.arange(18, 27). reshape(3, 3)
}
Then I just vertically stack the data and create an index like so:
matrices = []
index = []
for k, v in data.items():
matrices.append(v)
for _ in range(v.shape[0]):
index.append(k)
The dataframe would look like this:
df = pd.DataFrame(np.vstack(matrices), index=index)
print(df)
# 0 1 2
# 2018-09-29 05:00:00 0 1 2
# 2018-09-29 05:00:00 3 4 5
# 2018-09-29 05:00:00 6 7 8
# 2018-10-29 05:00:00 9 10 11
# 2018-10-29 05:00:00 12 13 14
# 2018-10-29 05:00:00 15 16 17
# 2018-11-29 05:00:00 18 19 20
# 2018-11-29 05:00:00 21 22 23
# 2018-11-29 05:00:00 24 25 26
If you want data of a specific timestamp all you have to do is use the loc method
print(df.loc['2018-09-29 05:00:00'])
# 0 1 2
# 2018-09-29 05:00:00 0 1 2
# 2018-09-29 05:00:00 3 4 5
# 2018-09-29 05:00:00 6 7 8
Hope this helps.
Edit:
You can convert the strings to Timestamps as well pd.Timestamp(...) and keep on querying with strings. I'm not to knowledgable about dos and don'ts of Pandas timestamps though.
Edit 2:
You could save objects in t he cells instead and include the whole numpy matrix as one cell entry, but then you would loose the abillity to query single rows/column of a matrix.
I have a data set that looks like this:
Date | ID | Task | Description
2016-01-06 00:00:00 | 1 | 010 | This is text
2016-01-06 00:10:00 | 1 | 020 | This is text
2016-01-06 00:20:00 | 1 | 010 | This is text
2016-01-06 01:00:00 | 1 | 020 | This is text
2016-01-06 01:10:00 | 1 | 030 | This is text
2016-02-06 00:00:00 | 2 | 010 | This is text
2016-02-06 00:10:00 | 2 | 020 | This is text
2016-02-06 00:20:00 | 2 | 010 | This is text
2016-02-06 01:00:00 | 2 | 020 | This is text
2016-02-06 01:01:00 | 2 | 030 | This is text
Task 020usually occurs after task 010 which means that when Task 020 starts means that task 010 ends, same applies for Task 020, if it comes before any other Task it means that it has stopped.
I need to group by Task calculating the average duration, total sum and count of each type of task in each ID, so I am looking for something like this:
ID | Task | Average | Sum | Count
1 | 010 | 25 | 50 | 2
1 | 020 | 10 | 20 | 2
etc | etc | etc | etc | etc
There are more IDs but I only care about 010 and 020, so whatever number is returned from them is acceptable.
Can someone help me on how to do this in Python?
I think it's a simple .groupby() that you need. You sample output doesn't show any complicated linking between timestamps and Task or ID
df['count'] = df.groupby(['ID','Task']).size()
will give you the count of each unique ID/Task in your data. To do a sum or average, it's similar, but you need a column with something to sum.
See here for more details.
It seems you need agg with groupby, but in sample not numeric column so col was added:
print (df)
Date ID Task Description col
0 2016-01-06 00:00:00 1 010 This is text 1
1 2016-01-06 00:10:00 1 020 This is text 2
2 2016-01-06 00:20:00 1 010 This is text 6
3 2016-01-06 01:00:00 1 020 This is text 1
4 2016-01-06 01:10:00 1 030 This is text 3
5 2016-02-06 00:00:00 2 010 This is text 1
6 2016-02-06 00:10:00 2 020 This is text 8
7 2016-02-06 00:20:00 2 010 This is text 9
8 2016-02-06 01:00:00 2 020 This is text 1
df = df.groupby(['ID','Task'])['col'].agg(['sum','size', 'mean']).reset_index()
print (df)
ID Task sum size mean
0 1 010 7 2 3.5
1 1 020 3 2 1.5
2 1 030 3 1 3.0
3 2 010 10 2 5.0
4 2 020 9 2 4.5
If need aggreagte datetime, id is a bit complicated, because need timedeltas:
df.Date = pd.to_timedelta(df.Date).dt.total_seconds()
df = df.groupby(['ID','Task'])['Date']
.agg(['sum','size', 'mean']).astype(np.int64).reset_index()
df['sum'] = pd.to_timedelta(df['sum'])
df['mean'] = pd.to_timedelta(df['mean'])
print (df)
ID Task sum size mean
0 1 010 00:00:02.904078 2 00:00:01.452039
1 1 020 00:00:02.904081 2 00:00:01.452040
2 1 030 00:00:01.452042 1 00:00:01.452042
3 2 010 00:00:02.909434 2 00:00:01.454717
4 2 020 00:00:02.909437 2 00:00:01.454718
For finding difference in column date:
print (df.Date.dtypes)
object
#if dtype of column is not datetime, first convert
df.Date = pd.to_datetime(df.Date )
print (df.Date.diff())
0 NaT
1 0 days 00:10:00
2 0 days 00:10:00
3 0 days 00:40:00
4 0 days 00:10:00
5 30 days 22:50:00
6 0 days 00:10:00
7 0 days 00:10:00
8 0 days 00:40:00
9 0 days 00:01:00
Name: Date, dtype: timedelta64[ns]
This is a sample of my dataset:
Consumer_num | billed_units
29 | 984
29 | 1244
29 | 2323
29 | 1232
29 | 1150
30 | 3222
30 | 1444
30 | 2124
I want to group by consumer_num and then add all values (billed_units) of each group into new columns. So my required output:
Consumer_num | month 1 | month 2 | month 3 | month 4 | month 5
29 | 984 | 1244 | 2323 | 1232 | 1150
30 | 3222 | 1444 | 2124 | NaN | NaN
This is what I've done so far:
group = df.groupby('consumer_num')['billed_units'].unique()
group[group.apply(lambda x: len(x)>1)]
df = group.to_frame()
print df
Output:
Consumer_num | billed_units
29 | [984,1244,2323,1232,1150]
30 | [3222,1444,2124]
I don't know whether my approach is correct. If it's right, then I would like to know how I can separate billed_units of each consumer and then add to new columns as I've shown in my required output. Or is there a better method to achieve my required output?
solution
c = 'Consumer_num'
m = 'month {}'.format
df.set_index(
[c, df.groupby(c).cumcount() + 1]
).billed_units.unstack().rename(columns=m).reset_index()
Consumer_num month 1 month 2 month 3 month 4 month 5
0 29 984.0 1244.0 2323.0 1232.0 1150.0
1 30 3222.0 1444.0 2124.0 NaN NaN
how it works
put 'Consumer_num' into a variable c for convenience
put mapper function into a variable m for convenience
setting index with two columns to make a pd.MultiIndex
I use groupby and cumcount to create a level to unstack with
then I unstack
finally use the mapper function to rename the columns
response to comments
One approach for limiting the number of months is to use iloc. The following limits us to 3 months. You can adjust to take first 5. The nans should take care of themselves.
c = 'Consumer_num'
m = 'month {}'.format
df.set_index(
[c, df.groupby(c).cumcount() + 1]
).billed_units.unstack().rename(columns=m).iloc[:, :3].reset_index()
# ^..........^
Consumer_num month 1 month 2 month 3
0 29 984.0 1244.0 2323.0
1 30 3222.0 1444.0 2124.0
Or you could pre-process
c = 'Consumer_num'
m = 'month {}'.format
d1 = df.groupby(c).head(3) # pre-process and take just first 3
d1.set_index(
[c, d1.groupby(c).cumcount() + 1]
).billed_units.unstack().rename(columns=m).reset_index()
You could use pivot like
In [70]: dfm = df.assign(m=df.groupby('Consumer_num').cumcount().add(1))
In [71]: dfm.pivot('Consumer_num', 'm', 'billed_units').add_prefix('month ')
Out[71]:
m month 1 month 2 month 3 month 4 month 5
Consumer_num
29 984.0 1244.0 2323.0 1232.0 1150.0
30 3222.0 1444.0 2124.0 NaN NaN
Details
In [75]: df
Out[75]:
Consumer_num billed_units
0 29 984
1 29 1244
2 29 2323
3 29 1232
4 29 1150
5 30 3222
6 30 1444
7 30 2124
In [76]: dfm
Out[76]:
Consumer_num billed_units m
0 29 984 1
1 29 1244 2
2 29 2323 3
3 29 1232 4
4 29 1150 5
5 30 3222 1
6 30 1444 2
7 30 2124 3
I have this dataframe (type could be 1 or 2):
user_id | timestamp | type
1 | 2015-5-5 12:30 | 1
1 | 2015-5-5 14:00 | 2
1 | 2015-5-5 15:00 | 1
I want to group my data by six hours and when doing this I want to keep type as:
1 (if there is only 1 within that 6 hour frame)
2 (if there is only 2 within that 6 hour frame) or
3 (if there was both 1 and 2 within that 6 hour frame)
Here is the my code:
df = df.groupby(['user_id', pd.TimeGrouper(freq=(6,'H'))]).mean()
which produces:
user_id | timestamp | type
1 | 2015-5-5 12:00 | 4
However, I want to get 3 instead of 4. I wonder how can I replace the mean() in my groupby code to produce the desired output?
Try this:
In [54]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]) \
.agg({'type':lambda x: x.unique().sum()})
Out[54]:
type
user_id timestamp
1 2015-05-05 12:00:00 3
PS it'll work only with given types: (1, 2) as their sum is 3
Another data set:
In [56]: df
Out[56]:
user_id timestamp type
0 1 2015-05-05 12:30:00 1
1 1 2015-05-05 14:00:00 1
2 1 2015-05-05 15:00:00 1
3 1 2015-05-05 20:00:00 1
In [57]: df.groupby(['user_id', pd.Grouper(key='timestamp', freq='6H')]).agg({'type':lambda x: x.unique().sum()})
Out[57]:
type
user_id timestamp
1 2015-05-05 12:00:00 1
2015-05-05 18:00:00 1