The Excel function SUMIFS supports calculation based on multiple criteria, including day inequalities, as follows
values_to_sum, criteria_range_n, condition_n, .., criteria_range_n, condition_n
Example
Input - tips per person per day, multiple entries per person per day allowed
date person tip
02/03/2022 X 10
05/03/2022 X 30
05/03/2022 Y 20
08/03/2022 X 12
08/03/2022 X 8
Output - sum per selected person per day
date X_sum_per_day
01/03/2022 0
02/03/2022 10
03/03/2022 0
04/03/2022 0
05/03/2022 30
06/03/2022 0
07/03/2022 0
08/03/2022 20
09/03/2022 0
10/03/2022 0
Can this be implemented in pandas and calculated as series for an input range of days? Cumulative would be presumably just application of cumsum() but the initial sum based on multiple criteria is tricky, especially if to be concise.
Code
import pandas as pd
df = pd.DataFrame({'date': ['02-03-2022 00:00:00',
'05-03-2022 00:00:00',
'05-03-2022 00:00:00',
'08-03-2022 00:00:00',
'08-03-2022 00:00:00'],
'person': ['X', 'X', 'Y', 'X', 'X'],
'tip': [10, 30, 20, 12, 8]},
index = [0, 1, 2, 3, 4])
df2 = pd.DataFrame({'date':pd.date_range(start='2022-03-01', end='2022-03-10')})
temp = df[df['person'] == 'X'].groupby(['date']).sum().reset_index()
df2['X_sum'] = df2['date'].map(temp.set_index('date')['tip']).fillna(0)
The above seems kinda hacky and not as simple to reason about as Excel SUMIFS. Additional conditions would also be a hassle (e.g. sum where country = X, company = Y, person = Z).
Any idea for alternative implementation?
IIUC, you want to filter the person X then groupby day and sum the tips, finally reindex the missing days:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
out = (df[df['person'].eq('X')]
.groupby('date')['tip'].sum()
.reindex(pd.date_range(start='2022-03-01', end='2022-03-10'),
fill_value=0)
.reset_index()
)
output:
index tip
0 2022-03-01 0
1 2022-03-02 10
2 2022-03-03 0
3 2022-03-04 0
4 2022-03-05 30
5 2022-03-06 0
6 2022-03-07 0
7 2022-03-08 20
8 2022-03-09 0
9 2022-03-10 0
Related
I'm trying to get the number of day between two days but per each month.
I found some answers but I can't figure out how to do it when the dates have two different years.
For example, I have this dataframe:
df = {'Id': ['1','2','3','4','5'],
'Item': ['A','B','C','D','E'],
'StartDate': ['2019-12-10', '2019-12-01', '2019-01-01', '2019-05-10', '2019-03-10'],
'EndDate': ['2020-01-30' ,'2020-02-02','2020-03-03','2020-03-03','2020-02-02']
}
df = pd.DataFrame(df,columns= ['Id', 'Item','StartDate','EndDate'])
And I want to get this dataframe:
s = (df[["StartDate", "EndDate"]]
.apply(lambda row: pd.date_range(row.StartDate, row.EndDate), axis=1)
.explode())
new = (s.groupby([s.index, s.dt.year, s.dt.month])
.count()
.unstack(level=[1, 2], fill_value=0))
new.columns = new.columns.map(lambda c: f"{c[0]}-{str(c[1]).zfill(2)}")
new = new.sort_index(axis="columns")
get all the dates in between StartDate and EndDate per row, and explode that list of dates to their own rows
group by the row id, year and month & count records
unstack the year & month identifier to be on the columns side as a multiindex
join that year & month values with a hypen in between (also zerofill months, e.g., 03)
lastly sort the year-month pairs on columns
to get
>>> new
2019-11 2019-12 2020-01 2020-02 2020-03
0 0 22 30 0 0
1 0 31 31 2 0
2 0 31 31 29 3
3 21 31 31 29 3
4 9 31 31 2 0
I have a dataframe that is being read from database records and looks like this:
date added total
2020-09-14 5 5
2020-09-15 4 9
2020-09-16 2 11
I need to be able to resample by different periods and this is what I am using:
df = pd.DataFrame.from_records(raw_data, index='date')
df.index = pd.to_datetime(df.index)
# let's say I want yearly sample, then I would do
df = df.fillna(0).resample('Y').sum()
This almost works, but it is obviously summing the total column, which is something I don't want. I need total column to be the value in the date sampled in the dataframe, like this:
# What I need
date added total
2020 11 11
# What I'm getting
date added total
2020 11 25
You can do this by resampling differently for different columns. Here you want to use sum() aggregator for the added column, but max() for the total.
df = pd.DataFrame({'date':[20200914, 20200915, 20200916, 20210101, 20210102],
'added':[5, 4, 2, 1, 6],
'total':[5, 9, 11, 1, 7]})
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d')
df_res = df.resample('Y', on='date').agg({'added':'sum', 'total':'max'})
And the result is:
df_res
added total
date
2020-12-31 11 11
2021-12-31 7 7
I have a pandas dataframe df with contiguous start_date and end_date ranges and a a single ref_date for each user:
users = {'user_id': ['A','A','A','A', 'B','B','B'],
'start_date': ['2017-03-07', '2017-03-12', '2017-04-04', '2017-05-22', '2018-12-01', '2018-12-23', '2018-12-29'],
'end_date': ['2017-03-11', '2017-04-03', '2017-05-21', '2222-12-31', '2018-12-22', '2018-12-28', '2222-12-31'],
'status': ['S1', 'S2', 'S1', 'S3', 'S1', 'S2', 'S1'],
'score': [1000, 1000, 1000, 1000, 900, 900, 1500],
'ref_date': ['2017-05-22', '2017-05-22', '2017-05-22', '2017-05-22', '2019-01-19', '2019-01-19', '2019-01-19']
}
df = pd.DataFrame(users, columns = ['user_id', 'start_date', 'end_date', 'status', 'score', 'ref_date'])
print(df)
user_id start_date end_date status score ref_date
0 A 2017-03-07 2017-03-11 S1 1000 2017-05-22
1 A 2017-03-12 2017-04-03 S2 1000 2017-05-22
2 A 2017-04-04 2017-05-21 S1 1000 2017-05-22
3 A 2017-05-22 2222-12-31 S3 1000 2017-05-22
4 B 2018-12-01 2018-12-22 S1 900 2019-01-19
5 B 2018-12-23 2018-12-28 S2 900 2019-01-19
6 B 2018-12-29 2222-12-31 S1 1500 2019-01-19
I would like to calculate a number of key figures per user for the last x months (x=1, 3, 6, 12) before each ref_date, examples are:
the number of days with status S1, S2, S3 in the last x months before
ref_date
the number of score increases in the last x months before ref_date
the average daily score in the last x months before ref_date
The result should look like that (I hope I did the calculations correctly):
user_id ref_date nday_s1_last3m nday_s2_last3m nday_s3_last3m \
0 A 2017-05-22 53 23 0
1 B 2019-01-19 43 6 0
ninc_score_last3m avg_score_last3m
0 0 1000.00
1 1 1157.14
The problem is that ref_date - x months could end up between an existing start_date/end_date interval or even before the first start_date, in which case the time "starts" on the first start_date. Resampling works, but creates huge dataframes if one has millions of users and many date ranges; I run out of memory. Any suggestions?
A detail to note: before a ref_date means up to and including ref_date-1
I would first compute the real start and end dates, as respectively the higher and lower of start_date and ref_date minus 3 months, and end_date and ref_date. Once this is done, the number of days, score increases and average will be simple to compute:
Code could be:
# convert date columns to datetimes
for col in ['start_date', 'end_date', 'ref_date']:
df[col] = pd.to_datetime(df[col])
# compute ref_date minus 3 months
ref = df.ref_date - pd.offsets.MonthOffset(3)
# compute the real start and end dates
tmp = df.loc[(df.end_date >= ref)&(df.start_date < df.ref_date),
['start_date', 'end_date']].copy()
tmp.loc[df.start_date < ref, 'start_date'] = ref-pd.Timedelta('1D')
tmp.loc[df.end_date >= df.ref_date, 'end_date'] = df.ref_date-pd.Timedelta('1D')
# add the relevant columns to the temp dataframe
tmp['days'] = (tmp.end_date - tmp.start_date).dt.days + 1
tmp['score'] = df.score
tmp['status'] = df.status
# build a list of result fields per user
data =[]
for i in df.user_id.unique():
# user_id, ref_date
d = [i, df.loc[df.user_id == i, 'ref_date'].iat[0]]
data.append(d)
# extract data for that user
x = tmp[df.loc[tmp.index,'user_id'] == i]
# number of days per status
d.extend(x.groupby('status')['days'].sum().reindex(df.status.unique())
.fillna(0).astype('int').tolist())
# increase and average score
d.extend((np.sum(np.where(x.score > x.score.shift(), 1, 0)),
np.average(x.score, weights=x.days)))
# build the resulting dataframe
resul = pd.DataFrame(data, columns=['user_id', 'ref_date', 'nday_s1_last3m',
'nday_s2_last3m', 'nday_s3_last3m',
'ninc_score_last3m', 'avg_score_last3m'])
It gives as expected:
user_id ref_date nday_s1_last3m nday_s2_last3m nday_s3_last3m ninc_score_last3m avg_score_last3m
0 A 2017-05-22 53 23 0 0 1000.000000
1 B 2019-01-19 43 6 0 1 1157.142857
I have the following dataframe:
date money
0 2018-01-01 20
1 2018-01-05 30
2 2018-02-15 7
3 2019-03-17 150
4 2018-01-05 15
...
2530 2019-03-17 350
And I need:
[(2018-01-01,20),(2018-01-05,65),(2018-02-15,72),...,(2019-03-17,572)]
So i need to do a cumulative sum of money over all days:
So far I have tried many things and the closest Ithink I've got is:
graph_df.date = pd.to_datetime(graph_df.date)
temporary = graph_df.groupby('date').money.sum()
temporary = temporary.groupby(temporary.index.to_period('date')).cumsum().reset_index()
But this gives me ValueError: Invalid frequency: date
Could anyone help please?
Thanks
I don't think you need the second groupby. You can simply add a column with the cumulative sum.
This does the trick for me:
import pandas as pd
df = pd.DataFrame({'date': ['01-01-2019','04-06-2019', '07-06-2019'], 'money': [12,15,19]})
df['date'] = pd.to_datetime(df['date']) # this is not strictly needed
tmp = df.groupby('date')['money'].sum().reset_index()
tmp['money_sum'] = tmp['money'].cumsum()
Converting the date column to an actual date is not needed for this to work.
list(map(tuple, df.groupby('date', as_index=False)['money'].sum().values))
Edit:
df = pd.DataFrame({'date': ['2018-01-01', '2018-01-05', '2018-02-15', '2019-03-17', '2018-01-05'],
'money': [20, 30, 7, 150, 15]})
#df['date'] = pd.to_datetime(df['date'])
#df = df.sort_values(by='date')
temporary = df.groupby('date', as_index=False)['money'].sum()
temporary['money_cum'] = temporary['money'].cumsum()
Result:
>>> list(map(tuple, temporary[['date', 'money_cum']].values))
[('2018-01-01', 20),
('2018-01-05', 65),
('2018-02-15', 72),
('2019-03-17', 222)]
you can try using df.groupby('date').sum():
example data frame:
df
date money
0 01/01/2018 20
1 05/01/2018 30
2 15/02/2018 7
3 17/03/2019 150
4 05/01/2018 15
5 17/03/2019 550
6 15/02/2018 13
df['cumsum'] = df.money.cumsum()
list(zip(df.groupby('date').tail(1)['date'], df.groupby('date').tail(1)['cumsum']))
[('01/01/2018', 20),
('05/01/2018', 222),
('17/03/2019', 772),
('15/02/2018', 785)]
When I run the following code, the results appear to add the non-business day data to the result.
Code
import pandas as pd
df = pd.DataFrame({'id': [30820864, 32295510, 30913444, 30913445],
'ticket_id': [100, 101, 102, 103],
'date_time': [
'6/1/17 9:48',
'6/2/17 13:11',
'6/3/17 13:15',
'6/5/17 13:15'],
})
df['date_time'] = pd.to_datetime(df['date_time'])
df.index = df['date_time']
x = df.resample('B').count()
print(x)
Result
id ticket_id date_time
date_time
2017-06-01 1 0 1
2017-06-02 2 0 2
2017-06-05 1 0 1
I would expect that the count for 2017-06-02 would be 1 and not 2. Shouldn't the data from a non-business day (6/3/17) be ignored?
This seems to be standard behaviour, events on weekends are grouped with friday (another post similar to this, and here it says that this is convention)
One solution, drop the weekends:
df = df[df['date_time'].apply(lambda x: x.weekday() not in [5,6])]
Output:
date_time id ticket_id
date_time
2017-06-01 1 1 1
2017-06-02 1 1 1
2017-06-05 1 1 1