I have a dataframe like this
df
order_date amount
0 2015-10-02 1
1 2015-12-21 15
2 2015-12-24 3
3 2015-12-26 4
4 2015-12-27 5
5 2015-12-28 10
I would like to sum on df["amount"] based on range from df["order_date"] to df["order_date"] + 6 days
order_date amount sum
0 2015-10-02 1 1
1 2015-12-21 15 27 //comes from 15 + 3 + 4 + 5
2 2015-12-24 3 22 //comes from 3 + 4 + 5 + 10
3 2015-12-26 4 19
4 2015-12-27 5 15
5 2015-12-28 10 10
the data type of order_date is datetime
have tried to use iloc but it did not work well...
if anyone has any idea/example on who to work on this,
please kindly let me know.
If pandas rolling allowed left-aligned window (default is right-aligned) then the answer would be a simple single liner: df.set_index('order_date').amount.rolling('7d',min_periods=1,align='left').sum(), however forward-looking has not been implemented yet (i.e. rolling does not accept an align parameter). So, the trick I came up with is to "reverse" the dates temporarily. Solution:
df.index = pd.to_datetime(pd.datetime.now() - df.order_date)
df['sum'] = df.sort_index().amount.rolling('7d',min_periods=1).sum()
df.reset_index(drop=True)
Output:
order_date amount sum
0 2015-10-02 1 1.0
1 2015-12-21 15 27.0
2 2015-12-24 3 22.0
3 2015-12-26 4 19.0
4 2015-12-27 5 15.0
5 2015-12-28 10 10.0
Expanding on my comment:
from datetime import timedelta
df['sum'] = 0
for i in range(len(df)):
dt1 = df['order_date'][i]
dt2 = dt1 + timedelta(days=6)
df['sum'][i] = sum(df['amount'][(df['order_date'] >= dt1) & (df['order_date'] <= dt2)])
There's probably a much better way to do this but it works...
There is my way for this problem. It works.. (I believe there should be a much better way to do this.)
import pandas as pd
df['order_date']=pd.to_datetime(pd.Series(df.order_date))
Temp=pd.DataFrame(pd.date_range(start='2015-10-02', end='2017-01-01'),columns=['STDate'])
Temp=Temp.merge(df,left_on='STDate',right_on='order_date',how='left')
Temp['amount']=Temp['amount'].fillna(0)
Temp.sort(['STDate'],ascending=False,inplace=True)
Temp['rolls']=pd.rolling_sum(Temp['amount'],window =7,min_periods=0)
Temp.loc[Temp.STDate.isin(df.order_date),:].sort(['STDate'],ascending=True)
STDate Unnamed: 0 order_date amount rolls
0 2015-10-02 0.0 2015-10-02 1.0 1.0
80 2015-12-21 1.0 2015-12-21 15.0 27.0
83 2015-12-24 2.0 2015-12-24 3.0 22.0
85 2015-12-26 3.0 2015-12-26 4.0 19.0
86 2015-12-27 4.0 2015-12-27 5.0 15.0
87 2015-12-28 5.0 2015-12-28 10.0 10.0
Set order_date to be DatetimeIndex, so that you can use df.ix[time1:time2] to get the time range rows, then filter amount column and sum them.
You can try with :
from datetime import timedelta
df = pd.read_fwf('test2.csv')
df.order_date = pd.to_datetime(df.order_date)
df =df.set_index(pd.DatetimeIndex(df['order_date']))
sum_list = list()
for i in range(len(df)):
sum_list.append(df.ix[df.ix[i]['order_date']:(df.ix[i]['order_date'] + timedelta(days=6))]['amount'].sum())
df['sum'] = sum_list
df
Output:
order_date amount sum
2015-10-02 2015-10-02 1 1
2015-12-21 2015-12-21 15 27
2015-12-24 2015-12-24 3 22
2015-12-26 2015-12-26 4 19
2015-12-27 2015-12-27 5 15
2015-12-28 2015-12-28 10 10
import datetime
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')
df.set_index(['order_date'], inplace=True)
# Sum rows within the range of six days in the future
d = {t: df[(df.index >= t) & (df.index <= t + datetime.timedelta(days=6))]['amount'].sum()
for t in df.index}
# Assign the summed values back to the dataframe
df['amount_sum'] = [d[t] for t in df.index]
df is now:
amount amount_sum
order_date
2015-10-02 1.0 1.0
2015-12-21 15.0 27.0
2015-12-24 3.0 22.0
2015-12-26 4.0 19.0
2015-12-27 5.0 15.0
2015-12-28 10.0 10.0
Related
I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year
With some help from the community I have managed to get to the below function. previous question on building the functionI am trying to work out how to get the resampled date to run to the latest date that appears in anywhere in either of the input data sets for any code. Below I have included the current output I am getting and my desired output.
Input data:
Input 1 df1 - In
date code qty
0 2019-01-10 A 20
1 2019-01-10 B 12
2 2019-01-10 C 10
3 2019-01-11 A 2
4 2019-01-11 B 30
5 2019-01-11 C 2
7 2019-01-12 A 4
8 2019-01-12 B 6
11 2019-01-13 A 10
12 2019-01-13 B 12
13 2019-01-13 C 1
Input 2 df2 - Outbound
date code qty
0 2019-01-11 A 5
1 2019-01-11 B 1
2 2019-01-11 C 3
3 2019-01-12 A 100
6 2019-01-13 B 1
7 2019-01-13 C 1
8 2019-01-15 A 1
9 2019-01-16 B 1
Existing Code:
from numba import njit
#njit
def poscumsum(x):
total = 0
result = np.empty(x.shape)
for i, y in enumerate(x):
total += y
if total < 0:
total = 0
result[i] = total
return result
a = df1.set_index(['code', 'date'])
b = df2.set_index(['code', 'date'])
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0))
df3 = df3.groupby('code').resample('D', level='date').sum()
df3['qty'] = df3.groupby('code')['qty'].transform(
lambda g: poscumsum(g.values))
Current Output
each code is only represented for dates on which they appear in the In or Out dfs.
code date qty
0 A 2019-01-10 20
1 A 2019-01-11 17
2 A 2019-01-12 0
3 A 2019-01-13 10
4 A 2019-01-14 10
5 A 2019-01-15 9
6 B 2019-01-10 12
7 B 2019-01-11 41
8 B 2019-01-12 47
9 B 2019-01-13 58
10 B 2019-01-14 58
11 B 2019-01-15 58
12 B 2019-01-16 57
13 C 2019-01-10 10
14 C 2019-01-11 9
15 C 2019-01-12 9
16 C 2019-01-13 9
Desired Output:
each code is represented for each date between 2019-01-10 & 2019-01-16
code date qty
0 A 2019-01-10 20
1 A 2019-01-11 17
2 A 2019-01-12 0
3 A 2019-01-13 10
4 A 2019-01-14 10
5 A 2019-01-15 9
6 A 2019-01-16 9
7 B 2019-01-10 12
8 B 2019-01-11 41
9 B 2019-01-12 47
10 B 2019-01-13 58
11 B 2019-01-14 58
12 B 2019-01-15 58
13 B 2019-01-16 57
14 C 2019-01-10 10
15 C 2019-01-11 9
16 C 2019-01-12 9
17 C 2019-01-13 9
18 C 2019-01-14 9
19 C 2019-01-15 9
20 C 2019-01-16 9
Ok, here is a 2D version of poscumsum (and generalized to cap the running sum at min and/or max):
#njit
def cumsum_capped_2d(x, xmin=None, xmax=None):
n, m = x.shape
result = np.empty_like(x)
if n == 0:
return result
total = np.zeros_like(x[0])
for i in range(n):
total += x[i]
if xmin is not None:
total[total < xmin] = xmin
if xmax is not None:
total[total > xmax] = xmax
result[i] = total
return result
And here is how to use it (now that you want all dates spanning the same period); the good news is that there is no more groupby (so it is faster than ever):
a = df1.pivot('date', 'code', 'qty')
b = df2.pivot('date', 'code', 'qty')
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0)).resample('D').sum()
df3.values[:, :] = cumsum_capped_2d(df3.values, xmin=0)
Or, in two (convoluted) lines:
df3 = (df1.set_index(['date', 'code']).subtract(df2.set_index(['date', 'code']), fill_value=0)
.unstack('code', fill_value=0).resample('D').sum())
df3.values[:, :] = cumsum_capped_2d(df3.values, xmin=0)
On your data:
>>> df3
code A B C
date
2019-01-10 20.0 12.0 10.0
2019-01-11 17.0 41.0 9.0
2019-01-12 0.0 41.0 9.0
2019-01-13 0.0 52.0 9.0
2019-01-14 0.0 52.0 9.0
2019-01-15 0.0 52.0 9.0
2019-01-16 0.0 51.0 9.0
Of course, you are free stack back into a skinny df, re-order, drop index, etc. For example, to match your desired output:
>>> df3.stack().swaplevel(0,1).sort_index().reset_index()
code date qty
0 A 2019-01-10 20.0
1 A 2019-01-11 17.0
2 A 2019-01-12 0.0
3 A 2019-01-13 10.0
4 A 2019-01-14 10.0
5 A 2019-01-15 9.0
6 A 2019-01-16 9.0
7 B 2019-01-10 12.0
8 B 2019-01-11 41.0
9 B 2019-01-12 47.0
10 B 2019-01-13 58.0
11 B 2019-01-14 58.0
12 B 2019-01-15 58.0
13 B 2019-01-16 57.0
14 C 2019-01-10 10.0
15 C 2019-01-11 9.0
16 C 2019-01-12 9.0
17 C 2019-01-13 9.0
18 C 2019-01-14 9.0
19 C 2019-01-15 9.0
20 C 2019-01-16 9.0
Here is another approach using reindex. You can generate a date_range of unique values per day across all groups called dates. Then, get the unique codes and create a mutli-index to reindex by with pd.MultiIndex.from_product(). Then, reindex and forward fill with ffill():
d = pd.to_datetime(df3.index.get_level_values(1))
dates = pd.date_range(d.min(), d.max(), freq= '1d')
codes = df3.index.get_level_values(0).unique()
idx = pd.MultiIndex.from_product([codes, dates], names=['date', 'code'])
df3 = df3.reindex(idx).reset_index().ffill()
Full code and output:
# original code
from numba import njit
#njit
def poscumsum(x):
total = 0
result = np.empty(x.shape)
for i, y in enumerate(x):
total += y
if total < 0:
total = 0
result[i] = total
return result
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
a = df1.set_index(['code', 'date'])
b = df2.set_index(['code', 'date'])
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0))
df3 = df3.groupby('code').resample('D', level='date').sum()
df3['qty'] = df3.groupby('code')['qty'].transform(
lambda g: poscumsum(g.values))
#code I added
d = pd.to_datetime(df3.index.get_level_values(1))
dates = pd.date_range(d.min(), d.max(), freq= '1d')
codes = df3.index.get_level_values(0).unique()
idx = pd.MultiIndex.from_product([codes, dates], names=['date', 'code'])
df3 = df3.reindex(idx).reset_index().ffill()
df3
Out[1]:
date code qty
0 A 2019-01-10 20.0
1 A 2019-01-11 17.0
2 A 2019-01-12 0.0
3 A 2019-01-13 10.0
4 A 2019-01-14 10.0
5 A 2019-01-15 9.0
6 A 2019-01-16 9.0
7 B 2019-01-10 12.0
8 B 2019-01-11 41.0
9 B 2019-01-12 47.0
10 B 2019-01-13 58.0
11 B 2019-01-14 58.0
12 B 2019-01-15 58.0
13 B 2019-01-16 57.0
14 C 2019-01-10 10.0
15 C 2019-01-11 9.0
16 C 2019-01-12 9.0
17 C 2019-01-13 9.0
18 C 2019-01-14 9.0
19 C 2019-01-15 9.0
20 C 2019-01-16 9.0
This isn't duplicate. I already referred this post_1 and post_2
My question is different and not about agg function. It is about displaying grouped by column as well during ffill operation. Though the code works fine, just sharing the full code for you to get an idea. Problem is in the commented line. look out for that line below.
I have a dataframe like as given below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05 12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06 13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09 22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
What this code with the help of Jezrael from forum does is add missing dates based on threshold value. Only issue is,I don't see the grouped by column during output
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
df2 = df2.groupby(df2['subject_id']).ffill() # problem is here #here is the problem
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
As shown in code above, I tried the below approaches
df2 = df2.groupby(df2['subject_id']).ffill() # doesn't help
df2 = df2.groupby(df2['subject_id']).ffill().reset_index() # doesn't help
df2 = df2.groupby('subject_id',as_index=False).ffill() # doesn't help
Incorrect output without subject_id
I expect my output to have subject_id column as well
Here are 2 possible solutions - specify all columns in list after groupby and assign back:
cols = df2.columns.difference(['subject_id'])
df2[cols] = df2.groupby('subject_id')[cols].ffill() # problem is here #here is the problem
Or create index by subject_id column and grouping by index:
#newer pandas versions
df2 = df2.set_index('subject_id').groupby('subject_id').ffill().reset_index()
#oldier pandas versions
df2 = df2.set_index('subject_id').groupby(level=0).ffill().reset_index()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day month count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 4.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 4.0 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 4.0 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 4.0 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 5.0 1.0
33 1 2173-05-05 2173-05-05 13:37:00 1 5 5.0 1.0
95 1 2173-07-06 2173-07-06 13:39:00 6 6 7.0 1.0
96 1 2173-07-07 2173-07-07 13:39:00 6 7 7.0 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 7.0 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 4.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8 9 4.0 NaN
100 2 2173-04-10 2173-04-10 22:00:00 8 10 4.0 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 4.0 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 4.0 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 4.0 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 4.0 1.0
I have the code below:
import pandas as pd
import datetime
df=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
df["date"]=pd.to_datetime(df["date"])
df['date'] = df.date.apply(lambda x: datetime.datetime.strftime(x,'%b')) # SHOWS date as MONTH
pvt_enroll=df.pivot_table(index='site', columns="date", values = 'baseline', aggfunc = {'baseline' : 'count'}, fill_value=0, margins=True) # Pivot_Table with enrollment by SITE by MONTH
pvt_enroll.to_csv("pivot_test.csv")
table_enroll_site_month = pd.read_csv('pivot_test.csv', encoding='latin-1')
table_enroll_site_month.rename(columns={'site':'Study Site'}, inplace=True)
table_enroll_site_month
Study Site Apr Jul Jun May All
0 A 5.0 0.0 8.0 4.0 17.0
1 B 9.0 0.0 11.0 5.0 25.0
2 C 6.0 1.0 3.0 20.0 30.0
3 D 5.0 0.0 3.0 2.0 10.0
4 E 5.0 0.0 5.0 0.0 10.0
5 All 30.0 1.0 30.0 31.0 92.0
And wonder how to:
1. Display months with year as
Apr16 Jul16 Jun16 May16
2. Is it possible to get same table without running this step (pvt_enroll.to_csv("pivot_test.csv")? I mean, can I get same result without needing to save to .csv file first?
I think by using %b%y you can get 'Apr16' etc format.
I tried with the following code, without saving into .csv.
import pandas as pd
from datetime import datetime
df=pd.read_csv("demo.csv")
df["date"]=pd.to_datetime(df["date"])
df['date'] = df['date'].apply(lambda x: datetime.strftime(x,'%b%y'))
pvt_enroll=df.pivot_table(index='site', columns="date", values = 'baseline', aggfunc = {'baseline' : 'count'}, fill_value=0, margins=True) # Pivot_Table with enrollment by SITE by MONTH
pvt_enroll.reset_index(inplace=True)
pvt_enroll.rename(columns={'site':'Study Site'}, inplace=True)
print(pvt_enroll)
And I got the output as follows
date Study Site Apr16 Jul16 Jun16 May16 All
0 A 5 0 8 4 17
1 B 9 0 11 5 25
2 C 6 1 3 20 30
3 D 5 0 3 2 10
4 E 5 0 5 0 10
5 All 30 1 30 31 92
For example (input pandas dataframe):
start_date end_date value
0 2018-05-17 2018-05-20 4
1 2018-05-22 2018-05-27 12
2 2018-05-14 2018-05-21 8
I want it to divide the value by the # of intervals present in the data (e.g. 2018-05-12 to 2018-05-27 has 6 days, 12 / 6 = 2) and then create a time series data like the following:
date value
0 2018-05-14 1
1 2018-05-15 1
2 2018-05-16 1
3 2018-05-17 2
4 2018-05-18 2
5 2018-05-19 2
6 2018-05-20 2
7 2018-05-21 1
8 2018-05-22 2
9 2018-05-23 2
10 2018-05-24 2
11 2018-05-25 2
12 2018-05-26 2
13 2018-05-27 2
is this possible to do without an inefficient loop through every row using pandas? Is there also a name for this method?
You can use:
#convert to datetimes if necessary
df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])
For each row generate list of Series by date_range, then divide their length and aggregate by groupby with sum:
dfs = [pd.Series(r.value, pd.date_range(r.start_date, r.end_date)) for r in df.itertuples()]
df = (pd.concat([x / len(x) for x in dfs])
.groupby(level=0)
.sum()
.rename_axis('date')
.reset_index(name='val'))
print (df)
date val
0 2018-05-14 1.0
1 2018-05-15 1.0
2 2018-05-16 1.0
3 2018-05-17 2.0
4 2018-05-18 2.0
5 2018-05-19 2.0
6 2018-05-20 2.0
7 2018-05-21 1.0
8 2018-05-22 2.0
9 2018-05-23 2.0
10 2018-05-24 2.0
11 2018-05-25 2.0
12 2018-05-26 2.0
13 2018-05-27 2.0