Pandas dataframe Groupby and retrieve date range - python

Here is my dataframe that I am working on. There are two pay periods defined:
first 15 days and last 15 days for each month.
date employee_id hours_worked id job_group report_id
0 2016-11-14 2 7.50 385 B 43
1 2016-11-15 2 4.00 386 B 43
2 2016-11-30 2 4.00 387 B 43
3 2016-11-01 3 11.50 388 A 43
4 2016-11-15 3 6.00 389 A 43
5 2016-11-16 3 3.00 390 A 43
6 2016-11-30 3 6.00 391 A 43
I need to group by employee_id and job_group but at the same time
I have to achieve date range for that grouped row.
For example grouped results would be like following for employee_id 1:
Expected Output:
date employee_id hours_worked job_group report_id
1 2016-11-15 2 11.50 B 43
2 2016-11-30 2 4.00 B 43
4 2016-11-15 3 17.50 A 43
5 2016-11-16 3 9.00 A 43
Is this possible using pandas dataframe groupby?

Use SM with Grouper and last add SemiMonthEnd:
df['date'] = pd.to_datetime(df['date'])
d = {'hours_worked':'sum','report_id':'first'}
df = (df.groupby(['employee_id','job_group',pd.Grouper(freq='SM',key='date', closed='right')])
.agg(d)
.reset_index())
df['date'] = df['date'] + pd.offsets.SemiMonthEnd(1)
print (df)
employee_id job_group date hours_worked report_id
0 2 B 2016-11-15 11.5 43
1 2 B 2016-11-30 4.0 43
2 3 A 2016-11-15 17.5 43
3 3 A 2016-11-30 9.0 43

a. First, (for each employee_id) use multiple Grouper with the .sum() on the hours_worked column. Second, use DateOffset to achieve bi-weekly date column. After these 2 steps, I have assigned the date in the grouped DF based on 2 brackets (date ranges) - if day of month (from the date column) is <=15, then I set the day in date to 15, else I set the day to 30. This day is then used to assemble a new date. I calculated month end day based on 1, 2.
b. (For each employee_id) get the .last() record for the job_group and report_id columns
c. merge a. and b. on the employee_id key
# a.
hours = (df.groupby([
pd.Grouper(key='employee_id'),
pd.Grouper(key='date', freq='SM')
])['hours_worked']
.sum()
.reset_index())
hours['date'] = pd.to_datetime(hours['date'])
hours['date'] = hours['date'] + pd.DateOffset(days=14)
# Assign day based on bracket (date range) 0-15 or bracket (date range) >15
from pandas.tseries.offsets import MonthEnd
hours['bracket'] = hours['date'] + MonthEnd(0)
hours['bracket'] = pd.to_datetime(hours['bracket']).dt.day
hours.loc[hours['date'].dt.day <= 15, 'bracket'] = 15
hours['date'] = pd.to_datetime(dict(year=hours['date'].dt.year,
month=hours['date'].dt.month,
day=hours['bracket']))
hours.drop('bracket', axis=1, inplace=True)
# b.
others = (df.groupby('employee_id')['job_group','report_id']
.last()
.reset_index())
# c.
merged = hours.merge(others, how='inner', on='employee_id')
Raw data for employee_id==1 and employeeid==3
df.sort_values(by=['employee_id','date'], inplace=True)
print(df[df.employee_id.isin([1,3])])
index date employee_id hours_worked id job_group report_id
0 0 2016-11-14 1 7.5 481 A 43
10 10 2016-11-21 1 6.0 491 A 43
11 11 2016-11-22 1 5.0 492 A 43
15 15 2016-12-14 1 7.5 496 A 43
25 25 2016-12-21 1 6.0 506 A 43
26 26 2016-12-22 1 5.0 507 A 43
6 6 2016-11-02 3 6.0 487 A 43
4 4 2016-11-08 3 6.0 485 A 43
3 3 2016-11-09 3 11.5 484 A 43
5 5 2016-11-11 3 3.0 486 A 43
20 20 2016-11-12 3 3.0 501 A 43
21 21 2016-12-02 3 6.0 502 A 43
19 19 2016-12-08 3 6.0 500 A 43
18 18 2016-12-09 3 11.5 499 A 43
Output
print(merged)
employee_id date hours_worked job_group report_id
0 1 2016-11-15 7.5 A 43
1 1 2016-11-30 11.0 A 43
2 1 2016-12-15 7.5 A 43
3 1 2016-12-31 11.0 A 43
4 2 2016-11-15 31.0 B 43
5 2 2016-12-15 31.0 B 43
6 3 2016-11-15 29.5 A 43
7 3 2016-12-15 23.5 A 43
8 4 2015-03-15 5.0 B 43
9 4 2016-02-29 5.0 B 43
10 4 2016-11-15 5.0 B 43
11 4 2016-11-30 15.0 B 43
12 4 2016-12-15 5.0 B 43
13 4 2016-12-31 15.0 B 43

Related

Putting Na on several columns according to conditions on certain columns

I have a pandas dataframe with amount of bills columns,with dates and ids associated with those amounts. I would like to set the columns to Na when the date is less than 2016-12-31 and their associated id and amount. Here is an example
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date3
ID Bill 3
Gender
Age
4
6
2000-10-04
1
45
2000-11-05
2
51
1999-12-05
8
M
25
6
8
2016-05-03
7
39
2017-08-09
8
38
2018-07-14
17
W
54
12
14
2016-11-16
10
73
2017-05-04
15
14
2017-07-04
35
M
68
And I would like to get this:
ID customer
Bill1
Date 1
ID Bill 1
Bill2
Date 2
ID Bill 2
Bill3
Date 3
ID Bill 3
Gender
Age
4
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
Nan
M
25
6
Nan
Nan
Nan
39
2017-08-09
8
38
2018-07-14
17
W
54
12
Nan
Nan
Nan
73
2017-05-04
15
14
2017-07-04
35
M
68
One option is to create a MultiIndex.from_frame based on the values extracted str.extractall:
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 6 2000-10-04 1 45 2000-11-05 2 51 1999-12-05 8
6 W 54 8 2016-05-03 7 39 2017-08-09 8 38 2018-07-14 17
12 M 68 14 2016-11-16 10 73 2017-05-04 15 14 2017-07-04 35
Then mask on the Date column (in level 0) where dates are less than the threshold:
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
0 Bill Date ID Bill Bill Date ID Bill Bill Date ID Bill
1 1 1 1 2 2 2 3 3 3
ID customer Gender Age
4 M 25 NaN NaT NaN NaN NaT NaN NaN NaT NaN
6 W 54 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0
12 M 68 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0
Lastly, restore columns and order:
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
ID customer Bill1 Date 1 ID Bill 1 Bill2 Date 2 ID Bill 2 Bill3 Date3 ID Bill 3 Gender Age
0 4 NaN NaT NaN NaN NaT NaN NaN NaT NaN M 25
1 6 NaN NaT NaN 39.0 2017-08-09 8.0 38.0 2018-07-14 17.0 W 54
2 12 NaN NaT NaN 73.0 2017-05-04 15.0 14.0 2017-07-04 35.0 M 68
All Together:
(ensure Date Columns are DateTime)
df['Date 1'] = pd.to_datetime(df['Date 1'])
df['Date 2'] = pd.to_datetime(df['Date 2'])
df['Date3'] = pd.to_datetime(df['Date3'])
new_df = df.set_index(['ID customer', 'Gender', 'Age'])
orig_cols = new_df.columns # Save For Later
new_df.columns = pd.MultiIndex.from_frame(
new_df.columns.str.extractall(r'(.*?)(?:\s+)?(\d+)')
)
new_df = new_df.mask(new_df['Date'].lt(pd.to_datetime('2016-12-31')))
new_df.columns = orig_cols # Restore from "save"
new_df = new_df.reset_index().reindex(columns=df.columns)
Another way:
#Assumption Your dates are of dtype datetime[ns]
c=~df.filter(like='Date').lt(pd.to_datetime('2016-12-31'))
c=pd.DataFrame(c.values.repeat(3,1),columns=df.columns[1:10])
Finally:
out=df[df.columns[1:10]]
out=out[c].join(df[['ID customer','Gender','Age']])
Now If you print out you will get your desired output

How can I get the difference of the values of the rows in a dataframe? (for customer code)

My dataset has Customer_Code, As_Of_Date and 24 products. The products have a value of 0 -1. I ordered the data set by customer code and as_of_date. I want to subtract from the next row in the products to the previous row. The important thing here is to get each customer out according to their as_of_date.
I try
df2.set_index('Customer_Code').diff()
and
df2.set_index('As_Of_Date').diff()
and
for i in new["Customer_Code"].unique():
df14 = df12.set_index('As_Of_Date').diff()
but is not true. My code is true for first customer but it is not true for second customer.
How I can do?
You didn't share any data so I made up something that you may use. Your expected outcome also lacks. For further reference, please do not share images. Let's say you have this data:
id date product
0 12 2008-01-01 1
1 12 2008-01-01 2
2 12 2008-01-01 1
3 12 2008-01-02 4
4 12 2008-01-02 5
5 34 2009-01-01 6
6 34 2009-01-01 7
7 34 2009-01-01 84
8 34 2009-01-02 4
9 34 2009-01-02 3
10 34 2009-01-02 3
11 34 2009-01-03 5
12 34 2009-01-03 6
13 34 2009-01-03 8
As I understand it, you want to substract the product value from the previous row, grouped by id and date. (if any other group, adapt). You then need to do this:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(1)), np.nan))
which returns:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -1.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 1.0
7 34 2009-01-01 84 77.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 -1.0
10 34 2009-01-02 3 0.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 1.0
13 34 2009-01-03 8 2.0
or if you want it the other way around:
mask = df.duplicated(['id', 'date'])
df['product_diff'] = (np.where(mask, (df['product'] - df['product'].shift(-1)), np.nan))
which gives:
id date product product_diff
0 12 2008-01-01 1 NaN
1 12 2008-01-01 2 1.0
2 12 2008-01-01 1 -3.0
3 12 2008-01-02 4 NaN
4 12 2008-01-02 5 -1.0
5 34 2009-01-01 6 NaN
6 34 2009-01-01 7 -77.0
7 34 2009-01-01 84 80.0
8 34 2009-01-02 4 NaN
9 34 2009-01-02 3 0.0
10 34 2009-01-02 3 -2.0
11 34 2009-01-03 5 NaN
12 34 2009-01-03 6 -2.0
13 34 2009-01-03 8 NaN

Split date range rows into years (ungroup) - Python Pandas

I have a dataframe like this:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2023 1 2
.......
I would like to split the rows where end - start > 1 year (see last row where end=2023 and start = 2020), keeping the same value for column A, while splitting proportionally the value in column B:
Start date end date A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
01.01.2020 31.12.2020 1 2/4
01.01.2021 31.12.2021 1 2/4
01.01.2022 31.12.2022 1 2/4
01.01.2023 31.12.2023 1 2/4
.......
Any idea?
Here is my solution. See the comments below:
import io
# TEST DATA:
text=""" start end A B
01.01.2020 30.06.2020 2 3
01.01.2020 31.12.2020 3 1
01.04.2020 30.04.2020 6 2
01.01.2021 31.12.2021 2 3
01.07.2020 31.12.2020 8 2
31.12.2020 20.01.2021 12 12
31.12.2020 01.01.2021 22 22
30.12.2020 01.01.2021 32 32
10.05.2020 28.09.2023 44 44
27.11.2020 31.12.2023 88 88
31.12.2020 31.12.2023 100 100
01.01.2020 31.12.2021 200 200
"""
df= pd.read_csv(io.StringIO(text), sep=r"\s+", engine="python", parse_dates=[0,1])
#print("\n----\n df:",df)
#----------------------------------------
# SOLUTION:
def split_years(r):
"""
Split row 'r' where "end"-"start" greater than 0.
The new rows have repeated values of 'A', and 'B' divided by the number of years.
Return: a DataFrame with rows per year.
"""
t1,t2 = r["start"], r["end"]
ys= t2.year - t1.year
kk= 0 if t1.is_year_end else 1
if ys>0:
l1=[t1] + [ t1+pd.offsets.YearBegin(i) for i in range(1,ys+1) ]
l2=[ t1+pd.offsets.YearEnd(i) for i in range(kk,ys+kk) ] + [t2]
return pd.DataFrame({"start":l1, "end":l2, "A":r.A,"B": r.B/len(l1)})
print("year difference <= 0!")
return None
# Create two groups, one for rows where the 'start' and 'end' is in the same year, and one for the others:
grps= df.groupby(lambda idx: (df.loc[idx,"start"].year-df.loc[idx,"end"].year)!=0 ).groups
print("\n---- grps:\n",grps)
# Extract the "one year" rows in a data frame:
df1= df.loc[grps[False]]
#print("\n---- df1:\n",df1)
# Extract the rows to be splitted:
df2= df.loc[grps[True]]
print("\n---- df2:\n",df2)
# Split the rows and put the resulting data frames into a list:
ldfs=[ split_years(df2.loc[row]) for row in df2.index ]
print("\n---- ldfs:")
for fr in ldfs:
print(fr,"\n")
# Insert the "one year" data frame to the list, and concatenate them:
ldfs.insert(0,df1)
df_rslt= pd.concat(ldfs,sort=False)
#print("\n---- df_rslt:\n",df_rslt)
# Housekeeping:
df_rslt= df_rslt.sort_values("start").reset_index(drop=True)
print("\n---- df_rslt:\n",df_rslt)
Outputs:
---- grps:
{False: Int64Index([0, 1, 2, 3, 4], dtype='int64'), True: Int64Index([5, 6, 7, 8, 9, 10, 11], dtype='int64')}
---- df2:
start end A B
5 2020-12-31 2021-01-20 12 12
6 2020-12-31 2021-01-01 22 22
7 2020-12-30 2021-01-01 32 32
8 2020-10-05 2023-09-28 44 44
9 2020-11-27 2023-12-31 88 88
10 2020-12-31 2023-12-31 100 100
11 2020-01-01 2021-12-31 200 200
---- ldfs:
start end A B
0 2020-12-31 2020-12-31 12 6.0
1 2021-01-01 2021-01-20 12 6.0
start end A B
0 2020-12-31 2020-12-31 22 11.0
1 2021-01-01 2021-01-01 22 11.0
start end A B
0 2020-12-30 2020-12-31 32 16.0
1 2021-01-01 2021-01-01 32 16.0
start end A B
0 2020-10-05 2020-12-31 44 11.0
1 2021-01-01 2021-12-31 44 11.0
2 2022-01-01 2022-12-31 44 11.0
3 2023-01-01 2023-09-28 44 11.0
start end A B
0 2020-11-27 2020-12-31 88 22.0
1 2021-01-01 2021-12-31 88 22.0
2 2022-01-01 2022-12-31 88 22.0
3 2023-01-01 2023-12-31 88 22.0
start end A B
0 2020-12-31 2020-12-31 100 25.0
1 2021-01-01 2021-12-31 100 25.0
2 2022-01-01 2022-12-31 100 25.0
3 2023-01-01 2023-12-31 100 25.0
start end A B
0 2020-01-01 2020-12-31 200 100.0
1 2021-01-01 2021-12-31 200 100.0
---- df_rslt:
start end A B
0 2020-01-01 2020-06-30 2 3.0
1 2020-01-01 2020-12-31 3 1.0
2 2020-01-01 2020-12-31 200 100.0
3 2020-01-04 2020-04-30 6 2.0
4 2020-01-07 2020-12-31 8 2.0
5 2020-10-05 2020-12-31 44 11.0
6 2020-11-27 2020-12-31 88 22.0
7 2020-12-30 2020-12-31 32 16.0
8 2020-12-31 2020-12-31 12 6.0
9 2020-12-31 2020-12-31 100 25.0
10 2020-12-31 2020-12-31 22 11.0
11 2021-01-01 2021-12-31 100 25.0
12 2021-01-01 2021-12-31 88 22.0
13 2021-01-01 2021-12-31 44 11.0
14 2021-01-01 2021-01-01 32 16.0
15 2021-01-01 2021-01-01 22 11.0
16 2021-01-01 2021-01-20 12 6.0
17 2021-01-01 2021-12-31 2 3.0
18 2021-01-01 2021-12-31 200 100.0
19 2022-01-01 2022-12-31 88 22.0
20 2022-01-01 2022-12-31 100 25.0
21 2022-01-01 2022-12-31 44 11.0
22 2023-01-01 2023-09-28 44 11.0
23 2023-01-01 2023-12-31 88 22.0
24 2023-01-01 2023-12-31 100 25.0
Bit of a different approach, adding new columns instead of new rows. But I think this accomplishes what you want to do.
df["years_apart"] = (
(df["end_date"] - df["start_date"]).dt.days / 365
).astype(int)
for years in range(1, df["years_apart"].max().astype(int)):
df[f"{years}_end_date"] = pd.NaT
df.loc[
df["years_apart"] == years, f"{years}_end_date"
] = df.loc[
df["years_apart"] == years, "start_date"
] + dt.timedelta(days=365*years)
df["B_bis"] = df["B"] / df["years_apart"]
Output
start_date end_date years_apart 1_end_date 2_end_date ...
2018-01-01 2018-01-02 0 NaT NaT
2018-01-02 2019-01-02 1 2019-01-02 NaT
2018-01-03 2020-01-03 2 NaT 2020-01-03
I have solved it creating a date difference and a counter that adds years to the repeated rows:
#calculate difference between start and end year
table['diff'] = (table['end'] - table['start'])//timedelta(days=365)
table['diff'] = table['diff']+1
#replicate rows depending on number of years
table = table.reindex(table.index.repeat(table['diff']))
#counter that increase for diff>1, assign increasing years to the replicated rows
table['count'] = table['diff'].groupby(table['diff']).cumsum()//table['diff']
table['start'] = np.where(table['diff']>1, table['start']+table['count']-1, table['start'])
table['end'] = table['start']
#split B among years
table['B'] = table['B']//table['diff']

How to add missing dates within date interval?

I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result

Improve Performance of Apply Method

I would like to groupby by the variable of my df "cod_id" and then apply this function:
[df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in df['dt_op']]
Moving from this df:
print(df)
dt_op quantity cod_id
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
To this one:
print(final_df)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...
I tried with:
def lookforward(x):
L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \
'quantity'].sum() for row in x.itertuples(index=False)]
return pd.Series(L, index=x.index)
s = df.groupby('cod_id').apply(lookforward)
s.index = s.index.droplevel(0)
df['Final_Quantity'] = s
print(df)
dt_op quantity cod_id Final_Quantity
0 2018-01-20 1 613 2
1 2018-01-21 8 611 8
2 2018-01-21 1 613 1
But it is not an efficient solution, since it is computationally slow;
How can I improve its performance?
I would achieve it even with a new code/new function that leads to the same result.
EDIT:
Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m":
print(df)
cod_id dt_op quantita final_sum
0 2 2017-01-03 1 54.0
1 2 2017-01-04 1 53.0
2 2 2017-01-13 1 52.0
3 2 2017-01-23 2 51.0
4 2 2017-01-26 1 49.0
5 2 2017-02-03 1 48.0
6 2 2017-02-27 1 47.0
7 2 2017-03-05 1 46.0
8 2 2017-03-15 1 45.0
9 2 2017-03-23 1 44.0
10 2 2017-03-27 2 43.0
11 2 2017-03-31 3 41.0
12 2 2017-04-04 1 38.0
13 2 2017-04-05 1 37.0
14 2 2017-04-15 2 36.0
15 2 2017-04-27 2 34.0
16 2 2017-04-30 1 32.0
17 2 2017-05-16 1 31.0
18 2 2017-05-18 1 30.0
19 2 2017-05-19 1 29.0
20 2 2017-06-03 1 28.0
21 2 2017-06-04 1 27.0
22 2 2017-06-07 1 26.0
23 2 2017-06-13 2 25.0
24 2 2017-06-14 1 23.0
25 2 2017-06-20 1 22.0
26 2 2017-06-22 2 21.0
27 2 2017-06-28 1 19.0
28 2 2017-06-30 1 18.0
29 2 2017-07-03 1 17.0
30 2 2017-07-06 2 16.0
31 2 2017-07-07 1 14.0
32 2 2017-07-13 1 13.0
33 2 2017-07-20 1 12.0
34 2 2017-07-28 1 11.0
35 2 2017-08-06 1 10.0
36 2 2017-08-07 1 9.0
37 2 2017-08-24 1 8.0
38 2 2017-09-06 1 7.0
39 2 2017-09-16 2 6.0
40 2 2017-09-20 1 4.0
41 2 2017-10-07 1 3.0
42 2 2017-11-04 1 2.0
43 2 2017-12-07 1 1.0
Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments.
Using for loops can be a performance killer when doing pandas operations.
The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here.
Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby.
quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \
.rolling("7D", on="dt_op").quantity.sum()
cod_id dt_op
611 2018-01-21 8.0
613 2018-01-21 1.0
2018-01-20 2.0
Name: quantity, dtype: float64
result = df.set_index(["cod_id", "dt_op"])
result["final_sum"] = quant_sum
result.reset_index()
cod_id dt_op quantity final_sum
0 613 2018-01-20 1 2.0
1 611 2018-01-21 8 8.0
2 613 2018-01-21 1 1.0
Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details).
This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data.
# Create a temporary df with all in between days filled in with zeros
filled = df.set_index("dt_op").groupby("cod_id") \
.resample("D").asfreq().fillna(0) \
.quantity.to_frame()
# Reverse and sum
filled["quant_sum"] = filled.reset_index().set_index("dt_op") \
.iloc[::-1] \
.groupby("cod_id") \
.rolling(7, min_periods=1) \
.quantity.sum().astype(int)
# Join with original `df`, dropping the filled days
result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()

Categories