I have a pandas dataframe like this,
id d1 d2
0 1 2016-12-15 2017-02-08
1 2 2017-04-28 2017-07-20
2 3 2017-07-28 2017-10-19
3 4 2018-02-20 2019-01-21
4 5 2019-03-19 2019-06-10
5 1 2019-05-24 2019-05-30
6 2 2019-06-04 2019-07-22
I want to check whether any d2 is greater than next d1, if so I want to set that d2 to next d1 - 1.
I can figure out where I want to change the date with this code,
x['d2'].gt(x['d1'].shift(-1))
I am not sure how to proceed efficently after this.
Result I am looking for is like this,
id d1 d2
0 1 2016-12-15 2017-02-08
1 2 2017-04-28 2017-07-20
2 3 2017-07-28 2017-10-19
3 4 2018-02-20 2019-01-21
4 5 2019-03-19 2019-05-23
5 1 2019-05-24 2019-05-30
6 2 2019-06-04 2019-07-22
How can I do this in pandas with no loops.?
I am currently using apply like this for solving this,
x.apply(lambda x : x['d1_shifted'] - pd.Timedelta(days=1) if x['d2'] > x['d1_shifted'] else x['d2'], axis=1)
Try :
c=df.d2.gt(df.d1.shift(-1))
df=df.assign(d2=np.where(c,df.d1.shift(-1)-pd.Timedelta(1,unit='d'),df.d2))
print(df)
id d1 d2
0 1 2016-12-15 2017-02-08
1 2 2017-04-28 2017-07-20
2 3 2017-07-28 2017-10-19
3 4 2018-02-20 2019-01-21
4 5 2019-03-19 2019-05-23
5 1 2019-05-24 2019-05-30
6 2 2019-06-04 2019-07-22
Another way is using direct assign from .loc and pd.DateOffset as follows
m = df.d2.gt(df.d1.shift(-1))
df.loc[m, 'd2'] = df.shift(-1).loc[m, 'd1'] - pd.DateOffset(1)
Out[947]:
id d1 d2
0 1 2016-12-15 2017-02-08
1 2 2017-04-28 2017-07-20
2 3 2017-07-28 2017-10-19
3 4 2018-02-20 2019-01-21
4 5 2019-03-19 2019-05-23
5 1 2019-05-24 2019-05-30
6 2 2019-06-04 2019-07-22
Related
I have date-interval-data with a "periodicity"-column representing how frequent the date interval occurs:
Weekly: same weekdays every week
Biweekly: same weekdays every other week
Monthly: Same DATES every month
Moreover I have a "recurring_until"-column specifying when the recurrence should stop.
What I need to accomplish is:
creating a separate row for each recurring record until the "recurring_until" has been reached.
What I have:
What I need:
I have been trying with various for loops without much success. Here is the sample data:
import pandas as pd
data = {'id':['1','2','3','4'],'from':['5/31/2020','6/3/2020','6/18/2020','6/10/2020'],'to':['6/5/2020','6/3/2020','6/19/2020','6/10/2020'],'periodicity':['weekly','weekly','biweekly','monthly'],'recurring_until':['7/25/2020','6/9/2020','12/30/2020','7/9/2020']}
df = pd.DataFrame(data)
First of all preprocess:
df.set_index("id", inplace=True)
df["from"], df["to"], df["recurring_until"] = pd.to_datetime(df["from"]), pd.to_datetime(df.to), pd.to_datetime(df.recurring_until)
Next compute all the periodic from:
new_from = df.apply(lambda x: pd.date_range(x["from"], x.recurring_until), axis=1) #generate all days between from and recurring_until
new_from[df.periodicity=="weekly"] = new_from[df.periodicity=="weekly"].apply(lambda x:x[::7]) #slicing by week
new_from[df.periodicity=="biweekly"] = new_from[df.periodicity=="biweekly"].apply(lambda x:x[::14]) #slicing by biweek
new_from[df.periodicity=="monthly"] = new_from[df.periodicity=="monthly"].apply(lambda x:x[x.day==x.day[0]]) #selectiong only days equal to the first day
new_from = new_from.explode() #explode to obtain a series
new_from.name = "from" #naming the series
after this we have new_from like this:
id
1 2020-05-31
1 2020-06-07
1 2020-06-14
1 2020-06-21
1 2020-06-28
1 2020-07-05
1 2020-07-12
1 2020-07-19
2 2020-06-03
3 2020-06-18
3 2020-07-02
3 2020-07-16
3 2020-07-30
3 2020-08-13
3 2020-08-27
3 2020-09-10
3 2020-09-24
3 2020-10-08
3 2020-10-22
3 2020-11-05
3 2020-11-19
3 2020-12-03
3 2020-12-17
4 2020-06-10
Name: from, dtype: datetime64[ns]
Now lets compute all the periodic to as:
new_to = new_from+(df.to-df["from"]).loc[new_from.index]
new_to.name = "to"
and we have new_to like this:
id
1 2020-06-05
1 2020-06-12
1 2020-06-19
1 2020-06-26
1 2020-07-03
1 2020-07-10
1 2020-07-17
1 2020-07-24
2 2020-06-03
3 2020-06-19
3 2020-07-03
3 2020-07-17
3 2020-07-31
3 2020-08-14
3 2020-08-28
3 2020-09-11
3 2020-09-25
3 2020-10-09
3 2020-10-23
3 2020-11-06
3 2020-11-20
3 2020-12-04
3 2020-12-18
4 2020-06-10
Name: to, dtype: datetime64[ns]
We can finally concatenate this two series and join them to the initial dataframe:
periodic_df = pd.concat([new_from, new_to], axis=1).join(df[["periodicity", "recurring_until"]]).reset_index()
result:
I have a dataframe like as shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,2,2,2,2,2],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-05
12:59:00','2173-05-04 13:14:00','2173-05-05 13:37:00','2173-07-06
13:39:00','2173-07-08 11:30:00','2173-04-08 16:00:00','2173-04-09
22:00:00','2173-04-11 04:00:00','2173- 04-13 04:30:00','2173-04-14 08:00:00'],
'val' :[5,5,5,5,1,6,5,5,8,3,4,6]})
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['month'] = df['time_1'].dt.month
As you can see from the dataframe above that there are few missing dates in between. I would like to create new records for those dates and fill in values from the immediate previous row
def dt(df):
r = pd.date_range(start=df.date.min(), end=df.date.max())
df.set_index('date').reindex(r)
new_df = df.groupby(['subject_id','month']).apply(dt)
This generates all the dates. I only want to find the missing date within the input date interval for each subject for each month
I did try the code from this related post. Though it helped me but doesn't get me the expected output for this updated/new requirement. As we do left join, it copies all records. I can't do inner join either because it will drop non-match column. I want a mix of left join and inner join
Currently it creates new records for all 365 days in a year which I don't want. something like below. This is not expected
I only wish to add missing dates between input date interval as shown below. For example subject = 1, in the 4th month has records from 3rd and 5th. but 4th is missing. So we add record for 4th day alone. We don't need 6th,7th etc unlike current output. Similarly in 7th month, record for 7th day missing. so we just add a new record for that
I expect my output to be like as shown below
Here is problem you need resample for append new days, so it is necessary.
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df1 = (df.set_index('date')
.groupby('subject_id')
.resample('d')
.last()
.index
.to_frame(index=False))
print (df1)
subject_id date
0 1 2173-04-03
1 1 2173-04-04
2 1 2173-04-05
3 1 2173-04-06
4 1 2173-04-07
.. ... ...
99 2 2173-04-10
100 2 2173-04-11
101 2 2173-04-12
102 2 2173-04-13
103 2 2173-04-14
[104 rows x 2 columns]
Idea is remove unnecessary missing rows - you can create threshold for minimum consecutive mising values (here 5) and remove rows (created new column fro easy test):
df2 = df1.merge(df, how='left')
thresh = 5
mask = df2['day'].notna()
s = mask.cumsum().mask(mask)
df2['count'] = s.map(s.value_counts())
df2 = df2[(df2['count'] < thresh) | (df2['count'].isna())]
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5.0 3.0 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5.0 3.0 NaN
2 1 2173-04-04 NaT NaN NaN 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5.0 5.0 NaN
32 1 2173-05-04 2173-05-04 13:14:00 5.0 4.0 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1.0 5.0 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6.0 6.0 NaN
96 1 2173-07-07 NaT NaN NaN 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5.0 8.0 NaN
98 2 2173-04-08 2173-04-08 16:00:00 5.0 8.0 NaN
99 2 2173-04-09 2173-04-09 22:00:00 8.0 9.0 NaN
100 2 2173-04-10 NaT NaN NaN 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3.0 11.0 NaN
102 2 2173-04-12 NaT NaN NaN 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4.0 13.0 NaN
104 2 2173-04-14 2173-04-14 08:00:00 6.0 14.0 NaN
Last use previous solution:
df2 = df2.groupby(df['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id date time_1 val day count
0 1 2173-04-03 2173-04-03 12:35:00 5 3 NaN
1 1 2173-04-03 2173-04-03 12:50:00 5 3 NaN
2 1 2173-04-04 2173-04-04 12:50:00 5 4 1.0
3 1 2173-04-05 2173-04-05 12:59:00 5 5 1.0
32 1 2173-05-04 2173-05-04 13:14:00 5 4 NaN
33 1 2173-05-05 2173-05-05 13:37:00 1 5 NaN
95 1 2173-07-06 2173-07-06 13:39:00 6 6 NaN
96 1 2173-07-07 2173-07-07 13:39:00 6 7 1.0
97 1 2173-07-08 2173-07-08 11:30:00 5 8 1.0
98 2 2173-04-08 2173-04-08 16:00:00 5 8 1.0
99 2 2173-04-09 2173-04-09 22:00:00 8 9 1.0
100 2 2173-04-10 2173-04-10 22:00:00 8 10 1.0
101 2 2173-04-11 2173-04-11 04:00:00 3 11 1.0
102 2 2173-04-12 2173-04-12 04:00:00 3 12 1.0
103 2 2173-04-13 2173-04-13 04:30:00 4 13 1.0
104 2 2173-04-14 2173-04-14 08:00:00 6 14 1.0
EDIT: Solution with reindex for each month:
df['time_1'] = pd.to_datetime(df['time_1'])
df['day'] = df['time_1'].dt.day
df['date'] = df['time_1'].dt.floor('d')
df['month'] = df['time_1'].dt.month
df1 = (df.drop_duplicates(['date','subject_id'])
.set_index('date')
.groupby(['subject_id', 'month'])
.apply(lambda x: x.reindex(pd.date_range(x.index.min(), x.index.max())))
.rename_axis(('subject_id','month','date'))
.index
.to_frame(index=False)
)
print (df1)
subject_id month date
0 1 4 2173-04-03
1 1 4 2173-04-04
2 1 4 2173-04-05
3 1 5 2173-05-04
4 1 5 2173-05-05
5 1 7 2173-07-06
6 1 7 2173-07-07
7 1 7 2173-07-08
8 2 4 2173-04-08
9 2 4 2173-04-09
10 2 4 2173-04-10
11 2 4 2173-04-11
12 2 4 2173-04-12
13 2 4 2173-04-13
14 2 4 2173-04-14
df2 = df1.merge(df, how='left')
df2 = df2.groupby(df2['subject_id']).ffill()
dates = df2['time_1'].dt.normalize()
df2['time_1'] += np.where(dates == df2['date'], 0, df2['date'] - dates)
df2['day'] = df2['time_1'].dt.day
df2['val'] = df2['val'].astype(int)
print (df2)
subject_id month date time_1 val day
0 1 4 2173-04-03 2173-04-03 12:35:00 5 3
1 1 4 2173-04-03 2173-04-03 12:50:00 5 3
2 1 4 2173-04-04 2173-04-04 12:50:00 5 4
3 1 4 2173-04-05 2173-04-05 12:59:00 5 5
4 1 5 2173-05-04 2173-05-04 13:14:00 5 4
5 1 5 2173-05-05 2173-05-05 13:37:00 1 5
6 1 7 2173-07-06 2173-07-06 13:39:00 6 6
7 1 7 2173-07-07 2173-07-07 13:39:00 6 7
8 1 7 2173-07-08 2173-07-08 11:30:00 5 8
9 2 4 2173-04-08 2173-04-08 16:00:00 5 8
10 2 4 2173-04-09 2173-04-09 22:00:00 8 9
11 2 4 2173-04-10 2173-04-10 22:00:00 8 10
12 2 4 2173-04-11 2173-04-11 04:00:00 3 11
13 2 4 2173-04-12 2173-04-12 04:00:00 3 12
14 2 4 2173-04-13 2173-04-13 04:30:00 4 13
15 2 4 2173-04-14 2173-04-14 08:00:00 6 14
Does this help?
def fill_dates(df):
result = pd.DataFrame()
for i,row in df.iterrows():
if i == 0:
result = result.append(row)
else:
start_date = result.iloc[-1]['time_1']
end_date = row['time_1']
# print(start_date, end_date)
delta = (end_date - start_date).days
# print(delta)
if delta > 0 and start_date.month == end_date.month:
for j in range(delta):
day = start_date + timedelta(days=j+1)
new_row = result.iloc[-1].copy()
new_row['time_1'] = day
new_row['remarks'] = 'added'
if new_row['time_1'].date() != row['time_1'].date():
result = result.append(new_row)
result = result.append(row)
else:
result = result.append(row)
result.reset_index(inplace = True)
return result
I have two datetime columns - ColumnA and ColumnB. I want to create a new column - ColumnC, using conditional logic.
Originally, I created ColumnB from a YearMonth column of dates such as 201907, 201908, etc.
When ColumnA is NaN, I want to choose ColumnB.
Otherwise, I want to choose ColumnA.
Currently, my code below is causing ColumnC to have different formats. I'm not sure how to get rid of all of those 0's. I want the whole column to be YYYY-MM-DD.
ID YearMonth ColumnA ColumnB ColumnC
0 1 201712 2017-12-29 2017-12-31 2017-12-29
1 1 201801 2018-01-31 2018-01-31 2018-01-31
2 1 201802 2018-02-28 2018-02-28 2018-02-28
3 1 201806 2018-06-29 2018-06-30 2018-06-29
4 1 201807 2018-07-31 2018-07-31 2018-07-31
5 1 201808 2018-08-31 2018-08-31 2018-08-31
6 1 201809 2018-09-28 2018-09-30 2018-09-28
7 1 201810 2018-10-31 2018-10-31 2018-10-31
8 1 201811 2018-11-30 2018-11-30 2018-11-30
9 1 201812 2018-12-31 2018-12-31 2018-12-31
10 1 201803 NaN 2018-03-31 1522454400000000000
11 1 201804 NaN 2018-04-30 1525046400000000000
12 1 201805 NaN 2018-05-31 1527724800000000000
13 1 201901 NaN 2019-01-31 1548892800000000000
14 1 201902 NaN 2019-02-28 1551312000000000000
15 1 201903 NaN 2019-03-31 1553990400000000000
16 1 201904 NaN 2019-04-30 1556582400000000000
17 1 201905 NaN 2019-05-31 1559260800000000000
18 1 201906 NaN 2019-06-30 1561852800000000000
19 1 201907 NaN 2019-07-31 1564531200000000000
20 1 201908 NaN 2019-08-31 1567209600000000000
21 1 201909 NaN 2019-09-30 1569801600000000000
df['ColumnB'] = pd.to_datetime(df['YearMonth'], format='%Y%m', errors='coerce').dropna() + pd.offsets.MonthEnd(0)
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB'], format='%Y%m%d'), df['ColumnA'])
df['ColumnC'] = np.where(df['ColumnA'].isnull(),df['ColumnB'] , df['ColumnA'])
Just figured it out!
df['ColumnC'] = np.where(pd.isna(df['ColumnA']), pd.to_datetime(df['ColumnB']), pd.to_datetime(df['ColumnA']))
I would like to groupby by the variable of my df "cod_id" and then apply this function:
[df.loc[df['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in df['dt_op']]
Moving from this df:
print(df)
dt_op quantity cod_id
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
To this one:
print(final_df)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...
I tried with:
def lookforward(x):
L = [x.loc[x['dt_op'].between(row.dt_op, row.dt_op + pd.Timedelta(days=7)), \
'quantity'].sum() for row in x.itertuples(index=False)]
return pd.Series(L, index=x.index)
s = df.groupby('cod_id').apply(lookforward)
s.index = s.index.droplevel(0)
df['Final_Quantity'] = s
print(df)
dt_op quantity cod_id Final_Quantity
0 2018-01-20 1 613 2
1 2018-01-21 8 611 8
2 2018-01-21 1 613 1
But it is not an efficient solution, since it is computationally slow;
How can I improve its performance?
I would achieve it even with a new code/new function that leads to the same result.
EDIT:
Subset of the original dataset, with just one product (cod_id == 2), I tried to run on the code provided by "w-m":
print(df)
cod_id dt_op quantita final_sum
0 2 2017-01-03 1 54.0
1 2 2017-01-04 1 53.0
2 2 2017-01-13 1 52.0
3 2 2017-01-23 2 51.0
4 2 2017-01-26 1 49.0
5 2 2017-02-03 1 48.0
6 2 2017-02-27 1 47.0
7 2 2017-03-05 1 46.0
8 2 2017-03-15 1 45.0
9 2 2017-03-23 1 44.0
10 2 2017-03-27 2 43.0
11 2 2017-03-31 3 41.0
12 2 2017-04-04 1 38.0
13 2 2017-04-05 1 37.0
14 2 2017-04-15 2 36.0
15 2 2017-04-27 2 34.0
16 2 2017-04-30 1 32.0
17 2 2017-05-16 1 31.0
18 2 2017-05-18 1 30.0
19 2 2017-05-19 1 29.0
20 2 2017-06-03 1 28.0
21 2 2017-06-04 1 27.0
22 2 2017-06-07 1 26.0
23 2 2017-06-13 2 25.0
24 2 2017-06-14 1 23.0
25 2 2017-06-20 1 22.0
26 2 2017-06-22 2 21.0
27 2 2017-06-28 1 19.0
28 2 2017-06-30 1 18.0
29 2 2017-07-03 1 17.0
30 2 2017-07-06 2 16.0
31 2 2017-07-07 1 14.0
32 2 2017-07-13 1 13.0
33 2 2017-07-20 1 12.0
34 2 2017-07-28 1 11.0
35 2 2017-08-06 1 10.0
36 2 2017-08-07 1 9.0
37 2 2017-08-24 1 8.0
38 2 2017-09-06 1 7.0
39 2 2017-09-16 2 6.0
40 2 2017-09-20 1 4.0
41 2 2017-10-07 1 3.0
42 2 2017-11-04 1 2.0
43 2 2017-12-07 1 1.0
Edit 181017: this approach doesn't work due to forward rolling functions on sparse time series not currently being supported by pandas, see the comments.
Using for loops can be a performance killer when doing pandas operations.
The for loop around the rows plus their timedelta of 7 days can be replaced with a .rolling("7D"). To get a forward-rolling time delta (current date + 7 days), we reverse the df by date, as shown here.
Then no custom function is required anymore, and you can just take .quantity.sum() from the groupby.
quant_sum = df.sort_values("dt_op", ascending=False).groupby("cod_id") \
.rolling("7D", on="dt_op").quantity.sum()
cod_id dt_op
611 2018-01-21 8.0
613 2018-01-21 1.0
2018-01-20 2.0
Name: quantity, dtype: float64
result = df.set_index(["cod_id", "dt_op"])
result["final_sum"] = quant_sum
result.reset_index()
cod_id dt_op quantity final_sum
0 613 2018-01-20 1 2.0
1 611 2018-01-21 8 8.0
2 613 2018-01-21 1 1.0
Implementing the exact behavior from the question is difficult due to two shortcoming in pandas: neither groupby/rolling/transform nor forward looking rolling sparse dates being implemented (see other answer for more details).
This answer attempts to work around both by resampling the data, filling in all days, and then joining the quant_sums back with the original data.
# Create a temporary df with all in between days filled in with zeros
filled = df.set_index("dt_op").groupby("cod_id") \
.resample("D").asfreq().fillna(0) \
.quantity.to_frame()
# Reverse and sum
filled["quant_sum"] = filled.reset_index().set_index("dt_op") \
.iloc[::-1] \
.groupby("cod_id") \
.rolling(7, min_periods=1) \
.quantity.sum().astype(int)
# Join with original `df`, dropping the filled days
result = df.set_index(["cod_id", "dt_op"]).join(filled.quant_sum).reset_index()
Want to calculate the duration of each conversation using ID below is the data
ID Ques Time Expected output
----------------------------------
11 Hi 11.21 1min
11 Hello 11.22
13 hey 12.11 10mins
13 what 12.22
14 so 01.01 2mins
14 ok 01.03
15 hru 02.00
15 hii 02.01 3mins
15 hey 02.02
----------------------------------
tried
First_last_cover = English_Logs['Date'].agg(['min','max'])
print ("First Conversation and Last Conversation of the month", First_last_cover)
I think need convert times to_timedelta and then get difference to new column by transform:
df['Time'] = pd.to_timedelta(df['Time'].astype(str).str.replace('.', ':').add(':00'))
df['new'] = df.groupby('ID')['Time'].transform(lambda x: x.max() - x.min())
print (df)
ID Ques Time Expected output new
0 11 Hi 11:21:00 1min 00:01:00
1 11 Hello 11:22:00 NaN 00:01:00
2 13 hey 12:11:00 10mins 00:11:00
3 13 what 12:22:00 NaN 00:11:00
4 14 so 01:01:00 2mins 00:02:00
5 14 ok 01:03:00 NaN 00:02:00
6 15 hru 02:00:00 NaN 00:02:00
7 15 hii 02:01:00 3mins 00:02:00
8 15 hey 02:02:00 NaN 00:02:00
If want convert timedeltas to minutes add total_seconds and divide by 60:
df['new'] = df['new'].dt.total_seconds().div(60)
print (df)
ID Ques Time Expected output new
0 11 Hi 11:21:00 1min 1.0
1 11 Hello 11:22:00 NaN 1.0
2 13 hey 12:11:00 10mins 11.0
3 13 what 12:22:00 NaN 11.0
4 14 so 01:01:00 2mins 2.0
5 14 ok 01:03:00 NaN 2.0
6 15 hru 02:00:00 NaN 2.0
7 15 hii 02:01:00 3mins 2.0
8 15 hey 02:02:00 NaN 2.0
... or to new DataFrame by agg:
df1 = (df.groupby('ID')['Time']
.agg(lambda x: x.max() - x.min())
.dt.total_seconds()
.div(60))
ID Time
0 11 1.0
1 13 11.0
2 14 2.0
3 15 2.0