Get the overlap duration between date intervals based on condition - python

I have two dataframes, they have a start/end datetime and a value. Not the same number of rows. The intervals which overlap may not be in the same row/index.
df1
start_datetime end_datetime value
08:50 09:50 5
09:52 10:10 6
10:50 11:30 2
df2
start_datetime end_datetime value
08:51 08:59 3
09:52 10:02 9
10:03 10:30 1
11:03 11:39 1
13:10 13:15 0
I would like to calculate the sum of duration time when df1 and df2 overlap only if df1.value > df2.value.
During one df2 time interval, df1 can overlaps multiple times and sometimes the condition is True.
I tried something like that:
time = timedelta()
for i, row1 in df1.iterrows():
t1 = pd.Interval(row1.start, row1.end)
for j, row2 in df2.iterrows():
t2 = pd.Interval(row2.start, row2.end)
if t1.overlaps(t2) and row1.value > row2.value:
latest_start = np.maximum(row1.start, row1.start)
earliest_end = np.minimum(row2.end, row2.end)
delta = earliest_end - latest_start
time += delta
I can loop on every df1 rows and test with the whole df2 data but it's not optimized.
expected output (example):
Timedelta('0 days 00:99:99')

Here is my solution:
Create DataFrames:
df1 = pd.DataFrame(
{"start_datetime1": ['08:50' ,'09:52' ,'10:50 ' ],
'end_datetime1' : ['09:50','10:10','11:30'] ,
'value1': [5,6,2]})
df2 = pd.DataFrame(
{"start_datetime2": ['08:51' ,'09:52' ,'10:03 ','11:03 ','13:10 ' ],
'end_datetime2' : ['08:59','10:02','10:30','11:39', '13:15'] ,
'value2': [3,9,1,1,0]})
df2["start_datetime2"]= pd.to_datetime(df2["start_datetime2"])
df2["end_datetime2"]= pd.to_datetime(df2["end_datetime2"])
df1["start_datetime1"]= pd.to_datetime(df1["start_datetime1"])
df1["end_datetime1"]= pd.to_datetime(df1["end_datetime1"])
Combine dataframes to make comparison easier. Combined dataframe has all possible matches :
df1['temp'] = 1 #temporary keys to create all combinations
df2['temp'] = 1
df_combined = pd.merge(df1,df2,on='temp').drop('temp',axis=1)
Compare values with lambda function:
df_combined['Result'] = df_combined.apply(lambda row: max(row["start_datetime1"],row["start_datetime2"]) -
min(row["start_datetime1"],row["start_datetime2"])
if pd.Interval(row['start_datetime1'], row['end_datetime1']).overlaps(
pd.Interval(row['start_datetime2'], row['end_datetime2'])) and
row["value1"] > row["value2"]
else 0, axis = 1 )
df_combined
Result :
total_timedelta = df_combined['Result'].loc[df_combined['Result'] != 0].sum()
0 days 00:25:00
Dataframe:

Related

use groupby() and for loop to count column values with conditions

The logic of what I am trying to do I think is best explained with code:
import pandas as pd
import numpy as np
from datetime import timedelta
random.seed(365)
#some data
start_date = pd.date_range(start = "2015-01-09", end = "2022-09-11", freq = "6D")
end_date = [start_date + timedelta(days = np.random.exponential(scale = 100)) for start_date in start_date]
df = pd.DataFrame(
{"start_date":start_date,
"end_date":end_date}
)
#randomly remove some end dates
df["end_date"] = df["end_date"].sample(frac = 0.7).reset_index(drop = True)
df["end_date"] = df["end_date"].dt.date.astype("datetime64[ns]")
I first create a pd.Series with the 1st day of every month in the entire history of the data:
dates = pd.Series(df["start_date"].dt.to_period("M").sort_values(ascending = True).unique()).dt.start_time
What I then want to do is count the number of df["start_date"] values which are less than the 1st day of each month in the series and where the df["end_date"] values are null (recorded as NaT)
I would think I would use a for loop to do this and somehow groupby the dates series so that the resulting output looks something like this:
month_start
count
2015-01-01
5
2015-02-01
10
2015-03-01
35
The count column in the resulting output is a count of the number of df rows where the df["start_date"] values are less than the 1st of each month in the series and where the df["end_date"] values are null - this occurs for every value in the series
Here is the logic of what I am trying to do:
df.groupby(by = dates)[["start_date", "end_date"]].apply(
lambda x: [x["start_date"] < date for date in dates] & x["end_date"].isnull == True
)
Is this what you want:
df2 = df[df['end_date'].isnull()]
dates_count = dates.apply(lambda x: df2[df2['start_date'] < x]['start_date'].count())
print(pd.concat([dates, dates_count], axis=1))
IIUC, group by period (shifted by 1 month) and count the NaT, then cumsum to accumulate the counts:
(df['end_date'].isna()
.groupby(df['start_date'].dt.to_period('M').add(1).dt.start_time)
.sum()
.cumsum()
)
Output:
start_date
2015-02-01 0
2015-03-01 0
2015-04-01 0
2015-05-01 0
2015-06-01 0
...
2022-06-01 122
2022-07-01 127
2022-08-01 133
2022-09-01 138
2022-10-01 140
Name: end_date, Length: 93, dtype: int64

compare a two date columns of a data frame with another two data frames of second data frame in python

I have two dataframes df1 and df2
df1 contains month and two date columns
df1
Month Month_Start Month_End
Month1 2022-03-27 2022-04-30
Month2 2022-05-01 2022-05-28
Month3 2022-05-01 2022-06-25
another data frame df2
start_Month end_Month price
2022-03-27 2260-12-31 1
2022-03-27 2260-12-31 2
2022-03-27 2260-12-31 3
if Month_Start and Month_end of df1 is in between start_Month and end_Month of df2, assign price column value to Month column of df1
like following result
Month price
Month1 1
Month2 1
Month3 1
I tried using for loops
for i in range(len(df2)):
for j in range(len(df1)):
if df2['start_Month'][i] <= df1['Month_Start'][j]<= df1['Month_End'][j] <= df2['end_Month'][i]:
new.loc[len(new.index)] = [df1['month'][j], df2['price'][i]]
but taking lot of time for execution for 1000+ rows.
ANY IDEAS?
Is there a common column where you can combine these two dataframes? such as id. If there is, it would be much more accurate to apply the conditions after combining these two tables. You can try the code below based on current data and conditions (Dataframes that are not the same size may have a problem.).
import pandas as pd
import numpy as np
df1=pd.DataFrame(data={'Month':['Month1','Month2','Month3'],
'Month_Start':['2022-03-27','2022-05-01','2022-05-01'],
'Month_End':['2022-04-30','2022-05-28','2022-06-25']})
df2=pd.DataFrame(data={'start_Month':['2022-03-27','2022-03-27','2022-03-27'],
'end_Month':['2260-12-31','2260-12-31','2260-12-31'],
'price':[1,2,3]})
con=[(df1['Month_Start']>= df2['start_Month']) & (df1['Month_End']<= df2['end_Month'])]
cho=[df2['price']]
df1['price']=np.select(con,cho,default=np.nan)#
Assuming these are your dataframes:
import pandas as pd
df1 = pd.DataFrame({ 'Month': ['Month1', 'Month2', 'Month3'],
'Month_Start': ['2022-03-27', '2022-05-01', '2022-05-01'],
'Month_End': ['2022-04-30', '2022-05-28', '2022-06-25'] })
df1['Month_Start'] = pd.to_datetime(df1['Month_Start'])
df1['Month_End'] = pd.to_datetime(df1['Month_End'])
df2 = pd.DataFrame({ 'start_Month': ['2022-03-01', '2022-05-01', '2022-06-01'],
'end_Month': ['2022-04-30', '2022-05-30', '2022-06-30'],
'price': [1, 2, 3] })
df2['start_Month'] = pd.to_datetime(df2['start_Month'])
df2['end_Month'] = pd.to_datetime(df2['end_Month'])
print(df1)
Month Month_Start Month_End
0 Month1 2022-03-27 2022-04-30
1 Month2 2022-05-01 2022-05-28
2 Month3 2022-05-01 2022-06-25
print(df2) #note validity periods do not overlap, so only 1 price is valid!
start_Month end_Month price
0 2022-03-01 2022-04-30 1
1 2022-05-01 2022-05-30 2
2 2022-06-01 2022-06-30 3
I would define an external function to check the validity period, then return the corresponding price. Note that if more than 1 corresponding validity periods are found, the first one will be returned. If no corresponding period is found, a null value is returned.
def check_validity(row):
try:
return int(df2['price'][(df2['start_Month']<=row['Month_Start']) & (row['Month_End']<=df2['end_Month'])].values[0])
except:
return
df1['price'] = df1.apply(lambda x: check_validity(x), axis=1)
print(df1)
Output:
Month Month_Start Month_End price
0 Month1 2022-03-27 2022-04-30 1.0
1 Month2 2022-05-01 2022-05-28 2.0
2 Month3 2022-05-01 2022-06-25 NaN

Problem Subtracting Values fom a column for two Dataframes with the same dates in for loop in Python

I have two dataframes which look like the following:
df1:
DATE Value1 Value2
04.01.05 2.754 2.757
05.01.05 2.7316 2.7505
06.01.05 2.7546 2.7568
07.01.05 2.7465 2.7525
10.01.05 2.7385 2.7415
11.01.05 2.7348 2.7388
12.01.05 2.7348 2.7388
13.01.05 2.7348 2.7388
14.01.05 2.7365 2.7435
17.01.05 2.7365 2.7435
18.01.05 2.7365 2.7435
19.01.05 2.7365 2.7435
df2:
DATE Value1 Value2
04.01.05 2.701 2.6995
05.01.05 2.7065 2.705
07.01.05 2.6348 2.6333
10.01.05 2.635 2.6315
11.01.05 2.6275 2.6265
12.01.05 2.6268 2.6253
13.01.05 2.6285 2.627
17.01.05 2.6565 2.6555
18.01.05 2.6275 2.626
19.01.05 2.643 2.6415
If I jhave the exact same dates my code below works. As soon as dates are not euqal and I only want to calculate for dates which are equal it is not working. My if statement somehow does not filter the proper dates out. I would like to add the calculated value to df1.
My code looks like the following:
import pandas as pd
file1 = 'File1.csv'
file2 = 'File2.csv'
df1 = pd.read_csv(file1, sep=';')
df1['DATE'] = pd.to_datetime(df1.DATE)
df2 = pd.read_csv(file2, sep=';')
df2['DATE'] = pd.to_datetime(df2.DATE)
for date1 in df1['DATE']:
for date2 in df2['DATE']:
if date1 == date2:
print(date1, date2)
df1['sub'] = df1.Value1 - df2.Value1
print(df1)
The expected output would be the following:
DATE Value1 Value2 LEVEL sub
04.01.05 2.701 2.6995 1 Year 0.053
05.01.05 2.7065 2.705 1 Year 0.0251
07.01.05 2.6348 2.6333 1 Year 0.1117
10.01.05 2.635 2.6315 1 Year 0.1035
11.01.05 2.6275 2.6265 1 Year 0.1073
12.01.05 2.6268 2.6253 1 Year 0.108
13.01.05 2.6285 2.627 1 Year 0.1063
17.01.05 2.6565 2.6555 1 Year 0.08
18.01.05 2.6275 2.626 1 Year 0.109
19.01.05 2.643 2.6415 1 Year 0.0935
This means only the difference will be calculated for equal dates.
First set the index to 'DATE' so that it will align. Then we subtract. Since you seem to want the output added to df2 we will do -(df2 - df1) which is the same as (df1 - df2)
df1 = df1.set_index('DATE')
df2 = df2.set_index('DATE')
df2['sub'] = -df2['Value1'].sub(df1['Value1'])
Value1 Value2 sub
DATE
04.01.05 2.7010 2.6995 0.0530
05.01.05 2.7065 2.7050 0.0251
07.01.05 2.6348 2.6333 0.1117
10.01.05 2.6350 2.6315 0.1035
11.01.05 2.6275 2.6265 0.1073
12.01.05 2.6268 2.6253 0.1080
13.01.05 2.6285 2.6270 0.1063
17.01.05 2.6565 2.6555 0.0800
18.01.05 2.6275 2.6260 0.1090
19.01.05 2.6430 2.6415 0.0935
# this will result in merged df wtih all intersecting dates
df1 = df1.merge(df2, on = 'DATE', suffixes=('','_df2')
# the column you need
df1['sub'] = df1['Value1'] - df1['Value1_df2']
# next drop unnecessary columns of df2
df1.drop(columns = ['Value1_df2', 'Value2_df2'], inplace=True)

Problem in the selection of a part of data in pandas

I have the following database that are extracted with pandas from csv files :
df1=pd.read_csv(path,parse_dates=True)
The print of df1 gives :
control Avg_return
2019-09-07 True 0
2019-06-06 True 0
2019-02-19 True 0
2019-01-17 True 0
2018-12-20 True 0
2018-11-27 True 0
2018-10-12 True 0
... ... ...
After I load the 2 csv file
df2=pd.read_csv(path,parse_dates=True)
The print of df2 gives :
return
2010-01-01 NaN
2010-04-01 0.010920
2010-05-01 -0.004404
2010-06-01 -0.025209
2010-07-01 -0.023280
... ...
The aim of my code is :
Take a date from df1
Subtract 6 days from the date taken in point 1.
Subtract 244 days from the date taken in point 1.
Take all the return from this two date in df2
Compute the mean of these return and stock it in Avg_return
I did this :
for i in range(0,df1_row):
#I go through my data df1
if (control.iloc[i]==True):
#I check if control_1 is true
date_1=df1.index[i]-pd.to_timedelta(6, unit='d')
# I remove 6 days from my date
date_2=df1.index[i]-pd.to_timedelta(244, unit='d')
# I remove 244 days from my date
df1.loc[i,"Average_return"] = df2[[date_1:date_2],["return"]].mean()
# I want to make the mean of the return between my date-6 days and my date-244 days
Unfortunately it gives me this error :
df1.loc[i,"Average_return"] = df2[[date1:date2],["return"]].mean()
^
SyntaxError: invalid syntax
Is someone able to help me? :)
The following looks a bit ugly, but I think it works :)
Dummy df's:
import numpy as np
import pandas as pd
cols = ['date', 'control', 'Avg_return']
data = [
[pd.to_datetime('2019-09-07'), True, 0],
[pd.to_datetime('2019-06-06'), True, 0]
]
df1 = pd.DataFrame(data, columns=cols)
cols2 = ['date', 'return']
data2 = [
[pd.to_datetime('2010-01-01'), np.nan],
[pd.to_datetime('2010-04-01'), 0.010920],
[pd.to_datetime('2019-09-01'), 1]
]
df2 = pd.DataFrame(data2, columns=cols2)
Drafted solution:
df1['date_minus_6'] = df1['date'] - dt.timedelta(days=6)
df1['date_minus_244'] = df1['date'] - dt.timedelta(days=244)
for i in range(0, df1.shape[0]):
for j in range(0, df2.shape[0]):
if df2['date'].iloc[j] == df1['date_minus_6'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
elif df2['date'].iloc[j] == df1['date_minus_244'].iloc[i]:
df1['Avg_return'].iloc[i] = (
df1['Avg_return'].iloc[i] + df2['return'].iloc[j]
).mean()
Output:
date control Avg_return date_minus_6 date_minus_244
0 2019-09-07 True 1.0 2019-09-01 2019-01-06
1 2019-06-06 True 0.0 2019-05-31 2018-10-05
import csv
import pandas as pd
df1=pd.read_csv('dsf1.csv',parse_dates=True)
df2=pd.read_csv('dsf2.csv',parse_dates=True)
df1.columns = ['date', 'control', 'return']
df2.columns = ['date', 'return']
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
for i in range(0, df1.shape[0]):
if df1['control'][i] == True:
date_1 = df1['date'][0] - pd.to_timedelta(6, unit='d')
date_2 = df2['date'][0] - pd.to_timedelta(244, unit='d')
#I'm not sure if average_return has the correct condition, but adjust as you see fit
df1.loc[i, 'average_return'] = (df1[df1['date'] > date_1]['return'] - df2[df2['date'] > date_2]['return']).mean()
print df1
This is a different approach without looping over all rows:
# make sure your index is a datetime index
df1.index = pd.to_datetime(df1.index)
df1['date_1'] = df1.index - pd.to_timedelta(6, unit='d')
df1['date_2'] = df1.index - pd.to_timedelta(244, unit='d')
df1['Average_return'] = df1.apply(lambda r: df2.loc[r['date_1']: r['date_2'], 'return'].mean(), axis=1)

"Iterative" Window function on subset of dataframe

I am looking for a way to create the column 'min_value' from the dataframe df below. For each row i, we subset from the entire dataframe all the records that correspond to the grouping ['Date_A', 'Date_B'] of the row i and having the condition 'Advance' less than 'Advance' of row i, and finally we pick the minimum of the column 'Amount' from this subset to set 'min_value' for the row i:
Initial dataframe:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240]})
df = df [['Date_A', 'Date_B', 'Advance', 'Amount']]
df
Desired output:
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2018-1-25','2018-1-25','2018-1-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-2-1','2018-2-1','2018-2-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [10,103,200,5,8,150],
'Amount' : [180,220,200,230,220,240],
'min_value': [180,180,180,230,230,220] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
I wrote the following loop that I think would do the job but it is much too long to run, I guess there must be much more efficient ways to accomplish this.
for i in range(len(df)):
date1=df['Date_A'][i] #select the date A of the row i
date2=df['Date_B'][i] #select the date B of the row i
advance= df['Advance'][i] #select the advance of the row i
df.loc[i,'min_value'] = df[df['Date_A']==date1][df['Date_B']==date2][df['Advance']<advance]['Amount'].min() # subset the entire dataframe to meet dates and advance conditions
df.loc[df['min_value'].isnull(),'min_value']=df['Amount'] # for the smallest advance value, ste min=to its own amount
df
I hope it is clear enough, thanks for your help.
Improvement question
Thanks a lot for the answer. For the last part, the NA rows, I'd like to replace the amount of the row by the overall amount of the Date_A,Date_B,advance grouping so that I have the overall minimum of the last day before date_A
Improvement desired output (two recodrs for the smallest advance value)
dates_A = ['2017-12-25','2017-12-25','2017-12-25','2017-12-25']
Date_A = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_A]
dates_B = ['2018-1-1','2018-1-1','2018-1-1','2018-1-1']
Date_B = [pd.to_datetime(date, format='%Y-%m-%d').date() for date in dates_B]
df_out = pd.DataFrame({'Date_A':Date_A,
'Date_B':Date_B,
'Advance' : [5,8,150,5],
'Amount' : [230,220,240,225],
'min_value': [225,230,220,225] })
df_out = df_out [['Date_A', 'Date_B', 'Advance', 'Amount','min_value']]
df_out
Thanks
You can use groupby on 'Date_A' and 'Date_B' after sorting the value by 'Advance' and apply the function cummin and shift to the column 'Amount'. Then use fillna with the value from the column 'Amount', such as:
df['min_value'] = (df.sort_values('Advance').groupby(['Date_A','Date_B'])['Amount']
.apply(lambda ser_g: ser_g.cummin().shift()).fillna(df['Amount']))
and you get:
Date_A Date_B Advance Amount min_value
0 2017-12-25 2018-01-01 10 180 180.0
1 2017-12-25 2018-01-01 103 220 180.0
2 2017-12-25 2018-01-01 200 200 180.0
3 2018-01-25 2018-02-01 5 230 230.0
4 2018-01-25 2018-02-01 8 220 230.0
5 2018-01-25 2018-02-01 150 240 220.0

Categories