How can I measure if there is overlap in begin time to end time within each group using Python? - python

How can I see if there is overlap between start time and/or end time for each group (by ID). That is to say if two "services" occurred together for any length of time from one employee (ID). I have a table like the following, but would like to calculate the overlap column.
| ID | Begin Time | End Time | Overlap |
| 1 | 1/1/2023 13:30 | 1/1/2023 13:55 | False |
| 1 | 1/7/2023 12:30 | 1/1/2023 13:45 | False |
| 2 | 1/3/2023 15:30 | 1/3/2023 16:30 | True |
| 1 | 1/5/2023 07:30 | 1/5/2023 08:30 | True |
| 2 | 1/3/2023 14:55 | 1/3/2023 15:55 | True |
| 1 | 1/5/2023 06:30 | 1/5/2023 09:30 | True |
| 1 | 1/7/2023 06:30 | 1/7/2023 09:30 | True |
| 1 | 1/7/2023 06:00 | 1/7/2023 06:45 | True |
Here is a chunk of code that creates this dataframe -->
id_list = [1,1,2,1,2,1,1,1]
begin_time = ['1/1/2023 13:30', '1/7/2023 12:30', '1/3/2023 15:30', '1/5/2023 07:30', '1/3/2023 14:55',
'1/5/2023 06:30', '1/7/2023 06:30', '1/7/2023 06:00']
end_time = ['1/1/2023 13:55', '1/1/2023 13:45', '1/3/2023 16:30', '1/5/2023 08:30', '1/3/2023 15:55',
'1/5/2023 09:30', '1/7/2023 09:30', '1/7/2023 06:45']
df = pd.DataFrame(list(zip(id_list, begin_time, end_time)), columns = ['ID', 'Begin_Time', 'End_Time'])
df['Begin_Time'] = pd.to_datetime(df['Begin_Time'])
df['End_Time'] = pd.to_datetime(df['End_Time'])
df

Use Interval.overlaps in custom function with enumerate for filter out itself Interval:
def f(x):
i = pd.IntervalIndex.from_arrays(x['Begin_Time'],
x['End_Time'],
closed="both")
a = np.arange(len(x))
x['overlap'] = [i[a != j].overlaps(y).any() for j, y in enumerate(i) ]
return x
df = df.groupby('ID').apply(f)
print (df)
ID Begin_Time End_Time overlap
0 1 2023-01-01 13:30:00 2023-01-01 13:55:00 False
1 1 2023-01-08 12:30:00 2023-01-08 13:45:00 False <- data was changed
2 2 2023-01-03 15:30:00 2023-01-03 16:30:00 True
3 1 2023-01-05 07:30:00 2023-01-05 08:30:00 True
4 2 2023-01-03 14:55:00 2023-01-03 15:55:00 True
5 1 2023-01-05 06:30:00 2023-01-05 09:30:00 True
6 1 2023-01-07 06:30:00 2023-01-07 09:30:00 True
7 1 2023-01-07 06:00:00 2023-01-07 06:45:00 True

Related

Filling Missing Date Column using groupby method

I have a dataframe that looks something like:
+---+----+---------------+------------+------------+
| | id | date1 | date2 | days_ahead |
+---+----+---------------+------------+------------+
| 0 | 1 | 2021-10-21 | 2021-10-24 | 3 |
| 1 | 1 | 2021-10-22 | NaN | NaN |
| 2 | 1 | 2021-11-16 | 2021-11-24 | 8 |
| 3 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 4 | 2 | 2021-10-22 | 2021-10-24 | 2 |
| 5 | 3 | 2021-10-26 | 2021-10-31 | 5 |
| 6 | 3 | 2021-10-30 | 2021-11-04 | 5 |
| 7 | 3 | 2021-11-02 | NaN | NaN |
| 8 | 3 | 2021-11-04 | 2021-11-04 | 0 |
| 9 | 4 | 2021-10-28 | NaN | NaN |
+---+----+---------------+------------+------------+
I am trying to fill the missing data with the days_ahead median of each id group,
For example:
Median of id 1 = 5.5 which rounds to 6
filled value of date2 at index 1 should be 2021-10-28
Similarly, for id 3 Median = 5
filled value of date2 at index 7 should be 2021-11-07
And,
for id 4 Median = NaN
filled value of date2 at index 9 should be 2021-10-28
I Tried
df['date2'].fillna(df.groupby('id')['days_ahead'].transform('median'), inplace = True)
But this fills with int values.
Although, I can use lambda and apply methods to identify int and turn it to date, How do I directly use groupby and fillna together?
You can round values with convert to_timedelta, add to date1 with fill_valueparameter and replace missing values:
df['date1'] = pd.to_datetime(df['date1'])
df['date2'] = pd.to_datetime(df['date2'])
td = pd.to_timedelta(df.groupby('id')['days_ahead'].transform('median').round(), unit='d')
df['date2'] = df['date2'].fillna(df['date1'].add(td, fill_value=pd.Timedelta(0)))
print (df)
id date1 date2 days_ahead
0 1 2021-10-21 2021-10-24 3.0
1 1 2021-10-22 2021-10-28 NaN
2 1 2021-11-16 2021-11-24 8.0
3 2 2021-10-22 2021-10-24 2.0
4 2 2021-10-22 2021-10-24 2.0
5 3 2021-10-26 2021-10-31 5.0
6 3 2021-10-30 2021-11-04 5.0
7 3 2021-11-02 2021-11-07 NaN
8 3 2021-11-04 2021-11-04 0.0
9 4 2021-10-28 2021-10-28 NaN

How can I set last rows of a dataframe based on condition in Python?

I have 1 dataframes, df1, with 2 different columns. The first column 'col1' is a datetime column, and the second one is a int column with only 2 possible values (0 or 1). Here is an example of the dataframe:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 1 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
As you can see, the datetimes are sorted in an ascending order. What I would like is: for each diferent date (in this example are 2 diferent dates, 2020-01-01 and 2020-01-02 with diferent times) I would like to mantain the first 1 value and put as 0 the previous and the next ones in that date. So, the resulting dataframe would be:
+----------------------+----------+
| col1 | col2 |
+----------------------+----------+
| 2020-01-01 10:00:00 | 0 |
+----------------------+----------+
| 2020-01-01 11:00:00 | 1 |
+----------------------+----------+
| 2020-01-01 12:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 11:00:00 | 0 |
+----------------------+----------+
| 2020-01-02 12:00:00 | 1 |
+----------------------+----------+
| ... | ... |
+----------------------+----------+
How can I do it in Python?
Use:
df['col1'] = pd.to_datetime(df.col1)
mask = df.groupby(df.col1.dt.date)['col2'].cumsum().eq(1)
df.col2.where(mask, 0, inplace = True)
Output:
>>> df
col1 col2
0 2020-01-01 10:00:00 0
1 2020-01-01 12:00:00 1
2 2020-01-01 12:00:00 0
3 2020-01-02 11:00:00 0
4 2020-01-02 12:00:00 1

How to create rows that fill the time between events in Python

I am building a data frame for survival analysis starting from 2018-01-01 00:00:00 and ending TODAY. I have two columns with start and end times only for the events that ocurred associated with an ID.
However, I need to add rows with the times between which the event was not observed
Here I show what I have:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
And what I need is:
+--------+-----+-----+---------------------+---------------------+
| State | ID1 | ID2 | Start_Time | End_Time |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2018-01-01 00:00:00 | 2019-12-04 04:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 04:00:00 | 2019-12-04 19:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-04 19:30:00 | 2019-12-08 06:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-08 06:30:00 | 2019-12-20 10:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-20 10:00:00 | 2019-12-22 11:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 11:00:00 | 2019-12-22 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-22 23:00:00 | 2019-12-26 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-26 08:00:00 | 2019-12-29 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State1 | 111 | AA1 | 2019-12-29 16:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
| State1 | 112 | AA1 | 2018-01-01 00:00:00 | 2018-09-19 08:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-19 08:00:00 | 2018-09-20 04:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-20 04:30:00 | 2018-09-25 16:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-25 16:30:00 | 2018-09-26 23:00:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA1 | 2018-09-26 23:00:00 | 2018-09-27 01:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 01:30:00 | 2018-09-27 10:30:00 |
+--------+-----+-----+---------------------+---------------------+
| State2 | 112 | AA2 | 2018-09-27 10:30:00 | TODAY |
+--------+-----+-----+---------------------+---------------------+
I have tried this code (borrowed from: How to find the start time and end time of an event in python?), but it gives me only the sequence of events, not the desired rows and the answer provided by #Fredy MontaƱo (below):
fill_date = []
for item in range(1,df.shape[0],1):
if (df['End_Time'][item-1] - df['Start_Time'][item]) == 0:
""
else:
fill_date.append([df["State"][item-1], df["ID1"][item-1], df["ID2"][item-1], df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ["State", "ID1", "ID2", 'Start_Time', 'End_Time']
df_output = pd.concat([df[["State", "ID1", "ID2", "Start_Time", "End_Time"]], df_add],axis = 0)
df_output = df_output.sort_values(["State", "ID2", "Start_Time"], ascending=True)
I think I have to put a condition over the STATE, ID1 and ID2 variables in order to not to take times from the previous groups.
Any suggestion?
Maybe this solution works for you.
I slice the dataframe only to take the dates, but it works for you you can repeat it taking into account the states and ID
df = df[['Start_Time', 'End_Time']]
fill_date = []
for item in range(1,df.shape[0],1):
if df['Start_Time'][item] - df['End_Time'][item-1] == 0:
""
else:
fill_date.append([df['End_Time'][item-1],df['Start_Time'][item]])
df_add = pd.DataFrame(fill_date)
df_add.columns = ['Start_Time', 'End_Time']
and finally, I do a concat to join you original dataframe with the new df of dates of not Observed events dates on squares are the new
df_final = pd.concat([df,df_add],axis = 0)
df_final.sort_index(0)

Interpolate time series and resample/pivot. How to get the expected output

I have a df that looks like this:
Video | Start | End | Duration |
vid1 |2018-10-02 16:00:29 |2018-10-02 20:07:05 | 246 |
vid2 |2018-10-04 16:03:08 |2018-10-04 16:10:11 | 7 |
vid3 |2018-10-04 10:13:40 |2018-10-06 12:07:38 | 113 |
What I want to do is resample dataframe by 10 minutes by start column and assign 1 if the video lasted in that timestamp and 0 if not.
The desired output is:
Start | vid1 | vid2 | vid3 |
2018-10-02 16:00:00| 1 | 0 | 0 |
2018-10-02 16:10:00| 1 | 0 | 0 |
...
2018-10-04 16:10:00| 0 | 1 | 0 |
2018-10-04 16:20:00| 0 | 0 | 1 |
The output is presented only for visualization the output, hence, it can contain errors.
The problem is that I can not resample dataframe in a way to make a desired crosstab output.
Try this:
df.apply(lambda x: pd.Series(x['Video'],
index=pd.date_range(x['Start'].floor('10T'),
x['End'].ceil('10T'),
freq='10T')), axis=1)\
.stack().str.get_dummies().reset_index(level=0, drop=True)
Output:
vid1 vid2 vid3
2018-10-02 16:00:00 1 0 0
2018-10-02 16:10:00 1 0 0
2018-10-02 16:20:00 1 0 0
2018-10-02 16:30:00 1 0 0
2018-10-02 16:40:00 1 0 0
... ... ... ...
2018-10-06 11:30:00 0 0 1
2018-10-06 11:40:00 0 0 1
2018-10-06 11:50:00 0 0 1
2018-10-06 12:00:00 0 0 1
2018-10-06 12:10:00 0 0 1
[330 rows x 3 columns]

Extracting the max consecutive missing values between the first and last value within a dataframe

I have a dataset that has columns which start on different dates:
| Date | Hour | A | B | C | D |
--------------------------------------
| 01/01/2012 | 01:00 | | 1 | 2 | |
| 01/01/2012 | 03:00 | | | | 1 |
| 01/01/2012 | 07:00 | | 5 | | |
| 15/04/2012 | 01:00 | 1 | | 2 | 3 |
| 16/01/2013 | 05:00 | 1 | 1 | | |
I want to extract the information on the number of consecutive missing values excluding that of the records outside of the first and last entry for each column. I am currently using:
df['Consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
When df's looks like:
| A | Count |
-------------
| | True |
| | True |
| | True |
| 1 | False |
| 1 | False |
Here Max Consecutive should be 0 (Currently above statement would get Max Consecutive as 3)
or
| D | Count |
-------------
| | True |
| 1 | False |
| | True |
| 3 | False |
| | True |
Here Max Consecutive should be 1
etc
To get the consecutive missing rows but I can't figure out how to exclude the areas outside the data collection range.
I believe I either need to calculate within the start and end range or delete the starting and ending blank records, but I'm not sure how to go about doing this.
Any help would be greatly appreciated, Cheers.
I think need:
print (df)
Date Hour A B C D
0 01/01/2012 01:00 NaN 1.0 2.0 NaN
1 01/01/2012 03:00 NaN NaN NaN 1.0
2 01/01/2012 07:00 NaN 5.0 NaN NaN
3 15/04/2012 01:00 1.0 NaN 2.0 3.0
4 16/01/2013 05:00 1.0 1.0 NaN NaN
5 01/01/2012 01:00 NaN 1.0 2.0 NaN
6 01/01/2012 03:00 NaN NaN NaN 1.0
7 01/01/2012 07:00 NaN NaN NaN NaN
8 15/04/2012 01:00 1.0 NaN 2.0 3.0
9 16/01/2013 05:00 1.0 1.0 NaN NaN
df = df.set_index(['Date','Hour'])
m = df.ffill().isnull() | df.bfill().isnull()
a = (df.isnull() & ~m)
b = a.cumsum()
c = (b-b.mask(a).ffill().fillna(0)).max()
print (c)
A 3.0
B 3.0
C 2.0
D 2.0
dtype: float64
Detail:
print (a)
A B C D
Date Hour
01/01/2012 01:00 False False False False
03:00 False True True False
07:00 False False True True
15/04/2012 01:00 False True False False
16/01/2013 05:00 False False True True
01/01/2012 01:00 True False False True
03:00 True True True False
07:00 True True True True
15/04/2012 01:00 False True False False
16/01/2013 05:00 False False False False
Explanation:
First create boolean mask with forward and back filling NaNs for exclude first and last values
Then count each Trues consecutive values per columns and get max

Categories