Join two dataframe based on time interval

Join two dataframe based on time interval - python

Is there a way to join two tables using one table's timestamp difference to match another table's timestamp. For example, for one table I have start and end timestamps, I want to match other tables timestamp in the difference between start and end times.
table1
start_time end_time ID Ident
01/01/2022 17:56 01/01/2022 17:59 1 1A
01/01/2022 18:36 01/01/2022 18:40 2 1C
01/01/2022 19:48 01/01/2022 19:50 1 2D
01/01/2022 20:12 01/01/2022 20:14 2 4F
01/01/2022 21:47 01/01/2022 21:50 3 7R
01/01/2022 22:56 01/01/2022 22:59 5 2E
01/01/2022 23:57 01/01/2022 23:59 6 3E
Table2
Timestamp rate
01/01/2022 17:57 5
01/01/2022 19:49 5
01/01/2022 20:14 5
01/01/2022 21:47 5
01/01/2022 23:58 5
result
start_time end_time ID Timestamp rate
01/01/2022 17:56 01/01/2022 17:59 1 01/01/2022 17:57 5
01/01/2022 18:36 01/01/2022 18:40 2 null null
01/01/2022 19:48 01/01/2022 19:50 1 01/01/2022 19:49 5
01/01/2022 20:12 01/01/2022 20:14 2 01/01/2022 20:14 5
01/01/2022 21:47 01/01/2022 21:50 3 01/01/2022 21:47 5
01/01/2022 22:56 01/01/2022 22:59 5 null null
01/01/2022 23:57 01/01/2022 23:59 6 01/01/2022 23:58 5
I tried to use pandas.merge_asof in Python but the problem I'm having is there are duplicate rows for the right table in the result if the time is too close which I don't want. Is there a way to do this without duplication maybe in SQL? I have tried inner join but I not getting any change from the table so maybe I did it wrong. This is the code I used
SELECT *
FROM Table 1
inner Table 2
on Table 2.Timestamp
between Table1.Start_Stamp and Table1.END_TIMESTAMP

Related

Drop duplicate rows from DataFrame based on conditions on multiple columns

I have dataframe as follow:
id
value
date
001
True
01/01/2022 00:00:00
002
False
03/01/2022 00:00:00
003
True
03/01/2022 00:00:00
001
False
01/01/2022 01:30:00
001
True
01/01/2022 01:30:00
002
True
03/01/2022 00:00:00
003
True
03/01/2022 00:30:00
004
False
03/01/2022 00:30:00
005
False
01/01/2022 00:00:00
There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:
If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)
Expected output:
id
value
date
001
True
01/01/2022 01:30:00
002
True
03/01/2022 00:00:00
003
True
03/01/2022 00:30:00
004
False
03/01/2022 00:30:00
005
False
01/01/2022 00:00:00
Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?
Thanks.

It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:
output = (
df.sort_values(by=['date','value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)
print(output)
Output
id value date
4 1 True 2022-01-01 01:30:00
5 2 True 2022-03-01 00:00:00
6 3 True 2022-03-01 00:30:00
7 4 False 2022-03-01 00:30:00
8 5 False 2022-01-01 00:00:00

Join two dataset by time range

Is there a way to join two tables using one table's timestamp difference to match another table's timestamp. For example, for one table I have start and end timestamps, I want to match other tables timestamp in the difference between start and end times.
table1
start_time end_time ID Ident
01/01/2022 17:56 01/01/2022 17:59 1 1A
01/01/2022 18:36 01/01/2022 18:40 2 1C
01/01/2022 19:48 01/01/2022 19:50 1 2D
01/01/2022 20:12 01/01/2022 20:14 2 4F
01/01/2022 21:47 01/01/2022 21:50 3 7R
01/01/2022 22:56 01/01/2022 22:59 5 2E
01/01/2022 23:57 01/01/2022 23:59 6 3E
Table2
Timestamp rate
01/01/2022 17:57 5
01/01/2022 19:49 5
01/01/2022 20:14 5
01/01/2022 21:47 5
01/01/2022 23:58 5
result
start_time end_time ID Timestamp rate
01/01/2022 17:56 01/01/2022 17:59 1 01/01/2022 17:57 5
01/01/2022 18:36 01/01/2022 18:40 2 null null
01/01/2022 19:48 01/01/2022 19:50 1 01/01/2022 19:49 5
01/01/2022 20:12 01/01/2022 20:14 2 01/01/2022 20:14 5
01/01/2022 21:47 01/01/2022 21:50 3 01/01/2022 21:47 5
01/01/2022 22:56 01/01/2022 22:59 5 null null
01/01/2022 23:57 01/01/2022 23:59 6 01/01/2022 23:58 5
I tried to use pandas.merge_asof in Python but the problem I'm having is there are duplicate rows for the right table in the result which I don't want. Is there a way to do this without duplication maybe in SQL? I have tried inner join but I not getting any change from the table.

For example, for one table I have start and end timestamps, I want to match other tables timestamp in the difference between start and end times.
SELECT
*
FROM
table1
INNER JOIN
table2
ON table2.event_timestamp >= table1.start_timestamp
AND table2.event_timestamp < table1.end_timestamp

Python: Integrate daily list into df with hourly index

I have daily weather data:
rain (mm)
date
01/01/2022 0.0
02/01/2022 0.5
03/01/2022 2.0
...
And I have another table (df) broken down by hour
value
datetime
01/01/2022 01:00 x
01/01/2022 02:00 x
01/01/2022 03:00 x
...
And I want to join them like this:
value rain
datetime
01/01/2022 01:00 x 0.0
01/01/2022 02:00 x 0.0
01/01/2022 03:00 x 0.0
...
02/01/2022 01:00 x 0.5
02/01/2022 02:00 x 0.5
02/01/2022 03:00 x 0.5
...
03/01/2022 01:00 x 2.0
03/01/2022 02:00 x 2.0
03/01/2022 03:00 x 2.0
...
(nb: all dates are in d%/m%/Y% format, and all dates are the index of their respective df)
I'm sure there is a straight-forward solution, but I can't find it...
Thanks in advance for any help!

You probably want to resample weather then join df:
weather = weather.resample("H").ffill()
df_out = weather.join(df)
This will keep the resampled index of weather you might want to keep df index or the intersection, or all indexes instead. Take a look at the doc and kwarg how.
Output (default how="left"):
rain (mm) value
date
2022-01-01 00:00:00 0.0 NaN
2022-01-01 01:00:00 0.0 x
2022-01-01 02:00:00 0.0 x
2022-01-01 03:00:00 0.0 x
2022-01-01 04:00:00 0.0 NaN
... ... ...
2022-02-28 20:00:00 0.5 NaN
2022-02-28 21:00:00 0.5 NaN
2022-02-28 22:00:00 0.5 NaN
2022-02-28 23:00:00 0.5 NaN
2022-03-01 00:00:00 2.0 NaN

Under assumption that the 1st dataframe is named 'weather' and the 2nd named 'df'.
Let's try the following code:
df['rain']=[weather['rain (mm)'][i] for i in df.index]

Pandas: Check if one date column falls between two date columns, if true populates output

I have a DataFrame with 3 date fields purchaseDate, releaseDate, and ceaseDate. An sample of the dataframe is seen below.
Product purchaseDate releaseDate ceaseDate
ABC 20/12/2020 01/01/2021 02/01/2022
ZXC 15/01/2021 05/01/2021 02/01/2022
QWE 29/03/2021 06/01/2021 02/01/2022
ASD 13/04/2021 07/01/2021 02/01/2022
If the purchaseDate falls between releaseDate, and ceaseDate output of Active should be populated in a new column Status. If it purchaseDate falls outside these two dates it should show as Inactive. The required output is seen below.
Product purchaseDate releaseDate ceaseDate status
ABC 20/12/2020 01/01/2021 02/01/2022 Inactive
ZXC 04/01/2021 05/01/2021 02/01/2022 Inactive
QWE 29/03/2021 06/01/2021 02/01/2022 Active
ASD 13/04/2021 07/01/2021 02/01/2022 Active
Any assistance that could be provided would be greatly appreciated.

Convert date columns to datetime type and use between function
date_columns = df.filter(regex='Date').columns
df[date_columns] = df[date_columns].apply(pd.to_datetime, format='%d/%m/%Y')
Use np.where to insert value according to the condition
in_between = df.purchaseDate.between(df.releaseDate, df.ceaseDate)
df['status'] = np.where(in_between, 'Active', 'Inactive')
print(df)
Output
Product purchaseDate releaseDate ceaseDate status
0 ABC 2020-12-20 2021-01-01 2022-01-02 Inactive
1 ZXC 2021-01-15 2021-01-05 2022-01-02 Active
2 QWE 2021-03-29 2021-01-06 2022-01-02 Active
3 ASD 2021-04-13 2021-01-07 2022-01-02 Active
NOTE: Don't forget to import numpy as np

Convert to datetime:
df = (df.assign(**df.filter(like='Date')
.transform(pd.to_datetime, format="%d/%m/%Y"))
)
Product purchaseDate releaseDate ceaseDate
0 ABC 2020-12-20 2021-01-01 2022-01-02
1 ZXC 2021-01-15 2021-01-05 2022-01-02
2 QWE 2021-03-29 2021-01-06 2022-01-02
3 ASD 2021-04-13 2021-01-07 2022-01-02
Use the between function and map the boolean output to active and inactive:
(df.assign(status = df.purchaseDate.between(df.releaseDate, df.ceaseDate)
.map({True:"Active", False:"Inactive"}))
)
Product purchaseDate releaseDate ceaseDate status
0 ABC 2020-12-20 2021-01-01 2022-01-02 Inactive
1 ZXC 2021-01-15 2021-01-05 2022-01-02 Active
2 QWE 2021-03-29 2021-01-06 2022-01-02 Active
3 ASD 2021-04-13 2021-01-07 2022-01-02 Active

Python/Pandas: Return all rows for an ID where the duration between dissimilar categories is below 60 minutes

Problem: Return all rows for an ID (1,2,3,4) if there is any instance where the time difference between dissimilar categories (A,B) for that ID is below 60 minutes. This time difference, or 'Delta' should be the minimum between two dissimilar categories within the same 'ID'.
Example df:
ID Category Time
1 A 1:00
1 A 3:00
1 B 3:30
2 A 13:00
2 B 13:15
2 B 1:00
3 B 12:30
3 B 12:00
4 A 1:00
4 B 3:00
4 B 4:00
4 B 4:30
Desired Output. Note that event 2 B 1:00 is included because ID 2 does have an instance where a time difference between dissimilar categories was <=60.
ID Category Time Delta(minutes)
1 A 1:00 150
1 A 3:00 30
1 B 3:30 30
2 A 13:00 15
2 B 13:15 15
2 B 1:00 120
Not this because there is no duration between dissimilar categories:
ID Category Time Delta
3 B 12:00 n/a
3 B 12:30 n/a
Not this because Delta is not < 60min.
ID Category Time Delta
4 A 1:00 120
4 B 3:00 120
4 B 4:00 180
4 B 4:30 240
I've tried using:
df["Delta"] = df["Time"].groupby(df['ID']).diff()
But this does not take into account Category. Any assistance would be much appreciated. Thanks!

Here's a way:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M')
def f(x):
x1 = x.assign(key=1).merge(x.assign(key=1), on='key').query('Category_x < Category_y')
x1['TimeDiff'] = x1['Time_y'] - x1['Time_x']
return (x1['TimeDiff'] <= pd.Timedelta('60T')).any()
df.groupby('ID').filter(f)
Output:
ID Category Time
0 1 A 1900-01-01 01:00:00
1 1 A 1900-01-01 03:00:00
2 1 B 1900-01-01 03:30:00
3 2 A 1900-01-01 13:00:00
4 2 B 1900-01-01 13:15:00
5 2 B 1900-01-01 01:00:00

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join two dataframe based on time interval - python

Related

Drop duplicate rows from DataFrame based on conditions on multiple columns

Join two dataset by time range

Python: Integrate daily list into df with hourly index

Pandas: Check if one date column falls between two date columns, if true populates output

Python/Pandas: Return all rows for an ID where the duration between dissimilar categories is below 60 minutes

Categories

Resources