Join two dataset by time range - python

Is there a way to join two tables using one table's timestamp difference to match another table's timestamp. For example, for one table I have start and end timestamps, I want to match other tables timestamp in the difference between start and end times.
table1
start_time end_time ID Ident
01/01/2022 17:56 01/01/2022 17:59 1 1A
01/01/2022 18:36 01/01/2022 18:40 2 1C
01/01/2022 19:48 01/01/2022 19:50 1 2D
01/01/2022 20:12 01/01/2022 20:14 2 4F
01/01/2022 21:47 01/01/2022 21:50 3 7R
01/01/2022 22:56 01/01/2022 22:59 5 2E
01/01/2022 23:57 01/01/2022 23:59 6 3E
Table2
Timestamp rate
01/01/2022 17:57 5
01/01/2022 19:49 5
01/01/2022 20:14 5
01/01/2022 21:47 5
01/01/2022 23:58 5
result
start_time end_time ID Timestamp rate
01/01/2022 17:56 01/01/2022 17:59 1 01/01/2022 17:57 5
01/01/2022 18:36 01/01/2022 18:40 2 null null
01/01/2022 19:48 01/01/2022 19:50 1 01/01/2022 19:49 5
01/01/2022 20:12 01/01/2022 20:14 2 01/01/2022 20:14 5
01/01/2022 21:47 01/01/2022 21:50 3 01/01/2022 21:47 5
01/01/2022 22:56 01/01/2022 22:59 5 null null
01/01/2022 23:57 01/01/2022 23:59 6 01/01/2022 23:58 5
I tried to use pandas.merge_asof in Python but the problem I'm having is there are duplicate rows for the right table in the result which I don't want. Is there a way to do this without duplication maybe in SQL? I have tried inner join but I not getting any change from the table.

For example, for one table I have start and end timestamps, I want to match other tables timestamp in the difference between start and end times.
SELECT
*
FROM
table1
INNER JOIN
table2
ON table2.event_timestamp >= table1.start_timestamp
AND table2.event_timestamp < table1.end_timestamp

Related

Drop duplicate rows from DataFrame based on conditions on multiple columns

I have dataframe as follow:
id
value
date
001
True
01/01/2022 00:00:00
002
False
03/01/2022 00:00:00
003
True
03/01/2022 00:00:00
001
False
01/01/2022 01:30:00
001
True
01/01/2022 01:30:00
002
True
03/01/2022 00:00:00
003
True
03/01/2022 00:30:00
004
False
03/01/2022 00:30:00
005
False
01/01/2022 00:00:00
There are some duplicate rows in the raw dataframe and I would like to remove duplicate rows based on following conditions:
If there are duplicate ids on the same date and same time, select a row with value "True" (e.g., id = 002)
If there are duplicate ids with same value, select a row with the latest date and time (e.g., id == 003)
If there are duplicate ids, select row with the latest date and time and select a row with value "True" (e.g., id == 001)
Expected output:
id
value
date
001
True
01/01/2022 01:30:00
002
True
03/01/2022 00:00:00
003
True
03/01/2022 00:30:00
004
False
03/01/2022 00:30:00
005
False
01/01/2022 00:00:00
Can somebody suggested me how to drop duplicates from dataframe based on above mentioned conditions ?
Thanks.
It looks like perhaps you just need to sort your dataframe prior to dropping duplicates. Something like this:
output = (
df.sort_values(by=['date','value'], ascending=False)
.drop_duplicates(subset='id')
.sort_values(by='id')
)
print(output)
Output
id value date
4 1 True 2022-01-01 01:30:00
5 2 True 2022-03-01 00:00:00
6 3 True 2022-03-01 00:30:00
7 4 False 2022-03-01 00:30:00
8 5 False 2022-01-01 00:00:00

Join two dataframe based on time interval

Is there a way to join two tables using one table's timestamp difference to match another table's timestamp. For example, for one table I have start and end timestamps, I want to match other tables timestamp in the difference between start and end times.
table1
start_time end_time ID Ident
01/01/2022 17:56 01/01/2022 17:59 1 1A
01/01/2022 18:36 01/01/2022 18:40 2 1C
01/01/2022 19:48 01/01/2022 19:50 1 2D
01/01/2022 20:12 01/01/2022 20:14 2 4F
01/01/2022 21:47 01/01/2022 21:50 3 7R
01/01/2022 22:56 01/01/2022 22:59 5 2E
01/01/2022 23:57 01/01/2022 23:59 6 3E
Table2
Timestamp rate
01/01/2022 17:57 5
01/01/2022 19:49 5
01/01/2022 20:14 5
01/01/2022 21:47 5
01/01/2022 23:58 5
result
start_time end_time ID Timestamp rate
01/01/2022 17:56 01/01/2022 17:59 1 01/01/2022 17:57 5
01/01/2022 18:36 01/01/2022 18:40 2 null null
01/01/2022 19:48 01/01/2022 19:50 1 01/01/2022 19:49 5
01/01/2022 20:12 01/01/2022 20:14 2 01/01/2022 20:14 5
01/01/2022 21:47 01/01/2022 21:50 3 01/01/2022 21:47 5
01/01/2022 22:56 01/01/2022 22:59 5 null null
01/01/2022 23:57 01/01/2022 23:59 6 01/01/2022 23:58 5
I tried to use pandas.merge_asof in Python but the problem I'm having is there are duplicate rows for the right table in the result if the time is too close which I don't want. Is there a way to do this without duplication maybe in SQL? I have tried inner join but I not getting any change from the table so maybe I did it wrong. This is the code I used
SELECT *
FROM Table 1
inner Table 2
on Table 2.Timestamp
between Table1.Start_Stamp and Table1.END_TIMESTAMP

Python: Integrate daily list into df with hourly index

I have daily weather data:
rain (mm)
date
01/01/2022 0.0
02/01/2022 0.5
03/01/2022 2.0
...
And I have another table (df) broken down by hour
value
datetime
01/01/2022 01:00 x
01/01/2022 02:00 x
01/01/2022 03:00 x
...
And I want to join them like this:
value rain
datetime
01/01/2022 01:00 x 0.0
01/01/2022 02:00 x 0.0
01/01/2022 03:00 x 0.0
...
02/01/2022 01:00 x 0.5
02/01/2022 02:00 x 0.5
02/01/2022 03:00 x 0.5
...
03/01/2022 01:00 x 2.0
03/01/2022 02:00 x 2.0
03/01/2022 03:00 x 2.0
...
(nb: all dates are in d%/m%/Y% format, and all dates are the index of their respective df)
I'm sure there is a straight-forward solution, but I can't find it...
Thanks in advance for any help!
You probably want to resample weather then join df:
weather = weather.resample("H").ffill()
df_out = weather.join(df)
This will keep the resampled index of weather you might want to keep df index or the intersection, or all indexes instead. Take a look at the doc and kwarg how.
Output (default how="left"):
rain (mm) value
date
2022-01-01 00:00:00 0.0 NaN
2022-01-01 01:00:00 0.0 x
2022-01-01 02:00:00 0.0 x
2022-01-01 03:00:00 0.0 x
2022-01-01 04:00:00 0.0 NaN
... ... ...
2022-02-28 20:00:00 0.5 NaN
2022-02-28 21:00:00 0.5 NaN
2022-02-28 22:00:00 0.5 NaN
2022-02-28 23:00:00 0.5 NaN
2022-03-01 00:00:00 2.0 NaN
Under assumption that the 1st dataframe is named 'weather' and the 2nd named 'df'.
Let's try the following code:
df['rain']=[weather['rain (mm)'][i] for i in df.index]

Python/Pandas: Return all rows for an ID where the duration between dissimilar categories is below 60 minutes

Problem: Return all rows for an ID (1,2,3,4) if there is any instance where the time difference between dissimilar categories (A,B) for that ID is below 60 minutes. This time difference, or 'Delta' should be the minimum between two dissimilar categories within the same 'ID'.
Example df:
ID Category Time
1 A 1:00
1 A 3:00
1 B 3:30
2 A 13:00
2 B 13:15
2 B 1:00
3 B 12:30
3 B 12:00
4 A 1:00
4 B 3:00
4 B 4:00
4 B 4:30
Desired Output. Note that event 2 B 1:00 is included because ID 2 does have an instance where a time difference between dissimilar categories was <=60.
ID Category Time Delta(minutes)
1 A 1:00 150
1 A 3:00 30
1 B 3:30 30
2 A 13:00 15
2 B 13:15 15
2 B 1:00 120
Not this because there is no duration between dissimilar categories:
ID Category Time Delta
3 B 12:00 n/a
3 B 12:30 n/a
Not this because Delta is not < 60min.
ID Category Time Delta
4 A 1:00 120
4 B 3:00 120
4 B 4:00 180
4 B 4:30 240
I've tried using:
df["Delta"] = df["Time"].groupby(df['ID']).diff()
But this does not take into account Category. Any assistance would be much appreciated. Thanks!
Here's a way:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M')
def f(x):
x1 = x.assign(key=1).merge(x.assign(key=1), on='key').query('Category_x < Category_y')
x1['TimeDiff'] = x1['Time_y'] - x1['Time_x']
return (x1['TimeDiff'] <= pd.Timedelta('60T')).any()
df.groupby('ID').filter(f)
Output:
ID Category Time
0 1 A 1900-01-01 01:00:00
1 1 A 1900-01-01 03:00:00
2 1 B 1900-01-01 03:30:00
3 2 A 1900-01-01 13:00:00
4 2 B 1900-01-01 13:15:00
5 2 B 1900-01-01 01:00:00

Duplicate line based on start and end date

I could not find a way how to do the following:
My data looks as follows:
Time (CET) Start Duration(min) End
2015-02-01 00:00 2015-02-01 00:00 2 2015-02-01 00:02
What I want to have is that every line (containing entries, a lot do not) gets duplicated based on the duration or end date in the following way such that:
Time (CET) Start Duration(min) End
2015-02-01 00:00 2015-02-01 00:00 2 2015-02-01 00:02
2015-02-01 00:01 2015-02-01 00:00 2 2015-02-01 00:02
2015-02-01 00:02 2015-02-01 00:00 2 2015-02-01 00:02
In the end dataframe the start and end column are not necessary anymore. I thought about using shift but was not sure if it is the right away and how to use the argument freq. Any ideas how to do that?
The Time columns are in datetime format and Time (CET) is the index.
Thanks a ton!
You can repeat rows by Index.repeat with loc and add timedeltas created by cumcount with to_timedelta to column Time (CET):
print (df)
Time (CET) Start Duration(min) End
0 2015-02-01 00:00 2015-02-01 00:00 2 2015-02-01 00:02
1 2015-02-02 00:00 2015-02-02 00:00 3 2015-02-02 00:02
#convert columns to datetimes
c = ['Time (CET)','Start','End']
df[c] = df[c].apply(pd.to_datetime)
df = df.loc[df.index.repeat(df['Duration(min)'] + 1)]
df['Time (CET)'] += pd.to_timedelta(df.groupby(level=0).cumcount(), unit='s') * 60
df = df.reset_index(drop=True).drop(['Start','End'], axis=1)
print (df)
Time (CET) Duration(min)
0 2015-02-01 00:00:00 2
1 2015-02-01 00:01:00 2
2 2015-02-01 00:02:00 2
3 2015-02-02 00:00:00 3
4 2015-02-02 00:01:00 3
5 2015-02-02 00:02:00 3
6 2015-02-02 00:03:00 3

Categories