I am new to Quantitative Finance in Python so please bear with me. I have the following data set:
> head(df, 20)
# A tibble: 20 × 15
deal_id book counterparty commodity_name commodity_code executed_date first_delivery_date last_delivery_date last_trading_date volume buy_sell trading_unit tenor delivery_window strategy
<int> <chr> <chr> <chr> <chr> <dttm> <dttm> <dttm> <dttm> <int> <chr> <chr> <chr> <chr> <chr>
1 0 Book_7 Counterparty_3 api2coal ATW 2021-03-07 11:50:24 2022-01-01 00:00:00 2022-12-31 00:00:00 2021-12-31 00:00:00 23000 sell MT year Cal 22 NA
2 1 Book_7 Counterparty_3 oil B 2019-11-10 18:33:39 2022-01-01 00:00:00 2022-12-31 00:00:00 2021-11-30 00:00:00 16000 sell bbl year Cal 22 NA
3 2 Book_4 Counterparty_3 oil B 2021-02-25 11:44:20 2021-04-01 00:00:00 2021-04-30 00:00:00 2021-02-26 00:00:00 7000 buy bbl month Apr 21 NA
4 3 Book_3 Counterparty_3 gold GC 2022-05-27 19:28:48 2022-11-01 00:00:00 2022-11-30 00:00:00 2022-10-31 00:00:00 200 buy oz month Nov 22 NA
5 4 Book_2 Counterparty_3 czpower CZ 2022-09-26 13:14:31 2023-03-01 00:00:00 2023-03-31 00:00:00 2023-02-27 00:00:00 2 buy MW quarter Mar 23 NA
6 5 Book_1 Counterparty_3 depower DE 2022-08-29 10:28:34 2022-10-01 00:00:00 2022-10-31 00:00:00 2022-09-30 00:00:00 23 buy MW month Oct 22 NA
7 6 Book_3 Counterparty_1 api2coal ATW 2022-12-08 08:17:11 2023-01-01 00:00:00 2023-01-31 00:00:00 2022-12-30 00:00:00 29000 sell MT quarter Jan 23 NA
8 7 Book_3 Counterparty_2 depower DE 2020-10-16 17:36:13 2022-03-01 00:00:00 2022-03-31 00:00:00 2022-02-25 00:00:00 3 sell MW quarter Mar 22 NA
9 8 Book_7 Counterparty_1 api2coal ATW 2020-10-13 09:35:24 2021-02-01 00:00:00 2021-02-28 00:00:00 2021-01-29 00:00:00 1000 sell MT quarter Feb 21 NA
10 9 Book_2 Counterparty_1 api2coal ATW 2020-05-19 11:04:39 2022-01-01 00:00:00 2022-12-31 00:00:00 2021-12-31 00:00:00 19000 sell MT year Cal 22 NA
11 10 Book_6 Counterparty_1 oil B 2022-03-03 08:04:04 2022-08-01 00:00:00 2022-08-31 00:00:00 2022-06-30 00:00:00 26000 buy bbl month Aug 22 NA
12 11 Book_3 Counterparty_1 gold GC 2021-05-09 18:08:31 2022-05-01 00:00:00 2022-05-31 00:00:00 2022-04-29 00:00:00 1600 sell oz month May 22 NA
13 12 Book_5 Counterparty_2 oil B 2020-08-20 11:54:34 2021-04-01 00:00:00 2021-04-30 00:00:00 2021-02-26 00:00:00 6000 buy bbl month Apr 21 Strategy_3
14 13 Book_6 Counterparty_2 gold GC 2020-12-23 16:28:55 2021-12-01 00:00:00 2021-12-31 00:00:00 2021-11-30 00:00:00 1700 sell oz month Dec 21 NA
15 14 Book_2 Counterparty_1 depower DE 2021-08-11 12:54:23 2024-01-01 00:00:00 2024-12-31 00:00:00 2023-12-28 00:00:00 15 buy MW year Cal 24 NA
16 15 Book_5 Counterparty_1 czpower CZ 2022-02-15 07:45:24 2022-12-01 00:00:00 2022-12-31 00:00:00 2022-11-30 00:00:00 28 buy MW month Dec 22 Strategy_3
17 16 Book_7 Counterparty_2 oil B 2021-05-19 07:37:05 2022-02-01 00:00:00 2022-02-28 00:00:00 2021-12-31 00:00:00 11000 buy bbl quarter Feb 22 Strategy_3
18 17 Book_4 Counterparty_3 depower DE 2022-02-01 12:34:49 2022-06-01 00:00:00 2022-06-30 00:00:00 2022-05-31 00:00:00 14 sell MW month Jun 22 NA
19 18 Book_2 Counterparty_3 czpower CZ 2022-06-02 09:39:16 2023-02-01 00:00:00 2023-02-28 00:00:00 2023-01-30 00:00:00 21 buy MW quarter Feb 23 NA
20 19 Book_3 Counterparty_1 czpower CZ 2021-10-28 12:41:11 2022-09-01 00:00:00 2022-09-30 00:00:00 2022-08-31 00:00:00 3 sell MW month Sep 22 NA
And I am asked to extract some information from it while applying what is called Yearly and Quarterly Futures Cascading, which I do not know. The question is as follows:
Compute the position size (contracted volume) for a combination of books and commodities, for a selected time in history. The output format should be a data frame with future delivery periods as index (here comes yearly and quarterly cascading), commodities as column names and total volume as values. Provide negative values when the total volume for given period was sold and positive value when it was bought.
I read some material online about Cascading Futures here and here, but it only gave me a vague idea of what they are about and doesn't help solve the problem in hand. and coding examples in Python are nonexistent.
Can someone please give me a hint as to how to approach this problem? I am a beginner in the field of quantitative finance and any help would be much appreciated.
Here I have an extract from my pandas dataframe which is survey data with two datetime fields. It appears that some of the start times and end times were filled in the wrong position in the survey. Here is an example from my dataframe. The start and end time in the 8th row, I suspect were entered the wrong way round.
Just to give context, I generated the third column like this:
df_time['trip_duration'] = df_time['tripEnd_time'] - df_time['tripStart_time']
The three columns are in timedelta64 format.
Here is the top of my dataframe:
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 -1 days +22:15:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What I am trying to do is, loop through these two columns, and for each time 'tripEnd_time' is less than 'tripStart_time' swap the positions of these two entries. So in the case of row 8 above, I would make tripStart_time = tripEnd_time and tripEnd_time = tripStart_time.
I am not quite sure the best way to approach this. Should I use nested for loop where i compare each entry in the two columns?
Thanks
Use Series.abs:
df_time['trip_duration'] = (df_time['tripEnd_time'] - df_time['tripStart_time']).abs()
print (df_time)
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
What is same like:
a = df_time['tripEnd_time'] - df_time['tripStart_time']
b = df_time['tripStart_time'] - df_time['tripEnd_time']
mask = df_time['tripEnd_time'] > df_time['tripStart_time']
df_time['trip_duration'] = np.where(mask, a, b)
print (df_time)
tripStart_time tripEnd_time trip_duration
1 22:30:00 23:15:00 00:45:00
2 11:00:00 11:30:00 00:30:00
3 09:00:00 09:15:00 00:15:00
4 13:30:00 14:25:00 00:55:00
5 09:00:00 10:15:00 01:15:00
6 12:00:00 12:15:00 00:15:00
7 08:00:00 08:30:00 00:30:00
8 11:00:00 09:15:00 01:45:00
9 14:00:00 14:30:00 00:30:00
10 14:55:00 15:20:00 00:25:00
You can switch column values on selected rows:
df_time.loc[df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripStart_time', 'tripEnd_time']] = df_time.loc[
df_time['tripEnd_time'] < df_time['tripStart_time'],
['tripEnd_time', 'tripStart_time']].values
I have a dataframe df that contains datetimes for every hour of a day between 2003-02-12 to 2017-06-30 and I want to delete all datetimes between 24th Dec and 1st Jan of EVERY year.
An extract of my data frame is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
7512,2003-12-24 00:00:00
7513,2003-12-24 01:00:00
7514,2003-12-24 02:00:00
7515,2003-12-24 03:00:00
7516,2003-12-24 04:00:00
7517,2003-12-24 05:00:00
7518,2003-12-24 06:00:00
...
7723,2004-01-01 19:00:00
7724,2004-01-01 20:00:00
7725,2004-01-01 21:00:00
7726,2004-01-01 22:00:00
7727,2004-01-01 23:00:00
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
and my expected output is:
...
7505,2003-12-23 17:00:00
7506,2003-12-23 18:00:00
7507,2003-12-23 19:00:00
7508,2003-12-23 20:00:00
7509,2003-12-23 21:00:00
7510,2003-12-23 22:00:00
7511,2003-12-23 23:00:00
...
7728,2004-01-02 00:00:00
7729,2004-01-02 01:00:00
7730,2004-01-02 02:00:00
7731,2004-01-02 03:00:00
7732,2004-01-02 04:00:00
7733,2004-01-02 05:00:00
7734,2004-01-02 06:00:00
7735,2004-01-02 07:00:00
...
Sample dataframe:
dates
0 2003-12-23 23:00:00
1 2003-12-24 05:00:00
2 2004-12-27 05:00:00
3 2003-12-13 23:00:00
4 2002-12-23 23:00:00
5 2004-01-01 05:00:00
6 2014-12-24 05:00:00
Solution:
If you want it for every year between the following dates excluded, then extract the month and dates first:
df['month'] = df['dates'].dt.month
df['day'] = df['dates'].dt.day
And now put the condition check:
dec_days = [24, 25, 26, 27, 28, 29, 30, 31]
## if the month is dec, then check for these dates
## if the month is jan, then just check for the day to be 1 like below
df = df[~(((df.month == 12) & (df.day.isin(dec_days))) | ((df.month == 1) & (df.day == 1)))]
Sample output:
dates month day
0 2003-12-23 23:00:00 12 23
3 2003-12-13 23:00:00 12 13
4 2002-12-23 23:00:00 12 23
This takes advantage of the fact that datetime-strings in the form mm-dd are sortable. Read everything in from the CSV file then filter for the dates you want:
df = pd.read_csv('...', parse_dates=['DateTime'])
s = df['DateTime'].dt.strftime('%m-%d')
excluded = (s == '01-01') | (s >= '12-24') # Jan 1 or >= Dec 24
df[~excluded]
You can try dropping on conditionals. Maybe with a pattern match to the date string or parsing the date as a number (like in Java) and conditionally removing.
datesIdontLike = df[df['colname'] == <stringPattern>].index
newDF = df.drop(datesIdontLike, inplace=True)
Check this out: https://thispointer.com/python-pandas-how-to-drop-rows-in-dataframe-by-conditions-on-column-values/
(If you have issues, let me know.)
You can use pandas and boolean filtering with strftime
# version 0.23.4
import pandas as pd
# make df
df = pd.DataFrame(pd.date_range('20181223', '20190103', freq='H'), columns=['date'])
# string format the date to only include the month and day
# then set it strictly less than '12-24' AND greater than or equal to `01-02`
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2018-12-23 00:00:00
1 2018-12-23 01:00:00
2 2018-12-23 02:00:00
3 2018-12-23 03:00:00
4 2018-12-23 04:00:00
5 2018-12-23 05:00:00
6 2018-12-23 06:00:00
7 2018-12-23 07:00:00
8 2018-12-23 08:00:00
9 2018-12-23 09:00:00
10 2018-12-23 10:00:00
11 2018-12-23 11:00:00
12 2018-12-23 12:00:00
13 2018-12-23 13:00:00
14 2018-12-23 14:00:00
15 2018-12-23 15:00:00
16 2018-12-23 16:00:00
17 2018-12-23 17:00:00
18 2018-12-23 18:00:00
19 2018-12-23 19:00:00
20 2018-12-23 20:00:00
21 2018-12-23 21:00:00
22 2018-12-23 22:00:00
23 2018-12-23 23:00:00
240 2019-01-02 00:00:00
241 2019-01-02 01:00:00
242 2019-01-02 02:00:00
243 2019-01-02 03:00:00
244 2019-01-02 04:00:00
245 2019-01-02 05:00:00
246 2019-01-02 06:00:00
247 2019-01-02 07:00:00
248 2019-01-02 08:00:00
249 2019-01-02 09:00:00
250 2019-01-02 10:00:00
251 2019-01-02 11:00:00
252 2019-01-02 12:00:00
253 2019-01-02 13:00:00
254 2019-01-02 14:00:00
255 2019-01-02 15:00:00
256 2019-01-02 16:00:00
257 2019-01-02 17:00:00
258 2019-01-02 18:00:00
259 2019-01-02 19:00:00
260 2019-01-02 20:00:00
261 2019-01-02 21:00:00
262 2019-01-02 22:00:00
263 2019-01-02 23:00:00
264 2019-01-03 00:00:00
This will work with multiple years because we are only filtering on the month and day.
# change range to include 2017
df = pd.DataFrame(pd.date_range('20171223', '20190103', freq='H'), columns=['date'])
df = df.loc[
(df.date.dt.strftime('%m-%d') < '12-24') &
(df.date.dt.strftime('%m-%d') >= '01-02')
].copy()
print(df)
date
0 2017-12-23 00:00:00
1 2017-12-23 01:00:00
2 2017-12-23 02:00:00
3 2017-12-23 03:00:00
4 2017-12-23 04:00:00
5 2017-12-23 05:00:00
6 2017-12-23 06:00:00
7 2017-12-23 07:00:00
8 2017-12-23 08:00:00
9 2017-12-23 09:00:00
10 2017-12-23 10:00:00
11 2017-12-23 11:00:00
12 2017-12-23 12:00:00
13 2017-12-23 13:00:00
14 2017-12-23 14:00:00
15 2017-12-23 15:00:00
16 2017-12-23 16:00:00
17 2017-12-23 17:00:00
18 2017-12-23 18:00:00
19 2017-12-23 19:00:00
20 2017-12-23 20:00:00
21 2017-12-23 21:00:00
22 2017-12-23 22:00:00
23 2017-12-23 23:00:00
240 2018-01-02 00:00:00
241 2018-01-02 01:00:00
242 2018-01-02 02:00:00
243 2018-01-02 03:00:00
244 2018-01-02 04:00:00
245 2018-01-02 05:00:00
... ...
8779 2018-12-23 19:00:00
8780 2018-12-23 20:00:00
8781 2018-12-23 21:00:00
8782 2018-12-23 22:00:00
8783 2018-12-23 23:00:00
9000 2019-01-02 00:00:00
9001 2019-01-02 01:00:00
9002 2019-01-02 02:00:00
9003 2019-01-02 03:00:00
9004 2019-01-02 04:00:00
9005 2019-01-02 05:00:00
9006 2019-01-02 06:00:00
9007 2019-01-02 07:00:00
9008 2019-01-02 08:00:00
9009 2019-01-02 09:00:00
9010 2019-01-02 10:00:00
9011 2019-01-02 11:00:00
9012 2019-01-02 12:00:00
9013 2019-01-02 13:00:00
9014 2019-01-02 14:00:00
9015 2019-01-02 15:00:00
9016 2019-01-02 16:00:00
9017 2019-01-02 17:00:00
9018 2019-01-02 18:00:00
9019 2019-01-02 19:00:00
9020 2019-01-02 20:00:00
9021 2019-01-02 21:00:00
9022 2019-01-02 22:00:00
9023 2019-01-02 23:00:00
9024 2019-01-03 00:00:00
Since you want this to happen for every year, we can first define a series that where we replace the year by a static value (2000 for example). Let date be the column that stores the date, we can generate such column as:
dt = pd.to_datetime({'year': 2000, 'month': df['date'].dt.month, 'day': df['date'].dt.day})
For the given sample data, we get:
>>> dt
0 2000-12-23
1 2000-12-23
2 2000-12-23
3 2000-12-23
4 2000-12-23
5 2000-12-23
6 2000-12-23
7 2000-12-24
8 2000-12-24
9 2000-12-24
10 2000-12-24
11 2000-12-24
12 2000-12-24
13 2000-12-24
14 2000-01-01
15 2000-01-01
16 2000-01-01
17 2000-01-01
18 2000-01-01
19 2000-01-02
20 2000-01-02
21 2000-01-02
22 2000-01-02
23 2000-01-02
24 2000-01-02
25 2000-01-02
26 2000-01-02
dtype: datetime64[ns]
Next we can filter the rows, like:
from datetime import date
df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
This gives us the following data for your sample data:
>>> df[(dt >= date(2000,1,2)) & (dt < date(2000,12,24))]
id dt
0 7505 2003-12-23 17:00:00
1 7506 2003-12-23 18:00:00
2 7507 2003-12-23 19:00:00
3 7508 2003-12-23 20:00:00
4 7509 2003-12-23 21:00:00
5 7510 2003-12-23 22:00:00
6 7511 2003-12-23 23:00:00
19 7728 2004-01-02 00:00:00
20 7729 2004-01-02 01:00:00
21 7730 2004-01-02 02:00:00
22 7731 2004-01-02 03:00:00
23 7732 2004-01-02 04:00:00
24 7733 2004-01-02 05:00:00
25 7734 2004-01-02 06:00:00
26 7735 2004-01-02 07:00:00
So regardless what the year is, we will only consider dates between the 2nd of January and the 23rd of December (both inclusive).
I had a df such as
ID | Half Hour Bucket | clock in time | clock out time | Rate
232 | 4/1/19 8:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
342 | 4/1/19 8:30 PM | 4/1/19 7:12 PM | 4/1/19 7:22 PM | 0.23
232 | 4/1/19 7:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54
I want my output to be
ID | Half Hour Bucket | clock in time | clock out time | Rate | Mins
232 | 4/1/19 8:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
342 | 4/1/19 8:30 PM | 4/1/19 7:12 PM | 4/1/19 7:22 PM | 0.23 |
232 | 4/1/19 7:00 PM | 4/1/19 7:12 PM | 4/1/19 10:45 PM | 0.54 |
Where minutes represents the difference between clock out time and clock in time.
But I can only contain the minutes value for the half hour bucket on the same row it corresponds to.
For example for id 342 it would be ten minutes and the 10 mins would be on that row.
But for ID 232 the clock in to clock out time spans 3 hours. I would only want the 30 mins for 8 to 830 in the first row and the 18 mins in the third row. for the minutes in the half hour bucket like 830-9 or 9-930 that dont exist in the first row, I would want to create a new row in that same df that contains nans for everything except the half hour bucket and mins field for the minutes that do not exist in the original row.
the 30 mins from 8-830 would stay in the first row, but I would want 5 new rows for all the half hour buckets that aren't 4/1/19 8:00 PM as new rows with only the half hour bucket and the rate carrying over from the row. Is this possible?
I thank anyone for their time!
Realised my first answer probably wasn't what you wanted. This version, hopefully, is. It was a bit more involved than I first assumed!
Create Data
First of all create a dataframe to work with, based on that supplied in the question. The resultant formatting isn't quite the same but that would be easily fixed, so I've left it as-is here.
import math
import numpy as np
import pandas as pd
# Create a dataframe to work with from the data provided in the question
columns = ['id', 'half_hour_bucket', 'clock_in_time', 'clock_out_time' , 'rate']
data = [[232, '4/1/19 8:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54],
[342, '4/1/19 8:30 PM', '4/1/19 7:12 PM', '4/1/19 07:22 PM ', 0.23],
[232, '4/1/19 7:00 PM', '4/1/19 7:12 PM', '4/1/19 10:45 PM', 0.54]]
df = pd.DataFrame(data, columns=columns)
def convert_cols_to_dt(df):
# Convert relevant columns to datetime format
for col in df:
if col not in ['id', 'rate']:
df[col] = pd.to_datetime(df[col])
return df
df = convert_cols_to_dt(df)
# Create the mins column
df['mins'] = (df.clock_out_time - df.clock_in_time)
Output:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 0 days 00:10:00.000000000
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 0 days 03:33:00.000000000
Solution
Next define a simple function to return a list of length equal to the number of 30-minute intervals in the min column.
def upsample_list(x):
multiplier = math.ceil(x.total_seconds() / (60 * 30))
return list(range(multiplier))
And apply this to the dataframe:
df['samples'] = df.mins.apply(upsample_list)
Next, create a new row for each list item in the 'samples' column (using the answer provided by Roman Pekar here):
s = df.apply(lambda x: pd.Series(x['samples']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'sample'
Join s to the dataframe and clean up the extra columns:
df = df.drop('samples', axis=1).join(s, how='inner').drop('sample', axis=1)
Which gives us this:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
0 232 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1 342 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
2 232 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
Nearly there!
Reset the index:
df = df.reset_index(drop=True)
Set duplicate rows to NaN:
df = df.mask(df.duplicated())
Which gives:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232.0 2019-04-01 20:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
1 NaN NaT NaT NaT NaN NaT
2 NaN NaT NaT NaT NaN NaT
3 NaN NaT NaT NaT NaN NaT
4 NaN NaT NaT NaT NaN NaT
5 NaN NaT NaT NaT NaN NaT
6 NaN NaT NaT NaT NaN NaT
7 NaN NaT NaT NaT NaN NaT
8 342.0 2019-04-01 20:30:00 2019-04-01 19:12:00 2019-04-01 19:22:00 0.23 00:10:00
9 232.0 2019-04-01 19:00:00 2019-04-01 19:12:00 2019-04-01 22:45:00 0.54 03:33:00
10 NaN NaT NaT NaT NaN NaT
11 NaN NaT NaT NaT NaN NaT
12 NaN NaT NaT NaT NaN NaT
13 NaN NaT NaT NaT NaN NaT
14 NaN NaT NaT NaT NaN NaT
15 NaN NaT NaT NaT NaN NaT
16 NaN NaT NaT NaT NaN NaT
Lastly, forward fill the half_hour_bucket and rate columns.
df[['half_hour_bucket', 'rate']] = df[['half_hour_bucket', 'rate']].ffill()
Final output:
id half_hour_bucket clock_in_time clock_out_time rate mins
0 232.0 2019-04-01 20:00:00 2019-04-01_19:12:00 2019-04-01_22:45:00 0.54 03:33:00
1 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
2 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
3 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
4 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
5 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
6 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
7 NaN 2019-04-01 20:00:00 NaT NaT 0.54 NaT
8 342.0 2019-04-01 20:30:00 2019-04-01_19:12:00 2019-04-01_19:22:00 0.23 00:10:00
9 232.0 2019-04-01 19:00:00 2019-04-01_19:12:00 2019-04-01_22:45:00 0.54 03:33:00
10 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
11 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
12 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
13 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
14 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
15 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
16 NaN 2019-04-01 19:00:00 NaT NaT 0.54 NaT
I have two data frames like following, data frame A has datetime even with minutes, data frame B only has hour.
df:A
dataDate original
2018-09-30 11:20:00 3
2018-10-01 12:40:00 10
2018-10-02 07:00:00 5
2018-10-27 12:50:00 5
2018-11-28 19:45:00 7
df:B
dataDate count
2018-09-30 10:00:00 300
2018-10-01 12:00:00 50
2018-10-02 07:00:00 120
2018-10-27 12:00:00 234
2018-11-28 19:05:00 714
I like to merge the two on the basis of hour date and hour, so that now in dataframe A should have all the rows filled on the basis of merge on date and hour
I can try to do it via
A['date'] = A.dataDate.date
B['date'] = B.dataDate.date
A['hour'] = A.dataDate.hour
B['hour'] = B.dataDate.hour
and then merge
merge_df = pd.merge(A,B, how='left', left_on=['date', 'hour'],
right_on=['date', 'hour'])
but its a very long process, Is their an efficient way to perform the same operation with the help of pandas time series or date functionality?
Use map if need append only one column from B to A with floor for set minutes and seconds if exist to 0:
d = dict(zip(B.dataDate.dt.floor('H'), B['count']))
A['count'] = A.dataDate.dt.floor('H').map(d)
print (A)
dataDate original count
0 2018-09-30 11:20:00 3 NaN
1 2018-10-01 12:40:00 10 50.0
2 2018-10-02 07:00:00 5 120.0
3 2018-10-27 12:50:00 5 234.0
4 2018-11-28 19:45:00 7 714.0
For general solution use DataFrame.join:
A.index = A.dataDate.dt.floor('H')
B.index = B.dataDate.dt.floor('H')
A = A.join(B, lsuffix='_left')
print (A)
dataDate_left original dataDate count
dataDate
2018-09-30 11:00:00 2018-09-30 11:20:00 3 NaT NaN
2018-10-01 12:00:00 2018-10-01 12:40:00 10 2018-10-01 12:00:00 50.0
2018-10-02 07:00:00 2018-10-02 07:00:00 5 2018-10-02 07:00:00 120.0
2018-10-27 12:00:00 2018-10-27 12:50:00 5 2018-10-27 12:00:00 234.0
2018-11-28 19:00:00 2018-11-28 19:45:00 7 2018-11-28 19:05:00 714.0