Pandas getting next available datetime index if desired one doesn't exist - python

So I have a DataFrame, f, with weekly indexes:
Open High Low Close Volume
Date
2017-07-24 5.05 5.120 5.010 5.19 16306737.0
2017-07-31 5.31 5.475 5.280 5.24 45182199.0
2017-08-07 5.69 5.740 5.640 5.67 10167161.0
2017-08-14 5.65 5.680 5.440 5.76 28296416.0
2017-08-21 5.49 5.605 5.480 5.55 16126060.0
2017-08-28 6.00 6.030 5.940 5.95 19398271.0
2017-09-04 5.86 5.965 5.845 6.01 20218389.0
2017-09-11 5.98 6.030 5.830 5.98 15812289.0
2017-09-18 5.71 5.770 5.540 5.81 30786508.0
2017-09-25 5.16 5.190 5.090 5.17 13641128.0
I want to parse a datetime object to it, if that datetime object exists in the index then I'll use the data in that row, otherwise if it doesn't exist in the index then grab the next row after where my parsed date would be.
E.g: if I parse f.loc[(datetime.datetime(2017, 09, 07)]
then that isn't in the index so I want it to grab the row
2017-09-11 5.98 6.030 5.830 5.98 15812289.0
since that is the next indexed date after 7 September.

One straightforward solution is using np.searchsorted:
df.iloc[[np.searchsorted(df.index, '2017-09-07')]]
Open High Low Close Volume
Date
2017-09-11 5.98 6.03 5.83 5.98 15812289.0
Details
df
Open High Low Close Volume
Date
2017-07-24 5.05 5.120 5.010 5.19 16306737.0
2017-07-31 5.31 5.475 5.280 5.24 45182199.0
2017-08-07 5.69 5.740 5.640 5.67 10167161.0
2017-08-14 5.65 5.680 5.440 5.76 28296416.0
2017-08-21 5.49 5.605 5.480 5.55 16126060.0
2017-08-28 6.00 6.030 5.940 5.95 19398271.0
2017-09-04 5.86 5.965 5.845 6.01 20218389.0
2017-09-11 5.98 6.030 5.830 5.98 15812289.0
2017-09-18 5.71 5.770 5.540 5.81 30786508.0
2017-09-25 5.16 5.190 5.090 5.17 13641128.0
df.index.dtype
dtype('<M8[ns]')

Related

Pandas pivot_table. How to change sorting

I have this dataframe df:
alpha1 week_day calendar_week
0 2.49 Freitag 2022-04-(01/07)
1 1.32 Samstag 2022-04-(01/07)
2 2.70 Sonntag 2022-04-(01/07)
3 3.81 Montag 2022-04-(01/07)
4 3.58 Dienstag 2022-04-(01/07)
5 3.48 Mittwoch 2022-04-(01/07)
6 1.79 Donnerstag 2022-04-(01/07)
7 2.12 Freitag 2022-04-(08/14)
8 2.41 Samstag 2022-04-(08/14)
9 1.78 Sonntag 2022-04-(08/14)
10 3.19 Montag 2022-04-(08/14)
11 3.33 Dienstag 2022-04-(08/14)
12 2.88 Mittwoch 2022-04-(08/14)
13 2.98 Donnerstag 2022-04-(08/14)
14 3.01 Freitag 2022-04-(15/21)
15 3.04 Samstag 2022-04-(15/21)
16 2.72 Sonntag 2022-04-(15/21)
17 4.11 Montag 2022-04-(15/21)
18 3.90 Dienstag 2022-04-(15/21)
19 3.16 Mittwoch 2022-04-(15/21)
and so on, with ascending calendar weeks.
I performed a pivot table to generate a heatmap.
df_pivot = pd.pivot_table(df, values=['alpha1'], index=['week_day'], columns=['calendar_week'])
What I get is:
alpha1 \
calendar_week 2022-(04-29/05-05) 2022-(05-27/06-02) 2022-(07-29/08-04)
week_day
Dienstag 3.32 2.09 4.04
Donnerstag 3.27 2.21 4.65
Freitag 2.83 3.08 4.19
Mittwoch 3.22 3.14 4.97
Montag 2.83 2.86 4.28
Samstag 2.62 3.62 3.88
Sonntag 2.81 3.25 3.77
\
calendar_week 2022-(08-26/09-01) 2022-04-(01/07) 2022-04-(08/14)
week_day
Dienstag 2.92 3.58 3.33
Donnerstag 3.58 1.79 2.98
Freitag 3.96 2.49 2.12
Mittwoch 3.09 3.48 2.88
Montag 3.85 3.81 3.19
Samstag 3.10 1.32 2.41
Sonntag 3.39 2.70 1.78
As you see the sorting of the pivot table is messed up. I need the same sorting for the columns (calendar weeks) as in the original dataframe.
I have been looking all over but couldn't find how to achieve this.
Would be also very nice, if the sorting of the rows remains the same.
Any help will be greatly appreciated
UPDATE
I didn't paste all the data. It would have been too long
The calendar_week column consist of following elements
'2022-04-(01/07)',
'2022-04-(08/14)',
'2022-04-(15/21)',
'2022-04-(22/28)',
'2022-(04-29/05-05)',
'2022-05-(06/12)',
'2022-05-(13/19)',
'2022-05-(20/26)',
'2022-(05-27/06-02)',
'2022-06-(03/09)'
'2022-06-(10/16)'
'2022-06-(17/23)'
'2022-06-(24/30)'
'2022-07-(01/07)'
'2022-07-(08/14)'
'2022-07-(15/21)'
'2022-07-(22/28)'
'2022-(07-29/08-04)'
'2022-08-(05/11)'
etc....
Each occurs 7 times in df. It represents a calendar week.
The sorting is the natural time sorting.
After pivoting the dataframe, the sorting of the column get messed up. And I guess it's due to the 2 different types: 2022-(07-29/08-04) and 2022-07-(15/21).
Try running this:
df_pivot.sort_values(by = ['calendar_week'], axis = 1, ascending = True)
I got the following output. Is this what you wanted?
calendar_week
2022-04-(01/07)
2022-04-(08/14)
2022-04-(15/21)
week_day
Dienstag
3.58
3.33
3.90
Donnerstag
1.79
2.98
NaN
Freitag
2.49
2.12
3.01
Mittwoch
3.48
2.88
3.16
Montag
3.81
3.19
4.11
be sure to remove the NaN values using the fillna() function.
I hope that answers it. :)
You can use an ordered Categorical for your week days and sort the dates after pivoting with sort_index:
# define the desired order of the days
days = ['Montag', 'Dienstag', 'Mittwoch', 'Donnerstag',
'Freitag', 'Samstag', 'Sonntag']
df_pivot = (df
.assign(week_day=pd.Categorical(df['week_day'], categories=days,
ordered=True))
.pivot_table(values='alpha1', index='week_day',
columns='calendar_week')
.sort_index(axis=1)
)
output:
calendar_week 2022-04-(01/07) 2022-04-(08/14) 2022-04-(15/21)
week_day
Montag 3.81 3.19 4.11
Dienstag 3.58 3.33 3.90
Mittwoch 3.48 2.88 3.16
Donnerstag 1.79 2.98 NaN
Freitag 2.49 2.12 3.01
Samstag 1.32 2.41 3.04
Sonntag 2.70 1.78 2.72

How to sample data from Pandas DataFrame where data is present for every hour of a given day

I wish to create a DataFrame where each row is one day, and the columns provide the date, hourly data, and maximum minimum of the day's data. Here is an example (I provide the input data further down in the question):
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
My input DataFrame has a row for each hour, with the date & time, mean, max, and min for each hour as its columns.
I wish to iterate through each day in the input DataFrame and do the following:
Check that there is a row for each hour of the day
Check that there is both maximum and minimum data for each hour of the day
If the conditions above are met, I wish to:
Add a row to the output DataFrame for the given date
Use the date to fill the 'Date_time' cell for the row
Transpose the hourly data to the hourly cells
Find the max of the hourly max data, and use it to fill the max cell for the row
Find the min of the hourly min data, and use it to fill the min cell for the row
Example daily input data examples follow.
Example 1
All hours for day available
Max & min available for each hour
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
0 2019-02-03 00:00:00 18.6 18.7 18.5
1 2019-02-03 01:00:00 18.6 18.7 18.5
2 2019-02-03 02:00:00 18.2 18.5 18.0
3 2019-02-03 03:00:00 18.0 18.0 17.9
4 2019-02-03 04:00:00 18.0 18.1 17.9
5 2019-02-03 05:00:00 18.3 18.4 18.1
6 2019-02-03 06:00:00 18.7 19.1 18.4
7 2019-02-03 07:00:00 20.1 21.3 19.1
8 2019-02-03 08:00:00 21.7 22.9 21.0
9 2019-02-03 09:00:00 23.2 23.9 22.8
10 2019-02-03 10:00:00 23.7 24.1 23.3
11 2019-02-03 11:00:00 24.6 25.5 24.0
12 2019-02-03 12:00:00 25.1 25.7 24.7
13 2019-02-03 13:00:00 24.5 25.0 24.2
14 2019-02-03 14:00:00 23.9 25.3 21.2
15 2019-02-03 15:00:00 19.6 21.2 18.8
16 2019-02-03 16:00:00 19.2 19.5 18.7
17 2019-02-03 17:00:00 19.8 19.9 19.4
18 2019-02-03 18:00:00 19.6 19.8 19.5
19 2019-02-03 19:00:00 19.3 19.4 19.1
20 2019-02-03 20:00:00 19.2 19.4 19.1
21 2019-02-03 21:00:00 19.3 19.4 18.9
22 2019-02-03 22:00:00 18.8 19.0 18.7
23 2019-02-03 23:00:00 19.0 19.1 18.9
Example 2
All hours for day available
Max & min available for each hour
NaN values for some Mean_temp entries
Proceed to create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
24 2019-02-04 00:00:00 18.9 19.0 18.9
25 2019-02-04 01:00:00 18.8 18.9 18.7
26 2019-02-04 02:00:00 18.6 18.8 18.4
27 2019-02-04 03:00:00 18.4 18.6 18.1
28 2019-02-04 04:00:00 18.7 18.9 18.4
29 2019-02-04 05:00:00 18.8 18.8 18.7
30 2019-02-04 06:00:00 19.0 19.3 18.8
31 2019-02-04 07:00:00 19.7 20.4 19.3
32 2019-02-04 08:00:00 21.4 22.8 20.3
33 2019-02-04 09:00:00 23.5 23.9 22.8
34 2019-02-04 10:00:00 25.7 23.6
35 2019-02-04 11:00:00 26.5 25.4
36 2019-02-04 12:00:00 27.1 26.1
37 2019-02-04 13:00:00 25.8 26.8 24.8
38 2019-02-04 14:00:00 25.4 27.8 23.7
39 2019-02-04 15:00:00 22.1 24.1 20.2
40 2019-02-04 16:00:00 21.8 22.6 20.2
41 2019-02-04 17:00:00 20.9 22.4 19.6
42 2019-02-04 18:00:00 18.9 19.6 18.6
43 2019-02-04 19:00:00 18.8 18.9 18.6
44 2019-02-04 20:00:00 18.9 19.0 18.8
45 2019-02-04 21:00:00 18.8 18.9 18.7
46 2019-02-04 22:00:00 18.8 18.9 18.7
47 2019-02-04 23:00:00 18.9 19.2 18.7
Example 3
Not all hours of the day are available
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
48 2019-02-05 00:00:00 19.2 19.3 19.0
49 2019-02-05 01:00:00 19.3 19.4 19.3
50 2019-02-05 02:00:00 19.3 19.4 19.2
51 2019-02-05 03:00:00 19.4 19.5 19.4
52 2019-02-05 04:00:00 19.5 19.6 19.3
53 2019-02-05 05:00:00 19.3 19.5 19.1
54 2019-02-05 06:00:00 20.1 20.6 19.2
55 2019-02-05 07:00:00 21.1 21.7 20.6
56 2019-02-05 08:00:00 22.3 23.2 21.7
57 2019-02-05 15:00:00 25.3 25.8 25.0
58 2019-02-05 16:00:00 25.8 26.0 25.2
59 2019-02-05 17:00:00 24.3 25.2 23.3
60 2019-02-05 18:00:00 22.5 23.3 22.1
61 2019-02-05 19:00:00 21.6 22.1 21.1
62 2019-02-05 20:00:00 21.1 21.3 20.9
63 2019-02-05 21:00:00 21.2 21.3 20.9
64 2019-02-05 22:00:00 20.9 21.0 20.6
65 2019-02-05 23:00:00 19.9 20.6 19.7
Example 4
All hours of the day are available
Max and/or min have at least one NaN value
Do not create row in output DataFrame
Date_time Mean_temp Max_temp Min_temp
66 2019-02-06 00:00:00 19.7 19.8 19.7
67 2019-02-06 01:00:00 19.6 19.7 19.3
68 2019-02-06 02:00:00 19.0 19.3 18.6
69 2019-02-06 03:00:00 18.5 18.6 18.4
70 2019-02-06 04:00:00 18.6 18.7 18.4
71 2019-02-06 05:00:00 18.5 18.6
72 2019-02-06 06:00:00 19.0 19.6 18.5
73 2019-02-06 07:00:00 20.3 21.2 19.6
74 2019-02-06 08:00:00 21.5 21.7 21.2
75 2019-02-06 09:00:00 21.4 22.3 20.9
76 2019-02-06 10:00:00 23.5 24.4 22.3
77 2019-02-06 11:00:00 24.7 25.4 24.3
78 2019-02-06 12:00:00 24.9 25.5 23.9
79 2019-02-06 13:00:00 23.4 24.0 22.9
80 2019-02-06 14:00:00 23.3 23.8 22.9
81 2019-02-06 15:00:00 24.4 23.7
82 2019-02-06 16:00:00 24.9 25.1 24.7
83 2019-02-06 17:00:00 24.4 24.9 23.8
84 2019-02-06 18:00:00 22.5 23.8 21.7
85 2019-02-06 19:00:00 20.8 21.8 19.6
86 2019-02-06 20:00:00 19.1 19.6 18.9
87 2019-02-06 21:00:00 19.0 19.1 18.9
88 2019-02-06 22:00:00 19.1 19.1 19.0
89 2019-02-06 23:00:00 19.1 19.1 19.0
Just to recap, the above inputs would create the following output:
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.3 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
1 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 25.8 25.4 22.1 21.8 21.0 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1
I've had a really good think about this, and I can only come up with a horrible set of if statements that I known will be terribly slow and will take ages to write (apologies, this is due to me being bad at coding)!
Does anyone have any pointers to Pandas functions that could begin to deal with this problem efficiently?
You can use a groupby on the day of the Date_time column, and build each row of your final_df from each group (moving to the next iteration of the groupby whenever there are any missing values in the max_temp or min_temp columns, or whenever the length of the group is less than 24)
Note that I assuming that your Date_time column is of type datetime64[ns]. If it isn't, you should run the line: df['Date_time'] = pd.to_datetime(df['Date_time'])
all_hours = list(pd.date_range(start='1/1/22 00:00:00', end='1/1/22 23:00:00', freq='h').strftime('%H:%M'))
final_df = pd.DataFrame(columns=['Date_time'] + all_hours + ['Max','Min'])
## construct final_df by using a groupby on the day of the 'Date_time' column
for group,df_group in df.groupby(df['Date_time'].dt.date):
## check if NaN is in either 'Max Temp' or 'Min Temp' columns
new_df_data = {}
if (df_group[['Max_temp','Min_temp']].isnull().sum().sum() == 0) & (len(df_group) == 24):
## create a dictionary for the new row of the final_df
new_df_data['Date_time'] = group
new_df_data.update(dict(zip(all_hours, [[val] for val in df_group['Mean_temp']])))
new_df_data['Max'], new_df_data['Min'] = df_group['Max_temp'].max(), df_group['Min_temp'].min()
final_df = pd.concat([final_df, pd.DataFrame(new_df_data)])
else:
continue
Output:
>>> final_df
Date_time 00:00 01:00 02:00 03:00 04:00 05:00 06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 21:00 22:00 23:00 Max Min
0 2019-02-03 18.6 18.6 18.2 18.0 18.0 18.3 18.7 20.1 21.7 23.2 23.7 24.6 25.1 24.5 23.9 19.6 19.2 19.8 19.6 19.3 19.2 19.3 18.8 19.0 25.7 17.9
0 2019-02-04 18.9 18.8 18.6 18.4 18.7 18.8 19.0 19.7 21.4 23.5 NaN NaN NaN 25.8 25.4 22.1 21.8 20.9 18.9 18.8 18.9 18.8 18.8 18.9 27.8 18.1

Python Pandas: How to combine or merge two difrent size dataframes based on dates

I like to merge or combine two dataframes of different size df1 and df2, based on a range of dates, for example:
df1:
Date Open High Low
2021-07-01 8.43 8.44 8.22
2021-07-02 8.36 8.4 8.28
2021-07-06 8.22 8.23 8.06
2021-07-07 8.1 8.19 7.98
2021-07-08 8.07 8.1 7.91
2021-07-09 7.97 8.11 7.92
2021-07-12 8 8.2 8
2021-07-13 8.15 8.18 8.06
2021-07-14 8.18 8.27 8.12
2021-07-15 8.21 8.26 8.06
2021-07-16 8.12 8.23 8.07
df2:
Day of month Revenue Earnings
01 45000 4000
07 43500 5000
12 44350 6000
15 39050 7000
results should be something like this:
combination:
Date Open High Low Earnings
2021-07-01 8.43 8.44 8.22 4000
2021-07-02 8.36 8.4 8.28 4000
2021-07-06 8.22 8.23 8.06 4000
2021-07-07 8.1 8.19 7.98 5000
2021-07-08 8.07 8.1 7.91 5000
2021-07-09 7.97 8.11 7.92 5000
2021-07-12 8 8.2 8 6000
2021-07-13 8.15 8.18 8.06 6000
2021-07-14 8.18 8.27 8.12 6000
2021-07-15 8.21 8.26 8.06 7000
2021-07-16 8.12 8.23 8.07 7000
The Earnings column is merged based on a range of date, how can I do this in python pandas?
Try merge_asof
#df1.date=pd.to_datetime(df1.date)
df1['Day of month'] = df1.Date.dt.day
out = pd.merge_asof(df1, df2, on ='Day of month', direction = 'backward')
out
Out[213]:
Date Open High Low Day of month Revenue Earnings
0 2021-07-01 8.43 8.44 8.22 1 45000 4000
1 2021-07-02 8.36 8.40 8.28 2 45000 4000
2 2021-07-06 8.22 8.23 8.06 6 45000 4000
3 2021-07-07 8.10 8.19 7.98 7 43500 5000
4 2021-07-08 8.07 8.10 7.91 8 43500 5000
5 2021-07-09 7.97 8.11 7.92 9 43500 5000
6 2021-07-12 8.00 8.20 8.00 12 44350 6000
7 2021-07-13 8.15 8.18 8.06 13 44350 6000
8 2021-07-14 8.18 8.27 8.12 14 44350 6000
9 2021-07-15 8.21 8.26 8.06 15 39050 7000
10 2021-07-16 8.12 8.23 8.07 16 39050 7000
A more general approach is the following:
First you introduce a key both dataframes share.
In this case, the day of the month (or, potentially, multiple keys like day of the month and month). df1["day"] = df1["Date"].dt.day
If you were to merge (leftjoin df2 on df1) now, you wouldn't have enough keys in df2, as there are days missing. To fill the gaps, we could interpolate, or use the naïve approach: If we don't know the Revenue / Earnings for a specific day, we take the last known one and apply no further calculation. One way to achieve this is described here: How to replace NaNs by preceding or next values in pandas DataFrame? df.fillna(method='ffill')
Now we merge on our key. Following the doc https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html , we do it like this: df1.merge(df2, left_on='day')
Voilà!

30 Day distance between dates in datetime64[ns] column

I have data of the following form:
6460 2001-07-24 00:00:00 67.5 75.1 75.9 71.0 75.2 81.8
6490 2001-06-24 00:00:00 68.4 74.9 76.1 70.9 75.5 82.7
6520 2001-05-25 00:00:00 69.6 74.7 76.3 70.8 75.5 83.2
6550 2001-04-25 00:00:00 69.2 74.6 76.1 70.6 75.0 83.1
6580 2001-03-26 00:00:00 69.1 74.4 75.9 70.5 74.3 82.8
6610 2001-02-24 00:00:00 69.0 74.0 75.3 69.8 73.8 81.9
6640 2001-01-25 00:00:00 68.9 73.9 74.6 69.7 73.5 80.0
6670 2000-12-26 00:00:00 69.0 73.5 75.0 69.5 72.6 81.8
6700 2000-11-26 00:00:00 69.8 73.2 75.1 69.5 72.0 82.7
6730 2000-10-27 00:00:00 70.3 73.1 75.0 69.4 71.3 82.6
6760 2000-09-27 00:00:00 69.4 73.0 74.8 69.4 71.0 82.3
6790 2000-08-28 00:00:00 69.6 72.8 74.6 69.2 70.7 81.9
6820 2000-07-29 00:00:00 67.8 72.9 74.4 69.1 70.6 81.8
I want all the dates to have a 30 day difference between each other. I know how to add a specific day or month to a datetime object with something like
ndfd = ndf['Date'].astype('datetime64[ns]')
ndfd = ndfd.apply(lambda dt: dt.replace(day=15))
But this does not take into account the difference in days from month to month.
How can I ensure there is a consistent step in days from month to month in my data, given that I am able to change the day as long as it remains on the same month?
You could use date_range:
df['date'] = pd.date_range(start=df['date'][0], periods=len(df), freq='30D')
IIUC you could change your date column like this:
import datetime
a = df.iloc[0,0] # first date, assuming date col is first
df['date'] = [a + datetime.timedelta(days=30 * i) for i in range(len(df))]
I haven't tested this so not sure it work as smooth as I thought it will =).
You can transform your first day into ordinal, add 30*i to it and then transform it back.
first_day=df.iloc[0]['date_column'].toordinal()
df['date']=(first_day+30*i for i in range(len(df))).fromordinal

Select rows that lie within datetime intervals

I'm trying to compare two dataframes and drop rows from the first dataframe that aren't between the dates in the second dataframe (or...selecting those rows that are between the dates in the 2nd dataframe). The selections should be inclusive. This might be really simple but its just not clicking for me right now.
Example data is below. For dataframe 1, this can be generated using daily data starting July 1 2018 and ending November 30 2018 with random numbers in the 'number' column. The ... in the dataframe 1 are meant used to show skipping data but the data is there in the real dataframe.
Dataframe 1:
Number
Date
2018-07-01 15.2
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-07-05 19.1
...
2018-09-15 30.4
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-09-19 23.4
...
2018-11-01 30.8
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
2018-11-05 43.8
Dataframe 2:
Change
Date
2018-07-02 Start
2018-07-04 End
2018-09-16 Start
2018-09-18 End
2018-11-02 Start
2018-11-04 End
With the example above, the output should be:
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
You can try this, I hope the Start and End comes one after the other and is sorted.
df3 = pd.concat([df[i:j] for i,j in zip(df2.loc[df2['Change']=='Start'].index, df2.loc[df2['Change']=='End'].index)]))
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7
You can build an IntervalIndex from df2's index and search in logarithmic time.
df2.index = pd.to_datetime(df2.index)
idx = pd.IntervalIndex.from_arrays(df2.index[df.Change == 'Start'],
df2.index[df.Change == 'End'],
closed='both')
df1[idx.get_indexer(pd.to_datetime(df1.index)) > -1]
Number
Date
2018-07-02 17.3
2018-07-03 19.5
2018-07-04 13.7
2018-09-16 25.7
2018-09-17 21.2
2018-09-18 19.7
2018-11-02 47.2
2018-11-03 25.3
2018-11-04 39.7

Categories