How can I group the dataframe
np.random.seed(42)
days = pd.date_range(start='1/1/2018', end='12/31/2019')
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({ 'col1':days,'col2': data})
print(df.head())
col1 col2
0 2018-01-01 52
1 2018-01-02 93
2 2018-01-03 15
3 2018-01-04 72
4 2018-01-05 61
by day of year, such that the resulting data frame looks like
min
01-01 ...
01-02 ...
01-03 ...
01-04 ...
01-05 ...
... ...
i.e. contains the min values over col2 for each date, where the index represents month and day, e.g. 01-02 is January 2nd?
I believe you need Series.dt.strftime with %m for months and %j for day of year:
df = df.groupby(df['col1'].dt.strftime('%m-%j'))['col2'].min()
print (df)
col1
01-001 30
01-002 93
01-003 15
01-004 6
01-005 61
..
12-361 18
12-362 47
12-363 17
12-364 14
12-365 15
Name: col2, Length: 365, dtype: int32
Or %d for days:
df = df.groupby(df['col1'].dt.strftime('%m-%d'))['col2'].min()
print (df)
col1
01-01 30
01-02 93
01-03 15
01-04 6
01-05 61
..
12-27 18
12-28 47
12-29 17
12-30 14
12-31 15
Name: col2, Length: 365, dtype: int32
Related
Let's say that I start with this dataframe
d = {'price': [10, 12, 8, 12, 14, 18, 10, 20],
'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
df = pd.DataFrame(d)
df['a_date'] = pd.date_range('01/01/2018',
periods=8,
freq='W')
df
price volume a_date
0 10 50 2018-01-07
1 12 60 2018-01-14
2 8 40 2018-01-21
3 12 100 2018-01-28
4 14 50 2018-02-04
5 18 100 2018-02-11
6 10 40 2018-02-18
7 20 50 2018-02-25
Now, I would like to resample/groupby in such a way that the data is aggregated on time intervals of roughly 10 days, but with pre-defined start and end dates, which fall on the 10th, 20th and last day of the month, such as:
2018-01-01 to 2018-01-10
2018-01-11 to 2018-01-20
2018-01-21 to 2018-01-31
2018-02-01 to 2018-02-10
2018-02-11 to 2018-02-20
2018-02-21 to 2018-02-28
and the results would be, in case of summing across the intervals:
price volume
a_date
2018-01-10 10 50
2018-01-20 12 60
2018-01-31 20 140
2018-02-10 14 50
2018-02-20 28 140
2018-02-28 20 50
The closest that I can get to this is doing df.resample('10D', on='a_date').sum() but clearly I need something more customized as interval.
I would be happy with just passing an array of intervals, but I don't think that's possible.
I've tried, as an experiment:
td = pd.to_datetime('2018-01-10') - pd.to_datetime('2018-01-01')
df.resample(td, on='a_date').sum()
but the pandas.Timedelta does not keep information on the specific dates.
EDIT:
A different dataframe to test the first day of the month:
d = {'price': np.arange(20)+1,
'volume': np.arange(20)+5}
df = pd.DataFrame(d)
df['a_date'] = pd.date_range('01/01/2018',
periods=20,
freq='D')
applying the accepted answer gives (the first day is not taken into account):
a_date price volume
0 2018-01-10 54 90
1 2018-01-20 155 195
compare with (for the first interval 2018-01-01 to 2018-01-10):
df.iloc[:10].sum()
price 55
volume 95
dtype: int64
Try:
from pandas.tseries.offsets import MonthEnd
bins = []
end = df["a_date"].max()
current = df["a_date"].min()
current = pd.Timestamp(year=current.year, month=current.month, day=1)
while True:
bins.append(current)
bins.append(current + pd.Timedelta(days=9))
bins.append(current + pd.Timedelta(days=19))
bins.append(current + MonthEnd())
if bins[-1] > end:
break
current = bins[-1] + pd.Timedelta(days=1)
x = (df.groupby(pd.cut(df["a_date"], bins)).sum()).reset_index()
x["a_date"] = x["a_date"].cat.categories.right
print(x[~(x.price.eq(0) & x.volume.eq(0))])
Prints:
a_date price volume
0 2018-01-10 10 50
1 2018-01-20 12 60
2 2018-01-31 20 140
4 2018-02-10 14 50
5 2018-02-20 28 140
6 2018-02-28 20 50
EDIT: Adjusted bins:
from pandas.tseries.offsets import MonthEnd
end = df["a_date"].max()
current = df["a_date"].min()
bins = [
pd.Timestamp(year=current.year, month=current.month, day=1) - MonthEnd(),
]
current = bins[-1]
while True:
bins.append(bins[-1] + pd.Timedelta(days=10))
bins.append(bins[-1] + pd.Timedelta(days=10))
bins.append(current + MonthEnd())
if bins[-1] > end:
break
current = bins[-1]
x = (df.groupby(pd.cut(df["a_date"], bins)).sum()).reset_index()
x["a_date"] = x["a_date"].cat.categories.right
print(x[~(x.price.eq(0) & x.volume.eq(0))])
Prints:
a_date price volume
0 2018-01-10 55 95
1 2018-01-20 155 195
I am starting off with Python and using Pandas.
I have 2 CSVs i.e
CSV1
Date Col1 Col2
2021-01-01 20 15
2021-01-02 22 12
2021-01-03 30 18
.
.
2021-12-31 125 160
so on and so forth...
CSV2
Start_Date End_Date Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2021-01-01 2021-02-25 15 25 35 45 30 40 55
2021-02-26 2021-05-31 25 30 44 35 50 45 66
.
.
2021-09-01 2021-0-25 44 25 65 54 24 67 38
Desired result
Date Col1 Col2 New_Col3 New_Col4
2021-01-01 20 15 Fri 40
2021-01-02 22 12 Sat 55
2021-01-03 30 18 Sun 15
.
.
2021-12-31 125 160 Fri 67
New_Col3 is the weekday abbreviation of Date
New_Col4 is the cell in CSV2 where the Date falls between Start_Date and End_Date row-wise, and from the corresponding weekday column-wise.
# Convert date column to datetime
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Start_Date'] = pd.to_datetime(df2['Start_Date'])
df2['End_Date'] = pd.to_datetime(df2['End_Date'])
# Get abbreviated weekday name
df1['New_Col3'] = df1['Date'].apply(lambda x: x.strftime('%a'))
New_Col4 = []
# Iterate over df1
for i in range(len(df1)):
# If df1['date'] is in between df2['Start_Date'] and df2['End_Date']
# Get the value according to df1['date'] weekday name
for j in range(len(df2)):
if df2.loc[j, 'Start_Date'] <= df1.loc[i, 'Date'] <= df2.loc[j, 'End_Date']:
day_name = df1.loc[i, 'Date'].strftime('%A')
New_Col4.append(df2.loc[j, day_name])
# Assign the result to a new column
df1['New_Col4'] = New_Col4
# print(df1)
Date Col1 Col2 New_Col3 New_Col4
0 2021-01-01 20 15 Fri 40
1 2021-01-02 22 12 Sat 55
2 2021-01-03 30 18 Sun 15
3 2021-03-03 40 18 Wed 35
Keys
Construct datetime and interval indexes to enable pd.IntervalIndex.get_indexer(pd.DatetimeIndex) for efficient row-matching. (reference post)
Apply a value retrieval function from df2 on each row of df1 for New_Col4.
With this approach, an explicit double for-loop search can be avoided in row-matching. However, a slow .apply() is still required. Maybe there is a fancy way to combine these two steps, but I will stop here for the time being.
Data
Typo in the last End_Date is changed.
import pandas as pd
import io
df1 = pd.read_csv(io.StringIO("""
Date Col1 Col2
2021-01-01 20 15
2021-01-02 22 12
2021-01-03 30 18
2021-12-31 125 160
"""), sep=r"\s+", engine='python')
df2 = pd.read_csv(io.StringIO("""
Start_Date End_Date Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2021-01-01 2021-02-25 15 25 35 45 30 40 55
2021-02-26 2021-05-31 25 30 44 35 50 45 66
2021-09-01 2022-01-25 44 25 65 54 24 67 38
"""), sep=r"\s+", engine='python')
df1["Date"] = pd.to_datetime(df1["Date"])
df2["Start_Date"] = pd.to_datetime(df2["Start_Date"])
df2["End_Date"] = pd.to_datetime(df2["End_Date"])
Solution
# 1. Get weekday name
df1["day_name"] = df1["Date"].dt.day_name()
df1["New_Col3"] = df1["day_name"].str[:3]
# 2-1. find corresponding row in df2
df1.set_index("Date", inplace=True)
idx = pd.IntervalIndex.from_arrays(df2["Start_Date"], df2["End_Date"], closed="both")
df1["df2_row"] = idx.get_indexer(df1.index)
# 2-2. pick out the value from df2
def f(row):
"""Get (#row, day_name) in df2"""
return df2[row["day_name"]].iloc[row["df2_row"]]
df1["New_Col4"] = df1.apply(f, axis=1)
Result
print(df1.drop(columns=["day_name", "df2_row"]))
Out[319]:
Col1 Col2 New_Col3 New_Col4
Date
2021-01-01 20 15 Fri 40
2021-01-02 22 12 Sat 55
2021-01-03 30 18 Sun 15
2021-12-31 125 160 Fri 67
I subsetted a big dataframe, slicing only one column Start Time with `type(object).
test = taxi_2020['Start Time']
Got a column
0 00:15:00
1 00:15:00
2 00:15:00
3 00:15:00
4 00:15:00
...
4137289 00:00:00
4137290 00:00:00
4137291 00:00:00
4137292 00:00:00
4137293 00:00:00
Name: Start Time, Length: 4137294, dtype: object
Then I grouped and summarized it by the count (to my best knowledge)
test.value_counts().sort_index().reset_index()
and got two columns
index Start Time
0 00:00:00 24005
1 00:15:00 22815
2 00:30:00 20438
3 00:45:00 19012
4 01:00:00 18082
... ... ...
91 22:45:00 32365
92 23:00:00 31815
93 23:15:00 29582
94 23:30:00 26903
95 23:45:00 24599
Not sure why this index column appeared, now I failed to rename it or convert.
What do I would like to see?
My ideal output - to group time by hour (24h format is ok), it looks like data counts every 15 min, so basically put each next 4 columns together. 00:15:00 is ok to be as 0 hour, 23:00:00 as 23rd hour.
My ideal output:
Hour Rides
0 34000
1 60000
2 30000
3 40000
I would like to create afterwards a simple histogram to show the occurrence by the hour.
Appreciate any help!
IIUC,
#Create dummy input datafframe
test = pd.DataFrame({'time':pd.date_range('2020-06-01', '2020-06-01 23:59:00', freq='15T').strftime('%H:%M:%S'),
'rides':np.random.randint(15000,28000,96)})
Let's create a DateTimeIndex from string and resample, aggregate with sum and convert DateTimeIndex to hours:
test2 = (test.set_index(pd.to_datetime(test['time'], format='%H:%M:%S'))
.rename_axis('hour').resample('H').sum())
test2.index = test2.index.hour
test2.reset_index()
Output:
hour rides
0 0 74241
1 1 87329
2 2 76933
3 3 86208
4 4 88002
5 5 82618
6 6 82188
7 7 81203
8 8 78591
9 9 95592
10 10 99778
11 11 85294
12 12 93931
13 13 80490
14 14 84181
15 15 71786
16 16 90962
17 17 96568
18 18 85646
19 19 88324
20 20 83595
21 21 89284
22 22 72061
23 23 74057
Step by step I found answer myself
Using this code, I renamed columns
test.rename(columns = {'index': "Time", 'Start Time': 'Rides'})
Got
The remaining question - how to summarize by the hour.
After applying
test2['hour'] = pd.to_datetime(test2['Time'], format='%H:%M:%S').dt.hour
test2
I came closer
Finally, I grouped by hour value
test3 = test2.groupby('hour', as_index=False).agg({"Rides": "sum"})
print(test3)
Here is my dataframe:
import pandas as pd
df = pd.DataFrame({
'KEY': [1, 2, 3, 1, 1, 2],
'START_DATE': ['2018-01-05', '2018-01-04', '2018-01-01', '2018-01-23', '2018-02-01', '2018-03-11'],
'STOP_DATE': ['2018-01-22', '2018-03-10', '2018-01-31', '2018-02-15', '2018-04-01', '2018-07-21'],
'AMOUNT': [5, 3, 11, 14, 7, 9],
})
df.START_DATE = pd.to_datetime(df.START_DATE, format='%Y-%m-%d')
df.STOP_DATE = pd.to_datetime(df.STOP_DATE, format='%Y-%m-%d')
df
>>> AMOUNT KEY START_DATE STOP_DATE
0 5 A 2018-01-05 2018-01-22
1 3 B 2018-01-04 2018-03-10
2 11 C 2018-01-01 2018-01-31
3 14 A 2018-01-23 2018-02-15
4 7 A 2018-02-01 2018-04-01
5 9 B 2018-03-11 2018-07-21
I am trying to get the AMOUNT per month and per KEY considering the AMOUNT as linearly distributed (by day) between START_DATE and STOP_DATE. The output is shown below. I would like to also keep track of the number of charged days in a month. For example KEY = A has overlapped periods in February so the number of charged periods can be > 28.
DAYS AMOUNT
A 2018_01 27 10.250000
2018_02 43 12.016667
2018_03 31 3.616667
2018_04 1 0.116667
B 2018_01 28 1.272727
2018_02 28 1.272727
2018_03 31 1.875598
2018_04 30 2.030075
2018_05 31 2.097744
2018_06 30 2.030075
2018_07 21 1.421053
C 2018_01 31 11.000000
2018_02 0 0.000000
I came up with the solution detailed below but it is highly inefficient and takes an unaffordable amount of time to run for a dataset with ~100 million rows. I am looking for an improved version but could not manage to vectorize the pd.date_range part of it. Not sure if numba #jit could help here? Added a tag just in case.
from pandas.tseries.offsets import MonthEnd
# Prepare the final dataframe (filled with zeros)
bounds = df.groupby('KEY').agg({'START_DATE': min, 'STOP_DATE':max}).reset_index()
multiindex = []
for row in bounds.itertuples():
dates = pd.date_range(start=row.START_DATE, end=row.STOP_DATE + MonthEnd(),
freq='M').strftime('%Y_%m')
multiindex.extend([(row.KEY, date) for date in dates])
index = pd.MultiIndex.from_tuples(multiindex)
final = pd.DataFrame(0, index=index, columns=['DAYS', 'AMOUNT'])
# Run the actual iteration over rows
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
for row in df.itertuples():
data = pd.Series(index=pd.date_range(start=row.START_DATE, end=row.STOP_DATE))
data = data.resample('MS').size().rename('DAYS').to_frame()
data['AMOUNT'] = data.DAYS / row.TOTAL_DAYS * row.AMOUNT
data.index = data.index.strftime('%Y_%m')
# Add data to the final dataframe
final.loc[(row.KEY, data.index.tolist()), 'DAYS'] += data.DAYS.values
final.loc[(row.KEY, data.index.tolist()), 'AMOUNT'] += data.AMOUNT.values
I eventually came up with this solution (heavily inspired from #jezrael answer on this post). Probably not the most memory efficient solution but this is not a major concern for me, execution time was the problem!
from pandas.tseries.offsets import MonthBegin
df['ID'] = range(len(df))
df['TOTAL_DAYS'] = (df.STOP_DATE - df.START_DATE).dt.days + 1
df
>>> AMOUNT KEY START_DATE STOP_DATE ID TOTAL_DAYS
0 5 A 2018-01-05 2018-01-22 0 18
1 3 B 2018-01-04 2018-03-10 1 66
2 11 C 2018-01-01 2018-01-31 2 31
3 14 A 2018-01-23 2018-02-15 3 24
4 7 A 2018-02-01 2018-04-01 4 60
5 9 B 2018-03-11 2018-07-21 5 133
final = (df[['ID', 'START_DATE', 'STOP_DATE']].set_index('ID').stack()
.reset_index(level=-1, drop=True)
.rename('DATE_AFTER')
.to_frame())
final = final.groupby('ID').apply(
lambda x: x.set_index('DATE_AFTER').resample('M').asfreq()).reset_index()
final = final.merge(df[['ID', 'KEY', 'AMOUNT', 'TOTAL_DAYS']], how='left', on=['ID'])
final['PERIOD'] = final.DATE_AFTER.dt.to_period('M')
final['DATE_BEFORE'] = final.DATE_AFTER - MonthBegin()
At this point final looks like this:
final
>>> ID DATE_AFTER KEY AMOUNT TOTAL_DAYS PERIOD DATE_BEFORE
0 0 2018-01-31 A 5 18 2018-01 2018-01-01
1 1 2018-01-31 B 3 66 2018-01 2018-01-01
2 1 2018-02-28 B 3 66 2018-02 2018-02-01
3 1 2018-03-31 B 3 66 2018-03 2018-03-01
4 2 2018-01-31 C 11 31 2018-01 2018-01-01
5 3 2018-01-31 A 14 24 2018-01 2018-01-01
6 3 2018-02-28 A 14 24 2018-02 2018-02-01
7 4 2018-02-28 A 7 60 2018-02 2018-02-01
8 4 2018-03-31 A 7 60 2018-03 2018-03-01
9 4 2018-04-30 A 7 60 2018-04 2018-04-01
10 5 2018-03-31 B 9 133 2018-03 2018-03-01
11 5 2018-04-30 B 9 133 2018-04 2018-04-01
12 5 2018-05-31 B 9 133 2018-05 2018-05-01
13 5 2018-06-30 B 9 133 2018-06 2018-06-01
14 5 2018-07-31 B 9 133 2018-07 2018-07-01
We then merge back the initial df twice (start and end of month):
final = pd.merge(
final,
df[['ID', 'STOP_DATE']].assign(PERIOD = df.STOP_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])
final = pd.merge(
final,
df[['ID', 'START_DATE']].assign(PERIOD = df.START_DATE.dt.to_period('M')),
how='left', on=['ID', 'PERIOD'])
final['STOP_DATE'] = final.STOP_DATE.combine_first(final.DATE_AFTER)
final['START_DATE'] = final.START_DATE.combine_first(final.DATE_BEFORE)
final['DAYS'] = (final.STOP_DATE- final.START_DATE).dt.days + 1
final = final.drop(columns=['ID', 'DATE_AFTER', 'DATE_BEFORE'])
final.AMOUNT *= final.DAYS/final.TOTAL_DAYS
final = final.groupby(['KEY', 'PERIOD']).agg({'AMOUNT': sum, 'DAYS': sum})
With the expected result:
AMOUNT DAYS
KEY PERIOD
A 2018-01 10.250000 27
2018-02 12.016667 43
2018-03 3.616667 31
2018-04 0.116667 1
B 2018-01 1.272727 28
2018-02 1.272727 28
2018-03 1.875598 31
2018-04 2.030075 30
2018-05 2.097744 31
2018-06 2.030075 30
2018-07 1.421053 21
C 2018-01 11.000000 31
I have a Pandas dataframe with a DataTimeIndex and some other columns, similar to this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='6H')
df = pd.DataFrame(index = range)
# Average speed in miles per hour
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
df.info()
# DatetimeIndex: 141 entries, 2017-12-01 00:00:00 to 2018-01-05 00:00:00
# Freq: 6H
# Data columns (total 1 columns):
# value 141 non-null int64
# dtypes: int64(1)
# memory usage: 2.2 KB
df.head(10)
# value
# 2017-12-01 00:00:00 15
# 2017-12-01 06:00:00 54
# 2017-12-01 12:00:00 19
# 2017-12-01 18:00:00 13
# 2017-12-02 00:00:00 35
# 2017-12-02 06:00:00 31
# 2017-12-02 12:00:00 58
# 2017-12-02 18:00:00 6
# 2017-12-03 00:00:00 8
# 2017-12-03 06:00:00 30
How can I select or filter the entries that are:
Weekdays only (that is, not weekend days Saturday or Sunday)
Not within N days of the dates in a list (e.g. U.S. holidays like '12-25' or '01-01')?
I was hoping for something like:
df = exclude_Sat_and_Sun(df)
omit_days = ['12-25', '01-01']
N = 3 # days near the holidays
df = exclude_days_near_omit_days(N, omit_days)
I was thinking of creating a new column to break out the month and day and then comparing them to the criteria for 1 and 2 above. However, I was hoping for something more Pythonic using the DateTimeIndex.
Thanks for any help.
The first part can be easily accomplished using the Pandas DatetimeIndex.dayofweek property, which starts counting weekdays with Monday as 0 and ending with Sunday as 6.
df[df.index.dayofweek < 5] will give you only the weekdays.
For the second part you can use the datetime module. Below I will give an example for only one date, namely 2017-12-25. You can easily generalize it to a list of dates, for example by defining a helper function.
from datetime import datetime, timedelta
N = 3
df[abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N)]
This will give all dates that are more than N=3 days away from 2017-12-25. That is, it will exclude an interval of 7 days from 2017-12-22 to 2017-12-28.
Lastly, you can combine the two criteria using the & operator, as you probably know.
df[
(df.index.dayofweek < 5)
&
(abs(df.index.date - datetime.strptime("2017-12-25", '%Y-%m-%d').date()) > timedelta(N))
]
I followed the answer by #Bahman Engheta and created a function to omit dates from a dataframe.
import pandas as pd
from datetime import datetime, timedelta
def omit_dates(df, list_years, list_dates, omit_days_near=3, omit_weekends=False):
'''
Given a Pandas dataframe with a DatetimeIndex, remove rows that have a date
near a given list of dates and/or a date on a weekend.
Parameters:
----------
df : Pandas dataframe
list_years : list of str
Contains a list of years in string form
list_dates : list of str
Contains a list of dates in string form encoded as MM-DD
omit_days_near : int
Threshold of days away from list_dates to remove. For example, if
omit_days_near=3, then omit all days that are 3 days away from
any date in list_dates.
omit_weekends : bool
If true, omit dates that are on weekends.
Returns:
-------
Pandas dataframe
New resulting dataframe with dates omitted.
'''
if not isinstance(df, pd.core.frame.DataFrame):
raise ValueError("df is expected to be a Pandas dataframe, not %s" % type(df).__name__)
if not isinstance(df.index, pd.tseries.index.DatetimeIndex):
raise ValueError("Dataframe is expected to have an index of DateTimeIndex, not %s" %
type(df.index).__name__)
if not isinstance(list_years, list):
list_years = [list_years]
if not isinstance(list_dates, list):
list_dates = [list_dates]
result = df.copy()
if omit_weekends:
result = result.loc[result.index.dayofweek < 5]
omit_dates = [ '%s-%s' % (year, date) for year in list_years for date in list_dates ]
for date in omit_dates:
result = result.loc[abs(result.index.date - datetime.strptime(date, '%Y-%m-%d').date()) > timedelta(omit_days_near)]
return result
Here is example usage. Suppose you have a dataframe that has a DateTimeIndex and other columns, like this:
import pandas as pd
import numpy as np
range = pd.date_range('2017-12-01', '2018-01-05', freq='1D')
df = pd.DataFrame(index = range)
df['value'] = np.random.randint(low=0, high=60, size=len(df.index))
The resulting dataframe looks like this:
value
2017-12-01 42
2017-12-02 35
2017-12-03 49
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-08 57
2017-12-09 3
2017-12-10 57
2017-12-11 46
2017-12-12 20
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-16 57
2017-12-17 4
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-23 45
2017-12-24 34
2017-12-25 42
2017-12-26 33
2017-12-27 17
2017-12-28 2
2017-12-29 2
2017-12-30 51
2017-12-31 19
2018-01-01 6
2018-01-02 43
2018-01-03 11
2018-01-04 45
2018-01-05 45
Now, let's specify dates to remove. I want to remove the dates '12-10', '12-25', '12-31', and '01-01' (following MM-DD notation) and all dates within 2 days of those dates. Further, I want to remove those dates from both the years '2016' and '2017'. I also want to remove weekend dates.
I'll call my function like this:
years = ['2016', '2017']
holiday_dates = ['12-10', '12-25', '12-31', '01-01']
omit_dates(df, years, holiday_dates, omit_days_near=2, omit_weekends=True)
The result is:
value
2017-12-01 42
2017-12-04 25
2017-12-05 19
2017-12-06 28
2017-12-07 21
2017-12-13 7
2017-12-14 5
2017-12-15 30
2017-12-18 46
2017-12-19 32
2017-12-20 48
2017-12-21 55
2017-12-22 52
2017-12-28 2
2018-01-03 11
2018-01-04 45
2018-01-05 45
Is that answer correct? Here are the calendars for December 2017 and January 2018:
December 2017
Su Mo Tu We Th Fr Sa
1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31
January 2018
Su Mo Tu We Th Fr Sa
1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 31
Looks like it works.