I have the following Dataframe:
Date Holiday
0 2018-01-01 New Year's Day
1 2018-01-15 Martin Luther King, Jr. Day
2 2018-02-19 Washington's Birthday
3 2018-05-08 Truman Day
4 2018-05-28 Memorial Day
... ... ...
58 2022-10-10 Columbus Day
59 2022-11-11 Veterans Day
60 2022-11-24 Thanksgiving
61 2022-12-25 Christmas Day
62 2022-12-26 Christmas Day (Observed)
I would like to re-sample this data frame so that it is an hourly df from a daily df (while copying the content in the holidays column to the correct date). I'd like it to look like this [Ignore the index of the table, it should be alot more numbers than this]
Timestamp Holiday
0 2018-01-01 00:00:00 New Year's Day
1 2018-01-01 01:00:00 New Year's Day
2 2018-01-01 02:00:00 New Year's Day
3 2018-01-01 03:00:00 New Year's Day
4 2018-01-01 04:00:00 New Year's Day
5 2018-01-01 05:00:00 New Year's Day
... ... ...
62 2022-12-26 20:00:00 Christmas Day (Observed)
63 2022-12-26 21:00:00 Christmas Day (Observed)
64 2022-12-26 22:00:00 Christmas Day (Observed)
65 2022-12-26 23:00:00 Christmas Day (Observed)
What's the fastest way to go about doing so? Thanks in advance.
How about
df.set_index("Date").resample("H").ffill().reset_index().rename(
{"Date": "Timestamp"}, axis=1
)
(1) Create a new DataFrame using date_range, (2) concat this with the original DF, (3) make dates as a column again using reset_index, (4) fill the empty slots using groupby and ffill, (5) sort values and drop duplicates/NaN values.
dates = pd.DataFrame(pd.date_range(df2['date'].min(), df2['date'].max(), freq='H'), columns=['date']).set_index('date')
df3 = pd.concat([df2.set_index('date'), dates], sort = False)
df3.reset_index(inplace = True)
df3['Holiday'] = df3.groupby(df3['date'].dt.date)['Holiday'].ffill()
df3 = df3.sort_values('date').drop_duplicates().dropna(axis = 0)
Related
I have two dataFrames as shown below:
df1 =
temperature Mon_start Mon_end Tues_start Tues_end
cold 1:00 3:00 9:00 10:00
warm 7:00 8:00 16:00 20:00
hot 4:00 6:00 12:00 14:00
df2 =
sample1 data_value
A 2:00
A 7:30
B 18:00
B 9:45
I need to use the values in df2['data_value'] to find out what day an experiment was performed and what temperature it was using df1. So essentially using df1 as a lookup table to check for if data_value is between a given start and end time and for what temp and if so, assign its value in a new column called day with the day. The output I've been trying to get is:
sample1 data_value day temperature
A 2:00 Mon cold
A 7:30 Mon warm
B 18:00 Tues warm
B 9:45 Tues cold
The actual dataFrame is quite long, so I defined a function and did np.vectorize() to speed it up, but can't seem to get the mapping and new columns defined correctly.
Or do I need to do a for-loop and check over every combination of *_start and *_end to do so?
Any help would be greatly appreciated!
If your data are valid, e.g. no row in df2 with 3:30, then you can use merge_asof:
# convert data to timedelta so we can compare correctly
for col in df1.columns[1:]:
df1[col] = pd.to_timedelta(df1[col]+':00')
df2['data_value'] = pd.to_timedelta(df2['data_value'] + ':00')
pd.merge_asof(df2.sort_values('data_value'),
df1.melt('temperature', var_name='day').sort_values('value'),
left_on='data_value', right_on='value')
Output:
sample1 data_value temperature day value
0 A 0 days 02:00:00 cold Mon_start 0 days 01:00:00
1 A 0 days 07:30:00 warm Mon_start 0 days 07:00:00
2 B 0 days 09:45:00 cold Tues_start 0 days 09:00:00
3 B 0 days 18:00:00 warm Tues_start 0 days 16:00:00
I have generated this df
PredictionTargetDateEOM PredictionTargetDateBOM DayAfterTargetDateEOM business_days
0 2018-12-31 2018-12-01 2019-01-01 20
1 2019-01-31 2019-01-01 2019-02-01 21
2 2019-02-28 2019-02-01 2019-03-01 20
3 2018-11-30 2018-11-01 2018-12-01 21
4 2018-10-31 2018-10-01 2018-11-01 23
... ... ... ... ...
172422 2020-10-31 2020-10-01 2020-11-01 22
172423 2020-11-30 2020-11-01 2020-12-01 20
172424 2020-12-31 2020-12-01 2021-01-01 22
172425 2020-09-30 2020-09-01 2020-10-01 21
172426 2020-08-31 2020-08-01 2020-09-01 21
with this code:
predicted_df['PredictionTargetDateBOM'] = predicted_df.apply(lambda x: pd.to_datetime(x['PredictionTargetDateEOM']).replace(day=1), axis = 1) #Get first day of the target month
predicted_df['PredictionTargetDateEOM'] = pd.to_datetime(predicted_df['PredictionTargetDateEOM'])
predicted_df['DayAfterTargetDateEOM'] = predicted_df['PredictionTargetDateEOM'] + timedelta(days=1) #Get the first day of the month after target month. i.e. M+2
predicted_df['business_days_bankers'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[list(holidays.US(years=x['PredictionTargetDateBOM'].year).keys())[index] for index in [list(holidays.US(years=x['PredictionTargetDateBOM'].year).values()).index(item) for item in rocket_holiday_including_observed if item in list(holidays.US(years=x['PredictionTargetDateBOM'].year).values())]] ), axis = 1) #Count number of business days of the target month
That counts the number of business days in the month of the PredictionTargetDateEOM column based on Python's holiday package, which is a dictionary that includes the following holidays:
2022-01-01 New Year's Day
2022-01-17 Martin Luther King Jr. Day
2022-02-21 Washington's Birthday
2022-05-30 Memorial Day
2022-06-19 Juneteenth National Independence Day
2022-06-20 Juneteenth National Independence Day (Observed)
2022-07-04 Independence Day
2022-09-05 Labor Day
2022-10-10 Columbus Day
2022-11-11 Veterans Day
2022-11-24 Thanksgiving
2022-12-25 Christmas Day
2022-12-26 Christmas Day (Observed)
However, I would like to replicate the business day count but instead use this list called rocket_holiday as the reference for np.busday_count():
["New Year's Day",
'Martin Luther King Jr. Day',
'Memorial Day',
'Independence Day',
'Labor Day',
'Thanksgiving',
'Christmas Day',
"New Year's Day (Observed)",
'Martin Luther King Jr. Day (Observed)',
'Memorial Day (Observed)',
'Independence Day (Observed)',
'Labor Day (Observed)',
'Thanksgiving (Observed)',
'Christmas Day (Observed)']
So I've added this line
predicted_df['business_days_rocket'] = predicted_df.apply(lambda x: np.busday_count(x['PredictionTargetDateBOM'].date(), x['DayAfterTargetDateEOM'].date(), holidays=[rocket_holiday]), axis = 1)
But I get the ValueError listed in the title of this question. I think the problem is that the first list is a dictionary with the dates of those holidays, so I need to write a function that could generate those dates for the holidays of the second list in a dynamic fashion based on year, and convert that list into a dictionary. Is there a way to do that with Python's holiday package so that I don't have to hard-code the dates in?
I have a pandas series s, I would like to extract the Monday before the third Friday:
with the help of the answer in following link, I can get a resample of third friday, I am still not sure how to get the Monday just before it.
pandas resample to specific weekday in month
from pandas.tseries.offsets import WeekOfMonth
s.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
Any help is welcome
Many thanks
For each source date, compute your "wanted" date in 3 steps:
Shift back to the first day of the current month.
Shift forward to Friday in third week.
Shift back 4 days (from Friday to Monday).
For a Series containing dates, the code to do it is:
s.dt.to_period('M').dt.to_timestamp() + pd.offsets.WeekOfMonth(week=2, weekday=4)\
- pd.Timedelta('4D')
To test this code I created the source Series as:
s = (pd.date_range('2020-01-01', '2020-12-31', freq='MS') + pd.Timedelta('1D')).to_series()
It contains the second day of each month, both as the index and value.
When you run the above code, you will get:
2020-01-02 2020-01-13
2020-02-02 2020-02-17
2020-03-02 2020-03-16
2020-04-02 2020-04-13
2020-05-02 2020-05-11
2020-06-02 2020-06-15
2020-07-02 2020-07-13
2020-08-02 2020-08-17
2020-09-02 2020-09-14
2020-10-02 2020-10-12
2020-11-02 2020-11-16
2020-12-02 2020-12-14
dtype: datetime64[ns]
The left column contains the original index (source date) and the right
column - the "wanted" date.
Note that third Monday formula (as proposed in one of comments) is wrong.
E.g. third Monday in January is 2020-01-20, whereas the correct date is 2020-01-13.
Edit
If you have a DataFrame, something like:
Date Amount
0 2020-01-02 10
1 2020-01-12 10
2 2020-01-13 2
3 2020-01-20 2
4 2020-02-16 2
5 2020-02-17 12
6 2020-03-15 12
7 2020-03-16 3
8 2020-03-31 3
and you want something like resample but each "period" should start
on a Monday before the third Friday in each month, and e.g. compute
a sum for each period, you can:
Define the following function:
def dateShift(d):
d += pd.Timedelta(4, 'D')
d = pd.offsets.WeekOfMonth(week=2, weekday=4).rollback(d)
return d - pd.Timedelta(4, 'D')
i.e.:
Add 4 days (e.g. move 2020-01-13 (Monday) to 2020-01-17 (Friday).
Roll back (in the above case (on offset) this date will not be moved).
Subtract 4 days.
Run:
df.groupby(df.Date.apply(dateShift)).sum()
The result is:
Amount
Date
2019-12-16 20
2020-01-13 6
2020-02-17 24
2020-03-16 6
E. g. two values of 10 for 2020-01-02 and 2020-01-12 are assigned
to period starting on 2019-12-16 (the "wanted" date for December 2019).
I have a timeseries data for a full year for every minute.
timestamp day hour min somedata
2010-01-01 00:00:00 1 0 0 x
2010-01-01 00:01:00 1 0 1 x
2010-01-01 00:02:00 1 0 2 x
2010-01-01 00:03:00 1 0 3 x
2010-01-01 00:04:00 1 0 4 x
... ...
2010-12-31 23:55:00 365 23 55
2010-12-31 23:56:00 365 23 56
2010-12-31 23:57:00 365 23 57
2010-12-31 23:58:00 365 23 58
2010-12-31 23:59:00 365 23 59
I want to group-by the data based on the day, i.e 2010-01-01 data should be one group, 2010-01-02 should be another upto 2010-12-31.
I used daily_groupby = dataframe.groupby(pd.to_datetime(dataframe.index.day, unit='D', origin=pd.Timestamp('2009-12-31'))). This creates the group based on the days so all jan, feb upto dec 01 day are in one group. But I want to also group by using month so that jan, feb .. does not get mixed up.
I am a beginner in pandas.
if timestamp is the index use DatetimeIndex.date
df.groupby(pd.to_datetime(df.index).date)
else Series.dt.date
df.groupby(pd.to_datetime(df['timestamp']).dt.date)
If you don't want group by year use:
time_index = pd.to_datetime(df.index)
df.groupby([time_index.month,time_index.day])
Given a df of this kind, where we have DateTime Index:
DateTime A
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
I would like to subset observations using the attributes of the index, like:
First business day of the month
Last business day of the month
First Friday of the month 'WOM-1FRI'
Third Friday of the month 'WOM-3FRI'
I'm specifically interested to know if this can be done using something like:
df.loc[(df['A'] < 5) & (df.index == 'WOM-3FRI'), 'Signal'] = 1
Thanks
You could try...
# FIRST DAY OF MONTH
df.loc[df[1:][df.index.month[:-1]!=df.index.month[1:]].index]
# LAST DAY OF MONTH
df.loc[df[:-1][df.index.month[:-1]!=df.index.month[1:]].index]
# 1st Friday
fr1 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==1)*(x.index.weekday==4)])
# 3rd Friday
fr3 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==3)*(x.index.weekday==4)])
If you want to remove extra-levels in the index of fr1 and fr3:
fr1.index=fr1.index.droplevel(0)
fr3.index=fr3.index.droplevel(0)