I have a pandas series s, I would like to extract the Monday before the third Friday:
with the help of the answer in following link, I can get a resample of third friday, I am still not sure how to get the Monday just before it.
pandas resample to specific weekday in month
from pandas.tseries.offsets import WeekOfMonth
s.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
Any help is welcome
Many thanks
For each source date, compute your "wanted" date in 3 steps:
Shift back to the first day of the current month.
Shift forward to Friday in third week.
Shift back 4 days (from Friday to Monday).
For a Series containing dates, the code to do it is:
s.dt.to_period('M').dt.to_timestamp() + pd.offsets.WeekOfMonth(week=2, weekday=4)\
- pd.Timedelta('4D')
To test this code I created the source Series as:
s = (pd.date_range('2020-01-01', '2020-12-31', freq='MS') + pd.Timedelta('1D')).to_series()
It contains the second day of each month, both as the index and value.
When you run the above code, you will get:
2020-01-02 2020-01-13
2020-02-02 2020-02-17
2020-03-02 2020-03-16
2020-04-02 2020-04-13
2020-05-02 2020-05-11
2020-06-02 2020-06-15
2020-07-02 2020-07-13
2020-08-02 2020-08-17
2020-09-02 2020-09-14
2020-10-02 2020-10-12
2020-11-02 2020-11-16
2020-12-02 2020-12-14
dtype: datetime64[ns]
The left column contains the original index (source date) and the right
column - the "wanted" date.
Note that third Monday formula (as proposed in one of comments) is wrong.
E.g. third Monday in January is 2020-01-20, whereas the correct date is 2020-01-13.
Edit
If you have a DataFrame, something like:
Date Amount
0 2020-01-02 10
1 2020-01-12 10
2 2020-01-13 2
3 2020-01-20 2
4 2020-02-16 2
5 2020-02-17 12
6 2020-03-15 12
7 2020-03-16 3
8 2020-03-31 3
and you want something like resample but each "period" should start
on a Monday before the third Friday in each month, and e.g. compute
a sum for each period, you can:
Define the following function:
def dateShift(d):
d += pd.Timedelta(4, 'D')
d = pd.offsets.WeekOfMonth(week=2, weekday=4).rollback(d)
return d - pd.Timedelta(4, 'D')
i.e.:
Add 4 days (e.g. move 2020-01-13 (Monday) to 2020-01-17 (Friday).
Roll back (in the above case (on offset) this date will not be moved).
Subtract 4 days.
Run:
df.groupby(df.Date.apply(dateShift)).sum()
The result is:
Amount
Date
2019-12-16 20
2020-01-13 6
2020-02-17 24
2020-03-16 6
E. g. two values of 10 for 2020-01-02 and 2020-01-12 are assigned
to period starting on 2019-12-16 (the "wanted" date for December 2019).
Related
I have up to three different timestamps for each day in dataframe. In a new column called 'Category' I want to give them a number from 1 to 3 based on time of the timestamp. Almost like a partition by with rank in sql.
Something like: for each day check the time of run and assign a rank based on if it was the first run, the second or the third (if there is a third run).
This dataframe has about half a million rows. For a few years, 2-3 runs every day. And it has data for on hourly resolution.
Any suggestion how to do this most efficiently?
Example of how it is supposed to look like:
Timestamp
Category
2020-01-17 08:18:00
1
2020-01-17 11:57:00
2
2020-01-17 15:35:00
3
2020-01-18 09:00:00
1
2020-01-18 12:00:00
2
2020-01-18 17:00:00
3
Use groupby() and .cumcount()
df['timestamp'] = pd.to_datetime(df['timestamp'], format = '%Y/%m/%d %H:%M')
df['category'] = df.groupby([df['timestamp'].dt.to_period('d')]).cumcount().add(1)
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp')).cumcount().add(1)
Output:
>>> df
Timestamp Category
0 2020-01-17 08:18:00 1
1 2020-01-17 11:57:00 2
2 2020-01-17 15:35:00 3
3 2020-01-18 09:00:00 1
4 2020-01-18 12:00:00 2
5 2020-01-18 17:00:00 3
UPDATE: Try this:
df['Category'] = df.groupby(pd.Grouper(freq='D', key='Timestamp'))['Timestamp'].diff().ne(pd.Timedelta(0)).cumsum()
I currently have a dataframe with the index "2018-01-02" to "2020-12-31".
I need to write a program that takes in this dataframe and outputs a new dataframe that contains the first date available for each month.
What is the best way to do this?
Assume that the source DataFrame is:
Amount
Date
2018-01-02 10
2018-01-03 11
2018-01-04 12
2018-02-03 13
2018-02-04 14
2018-02-05 15
2018-03-07 16
2018-03-09 17
2018-04-10 18
2018-04-12 19
(its index is of DatetimeIndex type, not string).
If you want only the first date in each month, you can run:
result = df.groupby(pd.Grouper(freq='MS')).apply(lambda grp: grp.index.min())
The result is a Series containing:
Date
2018-01-01 2018-01-02
2018-02-01 2018-02-03
2018-03-01 2018-03-07
2018-04-01 2018-04-10
Freq: MS, dtype: datetime64[ns]
The left column is the index - starting date of each month.
The right column is the value found - the first date in each month from
the source DataFrame.
But if you want full first rows from each month, you can run:
result = df.groupby(pd.Grouper(freq='MS')).head(1)
This time the result is:
Amount
Date
2018-01-02 10
2018-02-03 13
2018-03-07 16
2018-04-10 18
Note that df.groupby(pd.Grouper(freq='MS')).first() is a wrong
choice, since it returns in the key the first day of each month,
not the first existing day in this month (try it on your own).
I have a dataframe looking like this
open Start show Einde show
5 NaN 11:30 NaN
6 16:00 18:00 19:45
7 14:30 16:30 18:15
8 NaN NaN NaN
9 18:45 20:45 22:30
These hours are in string format and I would like to transform them to datetime format.
Whenever I try to use pd.to_datetime(evs['open'], errors='coerce') (to change one of the columns) It changes the hours to a full datetime format like this: 2020-04-03 16:00:00 with todays date. I would like to have just the hour, but still in datetime format so I can add minutes etc.
Now when I use dt.hour to access the hour, it return a string and not in HH:MM format.
Can someone help me out please? I'm reading in a CSV through Pandas read_csv but when I use the date parser I get the same problem. Ideally this would get fixed in the read_csv section instead of separately but at this point I'll take anything.
Thanks!
As Chris commented, it is not possible to convert just the hours and minutes into datetime format. But you can use timedeltas to solve your problem.
import datetime
import pandas as pd
def to_timedelta(date):
date = pd.to_datetime(date)
try:
date_start = datetime.datetime(date.year, date.month, date.day, 0, 0)
except TypeError:
return pd.NaT # to keep dtype of series; Alternative: pd.Timedelta(0)
return date - date_start
df['open'].apply(to_timedelta)
Output:
5 NaT
6 16:00:00
7 14:30:00
8 NaT
9 18:45:00
Name: open, dtype: timedelta64[ns]
Now you can use datetime.timedelta to add/subtract minutes, hours or whatever:
df['open'] + datetime.timedelta(minutes=15)
Output:
5 NaT
6 16:15:00
7 14:45:00
8 NaT
9 19:00:00
Name: open, dtype: timedelta64[ns]
Also, it is pretty easy to get back to full datetimes:
df['open'] + datetime.datetime(2020, 4, 4)
Output:
5 NaT
6 2020-04-04 16:00:00
7 2020-04-04 14:30:00
8 NaT
9 2020-04-04 18:45:00
Name: open, dtype: datetime64[ns]
When operating on a pandas series of dates, isolating the week number can be performed in two separate ways that produce different results.
Using the .dt.week accessor on a numpy.datetime64 value or a pd.Period within a series produces different results than using pd.Period.strftime on the same objects. The online documentation for pd.Period.strftime states that all days before the first occurrence of the start week in the beginning of the year are counted as week 0. This follows standard python strftime behavior.
The .dt.week accessor seems to start at 1 and restart after 52 weeks, making the final two days of 2018 week 1 of 2019. The online documentation for pd.Series.dt.week only states that it returns the week ordinal of the year. This seems to be the iso week number?
Why is there this discrepancy in the behavior of the two methods? Which one should be used and why? How can I elegantly get the iso week number from a single python datetime (or pd.Period or pd.timestamp) object (as opposed to a series)?
df2 = pd.DataFrame({"Date_string": ["2018-12-27", "2018-12-28","2018-12-29", "2018-12-30", "2018-12-31", "2019-01-01", "2019-01-02", "2019-01-03", "2019-01-04", "2019-01-05", "2019-01-06", "2019-01-07",]})
df2["Date_datestamp"] = pd.to_datetime(df2["Date_string"], format='%Y-%m-%d')
df2["Date_period"] = df2['Date_datestamp'].dt.to_period("D")
df2["Week1"] = df2['Date_period'].apply(lambda x: (x + timedelta(days=1)).week)
df2["Week2"] = df2['Date_period'].apply(lambda x: x.strftime("%U"))
df2
returns
Date_string Date_datestamp Date_period Week1 Week2
0 2018-12-27 2018-12-27 2018-12-27 52 51
1 2018-12-28 2018-12-28 2018-12-28 52 51
2 2018-12-29 2018-12-29 2018-12-29 52 51
3 2018-12-30 2018-12-30 2018-12-30 1 52
4 2018-12-31 2018-12-31 2018-12-31 1 52
5 2019-01-01 2019-01-01 2019-01-01 1 00
6 2019-01-02 2019-01-02 2019-01-02 1 00
7 2019-01-03 2019-01-03 2019-01-03 1 00
8 2019-01-04 2019-01-04 2019-01-04 1 00
9 2019-01-05 2019-01-05 2019-01-05 1 00
10 2019-01-06 2019-01-06 2019-01-06 2 01
11 2019-01-07 2019-01-07 2019-01-07 2 01
This is because there was actually 53 weeks in 2018. I would recommend using a year-week combination, something like.
df2['Year-Week'] = df2['Date_period'].apply(lambda x: x.strftime('%Y-%U'))
Edited:
To see the number of weeks, you can try
df2["Week2"] = df2['Date_period'].apply(lambda x: x.strftime("%W"))
This shows 2018-12-31 as week 53.
%U - gets the Week Number, using Sunday as First day of Week
%W - gets the Week Number, using Monday as First day of Week
Given a df of this kind, where we have DateTime Index:
DateTime A
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
I would like to subset observations using the attributes of the index, like:
First business day of the month
Last business day of the month
First Friday of the month 'WOM-1FRI'
Third Friday of the month 'WOM-3FRI'
I'm specifically interested to know if this can be done using something like:
df.loc[(df['A'] < 5) & (df.index == 'WOM-3FRI'), 'Signal'] = 1
Thanks
You could try...
# FIRST DAY OF MONTH
df.loc[df[1:][df.index.month[:-1]!=df.index.month[1:]].index]
# LAST DAY OF MONTH
df.loc[df[:-1][df.index.month[:-1]!=df.index.month[1:]].index]
# 1st Friday
fr1 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==1)*(x.index.weekday==4)])
# 3rd Friday
fr3 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==3)*(x.index.weekday==4)])
If you want to remove extra-levels in the index of fr1 and fr3:
fr1.index=fr1.index.droplevel(0)
fr3.index=fr3.index.droplevel(0)