Given a df of this kind, where we have DateTime Index:
DateTime A
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
I would like to subset observations using the attributes of the index, like:
First business day of the month
Last business day of the month
First Friday of the month 'WOM-1FRI'
Third Friday of the month 'WOM-3FRI'
I'm specifically interested to know if this can be done using something like:
df.loc[(df['A'] < 5) & (df.index == 'WOM-3FRI'), 'Signal'] = 1
Thanks
You could try...
# FIRST DAY OF MONTH
df.loc[df[1:][df.index.month[:-1]!=df.index.month[1:]].index]
# LAST DAY OF MONTH
df.loc[df[:-1][df.index.month[:-1]!=df.index.month[1:]].index]
# 1st Friday
fr1 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==1)*(x.index.weekday==4)])
# 3rd Friday
fr3 = df.groupby(df.index.year*100+df.index.month).apply(lambda x: x[(x.index.week==3)*(x.index.weekday==4)])
If you want to remove extra-levels in the index of fr1 and fr3:
fr1.index=fr1.index.droplevel(0)
fr3.index=fr3.index.droplevel(0)
Related
I have a dataframe as such
Date Value
2022-01-01 10:00:00 7
2022-01-01 10:30:00 5
2022-01-01 11:00:00 3
....
....
2022-02-15 21:00:00 8
I would like to convert it into a day by row and hour by column format. The hours are the columns in this case. and the value column is now filled as cell values.
Date 10:00 10:30 11:00 11:30............21:00
2022-01-01 7 5 3 4 11
2022-01-02 8 2 4 4 13
How can I achieve this? I have tried pivot table but no success
Use pivot_table:
df['Date'] = pd.to_datetime(df['Date'])
out = df.pivot_table('Value', df['Date'].dt.date, df['Date'].dt.time, fill_value=0)
print(out)
# Output
Date 10:00:00 10:30:00 11:00:00 21:00:00
Date
2022-01-01 7 5 3 0
2022-02-15 0 0 0 8
To remove Date labels, you can use rename_axis:
for the top Date label: out.rename_axis(columns=None)
for the bottom Date label: out.rename_axis(index=None)
for both: out.rename_axis(index=None, columns=None)
You can change None by any string to rename axis.
My DataFrame:
start_trade week_day
0 2021-01-16 09:30:00 Saturday
1 2021-01-19 14:30:00 Tuesday
2 2021-01-25 22:00:00 Monday
3 2021-01-29 12:15:00 Friday
4 2021-01-31 12:35:00 Sunday
There are no trades on the exchange on Saturday and Sunday. Therefore, if my trading signal falls on the weekend, I want to open a trade on Friday 23:50.
Expexted output:
start_trade week_day
0 2021-01-15 23:50:00 Friday
1 2021-01-19 14:30:00 Tuesday
2 2021-01-25 22:00:00 Monday
3 2021-01-29 12:15:00 Friday
4 2021-01-29 23:50:00 Friday
How to do it?
You can do it playing with to_timedelta to change the date to the Friday of the week and then set the time with Timedelta. Do this only on the rows wanted with the mask
#for week ends dates
mask = df['start_trade'].dt.weekday.isin([5,6])
df.loc[mask, 'start_trade'] = (df['start_trade'].dt.normalize() # to get midnight
- pd.to_timedelta(df['start_trade'].dt.weekday-4, unit='D') # to get the friday date
+ pd.Timedelta(hours=23, minutes=50)) # set 23:50 for time
df.loc[mask, 'week_day'] = 'Friday'
print(df)
start_trade week_day
0 2021-01-15 23:50:00 Friday
1 2021-01-19 14:30:00 Tuesday
2 2021-01-25 22:00:00 Monday
3 2021-01-29 12:15:00 Friday
4 2021-01-29 23:50:00 Friday
Try:
weekend = df['week_day'].isin(['Saturday', 'Sunday'])
df.loc[weekend, 'week_day'] = 'Friday'
Or np.where along with str.contains, and | operator:
df['week_day'] = np.where(df['week_day'].str.contains(r'Saturday|Sunday'),'Friday',df['week_day'])
I have a pandas series s, I would like to extract the Monday before the third Friday:
with the help of the answer in following link, I can get a resample of third friday, I am still not sure how to get the Monday just before it.
pandas resample to specific weekday in month
from pandas.tseries.offsets import WeekOfMonth
s.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
Any help is welcome
Many thanks
For each source date, compute your "wanted" date in 3 steps:
Shift back to the first day of the current month.
Shift forward to Friday in third week.
Shift back 4 days (from Friday to Monday).
For a Series containing dates, the code to do it is:
s.dt.to_period('M').dt.to_timestamp() + pd.offsets.WeekOfMonth(week=2, weekday=4)\
- pd.Timedelta('4D')
To test this code I created the source Series as:
s = (pd.date_range('2020-01-01', '2020-12-31', freq='MS') + pd.Timedelta('1D')).to_series()
It contains the second day of each month, both as the index and value.
When you run the above code, you will get:
2020-01-02 2020-01-13
2020-02-02 2020-02-17
2020-03-02 2020-03-16
2020-04-02 2020-04-13
2020-05-02 2020-05-11
2020-06-02 2020-06-15
2020-07-02 2020-07-13
2020-08-02 2020-08-17
2020-09-02 2020-09-14
2020-10-02 2020-10-12
2020-11-02 2020-11-16
2020-12-02 2020-12-14
dtype: datetime64[ns]
The left column contains the original index (source date) and the right
column - the "wanted" date.
Note that third Monday formula (as proposed in one of comments) is wrong.
E.g. third Monday in January is 2020-01-20, whereas the correct date is 2020-01-13.
Edit
If you have a DataFrame, something like:
Date Amount
0 2020-01-02 10
1 2020-01-12 10
2 2020-01-13 2
3 2020-01-20 2
4 2020-02-16 2
5 2020-02-17 12
6 2020-03-15 12
7 2020-03-16 3
8 2020-03-31 3
and you want something like resample but each "period" should start
on a Monday before the third Friday in each month, and e.g. compute
a sum for each period, you can:
Define the following function:
def dateShift(d):
d += pd.Timedelta(4, 'D')
d = pd.offsets.WeekOfMonth(week=2, weekday=4).rollback(d)
return d - pd.Timedelta(4, 'D')
i.e.:
Add 4 days (e.g. move 2020-01-13 (Monday) to 2020-01-17 (Friday).
Roll back (in the above case (on offset) this date will not be moved).
Subtract 4 days.
Run:
df.groupby(df.Date.apply(dateShift)).sum()
The result is:
Amount
Date
2019-12-16 20
2020-01-13 6
2020-02-17 24
2020-03-16 6
E. g. two values of 10 for 2020-01-02 and 2020-01-12 are assigned
to period starting on 2019-12-16 (the "wanted" date for December 2019).
I have the following table:
Hora_Retiro count_uses
0 00:00:18 1
1 00:00:34 1
2 00:02:27 1
3 00:03:13 1
4 00:06:45 1
... ... ...
748700 23:58:47 1
748701 23:58:49 1
748702 23:59:11 1
748703 23:59:47 1
748704 23:59:56 1
And I want to group all values within each hour, so I can see the total number of uses per hour (00:00:00 - 23:00:00)
I have the following code:
hora_pico_aug= hora_pico.groupby(pd.Grouper(key="Hora_Retiro",freq='H')).count()
Hora_Retiro column is of timedelta64[ns] type
Which gives the following output:
count_uses
Hora_Retiro
00:00:02 2566
01:00:02 602
02:00:02 295
03:00:02 5
04:00:02 10
05:00:02 4002
06:00:02 16075
07:00:02 39410
08:00:02 76272
09:00:02 56721
10:00:02 36036
11:00:02 32011
12:00:02 33725
13:00:02 41032
14:00:02 50747
15:00:02 50338
16:00:02 42347
17:00:02 54674
18:00:02 76056
19:00:02 57958
20:00:02 34286
21:00:02 22509
22:00:02 13894
23:00:02 7134
However, the index column starts at 00:00:02, and I want it to start at 00:00:00, and then go from one hour intervals. Something like this:
count_uses
Hora_Retiro
00:00:00 2565
01:00:00 603
02:00:00 295
03:00:00 5
04:00:00 10
05:00:00 4002
06:00:00 16075
07:00:00 39410
08:00:00 76272
09:00:00 56721
10:00:00 36036
11:00:00 32011
12:00:00 33725
13:00:00 41032
14:00:00 50747
15:00:00 50338
16:00:00 42347
17:00:00 54674
18:00:00 76056
19:00:00 57958
20:00:00 34286
21:00:00 22509
22:00:00 13894
23:00:00 7134
How can i make it to start at 00:00:00??
Thanks for the help!
You can create an hour column from Hora_Retiro column.
df['hour'] = df['Hora_Retiro'].dt.hour
And then groupby on the basis of hour
gpby_df = df.groupby('hour')['count_uses'].sum().reset_index()
gpby_df['hour'] = pd.to_datetime(gpby_df['hour'], format='%H').dt.time
gpby_df.columns = ['Hora_Retiro', 'sum_count_uses']
gpby_df
gives
Hora_Retiro sum_count_uses
0 00:00:00 14
1 09:00:00 1
2 10:00:00 2
3 20:00:00 2
I assume that Hora_Retiro column in your DataFrame is of
Timedelta type. It is not datetime, as in this case there
would be printed also the date part.
Indeed, your code creates groups starting at the minute / second
taken from the first row.
To group by "full hours":
round each element in this column to hour,
then group (just by this rounded value).
The code to do it is:
hora_pico.groupby(hora_pico.Hora_Retiro.apply(
lambda tt: tt.round('H'))).count_uses.count()
However I advise you to make up your mind, what do you want to count:
rows or values in count_uses column.
In the second case replace count function with sum.
I need to reshape a dataframe that looks like df1 and turn it into df2. There are 2 considerations for this procedure:
I need to be able to set the number of rows to be sliced as a parameter (length).
I need to split date and time from the index, and use date in the reshape as the column names and keep time as the index.
Current df1
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10
Desired Output df2 - With the parameter 'length=5'
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
What have I done:
My approach was to create a multi-index (Date - Time) and then do a pivot table or some sort of reshape to achieve the desired df output.
import pandas as pd
'''
First separate time and date
'''
df['TimeStamp'] = df.index
df['date'] = df.index.date
df['time'] = df.index.time
'''
Then create a way to separate the slices and make those specific dates available for then create
a multi-index.
'''
for index, row in df.iterrows():
df['Num'] = np.arange(len(df))
for index, row in df.iterrows():
if row['Num'] % 5 == 0:
df.loc[index, 'EventDate'] = df.loc[index, 'Date']
df.set_index(['EventDate', 'Hour'], inplace=True)
del df['Date']
del df['Num']
del df['TimeStamp']
Problem: There's a NaN appears next to each date of the first level of the multi-index. And even if that worked well, I can't find how to do what I need with a multiindex df.
I'm stuck. I appreciate any input.
import numpy as np
import pandas as pd
import io
data = '''\
val
2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
df = pd.read_table(io.BytesIO(data), sep='\s{2,}', parse_dates=True)
chunksize = 5
chunks = len(df)//chunksize
df['Date'] = np.repeat(df.index.date[::chunksize], chunksize)[:len(df)]
index = df.index.time[:chunksize]
df['Time'] = np.tile(np.arange(chunksize), chunks)
df = df.set_index(['Date', 'Time'], append=False)
df = df['val'].unstack('Date')
df.index = index
print(df)
yields
Date 2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
Note that the final DataFrame has an index with non-unique entries. (The
18:00:00 is repeated.) Some DataFrame operations are problematic when the
index has repeated entries, so in general it is better to avoid this if
possible.
First of all I'm assuming your datetime column is actually a datetime type if not use df['t'] = pd.to_datetime(df['t']) to convert.
Then set your index using a multindex and unstack...
df.index = pd.MultiIndex.from_tuples(df['t'].apply(lambda x: [x.time(),x.date()]))
df['v'].unstack()
This would be a canonical approach for pandas:
First, setup with imports and data:
import pandas as pd
import StringIO
txt = '''2007-08-07 18:00:00 1
2007-08-08 00:00:00 2
2007-08-08 06:00:00 3
2007-08-08 12:00:00 4
2007-08-08 18:00:00 5
2007-11-02 18:00:00 6
2007-11-03 00:00:00 7
2007-11-03 06:00:00 8
2007-11-03 12:00:00 9
2007-11-03 18:00:00 10'''
Now read in the DataFrame, and pivot on the correct columns:
df1 = pd.read_csv(StringIO.StringIO(txt), sep=' ',
names=['d', 't', 'n'], )
print(df1.pivot(index='t', columns='d', values='n'))
prints a pivoted df:
d 2007-08-07 2007-08-08 2007-11-02 2007-11-03
t
00:00:00 NaN 2 NaN 7
06:00:00 NaN 3 NaN 8
12:00:00 NaN 4 NaN 9
18:00:00 1 5 6 10
You won't get a length of 5, though. The following,
2007-08-07 2007-11-02
18:00:00 1 6
00:00:00 2 7
06:00:00 3 8
12:00:00 4 9
18:00:00 5 10
is incorrect, as you have 18:00:00 twice for the same date, and in your initial data, they apply to different dates.