Remove rows that are not the 15th counted day of the month - python

Sheet 1 has a column 'Date' with 10 years worth of dates. These dates are trading days for the Australian stockmarket. I'm looking to remove all dates that are not the 15th trading day of each month (not necessarily the 15th day of the month). This code works for the first 12 months of the first year but it stops after that.
Code:
df = pd.read_csv(r'C:\Users\\Desktop\Sheet1.csv')
df['Date'] = pd.to_datetime(df['Date'])
df['month'] = df['Date'].dt.month
df['trading_day'] = df.groupby(['month']).cumcount() + 1
df = df[df['trading_day'] == 15]
df.drop(['month', 'trading_day'], axis=1, inplace=True)
df.to_excel("Sheet2.xlsx", index=False)
Current output:
Date NAV
2009-06-22 00:00:00 $50.7731
2009-07-21 00:00:00 $52.2194
2009-08-21 00:00:00 $55.5233
2009-09-21 00:00:00 $61.1116
2009-10-21 00:00:00 $62.6512
2009-11-20 00:00:00 $60.9736
2009-12-21 00:00:00 $60.2841
2010-01-22 00:00:00 $61.2418
2010-02-19 00:00:00 $59.8768
2010-03-19 00:00:00 $63.4521
2010-04-23 00:00:00 $63.1672
2010-05-21 00:00:00 $55.8651

You also need to group by year to compute the cumcount:
df['trading_day'] = df.groupby([df['Date'].dt.year, 'month']).cumcount() + 1

Related

Extract a min value and corresponding row information based on date time index

Im unable to filter my dataframe to give me the details i need. Im trying to obtain the min value for any given 24 hour period but keep the detail in the filtered rows like the day name
below is the head of df dataframe. It has t datetimeindex format and ive pulled the hour column as an integer from the index time information and the low is a float. when i run the code between the two dataframes below the df2 is the second dataframe i get returned
df = pd.read_csv(r'C:/Users/Oliver/Documents/Data/eurusd1hr.csv')
df.drop(columns=(['Close', 'Open', 'High', 'Volume']), inplace=True)
df['t'] = (df['Date'] + ' ' + df['Time'])
df.drop(columns = ['Date', 'Time'], inplace=True)
df['t'] = pd.DatetimeIndex(df.t)
df.set_index('t', inplace=True)
df['hour'] = df.index.hour
df['day'] = df.index.day_name()
Low hour day
t
2003-05-05 03:00:00 1.12154 3 Monday
2003-05-05 04:00:00 1.12099 4 Monday
2003-05-05 05:00:00 1.12085 5 Monday
2003-05-05 06:00:00 1.12049 6 Monday
2003-05-05 07:00:00 1.12079 7 Monday```
```df2 = df.between_time('00:00', '23:00').groupby(pd.Grouper(freq='d')).min()```
``` Low hour day
t
2003-05-05 1.12014 3.0 Monday
2003-05-06 1.12723 0.0 Tuesday
2003-05-07 1.13265 0.0 Wednesday
2003-05-08 1.13006 0.0 Thursday
2003-05-09 1.14346 0.0 Friday```
I want to keep the corresponding hour in the hour column and also the hour information in the original index like the day name has been maintained
i was expecting the index and hour column to keep the information
ive tried add a 2nd grouper method but failed. I have also tried to reset the index. any help would be gratefully received. thanks

First week of year considering the first day last year

I have the following df:
time_series date sales
store_0090_item_85261507 1/2020 1,0
store_0090_item_85261501 2/2020 0,0
store_0090_item_85261500 3/2020 6,0
Being 'date' = Week/Year.
So, I tried use the following code:
df['date'] = df['date'].apply(lambda x: datetime.strptime(x + '/0', "%U/%Y/%w"))
But, return this df:
time_series date sales
store_0090_item_85261507 2020-01-05 1,0
store_0090_item_85261501 2020-01-12 0,0
store_0090_item_85261500 2020-01-19 6,0
But, the first day of the first week of 2020 is 2019-12-29, considering sunday as first day. How can I have the first day 2020-12-29 of the first week of 2020 and not 2020-01-05?
From the datetime module's documentation:
%U: Week number of the year (Sunday as the first day of the week) as a zero padded decimal number. All days in a new year preceding the first Sunday are considered to be in week 0.
Edit: My originals answer doesn't work for input 1/2023 and using ISO 8601 date values doesn't work for 1/2021, so I've edited this answer by adding a custom function
Here is a way with a custom function
import pandas as pd
from datetime import datetime, timedelta
##############################################
# to demonstrate issues with certain dates
print(datetime.strptime('0/2020/0', "%U/%Y/%w")) # 2019-12-29 00:00:00
print(datetime.strptime('1/2020/0', "%U/%Y/%w")) # 2020-01-05 00:00:00
print(datetime.strptime('0/2021/0', "%U/%Y/%w")) # 2020-12-27 00:00:00
print(datetime.strptime('1/2021/0', "%U/%Y/%w")) # 2021-01-03 00:00:00
print(datetime.strptime('0/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
print(datetime.strptime('1/2023/0', "%U/%Y/%w")) # 2023-01-01 00:00:00
#################################################
df = pd.DataFrame({'date':["1/2020", "2/2020", "3/2020", "1/2021", "2/2021", "1/2023", "2/2023"]})
print(df)
def get_first_day(date):
date0 = datetime.strptime('0/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date1 = datetime.strptime('1/' + date.split('/')[1] + '/0', "%U/%Y/%w")
date = datetime.strptime(date + '/0', "%U/%Y/%w")
return date if date0 == date1 else date - timedelta(weeks=1)
df['new_date'] = df['date'].apply(lambda x:get_first_day(x))
print(df)
Input
date
0 1/2020
1 2/2020
2 3/2020
3 1/2021
4 2/2021
5 1/2023
6 2/2023
Output
date new_date
0 1/2020 2019-12-29
1 2/2020 2020-01-05
2 3/2020 2020-01-12
3 1/2021 2020-12-27
4 2/2021 2021-01-03
5 1/2023 2023-01-01
6 2/2023 2023-01-08
You'll want to use ISO week parsing directives, Ex:
import pandas as pd
date = pd.Series(["1/2020", "2/2020", "3/2020"])
pd.to_datetime(date+"/1", format="%V/%G/%u")
0 2019-12-30
1 2020-01-06
2 2020-01-13
dtype: datetime64[ns]
you can also shift by one day if the week should start on Sunday:
pd.to_datetime(date+"/1", format="%V/%G/%u") - pd.Timedelta('1d')
0 2019-12-29
1 2020-01-05
2 2020-01-12
dtype: datetime64[ns]

How do I delete the first year of data by variable from a pandas dataframe?

I have a pandas dataframe with various stock symbols and prices by day, and I'm looking to remove the first year of data for each symbol.
Current dataframe:
Date Symbol Price
2009-01-01 00:00:00 A $10.00
2009-01-02 00:00:00 A $11.00
...
2010-01-01 00:00:00 A $12.00
...
2019-01-01 00:00:00 A $15.00
2009-01-01 00:00:00 B $100.00
...
2019-01-01 00:00:00 B $200.00
Goal dataframe:
Date Symbol Price
2010-01-01 00:00:00 A $12.00
...
2019-01-01 00:00:00 A $15.00
2010-01-01 00:00:00 B $100.00
...
2019-01-01 00:00:00 B $200.00
Any help is appreciated, thanks!
You can use column Data to get only year and then you can use it to drop rows.
If Date is as string then you could try to get year using
df["Year"] = df["Date"].str[:4]
and filter using string "2009"
df = df[ df["Year"] != "2009" ]
If it keeps Date as datetime object then you may need something like
df["Year"] = df["Date"].dt.year
and filter using integer 2009
df = df[ df["Year"] != 2009 ]
but I'm not sure.

How can I change the datetime values in a pandas df?

I have a pandas dataframe and datetime is used as an index in the following format: datetime.date(2018, 12, 31).
Each datetime represents the fiscal year end, i.e. 31/12/2018, 31/12/2017, 31/12/2016 etc.
However, for some companies the fiscal year end may be 30/11/2018 or 31/10/2018 and etc. instead of the last date of each year.
Is there any quick way in changing the non-standardized datetime to the last date of each year?
i.e. from 30/11/2018 to 30/12/2018 and 31/10/2018 to 31/12/2018 an so on.....
df = pd.DataFrame({'datetime': ['2019-01-02','2019-02-01', '2019-04-01', '2019-06-01', '2019-11-30','2019-12-30'],
'data': [1,2,3,4,5,6]})
df['datetime'] = pd.to_datetime(df['datetime'])
df['quarter'] = df['datetime'] + pd.tseries.offsets.QuarterEnd(n=0)
df
datetime data quarter
0 2019-01-02 1 2019-03-31
1 2019-02-01 2 2019-03-31
2 2019-04-01 3 2019-06-30
3 2019-06-01 4 2019-06-30
4 2019-11-30 5 2019-12-31
5 2019-12-30 6 2019-12-31
We have a datetime column with random dates I picked. Then we add a timeseries offset to the end of each date to make it quarter end and standardize the times.

How to add a new column by searching for data in a Pandas time series dataframe

I have a Pandas time series dataframe.
It has minute data for a stock for 30 days.
I want to create a new column, stating the price of the stock at 6 AM for that day, e.g. for all lines for January 1, I want a new column with the price at noon on January 1, and for all lines for January 2, I want a new column with the price at noon on January 2, etc.
Existing timeframe:
Date Time Last_Price Date Time 12amT
1/1/19 08:00 100 1/1/19 08:00 ?
1/1/19 08:01 101 1/1/19 08:01 ?
1/1/19 08:02 100.50 1/1/19 08:02 ?
...
31/1/19 21:00 106 31/1/19 21:00 ?
I used this hack, but it is very slow, and I assume there is a quicker and easier way to do this.
for lab, row in df.iterrows() :
t=row["Date"]
df.loc[lab,"12amT"]=df[(df['Date']==t)&(df['Time']=="12:00")]["Last_Price"].values[0]
One way to do this is to use groupby with pd.Grouper:
For pandas 24.1+
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].to_numpy()[0])
Older pandas use:
df.groupby(pd.Grouper(freq='D'))[0]\
.transform(lambda x: x.loc[(x.index.hour == 12) &
(x.index.minute==0)].values[0])
MVCE:
df = pd.DataFrame(np.arange(48*60), index=pd.date_range('02-01-2019',periods=(48*60), freq='T'))
df['12amT'] = df.groupby(pd.Grouper(freq='D'))[0].transform(lambda x: x.loc[(x.index.hour == 12)&(x.index.minute==0)].to_numpy()[0])
Output (head):
0 12amT
2019-02-01 00:00:00 0 720
2019-02-01 00:01:00 1 720
2019-02-01 00:02:00 2 720
2019-02-01 00:03:00 3 720
2019-02-01 00:04:00 4 720
I'm not sure why you have two DateTime columns, I made my own example to demonstrate:
ind = pd.date_range('1/1/2019', '30/1/2019', freq='H')
df = pd.DataFrame({'Last_Price':np.random.random(len(ind)) + 100}, index=ind)
def noon_price(df):
noon_price = df.loc[df.index.hour == 12, 'Last_Price'].values
noon_price = noon_price[0] if len(noon_price) > 0 else np.nan
df['noon_price'] = noon_price
return df
df.groupby(df.index.day).apply(noon_price).reindex(ind)
reindex by default will fill each day's rows with its noon_price.
To add a column with the next day's noon price, you can shift the column 24 rows down, like this:
df['T+1'] = df.noon_price.shift(-24)

Categories