How to calculate relative volume using pandas with faster way? - python

I am trying to implement the RVOL by the time of day technical indicator, which can be used as the indication of market strength.
The logic behind this is as follows:
If the current time is 2022/3/19 13:00, we look through the same moment (13:00) at the previous N days and average all the previous volumes at that moment to calculate Average_volume_previous.
Then, RVOL(t) is volume(t)/Average_volume_previous(t).
It is hard to use methods like rolling and apply to deal with this complex logic in the code I wrote.
However, the operation time of for loop is catastrophically long.
from datetime import datetime
import pandas as pd
import numpy as np
datetime_array = pd.date_range(datetime.strptime('2015-03-19 13:00:00', '%Y-%m-%d %H:%M:%S'), datetime.strptime("2022-03-19 13:00:00", '%Y-%m-%d %H:%M:%S'), freq='30min')
volume_array = pd.Series(np.random.uniform(1000, 10000, len(datetime_array)))
df = pd.DataFrame({'Date':datetime_array, 'Volume':volume_array})
df.set_index(['Date'], inplace=True)
output = []
for idx in range(len(df)):
date = str(df.index[idx].hour)+':'+str(df.index[idx].minute)
temp_date = df.iloc[:idx].between_time(date, date)
output.append(temp_date.tail(day_len).mean().iloc[0])
output = np.array(output)
Practically, there might be missing data in the datetime array. So, it would be hard to use fixed length lookback period to solve this. Is there any way to make this code work faster?

I'm not sure I understand, however this is the solution as far as I understand.
I didn't use date as index
df.set_index(['Date'], inplace=True)
# Filter data to find instant
rolling_day = 10
hour = df['Date'].dt.hour == 13
minute = df['Date'].dt.minute == 0
df_moment = df[ore&minuti].copy()
Calculation of moving averages
df_moment['rolling'] = df_moment.rolling(rolling_day).mean()
Calculation of Average_volume_previous(t)/volume(t)
for idx_s, idx_e in zip(df_moment['Volume'][::rolling_day], df_moment['rolling'][rolling_day::rolling_day]):
print(f'{idx_s/idx_e}')
Output:
0.566379345408499
0.7229214799940626
0.6753586759429548
2.0588617812341354
0.7494803741982076
1.2132554086225438

Related

Python: slice yearly data between February and June with pandas

I have a dataset with 10 years of data from 2000 to 2010. I have the initial datetime on 2000-01-01, with data resampled to daily. I also have a weekly counter for when I apply the slice() function, I will only ask for week 5 to week 21 (February 1 to May 30).
I am a little stuck with how I can slice it every year, does it involve a loop or is there a timeseries function in python that will know to slice for a specific period in every year? Below is the code I have so far, I had a for loop that was supposed to slice(5, 21) but that didn't work.
Any suggestions how might I get this to work?
import pandas as pd
from datetime import datetime, timedelta
initial_datetime = pd.to_datetime("2000-01-01")
# Read the file
df = pd.read_csv("D:/tseries.csv")
# Convert seconds to datetime
df["Time"] = df["Time"].map(lambda dt: initial_datetime+timedelta(seconds=dt))
df = df.set_index(pd.DatetimeIndex(df["Time"]))
resampling_period = "24H"
df = df.resample(resampling_period).mean().interpolate()
df["Week"] = df.index.map(lambda dt: dt.week)
print(df)
You can slice using loc:
df.loc[df.Week.isin(range(5,22))]
If you want separate calculations per year (f.e. mean), you can use groupby:
subset = df.loc[df.Week.isin(range(5,22))]
subset.groupby(subset.index.year).mean()

find the time different between the day

Dataframe
I have different machine running different hours that might cross over a day and I want to differentiate it on different day
Example Machine A running 8 hours from Start Date and Time 12-Aug, 9pm to 13-Aug , 5am
I cant get the correct time that 3hours from 12-Aug and 5hours from 13-Aug
Suspect that because i'm using datetime.now
how do it change the date will be same as Start date/ End date in python?
Here is my code:
endoftoday = datetime.now()
endoftoday = endoftoday.replace(hour=23,minute=59,second=59)
dt['Start_Date']=dt['Start_Time'].dt.strftime('%d/%m/%Y')
dt['End_Date']=dt['Finish_Time'].dt.strftime('%d/%m/%Y')
if (dt.['Start_Date'].str == dt['End_Date'].str):
dt['Tested_Time_Today']= endoftoday-dt['Start_Time']
dt['Tested_Time_NextDay']= dt['Finish_Time'] - endoftoday
Here is my attempt:
import pandas as pd
import datetime
def get_times(args):
start_time, end_time, start_date, end_date = args
hours = {}
for day in pd.date_range(start_date, end_date, freq='d'):
hours[day] = max(day, end_time) - max(start_time, day) + datetime.timedelta(hours=24)
return hours
df = pd.DataFrame({'Start_Time': [datetime.datetime(2021,8,21,6,2), datetime.datetime(2021,8,21,7,19)], 'Finish_Time': [datetime.datetime(2021,8,22,5,12), datetime.datetime(2021,8,21,16,50)], 'Start_Date': [datetime.date(2021,8,21), datetime.date(2021,8,21)], 'End_Date': [datetime.date(2021,8,22), datetime.date(2021,8,21)]})
df['hours'] = df.apply(get_times, axis=1)
print(df)
This is probably not exactly what you are looking for since I also don't really understand your question well enough. But what you get is a new column which contains in each row a dictionary with the dates as key and the hours during that day as value.
If you let us know what exactly you are after, I might be able to improve the answer.
Edit: This won't work if your time period covers more than two days. If that is necessary, the time calculation would have to slightly extended. And if you have more columns than the ones that we perform the calculation on, please change the penultimate row to df['hours'] = df[['Start_Time', 'Finish_Time', 'Start_Date', 'End_Date']].apply(get_times, axis=1)

Selecting specific date from pandas data-frame

From the daily stock price data, I want to sample and select end of the month price. I am accomplishing using the following code.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin=end-pd.DateOffset(365*2)
st=begin.strftime('%Y-%m-%d')
ed=end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2])).set_index(data.index)
The line above selects end of the month data and here is the output.
If I want to select penultimate value of the month, I can do it using the following code.
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-2]))
Here is the output.
However the index shows end of the month value. When I choose penultimate value of the month, I want index to be 2015-12-30 instead of 2015-12-31.
Please suggest the way forward. I hope my question is clear.
Thanking you in anticipation.
Regards,
Abhishek
I am not sure if there is a way to do it with resample. But, you can get what you want using groupby and TimeGrouper.
import datetime
from pandas_datareader import data as pdr
import pandas as pd
end = datetime.date.today()
begin = end - pd.DateOffset(365*2)
st = begin.strftime('%Y-%m-%d')
ed = end.strftime('%Y-%m-%d')
data = pdr.get_data_yahoo("AAPL",st,ed)
data['Date'] = data.index
mon_data = (
data[['Date', 'Adj Close']]
.groupby(pd.TimeGrouper(freq='M')).nth(-2)
.set_index('Date')
)
simplest solution is to take the index of your newly created dataframe and subtract the number of days you want to go back:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('M').apply(lambda x: x[-1-n]))
mon_data.index = mon_data.index - datetime.timedelta(days=n)
also, seeing your data, i think that you should resample not to ' month end frequency' but rather to 'business month end frequency':
.resample('BM')
but even that won't cover it all, because for instance December 29, 2017 is a business month end, but this date doesn't appear in your data (which ends in December 08 2017). so you could add a small fix to that (assuming the original data is sorted by the date):
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
so, the full code will look like:
n = 1
mon_data=pd.DataFrame(data['Adj Close'].resample('BM').apply(lambda x: x[-1-n]))
end_of_months = mon_data.index.tolist()
end_of_months[-1] = data.index[-1]
mon_data.index = end_of_months
mon_data.index = mon_data.index - datetime.timedelta(days=n)
btw: your .set_index(data.index) throw an error because data and mon_data are in different dimensions (mon_data is monthly grouped_by)

Change Date in TimeSeries Dataframes Python

I am trying to change the date part of an hourly time series. I am doing this by using this method.
import numpy as np
import pandas as pd
np.random.seed(0)
df_volume = pd.DataFrame(np.random.randint(60, 100, (24,4)), index=pd.date_range('2015-08-11', periods=24, freq='H'), columns='Col1 Col2 Col3 Col4'.split())
#df_volume.reset_index(inplace=True)
df_volume.index = pd.to_datetime('2015-01-01') + pd.to_datetime(df_volume.index,format='%H:%M:%S')
print (df_volume)
The error says raise NotImplementedError. I dont know how to solve this.
If i reset_index and then try to change the date, the error is time data 0 does not match format '%H:%M:%S' (match). Please help.
I have 24 hours worth of data and i want to change the date and keep the hours the way they are.
I am not exactly sure what you are trying to do, however it sounds like you want to change the days of the index without changing the hours, minutes or seconds. If that is what you are looking for this should do it. You can change the days or months for what you need.
from pandas.tseries.offsets import DateOffset
df_volume.index = df_volume.index - DateOffset(months=7, days=10)

Calculate daily sums using python pandas

I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()

Categories