I am super noob in pandas and I am following a tutorial that is obviously outdated.
I have this simple script that when I run I get tis error :
ValueError: Array conditional must be same shape as self
# loading the class data from the package pandas_datareader
import pandas as pd
from pandas_datareader import data
import matplotlib.pyplot as plt
# Adj Close:
# The closing price of the stock that adjusts the price of the stock for corporate actions.
# This price takes into account the stock splits and dividends.
# The adjusted close is the price we will use for this example.
# Indeed, since it takes into account splits and dividends, we will not need to adjust the price manually.
# First day
start_date = '2014-01-01'
# Last day
end_date = '2018-01-01'
# Call the function DataReader from the class data
goog_data = data.DataReader('GOOG', 'yahoo', start_date, end_date)
goog_data_signal = pd.DataFrame(index=goog_data.index)
goog_data_signal['price'] = goog_data['Adj Close']
goog_data_signal['daily_difference'] = goog_data_signal['price'].diff()
goog_data_signal['signal'] = 0.0
# this line produces the error
goog_data_signal['signal'] = pd.DataFrame.where(goog_data_signal['daily_difference'] > 0, 1.0, 0.0)
goog_data_signal['positions'] = goog_data_signal['signal'].diff()
print(goog_data_signal.head())
I am trying to understand the theory, the libraries and the methodology through practicing so bear with me if it is too obvious... :]
The where method is always called from a dataframe however here, you only need to check the condition for a series, so I found 2 ways to solve this problem:
The new where method doesn't support setting a value for the rows where condition is true (1.0 in your case), but still supports setting a value for the false rows (called the other parameter in the doc). So you can set the 1.0's manually later as follows:
goog_data_signal['signal'] = goog_data_signal.where(goog_data_signal['daily_difference'] > 0, other=0.0)
# the true rows will retain their values and you can set them to 1.0 as needed.
Or you can check the condition directly as follows:
goog_data_signal['signal'] = (goog_data_signal['daily_difference'] > 0).astype(int)
The second method produces the output for me:
price daily_difference signal positions
Date
2014-01-02 554.481689 NaN 0 NaN
2014-01-03 550.436829 -4.044861 0 0.0
2014-01-06 556.573853 6.137024 1 1.0
2014-01-07 567.303589 10.729736 1 0.0
2014-01-08 568.484192 1.180603 1 0.0
Related
So I have an excel file containing data on a specific stock.
My excel file contains about 2 months of data, it monitors the Open price, Close price, High Price, Low Price and Volume of trades in 5 minute intervals, so there are about 3000 rows in my file.
I want to calculate the RSI (or EMA if it's easier) of a stock daily, I'm making a summary table that collects the daily data so it converts my table of 3000+ rows into a table with only about 60 rows (each row represents one day).
Essentially I want some sort of code that sorts the excel data by date then calculates the RSI as a single value for that day. RSI is given by: 100-(100/(1+RS)) where RS = average gain of up periods/average loss of down periods.
Note: My excel uses 'Datetime' so each row's 'Datetime' looks something like '2022-03-03 9:30-5:00' and the next row would be '2022-03-03 9:35-5:00', etc. So the code needs to just look at the date and ignore the time I guess.
Some code to maybe help understand what I'm looking for:
So here I'm calling my excel file, I want the code to take the called excel file, group data by date and then calculate the RSI of each day using the formula I wrote above.
dat = pd.read_csv('AMD_5m.csv',index_col='Datetime',parse_dates=['Datetime'],
date_parser=lambda x: pd.to_datetime(x, utc=True))
dates = backtest.get_dates(dat.index)
#create a summary table
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index = dates, columns=cols)
# loop backtest by dates
This is the code I used to fill out the other columns in my summary table, I'll put my SMA (simple moving average) function below.
for d in dates:
this_dat = dat.loc[dat.index.date==d]
#find the number of observations in date d
summary_table.loc[d]['Num. Obs.'] = this_dat.shape[0]
#get trading (i.e. position holding) signals
signals = backtest.SMA(this_dat['Close'].values, window=10)
#find the number of trades in date d
summary_table.loc[d]['Num. Trade'] = np.sum(np.diff(signals)==1)
#find PnLs for 100 shares
shares = 100
PnL = -shares*np.sum(this_dat['Close'].values[1:]*np.diff(signals))
if np.sum(np.diff(signals))>0:
#close position at market close
PnL += shares*this_dat['Close'].values[-1]
summary_table.loc[d]['PnL'] = PnL
#find the win ratio
ind_in = np.where(np.diff(signals)==1)[0]+1
ind_out = np.where(np.diff(signals)==-1)[0]+1
num_win = np.sum((this_dat['Close'].values[ind_out]-this_dat['Close'].values[ind_in])>0)
if summary_table.loc[d]['Num. Trade']!=0:
summary_table.loc[d]['Win. Ratio'] = 1. *num_win/summary_table.loc[d]['Num. Trade']
This is my function for calculating Simple Moving Average. I was told to try and adapt this for RSI or for EMA (Exponential Moving Average). Apparently adapting this for EMA isn't too troublesome but I can't figure it out.
def SMA(p,window=10,signal_type='buy only'):
#input price "p", look-back window "window",
#signal type = buy only (default) --gives long signals, sell only --gives sell signals, both --gives both long and short signals
#return a list of signals = 1 for long position and -1 for short position
signals = np.zeros(len(p))
if len(p)<window:
#no signal if no sufficient data
return signals
sma = list(np.zeros(window)+np.nan) #the first few prices does not give technical indicator values
sma += [np.average(p[k:k+window]) for k in np.arange(len(p)-window)]
for i in np.arange(len(p)-1):
if np.isnan(sma[i]):
continue #skip the open market time window
if sma[i]<p[i] and (signal_type=='buy only' or signal_type=='both'):
signals[i] = 1
elif sma[i]>p[i] and (signal_type=='sell only' or signal_type=='both'):
signals[i] = -1
return signals
I have two solutions to this. One is to loop through each group, then add the relevant data to the summary_table, the other is to calculate the whole series and set the RSI column as this.
I first recreated the data:
import yfinance
import pandas as pd
# initially created similar data through yfinance,
# then copied this to Excel and changed the Datetime column to match yours.
df = yfinance.download("AAPL", period="60d", interval="5m")
# copied it and read it as a dataframe
df = pd.read_clipboard(sep=r'\s{2,}', engine="python")
df.head()
# Datetime Open High Low Close Adj Close Volume
#0 2022-03-03 09:30-05:00 168.470001 168.910004 167.970001 168.199905 168.199905 5374241
#1 2022-03-03 09:35-05:00 168.199997 168.289993 167.550003 168.129898 168.129898 1936734
#2 2022-03-03 09:40-05:00 168.119995 168.250000 167.740005 167.770004 167.770004 1198687
#3 2022-03-03 09:45-05:00 167.770004 168.339996 167.589996 167.718094 167.718094 2128957
#4 2022-03-03 09:50-05:00 167.729996 167.970001 167.619995 167.710007 167.710007 968410
Then I formatted the data and created the summary_table:
df["date"] = pd.to_datetime(df["Datetime"].str[:16], format="%Y-%m-%d %H:%M").dt.date
# calculate percentage change from open and close of each row
df["gain"] = (df["Close"] / df["Open"]) - 1
# your summary table, slightly changing the index to use the dates above
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index=df["date"].unique(), columns=cols)
Option 1:
# loop through each group, calculate the average gain and loss, then RSI
for grp, data in df.groupby("date"):
# average gain for gain greater than 0
average_gain = data[data["gain"] > 0]["gain"].mean()
# average loss for gain less than 0
average_loss = data[data["gain"] < 0]["gain"].mean()
# add to relevant cell of summary_table
summary_table["RSI"].loc[grp] = 100 - (100 / (1 + (average_gain / average_loss)))
Option 2:
# define a function to apply in the groupby
def rsi_calc(series):
avg_gain = series[series > 0].mean()
avg_loss = series[series < 0].mean()
return 100 - (100 / (1 + (avg_gain / avg_loss)))
summary_table["RSI"] = df.groupby("date")["gain"].apply(lambda x: rsi_calc(x))
Output (same for each):
summary_table.head()
# Num. Obs. Num. Trade PnL Win. Ratio RSI
#2022-03-03 NaN NaN NaN NaN -981.214015
#2022-03-04 NaN NaN NaN NaN 501.950956
#2022-03-07 NaN NaN NaN NaN -228.379066
#2022-03-08 NaN NaN NaN NaN -2304.451654
#2022-03-09 NaN NaN NaN NaN -689.824739
I implemented it but it prints all.
print(ta.STOCHRSI(df["close"], 14, 5, 3, 0)[-1])
2022-04-20 17:00:00 NaN
2022-04-20 18:00:00 NaN
2022-04-20 19:00:00 NaN
2022-04-20 20:00:00 NaN
2022-04-20 21:00:00 NaN
...
2022-04-28 20:00:00 79.700101
2022-04-28 21:00:00 0.000000
2022-04-28 22:00:00 0.000000
2022-04-28 23:00:00 44.877738
2022-04-29 00:00:00 65.792554
Length: 200, dtype: float64
I just want to get recent value of STOCHRSI, just one float value. How can I get it?
or if I want to get the avg of recent 3 values, How can I implement it?
If you really mean the library TA-Lib.enter link description here
As far as I know, the syntax there is different from yours.
Streaming API:"An experimental Streaming API was added that allows users to compute the latest value of an indicator. This can be faster than using the Function API, for example in an application that receives streaming data, and wants to know just the most recent updated indicator value
Streaming API
This works with 'SMA', but fails with 'STOCHRSI' if I make a difference less than 5 in 'assert'.
And to calculate the indicator, you need a quote history. You probably saw that the first values are empty, since there is no data required by the indicator period.
You can try the following: determine how much data is needed for the correct calculation of the indicator. And then feed only this array length.
If resources allow you, then you can calculate all the values and save their variable and take only the last of the variable fastk[-1].
import talib
from talib import stream
sma = talib.SMA(df["close"], timeperiod=14)
latest = stream.SMA(df["close"], timeperiod=14)
assert (sma[-1] - latest) < 0.00001
print(sma[-1], latest)#1.6180066666666686 1.6180066666666668
fastk, fastd = talib.STOCHRSI(df["close"], timeperiod=14, fastk_period=5, fastd_period=3, fastd_matype=0)
f, fd = stream.STOCHRSI(df["close"], timeperiod=14, fastk_period=5, fastd_period=3, fastd_matype=0)
print(fastk[-1], f)
assert (fastk[-1] - f) < 5#64.32089013974793 59.52628987038199
print(fastk[-1], f)
Use the condition of crossing the main signal line from bottom to top.
if fastd[100] < fastk[100] and fastd[101] > fastk[101]:
pass#pass replace your code
I also drew an indicator under the main chart to show what it looks like.
import matplotlib.pyplot as plt
import pandas as pd
import talib
date = df.iloc[:, 0].index.date
x = len(df)
fig, ax = plt.subplots(2)
ax[0].plot(date[x-100:], df.iloc[x-100:, 3])
ax[1].plot(date[x-100:], fastk[x-100:])
ax[1].plot(date[x-100:], fastd[x-100:])
fig.autofmt_xdate()
plt.show()
I made a code to determine the minimum size of the data length for the correct calculation of the indicator.
x = len(df["C"])
fastk, fastd = talib.STOCHRSI(df["C"].values, timeperiod=14, fastk_period=5, fastd_period=3, fastd_matype=0)
fk = np.round(fastk[x - 3:], 5)
fd = np.round(fastd[x - 3:], 5)
print('fk', fk, 'fd', fd)
Output
fk [100. 32.52114 0. ] fd [43.27353 54.11391 44.17371]
Next, we find the desired length of the array.
for depth in range(10, 250, 5):
fastk, fastd = talib.STOCHRSI(df["C"].values[x - depth:], timeperiod=14, fastk_period=5, fastd_period=3,
fastd_matype=0)
if (fk == np.round(fastk[depth - 3:], 5)).all() and (fd == np.round(fastd[depth - 3:], 5)).all():
print('fastk[depth-3:]', fastk[depth - 3:], 'fastd[depth-3:]', fastd[depth - 3:])
print('stop iteration required length', depth)
break
Output
fastk[depth-3:] [100. 32.52113882 0. ] fastd[depth-3:] [43.27353345 54.11391306 44.17371294]
stop iteration required length 190
I am trying to use python pandas to compute:
10 day and 30 day cumulative % performance in (stock ticker RTH "minus" stock tiker SPY) after certain performance threshold in stock ticker USO occurs (=>10% in a 5-day window)
Here is my code:
import pandas as pd
import datetime
import pandas_datareader.data as web
from pandas import Series, DataFrame
start = datetime.datetime(2012, 4, 1)
end = datetime.datetime.now()
dfcomp = web.DataReader(['USO', 'RTH', 'SPY'],'yahoo',start=start,end=end)['Adj Close']
dfcomp_daily_returns = dfcomp.pct_change()
dfcomp_daily_returns = dfcomp_daily_returns.dropna().copy()
dfcomp_daily_returns.head()
Symbols USO RTH SPY
Date
2012-04-03 -0.009243 -0.004758 -0.004089
2012-04-04 -0.020676 -0.007411 -0.009911
2012-04-05 0.010814 0.003372 -0.000501
2012-04-09 -0.007387 -0.006961 -0.011231
2012-04-10 -0.011804 -0.018613 -0.016785
I added several more rows so it might be easier to work with if someone can help
Symbols USO RTH SPY
Date
2012-04-03 -0.009243 -0.004758 -0.004089
2012-04-04 -0.020676 -0.007411 -0.009911
2012-04-05 0.010814 0.003372 -0.000501
2012-04-09 -0.007387 -0.006961 -0.011231
2012-04-10 -0.011804 -0.018612 -0.016785
2012-04-11 0.012984 0.010345 0.008095
2012-04-12 0.011023 0.010970 0.013065
2012-04-13 -0.007353 -0.004823 -0.011888
2012-04-16 0.000766 0.004362 -0.000656
2012-04-17 0.011741 0.015440 0.014812
2012-04-18 -0.014884 -0.000951 -0.003379
2012-04-19 -0.002305 -0.006183 -0.006421
2012-04-20 0.011037 0.002632 0.001670
2012-04-23 -0.009139 -0.015513 -0.008409
2012-04-24 0.003587 -0.004364 0.003802
I think this is a solution to your question. Note that I copied your code up to dropna(), and have also used import numpy as np. You don't need to use from pandas import Series, DataFrame, especially as you have already used import pandas as pd.
The main computations use rolling, apply and where.
# 5-day cumulative %
dfcomp_daily_returns["5_day_cum_%"] = dfcomp_daily_returns["USO"].rolling(5).apply(lambda x: np.prod(1+x)-1)
# RTH - SPY
dfcomp_daily_returns["RTH-SPY"] = dfcomp_daily_returns["RTH"] - dfcomp_daily_returns["SPY"]
# 10-day cumulative %
dfcomp_daily_returns["output_10"] = dfcomp_daily_returns["RTH-SPY"].rolling(10).apply(lambda x: np.prod(1+x)-1).shift(-10).where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan)
# 30-day cumulative %
dfcomp_daily_returns["output_30"] = dfcomp_daily_returns["RTH-SPY"].rolling(30).apply(lambda x: np.prod(1+x)-1).shift(-30).where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan)
I won't print the output, given that there are thousands of rows, and the occurrences of ["5_day_cum_%"] > 0.1 are irregular.
How this code works:
The 5_day_cum_% is calculated using a rolling 5-day window, with the product of the values in this window.
RTH-SPY is column RTH "minus" column SPY.
The output calculates the rolling product of RTH-SPY, then using .shift() for forward rolling (it is not possible to use .rolling() to roll forwards. This idea came from Daniel Manso here. Finally, .where() is used to only keep these values on the condition that [5_day_cum_%] > 0.1 (or 10%), returning np.nan otherwise.
Additions from comments
From your additions in the comments, here are two options for each of those (one using pd.where again, the other just using standard pandas filtering (I'm not sure if it has an actual name). In both, the standard filtering is shorter.
A list of all the dates:
# Option 1: pd.where
list(dfcomp_daily_returns.where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan).dropna(subset=["5_day_cum_%"]).index)
# Option 2: standard pandas filtering
list(dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1].index)
A dataframe of only those with 5-day return greater than 10%:
# Option 1: pd.where
dfcomp_daily_returns.where(dfcomp_daily_returns["5_day_cum_%"] > 0.1, np.nan).dropna(subset=["5_day_cum_%"])[["5_day_cum_%", "output_10", "output_30"]]
# Option 2: standard pandas row filtering
dfcomp_daily_returns[dfcomp_daily_returns["5_day_cum_%"] > 0.1][["5_day_cum_%", "output_10", "output_30"]]
I have two dataframes
import numpy as np
import pandas as pd
test1 = pd.date_range(start='1/1/2018', end='1/10/2018')
test1 = pd.DataFrame(test1)
test1.rename(columns = {list(test1)[0]: 'time'}, inplace = True)
test2 = pd.date_range(start='1/5/2018', end='1/20/2018')
test2 = pd.DataFrame(test2)
test2.rename(columns = {list(test2)[0]: 'time'}, inplace = True)
Now in first dataframe I create column
test1['values'] = np.zeros(10)
I want to fill this column, next to each date there should be the index of the closest date from second dataframe. I want it to look like this:
0 2018-01-01 0
1 2018-01-02 0
2 2018-01-03 0
3 2018-01-04 0
4 2018-01-05 0
5 2018-01-06 1
6 2018-01-07 2
7 2018-01-08 3
Of course my real data is not evenly spaced and has minutes and seconds, but the idea is same. I use the following code:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
for k in range(10):
a = nearest(test2['time'], test1['time'][k]) ### find nearest timestamp from second dataframe
b = test2.index[test2['time'] == a].tolist()[0] ### identify the index of this timestamp
test1['value'][k] = b ### assign this value to the cell
This code is very slow on large datasets, how can I make it more efficient?
P.S. timestamps in my real data are sorted and increasing just like in these artificial examples.
You could do this in one line, using numpy's argmin:
test1['values'] = test1['time'].apply(lambda t: np.argmin(np.absolute(test2['time'] - t)))
Note that applying a lambda function is essentially also a loop. Check if that satisfies your requirements performance-wise.
You might also be able to leverage the fact that your timestamps are sorted and the timedelta between each timestamp is constant (if I got that correctly). Calculate the offset in days and derive the index vector, e.g. as follows:
offset = (test1['time'] - test2['time']).iloc[0].days
if offset < 0: # test1 time starts before test2 time, prepend zeros:
offset = abs(offset)
idx = np.append(np.zeros(offset), np.arange(len(test1['time'])-offset)).astype(int)
else: # test1 time starts after or with test2 time, use arange right away:
idx = np.arange(offset, offset+len(test1['time']))
test1['values'] = idx
I am dealing with a pandas dataframe where the index is a DateTime object and the columns represent minute-by-minute returns on several stocks from the SP500 index, together with a column of returns from the index. It's fairly long (100 stocks, 1510 trading days, minute-by-minute data each day) and looks like this (only three stocks for the sake of example):
DateTime SPY AAPL AMZN T
2014-01-02 9:30 0.032 -0.01 0.164 0.007
2014-01-02 9:31 -0.012 0.02 0.001 -0.004
2014-01-02 9:32 -0.015 0.031 0.004 -0.001
I am trying to compute the betas of each stock for each different day and for each 30-minute window. The beta of a stock in this case is defined as the covariance between its returns and the SPY returns divided by the variance of SPY in the same period. My desired output is a 3-dimensional numpy array beta_HF where beta_HF[s, i, j], for instance, means the beta of stock s at day i at window j. At this moment, I am computing the betas in the following way (let returns be full dataframe):
trading_days = pd.unique(returns.index.date)
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
def dispersion(trading_days, moments, df, verbose = True):
index = 'SPY'
beta_HF = np.zeros((df.shape[1] - 1, len(trading_days), len(moments) - 1))
for i, day in enumerate(trading_days):
daily_data = df[df.index.date == day]
start_time = dt.time(9,30)
for j, end_time in enumerate(moments[1:]):
moment_data = daily_data.between_time(start_time, end_time)
covariances = np.array([moment_data[index].cov(moment_data[symbol]) for symbol in df])
beta_HF[:, i,j] = covariances[1:]/covariances[0]
if verbose == True:
if np.remainder(i, 100) == 0:
print("Current Trading Day: {}".format(day))
return(beta_HF)
The dispersion() function generates the correct output. However, I understand that I am looping over long iterables and this is not very efficient. I seek a more efficient way to "slice" the dataframe at each 30-minute window for each day in the sample and compute the covariances. Effectively, for each slice, I need to compute 101 numbers (100 covariances + 1 variance). On my local machine (a 2013 Retina i5 Macbook Pro) it's taking around 8 minutes to compute everything. I tested it on a research server of my university and the computing time was basically the same, which probably implies that computing power is not the bottleneck but my code has low quality in this part. I would appreciate any ideas on how to make this faster.
One might point out that parallelization is the way to go here since the elements in beta_HF never interact with each other. So this seems to be easy to parallelize. However, I have never implemented anything with parallelization so I am very new to these concepts. Any ideas on how to make the code run faster? Thanks a lot!
You can use pandas Grouper in order to group your data by frequency. The only drawbacks are that you cannot have overlapping windows and it will iterate over times that are not existant.
The first issue basically means that the window will slide from 9:30-9:59 to 10:00-10:29 instead of 9:30-10:00 to 10:00-10:30.
The second issue comes to play during holidays and night when no trading takes place. Hence, if you have a large period without trading then you might want to split the DataFrame and combine them afterwards.
Create example data
import pandas as pd
import numpy as np
time = pd.date_range(start="2014-01-02 09:30",
end="2014-01-02 16:00", freq="min")
time = time.append( pd.date_range(start="2014-01-03 09:30",
end="2014-01-03 16:00", freq="min") )
df = pd.DataFrame(data=np.random.rand(time.shape[0], 4)-0.5,
index=time, columns=['SPY','AAPL','AMZN','T'])
define the range you want to use
freq = '30min'
obs_per_day = len(pd.date_range(start = "9:30", end = "16:00", freq = "30min"))
trading_days = len(pd.unique(df.index.date))
make a function to calculate the beta values
def beta(df):
if df.empty: # returns nan when no trading takes place
return np.nan
mat = df.to_numpy() # numpy is faster than pandas
m = mat.mean(axis=0)
mat = mat - m[np.newaxis,:] # demean
dof = mat.shape[0] - 1 # degree of freedom
if dof != 0: # check if you data has more than one observation
mat = mat.T.dot(mat[:,0]) / dof # covariance with first column
return mat[1:] / mat[0] # beta
else:
return np.zeros(mat.shape[1] - 1) # return zeros for to short data e.g. 16:00
and in the end use pd.groupby().apply()
res = df.groupby(pd.Grouper(freq=freq)).apply(beta)
res = np.array( [k for k in res.values if ~np.isnan(k).any()] ) # remove NaN
res = res.reshape([trading_days, obs_per_day, df.shape[1]-1])
Note that the result is in a slightly different shape than yours.
The results also differ a bit because of the different window sliding. To check whether the results are the same, simply try somthing like this
trading_days = pd.unique(df.index.date)
# Your result
moments1 = pd.date_range(start = "9:30", end = "10:00", freq = "30min").time
beta(df[df.index.date == trading_days[0]].between_time(moments1[0], moments1[1]))
# mine
moments2 = pd.date_range(start = "9:30", end = "10:00", freq = "29min").time
beta(df[df.index.date == trading_days[0]].between_time(moments[0], moments2[1]))