Fast EMA calculation on large dataset with irregular time intervals - python

I have data that has 800,000+ rows. I want to take an Exponential Moving Average (EMA) of one of the columns. The times are not evenly sampled and I want to decay the EMA on each update (row). The code I have is this:
window = 5
for i in range(1, len(series)):
dt = series['datetime'][i] - series['datetime'][i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * series['midpoint'].iloc[i]
return pandas.Series(result, index=series.index)
The problem is, for 800,000 rows, this is very slow. Is there anyway to optimize this using some other features of numpy? I can't vectorize it because results[i] is dependent on results[i-1].
sample data here:
Timestamp Midpoint
1559655000001096130 2769.125
1559655000001162260 2769.127
1559655000001171688 2769.154
1559655000001408734 2769.138
1559655000001424200 2769.123
1559655000001433128 2769.110
1559655000001541560 2769.125
1559655000001640406 2769.125
1559655000001658436 2769.127
1559655000001755924 2769.129
1559655000001793266 2769.125
1559655000001878688 2769.143
1559655000002061024 2769.125

How about something like the following which takes me 0.34 seconds to run on a series of irregularly spaced data with 900k rows? I am assuming the window of 5 implies a 5 day span.
First, let's create some sample data.
# Create sample data for a price stream of 2.6m price observations sampled 1 second apart.
seconds_per_day = 60 * 60 * 24 # 60 seconds / minute * 60 minutes / hour * 24 hours / day
starting_value = 100
annualized_vol = .3
sampling_percentage = .35 # 35%
start_date = '2018-12-01'
end_date = '2018-12-31'
np.random.seed(0)
idx = pd.date_range(start=start_date, end=end_date, freq='s') # One second intervals.
periodic_vol = annualized_vol * (1/ 252 / seconds_per_day) ** 0.5
daily_returns = np.random.randn(len(idx)) * periodic_vol
cumulative_indexed_return = (1 + daily_returns).cumprod() * starting_value
index_level = pd.Series(cumulative_indexed_return, index=idx)
# Sample 35% of the simulated prices to create a time series of 907k rows with irregular time intervals.
s = index_level.sample(frac=sampling_percentage).sort_index()
Now let's create a generator function to store the latest value of the exponentially weighted time series. This can run c. 4x faster by installing numba, importing it and then adding the single decorator line above the function definition #jit(nopython=True).
from numba import jit # Optional, see below.
#jit(nopython=True) # Optional, see below.
def ewma(vals, decay_vals):
result = vals[0]
yield result
for val, decay in zip(vals[1:], decay_vals[1:]):
result = result * (1 - decay) + val * decay
yield result
Now let's run this generator on the irregularly spaced series s. For this sample with 900k rows, it takes me 1.2 seconds to run the following code. I can further cut down the execution time to 0.34 seconds by optionally using the the just in time compiler from numba. You first need to install that package, e.g. conda install numba. Note that I used a list compehension to populate the ewma values from the generator, and then I assign these values back to the original series after first converting it to a dataframe.
# Assumes time series data is now named `s`.
window = 5 # Span of 5 days?
dt = pd.Series(s.index).diff().dt.total_seconds().div(seconds_per_day) # Measured in days.
decay = (1 - (dt / -window).apply(np.exp))
g = ewma_generator(s.values, decay.values)
result = s.to_frame('midpoint').assign(
ewma=pd.Series([next(g) for _ in range(len(s))], index=s.index))
>>> result.tail()
midpoint ewma
2018-12-30 23:59:45 103.894471 105.546004
2018-12-30 23:59:49 103.914077 105.545929
2018-12-30 23:59:50 103.901910 105.545910
2018-12-30 23:59:53 103.913476 105.545853
2018-12-31 00:00:00 103.910422 105.545720
>>> result.shape
(907200, 2)
To make sure the numbers follow our intuition, let's visualize the result taking hourly samples. This looks good to me.
obs_per_day = 24 # 24 hourly observations per day.
step = int(seconds_per_day / obs_per_day)
>>> result.iloc[::step, :].plot()

A slight improvement may be obtained by iterating on the underlying numpy arrays instead of on pandas DataFrames and Series:
result = np.ndarray(len(series))
window = 5
serdt = series['datetime'].values
sermp = series['midpoint'].values
for i in range(1, len(series)):
dt = serdt[i] - serdt[i - 1]
decay = 1 - numpy.exp(-dt / window)
result[i] = (1 - decay) * result[i - 1] + decay * sermp[i]
return pandas.Series(result, index=series.index)
With your sample data it is about 6 times faster that the original method.

Related

How can I perform a calculation on my CVXPY variable?

I have a convex programming problem in which I am constrained to several periods, each of these periods represents different times of a day in minutes. Assume we are constrained to 7 periods in the day, these periods consist [480, 360, 120, 180, 90, 120, 90].
Update to my thoughts on this:
Can the 7 intervals variable be transferred to a binary variable of 1440? This would mean we can calculated the level as needed.
I would assume to use these periods as a max for my integer variable which can be defined as X. X is a CVXPY variable X = cp.Variable(7). Performing my solution I create and define the problem by creating constraints, the constraints I want to work with are as follows:
Target >= min target
Reservoir level >= min level
Reservoir level <= max level
I understand than in order to calculate the reservoir levels I must feed correct data to ensure calculation such as surface area, what is expected to leave the reservoir. The problem I am struggling with is that due to the shape of X I feel like I should be ensuring the reservoir isn't overfilling between periods, at the moment my calculation just checks at point 0, point 1 ... point 7 and this does satisfy the constraint, however in real world the issue I am facing is that we are exceeding the max level in between these points at stages and would need to factor this into account, how could we refactor the code below to account for this given the variable running times of pumps as set by X
Please see below the code that we currently are working with.
# Imports
import numpy as np
import cvxpy as cp
# Initial variables required to reproduce the problem
periods_minutes = [480, 360, 120, 180, 90, 120, 90]
energy_costs = np.array([0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19])
flow_out = np.array([84254.39998627, 106037.09985495, 35269.19992447, 47066.40017509, 26121.59963608, 33451.20002747, 20865.5999279])
# Constant variables
pump_flow = 9.6
pump_energy = 9.29
pump_flow_per_minute = pump_flow * 60
# CVXPY section
N = len(periods_minutes)
MIN_RUN_TIME = 10
running_time = cp.Variable(N, integer=True)
mins = np.ones(N) * MIN_RUN_TIME
maxs = np.ones(N) * periods_minutes
k = cp.Variable(N, boolean=True)
# Optimiation calculations
running_time_hours = running_time / 60
cost_of_running = cp.multiply(running_time_hours, energy_costs) * pump_energy
sum_of_energy = cp.sum(cost_of_running)
volume_cp = cp.sum(running_time*pump_flow_per_minute)
period_volume = running_time * pump_flow_per_minute
# Create a variable that will represent 1 if the period is runnning and 0 for the remeainder, 1440 total
# Example running_time[0] = 231, then this is 231 trues in the variable
# test = np.zeros((1, 1440))
# for i in range(N):
# for j in range(running_time[i]):
# test[0][j] = 1
# Reservoir information and calculations
FACTOR = 1/160.6
flow_in = running_time * pump_flow_per_minute
flow_diff = (flow_in - flow_out) / 1000
res_level = cp.cumsum(flow_diff) * FACTOR + 2.01
# Constant constraints
min_level_constraint = res_level >= 1.8
max_level_constraint = res_level <= 2.4
volume_constraint = volume_cp >= 353065.5
# Build constraints
constraints = []
# Convert the integer variables to binary variables
# constraints += [test_cp[0] == 1]
# Append common constraints
constraints += [min_level_constraint]
constraints += [max_level_constraint]
constraints += [volume_constraint]
constraints += [running_time >= cp.multiply(k, mins)]
constraints += [running_time <= cp.multiply(k, maxs)]
# Objective definition
objective = cp.Minimize(cp.sum(sum_of_energy))
# Problem declaration
prob = cp.Problem(objective, constraints)
prob.solve(solver=cp.CPLEX, verbose=False)
# Each ith element of the array represents running time in minutes
running_time.value
Note that some variables are part of our external class:
Surface Area: 160.6m²
Min Level: 1.85m
Max Level: 2.4m
Pump Flow: 9.6l/s
Pump Energy: 9kW
At the moment our outflow data for the reservoir is in 30 minute intervals, Ideally if we could develop a solution to allow for this in the sense of say a inflow matrix that accounted for the various running times over a period of time and accounted the volume such as imagine a variable output for X being [231 100 0 0 30 90 99] If we look at the first element being 231 I would expect something like this in our matrix given 480 minutes for the first element as the max running time for period 1, this yields 16 elements as 480/30.
The expected outcome given this would be something like
[17280 17280 17280 17280 17280 17280 17280 12096 0 0 0 0 0 0 0 0]
Figures shown above are in volumes 17280 being full 30 minute interval running and 12096 being 21 minutes of the period, 0 being not running. I hope to have provided enough information to entice people into looking at this problem and look forward to answering and queries you may have. Thanks for taking the time to read through my post.
Problem
I assume that the pump starts running at the beginning of each time period, and stops after running_time seconds, until the next time period starts. We are checking the level at the end of each period but within the periods the level may get higher when the pump is working. I hope I've understood the problem correctly.
Solution
The constraint is:
res_level(t) < 2.4
The function is piecewise smooth, the pieces being separated by time period boundaries and the event of pump shutdown within each time period.
Mathematically we know that the constraint is satisfied if the value of res_level(t) is smaller than 2.4 at all critical points—i.e. piece boundaries and interior extrema.
We also know that res_level(t) is linear in the piece intervals. So there are no interior extrema—except in case of constant function, but in that case the value is already checked at the boundaries.
So your approach of checking res_level at ends of each period is correct, except that you also need to check the level at the times of pump shutdown.
From simple mathematics:
res_level(t_shutdown) = res_level(period_start) + flow_in - (t_shutdown/t_period) * flow_out
In CVXPY this can be implemtend as:
res_level_at_beginning_of_period = res_level - flow_diff * FACTOR
flow_diff_until_pump_shutdown = (flow_in - cp.multiply(flow_out, (running_time / periods_minutes))) / 1000
res_level_at_pump_shutdown = res_level_at_beginning_of_period + flow_diff_until_pump_shutdown * FACTOR
max_level_constraint_at_pump_shutdown = res_level_at_pump_shutdown <= 2.4
constraints += [max_level_constraint_at_pump_shutdown]
Running the code with this additional constraint gave me the following res_levels (levels at end of periods):
[2.0448792 1.8831538 2.09393089 1.80086488 1.96100436 1.81727335
2.0101401 ]

How to implement Ehlers Homodyne Discriminator in Pandas series

I want to include a dynamic 'lookback period' for my stock indicators for a given time period entry. I've previously implemented Ehler's Homodyne Discriminator using a rolling window; every time a new datapoint comes into my algorithm, the discriminator is recalculated (but retains memory of prior calculations...see below). I would rather determine the period using Pandas as it seems to be a faster method of implementing data processing over large datasets.
Note that I encounter data two ways: first, historical data is generated in bulk; and second, data comes in 1 minute at a time and will be added to the historical data for reprocessing.
The issues I face are:
Calculations are dependent on the at-index value of period, and period depends on the other calculations (see original script). However the calculations using pandas are currently done in bulk so the data never changes over time which it should.
The dataframe includes values for multiple assets (MultiIndex) and so I currently process the discriminator once per asset; is there a way I can run this once and let Pandas do the grouping?
Should I simply reprocess the entire dataset every time new data comes in, or should I do away with the benefits of Pandas and just iterate through each new row and use my old script?
Historical Data:
close high low open volume
symbol time
SPY 2019-06-07 15:41:00 288.03 288.060 287.98 288.030 132296.0
2019-06-07 15:42:00 288.04 288.060 287.96 288.035 103635.0
2019-06-07 15:43:00 288.15 288.160 288.04 288.045 144841.0
2019-06-07 15:44:00 288.10 288.190 288.09 288.150 166086.0
2019-06-07 15:45:00 287.93 288.120 287.93 288.100 145304.0
2019-06-07 15:46:00 287.77 287.935 287.75 287.935 253202.0
2019-06-07 15:47:00 287.86 287.870 287.76 287.760 140996.0
2019-06-07 15:48:00 287.78 287.865 287.76 287.860 178082.0
2019-06-07 15:49:00 287.83 287.855 287.62 287.790 631133.0
2019-06-07 15:50:00 287.83 287.915 287.78 287.825 279326.0
Original Script (self.Value is actual period). If you don't use QuantConnect, I'm sure you could just replace all RollingWindows with arrays with reversed data or reverse the references. In this script, Update is called every time a new row is created in the dataframe:
class HomodyneDiscriminatorPeriodOld():
Values = RollingWindow[int](2)
SmoothedPeriod = RollingWindow[float](2)
Smooth = RollingWindow[float](7)
Detrend = RollingWindow[float](7)
Source = RollingWindow[float](4)
I1 = RollingWindow[float](7)
I2 = RollingWindow[float](7)
Q1 = RollingWindow[float](7)
Q2 = RollingWindow[float](7)
Re = RollingWindow[float](2)
Im = RollingWindow[float](2)
def FillWindows(self, *args, value=0):
for window in args:
for i in range(window.Size):
window.Add(value)
def __init__(self, period=1):
self.Value = period
self.Period = period
# Start with history
self.FillWindows(self.Smooth, self.SmoothedPeriod, self.Detrend, self.I1, self.I2, self.Q1, self.Q2, self.Re, self.Im)
self.FillWindows(self.Values, value=self.Value)
def __repr__(self):
return "{}".format(self.Value)
def Weighted(self, first, second, percent=0.2):
return percent * first + (1 - percent) * second
def Quadrature(self, window):
C1 = 0.0962
C2 = 0.5769
C3 = self.Period * 0.075 + 0.54
return (window[0] * C1 + window[2] * C2 - window[4] * C2 - window[6] * C1) * C3
def Update(self, data):
self.Source.Add((data.High + data.Low) / 2)
if not self.Source.IsReady: return self.Value
#
# --- Start the Homodyne Discriminator Caculations
#
# Mutable Variables (non-series)
self.Smooth.Add((self.Source[0] * 4.0 + self.Source[1] * 3.0 + self.Source[2] * 2.0 + self.Source[3]) / 10.0)
self.Detrend.Add(self.Quadrature(self.Smooth))
# Compute InPhase and Quadrature components
self.Q1.Add(self.Quadrature(self.Detrend))
self.I1.Add(self.Detrend[3])
# Advance Phase of I1 and Q1 by 90 degrees
jI = self.Quadrature(self.I1)
jQ = self.Quadrature(self.Q1)
# Phaser addition for 3 bar averaging and
# Smooth i and q components before applying discriminator
self.I2.Add(self.Weighted(self.I1[0] - jQ, self.I2[0]))
self.Q2.Add(self.Weighted(self.Q1[0] + jI, self.Q2[0]))
# Extract Homodyne Discriminator
self.Re.Add(self.Weighted(self.I2[0] * self.I2[1] + self.Q2[0] * self.Q2[1], self.Re[0]))
self.Im.Add(self.Weighted(self.I2[0] * self.Q2[1] - self.Q2[0] * self.I2[1], self.Im[0]))
# Calculate the period
period = ((math.pi * 2) / math.atan(self.Im[0] / self.Re[0])) if (self.Re[0] != 0 and self.Im[0] != 0) else 0
period = min(max(max(min(period, 1.5 * self.Period), 0.6667 * self.Period), 6), 50)
self.Period = self.Weighted(period, self.Period)
self.SmoothedPeriod.Add(self.Weighted(self.Period, self.SmoothedPeriod[0], 0.33))
self.Value = round(self.SmoothedPeriod[0] * 0.5 - 1)
if self.Value < 1: self.Value = 1
self.Values.Add(self.Value)
return self.Value
Pandas Script. Update is currently only called once after bulk import of historical data. I have yet to implement a walk-forward method of calculation as indicated by Q3 if it's even required:
class HomodyneDiscriminatorPeriod():
def Weighted(self, series, other=None, percent=0.2):
if other is None: other = series
return percent * series + (1 - percent) * other
def Quadrature(self, series):
C1 = 0.0962
C2 = 0.5769
C3 = self.Frame.period * 0.075 + 0.54
return (series * C1 + series.shift(2) * C2 - series.shift(4) * C2 - series.shift(6) * C1) * C3
def Update(self, frame):
# Add period column to timeframe's dataframe
frame['period'] = 1
# Initialize internal dataframe with same structure
# as timeframe's dataframe but without original columns
self.Frame = pd.DataFrame().reindex_like(frame)
self.Frame.drop(frame.columns, axis=1)
self.Frame['period'] = 1
self.Frame['smoothed_period'] = 1
self.Frame['i2'] = 0
self.Frame['q2'] = 0
self.Frame['re'] = 0
self.Frame['im'] = 0
# Shorthand references
period = self.Frame['period']
smoothed_period = self.Frame['smoothed_period']
i2 = self.Frame['i2']
q2 = self.Frame['q2']
re = self.Frame['re']
im = self.Frame['im']
#
# --- Start the Homodyne Discriminator Caculations
#
# Mutable Variables (non-series)
hl2 = (frame.high + frame.low) / 2
smooth = (hl2 * 4.0 + hl2.shift(1) * 3.0 + hl2.shift(2) * 2.0 + hl2.shift(3)) / 10.0
detrend = self.Quadrature(smooth)
# Compute InPhase and Quadrature components
q1 = self.Quadrature(detrend)
i1 = detrend.shift(3)
# Advance Phase of I1 and Q1 by 90 degrees
ji = self.Quadrature(i1)
jq = self.Quadrature(q1)
# Phaser addition for 3 bar averaging and
# smooth i and q components before applying discriminator
i2 = self.Weighted(i1 - jq)
q2 = self.Weighted(q1 + ji)
# Extract Homodyne Discriminator
re = self.Weighted(i2 * i2.shift(1) + q2 * q2.shift(1))
im = self.Weighted(i2 * q2.shift(1) - q2 * i2.shift(1))
# Calculate the period
# TODO: Use 360 or 2 * np.pi???? Official doc says 360...
_period = (2 * np.pi / np.arctan(im / re)).clip(upper=1.5 * period, lower=0.6667 * period).clip(upper=50, lower=6)
period = self.Weighted(_period, period)
smoothed_period = self.Weighted(period, smoothed_period, 0.33)
return (smoothed_period * 0.5 - 1).round().clip(lower=1)
I would think that recalculating the homodyne filter for the entire dataset each time a new bar became available would be much too expensive. Recall, most of Ehler's cycle filters are determined recursively -- and the homodyne looks back more bars than the supersmoother or high-pass filter. Given this, most trading platforms simply hold the resulting arrays in memory, and then just pick off the array elements a few bars back to calculate results for each new bar.
Note that none of the platforms go all the way back to the beginning and calculate the resulting output arrays for the entire time series when a new bar becomes available. If Pandas is that fast, then this may not be an issue. But in theory, I would not do that computationally, since it would be duplicative (unnecessary) computation. In other words, no matter how fast a platform is, why would you calculate the same array elements over and over again thousands of times within the time series, when you only need to look back about 6 bars for most Ehlers filters, and a few more for the homodyne when each new bar becomes available?

Getting my def function to apply.() to my stocks

I'm having trouble getting my code to work. Im coding python in a backtesting environment called "Quantopian". Regardless, the .apply(), series, .pd or whatever terminology is beyond my skill level. (assuming I'm even on the right track lol)
What I'm trying to accomplish:
Taking a couple stocks and constantly calculating the MACD. Then when the indicator meets a certain condition, the algo purchases or sells that specific stock.
What the MACD is simplistically:
A momentum indicator that looks at historical data, using 12, 26 and 9 day Exponential Moving Averages and comparing them with each other.
I've designed my own function, thats not my problem....
Help:
I'm trying to apply it to the pool of stocks in my universe to constantly calculate the MACD every minute.
Where I'm specifically confused:
I defined a MACD function but don't know how to get it to calculate every minute for whatever stocks are in my pool.
CODE:
import numpy as np
import math
import talib as ta
import pandas as pd
def initialize(context):
set_commission(commission.PerTrade(cost=10))
context.stocks = symbols('AAPL', 'GOOG_L')
def handle_data(context, data):
for stock in context.stocks:
prices_fast = data.history(context.stocks, "close", 390, "1m").resample("30min").dropna()
prices_slow = data.history(context.stocks, "close", 390, "1m").resample("30min").dropna()
prices_signal = data.history(context.stocks, "close", 390, "1m").resample("30min").dropna()
curr_price = data.history(context.stocks, "price", 30, "1m").resample("30min")[-1:].dropna()
series = pd.Series([stock]).dropna()
macd = series.apply(MACD)
macd_func = stock.apply(MACD)
if macd_func[stock] > 0:
order(stock, 1)
print macd_func
record(macd=macd_func[stock])
def MACD(prices_fast, prices_slow, prices_signal, curr_price):
# Setting MACD Conditions:
slow = 26
fast = 12
signal = 9
# Calcualting Averages:
avg_fast = pd.rolling_sum(prices_fast[:fast], fast)[-1:] / fast
avg_slow = pd.rolling_sum(prices_slow[:slow], slow)[-1:] / slow
avg_signal = pd.rolling_sum(prices_signal[:signal], signal)[-1:] / signal
# Calculating the Weighting Multipliers:
A = 2 / (fast + 1)
B = 2 / (slow + 1)
C = 2 / (signal + 1)
# Calculating the Exponential Moving Averages:
EMA_fast = (curr_price * A) + [avg_fast * (1 - A)]
EMA_slow = (curr_price * B) + [avg_slow * (1 - B)]
EMA_signal = (curr_price * C) + [avg_signal * (1 - C)]
# Calculating MACD Histogram:
macd = EMA_fast - EMA_slow - EMA_signal
If someone could give me a handle, I would GREATLY appreciate it!
Thank you very VERY much,
Mike

Efficiently Running Newton Algorithm

This is related to another question I asked earlier. I want to run the newton method on a large dataset. Below is the code I created using a loop. I need to run it on ~50 million lines and the loop is quite unwieldy. Is there a more efficient way to run it using Pandas/Numpy/ect? Thanks in advance
In:
from pandas import *
from pylab import *
import pandas as pd
import pylab as plt
import numpy as np
from scipy import *
import scipy
df = DataFrame(list([100,2,34.1556,9,105,-100]))
df = DataFrame.transpose(df)
df = df.rename(columns={0:'Face',1:'Freq',2:'N',3:'C',4:'Mkt_Price',5:'Yield'})
df2= df
df = concat([df, df2])
df = df.reset_index(drop=True)
df
Out:
Face Freq N C Mkt_Price Yield
0 100 2 34.1556 9 105 -100
1 100 2 34.1556 9 105 -100
In:
def Px(Rate):
return Mkt_Price - (Face * ( 1 + Rate / Freq ) ** ( - N ) + ( C / Rate ) * ( 1 - (1 + ( Rate / Freq )) ** -N ) )
for count, row in df.iterrows():
Face = row['Face']
Freq = row['Freq']
N = row['N']
C = row['C']
Mkt_Price = row['Mkt_Price']
row['Yield'] = scipy.optimize.newton(Px, .1, tol=.0001, maxiter=100)
df
Out:
Face Freq N C Mkt_Price Yield
0 100 2 34.1556 9 105 0.084419
1 100 2 34.1556 9 105 0.084419
One possibility that pops into my mind is that you might do it vectorized. However, you must then throw away all conditional code, and just run the required amount of iterations.
The basic step in Newton-Raphson is always the same, so you do not need to have any conditional code. Your function Px looks as if it could be vectorized without any extra effort.
The steps are roughly:
def Px(Rate, Mkt_Price, Face, Freq, N, C):
return Mkt_Price - (Face * ( 1 + Rate / Freq ) ** ( - N ) + ( C / Rate ) * ( 1 - (1 + ( Rate / Freq )) ** -N ) )
# initialize the iteration vector
y = 0.1 * np.zeros(num_rows)
# just a guess for the differentiation, might be smaller
h = 1e-6
# then iterate for a suitable number of iterations
for i in range(100):
f = Px(y, Mkt_Price, Face, Freq, N, C)
fp = Px(y+h, Mkt_Price, Face, Freq, N, C)
y -= h * f / (fp - f)
After this you have the iteration results in y. I have assumed Mkt_Price, Face, etc. are 50-million-row vectors.
There will be billions of calculations, so this will still take maybe a dozen seconds. Also, there is no error checking, so if something goes wildly oscillating, there is nothing to warn you about it.
One way to make this better is to calculate the first differential analytically, as it can be done. The practical improvement may be small, though. You will have to experiment to find the best number of iterations. If the function converges fast (as I suppose), 20 iterations will be plenty.
The code is completely untested, but it should illustrate the idea.

Relative Strength Index in python pandas

I am new to pandas. What is the best way to calculate the relative strength part in the RSI indicator in pandas? So far I got the following:
from pylab import *
import pandas as pd
import numpy as np
def Datapull(Stock):
try:
df = (pd.io.data.DataReader(Stock,'yahoo',start='01/01/2010'))
return df
print 'Retrieved', Stock
time.sleep(5)
except Exception, e:
print 'Main Loop', str(e)
def RSIfun(price, n=14):
delta = price['Close'].diff()
#-----------
dUp=
dDown=
RolUp=pd.rolling_mean(dUp, n)
RolDown=pd.rolling_mean(dDown, n).abs()
RS = RolUp / RolDown
rsi= 100.0 - (100.0 / (1.0 + RS))
return rsi
Stock='AAPL'
df=Datapull(Stock)
RSIfun(df)
Am I doing it correctly so far? I am having trouble with the difference part of the equation where you separate out upward and downward calculations
It is important to note that there are various ways of defining the RSI. It is commonly defined in at least two ways: using a simple moving average (SMA) as above, or using an exponential moving average (EMA). Here's a code snippet that calculates various definitions of RSI and plots them for comparison. I'm discarding the first row after taking the difference, since it is always NaN by definition.
Note that when using EMA one has to be careful: since it includes a memory going back to the beginning of the data, the result depends on where you start! For this reason, typically people will add some data at the beginning, say 100 time steps, and then cut off the first 100 RSI values.
In the plot below, one can see the difference between the RSI calculated using SMA and EMA: the SMA one tends to be more sensitive. Note that the RSI based on EMA has its first finite value at the first time step (which is the second time step of the original period, due to discarding the first row), whereas the RSI based on SMA has its first finite value at the 14th time step. This is because by default rolling_mean() only returns a finite value once there are enough values to fill the window.
import datetime
from typing import Callable
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pandas_datareader.data as web
# Window length for moving average
length = 14
# Dates
start, end = '2010-01-01', '2013-01-27'
# Get data
data = web.DataReader('AAPL', 'yahoo', start, end)
# Get just the adjusted close
close = data['Adj Close']
# Define function to calculate the RSI
def calc_rsi(over: pd.Series, fn_roll: Callable) -> pd.Series:
# Get the difference in price from previous step
delta = over.diff()
# Get rid of the first row, which is NaN since it did not have a previous row to calculate the differences
delta = delta[1:]
# Make the positive gains (up) and negative gains (down) Series
up, down = delta.clip(lower=0), delta.clip(upper=0).abs()
roll_up, roll_down = fn_roll(up), fn_roll(down)
rs = roll_up / roll_down
rsi = 100.0 - (100.0 / (1.0 + rs))
# Avoid division-by-zero if `roll_down` is zero
# This prevents inf and/or nan values.
rsi[:] = np.select([roll_down == 0, roll_up == 0, True], [100, 0, rsi])
rsi.name = 'rsi'
# Assert range
valid_rsi = rsi[length - 1:]
assert ((0 <= valid_rsi) & (valid_rsi <= 100)).all()
# Note: rsi[:length - 1] is excluded from above assertion because it is NaN for SMA.
return rsi
# Calculate RSI using MA of choice
# Reminder: Provide ≥ `1 + length` extra data points!
rsi_ema = calc_rsi(close, lambda s: s.ewm(span=length).mean())
rsi_sma = calc_rsi(close, lambda s: s.rolling(length).mean())
rsi_rma = calc_rsi(close, lambda s: s.ewm(alpha=1 / length).mean()) # Approximates TradingView.
# Compare graphically
plt.figure(figsize=(8, 6))
rsi_ema.plot(), rsi_sma.plot(), rsi_rma.plot()
plt.legend(['RSI via EMA/EWMA', 'RSI via SMA', 'RSI via RMA/SMMA/MMA (TradingView)'])
plt.show()
dUp= delta[delta > 0]
dDown= delta[delta < 0]
also you need something like:
RolUp = RolUp.reindex_like(delta, method='ffill')
RolDown = RolDown.reindex_like(delta, method='ffill')
otherwise RS = RolUp / RolDown will not do what you desire
Edit: seems this is a more accurate way of RS calculation:
# dUp= delta[delta > 0]
# dDown= delta[delta < 0]
# dUp = dUp.reindex_like(delta, fill_value=0)
# dDown = dDown.reindex_like(delta, fill_value=0)
dUp, dDown = delta.copy(), delta.copy()
dUp[dUp < 0] = 0
dDown[dDown > 0] = 0
RolUp = pd.rolling_mean(dUp, n)
RolDown = pd.rolling_mean(dDown, n).abs()
RS = RolUp / RolDown
My answer is tested on StockCharts sample data.
StockChart RSI info
def RSI(series, period):
delta = series.diff().dropna()
u = delta * 0
d = u.copy()
u[delta > 0] = delta[delta > 0]
d[delta < 0] = -delta[delta < 0]
u[u.index[period-1]] = np.mean( u[:period] ) #first value is sum of avg gains
u = u.drop(u.index[:(period-1)])
d[d.index[period-1]] = np.mean( d[:period] ) #first value is sum of avg losses
d = d.drop(d.index[:(period-1)])
rs = pd.DataFrame.ewm(u, com=period-1, adjust=False).mean() / \
pd.DataFrame.ewm(d, com=period-1, adjust=False).mean()
return 100 - 100 / (1 + rs)
#sample data from StockCharts
data = pd.Series( [ 44.34, 44.09, 44.15, 43.61,
44.33, 44.83, 45.10, 45.42,
45.84, 46.08, 45.89, 46.03,
45.61, 46.28, 46.28, 46.00,
46.03, 46.41, 46.22, 45.64 ] )
print RSI( data, 14 )
#output
14 70.464135
15 66.249619
16 66.480942
17 69.346853
18 66.294713
19 57.915021
I too had this question and was working down the rolling_apply path that Jev took. However, when I tested my results, they didn't match up against the commercial stock charting programs I use, such as StockCharts.com or thinkorswim. So I did some digging and discovered that when Welles Wilder created the RSI, he used a smoothing technique now referred to as Wilder Smoothing. The commercial services above use Wilder Smoothing rather than a simple moving average to calculate the average gains and losses.
I'm new to Python (and Pandas), so I'm wondering if there's some brilliant way to refactor out the for loop below to make it faster. Maybe someone else can comment on that possibility.
I hope you find this useful.
More info here.
def get_rsi_timeseries(prices, n=14):
# RSI = 100 - (100 / (1 + RS))
# where RS = (Wilder-smoothed n-period average of gains / Wilder-smoothed n-period average of -losses)
# Note that losses above should be positive values
# Wilder-smoothing = ((previous smoothed avg * (n-1)) + current value to average) / n
# For the very first "previous smoothed avg" (aka the seed value), we start with a straight average.
# Therefore, our first RSI value will be for the n+2nd period:
# 0: first delta is nan
# 1:
# ...
# n: lookback period for first Wilder smoothing seed value
# n+1: first RSI
# First, calculate the gain or loss from one price to the next. The first value is nan so replace with 0.
deltas = (prices-prices.shift(1)).fillna(0)
# Calculate the straight average seed values.
# The first delta is always zero, so we will use a slice of the first n deltas starting at 1,
# and filter only deltas > 0 to get gains and deltas < 0 to get losses
avg_of_gains = deltas[1:n+1][deltas > 0].sum() / n
avg_of_losses = -deltas[1:n+1][deltas < 0].sum() / n
# Set up pd.Series container for RSI values
rsi_series = pd.Series(0.0, deltas.index)
# Now calculate RSI using the Wilder smoothing method, starting with n+1 delta.
up = lambda x: x if x > 0 else 0
down = lambda x: -x if x < 0 else 0
i = n+1
for d in deltas[n+1:]:
avg_of_gains = ((avg_of_gains * (n-1)) + up(d)) / n
avg_of_losses = ((avg_of_losses * (n-1)) + down(d)) / n
if avg_of_losses != 0:
rs = avg_of_gains / avg_of_losses
rsi_series[i] = 100 - (100 / (1 + rs))
else:
rsi_series[i] = 100
i += 1
return rsi_series
You can use rolling_apply in combination with a subfunction to make a clean function like this:
def rsi(price, n=14):
''' rsi indicator '''
gain = (price-price.shift(1)).fillna(0) # calculate price gain with previous day, first row nan is filled with 0
def rsiCalc(p):
# subfunction for calculating rsi for one lookback period
avgGain = p[p>0].sum()/n
avgLoss = -p[p<0].sum()/n
rs = avgGain/avgLoss
return 100 - 100/(1+rs)
# run for all periods with rolling_apply
return pd.rolling_apply(gain,n,rsiCalc)
# Relative Strength Index
# Avg(PriceUp)/(Avg(PriceUP)+Avg(PriceDown)*100
# Where: PriceUp(t)=1*(Price(t)-Price(t-1)){Price(t)- Price(t-1)>0};
# PriceDown(t)=-1*(Price(t)-Price(t-1)){Price(t)- Price(t-1)<0};
# Change the formula for your own requirement
def rsi(values):
up = values[values>0].mean()
down = -1*values[values<0].mean()
return 100 * up / (up + down)
stock['RSI_6D'] = stock['Momentum_1D'].rolling(center=False,window=6).apply(rsi)
stock['RSI_12D'] = stock['Momentum_1D'].rolling(center=False,window=12).apply(rsi)
Momentum_1D = Pt - P(t-1) where P is closing price and t is date
You can get a massive speed up of Bill's answer by using numba. 100 loops of 20k row series( regular = 113 seconds, numba = 0.28 seconds ). Numba excels with loops and arithmetic.
import numpy as np
import numba as nb
#nb.jit(fastmath=True, nopython=True)
def calc_rsi( array, deltas, avg_gain, avg_loss, n ):
# Use Wilder smoothing method
up = lambda x: x if x > 0 else 0
down = lambda x: -x if x < 0 else 0
i = n+1
for d in deltas[n+1:]:
avg_gain = ((avg_gain * (n-1)) + up(d)) / n
avg_loss = ((avg_loss * (n-1)) + down(d)) / n
if avg_loss != 0:
rs = avg_gain / avg_loss
array[i] = 100 - (100 / (1 + rs))
else:
array[i] = 100
i += 1
return array
def get_rsi( array, n = 14 ):
deltas = np.append([0],np.diff(array))
avg_gain = np.sum(deltas[1:n+1].clip(min=0)) / n
avg_loss = -np.sum(deltas[1:n+1].clip(max=0)) / n
array = np.empty(deltas.shape[0])
array.fill(np.nan)
array = calc_rsi( array, deltas, avg_gain, avg_loss, n )
return array
rsi = get_rsi( array or series, 14 )
rsi_Indictor(close,n_days):
rsi_series = pd.DataFrame(close)
# Change = close[i]-Change[i-1]
rsi_series["Change"] = (rsi_series["Close"] - rsi_series["Close"].shift(1)).fillna(0)
# Upword Movement
rsi_series["Upword Movement"] = (rsi_series["Change"][rsi_series["Change"] >0])
rsi_series["Upword Movement"] = rsi_series["Upword Movement"].fillna(0)
# Downword Movement
rsi_series["Downword Movement"] = (abs(rsi_series["Change"])[rsi_series["Change"] <0]).fillna(0)
rsi_series["Downword Movement"] = rsi_series["Downword Movement"].fillna(0)
#Average Upword Movement
# For first Upword Movement Mean of first n elements.
rsi_series["Average Upword Movement"] = 0.00
rsi_series["Average Upword Movement"][n] = rsi_series["Upword Movement"][1:n+1].mean()
# For Second onwords
for i in range(n+1,len(rsi_series),1):
#print(rsi_series["Average Upword Movement"][i-1],rsi_series["Upword Movement"][i])
rsi_series["Average Upword Movement"][i] = (rsi_series["Average Upword Movement"][i-1]*(n-1)+rsi_series["Upword Movement"][i])/n
#Average Downword Movement
# For first Downword Movement Mean of first n elements.
rsi_series["Average Downword Movement"] = 0.00
rsi_series["Average Downword Movement"][n] = rsi_series["Downword Movement"][1:n+1].mean()
# For Second onwords
for i in range(n+1,len(rsi_series),1):
#print(rsi_series["Average Downword Movement"][i-1],rsi_series["Downword Movement"][i])
rsi_series["Average Downword Movement"][i] = (rsi_series["Average Downword Movement"][i-1]*(n-1)+rsi_series["Downword Movement"][i])/n
#Relative Index
rsi_series["Relative Strength"] = (rsi_series["Average Upword Movement"]/rsi_series["Average Downword Movement"]).fillna(0)
#RSI
rsi_series["RSI"] = 100 - 100/(rsi_series["Relative Strength"]+1)
return rsi_series.round(2)
For More Information
You do this using finta package as well just to add above
ref: https://github.com/peerchemist/finta/tree/master/examples
import pandas as pd
from finta import TA
import matplotlib.pyplot as plt
ohlc = pd.read_csv("C:\\WorkSpace\\Python\\ta-lib\\intraday_5min_IBM.csv", index_col="timestamp", parse_dates=True)
ohlc['RSI']= TA.RSI(ohlc)
It is not really necessary to calculate the mean, because after they are divided, you only need to calculate the sum, so we can use Series.cumsum ...
def rsi(serie, n):
diff_serie = close.diff()
cumsum_incr = diff_serie.where(lambda x: x.gt(0), 0).cumsum()
cumsum_decr = diff_serie.where(lambda x: x.lt(0), 0).abs().cumsum()
rs_serie = cumsum_incr.div(cumsum_decr)
rsi = rs_serie.mul(100).div(rs_serie.add(1)).fillna(0)
return rsi
Less code here but seems to work for me:
df['Change'] = (df['Close'].shift(-1)-df['Close']).shift(1)
df['ChangeAverage'] = df['Change'].rolling(window=2).mean()
df['ChangeAverage+'] = df.apply(lambda x: x['ChangeAverage'] if x['ChangeAverage'] > 0 else 0,axis=1).rolling(window=14).mean()
df['ChangeAverage-'] = df.apply(lambda x: x['ChangeAverage'] if x['ChangeAverage'] < 0 else 0,axis=1).rolling(window=14).mean()*-1
df['RSI'] = 100-(100/(1+(df['ChangeAverage+']/df['ChangeAverage-'])))

Categories