Related
I am trying to simulate a pandas dataframe, using random values, with a combination of hard upper/lower values. I am using np.random.normal, as the original data is fairly normally distributed.
The code I am using to create the dataframe is:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
})
In the above example, I would like there to be a hard lower and upper bound for all three values. For example, Rel. Hum. could not go below 0, or above 100. Edit: all three values would not have the same bounds, either upper or lower. Temp can go negative, while sun would be bounded at 0, and 24)
How can I force these values, while creating a relatively normally distribution, and passing them to the dataframe at the same time?
Edit : Note that this samples from a truncated normal for the given parameters and will most likely not be truly normally distributed, sorry for the confusion.
Use scipy truncated normal defined as :
"The standard form of this distribution is a standard normal truncated to the range [a, b]"
from scipy.stats import truncnorm
low_bound = 0
upper_bound = 100
mean = 8
std = 2
a, b = (low_bound - mean) / std, (upper_bound - mean) / std
n_samples = 1000
samples = truncnorm.rvs(a = a, b = b,
loc = mean, scale = std,
size = n_samples)
Thanks to ALollz for the corrections !
Try clip() function to bound the values, example:
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
Name: Rel Hum, Length: 93, dtype: float64
>>> df[df['Rel Hum']>100].head()
Temp Sun Rel Hum
32 4.734005 4.102939 100.064077
>>> df['Rel Hum'].clip(0, 100, inplace=True) # assigns values outside boundary to 0 and 100
>>> df.head()
Temp Sun Rel Hum
0 9.714943 6.255931 93.105135
1 0.551001 3.063972 85.923184
2 7.780588 3.580514 79.124139
3 3.766066 3.684801 84.543149
4 8.541507 -3.066196 83.598925
>>> df[df['Rel Hum']>100].head()
Empty DataFrame
Columns: [Temp, Sun, Rel Hum]
Index: []
Just do a clip:
df = pd.DataFrame({
"Temp": np.random.normal(6.809892, 2.975827,93),
"Sun": np.random.normal(1.615054,2.053996,93),
"Rel Hum": np.random.normal(87.153118,5.529958,93)
}).clip(0,100)
And plot:
df.plot.density(subplots=True);
gives:
You can clip, though this leaves you with a spike at the edges:
import pandas as pd
import numpy as np
N = 10**5
df = pd.DataFrame({"Rel Hum": np.random.normal(87.153118,5.529958, N)})
df['Rel Hum'].clip(lower=0, upper=100).plot(kind='hist', bins=np.arange(60,101,1))
If you want to avoid that spike redraw out of bounds points until everything is within bounds:
while not df['Rel Hum'].between(0, 100).all():
m = ~df['Rel Hum'].between(0, 100)
df.loc[m, 'Rel Hum'] = np.random.normal(87.153118, 5.529958, m.sum())
df['Rel Hum'].plot(kind='hist', bins=np.arange(60,101,1))
Background:
I'd like to solve a wide array of optimization problems such as asset weights in a portfolio, and parameters in trading strategies where the variables are passed to functions containing a bunch of other variables as well.
Until now, I've been able to do these things easily in Excel using the Solver Add-In. But I think it would be much more efficient and even more widely applicable using Python. For the sake of clarity, I'm going to boil the question down to the essence of portfolio optimization.
My question (short version):
Here's a dataframe and a corresponding plot with asset returns.
Dataframe 1:
A1 A2
2017-01-01 0.0075 0.0096
2017-01-02 -0.0075 -0.0033
.
.
2017-01-10 0.0027 0.0035
Plot 1 - Asset returns
Based on that, I would like to find the weights for the optimal portfolio with regards to risk / return (Sharpe ratio), represented by the green dot in the plot below (the red dot is the so-called minimum variance portfolio, and represents another optimization problem).
Plot 2 - Efficient frontier and optimal portfolios:
How can I do this with numpy or scipy?
The details:
The following code section contains the function returns() to build a dataframe with random returns for two assets, as well as a function pf_sharpe to calculate the Sharpe ratio of two given weights for a portfolio of the returns.
# imports
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
# Sharpe ratio
def pf_sharpe(df, w1, w2):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
Parameters
==========
df : pandas dataframe
w1 : portfolio weight for asset 1
w2 : portfolio weight for asset 2
'''
weights = [w1,w2]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return pf_sharpe
# Make df with random returns and calculate
# sharpe ratio for a 80/20 split between assets
df_returns = returns(rows = 10, names = ['A1', 'A2'])
df_returns.plot(kind = 'bar')
sharpe = pf_sharpe(df = df_returns, w1 = 0.8, w2 = 0.2)
print(sharpe)
# Output:
# 5.09477512073
Now I'd like to find the portfolio weights that optimize the Sharpe ratio. I think you could express the optimization problem as follows:
maximize:
pf_sharpe()
by changing:
w1, w2
under the constraints:
0 < w1 < 1
0 < w2 < 1
w1 + w2 = 1
What I've tried so far:
I found a possible setup in the post Python Scipy Optimization.minimize using SLSQP showing maximized results. Below is what I have so far, and it addresses a central aspect of my question directly:
[...]where the variables are passed to functions containing a bunch of other variables as well.
As you can see, my initial challenge prevents me from even testing if my bounds and constraints will be accepted by the function optimize.minimize(). I haven't even bothered to take into consideration the fact that this is a maximization and not a minimization problem (hopefully amendable by changing the sign of the function).
Attempts:
# bounds
b = (0,1)
bnds = (b,b)
# constraints
def constraint1(w1,w2):
return w1 - w2
cons = ({'type': 'eq', 'fun':constraint1})
# initial guess
x0 = [0.5, 0.5]
# Testing the initial guess
print(pf_sharpe(df = df_returns, weights = x0))
# Optimization attempts
attempt1 = optimize.minimize(pf_sharpe(), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt2 = optimize.minimize(pf_sharpe(df = df_returns, weights), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
attempt3 = optimize.minimize(pf_sharpe(weights, df = df_returns), x0, method = 'SLSQP', bounds = bnds, constraints = cons)
Results:
Attempt1 is closest to the scipy setup here, but understandably fails because neither df nor weights have been specified.
Attempt2 fails with SyntaxError: positional argument follows keyword argument
Attempt3 fails with NameError: name 'weights' is not defined
I was under the impression that df could freely be specified, and that x0 in optimize.minimize would be considered the variables to be tested as 'representatives' for the weights in the function specified by pf_sharpe().
As you surely understand, my transition from Excel to Python in this regard has not been the easiest, and there is plenty I don't understand here. Anyway, I'm hoping some of you may offer some suggestions or clarifications!
Thank you!
Appendix 1 - Simulation approach:
This particular portfolio optimization problem can easily be solved by simulating a bunch of portfolio weights. And I did exactly that to produce the portfolio plot above. Here's the whole function if anyone is interested:
# Portfolio simulation
def portfolioSim(df, simRuns):
''' Function to take a df with asset returns,
runs a number of simulated portfolio weights,
plots return and risk for those weights,
and finds minimum risk portfolio
and max risk / return portfolio
Parameters
==========
df : pandas dataframe with returns
simRuns : number of simulations
'''
prets = []
pvols = []
pwgts = []
names = list(df_returns)
for p in range (simRuns):
# Assign random weights
weights = np.random.random(len(list(df_returns)))
weights /= np.sum(weights)
weights = np.asarray(weights)
# Calculate risk and returns with random weights
prets.append(np.sum(df_returns.mean() * weights) * 252)
pvols.append(np.sqrt(np.dot(weights.T, np.dot(df_returns.cov() * 252, weights))))
pwgts.append(weights)
prets = np.array(prets)
pvols = np.array(pvols)
pwgts = np.array(pwgts)
pshrp = prets / pvols
# Store calculations in a df
df1 = pd.DataFrame({'return':prets})
df2 = pd.DataFrame({'risk':pvols})
df3 = pd.DataFrame(pwgts)
df3.columns = names
df4 = pd.DataFrame({'sharpe':pshrp})
df_temp = pd.concat([df1, df2, df3, df4], axis = 1)
# Plot resulst
plt.figure(figsize=(8, 4))
plt.scatter(pvols, prets, c=prets / pvols, cmap = 'viridis', marker='o')
# Min risk
min_vol_port = df_temp.iloc[df_temp['risk'].idxmin()]
plt.plot([min_vol_port['risk']], [min_vol_port['return']], marker='o', markersize=12, color="red")
# Max sharpe
max_sharpe_port = df_temp.iloc[df_temp['sharpe'].idxmax()]
plt.plot([max_sharpe_port['risk']], [max_sharpe_port['return']], marker='o', markersize=12, color="green")
# Test run
portfolioSim(df = df_returns, simRuns = 250)
Appendix 2 - Excel Solver approach:
Here is how I would approach the problem using Excel Solver. Instead of linking to a file, I've only attached a screenshot and included the most important formulas in a code section. I'm guessing not many of you is going to be interested in reproducing this anyway. But I've included it just to show that it can be done quite easily in Excel.
Grey ranges represent formulas. Ranges that can be changed and used as arguments in the optimization problem are highlighted in yellow. The green range is the objective function.
Here's an image of the worksheet and Solver setup:
Excel formulas:
C3 =AVERAGE(C7:C16)
C4 =AVERAGE(D7:D16)
H4 =COVARIANCE.P(C7:C16;D7:D16)
G5 =COVARIANCE.P(C7:C16;D7:D16)
G10 =G8+G9
G13 =MMULT(TRANSPOSE(G8:G9);C3:C4)
G14 =SQRT(MMULT(TRANSPOSE(G8:G9);MMULT(G4:H5;G8:G9)))
H13 =G12/G13
H14 =G13*252
G16 =G13/G14
H16 =H13/H14
End notes:
As you can see from the screenshot, Excel solver suggests a 47% / 53% split between A1 and A2 to obtain an optimal Sharpe Ratio of 5,6. Running the Python function sr_opt = portfolioSim(df = df_returns, simRuns = 25000) yields a Sharpe Ratio of 5,3 with corresponding weights of 46% and 53% for A1 and A2:
print(sr_opt)
#Output
#return 0.361439
#risk 0.067851
#A1 0.465550
#A2 0.534450
#sharpe 5.326933
The method applied in Excel is GRG Nonlinear. I understand that changing the SLSQP argument to a non-linear method would get me somewhere, and I've look into Nonlinear solvers in scipy as well, but with little success.
And maybe Scipy even isn't the best option here?
A more detailed answer, 1st part of your code remains the same
import pandas as pd
import numpy as np
from scipy.optimize import minimize
import matplotlib.pyplot as plt
np.random.seed(1234)
# Reproducible data sample
def returns(rows, names):
''' Function to create data sample with random returns
Parameters
==========
rows : number of rows in the dataframe
names: list of names to represent assets
Example
=======
>>> returns(rows = 2, names = ['A', 'B'])
A B
2017-01-01 0.0027 0.0075
2017-01-02 -0.0050 -0.0024
'''
listVars= names
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_temp = pd.DataFrame(np.random.randint(-100,100,size=(rows, len(listVars))), columns=listVars)
df_temp = df_temp.set_index(rng)
df_temp = df_temp / 10000
return df_temp
The function pf_sharpe is modified, the 1st input is one of the weights, the parameter to be optimised. Instead of inputting constraint w1 + w2 = 1, we can define w2 as 1-w1 inside pf_sharpe, which is perfectly equivalent but simpler and faster. Also, minimize will attempt to minimize pf_sharpe, and you actually want to maximize it, so now the output of pf_sharpe is multiplied by -1.
# Sharpe ratio
def pf_sharpe(weight, df):
''' Function to calculate risk / reward ratio
based on a pandas dataframe with two return series
'''
weights = [weight[0], 1-weight[0]]
# Calculate portfolio returns and volatility
pf_returns = (np.sum(df.mean() * weights) * 252)
pf_volatility = (np.sqrt(np.dot(np.asarray(weights).T, np.dot(df.cov() * 252, weights))))
# Calculate sharpe ratio
pf_sharpe = pf_returns / pf_volatility
return -pf_sharpe
# initial guess
x0 = [0.5]
df_returns = returns(rows = 10, names = ['A1', 'A2'])
# Optimization attempts
out = minimize(pf_sharpe, x0, method='SLSQP', bounds=[(0, 1)], args=(df_returns,))
optimal_weights = [out.x, 1-out.x]
print(optimal_weights)
print(-pf_sharpe(out.x, df_returns))
This returns an optimized Sharpe Ratio of 6.16 (better than 5.3) for w1 practically one and w2 practically 0
I am trying to implement bootstrap to estimate CI for statistics. Here is the code I have written
import numpy as np
import numpy.random as npr
import pylab
def bootstrap(data, num_samples, statistic, alpha):
"""Returns bootstrap estimate of 100.0*(1-alpha) CI for statistic."""
num_samples = len(data)
idx = npr.randint(min(data), max(data), num_samples)
samples = data[idx]
stat = np.sort(statistic(samples, 1))
return (stat[int((alpha/2.0)*num_samples)],
stat[int((1-alpha/2.0)*num_samples)])
X,Y = np.loadtxt('data/ABC.txt',
unpack =True,
delimiter =',',
skiprows = 1)
The text file contains 2 columns and I need to calculate the confidence interval for both columns.
My first thought is to convert the columns into an array and calculate the high and low 95% CI. I was thinking of something like this:
data = np.array([X,Y])
low, high = bootstrap(X, len(data), np.mean, 0.05)
low1, high1 = bootstrap(Y, len(data), np.mean, 0.05)
But I am not sure if this the correct way of calculating confidence interval. Can someone help me with this?
Thank you in advance!
Instead of :
idx = npr.randint(min(data), max(data), num_samples)
Use:
idx=np.random.choice(data,size=len(data),replace=True)
I have many (4000+) CSVs of stock data (Date, Open, High, Low, Close) which I import into individual Pandas dataframes to perform analysis. I am new to python and want to calculate a rolling 12month beta for each stock, I found a post to calculate rolling beta (Python pandas calculate rolling stock beta using rolling apply to groupby object in vectorized fashion) however when used in my code below takes over 2.5 hours! Considering I can run the exact same calculations in SQL tables in under 3 minutes this is too slow.
How can I improve the performance of my below code to match that of SQL? I understand Pandas/python has that capability. My current method loops over each row which I know slows performance but I am unaware of any aggregate way to perform a rolling window beta calculation on a dataframe.
Note: the first 2 steps of loading the CSVs into individual dataframes and calculating daily returns only takes ~20seconds. All my CSV dataframes are stored in the dictionary called 'FilesLoaded' with names such as 'XAO'.
Your help would be much appreciated!
Thank you :)
import pandas as pd, numpy as np
import datetime
import ntpath
pd.set_option('precision',10) #Set the Decimal Point precision to DISPLAY
start_time=datetime.datetime.now()
MarketIndex = 'XAO'
period = 250
MinBetaPeriod = period
# ***********************************************************************************************
# CALC RETURNS
# ***********************************************************************************************
for File in FilesLoaded:
FilesLoaded[File]['Return'] = FilesLoaded[File]['Close'].pct_change()
# ***********************************************************************************************
# CALC BETA
# ***********************************************************************************************
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
#Build Custom "Rolling_Apply" function
def rolling_apply(df, period, func, min_periods=None):
if min_periods is None:
min_periods = period
result = pd.Series(np.nan, index=df.index)
for i in range(1, len(df)+1):
sub_df = df.iloc[max(i-period, 0):i,:]
if len(sub_df) >= min_periods:
idx = sub_df.index[-1]
result[idx] = func(sub_df)
return result
#Create empty BETA dataframe with same index as RETURNS dataframe
df_join = pd.DataFrame(index=FilesLoaded[MarketIndex].index)
df_join['market'] = FilesLoaded[MarketIndex]['Return']
df_join['stock'] = np.nan
for File in FilesLoaded:
df_join['stock'].update(FilesLoaded[File]['Return'])
df_join = df_join.replace(np.inf, np.nan) #get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.replace(-np.inf, np.nan)#get rid of infinite values "inf" (SQL won't take "Inf")
df_join = df_join.fillna(0) #get rid of the NaNs in the return data
FilesLoaded[File]['Beta'] = rolling_apply(df_join[['market','stock']], period, calc_beta, min_periods = MinBetaPeriod)
# ***********************************************************************************************
# CLEAN-UP
# ***********************************************************************************************
print('Run-time: {0}'.format(datetime.datetime.now() - start_time))
Generate Random Stock Data
20 Years of Monthly Data for 4,000 Stocks
dates = pd.date_range('1995-12-31', periods=480, freq='M', name='Date')
stoks = pd.Index(['s{:04d}'.format(i) for i in range(4000)])
df = pd.DataFrame(np.random.rand(480, 4000), dates, stoks)
df.iloc[:5, :5]
Roll Function
Returns groupby object ready to apply custom functions
See Source
def roll(df, w):
# stack df.values w-times shifted once at each stack
roll_array = np.dstack([df.values[i:i+w, :] for i in range(len(df.index) - w + 1)]).T
# roll_array is now a 3-D array and can be read into
# a pandas panel object
panel = pd.Panel(roll_array,
items=df.index[w-1:],
major_axis=df.columns,
minor_axis=pd.Index(range(w), name='roll'))
# convert to dataframe and pivot + groupby
# is now ready for any action normally performed
# on a groupby object
return panel.to_frame().unstack().T.groupby(level=0)
Beta Function
Use closed form solution of OLS regression
Assume column 0 is market
See Source
def beta(df):
# first column is the market
X = df.values[:, [0]]
# prepend a column of ones for the intercept
X = np.concatenate([np.ones_like(X), X], axis=1)
# matrix algebra
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values[:, 1:])
return pd.Series(b[1], df.columns[1:], name='Beta')
Demonstration
rdf = roll(df, 12)
betas = rdf.apply(beta)
Timing
Validation
Compare calculations with OP
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
print(calc_beta(df.iloc[:12, :2]))
-0.311757542437
print(beta(df.iloc[:12, :2]))
s0001 -0.311758
Name: Beta, dtype: float64
Note the first cell
Is the same value as validated calculations above
betas = rdf.apply(beta)
betas.iloc[:5, :5]
Response to comment
Full working example with simulated multiple dataframes
num_sec_dfs = 4000
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(480, 4), dates, cols) for i in range(num_sec_dfs)}
market = pd.Series(np.random.rand(480), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = roll(df.pct_change().dropna(), 12).apply(beta)
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0001'].head(20)
Using a generator to improve memory efficiency
Simulated data
m, n = 480, 10000
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
stocks = pd.Index(['s{:04d}'.format(i) for i in range(n)])
df = pd.DataFrame(np.random.rand(m, n), dates, stocks)
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([df, market], axis=1)
Beta Calculation
def beta(df, market=None):
# If the market values are not passed,
# I'll assume they are located in a column
# named 'Market'. If not, this will fail.
if market is None:
market = df['Market']
df = df.drop('Market', axis=1)
X = market.values.reshape(-1, 1)
X = np.concatenate([np.ones_like(X), X], axis=1)
b = np.linalg.pinv(X.T.dot(X)).dot(X.T).dot(df.values)
return pd.Series(b[1], df.columns, name=df.index[-1])
roll function
This returns a generator and will be far more memory efficient
def roll(df, w):
for i in range(df.shape[0] - w + 1):
yield pd.DataFrame(df.values[i:i+w, :], df.index[i:i+w], df.columns)
Putting it all together
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
Validation
OP beta calc
def calc_beta(df):
np_array = df.values
m = np_array[:,0] # market returns are column zero from numpy array
s = np_array[:,1] # stock returns are column one from numpy array
covariance = np.cov(s,m) # Calculate covariance between stock and market
beta = covariance[0,1]/covariance[1,1]
return beta
Experiment setup
m, n = 12, 2
dates = pd.date_range('1995-12-31', periods=m, freq='M', name='Date')
cols = ['Open', 'High', 'Low', 'Close']
dfs = {'s{:04d}'.format(i): pd.DataFrame(np.random.rand(m, 4), dates, cols) for i in range(n)}
market = pd.Series(np.random.rand(m), dates, name='Market')
df = pd.concat([market] + [dfs[k].Close.rename(k) for k in dfs.keys()], axis=1).sort_index(1)
betas = pd.concat([beta(sdf) for sdf in roll(df.pct_change().dropna(), 12)], axis=1).T
for c, col in betas.iteritems():
dfs[c]['Beta'] = col
dfs['s0000'].head(20)
calc_beta(df[['Market', 's0000']])
0.0020118230147777435
NOTE:
The calculations are the same
While efficient subdivision of the input data set into rolling windows is important to the optimization of the overall calculations, the performance of the beta calculation itself can also be significantly improved.
The following optimizes only the subdivision of the data set into rolling windows:
def numpy_betas(x_name, window, returns_data, intercept=True):
if intercept:
ones = numpy.ones(window)
def lstsq_beta(window_data):
x_data = numpy.vstack([window_data[x_name], ones]).T if intercept else window_data[[x_name]]
beta_arr, residuals, rank, s = numpy.linalg.lstsq(x_data, window_data)
return beta_arr[0]
indices = [int(x) for x in numpy.arange(0, returns_data.shape[0] - window + 1, 1)]
return DataFrame(
data=[lstsq_beta(returns_data.iloc[i:(i + window)]) for i in indices]
, columns=list(returns_data.columns)
, index=returns_data.index[window - 1::1]
)
The following also optimizes the beta calculation itself:
def custom_betas(x_name, window, returns_data):
window_inv = 1.0 / window
x_sum = returns_data[x_name].rolling(window, min_periods=window).sum()
y_sum = returns_data.rolling(window, min_periods=window).sum()
xy_sum = returns_data.mul(returns_data[x_name], axis=0).rolling(window, min_periods=window).sum()
xx_sum = numpy.square(returns_data[x_name]).rolling(window, min_periods=window).sum()
xy_cov = xy_sum - window_inv * y_sum.mul(x_sum, axis=0)
x_var = xx_sum - window_inv * numpy.square(x_sum)
betas = xy_cov.divide(x_var, axis=0)[window - 1:]
betas.columns.name = None
return betas
Comparing the performance of the two different calculations, you can see that as the window used in the beta calculation increases, the second method dramatically outperforms the first:
Comparing the performance to that of #piRSquared's implementation, the custom method takes roughly 350 millis to evaluate compared to over 2 seconds.
Further optimizing on #piRSquared's implementation for both speed and memory. the code is also simplified for clarity.
from numpy import nan, ndarray, ones_like, vstack, random
from numpy.lib.stride_tricks import as_strided
from numpy.linalg import pinv
from pandas import DataFrame, date_range
def calc_beta(s: ndarray, m: ndarray):
x = vstack((ones_like(m), m))
b = pinv(x.dot(x.T)).dot(x).dot(s)
return b[1]
def rolling_calc_beta(s_df: DataFrame, m_df: DataFrame, period: int):
result = ndarray(shape=s_df.shape, dtype=float)
l, w = s_df.shape
ls, ws = s_df.values.strides
result[0:period - 1, :] = nan
s_arr = as_strided(s_df.values, shape=(l - period + 1, period, w), strides=(ls, ls, ws))
m_arr = as_strided(m_df.values, shape=(l - period + 1, period), strides=(ls, ls))
for row in range(period, l):
result[row, :] = calc_beta(s_arr[row - period, :], m_arr[row - period])
return DataFrame(data=result, index=s_df.index, columns=s_df.columns)
if __name__ == '__main__':
num_sec_dfs, num_periods = 4000, 480
dates = date_range('1995-12-31', periods=num_periods, freq='M', name='Date')
stocks = DataFrame(data=random.rand(num_periods, num_sec_dfs), index=dates,
columns=['s{:04d}'.format(i) for i in
range(num_sec_dfs)]).pct_change()
market = DataFrame(data=random.rand(num_periods), index=dates, columns=
['Market']).pct_change()
betas = rolling_calc_beta(stocks, market, 12)
%timeit betas = rolling_calc_beta(stocks, market, 12)
335 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
HERE'S THE SIMPLEST AND FASTEST SOLUTION
The accepted answer was too slow for what I needed and the I didn't understand the math behind the solutions asserted as faster. They also gave different answers, though in fairness I probably just messed it up.
I don't think you need to make a custom rolling function to calculate beta with pandas 1.1.4 (or even since at least .19). The below code assumes the data is in the same format as the above problems--a pandas dataframe with a date index, percent returns of some periodicity for the stocks, and market values are located in a column named 'Market'.
If you don't have this format, I recommend joining the stock returns to the market returns to ensure the same index with:
# Use .pct_change() only if joining Close data
beta_data = stock_data.join(market_data), how = 'inner').pct_change().dropna()
After that, it's just covariance divided by variance.
ticker_covariance = beta_data.rolling(window).cov()
# Limit results to the stock (i.e. column name for the stock) vs. 'Market' covariance
ticker_covariance = ticker_covariance.loc[pd.IndexSlice[:, stock], 'Market'].dropna()
benchmark_variance = beta_data['Market'].rolling(window).var().dropna()
beta = ticker_covariance / benchmark_variance
NOTES: If you have a multi-index, you'll have to drop the non-date levels to use the rolling().apply() solution. I only tested this for one stock and one market. If you have multiple stocks, a modification to the ticker_covariance equation after .loc is probably needed. Last, if you want to calculate beta values for the periods before the full window (ex. stock_data begins 1 year ago, but you use 3yrs of data), then you can modify the above to and expanding (instead of rolling) window with the same calculation and then .combine_first() the two.
Created a simple python package finance-calculator based on numpy and pandas to calculate financial ratios including beta. I am using the simple formula (as per investopedia):
beta = covariance(returns, benchmark returns) / variance(benchmark returns)
Covariance and variance are directly calculated in pandas which makes it fast. Using the api in the package is also simple:
import finance_calculator as fc
beta = fc.get_beta(scheme_data, benchmark_data, tail=False)
which will give you a dataframe of date and beta or the last beta value if tail is true.
but these would be blockish when you require beta calculations across the dates(m) for multiple stocks(n) resulting (m x n) number of calculations.
Some relief could be taken by running each date or stock on multiple cores, but then you will end up having huge hardware.
The major time requirement for the solutions available is finding the variance and co-variance and also NaN should be avoided in (Index and stock) data for a correct calculation as per pandas==0.23.0.
Thus running again would result stupid move unless the calculations are cached.
numpy variance and co-variance version also happens to miss-calculate the beta if NaN are not dropped.
A Cython implementation is must for huge set of data.
So I need to calculate the joint probability distribution for N variables. I have code for two variables, but I am having trouble generalizing it to higher dimensions. I imagine there is some sort of pythonic vectorization that could be helpful, but, right now my code is very C like (and yes I know that is not the right way to write Python). My 2D code is below:
import numpy
import math
feature1 = numpy.array([1.1,2.2,3.0,1.2,5.4,3.4,2.2,6.8,4.5,5.6,1.9,2.8,3.7,4.4,7.3,8.3,8.1,7.0,8.0,6.8,6.2,4.9,5.7,6.3,3.7,2.4,4.5,8.5,9.5,9.9]);
feature2 = numpy.array([11.1,12.8,13.0,11.6,15.2,13.8,11.1,17.8,12.5,15.2,11.6,20.8,14.7,14.4,15.3,18.3,11.4,17.0,16.0,16.8,12.2,14.9,15.7,16.3,13.7,12.4,14.2,18.5,19.8,19.0]);
#===Concatenate All Features===#
numFrames = len(feature1);
allFeatures = numpy.zeros((2,numFrames));
allFeatures[0,:] = feature1;
allFeatures[1,:] = feature2;
#===Create the Array to hold all the Bins===#
numBins = int(0.25*numFrames);
allBins = numpy.zeros((allFeatures.shape[0],numBins+1));
#===Find the maximum and minimum of each feature===#
allRanges = numpy.zeros((allFeatures.shape[0],2));
for f in range(allFeatures.shape[0]):
allRanges[f,0] = numpy.amin(allFeatures[f,:]);
allRanges[f,1] = numpy.amax(allFeatures[f,:]);
#===Create the Array to hold all the individual feature probabilities===#
allIndividualProbs = numpy.zeros((allFeatures.shape[0],numBins));
#===Grab all the Individual Probs and the Bins===#
for f in range(allFeatures.shape[0]):
freqhist, binedges = numpy.histogram(allFeatures[f,:],bins=numBins,range=[allRanges[f,0],allRanges[f,1]],density=False);
allBins[f,:] = binedges;
allIndividualProbs[f,:] = freqhist;
#===Create the joint probability array===#
jointProbs = numpy.zeros((numBins,numBins));
#===Compute the joint probability distribution===#
numElements = 0;
for b1 in range(numBins):
for b2 in range(numBins):
for f1 in range(numFrames):
for f2 in range(numFrames):
if ( ( (feature1[f1] >= allBins[0,b1]) and (feature1[f1] <= allBins[0,b1+1]) ) and ((feature2[f2] >= allBins[1,b2]) and (feature2[f2] <= allBins[1,b2+1])) ):
jointProbs[b1,b2] += 1;
numElements += 1;
jointProbs /= numElements;
#===But what if I add the following===#
feature3 = numpy.array([21.1,21.8,23.5,27.6,25.2,23.8,22.1,22.8,26.5,25.2,28.6,20.8,24.7,24.4,29.3,28.3,27.4,26.0,26.2,26.1,25.9,24.0,22.7,22.3,23.7,26.4,24.2,28.5,29.8,29.0]);
How can I generalize the large loop? For N variables (features) this loop would be enormous. Is there a Pythonic way to do this easily?
Check out the function numpy.histogramdd. This function can compute histograms in arbitrary numbers of dimensions. If you set the parameter normed=True, it returns the bin count divided by the bin hypervolume. If you'd prefer something more like a probability mass function (where everything sums to 1), just normalize it yourself. All together, you'll have something like:
import numpy as np
numBins = 10 # number of bins in each dimension
data = np.random.randn(100000, 3) # generate 100000 3-d random data points
jointProbs, edges = np.histogramdd(data, bins=numBins)
jointProbs /= jointProbs.sum()