Is it possible to improve the performance of the np.irr function such that it can applied to a 2-dimension array of cash flows without using a for-loop--either though vectorizing the np.irr function or through an alternative algorithm?
The irr function in the numpy library calculates the periodically compounded rate of return that gives a net present value of 0 for an array of cash flows. This function can only be applied to a 1-dimensional array:
x = np.array([-100,50,50,50])
r = np.irr(x)
np.irr will not work against a 2-dimensional array of cash flows, such as:
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
where each row represents a series of cash flows, and columns represent time periods. Therefore a slow implementation would be to loop over each row and apply np.irr to individual rows:
out = []
for x in cfs:
out.append(np.irr(x))
For large arrays, this is an optimization barrier. Looking at the source code of the np.irr function, I believe the main obstacle is vectorizing the np.roots function:
def irr(values):
res = np.roots(values[::-1])
mask = (res.imag == 0) & (res.real > 0)
if res.size == 0:
return np.nan
res = res[mask].real
# NPV(rate) = 0 can have more than one solution so we return
# only the solution closest to zero.
rate = 1.0/res - 1
rate = rate.item(np.argmin(np.abs(rate)))
return rate
I have found a similar implementation in R: Fast loan rate calculation for a big number of loans, but don't know how to port this into Python. Also, I don't consider np.apply_along_axis or np.vectorize to be solutions to this issue since my main concern is performance, and I understand both are wrappers for a for-loop.
Thanks!
Looking at the source of np.roots,
import inspect
print(inspect.getsource(np.roots))
We see that it works by finding the eigenvalues of the "companion matrix". It also does some special handling of coefficients that are zero. I really don't understand the mathematical background, but I do know that np.linalg.eigvals can calculate eigenvalues for multiple matrices in a vectorized way.
Merging it with the source of np.irr has resulted in the following "Frankencode":
def irr_vec(cfs):
# Create companion matrix for every row in `cfs`
M, N = cfs.shape
A = np.zeros((M, (N-1)**2))
A[:,N-1::N] = 1
A = A.reshape((M,N-1,N-1))
A[:,0,:] = cfs[:,-2::-1] / -cfs[:,-1:] # slice [-1:] to keep dims
# Calculate roots; `eigvals` is a gufunc
res = np.linalg.eigvals(A)
# Find the solution that makes the most sense...
mask = (res.imag == 0) & (res.real > 0)
res = np.ma.array(res.real, mask=~mask, fill_value=np.nan)
rate = 1.0/res - 1
idx = np.argmin(np.abs(rate), axis=1)
irr = rate[np.arange(M), idx].filled()
return irr
This does not do handling of zero coefficients and surely fails when any(cfs[:,-1] == 0). Also some input argument checking wouldn't hurt. And some other problems maybe? But for the supplied example data it achieves what we wanted (at the cost of increased memory use):
In [487]: cfs = np.zeros((10000,4))
...: cfs[:,0] = -100
...: cfs[:,1:] = 50
In [488]: %timeit [np.irr(x) for x in cfs]
1 loops, best of 3: 2.96 s per loop
In [489]: %timeit irr_vec(cfs)
10 loops, best of 3: 77.8 ms per loop
If you have the special case of loans with a fixed payback amount (like in the question you linked) you may be able do it faster using interpolation...
After I posted this question I worked on this question and came up with a vectorized solution that uses a different algorithm:
def virr(cfs, precision = 0.005, rmin = 0, rmax1 = 0.3, rmax2 = 0.5):
'''
Vectorized IRR calculator. First calculate a 3D array of the discounted
cash flows along cash flow series, time period, and discount rate. Sum over time to
collapse to a 2D array which gives the NPV along a range of discount rates
for each cash flow series. Next, find crossover where NPV is zero--corresponds
to the lowest real IRR value. For performance, negative IRRs are not calculated
-- returns "-1", and values are only calculated to an acceptable precision.
IN:
cfs - numpy 2d array - rows are cash flow series, cols are time periods
precision - level of accuracy for the inner IRR band eg 0.005%
rmin - lower bound of the inner IRR band eg 0%
rmax1 - upper bound of the inner IRR band eg 30%
rmax2 - upper bound of the outer IRR band. eg 50% Values in the outer
band are calculated to 1% precision, IRRs outside the upper band
return the rmax2 value
OUT:
r - numpy column array of IRRs for cash flow series
'''
if cfs.ndim == 1:
cfs = cfs.reshape(1,len(cfs))
# Range of time periods
years = np.arange(0,cfs.shape[1])
# Range of the discount rates
rates_length1 = int((rmax1 - rmin)/precision) + 1
rates_length2 = int((rmax2 - rmax1)/0.01)
rates = np.zeros((rates_length1 + rates_length2,))
rates[:rates_length1] = np.linspace(0,0.3,rates_length1)
rates[rates_length1:] = np.linspace(0.31,0.5,rates_length2)
# Discount rate multiplier rows are years, cols are rates
drm = (1+rates)**-years[:,np.newaxis]
# Calculate discounted cfs
discounted_cfs = cfs[:,:,np.newaxis] * drm
# Calculate NPV array by summing over discounted cashflows
npv = discounted_cfs.sum(axis = 1)
## Find where the NPV changes sign, implies an IRR solution
signs = npv < 0
# Find the pairwise differences in boolean values when sign crosses over, the
# pairwise diff will be True
crossovers = np.diff(signs,1,1)
# Extract the irr from the first crossover for each row
irr = np.min(np.ma.masked_equal(rates[1:]* crossovers,0),1)
# Error handling, negative irrs are returned as "-1", IRRs greater than rmax2 are
# returned as rmax2
negative_irrs = cfs.sum(1) < 0
r = np.where(negative_irrs,-1,irr)
r = np.where(irr.mask * (negative_irrs == False), 0.5, r)
return r
Performance:
import numpy as np
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
%timeit [np.irr(x) for x in cfs]
10 loops, best of 3: 1.06 s per loop
%timeit virr(cfs)
10 loops, best of 3: 29.5 ms per loop
pyxirr is super fast, and np.irr is deprecated, so I'd use this now:
https://pypi.org/project/pyxirr/
import pyxirr
cfs = np.zeros((10000,4))
cfs[:,0] = -100
cfs[:,1:] = 50
df = pd.DataFrame(cfs).T
df.apply(pyxirr.irr)
Related
I have some experimental data over time that I need to discount in the sense that each data point is weighted depending on how far back in time it lies.
For this I have the follwing code
import numpy as np
n_spots = 20
n_times = 5000
data = np.random.random((n_times, n_spots))
rate = 0.9
weight_vec = rate ** (n_times - 1 - np.arange(1, n_times, 1))
result = np.zeros((n_times, n_spots))
for k in range(n_times):
result[k, :] = data[0:k, ].transpose().dot(weight_vec[n_times - 1 - k:n_times - 1])
The for-loop becomes very slow as n_times increases and I wonder if there is a way to optimize this or even eliminate the for-loop completely. There are similar cases where one can add new axes and perform the computations as a matrix product in higher dimensions but I struggle to make this work here where the sub-arrays of data are not equal in size.
You're adding one extra column to current result in each step and accumulating what you've got so np.cumsum is a good option:
cols = data[:n_times-1].transpose() * weight_vec
result = np.cumsum(cols, axis=1)
result = np.insert(result.transpose(), 0, values=0, axis=0)
First row of result contains zeros so I did additional insertion. What is more, calculation of weight_vec could be improved itself:
weight_vec = np.full(n_times - 1, fill_value = 0.9)
weight_vec[0] = 1
weight_vec = np.cumprod(weight_vec)[::-1]
I am doing some scientific computing and I couldn't find an elegant way of performing the following operation. Suppose I have a 2-dimensional numpy array D which stores measurements of a given quantity at several times along the day. Each row corresponds to a different measuring instrument and each column corresponds to a different moment in the day at which the measurement was done.
Consider a list of desired percentiles. For example:
quantiles = [0.25, 0.5, 0.75]
My goal is to compute the average measurement by percentile group, at each moment in the day. In other words, given a column of measurements, I would like to sort all the measurements from that column in groups respecting the quantiles above and then take averages within groups. Using the example, I would have 4 groups at each moment of the day: the measurements in the lower quartile, then the measurements between the 25th and 50th quartile, the ones between the 50th and the 75th and finally the ones in the last quartile. Therefore, if m is the number of moments in the day when measurements were taken and q is the number of elements in the quantiles variable, my desired output would be qxm numpy array.
Currently, I am doing this in the most inefficient and hard-coded way possible. Here we go:
quantiles = [0.25, 0.5, 0.75]
window = "30min"
moments = pd.date_range(start = "9:30", end = "16:00", freq = window).time
quantile_curves = np.zeros((len(quantiles)+1, len(moments)-1))
EmpQuantiles = np.quantile(D, quantiles, axis = 0)
for moment in range(len(moments)-1):
quantile_curves[0, moment] = np.mean(D[:, moment][D[:,moment] < EmpQuantiles[0, moment]])
quantile_curves[1, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[0, moment], D[:,moment] <EmpQuantiles[1, moment])])
quantile_curves[2, moment] = np.mean(D[:, moment][np.logical_and(D[:,moment] > EmpQuantiles[1, moment], D[:,moment] <EmpQuantiles[2, moment])])
quantile_curves[3, moment] = np.mean(D[:, moment][D[:,moment] > EmpQuantiles[2, moment]])
What's an elegant and simpler way of doing this? I couldn't find the answer here however there is a related (but not the same) question in R: ddply multiple quantiles by group
I intend to plot the evolution of the in-group average along the day. I show the plot I get below (I am satisfied with the plot and I get the result I want however I seek better way of computing the quantile_curves variable):
Thanks a lot in advance!
You can do it efficiently using masked_arrays:
import numpy as np
quantiles = [0.25, 0.5, 0.75]
print('quantiles:\n', quantiles)
moments = [f'moment {i}' for i in range(5)]
print('nb of moments:\n', len(moments))
nb_measurements = 10000
D = np.random.rand(nb_measurements,len(moments))
quantile_values = np.quantile(D,quantiles,axis=0)
print('quantile_values (for each moment):\n', quantile_values)
quantile_curves = np.zeros((len(quantiles)+1,len(moments)))
quantile_curves[0, :] = np.mean(np.ma.masked_array(D, mask=D>quantile_values[[0],:]), axis=0)
for q in range(len(quantiles)-1):
quantile_curves[q+1, :] = np.mean(np.ma.masked_array(D, mask=np.logical_or(D<quantile_values[[q],:], D>quantile_values[[q+1],:])), axis=0)
quantile_curves[len(quantiles), :] = np.mean(np.ma.masked_array(D, mask=D<quantile_values[[len(quantiles)-1],:]), axis=0)
print('mean for each group and at each moment:')
print(quantile_curves)
Output:
% python3 script.py
quantiles:
[0.25, 0.5, 0.75]
nb of moments:
5
quantile_values (for each moment):
[[0.25271343 0.25434056 0.24658732 0.24612319 0.25221014]
[0.51114344 0.50103699 0.49671249 0.49113293 0.49819521]
[0.75629377 0.75427293 0.74676209 0.74211813 0.7490436 ]]
mean for each group and at each moment
[[0.12650993 0.12823392 0.12492136 0.12200609 0.12655318]
[0.3826476 0.373516 0.37050513 0.36974876 0.37722219]
[0.63454102 0.63023986 0.62280545 0.61696283 0.6238492 ]
[0.87866019 0.87614489 0.87492553 0.87253142 0.87403426]]
Note that I'm using random values between 0 and 1 that's why the quantile values (extremities of groups intervals) are almost equql to quantiles. Also not that this code works for an arbitrary number of quantiles or moments.
Is there a open source function to compute moving z-score like https://turi.com/products/create/docs/generated/graphlab.toolkits.anomaly_detection.moving_zscore.create.html. I have access to pandas rolling_std for computing std, but want to see if it can be extended to compute rolling z scores.
rolling.apply with a custom function is significantly slower than using builtin rolling functions (such as mean and std). Therefore, compute the rolling z-score from the rolling mean and rolling std:
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
According to the definition given on this page the rolling z-score depends on the rolling mean and std just prior to the current point. The shift(1) is used above to achieve this effect.
Below, even for a small Series (of length 100), zscore is over 5x faster than using rolling.apply. Since rolling.apply(zscore_func) calls zscore_func once for each rolling window in essentially a Python loop, the advantage of using the Cythonized r.mean() and r.std() functions becomes even more apparent as the size of the loop increases.
Thus, as the length of the Series increases, the speed advantage of zscore increases.
In [58]: %timeit zscore(x, N)
1000 loops, best of 3: 903 µs per loop
In [59]: %timeit zscore_using_apply(x, N)
100 loops, best of 3: 4.84 ms per loop
This is the setup used for the benchmark:
import numpy as np
import pandas as pd
np.random.seed(2017)
def zscore(x, window):
r = x.rolling(window=window)
m = r.mean().shift(1)
s = r.std(ddof=0).shift(1)
z = (x-m)/s
return z
def zscore_using_apply(x, window):
def zscore_func(x):
return (x[-1] - x[:-1].mean())/x[:-1].std(ddof=0)
return x.rolling(window=window+1).apply(zscore_func)
N = 5
x = pd.Series((np.random.random(100) - 0.5).cumsum())
result = zscore(x, N)
alt = zscore_using_apply(x, N)
assert not ((result - alt).abs() > 1e-8).any()
You should use native functions of pandas:
# Compute rolling zscore for column ="COL" and window=window
col_mean = df["COL"].rolling(window=window).mean()
col_std = df["COL"].rolling(window=window).std()
df["COL_ZSCORE"] = (df["COL"] - col_mean)/col_std
def zscore(arr, window):
x = arr.rolling(window = 1).mean()
u = arr.rolling(window = window).mean()
o = arr.rolling(window = window).std()
return (x-u)/o
df['zscore'] = zscore(df['value'],window)
Let us say you have a data frame called data, which looks like this:
enter image description here
then you run the following code,
data_zscore=data.apply(lambda x: (x-x.expanding().mean())/x.expanding().std())
enter image description here
Please note that the first row will always have NaN values as it doesn't have a standard deviation.
This can be solved in a single line of code. Given that s is the input series and wlen is the window length:
zscore = s.sub(s.rolling(wlen).mean()).div(s.rolling(wlen).std())
If you need to shift the mean and std it can still be done:
zscore = s.sub(s.rolling(wlen).mean().shift()).div(s.rolling(wlen).std().shift())
I have a time-series data that has 45000 rows and 30 columns. I want to split this data up into rolling windows of length 150 along the rows and compute some statistics for each window.
Basically, this means I first take rows 0 to 149 and compute mean followed by std for each column. I then take rows 1 to 150 and again compute mean and std for each column separately. I stack vectors of means and std for each column horizontally.
At the moment I have the following function:
def statFeaturesWindowed(data):
dataLen = data.shape[0]
count = 0
winSize = 150
ftVect = np.empty((1, 60))
while count+winSize <= dataLen:
if not count % 1000:
print("Row {} of {}".format(count, dataLen-winSize))
dataWin = data[count:count+winSize,:] # data window
means = np.mean(dataWin, axis = 0) # mean
stDev = np.std(dataWin, axis = 0) # standard deviation
tempVect = np.hstack((means, stDev)).T
ftVect = np.vstack((ftVect, tempVect))
count+=1
np.delete(ftVect, (1), 0)
return ftVect
However, this code is super slow and slows down as the output matrix becomes larger and larger. At the moment running this for 45000 lines takes roughly 5 minutes. Is there a more efficient way to do this?
I have run this simulation (given below) and got the simulated transition probabilities for dry-to-dry and wet-to-wet conditions. The simulated results for dry-to-dry are almost equal to the estimated dry-to-dry (d2d_tran). But, the simulated wet-to-wet values are substantially lower than the estimated ones. It seems there is something wrong in the program. I tried several other ways but haven’t got the expected results. Can you please run the program and suggest me how I may get improved results for wet-to-wet probabilities? Thanks in advance.
My codes:
import numpy as np
import random, datetime
d2d = np.zeros(12)
d2w = np.zeros(12)
w2w = np.zeros(12)
w2d = np.zeros(12)
pd2d = np.zeros(12)
pw2w = np.zeros(12)
dry = [0.333] ##unconditional probability of dry for January
d2d_tran = [0.564,0.503,0.582,0.621,0.634,0.679,0.738,0.667,0.604,0.564,0.577,0.621]
w2w_tran = [0.784,0.807,0.8,0.732,0.727,0.728,0.64,0.64,0.665,0.717,0.741,0.769]
mu = [3.71,4.46,4.11,2.94,3.01,2.87,2.31,2.44,2.56,3.45,4.32,4.12]
sigma = [6.72,7.92,7.49,6.57,6.09,5.53,4.38,4.69,4.31,5.71,7.64,7.54]
days = np.array([31,28,31,30,31,30,31,31,30,31,30,31])
rain = np.array([])
for y in xrange(0,10000):
for m in xrange(0,12):
#Include leap years in the calculation and creat random variables for each month
if ((y%4 == 0 and y%100 != 0) or y%400 == 0) and m==1:
random_num = np.random.rand(29)
else:
random_num = np.random.rand(days[m])
#lets generate a rainfall amount for first day of the random series
if random_num[0] <= dry[0]:
random_num[0] = 0
else:
random_num[0] = abs(random.gauss(mu[0],sigma[0]))
# generate the whole series in sequence of month and year
for i in xrange(0,days[m]):
if random_num[i-1] == 0: #if yesterday was dry
if random_num[i] <= d2d_tran[m]: #check today against the dry2dry transition probabilities
random_num[i] = 0
d2d[m] += 1.0
else:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
d2w[m] += 1.0
else:
if random_num[i] <= w2w_tran[m]:
random_num[i] = abs(random.gauss(mu[m],sigma[m]))
w2w[m] += 1.0
else:
random_num[i] = 0
w2d[m] += 1.0
pd2d[m] = d2d[m]/(d2d[m] + d2w[m])
pw2w[m] = w2w[m]/(w2d[m] + w2w[m])
print 'Simulated transition probability of dry2dry:\n', np.around(pd2d, decimals=3)
print 'Simulated transition probability of wet2wet:\n', np.around(pw2w, decimals=3)
### pd2d and pw2w of generated data should be identical to d2d_tran and w2w_tran respectively
The simulation looks correct as far as it goes, and after running it for 8000 years, I get transition probabilities within .001 most of the time, and there is convergence as the number of days increases.
Nothing guarantees that you will get the exact transition probabilities - on any single run you may get anything. What you've done is generate an estimator for each single transition probability that has mean equal to the actual value (0.345), and some positive variance. The variance of your estimator decreases with n = sample size, but it will always be positive.
If you'd like values closer to the actual transition probabilities (faster convergence), apply some well-known variance reduction techniques: Stratified Sampling, Importance Sampling, etc. - too many to mention. Here's a quick technique - take the uniform random deviates generated by np.random.rand(), and estimate as usual. Then generate another estimator using the transformed deviates: [(1-x) for x in stored_deviates]. The average of the two estimators has reduced variance (by .5).