Calculating Variable Cash-flow IRR in Python (pandas) - python

I have a DataFrame of unpredictable cashflows and unpredictable period lengths, and I need to generate a backward-looking IRR.
Doing it in Excel is pretty straightforward using the solver, wondering if there's a good way to pull it off in Python. (I think I could leverage openpyxl to get solver to work in excel from python, but that feels unnecessarily cumbersome).
The problem is pretty straightforward:
NPV of Cash Flow = ((cash_flow)/(1+IRR)^years_ago)
GOAL: Find IRR where SUM(NPV) = 0
My dataframe looks something like this:
cash_flow |years_ago
-----------------------
-3.60837e+06 |4.09167
31462 |4.09167
1.05956e+06 |3.63333
-1.32718e+06 |3.28056
-4.46554e+06 |3.03889
It seems as though other IRR calculators (such as numpy.irr) assume strict period cutoffs (every 3 months, 1 year, etc), which won't work. The other option seems to be the iterative route, where I continually guess, check, and iterate, but that feels like the wrong way to tackle this. Ideally, I'm looking for something that would do this:
irr = calc_irr((cash_flow1,years_ago1),(cash_flow2,years_ago2),etc)
EDIT: Here is the code I'm running the problem from. I have a list of transactions, and I've chosen to create temporary tables by id.
for id in df_tran.id.unique():
temp_df = df_tran[df_tran.id == id]
cash_flow = temp_df.cash_flows.values
years = temp_df.years.values
print(id, cash_flow)
print(years)
#irr_calc = irr(cfs=cash_flow, yrs=years,x0=0.100000)
#print(sid, irr_calc)
where df_tran (which temp_df is based on) looks like:
cash_flow |years |id
0 -3.60837e+06 4.09167 978237
1 31462 4.09167 978237
4 1.05956e+06 3.63333 978237
6 -1.32718e+06 3.28056 978237
8 -4.46554e+06 3.03889 978237
10 -3.16163e+06 2.81944 978237
12 -5.07288e+06 2.58889 978237
14 268833 2.46667 978237
17 -4.74703e+06 1.79167 978237
20 -964987 1.40556 978237
22 -142920 1.12222 978237
24 163894 0.947222 978237
26 -2.2064e+06 0.655556 978237
27 1.23804e+06 0.566667 978237
29 180655 0.430556 978237
30 -85297 0.336111 978237
34 -2.3529e+07 0.758333 1329483
36 21935 0.636111 1329483
38 -3.55067e+06 0.366667 1329483
41 -4e+06 4.14167 1365051
temp_df looks identical to df_tran, except it only holds transactions for a single id.

You can use scipy.optimize.fsolve:
Return the roots of the (non-linear) equations defined by func(x) = 0
given a starting estimate.
First define the function that will be the func parameter to fsolve. This is NPV as a result of your IRR, cash flows, and years. (Vectorize with NumPy.)
import numpy as np
def npv(irr, cfs, yrs):
return np.sum(cfs / (1. + irr) ** yrs)
An example:
cash_flow = np.array([-2., .5, .75, 1.35])
years = np.arange(4)
# A guess
print(npv(irr=0.10, cfs=cash_flow, yrs=years))
0.0886551465064
Now to use fsolve:
from scipy.optimize import fsolve
def irr(cfs, yrs, x0):
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
Your IRR is:
print(irr(cfs=cash_flow, yrs=years, x0=0.10))
0.12129650313214262
And you can confirm that this gets you to a 0 NPV:
res = irr(cfs=cash_flow, yrs=years, x0=0.10)
print(np.allclose(npv(res, cash_flow, years), 0.))
True
All code together:
import numpy as np
from scipy.optimize import fsolve
def npv(irr, cfs, yrs):
return np.sum(cfs / (1. + irr) ** yrs)
def irr(cfs, yrs, x0, **kwargs):
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs), **kwargs))
To make this compatible with your pandas example, just use
cash_flow = df.cash_flow.values
years = df.years_ago.values
Update: the values in your question seem a bit nonsensical (your IRR is going to be some astronomical number if it even exists) but here is how you'd run:
cash_flow = np.array([-3.60837e+06, 31462, 1.05956e+06, -1.32718e+06, -4.46554e+06])
years_ago = np.array([4.09167, 4.09167, 3.63333, 3.28056, 3.03889])
print(irr(cash_flow, years_ago, x0=0.10, maxfev=10000))
1.3977721900669127e+82
Second update: there are a couple minor typos in your code, and your actual flows of $ and timing work out to nonsensical IRRs, but here's what you're looking to do, below. For instance, notice you have one id with one single negative transaction, a negatively infinite IRR.
for i, df in df_tran.groupby('id'):
cash_flow = df.cash_flow.values
years = df.years.values
print('id:', i, 'irr:', irr(cash_flow, years, x0=0.))
id: 978237 irr: 347.8254979851405
id: 1329483 irr: 3.2921314448062817e+114
id: 1365051 irr: 1.0444951674872467e+25

Related

Python: numpy, pandas, and performing operations on the previous array value (smoothed averages): any way to not use FOR loop? EWMA?

Tbh, I'm not really sure how to ask this question. I've got an array of values, and I'm looking to take the smoothed average of these values moving forward. In Excel, the calculation process is:
average_val_1 = mean average of values through window_size
average_val_2 = (value at location window_size+1 * window_size-1 + average_val_1) / window_size
average_val_3 = (value at location window_size+2 * window_size-1 + average_val_2) / window_size
etc., etc.
In pandas and numpy, my code for this is the following
df = pd.DataFrame({'av':np.nan, 'values':np.random.rand(10)})
df = df[['values','av']]
window = 5
df['av'].iloc[5] = np.mean(df['values'][:5])
for i in range(window+1,len(df.index)):
df['av'].iloc[i] = (df['values'].iloc[i] * (window-1) + df['av'].iloc[i-1])/window
Which returns:
values av
0 0.418498 NaN
1 0.570326 NaN
2 0.296878 NaN
3 0.308445 NaN
4 0.127376 NaN
5 0.381160 0.344305
6 0.239725 0.260641
7 0.928491 0.794921
8 0.711632 0.728290
9 0.319791 0.401491
These are the values I am looking for, but there has to be a better way than using for loops. I think the answer has something to do with using exponentially weighted moving averages, but I'll be damned if I can figure out the syntax to make any sense of that.
Any suggestions?
you can use ewm such as:
window = 5
df['av'] = np.nan
df['av'].iloc[window] = np.mean(df['values'][:window])
df.loc[window:,'av'] = (df.loc[window:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())
and you get the same result than with your loop for. To be sure it works, column 'av' must be nan otherwise the fillna with column 'values' will not happen and the value calculted in 'av' will be wrong. The parameter alpha in ewm is what helps you to weigth the row you are calculating.
Note: while this code does as yours, I would recommend to have a look at this line in your code:
df['av'].iloc[5] = np.mean(df['values'][:5])
Because of the exclusion of the uppper bound when doing slicing [:5], df['values'][:5] is:
0 0.418498
1 0.570326
2 0.296878
3 0.308445
4 0.127376
Name: values, dtype: float64
so I think that what you should do is df['av'].iloc[4] = np.mean(df['values'][:5]). If you agree, then my above must be slightly changed
df['av'].iloc[window-1] = np.mean(df['values'][:window])
df.loc[window-1:,'av'] = (df.loc[window-1:,'av'].fillna(df['values'])
.ewm(adjust=False, alpha=(window-1.)/window).mean())

Aggregate and calculate median of arrayfield in django queryset

I'm wondering if this is possible in a more efficient way.
I have a dataset in PostGRESQL that is structured like this:
Year, Sitename, Array (length = 4500)
For example:
1982, DANC, array([2,3,4,5,6,7,...])
1982, ANCH, array([5,6,4,3,5,7,...])
1983, DANC, array([3,3,4,6,3,6,...])
1983, ANCH, array([8,8,5,4,3,2,...])
What I want to do is add up the arrays (across rows) by years
E.G.,
1982 1982 1982
DANC ANCH TOT
2 5 7
3 6 9
4 4 8
5 3 8
6 5 11
7 7 14
... ... ...
My Django model looks like this:
class Abundance(models.Model):
abundance_id = models.AutoField(primary_key=True)
site = models.ForeignKey('Site')
season = models.SmallIntegerField()
samples = ArrayField(models.DecimalField(blank=True, decimal_places=3, max_digits=30))
def __unicode__(self):
return self.site
The following code in my Views.py works:
import numpy as np
import bottleneck as bn
...
def testview(request):
s = ["ACUN","BRDM"]
quants = []
medians = []
for yr in range(1982,2015):
X = Abundance.objects.values_list('samples').filter(site__site_id__in = s).filter(season = yr)
h = np.matrix(np.array(X,dtype=float))
i = h.sum(axis=0)
m = bn.median(i)
up = np.percentile(i,95)
down = np.percentile(i,5)
qlist = [yr, round(down,3), round(up,3)]
mlist = [yr, round(m,3)]
quants.append(qlist)
medians.append(mlist)
return JsonResponse({'quants':quants, 'medians':medians})
However, the above code is very slow - especially when drawing many sites. I have tried playing with .aggregate() but I've not found a good solution.
Thanks in advance
You can probably use some of the .aggregate() on there to push the load down to Postgres, but I think one of the bigger problems with speed here is the Decimal field. It's the highest precision, but it's also one of the more expensive types for Python to move in and out of.
That said, I'm not sure if there's a quick way to get the percentiles out from the DB call but the sums and medians you can easily push down to the DB via the Django ORM. For the others (percentiles, etc.) you can probably push them down as well but you'll be delving into custom aggregates for django (https://docs.djangoproject.com/en/1.9/ref/models/expressions/#creating-your-own-aggregate-functions), which if you're going to go that far it might be worth checking out something like aldjemy (https://github.com/Deepwalker/aldjemy/) and convert the entire query over to SQLAlchemy so you have maximum control over it.

Rolling Product in PANDAS over 30-day time window

I am trying to get data ready for a financial event analysis and want to calculate the buy-and-hold abnormal return (BHAR). For a test data set I have three events (noted by event_id), and for each event I have 272 rows, going from t-252 days to t+20 days (noted by the variable time). For each day I also have the stock's return data (ret) as well as the expected return (Exp_Ret), which was calculated using a market model. Here's a sample of the data:
index event_id time ret vwretd Exp_Ret
0 0 -252 0.02905 0.02498 nan
1 0 -251 0.01146 -0.00191 nan
2 0 -250 0.01553 0.00562 nan
...
250 0 -2 -0.00378 0.00028 -0.00027
251 0 -1 0.01329 0.00426 0.00479
252 0 0 -0.01723 -0.00875 -0.01173
271 0 19 0.01335 0.01150 0.01398
272 0 20 0.00722 -0.00579 -0.00797
273 1 -252 0.01687 0.00928 nan
274 1 -251 -0.00615 -0.01103 nan
And here's the issue. I would like to calculate the following BHAR formula for each day:
So, using the above formula as an example, if I would like to calculate the 10-day buy-and-hold abnormal return,I would have to calculate (1+ret_t=0)x(1+ret_t=1)...x(1+ret_t=10), then do the same with the expected return, (1+Exp_Ret_t=0)x(1+Exp_Ret_t=1)...x(1+Exp_Ret_t=10), then substract the latter from the former.
I have made some progress using rolling_apply but it doesn't solve all my problems:
df['part1'] = pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod())
This seems to correctly implement the left-side of the BHAR equation in that it will add in the correct product -- though it will enter the value two rows down (which can be solved by shifting). One problem, though, is that there are three different 'groups' in the dataframe (3 events), and if the window were to go forward more than 30 days it might start using products from the next event. I have tried to implement a groupby with rolling_apply but keep getting error: TypeError: 'Series' objects are mutable, thus they cannot be hashed
df.groupby('event_id').apply(pd.rolling_apply(df['ret'], 10, lambda x : (1+x).prod()))
I am sure I am missing something basic here so any help would be appreciated. I might just need to approach it from a different angle. Here's one thought: In the end, what I am most interested in is getting the 30-day and 60-day buy-and-hold abnormal returns starting at time=0. So, maybe it is easier to select each event at time=0 and then calculate the 30-day product going forward? I'm not sure how I could best approach that.
Thanks in advance for any insights.
# Create sample data.
np.random.seed(0)
VOL = .3
df = pd.DataFrame({'event_id': [0] * 273 + [1] * 273 + [2] * 273,
'time': range(-252, 21) * 3,
'ret': np.random.randn(273 * 3) * VOL / 252 ** .5,
'Exp_Ret': np.random.randn(273 * 3) * VOL / 252 ** .5})
# Pivot on time and event_id.
df = df.set_index(['time', 'event_id']).unstack('event_id')
# Calculated return difference from t=0.
df_diff = df.ix[df.index >= 0, 'ret'] - df.loc[df.index >= 0, 'Exp_Ret']
# Calculate cumulative abnormal returns.
cum_returns = (1 + df_diff).cumprod() - 1
# Get 10 day abnormal returns.
>>> cum_returns.loc[10]
event_id
0 -0.014167
1 -0.172599
2 -0.032647
Name: 10, dtype: float64
Edited so that final values of BHAR are included in the main DataFrame.
BHAR = pd.Series()
def bhar(arr):
return np.cumprod(arr+1)[-1]
grouped = df.groupby('event_id')
for name, group in grouped:
BHAR = BHAR.append(pd.rolling_apply(group['ret'],10,bhar) -
pd.rolling_apply(group['Exp_Ret'],10,bhar))
df['BHAR'] = BHAR
You can then slice the DataFrame using df[df['time']>=0] such that you get only the required part.
You can obviously collapse the loop in one line using .apply() on the group, but I like it this way. Shorter lines to read = better readability.
This is what I did:
((df+1.0) \
.apply(lambda x: np.log(x),axis=1)\
.rolling(365).sum() \
.apply(lambda x: np.exp(x),axis=1)-1.0)
result is a rolling product.

Selecting elements of a pandas dataframe that fall above a critical threshold

I have a pandas.df and I'm trying to remove all hypotheses that can be rejected.
Here is a snippet of the df in question:
best value p_value
0 11.9549 0.986927
1 11.9588 0.986896
2 12.1185 0.985588
3 12.1682 0.985161
4 12.3907 0.983131
5 12.4148 0.982899
6 12.6273 0.980750
7 12.9020 0.977680
8 13.4576 0.970384
9 13.5058 0.969679
10 13.5243 0.969405
11 13.5886 0.968439
12 13.8025 0.965067
13 13.9840 0.962011
14 14.1896 0.958326
15 14.3939 0.954424
16 14.6229 0.949758
17 14.6689 0.948783
18 14.9464 0.942626
19 15.1216 0.938494
20 15.5326 0.928039
21 17.7720 0.851915
22 17.8668 0.847993
23 17.9662 0.843822
24 19.2481 0.785072
25 19.5257 0.771242
I want to remove the elements with a p_value greater then a critical threshold alpha by selecting the ones fall below alpha. The p value is calculated using scipy.stats.chisqprob(chisq,df) where chisq is the chi squared statistic and df is the degrees of freedom. This is all done using the custom method self.get_p_values shown below.
def reject_null_hypothesis(self,alpha,df):
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df) #calculates the data frame above
return p_value.loc[p_value['best value']
Im then calling this method using:
PE=Modelling_Tools.PE_Results(PE_file) #Modelling.Tools is the module and PE_Results is the class which is given the data 'PE_file'
print PE.reject_null_hypothesis(0.5,25)
From what I've read this should do what I want but I'm new to pandas.df and this code returns the unchanged
Are you getting any errors when you run this? I ask because:
print PE.reject_null_hypothesis(0.5, 25)
is passing into reject_null_hypothesis() 25, an int object instead of a pandas.DataFrame object, in the last argument position.
(Apologies. I would respond with this as a comment instead of an answer, but I only have 46 reputation at the moment, and 50 is needed to comment.)
refer indexging with boolean array
df[ df.p_value < threshold ]
Turns out there is a simple way to do what I want. Here is the code for those who want to know.
def reject_null_hypothesis(self,alpha,df):
'''
alpha = critical threshold for chisq statistic
df=degrees of freedom
values below this critical threshold are rejected.
values above this threshold are not 'proven' but
cannot be rejected and must therefore be subject to
further statistics
'''
assert alpha>0
assert alpha<1
p_value=self.get_p_values(df)
passed= p_value[p_value.loc[:,'p_value']>alpha].index
return p_value[:max(passed)]

Inexpensive way to add time series intensity in python pandas dataframe

I am trying to sum (and plot) a total from functions which change states at different times using Python's Pandas.DataFrame. For example:
Suppose we have 3 people whose states can be a) holding nothing, b) holding a 5 pound weight, and c) holding a 10 pound weight. Over time, these people pick weights up and put them down. I want to plot the total amount of weight being held. So, given:
My brute forece attempt:
import pandas as ps
import math
import numpy as np
person1=[3,0,10,10,10,10,10]
person2=[4,0,20,20,25,25,40]
person3=[5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['count','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDfNoCount=allPeopleDf[['start1', 'end1', 'start2', 'end2', 'start3','end3']]
uniqueTimes=sorted(ps.unique(allPeopleDfNoCount.values.ravel()))
possibleStates=[-1,0,1,2] #extra state 0 for initialization
stateData={}
comboStates={}
#initialize dict to add up all of the stateData
for time in uniqueTimes:
comboStates[time]=0.0
allPeopleDf['track']=-1
allPeopleDf['status']=-1
numberState=len(possibleStates)
starti=-1
endi=0
startState=0
for i in range(3):
starti=starti+2
print starti
endi=endi+2
for time in uniqueTimes:
def helper(row):
start=row[starti]
end=row[endi]
track=row[7]
if start <= time and time < end:
return possibleStates[i+1]
else:
return possibleStates[0]
def trackHelp(row):
status=row[8]
track=row[7]
if track<=status:
return status
else:
return track
def Multiplier(row):
x=row[8]
if x==0:
return 0.0*row[0]
if x==1:
return 5.0*row[0]
if x==2:
return 10.0*row[0]
if x==-1:#numeric place holder for non-contributing
return 0.0*row[0]
allPeopleDf['status']=allPeopleDf.apply(helper,axis=1)
allPeopleDf['track']=allPeopleDf.apply(trackHelp,axis=1)
stateData[time]=allPeopleDf.apply(Multiplier,axis=1).sum()
for k,v in stateData.iteritems():
comboStates[k]=comboStates.get(k,0)+v
print allPeopleDf
print stateData
print comboStates
Plots of weight being held over time might look like the following:
And the sum of the intensities over time might look like the black line in the following:
with the black line defined with the Cartesian points: (0,0 lbs),(5,0 lbs),(5,5 lbs),(15,5 lbs),(15,10 lbs),(20,10 lbs),(20,15 lbs),(25,15 lbs),(25,20 lbs),(40,20 lbs). However, I'm flexible and don't necessarily need to define the combined intensity line as a set of Cartesian points. The unique times can be found with:
print list(set(uniqueTimes).intersection(allNoCountT[1].values.ravel())).sort()
,but I can't come up with a slick way of getting the corresponding intensity values.
I started out with a very ugly function to break apart each "person's" graph so that all people had start and stop times (albeit many stop and start times without state change) at the same time, and then I could add up all the "chunks" of time. This was cumbersome; there has to be a slick pandas way of handling this. If anyone can offer a suggestion or point me to another SO like that I might have missed, I'd appreciate the help!
In case my simplified example isn't clear, another might be plotting the intensity of sound coming from a piano: there are many notes being played for different durations with different intensities. I would like the sum of intensity coming from the piano over time. While my example is simplistic, I need a solution that is more on the scale of a piano song: thousands of discrete intensity levels per key, and many keys contributing over the course of a song.
Edit--Implementation of mgab's provided solution:
import pandas as ps
import math
import numpy as np
person1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
person2=['person2',4,0,20,20,25,25,40]
person3=['person3',5,0,5,5,15,15,40]
allPeopleDf=ps.DataFrame(np.array(zip(person1,person2,person3)).T)
allPeopleDf.columns=['id','intensity','start1', 'end1', 'start2', 'end2', 'start3','end3']
allPeopleDf=ps.melt(allPeopleDf,id_vars=['intensity','id'])
allPeopleDf.columns=['intensity','id','timeid','time']
df=ps.DataFrame(allPeopleDf).drop('timeid',1)
df[df.id=='person1'].drop('id',1) #easier to visualize one id for check
df['increment']=df.groupby('id')['intensity'].transform( lambda x: x.sub(x.shift(), fill_value= 0 ))
TypeError: unsupported operand type(s) for -: 'str' and 'int'
End Edit
Going for the piano keys example, lets assume you have three keys, with 30 levels of intensity.
I would try to keep the data in this format:
import pandas as pd
df = pd.DataFrame([[10,'A',5],
[10,'B',7],
[13,'C',10],
[15,'A',15],
[20,'A',7],
[23,'C',0]], columns=["time", "key", "intensity"])
time key intensity
0 10 A 5
1 10 B 7
2 13 C 10
3 15 A 15
4 20 A 7
5 23 C 0
where you record every change in intensity of any of the keys. From here you can already get the Cartesian coordinates for each individual key as (time,intensity) pairs
df[df.key=="A"].drop('key',1)
time intensity
0 10 5
3 15 15
4 20 7
Then, you can easily create a new column increment that will indicate the change in intensity that occurred for that key at that time point (intensity indicates just the new value of intensity)
df["increment"]=df.groupby("key")["intensity"].transform(
lambda x: x.sub(x.shift(), fill_value= 0 ))
df
time key intensity increment
0 10 A 5 5
1 10 B 7 7
2 13 C 10 10
3 15 A 15 10
4 20 A 7 -8
5 23 C 0 -10
And then, using this new column, you can generate the (time, total_intensity) pairs to use as Cartesian coordinates
df.groupby("time").sum()["increment"].cumsum()
time
10 12
13 22
15 32
20 24
23 14
dtype: int64
EDIT: applying the specific data presented in question
Assuming the data comes as a list of values, starting with the element id (person/piano key), then a factor multiplying the measured weight/intensities for this element, and then pairs of time values indicating the start and end of a series of known states (weight being carried/intensity being emitted). Not sure if I got the data format right. From your question:
data1=['person1',3,0.0,10.0,10.0,10.0,10.0,10.0]
data2=['person2',4,0,20,20,25,25,40]
data3=['person3',5,0,5,5,15,15,40]
And if we know the weight/intensity of each one of the states, we can define:
known_states = [5, 10, 15]
DF_columns = ["time", "id", "intensity"]
Then, the easiest way I came up to load the data includes this function:
import pandas as pd
def read_data(data, states, columns):
id = data[0]
factor = data[1]
reshaped_data = []
for i in xrange(len(states)):
j += 2+2*i
if not data[j] == data[j+1]:
reshaped_data.append([data[j], id, factor*states[i]])
reshaped_data.append([data[j+1], id, -1*factor*states[i]])
return pd.DataFrame(reshaped_data, columns=columns)
Notice that the if not data[j] == data[j+1]: avoids loading data to the dataframe when start and end times for a given state are equal (seems uninformative, and wouldn't appear in your plots anyway). But take it out if you still want these entries.
Then, you load the data:
df = read_data(data1, known_states, DF_columns)
df = df.append(read_data(data2, known_states, DF_columns), ignore_index=True)
df = df.append(read_data(data3, known_states, DF_columns), ignore_index=True)
# and so on...
And then you're right at the beginning of this answer (substituting 'key' by 'id' and the ids, of course)
Appears to be what .sum() is for:
In [10]:
allPeopleDf.sum()
Out[10]:
aStart 0
aEnd 35
bStart 35
bEnd 50
cStart 50
cEnd 90
dtype: int32

Categories