I have a problem with calculating a function in python. I want to calculate the IRR for a number of investment, all of which are described in their own dataframe. I have a dataframe for each investment up until a certain date, so I have a multiple dataframe describing the flows of payments the investment has made up until different dates for each investment, with the last row of each dataframe containing the information of the stock of capital that each investment has until that point. I do this in order to have like a time series of the IRR for each investment. Each dataframe of which I want to calculate the IRR is in a list.
To calculate the IRR for each dataframe I made these functions:
def npv(irr, cfs, yrs):
return np.sum(cfs / ((1. + irr) ** yrs))
def irr(cfs, yrs, x0)
return np.asscalar(fsolve(npv, x0=x0, args=(cfs, yrs)))
So in order to calculate the IRR for each dataframe in my list I did:
for i, new_df in enumerate(dfs):
cash_flow = new_df.FLOWS.values
years = new_df.timediff.values
output.loc[i, ['DATE']] = new_df['DATE'].iloc[-1]
output.loc[i, ['Investment']] = new_df['Investment'].iloc[-1]
output.loc[i, ['irr']] = irr(cash_flow, years, x0=0.)
Output is the dataframe I want to create that the cointains the information I want, i.e the IRR of each invesment up until a certain date. The problem is, it calculates the IRR correctly for some dataframes, but not for others. For example it calculates the IRR correctly for this dataframe:
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-06-30 1 116957227.3 5.347945205479452
With an IRR of 0.215. But this dataframe, for the exact same investment it does not. It returns a IRR of 0.0001, but the real IRR should be around 0.216.
DATE INVESTMENT FLOWS timediff
0 2014-02-24 1 -36278400.0 0.0
1 2014-03-25 1 -11490744.0 0.07945205479452055
2 2015-01-22 1 -13244300.0 0.9095890410958904
3 2015-09-24 1 -10811412.0 1.5808219178082192
4 2015-11-12 1 -6208238.0 1.715068493150685
5 2016-01-22 1 -6210161.0 1.9095890410958904
6 2016-03-31 1 -4535569.0 2.0986301369863014
7 2016-05-25 1 8420470.0 2.249315068493151
8 2016-06-30 1 12357138.0 2.347945205479452
9 2016-07-14 1 3498535.0 2.3863013698630136
10 2016-12-26 1 4085285.0 2.8383561643835615
11 2017-06-07 1 3056835.0 3.2849315068493152
12 2017-09-11 1 11254424.0 3.547945205479452
13 2017-11-16 1 9274834.0 3.728767123287671
14 2018-02-22 1 1622857.0 3.9972602739726026
15 2018-05-23 1 2642985.0 4.243835616438356
18 2018-08-23 1 9265099.0 4.495890410958904
16 2018-11-29 1 1011915.0 4.764383561643836
19 2018-12-28 1 1760734.0 4.843835616438356
17 2019-01-14 1 1940112.0 4.890410958904109
20 2019-09-30 1 123753575.7 5.6
These two dataframes have exactly the same flows excepto the last row, of which it cointains the stock of capital up until that date for that investment. So the only difference between these two dataframes is the last row. This means this investment hasn't had any inflows or outflows during that time. I don't understand why the IRR varies so much. Or why some IRR are calculated incorrectly.
Most are calculated correctly but a few are not.
Thanks for helping me.
As I have thought, it is a problem with the optimization method.
When I have tried your irr function with the second df, I have even received a warning:
RuntimeWarning: The iteration is not making good progress, as measured by the
improvement from the last ten iterations.
warnings.warn(msg, RuntimeWarning)
But trying out scipy.optimize.root with other methods seem to work for me. Changed the func to:
import scipy.optimize as optimize
def irr(cfs, yrs, x0):
r = optimize.root(npv, args=(cfs, yrs), x0=x0, method='broyden1')
return float(r.x)
I just checked lm and broyden1, and those both converged with your second example to around 0.216. There are multiple methods, and I have no clue which would be the best choice from those, but most seems to be better then the hybr method used in fsolve.
Related
For example, let's consider the following dataframe:
Restaurant_ID Floor Cust_Arrival_Datetime
0 100 1 2021-11-17 17:20:00
1 100 1 2021-11-17 17:22:00
2 100 1 2021-11-17 17:25:00
3 100 1 2021-11-17 17:30:00
4 100 1 2021-11-17 17:50:00
5 100 1 2021-11-17 17:51:00
6 100 2 2021-11-17 17:25:00
7 100 2 2021-11-17 18:00:00
8 100 2 2021-11-17 18:50:00
9 100 2 2021-11-17 18:56:00
For the above toy example we can consider that the Cust_Arrival_Datetime is sorted as well as grouped by store and floor (as seen above). How could we, now, calculate things such as the median time interval that passes for a customer arrival for each unique store and floor group?
The desired output would be:
Restaurant_ID Floor Median Arrival Interval(in minutes)
0 100 1 3
1 100 2 35
The Median Arrival Interval is calculated as follows: for the first floor of the store we can see that by the time the second customer arrives 2 minutes have already passed since the first one arrived. Similarly, 3 minutes have elapsed between the 2nd and the 3rd customer and 5 minutes for the 3rd and 4th customer etc. The median for floor 1 and restaurant 100 would be 3.
I have tried something like this:
df.groupby(['Restaurant_ID', 'Floor'].apply(lambda row: row['Customer_Arrival_Datetime'].shift() - row['Customer_Arrival_Datetime']).apply(np.median)
but this does not work!
Any help is welcome!
IIUC, you can do
(df.groupby(['Restaurant_ID', 'Floor'])['Cust_Arrival_Datetime']
.agg(lambda x: x.diff().dt.total_seconds().median()/60))
and you get
Restaurant_ID Floor
100 1 3.0
2 35.0
Name: Cust_Arrival_Datetime, dtype: float64
you can chain with reset_index if needed
Consider the following data frame:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'time': pd.to_datetime(
['14:14', '14:17', '14:25', '17:29', '17:40','17:43']
)
})
Suppose, you'd like to apply a range of transformations:
def stats(group):
diffs = group.diff().dt.total_seconds()/60
return {
'min': diffs.min(),
'mean': diffs.mean(),
'median': diffs.median(),
'max': diffs.max()
}
Then you simply have to apply these:
>>> df.groupby('group')['time'].agg(stats).apply(pd.Series)
min mean median max
group
1 3.0 5.5 5.5 8.0
2 3.0 7.0 7.0 11.0
I have a csv file with bid/ask prices of many bonds (using ISIN identifiers) for the past 1 yr. Using these historical prices, I'm trying to calculate the historical volatility for each bond. Although it should be typically an easy task, the issue is not all bonds have exactly same number of days of trading price data, while they're all in same column and not stacked. Hence if I need to calculate a rolling std deviation, I can't choose a standard rolling window of 252 days for 1 yr.
The data set has this format-
BusinessDate
ISIN
Bid
Ask
Date 1
ISIN1
P1
P2
Date 2
ISIN1
P1
P2
Date 252
ISIN1
P1
P2
Date 1
ISIN2
P1
P2
Date 2
ISIN2
P1
P2
......
& so on.
My current code is as follows-
vol_df = pd.read_csv('hist_prices.csv')
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df[Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df['hist_vol'] = vol_df['log_return'].std() * np.sqrt(252)
The last line of code seems to be giving all NaN values in the column. This is most likely because the operation for calculating the std deviation is happening on the same row number and not for a list of numbers. I tried replacing the last line to use rolling_std-
vol_df.set_index('BusinessDate').groupby('ISIN').rolling(window = 1, freq = 'A').std()['log_return']
But this doesn't help either. It gives 2 numbers for each ISIN. I also tried to use pivot() to place the ISINs in columns and BusinessDate as index, and the Prices as "values". But it gives an error. Also I've close to 9,000 different ISINs and hence putting them in columns to calculate std() for each column may not be the best way. Any clues on how I can sort this out?
I was able to resolve this in a crude way-
vol_df_2 = vol_df.groupby('ISIN')['logret'].std()
vol_df_3 = vol_df_2.to_frame()
vol_df_3.rename(columns = {'logret':'daily_std}, inplace = True)
The first line above was returning a series and the std deviation column named as 'logret'. So the 2nd and 3rd line of code converts it into a dataframe and renames the daily std deviation as such. And finally the annual vol can be calculated using sqrt(252).
If anyone has a better way to do it in the same dataframe instead of creating a series, that'd be great.
ok this almost works now.
It does need some math per ISIN to figure out the rolling period, I just used 3 and 2 in my example, you probably need to count how many days of trading in the year or whatever and fix it at that per ISIN somehow.
And then you need to figure out how to merge the data back. The output actually has errors becuase its updating a copy, but that is kind of what I was looking for here. I am sure someone that knows more could fix it at this point. I can't get it working to do the merge.
toy_data={'BusinessDate': ['10/5/2020','10/6/2020','10/7/2020','10/8/2020','10/9/2020',
'10/12/2020','10/13/2020','10/14/2020','10/15/2020','10/16/2020',
'10/5/2020','10/6/2020','10/7/2020','10/8/2020'],
'ISIN': [1,1,1,1,1, 1,1,1,1,1, 2,2,2,2],
'Bid': [0.295,0.295,0.295,0.295,0.295,
0.296, 0.296,0.297,0.298,0.3,
2.5,2.6,2.71,2.8],
'Ask': [0.301,0.305,0.306,0.307,0.308,
0.315,0.326,0.337,0.348,0.37,
2.8,2.7,2.77,2.82]}
#vol_df = pd.read_csv('hist_prices.csv')
vol_df = pd.DataFrame(toy_data)
vol_df['BusinessDate'] = pd.to_datetime(vol_df['BusinessDate'])
vol_df['Mid Price'] = vol_df[['Bid', 'Ask']].mean(axis = 1)
vol_df['log_return'] = vol_df.groupby('ISIN')['Mid Price'].apply(lambda x: np.log(x) - np.log(x.shift(1)))
vol_df.dropna(subset = ['log_return'], inplace=True)
# do some math here to calculate how many days you want to roll for an ISIN
# maybe count how many days over a 1 year period exist???
# not really sure how you'd miss days unless stuff just doesnt trade
# (but I don't need to understand it anyway)
rolling = {1: 3, 2: 2}
for isin in vol_df['ISIN'].unique():
roll = rolling[isin]
print(f'isin={isin}, roll={roll}')
df_single = vol_df[vol_df['ISIN']==isin]
df_single['rolling'] = df_single['log_return'].rolling(roll).std()
# i can't get the right syntax to merge data back, but this shows it
vol_df[isin, 'rolling'] = df_single['rolling']
print(df_single)
print(vol_df)
which outputs (minus the warning errors):
isin=1, roll=3
BusinessDate ISIN Bid Ask Mid Price log_return rolling
1 2020-10-06 1 0.295 0.305 0.3000 0.006689 NaN
2 2020-10-07 1 0.295 0.306 0.3005 0.001665 NaN
3 2020-10-08 1 0.295 0.307 0.3010 0.001663 0.002901
4 2020-10-09 1 0.295 0.308 0.3015 0.001660 0.000003
5 2020-10-12 1 0.296 0.315 0.3055 0.013180 0.006650
6 2020-10-13 1 0.296 0.326 0.3110 0.017843 0.008330
7 2020-10-14 1 0.297 0.337 0.3170 0.019109 0.003123
8 2020-10-15 1 0.298 0.348 0.3230 0.018751 0.000652
9 2020-10-16 1 0.300 0.370 0.3350 0.036478 0.010133
isin=2, roll=2
BusinessDate ISIN Bid ... log_return (1, rolling) rolling
11 2020-10-06 2 2.60 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.71 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.80 ... 2.522656e-02 NaN 0.005778
[3 rows x 8 columns]
BusinessDate ISIN Bid ... log_return (1, rolling) (2, rolling)
1 2020-10-06 1 0.295 ... 6.688988e-03 NaN NaN
2 2020-10-07 1 0.295 ... 1.665279e-03 NaN NaN
3 2020-10-08 1 0.295 ... 1.662511e-03 0.002901 NaN
4 2020-10-09 1 0.295 ... 1.659751e-03 0.000003 NaN
5 2020-10-12 1 0.296 ... 1.317976e-02 0.006650 NaN
6 2020-10-13 1 0.296 ... 1.784313e-02 0.008330 NaN
7 2020-10-14 1 0.297 ... 1.910886e-02 0.003123 NaN
8 2020-10-15 1 0.298 ... 1.875055e-02 0.000652 NaN
9 2020-10-16 1 0.300 ... 3.647821e-02 0.010133 NaN
11 2020-10-06 2 2.600 ... 2.220446e-16 NaN NaN
12 2020-10-07 2 2.710 ... 3.339828e-02 NaN 0.023616
13 2020-10-08 2 2.800 ... 2.522656e-02 NaN 0.005778
I have a dataset with the following [structure][1] -
On a high level it is a time series data. I want to plot this time series data and have a unique color for each column. This will enable me to show the transitions better to the viewer. The column names/labels change from one data set to another. That means I need to create colors for the y value based on labels present in each dataset. I am trying to decide how to do this in a scalable manner.
Sample data ->
;(1275, 51) PCell Tput Avg (kbps) (Average);(1275, 95) PCell Tput Avg (kbps) (Average);(56640, 125) PCell Tput Avg (kbps) (Average);Time Stamp
0;;;79821.1;2021-04-29 23:01:53.624
1;;;79288.3;2021-04-29 23:01:53.652
2;;;77629.2;2021-04-29 23:01:53.682
3;;;78980.3;2021-04-29 23:01:53.695
4;;;77953.4;2021-04-29 23:01:53.723
5;;;;2021-04-29 23:01:53.748
6;;75558.7;;2021-04-29 23:01:53.751
7;;73955.5;;2021-04-29 23:01:53.780
8;;73689.8;;2021-04-29 23:01:53.808
9;;74819.8;;2021-04-29 23:01:53.839
10;10000;;;2021-04-29 23:01:53.848
11;68499;;;2021-04-29 23:01:53.867
[1]: https://i.stack.imgur.com/YM2P6.png
As long as each dataframe has the datetime column labeled as 'Time Stamp', you should just be able to do this for each one:
import matplotlib.pyplot as plt
df = #get def
df.plot.line(x = 'Time Stamp',grid=True)
plt.show()
Example df:
>>> df
A B C D Time Stamp
0 6 9 7 8 2018-01-01 00:00:00.000000000
1 7 3 8 6 2018-05-16 18:04:05.267114496
2 4 1 4 0 2018-09-29 12:08:10.534228992
3 1 2 5 8 2019-02-12 06:12:15.801343744
4 6 7 9 3 2019-06-28 00:16:21.068458240
5 8 5 9 9 2019-11-10 18:20:26.335572736
6 3 0 8 2 2020-03-25 12:24:31.602687232
7 8 0 0 9 2020-08-08 06:28:36.869801984
8 0 9 7 8 2020-12-22 00:32:42.136916480
9 7 8 9 2 2021-05-06 18:36:47.404030976
Resulting plot:
I have two sets of continuous data that I would like to pass into a contour plot. The x-axis would be time, the y-axis would be mass, and the z-axis would be frequency (as in how many times that data point appears). However, most data points are not identical but rather very similar. Thus, I suspect it's easiest to discretize both the x-axis and y-axis.
Here's the data I currently have:
INPUT
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Dates'].head(5)
df['Mass'].head(5)
OUTPUT
13 2003-05-09
14 2003-09-09
15 2010-01-18
16 2010-11-21
17 2012-06-29
Name: Date, dtype: datetime64[ns]
13 2500.0
14 3500.0
15 4000.0
16 4500.0
17 5000.0
Name: Mass, dtype: float64
I'd like to convert the data such that it groups up data points within the year (ex: all datapoints taken in 2003) and it groups up data points within different levels of mass (ex: all datapoints between 3000-4000 kg). Next, the code would count how many data points are within each of these blocks and pass that as the z-axis.
Ideally, I'd also like to be able to adjust the levels of slices. Ex: grouping points up every 100kg instead of 1000kg, or passing a custom list of levels that aren't equally distributed. How would I go about doing this?
I think the function you are looking for is pd.cut
import pandas as pd
import numpy as np
import datetime
n = 10
scale = 1e3
Min = 0
Max = 1e4
np.random.seed(6)
Start = datetime.datetime(2000, 1, 1)
Dates = np.array([base + datetime.timedelta(days=i*180) for i in range(n)])
Mass = np.random.rand(n)*10000
df = pd.DataFrame(index = Dates, data = {'Mass':Mass})
print(df)
gives you:
Mass
2000-01-01 8928.601514
2000-06-29 3319.798053
2000-12-26 8212.291231
2001-06-24 416.966257
2001-12-21 1076.566799
2002-06-19 5950.520642
2002-12-16 5298.173622
2003-06-14 4188.074286
2003-12-11 3354.078493
2004-06-08 6225.194322
if you want to group your Masses by say 1000, or implement your own custom bins, you can do this:
Bins,Labels=np.arange(Min,Max+.1,scale),(np.arange(Min,Max,scale))+(scale)/2
EqualBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(1,'Equal Bins',EqualBins)
Bins,Labels=[0,1000,5000,10000],['Small','Medium','Big']
CustomBins = pd.cut(df['Mass'],bins=Bins,labels=Labels)
df.insert(2,'Custom Bins',CustomBins)
If you want to just show the year, month, etc it is very simple:
df['Year'] = df.index.year
df['Month'] = df.index.month
but you can also do custom date ranges if you like:
Bins=[datetime.datetime(1999, 12, 31),datetime.datetime(2000, 9, 1),
datetime.datetime(2002, 1, 1),datetime.datetime(2010, 9, 1)]
Labels = ['Early','Middle','Late']
CustomDateBins = pd.cut(df.index,bins=Bins,labels=Labels)
df.insert(3,'Custom Date Bins',CustomDateBins)
print(df)
This yields something like what you want:
Mass Equal Bins Custom Bins Custom Date Bins Year Month
2000-01-01 8928.601514 8500.0 Big Early 2000 1
2000-06-29 3319.798053 3500.0 Medium Early 2000 6
2000-12-26 8212.291231 8500.0 Big Middle 2000 12
2001-06-24 416.966257 500.0 Small Middle 2001 6
2001-12-21 1076.566799 1500.0 Medium Middle 2001 12
2002-06-19 5950.520642 5500.0 Big Late 2002 6
2002-12-16 5298.173622 5500.0 Big Late 2002 12
2003-06-14 4188.074286 4500.0 Medium Late 2003 6
2003-12-11 3354.078493 3500.0 Medium Late 2003 12
2004-06-08 6225.194322 6500.0 Big Late 2004 6
The .groupby function is probably of interst to you as well:
yeargroup = df.groupby(df.index.year).mean()
massgroup = df.groupby(df['Equal Bins']).count()
print(yeargroup)
print(massgroup)
Mass Year Month
2000 6820.230266 2000.0 6.333333
2001 746.766528 2001.0 9.000000
2002 5624.347132 2002.0 9.000000
2003 3771.076389 2003.0 9.000000
2004 6225.194322 2004.0 6.000000
Mass Custom Bins Custom Date Bins Year Month
Equal Bins
500.0 1 1 1 1 1
1500.0 1 1 1 1 1
2500.0 0 0 0 0 0
3500.0 2 2 2 2 2
4500.0 1 1 1 1 1
5500.0 2 2 2 2 2
6500.0 1 1 1 1 1
7500.0 0 0 0 0 0
8500.0 2 2 2 2 2
9500.0 0 0 0 0 0
I have 4188006 rows of data. I want to group my data by its column Code value. And set the Code value as the key, the corresponding data as the value int0 a dict`.
The _a_stock_basic_data is my data:
Code date_time open high low close \
0 000001.SZ 2007-03-01 19.000000 19.000000 18.100000 18.100000
1 000002.SZ 2007-03-01 14.770000 14.800000 13.860000 14.010000
2 000004.SZ 2007-03-01 6.000000 6.040000 5.810000 6.040000
3 000005.SZ 2007-03-01 4.200000 4.280000 4.000000 4.040000
4 000006.SZ 2007-03-01 13.050000 13.470000 12.910000 13.110000
... ... ... ... ... ... ...
88002 603989.SH 2015-06-30 44.950001 50.250000 41.520000 49.160000
88003 603993.SH 2015-06-30 10.930000 12.500000 10.540000 12.360000
88004 603997.SH 2015-06-30 21.400000 24.959999 20.549999 24.790001
88005 603998.SH 2015-06-30 65.110001 65.110001 65.110001 65.110001
amt volume
0 418404992 22927500
1 659624000 46246800
2 23085800 3853070
3 131162000 31942000
4 251946000 19093500
.... ....
88002 314528000 6933840
88003 532364992 46215300
88004 169784992 7503370
88005 0 0
[4188006 rows x 8 columns]
And my code is:
_a_stock_basic_data = pandas.concat(dfs)
_all_universe = set(all_universe.values.tolist())
for _code in _all_universe:
_temp_data = _a_stock_basic_data[_a_stock_basic_data['Code']==_code]
data[_code] = _temp_data[_temp_data.notnull()]
_all_universe contains _a_stock_basic_data['Code']. The length of _all_universe is about 2816, and the number of for loop is 2816, it costs a lot of time to complete the process.
So, I just wonder how to use high performance method to group these data. And I think multiprocessing is a choice, but I think share memory is its problem. And I think as the data is more and more large, performance of code need take into consideration, otherwise, it will costs a lot. Thank you for your help.
I'll show an example which I think will solve your problem. Below I make a dataframe with random elements, where the column Code will have duplicate values
a = pd.DataFrame({'a':np.arange(20), 'b':np.random.random(20), 'Code':np.random.random_integers(0, 10, 20)})
To group by the column Code, set it as index:
a.index = a['Code']
you can now use the index to access the data by the value of Code:
In : a.ix[8]
Out:
a b Code
Code
8 1 0.589938 8
8 3 0.030435 8
8 13 0.228775 8
8 14 0.329637 8
8 17 0.915402 8
Did you tried the pd.concat function? Here you can append arrays along an axis of your choice.
pd.concat([data,_temp_data],axis=1)
- dict(_a_stock_basic_data.groupby(['Code']).size())
## Number of occurences per code
- dict(_a_stock_basic_data.groupby(['Code'])['Column_you_want_to_Aggregate'].sum()) ## If you want to do an aggregation on a certain column
?