How to create a dictionary to hold dataframes in python with loops - python

I am saving a large amount of data from some Monte Carlo simulations. I simulate 20 things over a period of 10 time steps using a varying of random draws. So, for a given number of random draws, I have have a folder with 10 .csv files (one for each time step) which has 20 columns of data and n rows per column, where n is the number of random draws in that simulation. Currently my basic code for loading data in looks something like this:
import pandas as pd
import numpy as np
load_path = r'...\path\to\data'
numScenarios = [100, 500, 1000, 2500, 5000, 10000, 20000]
yearsSimulated = np.arange(1,11)
for n in numScenarios:
folder_path = load_path + '\draws = ' + str(n)
for year in yearsSimulated:
filename = '\year ' + str(year) + '.csv'
path = folder_path + filename
df = pd.read_csv(path)
# save df.describe() somewhere
I want to efficiently save df.describe() somehow so that I can compare how the number of random draws is affecting results for the 20 things for a given time step. That is, I would ultimately like some object that I can access easily that will store all the df.describe() outputs for each individual time step. I'm not sure of a nice way to do this though. Some previous questions seem to suggest that dictionaries may be the way to go here but I've not been able to get them going.

Edit:
My final approach is to use an answer to a question here with a bunch of loops. So now I have:
class ngram(dict):
"""Based on perl's autovivification feature."""
def __getitem__(self, item):
try:
return super(ngram, self).__getitem__(item)
except KeyError:
value = self[item] = type(self)()
return value
results = ngram()
for i, year in enumerate(years):
year_str = str(year)
ann_stats = pd.DataFrame()
for j, n in enumerate(numScenarios):
n_str = str(n)
folder_path = load_path + '\draws = ' + str(n)
filename = '\scenarios ' + str(year) + '.csv'
path = folder_path + filename
df = pd.read_csv(path)
ann_stats['mean'] = df.mean()
ann_stats['std. dev'] = df.std()
ann_stats['1%'] = df.quantile(0.01)
ann_stats['25%'] = df.quantile(0.25)
ann_stats['50%'] = df.quantile(0.5)
ann_stats['75%'] = df.quantile(0.75)
ann_stats['99%'] = df.quantile(0.99)
results[year_str][n_str] = ann_stats.T
And so now the summary data for each time step and number of draws is accessed as a dataframe with
test = results[year_str][n_str]
where the columns of test hold results for each of my 20 things.

Related

Line Plot based on a Pandas DataFrame

I'm trying to learn how to analyze data in python, so I'm using a database that I've already did some work on it with PowerBI, but now I'm trying to do the same plots with python.
The Pandas dataframe is this...
And I'm trying to build a line plot like this one...
This line represents the amount of 'Água e sabonete' and 'Fricção com álcool' in the column Ação divided by the the totals of Ação.
This was how managed to do it on PowerBI using Dax:
Adesão = VAR nReal = (COUNTROWS(FILTER(Tabela1,Tabela1[Ação]="Água e sabonete")) + COUNTROWS(FILTER(Tabela1,Tabela1[Ação]="Fricção com álcool")))
//VAR acao = COUNTA(Tabela1[Ação]
RETURN
DIVIDE(nReal,COUNTA(Tabela1[Ação]))
I want to know if it is possible to do something similar to build the plot or if there is other way to build it in python.
I didn't try anything especifically, but I think that should be possible to build it with a function, but it is too difficult to me right now to create one since I'm a beginner.
Any help would be greatly appreciated!
The idea here is to get access to each month and count every time Água e sabonete and Fricção com álcool appear.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
def dateprod(year): #could change this with an array of each month instead
year = str(year) + "-"
dates = []
for i in range(1,13):
if i >= 10:
date = year + str(i)
else:
date = year + "0" + str(i)
dates.append(date)
return dates
def y_construct(year,search_1,search_2):
y_vals = np.zeros(0)
for j in dateprod(2001):
score = 0
temp_list = df.loc[str(j)]["aloc"].values #Gets all values of each month
for k in temp_list:
if k == str(search_1) or str(search_2): #Cheak if the words in the arrays are time Água e sabonete and Fricção com álcool
score += 1
y_vals = np.append(y_vals,score)
return y_vals/df.size # divid by total. Assuming that the amount of Ação is the same as df.size
# this does the plotting
y_vals = y_construct(2022,"Água e sabonete","Fricção com álcool")
x_label = ["jan","feb","mar","apr","may","jul","juli","aug","sep","otc","nov","dec"]
plt.ylabel("%")
plt.plot(x_label,y_vals)
plt.show()

How to Run script for multiple text files with a common string?

In my research work I have multiple text files has a common string "Max", and Max has different values in range of 0.10 to 2.00 with step 0.10 as follow:
A_100Hz_Rate20Hz_5tot_0.10Max_1_
A_100Hz_Rate20Hz_5tot_0.10Max_2_
A_100Hz_Rate20Hz_5tot_0.10Max_3_
.
.
.
A_100Hz_Rate20Hz_5tot_2.00Max_1_
A_100Hz_Rate20Hz_5tot_2.00Max_2_
A_100Hz_Rate20Hz_5tot_2.00Max_3_
I need to import all files depending on the value of Max (ex: 0.10Max) to get the average of files with the same Max values separately to get:
Ave_A_100Hz_Rate20Hz_5tot_0.10Max_3_
.
.
.
Ave_A_100Hz_Rate20Hz_5tot_2.00Max_3_
I’ve tried a manual glob module, and it's working good for one value of "Max" but for the full range it doeson't work. This is my code:
import numpy as np
import glob
import pandas as pd
h = np.linspace(0.10,2.00,20)
for x in h:
x1 = ("%.2f" % x)
glob_path = 'input/*_{}Vbr_*.txt'.format(x1)
import_files = glob.glob(glob_path)
print(x,import_files )
for index, file_name in enumerate(import_files ):
merged_data = pd.read_csv(file_name, header=None, delimiter="\t").values
if index==0:
summation = merged_data
else:
summation = summation + merged_data
averaging = summation/len(import_files)
np.savetxt('output/Ave_'+file_name[10:], averaging, delimiter="\t" )
I need to write a general script. But, in my case now I used the script with just two values x = 1.50 and x = 2.0 to make it simple. I tried print(import_files) and expected the output to be:
['input\\A_100Hz_Rate20Hz_5tot_1.50Max_1_.txt',
'input\\A_100Hz_Rate20Hz_5tot_1.50Max_2_.txt',
'input\\A_100Hz_Rate20Hz_5tot_1.50Max_3_.txt']
['input\\A_100Hz_Rate20Hz_5tot_2.00Max_1_.txt',
'input\\A_100Hz_Rate20Hz_5tot_2.00Max_2_.txt',
'input\\A_100Hz_Rate20Hz_5tot_2.00Max_3_.txt']
But the actual output is (in short):
0.1 []
0.2 []
1.5 ['input\\A_100Hz_Rate20Hz_5tot_1.50Max_1_.txt',
'input\\A_100Hz_Rate20Hz_5tot_1.50Max_2_.txt',
'input\\A_100Hz_Rate20Hz_5tot_1.50Max_3_.txt']
1.6 []
1.7 []
2.0['input\\A_100Hz_Rate20Hz_5tot_2.00Max_1_.txt',
'input\\A_100Hz_Rate20Hz_5tot_2.00Max_2_.txt',
'input\\A_100Hz_Rate20Hz_5tot_2.00Max_3_.txt']
and it caused an error in the kernel
np.savetxt('output/Ave_'+file_name[10:], averaging, delimiter="\t" )
NameError: name 'file_name' is not defined
Please, Any suggestions?
I think that you have just to test whether import_file is empty:
for x in h:
x1 = ("%.2f" % x)
glob_path = 'input/*_{}Vbr_*.txt'.format(x1)
import_files = glob.glob(glob_path)
print(x,import_files )
if len(import_files) != 0:
for index, file_name in enumerate(import_files ):
merged_data = pd.read_csv(file_name, header=None, delimiter="\t").values
if index==0:
summation = merged_data
else:
summation = summation + merged_data
averaging = summation/len(import_files)
np.savetxt('output/Ave_'+file_name[10:], averaging, delimiter="\t" )

Problem with multiprocessing in Python writing into csv

I created a function which takes a date as an argument and writes produced output into csv. If I run multiprocessing Pool with e.g. 28 tasks, and I have a list of 100 dates, then the last 72 rows in the output csv file are twice longer than they should be (just a joined repetition of last 72 rows).
My code:
import numpy as np
import pandas as pd
import multiprocessing
#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s[0:100]):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()
And the output looks like this:
date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315
...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
I tried to insert lock() into the funk(), but it yielded the same results, just took more time to implement. Any ideas how to fix it?
Edit. funk looks like this. e is equivalent to date.
def funk(e):
block = pd.DataFrame()
i = s_list.index(e)
if i > 19:
ran = s_list[i-19:i+6]
ran0 = s_list[i-19:i+1]
# print ran0
piv = df.pivot(index='date', columns='permno', values='date')
# Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
s = list(piv.loc[ran].dropna(axis=1).columns)
sample = df[df['permno'].isin(s)]
sample = sample.loc[ran]
permno = ['10001', '93422']
sample = sample[sample['permno'].isin(permno)]
# print sample.index.unique()
# get past 20 days returns in additional 20 columns
for i in range(0, 20):
sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
#merge dataset with betas
sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
sample['ex_ret'] = 0
# calculate expected return
for i in range(0,20):
sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
# print(sample)
# define a stock into two legs based on expected return
sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
# workaround for short leg, multiply returns by -1
sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
# create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
for i in range(1,6):
sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
sample = sample.reset_index(drop=True)
sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
for i in range(1, 5):
sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
sample = sample.dropna(how='any')
for k in range(0,20):
sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
for k in range(1, 6):
sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
q = ['port_ret_{}'.format(k)]
list_names.extend(q)
block = sample.groupby('date')[list_names].sum().copy()
return block

how to make python loop faster to run pairwise association test

I have a list of patient id and drug names and a list of patient id and disease names. I want to find the most indicative drug for each disease.
To find this I want to do Fisher exact test to get the p-value for each disease/drug pair. But the loop runs very slowly, more than 10 hours. Is there a way to make the loop more efficient, or a better way to solve this association problem?
My loop:
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
for i in disease_list:
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
You need multiprocessing/multithreading, I have added the code.:
from multiprocessing.dummy import Pool as ThreadPool
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique())
disease_list = list(meps_base_data.columns.values)[8:]
def run_pairwise(i):
print i
rx_dict = {}
for j in rx_list:
subset = base[['id', i, 'rxName']].drop_duplicates()
subset[j] = subset['rxName'] == j
subset = subset.loc[subset[i].isin(['Yes', 'No'])]
subset = subset[[i, j]]
tab = pd.crosstab(subset[i], subset[j])
if len(tab.columns) == 2:
rx_dict[j] = fisher_exact(tab)[1]
else:
rx_dict[j] = np.nan
most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
pool = ThreadPool(3)
pairwise_test_results = pool.map(run_pairwise,disease_list)
pool.close()
pool.join()
notes:http://chriskiehl.com/article/parallelism-in-one-line/
Crunching faster is good, but a better algorithm usually beats it ;-)
Filling in a bit,
import numpy as np
import pandas as pd
from scipy.stats import fisher_exact
# data files can be download at
# https://github.com/Saynah/platform/tree/d7e9f150ef2ff436387585960ca312a301847a46/data
meps_meds = pd.read_csv("meps_meds.csv") # 8 cols * 1,148,347 rows
meps_base_data = pd.read_csv("meps_base_data.csv") # 18 columns * 61,489 rows
# merge to get disease and drug info in same table
merged = pd.merge( # 25 cols * 1,148,347 rows
meps_base_data, meps_meds,
how='inner', left_on='id', right_on='id'
)
rx_list = meps_meds.rxName.unique().tolist() # 9218 items
disease_list = meps_base_data.columns.values[8:].tolist() # 10 items
Note that rx_list has a LOT of duplicates (eg. 52 entries for Amoxicillin, if you include misspellings).
Then
most_indicative = {}
for disease in disease_list:
# get unique (id, disease, prescription)
subset = merged[['id', disease, 'rxName']].drop_duplicates()
# keep only Yes/No entries under disease
subset = subset[subset[disease].isin(['Yes', 'No'])]
# summarize (replaces the inner loop)
tab = pd.crosstab(subset.rxName, subset[disease])
# bind "No" values for Fisher exact function
nf, yf = tab.sum().values
def p_value(x, nf=nf, yf=yf):
return fisher_exact([[nf - x.No, x.No], [yf - x.Yes, x.Yes]])[1]
# OPTIONAL:
# We can probably assume that the most-indicative drugs are among
# the most-prescribed; get just the 100 most-prescribed drugs
# Note: you have to get the nf, yf values before doing this!
tab = tab.sort_values("Yes", ascending=False)[:100]
# and apply the function
tab["P-value"] = tab.apply(p_value, axis=1)
# find the best match
best_med = tab.sort_values("P-value").index[0]
most_indicative[disease] = best_med
This now runs in about 18 minutes on my machine, and you could probably combine it with EM28's answer to speed it up by another factor of 4 or more.

Daily Pricing of a Bond with QuantLib using Python

I would like to use QuantLib within python mainly to price interest rate instruments (derivatives down the track) within a portfolio context. The main requirement would be to pass daily yield curves to the system to price on successive days (let's ignore system performance issues for now). My question is, have I structured the example below correctly to do this? My understanding is that I would need at least one curve object per day with the necessary linking etc. I have made use of pandas to attempt this. Guidance on this would be appreciated.
import QuantLib as ql
import math
import pandas as pd
import datetime as dt
# MARKET PARAMETRES
calendar = ql.SouthAfrica()
bussiness_convention = ql.Unadjusted
day_count = ql.Actual365Fixed()
interpolation = ql.Linear()
compounding = ql.Compounded
compoundingFrequency = ql.Quarterly
def perdelta(start, end, delta):
date_list=[]
curr = start
while curr < end:
date_list.append(curr)
curr += delta
return date_list
def to_datetime(d):
return dt.datetime(d.year(),d.month(), d.dayOfMonth())
def format_rate(r):
return '{0:.4f}'.format(r.rate()*100.00)
#QuantLib must have dates in its date objects
dicPeriod={'DAY':ql.Days,'WEEK':ql.Weeks,'MONTH':ql.Months,'YEAR':ql.Years}
issueDate = ql.Date(19,8,2014)
maturityDate = ql.Date(19,8,2016)
#Bond Schedule
schedule = ql.Schedule (issueDate, maturityDate,
ql.Period(ql.Quarterly),ql.TARGET(),ql.Following, ql.Following,
ql.DateGeneration.Forward,False)
fixing_days = 0
face_amount = 100.0
def price_floater(myqlvalDate,jindex,jibarTermStructure,discount_curve):
bond = ql.FloatingRateBond(settlementDays = 0,
faceAmount = 100,
schedule = schedule,
index = jindex,
paymentDayCounter = ql.Actual365Fixed(),
spreads=[0.02])
bondengine = ql.DiscountingBondEngine(ql.YieldTermStructureHandle(discount_curve))
bond.setPricingEngine(bondengine)
ql.Settings.instance().evaluationDate = myqlvalDate
return [bond.NPV() ,bond.cleanPrice()]
start_date=dt.datetime(2014,8,19)
end_date=dt.datetime(2015,8,19)
all_dates=perdelta(start_date,end_date,dt.timedelta(days=1))
dtes=[];fixings=[]
for d in all_dates:
if calendar.isBusinessDay(ql.QuantLib.Date(d.day,d.month,d.year)):
dtes.append(ql.QuantLib.Date(d.day,d.month,d.year))
fixings.append(0.1)
df_ad=pd.DataFrame(all_dates,columns=['valDate'])
df_ad['qlvalDate']=df_ad.valDate.map(lambda x:ql.DateParser.parseISO(x.strftime('%Y-%m-%d')))
df_ad['jibarTermStructure'] = df_ad.qlvalDate.map(lambda x:ql.RelinkableYieldTermStructureHandle())
df_ad['discountStructure'] = df_ad.qlvalDate.map(lambda x:ql.RelinkableYieldTermStructureHandle())
df_ad['jindex'] = df_ad.jibarTermStructure.map(lambda x: ql.Jibar(ql.Period(3,ql.Months),x))
df_ad.jindex.map(lambda x:x.addFixings(dtes, fixings))
df_ad['flatCurve'] = df_ad.apply(lambda r: ql.FlatForward(r['qlvalDate'],0.1,ql.Actual365Fixed(),compounding,compoundingFrequency),axis=1)
df_ad.apply(lambda x:x['jibarTermStructure'].linkTo(x['flatCurve']),axis=1)
df_ad.apply(lambda x:x['discountStructure'].linkTo(x['flatCurve']),axis=1)
df_ad['discount_curve']= df_ad.apply(lambda x:ql.ZeroSpreadedTermStructure(x['discountStructure'],ql.QuoteHandle(ql.SimpleQuote(math.log(1+0.02)))),axis=1)
df_ad['all_in_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['jindex'],r['jibarTermStructure'],r['discount_curve'])[0],axis=1)
df_ad['clean_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['jindex'],r['jibarTermStructure'],r['discount_curve'])[1],axis=1)
df_plt=df_ad[['valDate','all_in_price','clean_price']]
df_plt=df_plt.set_index('valDate')
from matplotlib import ticker
def func(x, pos):
s = str(x)
ind = s.index('.')
return s[:ind] + '.' + s[ind+1:]
ax=df_plt.plot()
ax.yaxis.set_major_formatter(ticker.FuncFormatter(func))
Thanks to Luigi Ballabio I have reworked the example above to incorporate the design principles within QuantLib so as to avoid unnecessary calling.
Now the static data is truly static and only the market data varies (I hope).
I now understand better how the live objects listen for changes in linked variables.
Static data is the following:
bondengine
bond
structurehandles
historical jibar index
Market data will be the only varying component
daily swap curve
market spread over swap curve
The reworked example is below:
import QuantLib as ql
import math
import pandas as pd
import datetime as dt
import numpy as np
# MARKET PARAMETRES
calendar = ql.SouthAfrica()
bussiness_convention = ql.Unadjusted
day_count = ql.Actual365Fixed()
interpolation = ql.Linear()
compounding = ql.Compounded
compoundingFrequency = ql.Quarterly
def perdelta(start, end, delta):
date_list=[]
curr = start
while curr < end:
date_list.append(curr)
curr += delta
return date_list
def to_datetime(d):
return dt.datetime(d.year(),d.month(), d.dayOfMonth())
def format_rate(r):
return '{0:.4f}'.format(r.rate()*100.00)
#QuantLib must have dates in its date objects
dicPeriod={'DAY':ql.Days,'WEEK':ql.Weeks,'MONTH':ql.Months,'YEAR':ql.Years}
issueDate = ql.Date(19,8,2014)
maturityDate = ql.Date(19,8,2016)
#Bond Schedule
schedule = ql.Schedule (issueDate, maturityDate,
ql.Period(ql.Quarterly),ql.TARGET(),ql.Following, ql.Following,
ql.DateGeneration.Forward,False)
fixing_days = 0
face_amount = 100.0
start_date=dt.datetime(2014,8,19)
end_date=dt.datetime(2015,8,19)
all_dates=perdelta(start_date,end_date,dt.timedelta(days=1))
dtes=[];fixings=[]
for d in all_dates:
if calendar.isBusinessDay(ql.QuantLib.Date(d.day,d.month,d.year)):
dtes.append(ql.QuantLib.Date(d.day,d.month,d.year))
fixings.append(0.1)
jibarTermStructure = ql.RelinkableYieldTermStructureHandle()
jindex = ql.Jibar(ql.Period(3,ql.Months), jibarTermStructure)
jindex.addFixings(dtes, fixings)
discountStructure = ql.RelinkableYieldTermStructureHandle()
bond = ql.FloatingRateBond(settlementDays = 0,
faceAmount = 100,
schedule = schedule,
index = jindex,
paymentDayCounter = ql.Actual365Fixed(),
spreads=[0.02])
bondengine = ql.DiscountingBondEngine(discountStructure)
bond.setPricingEngine(bondengine)
spread = ql.SimpleQuote(0.0)
discount_curve = ql.ZeroSpreadedTermStructure(jibarTermStructure,ql.QuoteHandle(spread))
discountStructure.linkTo(discount_curve)
# ...here is the pricing function...
# pricing:
def price_floater(myqlvalDate,jibar_curve,credit_spread):
credit_spread = math.log(1.0+credit_spread)
ql.Settings.instance().evaluationDate = myqlvalDate
jibarTermStructure.linkTo(jibar_curve)
spread.setValue(credit_spread)
ql.Settings.instance().evaluationDate = myqlvalDate
return pd.Series({'NPV': bond.NPV(), 'cleanPrice': bond.cleanPrice()})
# ...and here are the remaining varying parts:
df_ad=pd.DataFrame(all_dates,columns=['valDate'])
df_ad['qlvalDate']=df_ad.valDate.map(lambda x:ql.DateParser.parseISO(x.strftime('%Y-%m-%d')))
df_ad['jibar_curve'] = df_ad.apply(lambda r: ql.FlatForward(r['qlvalDate'],0.1,ql.Actual365Fixed(),compounding,compoundingFrequency),axis=1)
df_ad['spread']=np.random.uniform(0.015, 0.025, size=len(df_ad)) # market spread
df_ad['all_in_price'], df_ad["clean_price"]=zip(*df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['jibar_curve'],r['spread']),axis=1).to_records())[1:]
# plot result
df_plt=df_ad[['valDate','all_in_price','clean_price']]
df_plt=df_plt.set_index('valDate')
from matplotlib import ticker
def func(x, pos): # formatter function takes tick label and tick position
s = str(x)
ind = s.index('.')
return s[:ind] + '.' + s[ind+1:] # change dot to comma
ax=df_plt.plot()
ax.yaxis.set_major_formatter(ticker.FuncFormatter(func))
Your solution would work, but creating a bond per day kind of goes against the grain of the library. You can create the bond and the JIBAR index just once, and just change the evaluation date and the corresponding curves; the bond will detect the changes and recalculate.
In the general case, this would be something like:
# here are the parts that stay the same...
jibarTermStructure = ql.RelinkableYieldTermStructureHandle()
jindex = ql.Jibar(ql.Period(3,ql.Months), jibarTermStructure)
jindex.addFixings(dtes, fixings)
discountStructure = ql.RelinkableYieldTermStructureHandle()
bond = ql.FloatingRateBond(settlementDays = 0,
faceAmount = 100,
schedule = schedule,
index = jindex,
paymentDayCounter = ql.Actual365Fixed(),
spreads=[0.02])
bondengine = ql.DiscountingBondEngine(discountStructure)
bond.setPricingEngine(bondengine)
# ...here is the pricing function...
def price_floater(myqlvalDate,jibar_curve,discount_curve):
ql.Settings.instance().evaluationDate = myqlvalDate
jibarTermStructure.linkTo(jibar_curve)
discountStructure.linkTo(discount_curve)
return [bond.NPV() ,bond.cleanPrice()]
# ...and here are the remaining varying parts:
df_ad=pd.DataFrame(all_dates,columns=['valDate'])
df_ad['qlvalDate']=df_ad.valDate.map(lambda x:ql.DateParser.parseISO(x.strftime('%Y-%m-%d')))
df_ad['flatCurve'] = df_ad.apply(lambda r: ql.FlatForward(r['qlvalDate'],0.1,ql.Actual365Fixed(),compounding,compoundingFrequency),axis=1)
df_ad['discount_curve']= df_ad.apply(lambda x:ql.ZeroSpreadedTermStructure(jibarTermStructure,ql.QuoteHandle(ql.SimpleQuote(math.log(1+0.02)))),axis=1)
df_ad['all_in_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['flatCurve'],r['discount_curve'])[0],axis=1)
df_ad['clean_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['flatCurve'],r['discount_curve'])[0],axis=1)
df_plt=df_ad[['valDate','all_in_price','clean_price']]
df_plt=df_plt.set_index('valDate')
Now, even in the most general case, the above can be optimized: you're calling price_floater twice per date, so you're doing twice the work. I'm not familiar with pandas, but I'd guess you can make a single call and set df_ad['all_in_price'] and df_ad['clean_price'] with a single assignment.
Moreover, there might be ways to simplify the code even further depending on your use cases. The discount curve might be instantiated once and the spread changed during pricing:
# in the "only once" part:
spread = ql.SimpleQuote()
discount_curve = ql.ZeroSpreadedTermStructure(jibarTermStructure,ql.QuoteHandle(spread))
discountStructure.linkTo(discount_curve)
# pricing:
def price_floater(myqlvalDate,jibar_curve,credit_spread):
ql.Settings.instance().evaluationDate = myqlvalDate
jibarTermStructure.linkTo(jibar_curve)
spread.setValue(credit_spread)
return [bond.NPV() ,bond.cleanPrice()]
and in the varying part, you'll just have an array of credit spreads intead of an array of discount curves.
If the curves are all flat, you can do the same by taking advantage of another feature: if you initialize a curve with a number of days and a calendar instead of a date, its reference date will move with the evaluation date (if the number of days is 0, it will be the evaluation date; if it's 1, it will be the next business day, and so on).
# only once:
risk_free = ql.SimpleQuote()
jibar_curve = ql.FlatForward(0,calendar,ql.QuoteHandle(risk_free),ql.Actual365Fixed(),compounding,compoundingFrequency)
jibarTermStructure.linkTo(jibar_curve)
# pricing:
def price_floater(myqlvalDate,risk_free_rate,credit_spread):
ql.Settings.instance().evaluationDate = myqlvalDate
risk_free.linkTo(risk_free_rate)
spread.setValue(credit_spread)
return [bond.NPV() ,bond.cleanPrice()]
and in the varying part, you'll replace the array of jibar curves with a simple array of rates.
The above should give you the same result as your code, but will instantiate a lot less objects and thus probably save memory and increase performance.
One final warning: neither my code nor yours will work if pandas' map evaluates the results in parallel; you'd end up trying to set up the global evaluation date to several values simultaneously, and that wouldn't go well.

Categories