Improving Pandas iteration performance

Improving Pandas iteration performance - python

I've got the following code that takes historical prices for a single asset and calculated forecasts, and computes how you would have faired if you had really invested your money according to the forecast. In financial parlance, it's a back-test.
The main problem is that its very slow, and I'm not sure what the right strategy is for improving it. I need to run this thousands of times, so an order of magnitude speedup is required.
Where should I begin looking?
class accountCurve():
def __init__(self, forecasts, prices):
self.curve = pd.DataFrame(columns=['Capital','Holding','Cash','Trade', 'Position'], dtype=float)
forecasts.dropna(inplace=True)
self.curve['Forecast'] = forecasts
self.curve['Price'] = prices
self.curve.loc[self.curve.index[0],['Capital', 'Holding', 'Cash', 'Trade', 'Position']] = [10000, 0, 10000, 0, 0]
for date, forecast in forecasts.iteritems():
x=self.curve.loc[date]
previous = self.curve.shift(1).loc[date]
if previous.isnull()['Cash']==False:
x['Cash'] = previous['Cash'] - previous['Trade'] * x['Price']
x['Position'] = previous['Position'] + previous['Trade']
x['Holding'] = x['Position'] * x['Price']
x['Capital'] = x['Cash'] + x['Holding']
x['Trade'] = np.fix(x['Capital']/x['Price'] * x['Forecast']/20) - x['Position']
Edit:
Datasets as requested:
Prices:
import quandl
corn = quandl.get('CHRIS/CME_C2')
prices = corn['Open']
Forecasts:
def ewmac(d):
columns = pd.Series([2, 4, 8, 16, 32, 64])
g = lambda x: d.ewm(span = x, min_periods = x*4).mean() - d.ewm(span = x*4, min_periods=x*4).mean()
f = columns.apply(g).transpose()
f = f*10/f.abs().mean()
f.columns = columns
return f.clip(-20,20)
forecasts=ewmac(prices)

I would suggest using a numpy array instead of a data frame inside the for loop. It usually gives significant speed boost.
So the code may look like:
class accountCurve():
def __init__(self, forecasts, prices):
self.curve = pd.DataFrame(columns=['Capital','Holding','Cash','Trade', 'Position'], dtype=float)
# forecasts.dropna(inplace=True)
self.curve['Forecast'] = forecasts.dropna()
self.curve['Price'] = prices
# helper np.array:
self.arr = np.array(self.curve)
self.arr[0,:5] = [10000, 0, 10000, 0, 0]
for i in range(1, self.arr.shape[0]):
this = self.arr[i]
prev = self.arr[i-1]
cash = prev[2] - prev[3] * this[6]
position = ...
holding = ...
capital = ...
trade = ...
this[:5] = [capital, holding, cash, trade, position]
# back to data frame:
self.curve[['Capital','Holding','Cash','Trade', 'Position']] = self.arr[:,:5]
# or maybe this would be faster:
# self.curve[:] = self.arr
I don't quite understand the significance of the line if previous.isnull()['Cash']==False:. It looks as if previous['Cash'] was never null, except maybe for the first row - but you set the first row earlier.
Also, you may consider executing forecasts.dropna(inplace=True) outside of the class. If its originally a data frame, you'll run it once instead of repeating it for every column. (Do I understand correctly that you input single columns of forecasts into the class?)
Next step I'd recommend is using some line profiler to see where your code spends most of the time and trying to optimize these bottlenecks. If you use ipython then you can try running %prun or %lprun. For example
%lprun -f accountCurve.__init__ A = accountCurve(...)
will produce stats for every line in your __init__.

Related

Problem with multiprocessing in Python writing into csv

I created a function which takes a date as an argument and writes produced output into csv. If I run multiprocessing Pool with e.g. 28 tasks, and I have a list of 100 dates, then the last 72 rows in the output csv file are twice longer than they should be (just a joined repetition of last 72 rows).
My code:
import numpy as np
import pandas as pd
import multiprocessing
#Load the data
df = pd.read_csv('data.csv', low_memory=False)
list_s = df.date.unique()
def funk(date):
...
# for each date in df.date.unique() do stuff which gives sample dataframe
# as an output
return sample
# list_s is a list of dates I want to calculate function funk for
def mp_handler():
# 28 is a number of processes I want to run
p = multiprocessing.Pool(28)
for result in p.imap(funk, list_s[0:100]):
result.to_csv('crsp_full.csv', mode='a')
if __name__=='__main__':
mp_handler()
And the output looks like this:
date,port_ret_1,port_ret_2
2010-03-05,0.0,0.002
date,port_ret_1,port_ret_2
2010-02-12,-0.001727,0.009139189315
...
# and after first 28 rows like this
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-03,0.002045,0.00045092025,0.002045,0.00045092025
date,port_ret_1,port_ret_2,port_ret_1,port_ret_2
2010-03-15,-0.006055,-0.00188451972,-0.006055,-0.00188451972
I tried to insert lock() into the funk(), but it yielded the same results, just took more time to implement. Any ideas how to fix it?
Edit. funk looks like this. e is equivalent to date.
def funk(e):
block = pd.DataFrame()
i = s_list.index(e)
if i > 19:
ran = s_list[i-19:i+6]
ran0 = s_list[i-19:i+1]
# print ran0
piv = df.pivot(index='date', columns='permno', values='date')
# Drop the stocks which do not have returns for the given time window and make the list of suitable stocks
s = list(piv.loc[ran].dropna(axis=1).columns)
sample = df[df['permno'].isin(s)]
sample = sample.loc[ran]
permno = ['10001', '93422']
sample = sample[sample['permno'].isin(permno)]
# print sample.index.unique()
# get past 20 days returns in additional 20 columns
for i in range(0, 20):
sample['r_{}'.format(i)] = sample.groupby('permno')['ret'].shift(i)
#merge dataset with betas
sample = pd.merge(sample, betas_aug, left_index=True, right_index=True)
sample['ex_ret'] = 0
# calculate expected return
for i in range(0,20):
sample['ex_ret'] += sample['ma_beta_{}'.format(i)]*sample['r_{}'.format(i)]
# print(sample)
# define a stock into two legs based on expected return
sample['sign'] = sample['ex_ret'].apply(lambda x: -1 if x<0 else 1)
# workaround for short leg, multiply returns by -1
sample['abs_ex_ret'] = sample['ex_ret']*sample['sign']
# create 5 columns for future realised 5 days returns (multiplied by -1 for short leg)
for i in range(1,6):
sample['rp_{}'.format(i)] = sample.groupby(['permno'])['ret'].shift(-i)
sample['rp_{}'.format(i)] = sample['rp_{}'.format(i)]*sample['sign']
sample = sample.reset_index(drop=True)
sample['w_0'] = sample['abs_ex_ret'].div(sample.groupby(['date'])['abs_ex_ret'].transform('sum'))
for i in range(1, 5):
sample['w_{}'.format(i)] = sample['w_{}'.format(i-1)]*(1+sample['rp_{}'.format(i)])
sample = sample.dropna(how='any')
for k in range(0,20):
sample.drop(columns = ['ma_beta_{}'.format(k), 'r_{}'.format(k)])
for k in range(1, 6):
sample['port_ret_{}'.format(k)] = sample['w_{}'.format(k-1)]*sample['rp_{}'.format(k)]
q = ['port_ret_{}'.format(k)]
list_names.extend(q)
block = sample.groupby('date')[list_names].sum().copy()
return block

How to use multiprocessing pool in a for loop while saving the data?

I have some data where I'm trying to apply multiprocessing.pool on it as I have a machine available with 16 processors.
Here do I generate some pseudo data:
y = pd.Series(np.random.randint(400, high=600, size=1250))
date_today = datetime.now()
x = pd.date_range(date_today, date_today + timedelta(1250), freq='D')
data = pd.DataFrame(columns=['Date','Price'])
data['Date'] = x
data['Price'] = y
d={name: group for name, group in data.groupby(np.arange(len(data)) // (len(data)))}
What I exactly want is that I apply pool in the for loop parameters. So using a processor per constant:
parameters = range(300,550,50)
portfolio = pd.DataFrame(columns=['Parameter','Date','Price','Calculation'])
for key, value in sorted(d.items()):
for constante in parameters:
print('Constante:',constante)
# HERE I WANT TO USE MP.POOL()
In the code I'm using some sort of shifting window to perform calculations on. This is the simplest version of the code. So I want to assign a process per constant in the parameters while writing to a DF. How does one achieve this?

You'll want to use multiprocessing.pool.map a bit like this, though you'll probably have to adjust for your needs...
from functools import partial
from multiprocessing import Pool
def pool_map_fn(value=None, constante=None, i=None):
s = {'val': value[i:i+constante]}
window = pd.concat([s['val']['Date'],s['val']['Price']], axis=1)
window['Price'] = pd.to_numeric(window['Price'], errors='coerce').fillna(0)
calc = window['Price'].mean()
date_variable = window['Date'].iloc[-1]
price_var = window['Price'].iloc[-1]
if price_var < calc:
print('Parameter',constante,'Lower than average',date_variable,price_var,calc)
portfolio = portfolio.append({'Parameter': constante,
'Date': date_variable,
'Price': price_var,
'Calculation': calc}, ignore_index=True)
if price_var > calc:
print('Parameter',constante,'Higher than average',date_variable,price_var,calc)
parameters = range(300,550,50)
portfolio = pd.DataFrame(columns=['Parameter','Date','Price','Calculation'])
for key, value in sorted(d.items()):
for constante in parameters:
with Pool() as pool:
results = pool.map(partial(pool_map_fn, value=value, constante=constante),
range(len(value) - constante + 1))
Note: This is untested but should work, if you get errors try to resolve them as the concept should be sound.

Iterating Over CSV Deleting Analyzed Data

Hello I am trying to take a CSV file and iterate over each customers data. To explain, each customer has data for 12 months. I want to analyze their yearly data, save the correlations of this data to a new list and loop this until all customers have been analyzed.
For instance here is what a customers data might look like (simplified case):
I have been able to get this to work to generate correlations in a CSV of one customers data. However, there are thousands of customers in my datasheet. I want to use a nested for loop to get all of the correlation values for each customer into a list/array. The list would have a row of a specific customer's correlations then the next row would be the next customer.
Here is my current code:
import numpy
from numpy import genfromtxt
overalldata = genfromtxt('C:\Users\User V\Desktop\CUSTDATA.csv', delimiter=',')
emptylist = []
overalldatasubtract = overalldata[13::]
#This is where I try to use the four loop to go through all the customers. I don't know if len will give me all the rows or the number of columns.
for x in range(0,len(overalldata),11):
for x in range(0,13,1):
cust_months = overalldata[0:x,1]
cust_balancenormal = overalldata[0:x,16]
cust_demo_one = overalldata[0:x,2]
cust_demo_two = overalldata[0:x,3]
num_acct_A = overalldata[0:x,4]
num_acct_B = overalldata[0:x,5]
#Correlation Calculations
demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0]
demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0]
demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0]
demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0]
demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0]
demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0]
result_correlation = [demo_one_corr_balance, demo_two_corr_balance, demo_one_corr_acct_a, demo_one_corr_acct_b, demo_two_corr_acct_a, demo_two_corr_acct_b]
result_correlation_combined = emptylist.append(result_correlation)
#This is where I try to delete the rows I have already analyzed.
overalldata = overalldata[11**x::]
print result_correlation_combined
print overalldatasubtract
It seemed that my subtraction method was working, but when I tried it with my larger data set, I realized my method is totally wrong.
Would you do this a different way? I think that it can work, but I cannot find my mistake.

You use the same variable x for both loops. In the second loop x goes from 0 to 12 whatever the customer, and since you set the line number only with x you're stuck on the first customer.
Your double loop should rather look like this :
# loop over the customers
for x_customer in range(0,len(overalldata),12):
# loop over the months
for x_month in range(0,12,1):
# line number: x
x = x_customer*12 + x_month
...
I changed the bounds and steps of the loops because :
loop 1: there are 12 months so 12 lines per customer -> step = 12
loop 2: there are 12 months, so month number ranges from 0 to 11 -> range(0,12,1)

this is how I solved the problem: It was a problem with the placement of my for loops. A simple indentation problem. Thank you for the help to above poster.
for x_customer in range(0,len(overalldata),12):
for x in range(0,13,1):
cust_months = overalldata[0:x,1]
cust_balancenormal = overalldata[0:x,16]
cust_demo_one = overalldata[0:x,2]
cust_demo_two = overalldata[0:x,3]
num_acct_A = overalldata[0:x,4]
num_acct_B = overalldata[0:x,5]
#Correlation Calculations
demo_one_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_one)[1,0]
demo_two_corr_balance = numpy.corrcoef(cust_balancenormal, cust_demo_two)[1,0]
demo_one_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_one)[1,0]
demo_one_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_one)[1,0]
demo_two_corr_acct_a = numpy.corrcoef(num_acct_A, cust_demo_two)[1,0]
demo_two_corr_acct_b = numpy.corrcoef(num_acct_B, cust_demo_two)[1,0]
result_correlation = [(demo_one_corr_balance),(demo_two_corr_balance),(demo_one_corr_acct_a),(demo_one_corr_acct_b),(demo_two_corr_acct_a),(demo_two_corr_acct_b)]
numpy.savetxt('correlationoutput.csv', (result_correlation))
result_correlation_combined = emptylist.append([result_correlation])
cust_delete_list = [0,(x_customer),1]
overalldata = numpy.delete(overalldata, (cust_delete_list), axis=0)

Daily Pricing of a Bond with QuantLib using Python

I would like to use QuantLib within python mainly to price interest rate instruments (derivatives down the track) within a portfolio context. The main requirement would be to pass daily yield curves to the system to price on successive days (let's ignore system performance issues for now). My question is, have I structured the example below correctly to do this? My understanding is that I would need at least one curve object per day with the necessary linking etc. I have made use of pandas to attempt this. Guidance on this would be appreciated.
import QuantLib as ql
import math
import pandas as pd
import datetime as dt
# MARKET PARAMETRES
calendar = ql.SouthAfrica()
bussiness_convention = ql.Unadjusted
day_count = ql.Actual365Fixed()
interpolation = ql.Linear()
compounding = ql.Compounded
compoundingFrequency = ql.Quarterly
def perdelta(start, end, delta):
date_list=[]
curr = start
while curr < end:
date_list.append(curr)
curr += delta
return date_list
def to_datetime(d):
return dt.datetime(d.year(),d.month(), d.dayOfMonth())
def format_rate(r):
return '{0:.4f}'.format(r.rate()*100.00)
#QuantLib must have dates in its date objects
dicPeriod={'DAY':ql.Days,'WEEK':ql.Weeks,'MONTH':ql.Months,'YEAR':ql.Years}
issueDate = ql.Date(19,8,2014)
maturityDate = ql.Date(19,8,2016)
#Bond Schedule
schedule = ql.Schedule (issueDate, maturityDate,
ql.Period(ql.Quarterly),ql.TARGET(),ql.Following, ql.Following,
ql.DateGeneration.Forward,False)
fixing_days = 0
face_amount = 100.0
def price_floater(myqlvalDate,jindex,jibarTermStructure,discount_curve):
bond = ql.FloatingRateBond(settlementDays = 0,
faceAmount = 100,
schedule = schedule,
index = jindex,
paymentDayCounter = ql.Actual365Fixed(),
spreads=[0.02])
bondengine = ql.DiscountingBondEngine(ql.YieldTermStructureHandle(discount_curve))
bond.setPricingEngine(bondengine)
ql.Settings.instance().evaluationDate = myqlvalDate
return [bond.NPV() ,bond.cleanPrice()]
start_date=dt.datetime(2014,8,19)
end_date=dt.datetime(2015,8,19)
all_dates=perdelta(start_date,end_date,dt.timedelta(days=1))
dtes=[];fixings=[]
for d in all_dates:
if calendar.isBusinessDay(ql.QuantLib.Date(d.day,d.month,d.year)):
dtes.append(ql.QuantLib.Date(d.day,d.month,d.year))
fixings.append(0.1)
df_ad=pd.DataFrame(all_dates,columns=['valDate'])
df_ad['qlvalDate']=df_ad.valDate.map(lambda x:ql.DateParser.parseISO(x.strftime('%Y-%m-%d')))
df_ad['jibarTermStructure'] = df_ad.qlvalDate.map(lambda x:ql.RelinkableYieldTermStructureHandle())
df_ad['discountStructure'] = df_ad.qlvalDate.map(lambda x:ql.RelinkableYieldTermStructureHandle())
df_ad['jindex'] = df_ad.jibarTermStructure.map(lambda x: ql.Jibar(ql.Period(3,ql.Months),x))
df_ad.jindex.map(lambda x:x.addFixings(dtes, fixings))
df_ad['flatCurve'] = df_ad.apply(lambda r: ql.FlatForward(r['qlvalDate'],0.1,ql.Actual365Fixed(),compounding,compoundingFrequency),axis=1)
df_ad.apply(lambda x:x['jibarTermStructure'].linkTo(x['flatCurve']),axis=1)
df_ad.apply(lambda x:x['discountStructure'].linkTo(x['flatCurve']),axis=1)
df_ad['discount_curve']= df_ad.apply(lambda x:ql.ZeroSpreadedTermStructure(x['discountStructure'],ql.QuoteHandle(ql.SimpleQuote(math.log(1+0.02)))),axis=1)
df_ad['all_in_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['jindex'],r['jibarTermStructure'],r['discount_curve'])[0],axis=1)
df_ad['clean_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['jindex'],r['jibarTermStructure'],r['discount_curve'])[1],axis=1)
df_plt=df_ad[['valDate','all_in_price','clean_price']]
df_plt=df_plt.set_index('valDate')
from matplotlib import ticker
def func(x, pos):
s = str(x)
ind = s.index('.')
return s[:ind] + '.' + s[ind+1:]
ax=df_plt.plot()
ax.yaxis.set_major_formatter(ticker.FuncFormatter(func))
Thanks to Luigi Ballabio I have reworked the example above to incorporate the design principles within QuantLib so as to avoid unnecessary calling.
Now the static data is truly static and only the market data varies (I hope).
I now understand better how the live objects listen for changes in linked variables.
Static data is the following:
bondengine
bond
structurehandles
historical jibar index
Market data will be the only varying component
daily swap curve
market spread over swap curve
The reworked example is below:
import QuantLib as ql
import math
import pandas as pd
import datetime as dt
import numpy as np
# MARKET PARAMETRES
calendar = ql.SouthAfrica()
bussiness_convention = ql.Unadjusted
day_count = ql.Actual365Fixed()
interpolation = ql.Linear()
compounding = ql.Compounded
compoundingFrequency = ql.Quarterly
def perdelta(start, end, delta):
date_list=[]
curr = start
while curr < end:
date_list.append(curr)
curr += delta
return date_list
def to_datetime(d):
return dt.datetime(d.year(),d.month(), d.dayOfMonth())
def format_rate(r):
return '{0:.4f}'.format(r.rate()*100.00)
#QuantLib must have dates in its date objects
dicPeriod={'DAY':ql.Days,'WEEK':ql.Weeks,'MONTH':ql.Months,'YEAR':ql.Years}
issueDate = ql.Date(19,8,2014)
maturityDate = ql.Date(19,8,2016)
#Bond Schedule
schedule = ql.Schedule (issueDate, maturityDate,
ql.Period(ql.Quarterly),ql.TARGET(),ql.Following, ql.Following,
ql.DateGeneration.Forward,False)
fixing_days = 0
face_amount = 100.0
start_date=dt.datetime(2014,8,19)
end_date=dt.datetime(2015,8,19)
all_dates=perdelta(start_date,end_date,dt.timedelta(days=1))
dtes=[];fixings=[]
for d in all_dates:
if calendar.isBusinessDay(ql.QuantLib.Date(d.day,d.month,d.year)):
dtes.append(ql.QuantLib.Date(d.day,d.month,d.year))
fixings.append(0.1)
jibarTermStructure = ql.RelinkableYieldTermStructureHandle()
jindex = ql.Jibar(ql.Period(3,ql.Months), jibarTermStructure)
jindex.addFixings(dtes, fixings)
discountStructure = ql.RelinkableYieldTermStructureHandle()
bond = ql.FloatingRateBond(settlementDays = 0,
faceAmount = 100,
schedule = schedule,
index = jindex,
paymentDayCounter = ql.Actual365Fixed(),
spreads=[0.02])
bondengine = ql.DiscountingBondEngine(discountStructure)
bond.setPricingEngine(bondengine)
spread = ql.SimpleQuote(0.0)
discount_curve = ql.ZeroSpreadedTermStructure(jibarTermStructure,ql.QuoteHandle(spread))
discountStructure.linkTo(discount_curve)
# ...here is the pricing function...
# pricing:
def price_floater(myqlvalDate,jibar_curve,credit_spread):
credit_spread = math.log(1.0+credit_spread)
ql.Settings.instance().evaluationDate = myqlvalDate
jibarTermStructure.linkTo(jibar_curve)
spread.setValue(credit_spread)
ql.Settings.instance().evaluationDate = myqlvalDate
return pd.Series({'NPV': bond.NPV(), 'cleanPrice': bond.cleanPrice()})
# ...and here are the remaining varying parts:
df_ad=pd.DataFrame(all_dates,columns=['valDate'])
df_ad['qlvalDate']=df_ad.valDate.map(lambda x:ql.DateParser.parseISO(x.strftime('%Y-%m-%d')))
df_ad['jibar_curve'] = df_ad.apply(lambda r: ql.FlatForward(r['qlvalDate'],0.1,ql.Actual365Fixed(),compounding,compoundingFrequency),axis=1)
df_ad['spread']=np.random.uniform(0.015, 0.025, size=len(df_ad)) # market spread
df_ad['all_in_price'], df_ad["clean_price"]=zip(*df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['jibar_curve'],r['spread']),axis=1).to_records())[1:]
# plot result
df_plt=df_ad[['valDate','all_in_price','clean_price']]
df_plt=df_plt.set_index('valDate')
from matplotlib import ticker
def func(x, pos): # formatter function takes tick label and tick position
s = str(x)
ind = s.index('.')
return s[:ind] + '.' + s[ind+1:] # change dot to comma
ax=df_plt.plot()
ax.yaxis.set_major_formatter(ticker.FuncFormatter(func))

Your solution would work, but creating a bond per day kind of goes against the grain of the library. You can create the bond and the JIBAR index just once, and just change the evaluation date and the corresponding curves; the bond will detect the changes and recalculate.
In the general case, this would be something like:
# here are the parts that stay the same...
jibarTermStructure = ql.RelinkableYieldTermStructureHandle()
jindex = ql.Jibar(ql.Period(3,ql.Months), jibarTermStructure)
jindex.addFixings(dtes, fixings)
discountStructure = ql.RelinkableYieldTermStructureHandle()
bond = ql.FloatingRateBond(settlementDays = 0,
faceAmount = 100,
schedule = schedule,
index = jindex,
paymentDayCounter = ql.Actual365Fixed(),
spreads=[0.02])
bondengine = ql.DiscountingBondEngine(discountStructure)
bond.setPricingEngine(bondengine)
# ...here is the pricing function...
def price_floater(myqlvalDate,jibar_curve,discount_curve):
ql.Settings.instance().evaluationDate = myqlvalDate
jibarTermStructure.linkTo(jibar_curve)
discountStructure.linkTo(discount_curve)
return [bond.NPV() ,bond.cleanPrice()]
# ...and here are the remaining varying parts:
df_ad=pd.DataFrame(all_dates,columns=['valDate'])
df_ad['qlvalDate']=df_ad.valDate.map(lambda x:ql.DateParser.parseISO(x.strftime('%Y-%m-%d')))
df_ad['flatCurve'] = df_ad.apply(lambda r: ql.FlatForward(r['qlvalDate'],0.1,ql.Actual365Fixed(),compounding,compoundingFrequency),axis=1)
df_ad['discount_curve']= df_ad.apply(lambda x:ql.ZeroSpreadedTermStructure(jibarTermStructure,ql.QuoteHandle(ql.SimpleQuote(math.log(1+0.02)))),axis=1)
df_ad['all_in_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['flatCurve'],r['discount_curve'])[0],axis=1)
df_ad['clean_price']=df_ad.apply(lambda r:price_floater(r['qlvalDate'],r['flatCurve'],r['discount_curve'])[0],axis=1)
df_plt=df_ad[['valDate','all_in_price','clean_price']]
df_plt=df_plt.set_index('valDate')
Now, even in the most general case, the above can be optimized: you're calling price_floater twice per date, so you're doing twice the work. I'm not familiar with pandas, but I'd guess you can make a single call and set df_ad['all_in_price'] and df_ad['clean_price'] with a single assignment.
Moreover, there might be ways to simplify the code even further depending on your use cases. The discount curve might be instantiated once and the spread changed during pricing:
# in the "only once" part:
spread = ql.SimpleQuote()
discount_curve = ql.ZeroSpreadedTermStructure(jibarTermStructure,ql.QuoteHandle(spread))
discountStructure.linkTo(discount_curve)
# pricing:
def price_floater(myqlvalDate,jibar_curve,credit_spread):
ql.Settings.instance().evaluationDate = myqlvalDate
jibarTermStructure.linkTo(jibar_curve)
spread.setValue(credit_spread)
return [bond.NPV() ,bond.cleanPrice()]
and in the varying part, you'll just have an array of credit spreads intead of an array of discount curves.
If the curves are all flat, you can do the same by taking advantage of another feature: if you initialize a curve with a number of days and a calendar instead of a date, its reference date will move with the evaluation date (if the number of days is 0, it will be the evaluation date; if it's 1, it will be the next business day, and so on).
# only once:
risk_free = ql.SimpleQuote()
jibar_curve = ql.FlatForward(0,calendar,ql.QuoteHandle(risk_free),ql.Actual365Fixed(),compounding,compoundingFrequency)
jibarTermStructure.linkTo(jibar_curve)
# pricing:
def price_floater(myqlvalDate,risk_free_rate,credit_spread):
ql.Settings.instance().evaluationDate = myqlvalDate
risk_free.linkTo(risk_free_rate)
spread.setValue(credit_spread)
return [bond.NPV() ,bond.cleanPrice()]
and in the varying part, you'll replace the array of jibar curves with a simple array of rates.
The above should give you the same result as your code, but will instantiate a lot less objects and thus probably save memory and increase performance.
One final warning: neither my code nor yours will work if pandas' map evaluates the results in parallel; you'd end up trying to set up the global evaluation date to several values simultaneously, and that wouldn't go well.

Code / Loop optimization with pandas for creating two matrixes

I need to optimize this loop which takes 2.5 second. The needs is that I call it more than 3000 times in my script.
The aim of this code is to create two matrix which are used after in a linear system.
Has someone any idea in Python or Cython?
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
result = df[(mat.index>=d1)&(mat.index<=d2)]
result2 = df[(mat.index>=d1)&(mat.index<=d2)&(mat.index.hour>=8)]
mat.loc[result.index,j] = 1.
mat_bp.loc[result2.index,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer

Here you go. I tested the following and I get the same resultant matrices in mat and mat_bp as in your original code, but in 0.07 seconds vs. 1.4 seconds for the original code on my machine.
The real slowdown was due to using result.index and result2.index. Looking up by a datetime is much slower than looking up using an index. I used binary searches where possible to find the right indices.
import pandas as pd
import numpy as np
import time
import bisect
## df is only here for illustration and date_indicatrice changes upon function call
df = pd.DataFrame(0, columns=range(6),
index=pd.date_range(start = pd.datetime(2010,1,1),
end = pd.datetime(2020,1,1), freq="H"))
mat = pd.DataFrame(0,index=df.index,columns=range(6))
mat_bp = pd.DataFrame(0,index=df.index,columns=range(6*2))
date_indicatrice = [(pd.datetime(2010,1,1), pd.datetime(2010,4,1)),
(pd.datetime(2012,5,1), pd.datetime(2019,4,1)),
(pd.datetime(2013,4,1), pd.datetime(2019,4,1)),
(pd.datetime(2014,3,1), pd.datetime(2019,4,1)),
(pd.datetime(2015,1,1), pd.datetime(2015,4,1)),
(pd.datetime(2013,6,1), pd.datetime(2018,4,1))]
timer = time.time()
for j, (d1,d2) in enumerate(date_indicatrice):
ind_start = bisect.bisect_left(mat.index, d1)
ind_end = bisect.bisect_right(mat.index, d2)
inds = np.array(xrange(ind_start, ind_end))
valid_inds = inds[mat.index[ind_start:ind_end].hour >= 8]
mat.loc[ind_start:ind_end,j] = 1.
mat_bp.loc[valid_inds,j*2] = 1.
mat_bp[j*2+1] = (1 - mat_bp[j*2]) * mat[j]
print time.time()-timer

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Improving Pandas iteration performance - python

Related

Problem with multiprocessing in Python writing into csv

How to use multiprocessing pool in a for loop while saving the data?

Iterating Over CSV Deleting Analyzed Data

Daily Pricing of a Bond with QuantLib using Python

Code / Loop optimization with pandas for creating two matrixes

Categories

Resources