let's suppose to have the following DataFrame of returns:
import numpy as np
import pandas as pd
import pandas.io.data as web
data = web.DataReader(['AAPL','GOOG'],data_source='google')
returns = data['Close'].pct_change()
Now let's say I want to backtest an investment on the two assets, and let also suppose that cashflows are not invested at the same time:
positions = {}
positions['APPL'] = {returns.index[10]: 20000.0}
positions['GOOG'] = {returns.index[20]: 80000.0}
wealth = pd.DataFrame.from_dict(positions).reindex(returns.index).fillna(0.0)
My question is: is there a pythonic way to let the 20k dollars of positive cashflow on Apple and the 80k dollars on Google grow, based on their respective daily returns?
At the moment I'm doing this iterating by each position (column) and then by i-th row:
wealth.ix[i] = wealth.ix[i-1] * (1 + returns[i])
but I know that with Python and Pandas this kind of iteration can be often avoided.
Thanks for the time you will spend for this.
link to iPython Notebook
Simone
First you need to change your position to forward fill, since you keep the investment.
pos = pd.DataFrame.from_dict(positions).reindex(returns.index).fillna(method="ffill")
Then you need cumprod
wealth = pos.shift() * (1+returns).cumprod(axis=0)
The shift is necessary since you do not get the return on the first day.
Related
My goal is to simulate the past growth of a stock portfolio based on historical stock prices. I wrote a code, that works (at least I think so). However, I am pretty sure, that the basic structure of the code is not very clever and propably makes things more complicated than they actually are. Maybe someone can help me and tell me the best procedure to solve a problem like mine.
I started with a dataframe containing historical stock prices for a number (here: 2) of stocks:
import pandas as pd import numpy as np
price_data = pd.DataFrame({'Stock_A': [5,6,10],
'Stock_B': [5,7,2]})
Than I defined a start capital (here: 1000 €). Furthermore I decide how much of my money I want to invest in Stock_A (here: 50%) and Stock_B (here: also 50%).
capital = 1000
weighting = {'Stock_A': 0.5, 'Stock_B': 0.5}
Now I can calculate, how many shares of Stock_A and Stock_B I can buy in the beginning
quantities = {key: weighting[key]*capital/price_data.get(key,0)[0] for key in weighting}
While time goes by the weights of the portfolio components will of course change, as the prices of Stock A and B move in opposite directions. So at some point the portfolio will mainly consists of Stock A, while the proportion of Stock B (value wise) gets pretty small. To correct for this, I want to restore the initial 50:50 weighting as soon as the portfolio weights deviate too much from the initial weighting (so called rebalancing). I defined a function to decide, whether rebalancing is needed or not.
def need_to_rebalance(row):
rebalance = False
for asset in assets:
if not 0.4 < row[asset] * quantities[asset] / portfolio_value < 0.6:
rebalance = True
break
return rebalance
If we perform a rebalancing, the following formula, returns the updated number of shares for Stock A and Stock B:
def rebalance(row):
for asset in assets:
quantities[asset] = weighting[asset]*portfolio_value/row[asset]
return quantities
Finally I defined a third funtion, that I can use to loop over the dataframe containing the sock prices in order to calculate the value of the portfolio based on the current number of Stocks we own. It looks like this:
def run_backtest(row):
global portfolio_value, quantities
portfolio_value = sum(np.array(row[assets]) * np.array(list(quantities.values())))
if need_to_rebalance(row):
quantities = rebalance(row)
for asset in assets:
historical_quantities[asset].append(quantities[asset])
return portfolio_value
Than I put it all to work using .apply:
historical_quantities = {}
for asset in assets:
historical_quantities[asset] = []
output = price_data.copy()
output['portfolio_value'] = price_data.apply(run_backtest, axis = 1)
output.join(pd.DataFrame(historical_quantities), rsuffix='_weight')
The result looks reasonable to me and it is basically, what I wanted to achieve. However, I was wondering, whether there is a more efficient way, to solve the problem. Somehow, doing the calculation line by line and storing all the values in the variable 'historical quantities' just to add it to the dataframe at the end doesn't look very clever to me. Furthermore I have to use a lot of global variables. Storing a lot of values from inside the functions as global variables makes the code pretty messy (In particular, if the calculations concering rebalancing get more complex, for example when including tax effects). Has someone read until here & is maybe willing to help me?
All the best
I have two .csv files with data arranged date-wise. I want to compute the monthly accumulated value for each month and for all the years. While reading the csv files, it reads without any error. However, while computing the monthly accumulated values, for one times series (in one csv file), it is doing it correctly. But, for the other time series, the same code malfunctions. The only difference I notice is, when I read the first csv file (with a 'Date' and 'Value' column, and total no. of rows = 826), the dataframe has 827 rows (last row as nan). This nan thing is not observed for the other csv file.
Please note that my timeseries starts from 06-06-20xx to 01-10-20xx every year from 2008-2014. I am obtaining the monthly accumulated value for each month and then removing the zero values (for months Jan-May and Nov-Dec). When my code runs, for the first csv, I get monthly accumulated values starting from June month of 2008. But, for the second, its starts from January 2008 (and has a non-zero value, which ideally should be zero).
Since I am new in python coding, I cannot figure out where the issue is. Any help is appreciated. Thanks in advance.
Here is my code:
import pandas as pd
import numpy as np
# read the csv file
df_obs = pd.read_csv("..path/first_csv.csv")
df_fore = pd.read_csv("..path/second_csv.csv")
# convert 'Date' column to datetime index
df_obs['Date'] = pd.to_datetime(df_obs['Date'])
df_fore['Date'] = pd.to_datetime(df_fore['Date'])
# perform GroupBy operation over monthly frequency
monthly_accumulated_obs = df_obs.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
monthly_accumulated_fore = df_fore.set_index('Date').groupby(pd.Grouper(freq='M'))['Observed'].sum().reset_index()
Sometimes more verbose but explicit solutions work better and are more flexible, so here's an alternative one, using convtools:
from datetime import date, datetime
from convtools import conversion as c
from convtools.contrib.tables import Table
# generate an ad hoc grouper;
# it's a simple function to be reused further
converter = (
c.group_by(c.item("Date"))
.aggregate(
{
"Date": c.item("Date"),
"Observed": c.ReduceFuncs.Sum(c.item("Observed")),
}
)
.gen_converter()
)
# read the stream of prepared rows
rows = (
Table.from_csv("..path/first_csv.csv", header=True)
.update(
Date=c.call_func(
datetime.strptime, c.col("Date"), "%m-%d-%Y"
).call_method("replace", day=1),
Observed=c.col("Observed").as_type(float),
)
.into_iter_rows(dict)
)
# process them
result = converter(rows)
I have a big problem filtering my data. I've read a lot here on stackoverflow and ion other pages and tutorials, but I could not solve my specific problem...
The first part of my code, where I load my data into python looks as follow:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from arch import arch_model
spotmarket = pd.read_excel("./data/external/Spotmarket_dhp.xlsx", index=True)
r = spotmarket['Price'].pct_change().dropna()
returns = 100 * r
df = pd.DataFrame(returns)
The excel table has 43.000 values in one column and includes the hourly prices. I use this data to calculate the percentage change from hour to hour and the problem is, that there are sometimes big changes between 1000 to 40000%. The dataframe looks as follow:
df
Out[12]:
Price
1 20.608229
2 -2.046870
3 6.147789
4 16.519258
...
43827 -16.079874
43828 -0.438322
43829 -40.314465
43830 -100.105374
43831 700.000000
43832 -62.500000
43833 -40400.000000
43834 1.240695
43835 52.124183
43836 12.996778
43837 -17.157795
43838 -30.349971
43839 6.177924
43840 45.073701
43841 76.470588
43842 2.363636
43843 -2.161042
43844 -6.444781
43845 -14.877102
43846 6.762918
43847 -38.790036
[43847 rows x 1 columns]
I wanna exclude this outliers. I've tried different ways like calculating the meanand the std and exclude all values which are + and - three times the std away from the mean. It works for a small part of the data, but for the complete data, the mean and std are both NaN. Has someone an idea, how I can filter my dataframe?
I think need filter by percentiles by quantile:
r = spotmarket['Price'].pct_change() * 100
Q1 = r.quantile(.25)
Q3 = r.quantile(.75)
q1 = Q1-1.5*(Q3-Q1)
q3 = Q3+1.5*(Q3-Q1)
df = spotmarket[r.between(q1, q3)]
may you should first discard all the values that are giving those fluctuations and then create the dataframe. One way is to use the filter()
Though Pandas has time series functionality, I am still struggling with dataframes that have incomplete time series data.
See the pictures below, the lower picture has complete data, the upper has gaps. Both pics show correct values. In red are the columns that I want to calculate using the data in black. Column Cumm_Issd shows the accumulated issued shares during the year, MV is market value.
I want to calculate the issued shares per quarter (IssdQtr), the quarterly change in Market Value (D_MV_Q) and the MV of last year (L_MV_Y).
See for underlying cvs data this link for the full data and this link for the gapped data. There are two firms 1020180 and 1020201.
However, when I try Pandas shift method it fails when there are gaps, try yourself using the csv files and the code below. All columns (DiffEq, Dif1MV, Lag4MV) differ - for some quarters - from IssdQtr, D_MV_Q, L_MV_Y, respectively.
Are there ways to deal with gaps in data using Pandas?
import pandas as pd
import numpy as np
import os
dfg = pd.read_csv('example_soverflow_gaps.csv',low_memory=False)
dfg['date'] = pd.to_datetime(dfg['Period'], format='%Y%m%d')
dfg['Q'] = pd.DatetimeIndex(dfg['date']).to_period('Q')
dfg['year'] = dfg['date'].dt.year
dfg['DiffEq'] = dfg.sort_values(['Q']).groupby(['Firm','year'])['Cumm_Issd'].diff()
dfg['Dif1MV'] = dfg.groupby(['Firm'])['MV'].diff(1)
dfg['Lag4MV'] = dfg.groupby(['Firm'])['MV'].shift(4)
Gapped data:
Full data:
Solved the basic problem by using a merge. First, create a variable that shows the lagged date or quarter. Here we want last year's MV (4 quarters back):
from pandas.tseries.offsets import QuarterEnd
dfg['lagQ'] = dfg['date'] + QuarterEnd(-4)
Then create a data-frame with the keys (Firm and date) and the relevant variable (here MV).
lagset=dfg[['Firm','date', 'MV']].copy()
lagset.rename(columns={'MV':'Lag_MV', 'date':'lagQ'}, inplace=True)
Lastly, merge the new frame into the existing one:
dfg=pd.merge(dfg, lagset, on=['Firm', 'lagQ'], how='left')
I'm working on replacing an Excel financial model into Python Pandas. By financial model I mean forecasting a cash flow, profit & loss statement and balance sheet over time for a business venture as opposed to pricing swaps / options or working with stock price data that are also referred to as financial models. It's quite possible that the same concepts and issues apply to the latter types I just don't know them that well so can't comment.
So far I like a lot of what I see. The models I work with in Excel have a common time series across the top of the page, defining the time period we're interested in forecasting. Calculations then run down the page as a series of rows. Each row is therefore a TimeSeries object, or a collection of rows becomes a DataFrame. Obviously you need to transpose to read between these two constructs but this is a trivial transformation.
Better yet each Excel row should have a common, single formula and only be based on rows above on the page. This lends itself to vector operations that are computationally fast and simple to write using Pandas.
The issue I get is when I try to model a corkscrew-type calculation. These are often used to model accounting balances, where the opening balance for one period is the closing balance of the prior period. You can't use a .shift() operation as the closing balance in a given period depends, amongst other things, on the opening balance in the same period. This is probably best illustrated with an example:
Time 2013-04-01 2013-05-01 2013-06-01 2013-07-01 ...
Opening Balance 0 +3 -2 -10
[...]
Some Operations +3 -5 -8 +20
[...]
Closing Balance +3 -2 -10 +10
In pseudo-code my solution to how to calculate these sorts of things is a follows. It is not a vectorised solution and it looks like it is pretty slow
# Set up date range
dates = pd.date_range('2012-04-01',periods=500,freq='MS')
# Initialise empty lists
lOB = []
lSomeOp1 = []
lSomeOp2 = []
lCB = []
# Set the closing balance for the initial loop's OB
sCB = 0
# As this is a corkscrew calculation will need to loop through all dates
for d in dates:
# Create a datetime object as will reference it several times below
dt = d.to_datetime()
# Opening balance is either initial opening balance if at the
# initial date or else the last closing balance from prior
# period
sOB = inp['ob'] if (dt == obDate) else sCB
# Calculate some additions, write-off, amortisation, depereciation, whatever!
sSomeOp1 = 10
sSomeOp2 = -sOB / 2
# Calculate the closing balance
sCB = sOB + sSomeOp1 + sSomeOp2
# Build up list of outputs
lOB.append(sOB)
lSomeOp1.append(sSomeOp1)
lSomeOp2.append(sSomeOp2)
lCB.append(sCB)
# Convert lists to timeseries objects
ob = pd.Series(lOB, index=dates)
someOp1 = pd.Series(lSomeOp1, index=dates)
someOp2 = pd.Series(lSomeOp2, index=dates)
cb = pd.Series(lCB, index=dates)
I can see that where you only have one or two lines of operations there might be some clever hacks to vectorise the computation, I'd be grateful to hear any tips people have on doing these sorts of tricks.
Some of the corkscrews I have to build, however, have 100's of intermediate operations. In these cases what's my best way forward? Is it to accept the slow performance of Python? Should I migrate to Cython? I've not really looked into it (so could be way off base) but the issue with the latter approach is that if I'm moving 100's of lines into C why am I bothering with Python in the first place, it doesn't feel like a simple lift and shift?
This following makes in-place updates, which should improve performance
import pandas as pd
import numpy as np
book=pd.DataFrame([[0, 3, np.NaN],[np.NaN,-5,np.NaN],[np.NaN,-8,np.NaN],[np.NaN,+20,np.NaN]], columns=['ob','so','cb'], index=['2013-04-01', '2013-05-01', '2013-06-01', '2013-07-01'])
for row in book.index[:-1]:
book['cb'][row]=book.ix[row, ['ob', 'so']].sum()
book['ob'][book.index.get_loc(row)+1]=book['cb'][row]
book['cb'][book.index[-1]]=book.ix[book.index[-1], ['ob', 'so']].sum()
book