Pandas capitalization of compound interests - python

I am writing an emulation of a bank deposit account in pandas.
I got stuck with Compound interest (It is the result of reinvesting interest, so that interest in the next period is then earned on the principal sum plus previously accumulated interest.)
So far I have the following code:
import pandas as pd
from pandas.tseries.offsets import MonthEnd
from datetime import datetime
# Create a date range
start = '21/11/2017'
now = datetime.now()
date_rng = pd.date_range(start=start, end=now, freq='d')
# Create an example data frame with the timestamp data
df = pd.DataFrame(date_rng, columns=['Date'])
# Add column (EndOfMonth) - shows the last day of the current month
df['LastDayOfMonth'] = pd.to_datetime(df['Date']) + MonthEnd(0)
# Add columns for interest, Sasha, Artem, Total, Description
df['Debit'] = 0
df['Credit'] = 0
df['Total'] = 0
df['Description'] = ''
# Iterate through the DataFrame to set "IsItLastDay" value
for i in df:
df['IsItLastDay'] = (df['LastDayOfMonth'] == df['Date'])
# Add the transaction of the first deposit
df.loc[df.Date == '2017-11-21', ['Debit', 'Description']] = 10000, "First deposit"
# Calculate the principal sum (It the summ of all deposits minus all withdrows plus all compaund interests)
df['Total'] = (df.Debit - df.Credit).cumsum()
# Calculate interest per day and Cumulative interest
# 11% is the interest rate per year
df['InterestPerDay'] = (df['Total'] * 0.11) / 365
df['InterestCumulative'] = ((df['Total'] * 0.11) / 365).cumsum()
# Change the order of columns
df = df[['Date', 'LastDayOfMonth', 'IsItLastDay', 'InterestPerDay', 'InterestCumulative', 'Debit', 'Credit', 'Total', 'Description']]
df.to_excel("results.xlsx")
The output file looks fine, but I need the following:
The "InterestCumulative" column adds to the "Total" column at the last day of each months (compounding the interests)
At the beggining of each month the "InterestCumulative" column should be cleared (Because the interest were added to the Principal sum).
How can I do this?

You're going to need to loop, as your total changes depending on previous rows, which then affects the later rows. As a result your current interest calculations are wrong.
total = 0
cumulative_interest = 0
total_per_day = []
interest_per_day = []
cumulative_per_day = []
for day in df.itertuples():
total += day.Debit - day.Credit
interest = total * 0.11 / 365
cumulative_interest += interest
if day.IsItLastDay:
total += cumulative_interest
total_per_day.append(total)
interest_per_day.append(interest)
cumulative_per_day.append(cumulative_interest)
if day.IsItLastDay:
cumulative_interest = 0
df.Total = total_per_day
df.InterestPerDay = interest_per_day
df.InterestCumulative = cumulative_per_day
This is unfortunately a lot more confusing looking, but that's what happens when values depend on previous values. Depending on your exact requirements there may be nice ways to simplify this using math, but otherwise this is what you've got.
I've written this directly into stackoverflow so it may not be perfect.

Related

Python/Pandas: Search for date in one dataframe and return value in column of another dataframe with matching date

I have two dataframes, one with earnings date and code for before market/after market and the other with daily OHLC data.
First dataframe df:
earnDate
anncTod
103
2015-11-18
0900
104
2016-02-24
0900
105
2016-05-18
0900
...
..........
.......
128
2022-03-01
0900
129
2022-05-18
0900
130
2022-08-17
0900
Second dataframe af:
Datetime
Open
High
Low
Close
Volume
2005-01-03
36.3458
36.6770
35.5522
35.6833
3343500
...........
.........
.........
.........
........
........
2022-04-22
246.5500
247.2000
241.4300
241.9100
1817977
I want to take a date from the first dataframe and find the open and/or close price in the second dataframe. Depending on anncTod value, I want to find the close price of the previous day (if =0900) or the open and close price on the following day (else). I'll use these numbers to calculate the overnight, intraday and close-to-close move which will be stored in new columns on df.
I'm not sure how to search matching values and fetch values from that row but a different column. I'm trying to do this with a df.iloc and a for loop.
Here's the full code:
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.iloc[date,1] == '0900':
priorday = af.loc[af.index.get_loc(date)-1,0]
priorclose = af.loc[priorday,4]
open = af.loc[date,1]
close = af.loc[date,4]
df.iloc[date,2] = close/priorclose
df.iloc[date,3] = open/priorclose
df.iloc[date,4] = close/open
else:
print('afternoon')
I get an error:
if df.iloc[date,1] == '0900':
ValueError: Location based indexing can only have [integer, integer slice (START point is INCLUDED, END point is EXCLUDED), listlike of integers, boolean array] types
Converting the date columns to integers creates another error. Is there a better way I should go about doing this?
Ideal output would look like (made up numbers, abbreviated output):
earnDate
anncTod
Total Move
Overnight Move
Intraday Move
2015-11-18
0900
9%
7.2%
1.8%
But would include all the dates given in the first dataframe.
UPDATE
I swapped df.iloc for df.loc and that seems to have solved that problem. The new issue is searching for variable 'date' in the second dataframe af. I have simplified the code to just print the value in the 'Open' column while I trouble shoot.
Here is updated and simplified code (all else remains the same):
import pandas as pd
import requests
import datetime as dt
ticker = 'TGT'
## pull orats earnings dates and store in pandas dataframe
url = f'https://api.orats.io/datav2/hist/earnings.json?token=keyhere={ticker}'
response = requests.get(url, allow_redirects=True)
data = response.json()
df = pd.DataFrame(data['data'])
## reduce number of dates to last 28 quarters and remove updatedAt column
n = len(df.index)-28
df.drop(index=df.index[:n], inplace=True)
df = df.iloc[: , 1:-1]
## set index to earnDate
df = df.set_index(pd.DatetimeIndex(df['earnDate']))
## import daily OHLC stock data file
loc = f"C:\\Users\\anon\\Historical Stock Data\\us3000_tickers_daily\\{ticker}_daily.txt"
af = pd.read_csv(loc, delimiter=',', names=['Datetime','Open','High','Low','Close','Volume'])
## create total return, overnight and intraday columns in df
df['Total Move'] = '' ##col #2
df['Overnight'] = '' ##col #3
df['Intraday'] = '' ##col #4
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(af.loc[date,'Open']) ##this is line generating error
else:
print('afternoon')
I now get KeyError:'2015-11-18'
To use loc to access a certain row, that assumes that the label you search for is in the index. Specifically, that means that you'll need to set the date column as index. EX:
import pandas as pd
df = pd.DataFrame({'earnDate': ['2015-11-18', '2015-11-19', '2015-11-20'],
'anncTod': ['0900', '1000', '0800'],
'Open': [111, 222, 333]})
df = df.set_index(df["earnDate"])
for date in df['earnDate']:
if df.loc[date, 'anncTod'] == '0900':
print(df.loc[date, 'Open'])
# prints
# 111

RSI in spyder using data in excel

So I have an excel file containing data on a specific stock.
My excel file contains about 2 months of data, it monitors the Open price, Close price, High Price, Low Price and Volume of trades in 5 minute intervals, so there are about 3000 rows in my file.
I want to calculate the RSI (or EMA if it's easier) of a stock daily, I'm making a summary table that collects the daily data so it converts my table of 3000+ rows into a table with only about 60 rows (each row represents one day).
Essentially I want some sort of code that sorts the excel data by date then calculates the RSI as a single value for that day. RSI is given by: 100-(100/(1+RS)) where RS = average gain of up periods/average loss of down periods.
Note: My excel uses 'Datetime' so each row's 'Datetime' looks something like '2022-03-03 9:30-5:00' and the next row would be '2022-03-03 9:35-5:00', etc. So the code needs to just look at the date and ignore the time I guess.
Some code to maybe help understand what I'm looking for:
So here I'm calling my excel file, I want the code to take the called excel file, group data by date and then calculate the RSI of each day using the formula I wrote above.
dat = pd.read_csv('AMD_5m.csv',index_col='Datetime',parse_dates=['Datetime'],
date_parser=lambda x: pd.to_datetime(x, utc=True))
dates = backtest.get_dates(dat.index)
#create a summary table
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index = dates, columns=cols)
# loop backtest by dates
This is the code I used to fill out the other columns in my summary table, I'll put my SMA (simple moving average) function below.
for d in dates:
this_dat = dat.loc[dat.index.date==d]
#find the number of observations in date d
summary_table.loc[d]['Num. Obs.'] = this_dat.shape[0]
#get trading (i.e. position holding) signals
signals = backtest.SMA(this_dat['Close'].values, window=10)
#find the number of trades in date d
summary_table.loc[d]['Num. Trade'] = np.sum(np.diff(signals)==1)
#find PnLs for 100 shares
shares = 100
PnL = -shares*np.sum(this_dat['Close'].values[1:]*np.diff(signals))
if np.sum(np.diff(signals))>0:
#close position at market close
PnL += shares*this_dat['Close'].values[-1]
summary_table.loc[d]['PnL'] = PnL
#find the win ratio
ind_in = np.where(np.diff(signals)==1)[0]+1
ind_out = np.where(np.diff(signals)==-1)[0]+1
num_win = np.sum((this_dat['Close'].values[ind_out]-this_dat['Close'].values[ind_in])>0)
if summary_table.loc[d]['Num. Trade']!=0:
summary_table.loc[d]['Win. Ratio'] = 1. *num_win/summary_table.loc[d]['Num. Trade']
This is my function for calculating Simple Moving Average. I was told to try and adapt this for RSI or for EMA (Exponential Moving Average). Apparently adapting this for EMA isn't too troublesome but I can't figure it out.
def SMA(p,window=10,signal_type='buy only'):
#input price "p", look-back window "window",
#signal type = buy only (default) --gives long signals, sell only --gives sell signals, both --gives both long and short signals
#return a list of signals = 1 for long position and -1 for short position
signals = np.zeros(len(p))
if len(p)<window:
#no signal if no sufficient data
return signals
sma = list(np.zeros(window)+np.nan) #the first few prices does not give technical indicator values
sma += [np.average(p[k:k+window]) for k in np.arange(len(p)-window)]
for i in np.arange(len(p)-1):
if np.isnan(sma[i]):
continue #skip the open market time window
if sma[i]<p[i] and (signal_type=='buy only' or signal_type=='both'):
signals[i] = 1
elif sma[i]>p[i] and (signal_type=='sell only' or signal_type=='both'):
signals[i] = -1
return signals
I have two solutions to this. One is to loop through each group, then add the relevant data to the summary_table, the other is to calculate the whole series and set the RSI column as this.
I first recreated the data:
import yfinance
import pandas as pd
# initially created similar data through yfinance,
# then copied this to Excel and changed the Datetime column to match yours.
df = yfinance.download("AAPL", period="60d", interval="5m")
# copied it and read it as a dataframe
df = pd.read_clipboard(sep=r'\s{2,}', engine="python")
df.head()
# Datetime Open High Low Close Adj Close Volume
#0 2022-03-03 09:30-05:00 168.470001 168.910004 167.970001 168.199905 168.199905 5374241
#1 2022-03-03 09:35-05:00 168.199997 168.289993 167.550003 168.129898 168.129898 1936734
#2 2022-03-03 09:40-05:00 168.119995 168.250000 167.740005 167.770004 167.770004 1198687
#3 2022-03-03 09:45-05:00 167.770004 168.339996 167.589996 167.718094 167.718094 2128957
#4 2022-03-03 09:50-05:00 167.729996 167.970001 167.619995 167.710007 167.710007 968410
Then I formatted the data and created the summary_table:
df["date"] = pd.to_datetime(df["Datetime"].str[:16], format="%Y-%m-%d %H:%M").dt.date
# calculate percentage change from open and close of each row
df["gain"] = (df["Close"] / df["Open"]) - 1
# your summary table, slightly changing the index to use the dates above
cols = ['Num. Obs.', 'Num. Trade', 'PnL', 'Win. Ratio','RSI'] #add addtional fields if necessary
summary_table = pd.DataFrame(index=df["date"].unique(), columns=cols)
Option 1:
# loop through each group, calculate the average gain and loss, then RSI
for grp, data in df.groupby("date"):
# average gain for gain greater than 0
average_gain = data[data["gain"] > 0]["gain"].mean()
# average loss for gain less than 0
average_loss = data[data["gain"] < 0]["gain"].mean()
# add to relevant cell of summary_table
summary_table["RSI"].loc[grp] = 100 - (100 / (1 + (average_gain / average_loss)))
Option 2:
# define a function to apply in the groupby
def rsi_calc(series):
avg_gain = series[series > 0].mean()
avg_loss = series[series < 0].mean()
return 100 - (100 / (1 + (avg_gain / avg_loss)))
summary_table["RSI"] = df.groupby("date")["gain"].apply(lambda x: rsi_calc(x))
Output (same for each):
summary_table.head()
# Num. Obs. Num. Trade PnL Win. Ratio RSI
#2022-03-03 NaN NaN NaN NaN -981.214015
#2022-03-04 NaN NaN NaN NaN 501.950956
#2022-03-07 NaN NaN NaN NaN -228.379066
#2022-03-08 NaN NaN NaN NaN -2304.451654
#2022-03-09 NaN NaN NaN NaN -689.824739

How to populate a dataframe from row-by-row calculations?

I am seeking to populate a pandas dataframe row-by-row, whereby each new row is calculated on the basis of the contents of the previous row. I am using this for simple financial projections.
Let us take a dataframe 'df_basic_financials':
df_basic_financials = pd.DataFrame({'current_account': [18357.], 'savings_account': [14809.]})
Now I want to forecast what my current and saving accounts will look like in five years, assuming that I earn 24000 a year and that my saving accounts yields 2% yearly, assuming I spend zero money and do not transfer any money to my savings account.
How do I write the code so that I get this:
current_account savings_account
0 18357 14809
1 42357 15105.18
2 66357 15407.2836
etc... for any number of years I want, each time using the calculation 'value of the previous row in the same column + 24000' for current_account and 'value of the previous row in the same column*1.02' for savings_account.
You can get the input from user on number of years and then run the code this way
import pandas as pd
df = pd.DataFrame({'current_account': [18357], 'savings_account':[14809]})
years = int(input("Enter years: "))
for n in range(years):
lastrow = df.iloc[len(df)-1]
print(lastrow[0], lastrow[1])
df.loc[len(df.index)] = [int(lastrow[0]) +24000, int(lastrow[1])*1.02]
df
Out will be....
Just use math
df_basic_financials = pd.DataFrame({'current_account': [18357.], 'savings_account': [14809.]})
current_account_projection = [df_basic_financials['current_account'].iloc[-1] + (24000 * i) for i in range(10)]
savings_account_projection = [df_basic_financials['savings_account'].iloc[-1] * (1.02 ** i) for i in range(10)]
df_basic_financials = pd.DataFrame({'current_account': current_account_projection, 'savings_account': savings_account_projection})
if you really want an interative solution, apply the function on savings_account.iloc[-1]
current_account_next = df_basic_financials.iloc[-1]['current_account'] + 24000
savings_account_next = df_basic_financials.iloc[-1]['savings_account'] * 1.02
df_basic_financials = df_basic_financials.append(pd.Series({'current_account': current_account_next, 'savings_account': savings_account_next}))

Pandas find min date within lookback window from first order for each user

For every user, I'd like to find the date of their earliest visit that falls within a 90 day lookback window from their first order date.
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
lookback = test[test['orderNumber']==1]['date'].apply(lambda x: x - timedelta(days=90))
lookback.name = 'window_min'
ids = test['fullVisitorId']
ids = ids.reset_index()
ids = ids.set_index('index')
lookback = lookback.reset_index()
lookback['fullVisitorId'] = lookback['index'].map(ids['fullVisitorId'])
lookback = lookback.set_index('fullVisitorId')
test['window'] = test['fullVisitorId'].map(lookback['window_min'])
test = test[test['window']<test['date']]
test.loc[test.groupby('fullVisitorId')['date'].idxmin()]
This works, but I feel like there ought to be a cleaner way...
How about this? Basically we assign a new column (order-90days) to help us filter away those who are False.
We apply groupby and pick the 1st (0-nth) element.
import pandas as pd
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
test.sort_values(by='date', inplace=True)
firstorder = test[test.orderNumber > 0].set_index('fullVisitorId').date
test['firstorder_90'] = test.fullVisitorId.map(firstorder - pd.Timedelta(days=90))
test.query('date >= firstorder_90').groupby('fullVisitorId', as_index=False).nth(0)
We get:
date fullVisitorId sessionId \
121154 2016-10-07 7634897085866546110 7634897085866546110_1475846055
189763 2016-12-18 643786734868244401 0643786734868244401_1482120932
orderNumber firstorder_90
121154 0.0 2016-10-07
189763 1.0 2016-09-19

How to calculate with previous row value of a column if it can be declared only after it should be used in pandas dataframe?

The following is the dataframe I have:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
data = pd.DataFrame(index=index)
data['Open'] = np.random.randint(20,40, size=len(data))
data['High'] = np.random.randint(40,50, size=len(data))
data['Low'] = np.random.randint(10,20, size=len(data))
data['Close'] = np.random.randint(10,20, size=len(data))
The calculations I would like to perform are the following:
capital = 30000
data['Shares'] = (capital * 0.05 / data['Close'].shift(1) - data['Low'].shift(1)).round(0)
data['Open_price'] = data['Open'] + 0.5 * (data['High'] - data['Open'])
data['Floating_P/L'] = data['Shares'] * data['Close']
data['Close_price'] = data['Close'] - 0.5 * (data['Close'] - data['Low'])
data['Closed_P/L'] = data['Shares'].shift(1) * data['Close_price']
data['Closed_Balance'] = capital + data['Closed_P/L'].cumsum()
data['Equity'] = data['Closed_Balance'] + data['Floating_P/L']
capital = data['Equity'].shift(1)
As you can see Equity is calculated from today's Shares number which is calculated from yesterday's Equity. I want to set capital to be the initial value of Equity at the first index and calculate Shares based on this number at the first index. From the second index, Shares should be calculated from the Equity shifted one row up. How can I do this?
In the code you have have given above, I see that initial value of Shares and first two values of Equity are Nans since first value of Shares is calculated using Close and Low columns shifted up. Assuming you can set up some values here,
try doing this?
Create empty columns in data for Equity and Shares
data['Equity']=''
data['Shares']=''
set capital to be the initial value of Equity. Since you mentioned initial value of capital = 30000
capital= 30000
data['Equity'][0]= 30000
Calculate initial value of Shares. Here try setting up values in place of Nans for Close and Low
data['Shares'][0]= (capital * 0.05 / data['Close'].shift(1)[0] - data['Low'].shift(1))[0].round(0)
then use a for loop which iterates from second index to the length of data.

Categories