Here's how Yahoo Finance apparently calculates Adjusted Close stock prices:
https://help.yahoo.com/kb/adjusted-close-sln28256.html
From this, I understand that a constant factor is applied to the unadjusted price and that said factor changes with each dividend or split event, which should happen not too often. And that I should be able to infer that factor by dividing the unadjusted by the adjusted price.
However, if I verify this with AAPL data (using Python), I get confusing results:
import yfinance
df = yfinance.download("AAPL", start="2010-01-01", end="2019-12-31")
df["Factor"] = df["Close"] / df["Adj Close"]
print(df["Factor"].nunique(), df["Factor"].count())
Which produces: 2442 2516
So the factor is different in by far most of the cases. But AAPL usually has 4 dividend events per year and had a stock split during that period, so I would expect roughly 40 different factors rather than 2442.
Is the formula Yahoo Finance provides under the link above overly simplified or am I missing something here?
The problem is that Yahoo Finance doesn't provide BOTH raw and adjusted prices for you to work with. If you check the footnote of a sample historical price page (e.g., MSFT), you will see a text that says "Close price adjusted for splits; Adjusted close price adjusted for both dividends and splits."
In order to derive clean adjusted ratios, both raw (unadjusted) and adjusted prices are needed. Then you can apply an adjustment method such as CRSP to derive the correct values. In summary, you didn't do anything wrong! It's the intrinsic limitation of Yahoo's output.
References:
[1] https://medium.com/#patrick.collins_58673/stock-api-landscape-5c6e054ee631
[2] http://www.crsp.org/products/documentation/crsp-calculations
I'm not sure this is a complete answer, but it's too long for a comment:
First, there is definitely an issue with rounding. If you modify your third line to
df["Factor"] = df["Close"] / df["Adj Close"].round(12)
you get 2441 unique factors. If, however, you use, for example round(6), you only get 46 unique factors.
Second, according to the adjustment rules in the Yahoo help page in your question, each dividend results in an adjustment for the 5 trading dates immediately prior to the ex-divined date. During the 10 year period in your question, there were no stock splits and approximately 40 quarterly dividends. These should have resulted in 200 dates with adjusted closing prices. All the other 2300 or so dates should have no closing adjustments, ie., a Factor of 1. Yet when you run:
df[df.Factor == 1].shape
you get only 37 dates (regardless of the rounding used) with no adjustments.
Obviously, that doesn't make sense and - unless I'm missing something basic - there is some error either in the implementation of the adjustment methodology or in the Yahoo code.
Related
My goal is to simulate the past growth of a stock portfolio based on historical stock prices. I wrote a code, that works (at least I think so). However, I am pretty sure, that the basic structure of the code is not very clever and propably makes things more complicated than they actually are. Maybe someone can help me and tell me the best procedure to solve a problem like mine.
I started with a dataframe containing historical stock prices for a number (here: 2) of stocks:
import pandas as pd import numpy as np
price_data = pd.DataFrame({'Stock_A': [5,6,10],
'Stock_B': [5,7,2]})
Than I defined a start capital (here: 1000 €). Furthermore I decide how much of my money I want to invest in Stock_A (here: 50%) and Stock_B (here: also 50%).
capital = 1000
weighting = {'Stock_A': 0.5, 'Stock_B': 0.5}
Now I can calculate, how many shares of Stock_A and Stock_B I can buy in the beginning
quantities = {key: weighting[key]*capital/price_data.get(key,0)[0] for key in weighting}
While time goes by the weights of the portfolio components will of course change, as the prices of Stock A and B move in opposite directions. So at some point the portfolio will mainly consists of Stock A, while the proportion of Stock B (value wise) gets pretty small. To correct for this, I want to restore the initial 50:50 weighting as soon as the portfolio weights deviate too much from the initial weighting (so called rebalancing). I defined a function to decide, whether rebalancing is needed or not.
def need_to_rebalance(row):
rebalance = False
for asset in assets:
if not 0.4 < row[asset] * quantities[asset] / portfolio_value < 0.6:
rebalance = True
break
return rebalance
If we perform a rebalancing, the following formula, returns the updated number of shares for Stock A and Stock B:
def rebalance(row):
for asset in assets:
quantities[asset] = weighting[asset]*portfolio_value/row[asset]
return quantities
Finally I defined a third funtion, that I can use to loop over the dataframe containing the sock prices in order to calculate the value of the portfolio based on the current number of Stocks we own. It looks like this:
def run_backtest(row):
global portfolio_value, quantities
portfolio_value = sum(np.array(row[assets]) * np.array(list(quantities.values())))
if need_to_rebalance(row):
quantities = rebalance(row)
for asset in assets:
historical_quantities[asset].append(quantities[asset])
return portfolio_value
Than I put it all to work using .apply:
historical_quantities = {}
for asset in assets:
historical_quantities[asset] = []
output = price_data.copy()
output['portfolio_value'] = price_data.apply(run_backtest, axis = 1)
output.join(pd.DataFrame(historical_quantities), rsuffix='_weight')
The result looks reasonable to me and it is basically, what I wanted to achieve. However, I was wondering, whether there is a more efficient way, to solve the problem. Somehow, doing the calculation line by line and storing all the values in the variable 'historical quantities' just to add it to the dataframe at the end doesn't look very clever to me. Furthermore I have to use a lot of global variables. Storing a lot of values from inside the functions as global variables makes the code pretty messy (In particular, if the calculations concering rebalancing get more complex, for example when including tax effects). Has someone read until here & is maybe willing to help me?
All the best
I am calculating ema with python on binance (BTC Futures) monthly open price data(20/12~21/01).
ema2 gives 25872.82333 on the second month like below.
df = pd.Series([19722.09, 28948.19])
ema2 = df.ewm(span=2,adjust=False).mean()
ema2
0 19722.090000
1 25872.823333
But in binance, ema(2) gives difference value(25108.05) like in the picture.
https://www.binance.com/en/futures/BTCUSDT_perpetual
Any help would be appreciated.
I had a the same problem, that the calculated EMA (df.ewm...) from pandas wasn't the same as the one from binance. You have to use a longer series. First i used 25 candlestick data, then changed to 500. When you query binance, query a lot of date, because the mathematical calculation of the EMA is from the beginning of the series.
best regards
I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.
Having a strange issue that skirts the line between stat/math and programming (I'm unsure what the root issue is).
I have a dataframe of returns by asset by month. Here's a small snapshot of the dataframe:
Strategy Theme Idea month PnL
Event Catalyst European Oil Services 2019-05 -1.412264e-10
Event Catalyst European Oil Services 2019-06 -2.688968e-08
Event Catalyst None 2019-06 1.546945e-08
Event M&A None 2019-06 2.128868e-08
Fundamental 5G Rollout Intelsat 2019-01 1.375019e-02
Now, when I group by month and sum, then compound the result, I get the answer I'd expect, and what I am able to replicate in excel:
x = df.groupby(['month']).sum()
total_pnl = x.compound()['PnL']
Gets me my excepted result of 4.16%. So far so good. However, taking the same dataframe, and compounding the individual Ideas, then summing, yields a different answer: 5.89%
dfx = df.pivot_table(index='month',columns=['Strategy','Theme','Idea'],values='PnL')
dfx = dfx.fillna(0.0)
dfx = dfx.compound()
dfx = dfx.reset_index(drop=False)
dfx.columns= ['Strategy','Theme','Idea','PnL']
total_pnl = sum(dfx['PnL'])
At first, I thought it was simply that it is not mathematically acceptable to compound individual returns then sum them, but a simple example I did in excel proved to me that the two methods should be the same. I then checked in excel if having 0 returns on any given month would be a problem -- it isn't.
Now I've been scratching my head for a couple hours trying to figure out why these numbers don't match when I do it in Python, while my simple excel example shows me they should.
Can you think of any caveats that I'm not taking into account that may be causing this?
I'm fairly new to python and pandas, and I'm wondering if anyone knows if there are any libraries for python build on top of pandas which would take a time series of orders which have the following columns:
timestamp, id, price, size, exchange
Each record adjusts the total per price and exchange by the size to give you a current view, i.e. records might look like:
9:00:25.123, 1, 1.02, 100, N
9:00:25.123, 2, 1.02, -50, N
9:00:25.129, 3, 1.03, 50, X
9:00:25.130, 4, 1.02, 150, X
9:00:25.131, 5, 1.02, -5, X
I want to be able, for any time, get the current view of the market. So for example if I made the call for the market at 9:00:25.130, I would get:
1.02, N, 50
1.02, X, 150
1.03, X, 50
A query for 9:00:25.131 would return
1.02, N, 50
1.02, X, 145
1.03, X, 50
There may be a million or more of these records, iterating over all of the records for every request would take a long time, particularly if you were trying to look at times later in on the day. I suppose one could create "snapshots" on some time interval and use them like key frames in mpeg playback, and I could code it myself, but I think that book building/ playback, is such a common need for folks using pandas with financial data that their might already be libraries out there to do this.
Any ideas, or do I roll my own?
I know this is old but it's instructive to see the benefits and limits of pandas
I built a trivial jupyter notebook to show how an order book like you describe could be built to be used as you requested.
The core is a loop that updates the state of the order book and saves it for amalgamation into a pandas Dataframe:
states = []
current_timestamp = None
current_state = {}
for timestamp, (id_, price, exch, size) in df.iterrows():
if current_timestamp is None:
current_timestamp = timestamp
if current_timestamp != timestamp:
for key in list(current_state):
if current_state[key] == 0.:
del current_state[key]
states.append((current_timestamp, dict(**current_state)))
current_timestamp = timestamp
key = (exch, price)
current_state.setdefault(key, 0.)
current_state[key] += size
states.append((timestamp, dict(**current_state)))
order_book = pd.DataFrame.from_items(states).T
However: note how the book state has to be built up outside of pandas, and that a pandas.DataFrame of order book state isn't so well suited to model order book per-level priority or depth (Level 3 data), which can be a major limitation depending on how accurately you want to model the order book.
Order books and the orders and quotes that update them (both of which you group into the term "request") in the real world have fairly complex interactions. These interactions are governed by the rules of the exchange that manages them, and these rules change all the time. Since these rules take time to model correctly, are worth understanding to very few, and old sets of rules are usually not even of much academic interest, the only places one would tend to find these rules codified into a library are the places not very interested in sharing them with others.
To understand the theory behind a simple ("stylised") model of an order book, its orders and quotes thereupon, see the paper "A stochastic model for order book dynamics" by Rama Cont, Sasha Stoikov, Rishi Talreja, Section 2:
2.1 Limit order books
Consider a financial asset traded in an order-driven market. Market participants can post two types of buy/sell orders. A limit order is an order to trade a certain amount of a security at a given price. Limit orders are posted to a electronic trading system and the state of outstanding limit orders can be summarized by stating the quantities posted at each price level: this is known as the limit order book. The lowest price for which there is an outstanding limit sell order is called the ask price and the highest buy price is called the bid price.
[...more useful description]
2.2. Dynamics of the order book
Let us now describe how the limit order book is updated by the inflow of new orders. [...] Assuming that all orders are of unit size [...],
• a limit buy order at price level p<p_A(t) increases the quantity at level p: x → x_{p−1}
• a limit sell order at price level p>p_B(t) increases the quantity at level p: x → x_{p+1}
• a market buy order decreases the quantity at the ask price: x → x_{p_A(t)−1}
• a market sell order decreases the quantity at the bid price: x → x_{p_B(t)+1}
• a cancellation of an oustanding limit buy order at price level p<p_A(t) decreases the quantity at level p: x → x_{p+1}
• a cancellation of an oustanding limit sell order at price level p>p_B(t) decreases the quantity at level p: x → x_{p−1}
The evolution of the order book is thus driven by the incoming flow of market orders, limit orders and cancellations at each price level [...]
Some libraries where you can see people's attempts at modeling or visualising a simple limit order book are:
PyRebuildLOB is a working example, but pandas plays a relatively small part in its implementation besides as a fancy array.
Background and visualisation Analysing an Electronic Limit Order Book by David Kane, Andrew Liu, and Khanh Ngyuen
Usual visualisation from Oculus Information:
Unusual visualisation from working4arbitrage:
And there is a good quant.stackoverflow.com question and answers here.