Fix Beginning of Time Series Plot - It Is Wildly Distorted - python

Regardless of the time frame I put on this chart (i.e. 10yrs, 20yrs, 25yrs), the data in the beginning is always distorted -- i.e. very high or very low -- on the chart.
I thought it was a min_periods issue, but that doesn't seem to be it.
Appreciate any insight!
Tried making sure that there were no N/As in the calculation columns.
Tried changing min_periods to be larger.
No change.
What's charted on the bottom subplot is sharpe_rf in the code below.
asset1 = 'XLY Equity'
asset2 = 'USGG10YR Index'
#-------------Dataframe 1: Price Data from Multiple Securities-------------#
sids = mgr[asset1, asset2]
d = sids.get_historical(prices,start=startD,end=endD)
df = pd.DataFrame(d)
df = df.dropna()
df.columns = ['Asset', 'Risk Free Rate']
#------------Returns and Standard Deviation---------------#
log_return = np.log(df['Asset']) - np.log(df['Asset'].shift(1))
ten_day_return = log_return.rolling(min_periods=10, window=30, center=False).sum()
std_20d = df['Asset'].rolling(min_periods=10, window=30, center=False).std()
max_std_today = std_20d.max() - std_20d
#------------Moving Averages------------#
ma_50 = df['Asset'].rolling(min_periods=10, window=50, center=False).mean()
ma_100 = df['Asset'].rolling(min_periods=10, window=100, center=False).mean()
ma_200 = df['Asset'].rolling(min_periods=10, window=200, center=False).mean()
# #------------Rolling Sharpe Ratio------------#
lookback = 252
daily_returns = log_return[1:]
std_returns = daily_returns.rolling(min_periods=10,window=lookback,center=False).std()
std_returns = std_returns.dropna()
avg_daily_return = daily_returns.rolling(min_periods=10,window=lookback,center=False).mean()
avg_daily_return = avg_daily_return.dropna()
sharpe = avg_daily_return/std_returns
sharpe_final = np.sqrt(252)*sharpe
sharpe_final =sharpe_final.dropna()
sharpe_rf = sharpe_final / df['Risk Free Rate']
sharpe_rf = sharpe_rf.dropna()`

Related

Identify in a pandas series when the trend changes from positive to negative

I have a pandas dataframe with securities prices and several moving average trend lines of various moving average lengths. The data frames are sufficiently large that I would like to identify the most efficient way to capture the index of a particular series where the slope changes (In this example, let's just say from positive to negative for a given series in the dataframe.)
My hack seems very "hacky". I am currently doing the following (Note, imagine this is for a single moving average series):
filter = (df.diff()>0).diff().dropna(axis=0)
new_df = df[filter].dropna(axis=0)
Full example code below:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Create a sample Dataframe
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
close = pd.Series([1,2,3,4,2,1,4,3])
df = pd.DataFrame({"date":days, "prices":close})
df.set_index("date", inplace=True)
print("Original DF")
print(df)
# Long Explanation
updays = (df.diff()>0) # Show True for all updays false for all downdays
print("Updays df is")
print(updays)
reversal_df = (updays.diff()) # this will only show change days as True
reversal_df.dropna(axis=0, inplace=True) # Handle the first day
trade_df = df[reversal_df].dropna() # Select only the days where the trend reversed
print("These are the days where the trend reverses it self from negative to positive or vice versa ")
print(trade_df)
# Simplified below by combining the above into two lines
filter = (df.diff()>0).diff().dropna(axis=0)
new_df = df[filter].dropna(axis=0)
print("The final result is this: ")
print(new_df)
Any help here would be appreciated. Note, I'm more interested in balancing efficiencies between how best to do this so I can understand it, and how to make it sufficiently quick to compute.
Multiple moving average solution.
Look for the comment # *** THE SOLUTION BEGINS HERE *** to see the solution, before that is just generating data, printing and plotting to validate.
What I do here is to calculate the sign of MVA slopes so a positive slope will have a value of 1 and a negative slope a value of -1.
Slope_i = MVA(i, ask; periods) - MVA(i, ask; periods)
m<periods>_slp_sgn_i = Sign(Slope_i)
Then to spot slope changes I Calculate:
m<periods>slp_chg = sign(m<periods>_slp_sgn_i - m<periods>_slp_sgn_i-1)
So for example if the slope changes from 1 (positive) to -1 (negative):
sign (-1 - 1) = sign(-2) = -1
In the other hand, if the changes from -1 to 1:
sign (1 - - 1) = sign(2) = 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# GENERATE DATA RANDOM PRICE
_periods = 1000
_value_0 = 1.1300
_std = 0.0005
_freq = '5T'
_output_col = 'ask'
_last_date = pd.to_datetime('2021-12-15')
_p_t = np.zeros(_periods)
_wn = np.random.normal(loc=0, scale=_std, size=_periods)
_p_t[0] = _value_0
_wn[0] = 0
_date_index = pd.date_range(end=_last_date, periods=_periods, freq=_freq)
df= pd.DataFrame(np.stack([_p_t, _wn], axis=1), columns=[_output_col, "wn"], index=_date_index)
for i in range(1, _periods):
df.iloc[i][_output_col] = df.iloc[i - 1][_output_col] + df.iloc[i].wn
print(df.head(5))
# CALCULATE MOVING AVERAGES (3)
df['mva_25'] = df['ask'].rolling(25).mean()
df['mva_50'] = df['ask'].rolling(50).mean()
df['mva_100'] = df['ask'].rolling(100).mean()
# plot to check
df['ask'].plot(figsize=(15,5))
df['mva_25'].plot(figsize=(15,5))
df['mva_50'].plot(figsize=(15,5))
df['mva_100'].plot(figsize=(15,5))
plt.show()
# *** THE SOLUTION BEGINS HERE ***
# calculate mva slopes directions
# positive slope: 1, negative slope -1
df['m25_slp_sgn'] = np.sign(df['mva_25'] - df['mva_25'].shift(1))
df['m50_slp_sgn'] = np.sign(df['mva_50'] - df['mva_50'].shift(1))
df['m100_slp_sgn'] = np.sign(df['mva_100'] - df['mva_100'].shift(1))
# CALCULATE CHANGE IN SLOPE
# from 1 to -1: -1
# from -1 to 1: 1
df['m25_slp_chg'] = np.sign(df['m25_slp_sgn'] - df['m25_slp_sgn'].shift(1))
df['m50_slp_chg'] = np.sign(df['m50_slp_sgn'] - df['m50_slp_sgn'].shift(1))
df['m100_slp_chg'] = np.sign(df['m100_slp_sgn'] - df['m100_slp_sgn'].shift(1))
# clean NAN
df.dropna(inplace=True)
# print data to visually check
print(df.iloc[20:40][['mva_25', 'm25_slp_sgn', 'm25_slp_chg']])
# query where slope of MVA25 changes from positive to negative
df[(df['m25_slp_chg'] == -1)].head(5)
WARNING: Data is generated random so you'll see the plots and the printings change each time you execute the code.

Why does my ema 15 and ema 200 look the same?

I'm not sure if I coded something wrong or just don't seem to understand what an exponential moving average does.
I'm calculating a moving average on SPY 5 minute tick data at ema 15 and ema 200.
They are exactly the same once the 200 starts. This is the code I used.
def EMA(df, column, window, alpha = .2):
"""
Parameters
----------
df : dataframe.
column : column to compute the EMA
window : the window of the EMA
Returns
-------
df with ema column.
"""
df = df.copy()
values = np.array(df[column])
window2 = 'ema_'+str(window)
df[window2] = df[column].ewm(min_periods = window, alpha=0.3, adjust=False).mean()
return df
df = avg.EMA(df = df, column = 'adjusted_close', window = 15)
df = avg.EMA(df = df, column = 'adjusted_close', window = 200)
Is there something glaringly obvious or is it doing what it's supposed to and I don't understand the ema like I thought I did?

windowed cross-correlation using numpy

I have two data series, that are slightly shifted to each other. Both contain nan values, that need to be respected. Hence I would like to align them automatically. My idea is to use cross-correlation and numpy arrays to solve the problem. The code below is extremely slow and I would like to speed things up, but as a non python expert, I don't see any possibilities for improvement.
The idea is to have a baseline and target array. The code calculates the offset of each target position relative to the baseline in a windowed fashion. For each window it is calculated how much the data point has to be shifted for an optimal alignment. The first point that can be aligned is at window_size//2 and the last at basline.size-window_size//2
window_size = 50
N = 100
randN = 10
baseline = np.random.rand(N,)
target = np.random.rand(N,)
mask=np.zeros(N,dtype=bool)
mask[:randN] = True
np.random.shuffle(mask)
baseline[mask] = np.nan
np.random.shuffle(mask)
target[mask] = np.nan
stacked = np.column_stack((baseline,target))
stacked_windows = sliding_window_view(stacked, (window_size,2))
offset_np = np.zeros([stacked.shape[0], ])
offset_np[:] = np.nan
for idx in range(stacked_windows.shape[0]):
window = stacked_windows[idx]
baseline_window_np = window.reshape(window_size,2)[:,0]
target_window_np = window.reshape(window_size,2)[:,1]
#
baseline_window_masked = ma.masked_invalid(baseline_window_np)
target_window_masked = ma.masked_invalid(target_window_np)
#
cc_np = np.empty([window_size + 1, ], dtype=np.float32)
cc_np = np.zeros([window_size, ])
cc_np[:] = np.nan
for lag in range(-int(window_size//2),int(window_size//2)):
masked_tmp = ma.masked_invalid(shift_numpy(target_window_masked, lag))
cc_np[lag+int(window_size//2)] = ma.corrcoef(baseline_window_masked,masked_tmp)[0,1]
if not np.isnan(cc_np).all():
offset_np[window_size//2+idx] = np.floor(window_size//2)-np.argmax(cc_np)
result_np = np.column_stack((stacked, offset_np))
result_df = df = pd.DataFrame(result_np, columns = ['baseline','target','offset'])

How to combine two line charts with (Numbers/Percentage) for y axis and date for x axis and set scaling with openpyxl

I have this data frame which contains date column, amount column and percentage value column. Want to draw a graph which has date column for x axis, amount column for y axis in left and percentage column for y axis on the right.
I could build up below code to draw two graphs and join into one. But, the second y axis don't appear as expected. And the lines related to percentage values are not displayed. I think that is due to scaling issue. Some help would be highly appreciated, as I am unable to figure out why the second y axis is not displayed, and the meaning of below properties.
updating_file = "D:\\Testing\\Pivot.xlsx"
wb = load_workbook(filename=updating_file, data_only=True)
sheet = wb["Sheet1"]
columns = next(sheet.values)[0:]
current_sheet_df = pd.DataFrame(sheet.values, columns=columns)
c1 = LineChart()
c1.title = None
c1.y_axis.title = None
c1.y_axis.crossAx = 500
c1.x_axis = DateAxis(crossAx=100)
c1.x_axis.number_format = 'd-mmm'
c1.x_axis.lblAlgn = 'l'
c1.x_axis.majorTimeUnit = "days"
c1.x_axis.title = "Date"
data1 = Reference(sheet, min_col=10, min_row=3, max_row=current_sheet_df.shape[0]+2)
data2 = Reference(sheet, min_col=14, min_row=3, max_row=current_sheet_df.shape[0]+2)
c1.add_data(data1, titles_from_data=True)
c1.add_data(data2, titles_from_data=True)
dates = Reference(sheet, min_col=9, min_row=4, max_row=current_sheet_df.shape[0]+2)
c1.set_categories(dates)
# Create a second chart
c2 = LineChart()
v2 = Reference(sheet, min_col=11, max_col=13, min_row=3, max_row=current_sheet_df.shape[0]+2)
c2.add_data(v2, titles_from_data=True)
c2.set_categories(dates)
# c2.y_axis.axId = 200
c2.y_axis.title = None
c2.x_axis = DateAxis(crossAx=100)
c2.x_axis.number_format = 'd-mmm'
c2.x_axis.lblAlgn = 'l'
c2.x_axis.majorTimeUnit = "days"
c2.x_axis.title = "Date"
c2.y_axis.majorGridlines = None
c1.y_axis.majorGridlines = None
# Display y-axis of the second chart on the right by setting it to cross the x-axis at its maximum
c1.y_axis.crosses = "min"
c2.y_axis.crosses = "max"
c1 += c2
c1.legend.position = 't'
# # Style the lines
s1 = c1.series[0]
s1.graphicalProperties.line.solidFill = "BDCDCD" # Marker outline
sheet.add_chart(c1, "Q20")
wb.save(filename=updating_file)
The graph drawn with above code is as follows;
Want to draw something like below.
based on the sample code at https://openpyxl.readthedocs.io/en/latest/charts/secondary.html
and some try I found there has to be an axis ID definition to separate 2 axes from each other
c2.y_axis.axId = 200
c1.y_axis.crosses = "max"
c1 += c2

How does one append rows to a dataframe from a loop using Pandas?

I'm running a loop that appends values to an empty dataframe out side of the loop. However, when this is done, the datframe remains empty. I'm not sure what's going on. The goal is to find the power value that results in the lowest sum of squared residuals.
Example code below:
import tweedie
power_list = np.arange(1.3, 2, .01)
mean = 353.77
std = 17298.24
size = 860310
x = tweedie.tweedie(mu = mean, p = 1.5, phi = 50).rvs(len(x))
variance = 299228898.89
sum_ssr_df = pd.DataFrame(columns = ['power', 'dispersion', 'ssr'])
for i in power_list:
power = i
phi = variance/(mean**power)
tvs = tweedie.tweedie(mu = mean, p = power, phi = phi).rvs(len(x))
sort_tvs = np.sort(tvs)
df = pd.DataFrame([x, sort_tvs]).transpose()
df.columns = ['actual', 'random']
df['residual'] = df['actual'] - df['random']
ssr = df['residual']**2
sum_ssr = np.sum(ssr)
df_i = pd.DataFrame([i, phi, sum_ssr])
df_i = df_i.transpose()
df_i.columns = ['power', 'dispersion', 'ssr']
sum_ssr_df.append(df_i)
sum_ssr_df[sum_ssr_df['ssr'] == sum_ssr_df['ssr'].min()]
What exactly am I doing incorrectly?
This code isn't as efficient as is could be as noted by ALollz. When you append, it basically creates a new dataframe in memory (I'm oversimplifying here).
The error in your code is:
sum_ssr_df.append(df_i)
should be:
sum_ssr_df = sum_ssr_df.append(df_i)

Categories