How to implement a normality-check function in python? - python

I am building a Monte Carlo simulation in order to study the behaviour of a set of 1000 iterations. Every simulation has an output graph given by a Pandas dataframe converted into a png by matplotlib.pyplot. Since I am not sure that every output is a Normal ditribution, even if a read an article about this and it secures every output is, I'd like to understand how to check it.
I've found something in this link but I didn't understand which one is the best and how to implement it.
Here's the code:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('whitegrid')
avg = 1
std_dev = .1
num_reps = 500
num_simulations = 1000
#generate a list of percentages that will replicate our historical normsal distribution
#two decimal places in order to make it very easy to see the boundaries
pct_to_target = np.random.normal(avg, std_dev, num_reps).round(2)
#input of historical datas
sales_target_values = [75_000, 100_000, 200_000, 300_000, 400_000, 500_000]
sales_target_prob = [.3, .3, .2, .1, .05, .05]
sales_target = np.random.choice(sales_target_values, num_reps, p=sales_target_prob)
#build up a pandas dataframe
df = pd.DataFrame(index=range(num_reps), data={'Pct_To_Target': pct_to_target,
'Sales_Target': sales_target})
df['Sales'] = df['Pct_To_Target'] * df['Sales_Target']
#Here is what our new dataframe looks like
print("how our dataframe looks like")
print(df)
#Return the commission rate based on the excell table
def calc_commission_rate(x):
if x <= .90:
return .02
if x <= .99:
return .03
else:
return .04
#create our commission rate and multiply it times sales
df['Commission_Rate'] = df['Pct_To_Target'].apply(calc_commission_rate)
df['Commission_Amount'] = df['Commission_Rate'] * df['Sales']
print(df)
# Define a list to keep all the results from each simulation that we want to analyze
all_stats = []
# Loop through many simulations
for i in range(num_simulations):
# Choose random inputs for the sales targets and percent to target
sales_target = np.random.choice(sales_target_values, num_reps, p=sales_target_prob)
pct_to_target = np.random.normal(avg, std_dev, num_reps).round(2)
# Build the dataframe based on the inputs and number of reps
df = pd.DataFrame(index=range(num_reps), data={'Pct_To_Target': pct_to_target,
'Sales_Target': sales_target})
# Back into the sales number using the percent to target rate
df['Sales'] = df['Pct_To_Target'] * df['Sales_Target']
# Determine the commissions rate and calculate it
df['Commission_Rate'] = df['Pct_To_Target'].apply(calc_commission_rate)
df['Commission_Amount'] = df['Commission_Rate'] * df['Sales']
#print(df)
# We want to track sales,commission amounts and sales targets over all the simulations
all_stats.append([df['Sales'].sum().round(0),
df['Commission_Amount'].sum().round(0),
df['Sales_Target'].sum().round(0)])
results_df = pd.DataFrame.from_records(all_stats, columns=['Sales',
'Commission_Amount',
'Sales_Target'])
results_df.describe().style.format('{:,}')
print(results_df)
results_df['Commission_Amount'].plot(kind='hist', title="Total Commission Amount")
plt.savefig('graph.png')
# results_df['Sales'].plot(kind='hist')
# plt.savefig('graph2.png')
print(results_df)
I'd like to add a function that checks if the output distribution is a Gaussian (normal) distribution , because I am not sure that it actually is at every running.

Related

Error using Santiment sanpy library for cryptocurrency data analysis

I am using sanpy to gather crypto market data, compute alpha, beta and rsquared with statsmodels, and then create a crypto = input("Cryptocurrency: ") function with a while loop that allows me to ask the user for an specific crypto and output its respective statistics, followed by showing the input again.
With the following code I receive the error: ValueError: If using all scalar values, you must pass an index
import san
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import datetime
import statsmodels.api as sm
from statsmodels import regression
cryptos = ["bitcoin", "ethereum", "ripple", "bitcoin-cash", "tether",
"bitcoin-sv", "litecoin", "binance-coin", "eos", "chainlink",
"monero", "bitcoin-gold"]
def get_and_process_data(c):
raw_data = san.get("daily_closing_price_usd/" + c, from_date="2014-12-31", to_date="2019-12-31", interval="1d") # "query/slug"
return raw_data.pct_change()[1:]
df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})
df['MKT Return'] = df.mean(axis=1) # avg market return
#print(df) # show dataframe with all data
def model(x, y):
# Calculate r-squared
X = sm.add_constant(x) # artificially add intercept to x, as advised in the docs
model = sm.OLS(y,X).fit()
rsquared = model.rsquared
# Fit linear regression and calculate alpha and beta
X = sm.add_constant(x)
model = regression.linear_model.OLS(y,X).fit()
alpha = model.params[0]
beta = model.params[1]
return rsquared, alpha, beta
results = pd.DataFrame({c: model(df[df[c].notnull()]['MKT Return'], df[df[c].notnull()][c]) for c in cryptos}).transpose()
results.columns = ['rsquared', 'alpha', 'beta']
print(results)
The error is in the following line:
df = pd.DataFrame({c: get_and_process_data(c) for c in cryptos})
I tried solving the issue by changing it to:
df = {c: get_and_process_data(c) for c in cryptos}
df['MKT Return'] = df.mean(axis=1) # avg market return
print(df) # show dataframe with all data
But with that, it gave me a different error: AttributeError: 'dict' object has no attribute 'mean'.
The goal is to create a single DataFrame with the datatime column, columns for the cryptos and their pct.change data, an additional column for MKT Return with the daily mean from all cryptos' pct.change. Then, use all this data to calculate each crypto's statistics and finally create the input function mentioned at the beginning.
I hope I made myself clear and that someone is able to help me with this matter.
This is a great start, but I think that you are getting confused with the return from san. If you look at
import san
import pandas as pd
# List of data we are interested in
cryptos = ["bitcoin", "ethereum", "ripple", "bitcoin-cash", "tether",
"bitcoin-sv", "litecoin", "binance-coin", "eos", "chainlink",
"monero", "bitcoin-gold"]
# function to get the data from san into a dataframe and turn in into
# a daily percentage change
def get_and_process_data(c):
raw_data = san.get("daily_closing_price_usd/" + c, from_date="2014-12-31", to_date="2019-12-31", interval="1d") # "query/slug"
return raw_data.pct_change()[1:]
# now set up an empty dataframe to get all the data put into
df = pd.DataFrame()
# cycle through your list
for c in cryptos:
# get the data as percentage changes
dftemp = get_and_process_data(c)
# then add it to the output dataframe df
df[c] = dftemp['value']
# have a look at what you have
print(df)
And from that point on you know you have some good data and you can play with it as you go forward.
If I could suggest that you just get one currency and get the regressions working with that one then move forward to cycling through all of them.
You are passing scalar values, you need to pass lists so try the following:
data = {c: [get_and_process_data(c)] for c in cryptos}
df = pd.DataFrame(data)
Maybe try this first

Is this the correct way to forecast stock price volatility using GARCH

I am attempting to make a forecast of a stock's volatility some time into the future (say 90 days). It seems that GARCH is a traditionally used model for this.
I have implemented this below using Python's arch library. Everything I do is explained in the comments, the only thing that needs to be changed to run the code is to provide your own daily prices, rather than where I retrieve them from my own API.
import utils
import numpy as np
import pandas as pd
import arch
import matplotlib.pyplot as plt
ticker = 'AAPL' # Ticker to retrieve data for
forecast_horizon = 90 # Number of days to forecast
# Retrive prices from IEX API
prices = utils.dw.get(filename=ticker, source='iex', iex_range='5y')
df = prices[['date', 'close']]
df['daily_returns'] = np.log(df['close']).diff() # Daily log returns
df['monthly_std'] = df['daily_returns'].rolling(21).std() # Standard deviation across trading month
df['annual_vol'] = df['monthly_std'] * np.sqrt(252) # Annualize monthly standard devation
df = df.dropna().reset_index(drop=True)
# Convert decimal returns to %
returns = df['daily_returns'] * 100
# Fit GARCH model
am = arch.arch_model(returns[:-forecast_horizon])
res = am.fit(disp='off')
# Calculate fitted variance values from model parameters
# Convert variance to standard deviation (volatility)
# Revert previous multiplication by 100
fitted = 0.1 * np.sqrt(
res.params['omega'] +
res.params['alpha[1]'] *
res.resid**2 +
res.conditional_volatility**2 *
res.params['beta[1]']
)
# Make forecast
# Convert variance to standard deviation (volatility)
# Revert previous multiplication by 100
forecast = 0.1 * np.sqrt(res.forecast(horizon=forecast_horizon).variance.values[-1])
# Store actual, fitted, and forecasted results
vol = pd.DataFrame({
'actual': df['annual_vol'],
'model': np.append(fitted, forecast)
})
# Plot Actual vs Fitted/Forecasted
plt.plot(vol['actual'][:-forecast_horizon], label='Train')
plt.plot(vol['actual'][-forecast_horizon - 1:], label='Test')
plt.plot(vol['model'][:-forecast_horizon], label='Fitted')
plt.plot(vol['model'][-forecast_horizon - 1:], label='Forecast')
plt.legend()
plt.show()
For Apple, this produces the following plot:
Clearly, the fitted values are constantly far lower than the actual values, and this results in the forecast being a huge underestimation, too (This is a poor example given that Apple's volatility was unusually high in this test period, but with all companies I try, the model is always underestimating the fitted values).
Am I doing everything correct, and the GARCH model just isn't very powerful, or modelling volatility is very difficult? Or is there some error I am making?

Classification of continious data

I've got a Pandas df that I use for Machine Learning in Scikit for Python.
One of the columns is a target value which is continuous data (varying from -10 to +10).
From the target-column, I want to calculate a new column with 5 classes where the number of rows per class is the same, i.e. if I have 1000 rows I want to distribute into 5 classes with roughly 200 in each class.
So far, I have done this in Excel, separate from my Python code, but as the data has grown it's getting unpractical.
In Excel I have calculated the percentiles and then used some logic to build the classes.
How to do this in Python?
#create data
import numpy as np
import pandas as pd
df = pd.DataFrame(20*np.random.rand(50, 1)-10, columns=['target'])
#find quantiles
quantiles = df['target'].quantile([.2, .4, .6, .8])
#labeling of groups
df['group'] = 5
df['group'][df['target'] < quantiles[.8]] = 4
df['group'][df['target'] < quantiles[.6]] = 3
df['group'][df['target'] < quantiles[.4]] = 2
df['group'][df['target'] < quantiles[.2]] = 1
looking for an answer to similar question found this post and the following tip: What is the difference between pandas.qcut and pandas.cut?
import numpy as np
import pandas as pd
#generate 1000 rows of uniform distribution between -10 and 10
rows = np.random.uniform(-10, 10, size = 1000)
#generate the discretization in 5 classes
rows_cut = pd.qcut(rows, 5)
classes = rows_cut.factorize()[0]

Momentum portfolio(trend following) quant simulation on pandas

I am trying to construct trend following momentum portfolio strategy based on S&P500 index (momthly data)
I used Kaufmann's fractal efficiency ratio to filter out whipsaw signal
(http://etfhq.com/blog/2011/02/07/kaufmans-efficiency-ratio/)
I succeeded in coding, but it's very clumsy, so I need advice for better code.
Strategy
Get data of S&P 500 index from yahoo finance
Calculate Kaufmann's efficiency ratio on lookback period X (1 , if close > close(n), 0)
Averages calculated value of 2, from 1 to 12 time period ---> Monthly asset allocation ratio, 1-asset allocation ratio = cash (3% per year)
I am having a difficulty in averaging 1 to 12 efficiency ratio. Of course I know that it can be simply implemented by for loop and it's very easy task, but I failed.
I need more concise and refined code, anybody can help me?
a['meanfractal'] bothers me in the code below..
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pandas_datareader.data as web
def price(stock, start):
price = web.DataReader(name=stock, data_source='yahoo', start=start)['Adj Close']
return price.div(price.iat[0]).resample('M').last().to_frame('price')
a = price('SPY','2000-01-01')
def fractal(a,p):
a['direction'] = np.where(a['price'].diff(p)>0,1,0)
a['abs'] = a['price'].diff(p).abs()
a['volatility'] = a.price.diff().abs().rolling(p).sum()
a['fractal'] = a['abs'].values/a['volatility'].values*a['direction'].values
return a['fractal']
def meanfractal(a):
a['meanfractal']= (fractal(a,1).values+fractal(a,2).values+fractal(a,3).values+fractal(a,4).values+fractal(a,5).values+fractal(a,6).values+fractal(a,7).values+fractal(a,8).values+fractal(a,9).values+fractal(a,10).values+fractal(a,11).values+fractal(a,12).values)/12
a['portfolio1'] = (a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+(1-a.meanfractal.shift(1).values)*1.03**(1/12)).cumprod()
a['portfolio2'] = ((a.price/a.price.shift(1).values*a.meanfractal.shift(1).values+1.03**(1/12))/(1+a.meanfractal.shift(1))).cumprod()
a=a.dropna()
a=a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
print(a)
plt.show()
You could simplify further by storing the values corresponding to p in a DF rather than computing for each series separately as shown:
def fractal(a, p):
df = pd.DataFrame()
for count in range(1,p+1):
a['direction'] = np.where(a['price'].diff(count)>0,1,0)
a['abs'] = a['price'].diff(count).abs()
a['volatility'] = a.price.diff().abs().rolling(count).sum()
a['fractal'] = a['abs']/a['volatility']*a['direction']
df = pd.concat([df, a['fractal']], axis=1)
return df
Then, you could assign the repeating operations to a variable which reduces the re-computation time.
def meanfractal(a, l=12):
a['meanfractal']= pd.DataFrame(fractal(a, l)).sum(1,skipna=False)/l
mean_shift = a['meanfractal'].shift(1)
price_shift = a['price'].shift(1)
factor = 1.03**(1/l)
a['portfolio1'] = (a['price']/price_shift*mean_shift+(1-mean_shift)*factor).cumprod()
a['portfolio2'] = ((a['price']/price_shift*mean_shift+factor)/(1+mean_shift)).cumprod()
a.dropna(inplace=True)
a = a.div(a.ix[0])
return a[['price','portfolio1','portfolio2']].plot()
Resulting plot obtained:
meanfractal(a)
Note: If speed is not a major concern, you could perform the operations via the built-in methods present in pandas instead of converting them into it's corresponding numpy array values.

Plotting a discrete variable over time (scarf plot)

I have time series data from a repeated-measures eyetracking experiment.
The dataset consists of a number of respondents and for each respondent, there is 48 trials.
The data set has a variable ('saccade') which is the transitions between eye-fixations and a variable ('time') which ranges for 0-1 for each trial. The transitions are classified into three different categories ('ver', 'hor' and 'diag').
Here is a script that will create a small example data set in python (one participant and two trials):
import numpy as np
import pandas as pd
saccade1 = np.array(['diag','hor','ver','hor','diag','ver','hor','diag','diag',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','hor','hor','diag',
'diag','ver','ver','ver','ver'])
time1 = np.array(range(len(saccade1)))/float(len(saccade1)-1)
trial1 = [1]*len(time1)
saccade2 = np.array(['diag','ver','hor','diag','diag','diag','hor','ver','hor',
'diag','hor','ver','ver','ver','ver','diag','ver','ver','hor','diag',
'diag','hor','hor','diag','diag','ver','ver','ver','ver','hor','diag','diag'])
time2 = np.array(range(len(saccade2)))/float(len(saccade2)-1)
trial2 = [2]*len(time2)
saccade = np.append(saccade1,saccade2)
time = np.append(time1,time2)
trial = np.append(trial1,trial2)
subject = [1]*len(time)
df = pd.DataFrame(index=range(len(subject)))
df['subject'] = subject
df['saccade'] = saccade
df['trial'] = trial
df['time'] = time
Alternatively I have made a csv-file with the same data which can be downloaded here
I would like to be able to make a so-called scarf plot to visualize the sequence of transitions over time, but I have no clue how to make these plots.
I would like plots (for each participant separately) where time is on the x-axis and trial is on the y-axis. For each trial I would like the transitions represented as colored "stacked" bars.
The only example I have of these kinds of plots are in the book "Eye Tracking - A comprehensive guide to methods and measures" (fig. 6.8b) link
Can anyone tell/help me in doing this?
(I can deal which python or R programming - preferably python)
Here is a solution in R using ggplot2. You need to recode time2 so that it indicates the enlapsed time instead of the total time.
library(ggplot2)
dataset <- read.csv("~/Downloads/example_data_for_scarf.csv")
dataset$trial <- factor(dataset$trial)
dataset$saccade <- factor(dataset$saccade)
dataset$time2 <- c(0, diff(dataset$time))
dataset$time2[dataset$time == 0] <- 0
ggplot(dataset, aes(x = trial, y = time2, fill = saccade)) +
geom_bar(stat = "identity") +
coord_flip()

Categories