Here is the sample ADF test in python to check for Cointegration between two pairs. However the final result gives only the numeric value for co-integration. How to get the historical results of Co-integration.
Taken from http://www.leinenbock.com/adf-test-in-python/
import numpy as np
import statsmodels.api as stat
import statsmodels.tsa.stattools as ts
x = np.random.normal(0,1, 1000)
y = np.random.normal(0,1, 1000)
def cointegration_test(y, x):
result = stat.OLS(y, x).fit()
return ts.adfuller(result.resid)
I assume you want to test for expanding cointegration? Note that you should use sm.tsa.coint to test for cointegration. You could test for historical cointegrating relationship between realgdp and realdpi using pandas like so
import pandas as pd
import statsmodels.api as sm
data = sm.datasets.macrodata.load_pandas().data
def rolling_coint(x, y):
yy = y[:len(x)]
# returns only the p-value
return sm.tsa.coint(x, yy)[1]
historical_coint = pd.expanding_apply(data.realgdp, rolling_coint,
min_periods=36,
args=(data.realdpi,))
Related
Here I have created the output x from first def(null_checking) function and want to use the same output (x) as input for second def(variance) within class. I tried a lot but couldnot.
#import dependencies
#import dependencies
from optparse import Values
from re import X
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
import pandas as pd
from .prac import checking_null
#import datasets
datasets = pd.read_csv('Car_sales.csv')
datasets
features = ['Fuel_efficiency', 'Power_perf_factor', 'Engine_size', 'Horsepower', 'Fuel_capacity', 'Curb_weight']
features
class extraction():
def __init__(self, datasets, features):
self.features = features
self.datasets = datasets
#checking and removing null rows present in the dataframe
def null_checking(self):
datasets1 = self.datasets[self.features]
print(datasets1)
for items, x in enumerate(datasets1):
if items =='True':
continue
x = datasets1.dropna(axis=0)
print(x)
#calculating variance inflation factorss
def variance(self):
x.inner_display()
#we need intercept for calculating variance inflation factor
x['intercept'] = 1
print(x)
#Making the new dataframe
df = pd.DataFrame()
df['variables'] = x.columns
df['VIF'] = [variance_inflation_factor(x, i) for i in range(x.shape[1])]
print(df)
The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.
Using DB scan I am iterating through a csv with x, y, z, and id data. I would like to generate a new csv for a every combination of eps and minimum samples that occur within a set range.
The output csv should also include the number of points in the cluster, eps value, and minimum samples used as columns along with the original x, y, z, and id data. The code below will do this, but will only create a csv for the last cluster calculated.
%matplotlib notebook
from interp import *
import pandas as pd
import pandas as pdfrom sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score
import itertools
points_file = "examples/example-1/fcr.csv"
eps_values = np.arange(0.1,0.5,0.1)
min_samples = np.arange(5,20)
dbscan_params = list(itertools.product(eps_values, min_samples))
cluster_n = []
epsvalues = []
min_samp = []
for p in dbscan_params:
dbscan_cluster = DBSCAN(eps=p[0],
min_samples=p[1]).fit(points)
epsvalues.append(p[0])
min_samp.append(p[1]), cluster_n.append(
len(np.unique(dbscan_cluster.labels_)))
df = pd.read_csv(points_file, header=None, names=["x", "y", "z", "id"])
df["cluster"] = clustering.labels_
df["eps"] = eps
df["min_sample"] = min_sam
csv_name = f'fcr_{eps}e{min_sam}m.csv'
df.to_csv(csv_name, index=False)
Everytime I run a VARMAX model I get different coefficients.
Is there any way I could replicate my previous results without imposing a seed?
Thank you
I tried to replicate the VARMA(p,q) example posted on the statsmodels webpage: ( https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_varmax.html ). In order to check the replicability of the results, I just added a loop to reestimate the model and a dataframe (parameters) for saving the results. So this is my code:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
exog = endog['dln_consump']
parameters=pd.DataFrame()
for p in range(10):
print(p)
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
param= pd.DataFrame(res.params,columns= ["estimation "+str(p)])
parameters=pd.concat([parameters, param], axis=1)
print(parameters)
As you can see, the results change everytime I reestimate the model:
estimation 0 estimation 1 estimation 2 \
const.dln_inv 0.010974 0.010934 0.010934
const.dln_inc 0.016554 0.016536 0.016536
L1.dln_inv.dln_inv -0.010164 -0.010087 -0.010087
L1.dln_inc.dln_inv 0.360306 0.362187 0.362187
L1.dln_inv.dln_inc -0.032975 -0.033071 -0.033071
L1.dln_inc.dln_inc 0.230657 0.231421 0.231421
L1.e(dln_inv).dln_inv -0.249916 -0.250307 -0.250307
L1.e(dln_inc).dln_inv 0.125546 0.125581 0.125581
L1.e(dln_inv).dln_inc 0.088878 0.089001 0.089001
L1.e(dln_inc).dln_inc -0.235258 -0.235176 -0.235176
sqrt.var.dln_inv 0.044926 0.044927 0.044927
sqrt.cov.dln_inv.dln_inc 0.001670 0.001662 0.001662
sqrt.var.dln_inc 0.011554 0.011554 0.011554
Thank you. But I tried to replicate the VARMA(p,q) example posted on the statsmodels webpage: ( https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_varmax.html ). In order to check the replicability of the results, I just added a loop to reestimate the model and a dataframe (parameters) for saving the results. So this is my code:
%matplotlib inline
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
dta = sm.datasets.webuse('lutkepohl2', 'https://www.stata-press.com/data/r12/')
dta.index = dta.qtr
endog = dta.loc['1960-04-01':'1978-10-01', ['dln_inv', 'dln_inc', 'dln_consump']]
exog = endog['dln_consump']
parameters=pd.DataFrame()
for p in range(10):
print(p)
mod = sm.tsa.VARMAX(endog[['dln_inv', 'dln_inc']], order=(1,1))
res = mod.fit(maxiter=1000, disp=False)
print(res.summary())
param= pd.DataFrame(res.params,columns= ["estimation "+str(p)])
parameters=pd.concat([parameters, param], axis=1)
print(parameters)
As you can see, the results change everytime I reestimate the model:
estimation 0 estimation 1 estimation 2 \
const.dln_inv 0.010974 0.010934 0.010934
const.dln_inc 0.016554 0.016536 0.016536
L1.dln_inv.dln_inv -0.010164 -0.010087 -0.010087
L1.dln_inc.dln_inv 0.360306 0.362187 0.362187
L1.dln_inv.dln_inc -0.032975 -0.033071 -0.033071
L1.dln_inc.dln_inc 0.230657 0.231421 0.231421
L1.e(dln_inv).dln_inv -0.249916 -0.250307 -0.250307
L1.e(dln_inc).dln_inv 0.125546 0.125581 0.125581
L1.e(dln_inv).dln_inc 0.088878 0.089001 0.089001
L1.e(dln_inc).dln_inc -0.235258 -0.235176 -0.235176
sqrt.var.dln_inv 0.044926 0.044927 0.044927
sqrt.cov.dln_inv.dln_inc 0.001670 0.001662 0.001662
sqrt.var.dln_inc 0.011554 0.011554 0.011554
The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
regressionPred(train,test)
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data
In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.