I have a model in which I would like to analyse the residuals.Ultimatly, I would like to identify extreme resudials that lie outside of the confidence interval for each day. But am having trouble calculating the pointwise standard deviation of residuals for each model in the bagging regressor.
My sample code is below;
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.ensemble import BaggingRegressor
# Sample DataFrame
df = pd.DataFrame(np.random.randint(0,200,size=(500, 4)), columns=list('ABCD'))
# Add dates to sample data
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(500)]
df['date'] = date_list
df['date'] = df['date'].astype('str')
# Split dataset into testing and training
train = df[:int(len(df)*0.80)]
test = df[int(len(df)*0.20):]
X_train = train[['B','C','D','date']]
X_test = test[['B','C','D','date']]
y_train = train[['A']]
y_test = test[['A']]
# Function to Encode the data
def encode_and_bind(data_in, feature_to_encode):
dummies = pd.get_dummies(data_in[[feature_to_encode]])
data_out = pd.concat([data_in, dummies], axis=1)
data_out = data_out.drop([feature_to_encode], axis=1)
return(data_out)
for feature in features_to_encode:
X_train_final = encode_and_bind(X_train, 'date')
X_test_final = encode_and_bind(X_test, 'date')
# Define Model
svr_lin = SVR(kernel="linear", C=100, gamma="auto")
regr = BaggingRegressor(base_estimator=svr_lin,random_state=5).fit(X_train_final, y_train.values.ravel())
# Predictions
y_pred = regr.predict(X_test_final)
# Join the predictions back into orignial dataframe
y_test['predict'] = y_pred
# Calculate residuals
y_test['residuals'] = y_test['A'] - y_test['predict']
I found this method online
raw_pred = [x.predict([[0, 0, 0, 0]]) for x in regr.estimators_]
but am not sure of what to use for the x.predict([[0, 0, 0, 0]]) part since I have far more than 4 features.
EDIT:
Building off of #2MuchC0ff33's answer I tried
stdevs = []
for dates in X_test_final.columns[3:]:
test = X_test_final[X_test_final[dates]==1]
raw_pred = [x.predict([test.iloc[0]]) for x in regr.estimators_]
dates= dates
sdev= np.std(raw_pred)
sdev = sdev.astype('str')
stdevs.append(dates + "," + sdev)
it seems to be correct, but I don't know enough about how these calculations are being done to judge if this is working in the way I think it is.
F, thanks for sharing your attempt from my answer.
I am going to try to break everything down and hopefully provide you a solution you need. Apologies in advance if I am repeating some of your code but it is how my brain works haha.
You can group the residuals by date and calculate the standard deviation for each group to calculate the pointwise standard deviation of residuals for each day. Here's how to go about it:
y_test['date'] = y_test['date'].apply(lambda x: x[:10])
grouped = y_test.groupby(['date'])
residual_groups = grouped['residuals']
residual_stds = residual_groups.std()
This will give you the residual standard deviation for each day. For each day, multiply the standard deviation by a constant such as 1.96 (for a 95% confidence interval) and add/subtract it from the mean of the residuals.
residual_means = residual_groups.mean()
CI = 1.96 * residual_stds
upper_bound = residual_means + CI
lower_bound = residual_means - CI
Finally, by comparing the residuals with the lower and upper bounds, you can identify the extreme residuals that lie outside the confidence interval for each day:
extreme_residuals = y_test[(y_test['residuals'] > upper_bound) | (y_test['residuals'] < lower_bound)]
You can extend this method to find the standard deviation for each day.
# Group the test data by the date feature
grouped = X_test_final.groupby(['date'])
stdevs = []
for name, group in grouped:
raw_pred = [x.predict(group) for x in regr.estimators_]
# Calculate the standard deviation of the predictions for each group
sdev = np.std(raw_pred)
stdevs.append((name, sdev))
I think we could replace 0, 0, 0, 0 with x_test_final. Let me know your thoughts on my updated method below:
raw_pred = [x.predict([X_test_final.iloc[0]]) for x in regr.estimators_]
Related
I run IPR outlier control for a relatively big dataframe df:
I perform IPR within subset of the data so I use for loop.
How can I return value to original df >1 000 000 rows:
months product brick units is_outlier
0 202104 abc 3 1.00 False
1 202104 abc 6 3.00 False
for product in df['product'].unique():
for brick in df['brick'].unique():
try:
# Extract the units for the current product and brick
data = df.loc[(df['product'] == product) & (df['brick'] == brick)]['units'].values
# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
# Fit a linear regression model to the data
reg = LinearRegression()
reg.fit(np.arange(len(data_scaled)).reshape(-1, 1), data_scaled)
# Calculate the residuals of the regression
residuals = data_scaled - reg.predict(np.arange(len(data_scaled)).reshape(-1, 1))
# Identify any observations with a residual larger than 2 standard deviations from the mean
threshold = 2*residuals.std()
outliers = np.where(np.abs(residuals) > threshold)
# Set the "is_outlier" column to True for the outliers in the current product
df.loc[(df['product'] == product ) & (df['brick']== brick) & (df.index.isin(outliers[0])), 'is_outlier'] = True
except:
pass
As #QuangHoang suggested, use groupby and apply your custom function:
def outlier(df):
data = df.to_numpy().reshape((-1, 1))
# Scale the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Fit a linear regression model to the data
reg = LinearRegression()
reg.fit(np.arange(len(data_scaled)).reshape(-1, 1), data_scaled)
# Calculate the residuals of the regression
residuals = data_scaled - reg.predict(np.arange(len(data_scaled)).reshape(-1, 1))
# Identify any observations with a residual
# larger than 2 standard deviations from the mean
threshold = 2*residuals.std()
return np.ravel(np.abs(residuals) > threshold)
df['is_outlier'] = df.groupby(['product', 'brick'])['units'].transform(outlier)
I have fit a linearmodels.PanelOLS model and stored it in m. I now want to test if certain coefficients are simultaneously equal to zero.
Does a fitted linearmodels.PanelOLS object have an F-test function where I can pass my own restriction matrix?
I am looking for something like statsmodels' f_test method.
Here's a minimum reproducible example.
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
# Is there an f_test method for m???
m.f_test(r_mat=some_matrix_here) # Something along these lines?
You can use wald_test (a standard F-test is numerically identical to a Walkd test under some assumptions on the covariance).
# Libraries
from linearmodels.panel import PanelOLS
from linearmodels.datasets import wage_panel
# Load data and set index
df = wage_panel.load()
df = df.set_index(['nr','year'])
# Add constant term
df['const'] = 1
# Fit model
m = PanelOLS(dependent=df['lwage'], exog=df[['const','expersq','married']])
m = m.fit(cov_type='clustered', cluster_entity=True)
Then the test
import numpy as np
# Use matrix notation RB - q = 0 where R is restr and q is value
# Restrictions: expersq = 0.001 & expersq+married = 0.2
restr = np.array([[0,1,0],[0,1,1]])
value = np.array([0.01, 0.2])
m.wald_test(restr, value)
This returns
Linear Equality Hypothesis Test
H0: Linear equality constraint is valid
Statistic: 0.2608
P-value: 0.8778
Distributed: chi2(2)
WaldTestStatistic, id: 0x2271cc6fdf0
You can also use formula syntax if you used formulas to define your model, which can be easier to code up.
fm = PanelOLS.from_formula("lwage~ 1 + expersq + married", data=df)
fm = fm.fit(cov_type='clustered', cluster_entity=True)
fm.wald_test(formula="expersq = 0.001,expersq+married = 0.2")
The result is the same as above.
so currently this is the code I have. Not attached are various graphs that I have made that show the actual stock price from the CSV and then my projections. I'm wanting to make it where I simply predict tomorrow's stock price given all of this historical data but I'm having a difficult time. The "df.loc[len(df.index)] = ['2022-04-05',0,0,0,0,0,0]" was where I was trying to put the predictions for future days although I am open to other ways.
# Machine learning
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# For data manipulation
import pandas as pd
import numpy as np
# To plot
import matplotlib.pyplot as plt
plt.style.use('seaborn-darkgrid')
# To ignore warnings
import warnings
warnings.filterwarnings("ignore")
# method of pandas
df = pd.read_csv('data_files/MSFT.csv')
#add extra row of blank data for future prediction
df.loc[len(df.index)] = ['2022-04-05',0,0,0,0,0,0]
df.loc[len(df.index)] = ['2022-04-06',0,0,0,0,0,0]
df.loc[len(df.index)] = ['2022-04-07',0,0,0,0,0,0]
df.loc[len(df.index)] = ['2022-04-08',0,0,0,0,0,0]
# Changes The Date column as index columns
df.index = pd.to_datetime(df['Date'])
# drop The original date column
df = df.drop(['Date'], axis='columns')
print(df)
# Create predictor variables
df['Open-Close'] = df.Open - df.Close
df['High-Low'] = df.High - df.Low
# Store all predictor variables in a variable X
X = df[['Open-Close', 'High-Low']]
X.head()
# Target variables
y = np.where(df['Close'].shift(-1) > df['Close'], 1, 0)
print(y)
split_percentage = 0.8
split = int(split_percentage*len(df))
# Train data set
X_train = X[:split]
y_train = y[:split]
# Test data set
X_test = X[split:]
y_test = y[split:]
# Support vector classifier
cls = SVC().fit(X_train, y_train)
df['Predicted_Signal'] = cls.predict(X)
# Calculate daily returns
df['Return'] = df.Close.pct_change()
# Calculate strategy returns
df['Strategy_Return'] = df.Return * df.Predicted_Signal.shift(1)
# Calculate Cumulutive returns
df['Cum_Ret'] = df['Return'].cumsum()
# Plot Strategy Cumulative returns
df['Cum_Strategy'] = df['Strategy_Return'].cumsum()
Given this simulated data:
import numpy as np
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.tsa.statespace.structural import UnobservedComponents
np.random.seed(12345)
ar = np.r_[1, 0.9]
ma = np.array([1])
arma_process = ArmaProcess(ar, ma)
X = 100 + arma_process.generate_sample(nsample=100)
y = 1.2 * X + np.random.normal(size=100)
We build a UnobservedComponents model with the first 70 points to run inferences on the last 30 points like so:
model = UnobservedComponents(y[:70], level='llevel', exog=X[:70])
f_model = model.fit()
forecaster = f_model.get_forecast(
steps=30,
exog=X[70:].reshape(-1, 1)
)
conf_int = forecaster.conf_int()
If we observe the mean for the 95% confidence interval, we get the following:
conf_int.mean(axis=0)
array([118.19789195, 122.14101161])
But when trying to get the same values through model simulations, we don't quite get the same results. Here's the script we run for the simulated boundaries:
sim_model = UnobservedComponents(np.zeros(30), level='llevel', exog=X[70:])
res = []
predicted_state = f_model.predicted_state[..., -1]
predicted_state_cov = f_model.predicted_state_cov[..., -1]
for i in range(1000):
init_state = np.random.multivariate_normal(
predicted_state,
predicted_state_cov
)
sim = sim_model.simulate(
f_model.params,
30,
initial_state=init_state)
res.append(sim.mean())
Printing the lower 2.5 and upper 97.5 percentile we get:
np.percentile(res, [2.5, 97.5])
array([119.06735028, 121.26810407])
As we use model simulations to distinguish signal from noise in data, this difference ended up being big enough to lead to contradictory conclusions. If we make for instance:
y[70:] += 1
Then according to the first technique we conclude the new y carries no signal as its mean is lower than 122.14. But the same is not true if we use the second technique: as the upper boundary is 121.2, we conclude that there's signal.
What we are trying to understand now is whether this is expected. Shouldn't the lower and upper 95% confidence interval of both techniques be equal?
I am learning how to build a simple linear model to find a flat price based on its squared meters and the number of rooms. I have a .csv data set with several features and of course 'Price' is one of them, but it contains several suspicious values like '1' or '4000'. I want to remove these values based on mean and standard deviation, so I use the following function to remove outliers:
import numpy as np
import pandas as pd
def reject_outliers(data):
u = np.mean(data)
s = np.std(data)
data_filtered = [e for e in data if (u - 2 * s < e < u + 2 * s)]
return data_filtered
Then I construct function to build linear regression:
def linear_regression(data):
data_filtered = reject_outliers(data['Price'])
print(len(data)) # based on the lenght I see that several outliers have been removed
Next step is to define the data/predictors. I set my features:
features = data[['SqrMeters', 'Rooms']]
target = data_filtered
X = features
Y = target
And here is my question. How can I get the same set of observations for my X and Y? Now I have inconsistent numbers of samples (5000 for my X and 4995 for my Y after removing outliers). Thank you for any help in this topic.
The features and labels should have the same length
and you should pass the whole data object to reject_outliers:
def reject_outliers(data):
u = np.mean(data["Price"])
s = np.std(data["Price"])
data_filtered = data[(data["Price"]>(u-2*s)) & (data["Price"]<(u+2*s))]
return data_filtered
You can use it in this way:
data_filtered=reject_outliers(data)
features = data_filtered[['SqrMeters', 'Rooms']]
target = data_filtered['Price']
X=features
y=target
Following works for Pandas DataFrames (data):
def reject_outliers(data):
u = np.mean(data.Price)
s = np.std(data.Price)
data_filtered = data[(data.Price > u-2*s) & (data.Price < u+2*s)]
return data_filtered