Python: Predict the y value using Statsmodels - Linear Regression - python

I am using the statsmodels library of Python to predict the future balance using Linear Regression. The csv file is displayed below:
Year | Balance
3 | 30
8 | 57
9 | 64
13 | 72
3 | 36
6 | 43
11 | 59
21 | 90
1 | 20
16 | 83
It contains the 'Year' as the independent 'x' variable, while the 'Balance' is the dependent 'y' variable
Here's the code for Linear Regression for this data:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
import numpy as np
from matplotlib import pyplot as plt
import os
os.chdir('C:\Users\Admin\Desktop\csv')
cw = pd.read_csv('data-table.csv')
y=cw.Balance
X=cw.Year
X = sm.add_constant(X) # Adds a constant term to the predictor
est = sm.OLS(y, X)
est = est.fit()
print est.summary()
est.params
X_prime = np.linspace(X.Year.min(), X.Year.max(), 100)[:, np.newaxis]
X_prime = sm.add_constant(X_prime) # add constant as we did before
y_hat = est.predict(X_prime)
plt.scatter(X.Year, y, alpha=0.3) # Plot the raw data
plt.xlabel("Year")
plt.ylabel("Total Balance")
plt.plot(X_prime[:, 1], y_hat, 'r', alpha=0.9) # Add the regression line, colored in red
plt.show()
The question is how to predict the 'Balance' value, using Statsmodels when the value of 'Year'=10 ?

You can use the predict method from the result object est but in order to succesfully use it you have to use as formula
est = sm.ols("y ~ x", data =data).fit()
est.predict(exog=new_values)
where new_values is a dictionary.
Check out this link.

Related

How the X axis on a Linearregression is formated and processed?

I am trying to build a regression line based on date and closure price of a stock.
I know the regline doesn't allow to be calculated on date, so I transform the date to be a numerical value.
I have been able to format the data as it requires.
Here is my sample code :
import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
source = 'C:\\path'
#gets file
df = pd.read_csv(source+'\\ABBN.SW.csv')
#change string to datetime
df['Date'] = pd.to_datetime(df['Date'])
#change datetime to numerical value
df['Date'] = df['Date'].map(dt.datetime.toordinal)
#build X and Y axis
x = np.array(df['Date']).reshape(-1, 1)
y = np.array(df['Close'])
model = LinearRegression()
model.fit(x,y)
print(model.intercept_)
print(model.coef_)
print(x)
[[734623]
[734625]
[734626]
...
[738272]
[738273]
[738274]]
print(y)
[16.54000092 16.61000061 16.5 28.82999992 28.88999939 ... 29.60000038]
intercept : -1824.9528261991056 #complete off the charts, it should be around 18-20
coef : [0.00250826]
The question here is : What I am missing on the X axis (date) to produce a correct intercept ?
It looks like the the coef is right tho.
See the example on excel (old data)
References used :
https://realpython.com/linear-regression-in-python/
https://medium.com/python-data-analysis/linear-regression-on-time-series-data-like-stock-price-514a42d5ac8a
https://www.alpharithms.com/predicting-stock-prices-with-linear-regression-214618/
I would suggest to apply min-max normalisation to your ordinal dates. In this manner you will get the desired "small" intercept out of the linear regression.
import datetime as dt
import csv
import pandas as pd
from sklearn.linear_model import LinearRegression
import numpy as np
df = pd.read_csv("data.csv")
df['Date'] = pd.to_datetime(df["Date"])
df['Date_ordinal'] = df["Date"].map(dt.datetime.toordinal)
df["Date_normalized"] = df["Date"].apply(lambda x: len(df["Date"]) * (x - df["Date"].min()) / (df["Date"].max() - df["Date"].min()))
print(df)
def apply_linear(df,label_dates):
x = np.array(df[label_dates]).reshape(-1, 1)
y = np.array(df['Close'])
model = LinearRegression()
model.fit(x,y)
print("intercep = ",model.intercept_)
print("coef = ",model.coef_[0])
print("Without normalization")
apply_linear(df,"Date_ordinal")
print("With normalization")
apply_linear(df,"Date_normalized")
And the results of my execution as follows, passing to it an invented representative data set for your purpose:
PS C:\Users\ruben\PycharmProjects\stackOverFlowQnA> python .\main.py
Date Close Date_ordinal Date_normalized
0 2022-04-01 111 738246 0.000000
1 2022-04-02 112 738247 0.818182
2 2022-04-03 120 738248 1.636364
3 2022-04-04 115 738249 2.454545
4 2022-04-05 105 738250 3.272727
5 2022-04-09 95 738254 6.545455
6 2022-04-10 100 738255 7.363636
7 2022-04-11 105 738256 8.181818
8 2022-04-12 112 738257 9.000000
Without normalization
intercep = 743632.8904761908
coef = -1.0071428571428576
With normalization
intercep = 113.70476190476191
coef = -1.2309523809523817

Unable to plot distribution of a column containing binary values using Python

I'm trying to plot the original data before handling the imbalance in a way to show the class distribution and class imbalance (class is Failure =0/1) 2. I might need to do some transformation on the data in both cases to be able to visualize it.
Here's what the column looks like:
| failure |
|---------|
| 1 |
| 0 |
| 0 |
| 1 |
| 0 |
Here's what I have tried so far:
import numpy as np
from scipy.stats.kde import gaussian_kde
def distribution_scatter(x, symmetric=True, cmap=None, size=None):
pdf = gaussian_kde(x)
w = np.random.rand(len(x))
if symmetric:
w = w*2-1
pseudo_y = pdf(x) * w
if cmap:
plt.scatter(x, pseudo_y, c=x, cmap=cmap, s=size)
else:
plt.scatter(x, pseudo_y, s=size)
return pseudo_y
Results:
The problem with the results:
I want the plot the distribution of 0's and 1's. For which I believe I need to transform it in someway.
Desired output:
If you want a KDE plot, you can check kdeplot from seaborn:
x = np.random.binomial(1, 0.2, 100)
sns.kdeplot(x)
Output:
Update: Or a swarmplot if you want a scatter:
x = np.random.binomial(1, 0.2, 25)
sns.swarmplot(x=x)
Output:
Update 2: In fact, your function seems to also produce a reasonable visualization:
distribution_scatter(np.random.binomial(1, 0.2, 100))
Output:

Selecting features in python

I am trying to do this algorithm http://venom.cs.utsa.edu/dmz/techrep/2007/CS-TR-2007-011.pdf
import pandas as pd
import pathlib
import gaitrec
from tsfresh import extract_features
from collections import defaultdict
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics.pairwise import euclidean_distances
class PFA(object):
def __init__(self, n_features, q=None):
self.q = q
self.n_features = n_features
def fit(self, X):
if not self.q:
self.q = X.shape[1]
pca = PCA(n_components=self.q).fit(X)
A_q = pca.components_.T
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
clusters = kmeans.predict(A_q)
cluster_centers = kmeans.cluster_centers_
dists = defaultdict(list)
for i, c in enumerate(clusters):
dist = euclidean_distances(A_q[i, :].reshape(1,-1), cluster_centers[c, :].reshape(1,-1))[0][0]
dists[c].append((i, dist))
self.indices_ = [sorted(f, key=lambda x: x[1])[0][0] for f in dists.values()]
self.features_ = X[:, self.indices_]
p = pathlib.Path(gaitrec.__file__).parent
dataset_file = p / 'DatasetC' / 'subj_001' / 'walk0' / 'subj_0010.csv'
read_csv = pd.read_csv(dataset_file, sep=';', decimal='.', names=['time','x','y', 'z', 'id'])
read_csv['id'] = 0
if __name__ == '__main__':
print(read_csv)
extracted_features = extract_features(read_csv, column_id="id", column_sort="time")
features_withno_nanvalues = extracted_features.dropna(how='all', axis=1)
print(features_withno_nanvalues)
X = features_withno_nanvalues.to_numpy()
pfa = PFA(n_features=2274, q=1)
pfa.fit(X)
Y = pfa.features_
print(Y) #feature extracted
column_indices = pfa.indices_ #index of the features
print(column_indices)
C:\Users\Thund\AppData\Local\Programs\Python\Python37\python.exe C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py
time x y z id
0 0 -0.833333 0.416667 -0.041667 0
1 1 -0.833333 0.416667 -0.041667 0
2 2 -0.833333 0.416667 -0.041667 0
3 3 -0.833333 0.416667 -0.041667 0
4 4 -0.833333 0.416667 -0.041667 0
... ... ... ... ... ..
1337 1337 -0.833333 0.416667 0.083333 0
1338 1338 -0.833333 0.416667 0.083333 0
1339 1339 -0.916667 0.416667 0.083333 0
1340 1340 -0.958333 0.416667 0.083333 0
1341 1341 -0.958333 0.416667 0.083333 0
[1342 rows x 5 columns]
Feature Extraction: 100%|██████████| 3/3 [00:04<00:00, 1.46s/it]
C:\Users\Thund\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\decomposition\_pca.py:461: RuntimeWarning: invalid value encountered in true_divide
explained_variance_ = (S ** 2) / (n_samples - 1)
variable x__abs_energy ... z__variation_coefficient
id ...
0 1430.496338 ... 5.521904
[1 rows x 2274 columns]
C:/Users/Thund/Desktop/RepoBitbucket/Gaitrec/gaitrec/extraction.py:21: ConvergenceWarning: Number of distinct clusters (2) found smaller than n_clusters (2274). Possibly due to duplicate points in X.
kmeans = KMeans(n_clusters=self.n_features).fit(A_q)
[[1430.49633789 66.95824 ]]
[0, 1]
Process finished with exit code 0
I don't understand the warnings and the cause that from 2k+ features it only extract the first 2,that's what I did:
Produce the covariance matrix from the original data
Compute eigenvectors and eigenvalues of the covariance matrix using the SVD method
Those two steps combined are what you call PCA.
The Principle Components are the eigenvectors of the covariance matrix of the original data and then apply the K-means algorithm.
My question are:
How can I fix the warning it gives me?
It only select 2 features from 2k+ features, so something is wrong?
As mentioned in the comments, the features after the fit are coming from the indices of the A_q matrix, which has a reduced number of features from PCA. You're getting two features instead of q features (1 in this case) because of the reshape. self.features_ should probably come from A_q instead of X.
I think the problem in your code is in the following statement:
pfa = PFA(n_features=2274, q=1)
I haven't read the paper, but you have to observe pca behavior. If the authors set q variable to 1, you should see why q is 1.
For instance:
from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
Note: If you are using an application other than jupyter-notebook please add show at the end of the line, in case you couldn't see any graph
from matplotlib.pyplot import plot
from matplotlib.pyplot import xlabel
from matplotlib.pyplot import ylabel
from matplotlib.pyplot import figure
from matplotlib.pyplot import show
pca_obj = PCA().fit(X=X)
figure(1, figsize=(6,3), dpi=300)
plot(pca_obj.explained_variance_, linewidth=2)
xlabel('Components')
ylabel('Explained Variaces')
show()
For my dataset, the result is:
Now, I can say: "My q variable is 100, since PCA performs better starting with 100 components."
Can say the same? How do you know q is 1?
Now observe your best q performance variable, see if it solves your problem.

Is there something similar to R's brglm to help deal with quasi-separation in Python using statsmodels Logit?

I am using Logit from statsmodels to create a regression model.
I get the error: LinAlgError: Singular matrix and then when I remove 1 variable at a time from my dataset, I finally got a different error: PerfectSeparationError: Perfect separation detected, results not available.
I suspect that the original error (LinAlgError) is related to perfect separation because I had the same problem in R and got around it using a brglm (bias reduced glm).
I have a boolean y variable and 23 numeric and boolean x variables.
I have already run a VIF function to remove any variables which have high multicollinearity scores (I started with 26 variables).
I have tried using the firth_regression.py instead to account for perfect separation but I got a memory error: MemoryError.(https://gist.github.com/johnlees/3e06380965f367e4894ea20fbae2b90d)
I have tried the LogisticRegression from sklearn but cannot get the p values which is no good to me.
I even tried removing 1 variable at a time from my dataset. When I got down to 4 variables left (I had 23), then I got PerfectSeparationError: Perfect separation detected, results not available.
Has anyone experienced this and how do you get around it?
Appreciate any advice!
X = df.loc[:, df.columns != 'VehicleMake']
y = df.iloc[:,0]
# Split data
X_train, X_test, y_train, y_test = skl.model_selection.train_test_split(X, y, test_size=0.3)
Code in question:
# Perform logistic regression and get p values
logit_model = sm.Logit(y_train, X_train.astype(float))
result = logit_model.fit()
This is the firth_regression I tried instead which got me a memory error:
# For the firth_regression
import sys
import warnings
import math
import statsmodels
from scipy import stats
import statsmodels.formula.api as smf
def firth_likelihood(beta, logit):
return -(logit.loglike(beta) + 0.5*np.log(np.linalg.det(-logit.hessian(beta))))
step_limit=1000
convergence_limit=0.0001
logit_model = smf.Logit(y_train, X_train.astype(float))
start_vec = np.zeros(X.shape[1])
beta_iterations = []
beta_iterations.append(start_vec)
for i in range(0, step_limit):
pi = logit_model.predict(beta_iterations[i])
W = np.diagflat(np.multiply(pi, 1-pi))
var_covar_mat = np.linalg.pinv(-logit_model.hessian(beta_iterations[i]))
# build hat matrix
rootW = np.sqrt(W)
H = np.dot(np.transpose(X_train), np.transpose(rootW))
H = np.matmul(var_covar_mat, H)
H = np.matmul(np.dot(rootW, X), H)
# penalised score
U = np.matmul(np.transpose(X_train), y - pi + np.multiply(np.diagonal(H), 0.5 - pi))
new_beta = beta_iterations[i] + np.matmul(var_covar_mat, U)
# step halving
j = 0
while firth_likelihood(new_beta, logit_model) > firth_likelihood(beta_iterations[i], logit_model):
new_beta = beta_iterations[i] + 0.5*(new_beta - beta_iterations[i])
j = j + 1
if (j > step_limit):
sys.stderr.write('Firth regression failed\n')
None
beta_iterations.append(new_beta)
if i > 0 and (np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) < convergence_limit):
break
return_fit = None
if np.linalg.norm(beta_iterations[i] - beta_iterations[i-1]) >= convergence_limit:
sys.stderr.write('Firth regression failed\n')
else:
# Calculate stats
fitll = -firth_likelihood(beta_iterations[-1], logit_model)
intercept = beta_iterations[-1][0]
beta = beta_iterations[-1][1:].tolist()
bse = np.sqrt(np.diagonal(-logit_model.hessian(beta_iterations[-1])))
return_fit = intercept, beta, bse, fitll
#print(return_fit)
I fixed my problem by changing the default method in the logit regression to method ='bfgs'.
result = logit_model.fit(method = 'bfgs')
Few years late for this question, but I'm working on a Python implementation of Firth logistic regression using the procedure detailed in the R logistf package and Heinze and Schemper, 2002. There are a few implementation differences compared to the gist you linked that make it much more memory efficient, and p-values are calculated using penalized likelihood ratio tests. Confidence intervals are also calculated.
Obviously I don't have your data, so let's use the sex2 dataset included with the logistf R package.
>>> from firthlogist import FirthLogisticRegression, load_sex2
>>> fl = FirthLogisticRegression()
>>> X, y, feature_names = load_sex2()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.97379 -0.307427 0.00611139
oc -0.0688167 0.443793 -0.941436 0.789202 0.826365
vic 2.26887 0.548416 1.27304 3.43543 1.67219e-06
vicl -2.11141 0.543082 -3.26086 -1.11774 1.23618e-05
vis -0.788317 0.417368 -1.60809 0.0151846 0.0534899
dia 3.09601 1.67501 0.774568 8.03028 0.00484687
Intercept 0.120254 0.485542 -0.818559 1.07315 0.766584
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8
Compare results with brglm:
> library(brglm)
Loading required package: profileModel
'brglm' will gradually be superseded by the 'brglm2' R package (https://cran.r-project.org/package=brglm2), which provides utilities for mean and median bias reduction for all GLMs.
Methods for the detection of separation and infinite estimates in binomial-response models are provided by the 'detectseparation' R package (https://cran.r-project.org/package=detectseparation).
> fit <- brglm(case~age+oc+vic+vicl+vis+dia, data=logistf::sex2)
> summary(fit)
Call:
brglm(formula = case ~ age + oc + vic + vicl + vis + dia, data = logistf::sex2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.12025 0.48554 0.248 0.804390
age -1.10598 0.42366 -2.611 0.009040 **
oc -0.06882 0.44379 -0.155 0.876770
vic 2.26887 0.54842 4.137 3.52e-05 ***
vicl -2.11141 0.54308 -3.888 0.000101 ***
vis -0.78832 0.41737 -1.889 0.058921 .
dia 3.09601 1.67501 1.848 0.064551 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 304.61 on 238 degrees of freedom
Residual deviance: 276.91 on 232 degrees of freedom
Penalized deviance: 265.0788
AIC: 290.91
The p-values are slightly different because they are calculated by penalized likelihood ratio tests, whereas brglm uses Wald tests. firthlogist can also use Wald:
>>> fl = FirthLogisticRegression(wald=True)
>>> fl.fit(X, y)
FirthLogisticRegression(wald=True)
>>> fl.summary(xname=feature_names)
coef std err [0.025 0.975] p-value
--------- ---------- --------- --------- ---------- -----------
age -1.10598 0.42366 -1.93634 -0.275623 0.00903995
oc -0.0688167 0.443793 -0.938636 0.801002 0.87677
vic 2.26887 0.548416 1.194 3.34375 3.51659e-05
vicl -2.11141 0.543082 -3.17583 -1.04699 0.000101147
vis -0.788317 0.417368 -1.60634 0.0297084 0.0589208
dia 3.09601 1.67501 -0.186943 6.37896 0.0645508
Intercept 0.120254 0.485542 -0.83139 1.0719 0.80439
Log-Likelihood: -132.5394
Newton-Raphson iterations: 8

How can better format the output that I'm attempting to save from several regressions?

I'd like to loop through several specifications of a linear regression and save the results for each model in a python dictionary. The code below is somewhat successful but additional text (e.g. datatype information) is included in the dictionary making it unreadable. Moreover, regarding the confidence interval, I'd like to have two separate columns - one for the upper and another for the lower-bound - but I'm unable to do that.
code:
import patsy
import statsmodels.api as sm
from collections import defaultdict
colleges = ['ARC_g',u'CCSF_g',u'DAC_g',u'DVC_g',u'LC_g',u'NVC_g',u'SAC_g', u'SRJC_g',u'SC_g',u'SCC_g']
results = defaultdict(lambda: defaultdict(int))
for exog in colleges:
exog = exog.encode('ascii')
f1 = 'GRADE_PT_103 ~ %s -1' % exog
y,X = patsy.dmatrices(f1, data,return_type='dataframe')
mod = sm.OLS(y, X) # Describe model
res = mod.fit() # Fit model
results[exog]['beta'] = res.params
#I'd like the confidence interval to be separated into two columns ('upper' and 'lower')
results[exog]['CI'] = res.conf_int()
results[exog]['rsq'] = res.rsquared
pd.DataFrame(results)
______Current output
ARC_g | CCSF_g | ...
beta | ARC_g 0.79304 dtype: float64 | CCSF_g 0.833644 dtype: float64
CI | 0 1 ARC_g 0.557422 1.0... 0 1| CCSF_g 0.655746 1...
rsq | 0.122551 | 0.213053
This is how I'd summarize what you were showing. Hopefully it helps give you some ideas.
import pandas as pd
import statsmodels.formula.api as smf
data = pd.DataFrame(np.random.randn(30, 5), columns=list('YABCD'))
results = {}
for c in data.columns[1:]:
f = 'Y ~ {}'.format(c)
r = smf.ols(formula=f, data=data).fit()
coef = pd.concat([r.params,
r.conf_int().iloc[:, 0],
r.conf_int().iloc[:, 1]], axis=1, keys=['coef', 'lower', 'upper'])
coef.index = ['Intercept', 'Beta']
results[c] = dict(coef=coef, rsq=r.rsquared)
keys = data.columns[1:]
summary = pd.concat([results[k]['coef'].stack() for k in keys], axis=1, keys=keys)
summary.index = summary.index.to_series().str.join(' - ')
summary.append(pd.Series([results[k]['rsq'] for k in keys], keys, name='R Squared'))

Categories