AttributeError: 'OLSResults' object has no attribute 'norm_resid' - python

When I run this I have the following error :
AttributeError: 'OLSResults' object has no attribute 'norm_resid'
I have the latest version of OLS, so the attribute norm_resid should be there.
Any ideas ?
from scipy import stats
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn import datasets, linear_model
from statsmodels.formula.api import ols
"""
Data Management
"""
data = pd.read_csv("TestExer1-sales-round1.csv")
X_train = data["Advertising"]
Y_train = data["Sales"]
# use of linregregress
model = ols("Y_train ~ X_train", data).fit()
print(model.summary())
plt.plot(X_train,Y_train , 'ro')
plt.plot(X_train, model.fittedvalues, 'b')
plt.legend(['Sales', 'Advertising'])
plt.ylim(0, 70)
plt.xlim(5, 18)
plt.hist(model.norm_resid())
plt.ylabel('Count')
plt.xlabel('Normalized residuals')
plt.xlabel('Temperature')
plt.ylabel('Gas')
plt.title('Before Insulation')

I had the same issue, but the following worked:
plt.hist(model.resid_pearson)
Thus your solution should look like:
from scipy import stats
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from sklearn import datasets, linear_model
from statsmodels.formula.api import ols
"""
Data Management
"""
data = pd.read_csv("TestExer1-sales-round1.csv")
X_train = data["Advertising"]
Y_train = data["Sales"]
# use of linregregress
model = ols("Y_train ~ X_train", data).fit()
print(model.summary())
plt.plot(X_train,Y_train , 'ro')
plt.plot(X_train, model.fittedvalues, 'b')
plt.legend(['Sales', 'Advertising'])
plt.ylim(0, 70)
plt.xlim(5, 18)
plt.hist(model.resid_pearson)
plt.ylabel('Count')
plt.xlabel('Normalized residuals')
plt.xlabel('Temperature')
plt.ylabel('Gas')
plt.title('Before Insulation')
when using statsmodel version 0.8.0 or greater.
Note: the pearson residuals only divide each residual value with standard error of residuals. While normalisation also divides each residual by the sum of all residuals. For more see here
From the docs.

Related

select subset for regression

I have the following codes that I want to use. The column 0 is year (1950-2020) then the rest of the columns are months. I only want to use the data from 1979-2020 in my linear regression model.
Can you help me? I am quite a beginner in using python. Below is my code:
import pandas as pd
from sklearn import linear_model
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
data1 = pd.read_csv (r'C:\Users\User-PC\sample.csv')
x1 = pd.DataFrame(data,columns=['Year','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
#data2 = pd.read_csv (r'C:\Users\User-PC\sample2.csv', parse_dates=[0], index_col=0)
#x2 = pd.DataFrame(data2,columns=['Year','Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
plt.plot(x1['Year'], x1['Jan'], color='green')
plt.title('Model 1')
plt.xlabel('Year')
plt.ylabel('index')
plt.show()
You can filter your dataframe by year before applying linear regression:
new_df = df[df['Year'].between(1979, 2000, inclusive="both")]

Data mining for machine learning

I start in data analysis and I encounter a problem on an exercise to recover on kaggle: file 'ENBsv' I import my data, determine the correlation, create a new column in my dataframe which totals my target variables
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn import model_selection
from sklearn.model_selection import validation_curve
from sklearn import ensemble
from sklearn import svm
from sklearn import neighbors
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.ensemble import VotingClassifier
df = pd.read_csv('ENB.csv')
df.columns= ["relative_compactness","surface_area","wall_area","roof_area","overall_height","orientaion",
"glazing_area","glazing_area_dist","heating_load","cooling_load"]
df.head()
corr =df.corr(method = 'pearson')
plt.figure(figsize = (20,10))
sns.heatmap(df.corr(), annot=True, cmap='Greens');
df['total_charges'] = pd.Series([1]).astype(dtype=float)
df['total_charges'] = df['heating_load'] + df['cooling_load']
I have to instantiate new variable 'charges_classes' split the buildings into 4 distinct classes with the label 0,1,2,3 according to the 3 quantiles of the new variable created. But I have to look and seek I can not find a solution, someone can help me here is what I did:
charge_classes = pd.get_dummies(df['total_charges'])
charge_classes
You could use qcut:
df['charge_classes'] = pd.qcut(df['total_charges'], 4, labels=False)

I would like to know how i can apply this clustering algorithm on my own data please?

I'd like to replace the iris data by my own data. please tell me what are the steps to follow to do that ?
thanks
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report import matplotlib.pyplot as plt plt.rc('figure', figsize=(7,4))
iris = datasets.load_iris()
X = scale(iris.data)
Y = pd.DataFrame(iris.target)
variable_name = iris.feature_names X[0:10,]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
iris_df = pd.DataFrame(iris.data)
iris_df.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width'] Y.columns = ['Targets']
The import section will stay the same.
Lets assume you have a dataframe:
#read your dataframe(several types possible)
df = pd.read_csv('test.csv')
#you need to define a target variable (named target in my case) and the features X
Y = df['target']
X = df.drop(['target'], axis=1)
#here your k-means algorithm gets start
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
let me add one more think, for what are you using kmeans? it is an unsupervised learning method, so you do not have any target variable, so what are you doing?
Normally it should be:
df = pd.read_csv('test.csv')
#columns header you want to use
relevant_columns = ['A', 'B']
X = df[relevant_columns]
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
from sklearn.cluster import KMeans
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn import datasets
from sklearn.metrics import confusion_matrix,classification_report
# CHANGED CODE START
df = pd.read_excel('tmp.xlsx')
Y = df['target']
X = df.drop(['target'], axis=1)
# CHANGED CODE END
variable_name = X.columns
clustering = KMeans(n_clusters=3,random_state=5)
clustering.fit(X)

Receiving ValueError: x and y must be the same size for x and y values. Any help would be appreciated

Currently working a machine learning problem for predicting the weather. Here however I while I was running my code in Jupyter notebook I came across the above error and I am not sure where I am going wrong as the values for my data should both be in 2d arrays. Any help would be greatly appreciated. In my notebook it specifically mentions line 133
axes[row, col]. scatter(df2[feature], df2['meantempm'])
as the problem. If it helps I am using https://stackabuse.com/using-machine-learning-to-predict-the-weather-part-2/ as my pain resource for this
import jupyter
import IPython
from IPython import get_ipython
from datetime import datetime
from datetime import timedelta
import time
from collections import namedtuple
import pandas as pd
import requests
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, median_absolute_error
from sklearn.metrics import explained_variance_score, \
mean_absolute_error, \
median_absolute_error
import tensorflow as tf
df = pd.read_csv('end-part2_df.csv').set_index('date')
df.corr()[['meantempm']].sort_values('meantempm')
predictors = ['meantempm_1', 'meantempm_2', 'meantempm_3',
'mintempm_1', 'mintempm_2', 'mintempm_2',
'meandewptm_1', 'meandewptm_2', 'meandewptm_3',
'maxdewptm_1', 'maxdewptm_2', 'maxdewptm_3',
'mindewptm_1', 'mindewptm_2', 'mindewptm_3',
'maxtempm_1', 'maxtempm_2', 'maxtempm_3']
df2 = df[['meantempm'] + predictors]
get_ipython().run_line_magic('matplotlib','inline')
plt.rcParams['figure.figsize'] = [16, 22]
fig, axes = plt.subplots(nrows=6, ncols=3, sharey=True)
arr = np.array(predictors).reshape(6, 3)
for row, col_arr in enumerate(arr):
for col, feature in enumerate(col_arr):
axes[row, col]. scatter(df2[feature], df2['meantempm'])
if col == 0:
axes[row, col].set(xlabel=feature, ylabel='meantempm')
else:
axes[row, col].set(xlabel=feature)
plt.show()
Your df['mintempm_2'] is 2D (997, 2). This is because in your predictors array you have included 'mintempm_2' twice.

Translate cross_validation algorithm to model_selection

In 2016, I ran a lasso regression model using the code below:
#Import required packages
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pylab as plt
import matplotlib.pyplot as plp
import seaborn as sns
import statsmodels.formula.api as smf
from scipy import stats
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LassoLarsCV
# split data into train and test sets
pred_train, pred_test, tar_train, tar_test = train_test_split(predictors, target, test_size=.4, random_state=123)
#%
# specify the lasso regression model
model=LassoLarsCV(cv=10, precompute=False).fit(pred_train,tar_train)
#%
# print variable names and regression coefficients
dict(zip(predictors.columns, model.coef_))
#regcoef.to_csv('variable+regresscoef.csv')
#%%
# plot coefficient progression
m_log_alphas = -np.log10(model.alphas_)
ax = plt.gca()
plt.plot(m_log_alphas, model.coef_path_.T)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.ylabel('Regression Coefficients')
plt.xlabel('-log(alpha)')
plt.title('Regression Coefficients Progression for Lasso Paths')
#%
# plot mean square error for each fold
m_log_alphascv = -np.log10(model.cv_alphas_)
plt.figure()
plt.plot(m_log_alphascv, model.cv_mse_path_, ':')
plt.plot(m_log_alphascv, model.cv_mse_path_.mean(axis=-1), 'k',
label='Average across the folds', linewidth=2)
plt.axvline(-np.log10(model.alpha_), linestyle='--', color='k',
label='alpha CV')
plt.legend()
plt.xlabel('-log(alpha)')
plt.ylabel('Mean squared error')
plt.title('Mean squared error on each fold')
#%
# MSE from training and test data
from sklearn.metrics import mean_squared_error
train_error = mean_squared_error(tar_train, model.predict(pred_train))
test_error = mean_squared_error(tar_test, model.predict(pred_test))
print ('training data MSE')
print(train_error)
print ('test data MSE')
print(test_error)
#%
# R-square from training and test data
rsquared_train=model.score(pred_train,tar_train)
rsquared_test=model.score(pred_test,tar_test)
print ('training data R-square')
print(rsquared_train)
print ('test data R-square')
print(rsquared_test)
Now I want to run it again and got the following warning:
DeprecationWarning: This module was deprecated in version 0.18 in
favor of the model_selection module into which all the refactored
classes and functions are moved.
How can I rewrite this code using model_selection ?
Only thing I can see here that used cross_validation module earlier is train_test_split.
So just change your import from:
from sklearn.cross_validation import train_test_split
to:
from sklearn.model_selection import train_test_split
and you are good to go.

Categories