from_formula() missing 2 required positional arguments: 'formula' and 'data' - python

I am getting positional arguments error for the ols function under statsmodels.formula.api
have tried for statsmodels.regression.linear_model and changing OLS to ols and vice-versa.
import statsmodels.regression.linear_model as sm
X = np.append(arr=np.ones((50,1)).astype(int),values=X,axis=1)
X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = sm.ols(endog = Y, exog = X_opt).fit()
Expected output is the fitting for the regression model. But I am getting an this error:
from_formula() missing 2 required positional arguments: 'formula' and
'data'

To get this example to work (I am assuming you are running the udemy machine learning course, which is line for line this example) I had to change the import statement. The library they are using is not where the OLS function resides any longer.
import statsmodels.regression.linear_model as lm
then
regressor_ols = lm.OLS(endog = y, exog = x_optimal).fit()

This should work :
import statsmodels.api as smf;
X = np.append(arr=np.ones((50,1),dtype=np.int), values = X,axis = 1)
X_opt = X[:,[0,1,2,3,4,5]]
regressor_ols = smf.OLS(y,X_opt).fit()

Remove
import statsmodels.regression.linear_model as sm
And just import statsmodels.api as following
import statsmodels.api as sm
The course is quite old and that's why fragments of code is obsolete, no idea why they are not updating it anymore.

Guys this module is part of Linear_model class so use following code to make it work.
import statsmodels.regression.linear_model as lm
X = np.append(arr=np.ones((50,1)).astype(int),values=X,axis=1)
X_opt = X[:,[0,1,2,3,4,5]]
regressor_OLS = lm.OLS(endog = y, exog = X_opt).fit()

Use import statsmodels.regression.linear_model as lm or import statsmodels.api as sm
import statsmodels.regression.linear_model as lm
X=np.append(arr=np.ones((50,1)).astype(int), values=X, axis=1)
X_opt=X[:,[0, 1, 2, 3, 4, 5]]
regressor_x=sm.OLS(endog=y, exog=X_opt).fit()
regressor_x.summary()

this one worked for me
import statsmodels.api as sm
X=np.insert(X,0,np.ones(X.shape[0]),axis=1)
colList=list()
for i in range(X.shape[1]):
colList.append(i)
X_opt=np.array(X[:, colList], dtype=float)
regressor_OLS=sm.OLS(endog=y,exog=X_opt).fit()

Solution 1:
import statsmodels.api as sm
x = np.append(arr= np.ones((50, 1)).astype(int), values= x, axis=1)
x_opt = x[:, [0,1,2,3,4,5]]
regressor_OLS = sm.OLS(endog=y, exog=x_opt).fit()
Solution 2:
import statsmodels.regression.linear_model as lm
x = np.append(arr= np.ones((50, 1)).astype(int), values= x, axis=1)
x_opt = x[:, [0,1,2,3,4,5]]
regressor_ols = lm.OLS(endog=y, exog=x_opt).fit()

I recently had the same problem, as auticus said, the library with the OLS function is no longer in statsmodels.formula.api. But you also must create X_otp as a list
import statsmodels.regression.linear_model as lm
X = np.append(arr = np.ones((50,1)).astype(int), values = X, axis = 1)
X_opt = X[:, [0, 1, 2, 3, 4, 5]].tolist()
SL = 0.05
regression_OLS = lm.OLS(endog = y, exog = X_opt). fit()

Related

LinearRegression TypeError

The above screenshot is refereed to as: sample.xlsx. I've been having trouble getting the beta for each stock using the LinearRegression() function.
Input:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.read_excel('sample.xlsx')
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape((-1, 1))
y = np.array(mean)
model = LinearRegression().fit(x, y)
print(model.coef_)
Output:
Line 16: model = LinearRegression().fit(x, y)
"Singleton array %r cannot be considered a valid collection." % x
TypeError: Singleton array array(3.34) cannot be considered a valid collection.
How can I make the collection valid so that I can get a beta value(model.coef_) for each stock?
X and y must have same shape, so you need to reshape both x and y to 1 row and 1 column. In this case it is resumed to the following:
np.array(mean).reshape(-1,1) or np.array(mean).reshape(1,1)
Given that you are training 5 classifiers, each one with just one value, is not surprising that the 5 models will "learn" that the coefficient of the linear regression is 0 and the intercept is 3.37 (y).
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
df = pd.DataFrame({
"stock": ["ABCD", "XYZ", "JK", "OPQ", "GHI"],
"ChangePercent": [-1.7, 30, 3.7, -15.3, 0]
})
mean = df['ChangePercent'].mean()
for index, row in df.iterrows():
symbol = row['stock']
perc = row['ChangePercent']
x = np.array(perc).reshape(-1,1)
y = np.array(mean).reshape(-1,1)
model = LinearRegression().fit(x, y)
print(f"{model.intercept_} + {model.coef_}*{x} = {y}")
Which is correct from an algorithmic point of view, but it doesn't make any practical sense given that you're only providing one example to train each model.

GridSearchCV results heatmap

I am trying to generate a heatmap for the GridSearchCV results from sklearn. The thing I like about sklearn-evaluation is that it is really easy to generate the heatmap. However, I have hit one issue. When I give a parameter as None, for e.g.
max_depth = [3, 4, 5, 6, None]
while generating, the heatmap, it shows error saying:
TypeError: '<' not supported between instances of 'NoneType' and 'int'
Is there any workaround for this?
I have found other ways to generate heatmap like using matplotlib and seaborn, but nothing gives as beautiful heatmaps as sklearn-evalutaion.
I fiddled around with the grid_search.py file /lib/python3.8/site-packages/sklearn_evaluation/plot/grid_search.py. At line 192/193 change the lines
From
row_names = sorted(set([t[0] for t in matrix_elements.keys()]),
key=itemgetter(1))
col_names = sorted(set([t[1] for t in matrix_elements.keys()]),
key=itemgetter(1))
To:
row_names = sorted(set([t[0] for t in matrix_elements.keys()]),
key=lambda x: (x[1] is None, x[1]))
col_names = sorted(set([t[1] for t in matrix_elements.keys()]),
key=lambda x: (x[1] is None, x[1]))
Moving all None to the end of a list while sorting is based on a previous answer
from Andrew Clarke.
Using this tweak, my demo script is shown below:
import numpy as np
import sklearn.datasets as datasets
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn_evaluation import plot
data = datasets.make_classification(n_samples=200, n_features=10, n_informative=4, class_sep=0.5)
X = data[0]
y = data[1]
hyperparameters = {
"max_depth": [1, 2, 3, None],
"criterion": ["gini", "entropy"],
"max_features": ["sqrt", "log2"],
}
est = RandomForestClassifier(n_estimators=5)
clf = GridSearchCV(est, hyperparameters, cv=3)
clf.fit(X, y)
plot.grid_search(clf.cv_results_, change=("max_depth", "criterion"), subset={"max_features": "sqrt"})
import matplotlib.pyplot as plt
plt.show()
The output is as shown below:

Can you identify what wrong with this programme about normal equation implementation for linear regression

1.Here i got output with theta value with large numbers which is unusable
2.Can you determine what problem it has
import pandas as pd
import matplotlib.pyplot as plt
data=pd.read_csv("headbrain.csv")
data.head()
x=np.array(data["Head Size(cm^3)"].values)
y=np.array(data["Brain Weight(grams)"].values)
print(x.shape
x1=np.ones(len(y))
X=np.array([x,x1])
X.shape
#normal equation creating (x.transpose*x)*(x.transpose*y)
first=np.matmul(X,X.transpose()) #first part in normal equation(x.transpose*x)
second=np.matmul(X,y) #second part in nornal equation(x.transpose*y)
theta=np.matmul(first,second) #normal equation for theta
print(theta)
#i return theata values large number which includes e also```
import pandas as pd
import matplotlib.pyplot as plt
data=pd.read_csv("headbrain.csv")
data.head()
x=np.array(data["Head Size(cm^3)"].values)
y=np.array(data["Brain Weight(grams)"].values)
print(x.shape)
x1=np.ones(len(y))
X=np.array([x,x1])
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict

NameError: name 'X' is not defined sklearn

I am working through this multiple regression problem with this walk through however the code that starts at
section : #Treating categorical variables with One-hot-encoding at website: https://towardsdatascience.com/what-makes-a-movie-hit-a-jackpot-learning-from-data-with-multiple-linear-regression-339f6c1a7022
I ran code up to this point but it doesn't work for (X)
Actual code:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# LabelEncoder for a number of columns
class MultiColumnLabelEncoder:
def __init__(self, columns = None):
self.columns = columns # list of column to encode
def fit(self, X, y=None):
return self
def transform(self, X):
'''
Transforms columns of X specified in self.columns using
LabelEncoder(). If no columns specified, transforms all
columns in X.
'''
output = X.copy()
if self.columns is not None:
for col in self.columns:
output[col] = LabelEncoder().fit_transform(output[col])
else:
for colname, col in output.iteritems():
output[colname] = LabelEncoder().fit_transform(col)
return output
def fit_transform(self, X, y=None):
return self.fit(X, y).transform(X)
le = MultiColumnLabelEncoder()
X_train_le = le.fit_transform(X)
Here is the error that I get:
Traceback (most recent call last):
File "<ipython-input-63-581cea150670>", line 34, in <module>
X_train_le = le.fit_transform(X)
NameError: name 'X' is not defined
Your code shouldn't be able to work because you left out 40 lines of codes that she wrote before that snippet of codes. She has defined X earlier. The codes can be obtained from Github.
#importing the libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
import statsmodels.api as sm
import pyreadr
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import explained_variance_score
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
result = pyreadr.read_r('Movies.RData')# also works for Rds
print(result.keys())
df = pd.DataFrame(result['movies'], columns=result['movies'].keys() )
df.shape
df.shape[0]
df.set_index("title", inplace=True) #setting the index name
df_1 = df.loc[:, ['imdb_rating','genre', 'runtime', 'best_pic_nom',
'top200_box', 'director', 'actor1']]
#Let's also check the column-wise distribution of null values
print(df_1.isnull().values.sum())
print(df_1.isnull().sum())
#Dropping missing values from my dataset
df_1.dropna(how='any', inplace=True)
print(df_1.isnull().values.sum()) #checking for missing values after the dropna()
#Splitting for 2 matrices: independent variables used for prediction and dependent variables (that is predicted)
X = df_1.drop(["imdb_rating", 'runtime'], axis = 1) #Feature Matrix
y = df_1["imdb_rating"] #Dependent Variables

In Scikit, How Do You Fix Value Error When Predicting?

The following code gives me the following error: ValueError: Found array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
The error is produced in line where prediction is invoked. I am assuming there's something wrong about the shape of the dataframe, 'obs_to_pred.' I checked the shape, which is (1046, 3).
What do you recommend so I can fix this and run the prediction?
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.api as sm
from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
import scipy.stats as stats
from sklearn import linear_model
# Import Titanic Data
train_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/train.csv'
test_loc = 'C:/Users/Young/Desktop/Kaggle/Titanic/test.csv'
train = pd.read_csv(train_loc)
test = pd.read_csv(test_loc)
# Predict Missing Age Values Based on Factors Pclass, SibSp, and Parch.
# In the function, combine train and test data.
def regressionPred (traindata,testdata):
allobs = pd.concat([traindata, testdata])
allobs = allobs[~allobs.Age.isnull()]
y = allobs.Age
y, X = dmatrices('y ~ Pclass + SibSp + Parch', data = allobs, return_type = 'dataframe')
mod = sm.OLS(y,X)
res = mod.fit()
predictors = ['Pclass', 'SibSp', 'Parch']
regr = linear_model.LinearRegression()
regr.fit(allobs.ix[:,predictors], y)
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
prediction = regr.predict( obs_to_pred ) # Error Produced in This Line ***
return res.summary(), prediction
regressionPred(train,test)
In case you may want to look at the dataset, the link will take you there: https://www.kaggle.com/c/titanic/data
In the line
allobs = allobs[~allobs.Age.isnull()]
you define allobs as all the cases with no NaN in Age column.
Later, with:
obs_to_pred = allobs[allobs.Age.isnull()].ix[:,predictors]
you do not have any case to predict on as all allobs.Age.isnull() will be evaluated to False and you'll get an empty obs_to_pred. Thus your error:
array with 0 sample(s) (shape=(0, 3)) while a minimum of 1 is required.
Check the logic what you want with your predictions.

Categories