I have a train_data which holds information about Stores and their sales. Which looks like this
I want to build a multiple feature linear regression to predict the 'Sales' on a test_data, by using 'DayofWeek', 'Customers', 'Promo'.
How do I build a Multiple Linear Regression Model for this, preferably by using SKlearn.
edit: here's the link to the dataset I am using, if anyone is interested : https://www.kaggle.com/c/rossmann-store-sales
This is what i've tried so far.
import pandas as pd
from sklearn import linear_model
x=train_data[['Promo','Customers','DayOfWeek']]
y=train_data['Sales']
lm=LinearRegression()
lm.fit(x,y)
For which i am getting an error saying 'LinearRegression not defined'.
You aren't actually importing the LinearRegression class. If you want to import everything in the linear_model module (which is generally frowned upon) you could do:
from sklearn.linear_model import *
lr = LinearRegression()
...
A better practice is to import the module itself and give it an alias. Like so:
import sklearn.linear_model as lm
lr = lm.LinearRegression()
...
Finally you could import just the class you want:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
...
You've imported linear_model, which is the module that contains
the LinearRegression() class. To call the LinearRegression class use this:
lm = linear_model.LinearRegression()
lm.fit(x,y)
Related
I'm trying to retrieve the importance of features within a RandomForestClassifier model, retrieving the coef for each feature in the model,
I'm running the following code here,
random_forest = SelectFromModel(RandomForestClassifier(n_estimators = 200, random_state = 123))
random_forest.fit(X_train, y_train)
print(random_forest.estimator.feature_importances_)
but am receiving the following error
NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
What exactly am I doing wrong? You can see I fit the model right before looking to identify the importance of features, but it doesn't seem to work as it should,
Similarily, I have the code below with a LogisticRegression model and it works fine,
log_reg = SelectFromModel(LogisticRegression(class_weight = "balanced", random_state = 123))
log_reg.fit(X_train, y_train)
print(log_reg.estimator_.coef_)
You have to call the attribute estimator_ to access the fitted estimator (see the docs). Observe that you forgot the trailing _. So it should be:
print(random_forest.estimator_.feature_importances_)
Interestingly, you did it correctly for your example with the LogisticRegression model.
I setup a small pipeline with scikit-Learn that I wrapped in a TransforedTargetRegressor object. After the training, I would like to access the attribute from my trained estimator (e.g. feature_importances_). Can anyone tell me how this can be done?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import TransformedTargetRegressor
# setup the pipeline
Pipeline(steps = [('scale', StandardScaler(with_mean=True, with_std=True)),
('estimator', RandomForestRegressor())])
# tranform target variable
model = TransformedTargetRegressor(regressor=pipeline,
transformer=MinMaxScaler())
# fit model
model.fit(X_train, y_train)
I tried the following:
# try to access the attribute of the fitted estimator
model.get_params()['regressor__estimator'].feature_importances_
model.regressor.named_steps['estimator'].feature_importances_
But this results in the following NotFittedError:
NotFittedError: This RandomForestRegressor instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.
When you look into the documentation of TransformedTargetRegressor it says that the attribute .regressor_ (note the trailing underscore) returns the fitted regressor. Hence, your call should look like:
model.regressor_.named_steps['estimator'].feature_importances_
Your previous calls were just returning an unfitted clone. That's were the error came from.
One-shot question: I'm trying to build a Multiputput Stacked Regressor (added to sklearn 0.22).
As far I understand, I have to combine StackingRegressor and MultiOutputRegressor. After several attemps this seems to be the right order:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import OrthogonalMatchingPursuit
from sklearn.ensemble import StackingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.svm import SVR
estimators = [ ('svr', SVR(kernel='rbf', C=1e3, gamma=0.1)),
('knn',KNeighborsRegressor(n_neighbors=5)),
('omp', OrthogonalMatchingPursuit())
]
reg = MultiOutputRegressor(StackingRegressor( estimators = estimators, final_estimator= RandomForestRegressor(n_estimators=5)))
X=np.random.random((200,20))
y = np.random.random((200,4))
reg.fit(X,y)
reg.predict(X)
But the predict method ends with an error
*** ValueError: The base estimator should implement a predict method
I searched such error in the sklean files and it seems related to the MultiOutputRegressor:
if not hasattr(self.estimator, "predict"):
raise ValueError("The base estimator should implement a predict method")
So i tried to look at the self.estimator model:
reg.estimator.predict(X)
but I obtain this error:
*** AttributeError: 'StackingRegressor' object has no attribute 'final_estimator_'
Looking at the attributes of reg.estimator I can not find final_estimator_ but only final_estimator so my solution is to create such attribute:
reg.estimator.final_estimator_ = reg.estimator.final_estimator
It works but I'm not sure anymore if my model now is doing what it is suppose to do (maybe it is using the same final estimator for each coordinate of the output).
Is this a bug due to the combination StackingRegressor + MultiOutputRegressor or I'm missing something?
Thanks!
Set stacking_method='predict' in Model Initialization, and it should work fine. Idk why the 'auto' option doesn't work, but quick fix like so:
model = StackingClassifier(estimators=level0, final_estimator=level1, stack_method='predict', cv=5)
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
n_nodes = rf.tree_.node_count
Everytime I run this code, I get the following error
'RandomForestClassifier' object has no attribute 'tree_'
any ideas why
According to scikit-learn documentation, it doesn't have .tree_ attribute.
It only has: estimators_, classes_, n_classes_, n_features_, n_outputs_, feature_importances_, oob_score_, and oob_decision_function_ attributes.
You want to pull a single DecisionTreeClassifier out of your forest. From the documentation, base_estimator_ is a DecisionTreeClassifier and estimators_ is a list of DecisionTreeClassifier. The change to your code is:
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
n_nodes = rf.base_estimator_.tree_.node_count
or
n_nodes = rf.estimators_[0].tree_.node_count
'tree_' is not RandomForestClassifier attribute. It is the attribute of DecisionTreeClassifiers.
You should not use this while using RandomForestClassifier, there is no need of it.
I am following step 3 of this example:
model.fit(dataset.data, dataset.target)
expected = dataset.target
predicted = model.predict(dataset.data)
I don't understand why scikit doesn't recognize model.fit.
Do I need assign that variable first?
Is there a missing import?
I'm working in jupyter, scikit-learn 0.17.1.
You need to first initiate an instance of whatever model you're using:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
fit(x,y) is a method that can be used on an estimator.
In order to be able to use this method on model you would have to create model first and make sure its of an estimator class.
Documentation