raise NotFittedError(msg % {'name': type(estimator).name})
sklearn.exceptions.NotFittedError: This Lasso instance is not fitted
yet. Call 'fit' with appropriate arguments before using this
estimator.
from sklearn import datasets
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
#
# Load the Boston Data Set
#
bh = datasets.load_boston()
X = bh.data
y = bh.target
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#
# Create an instance of Lasso Regression implementation
#
lasso = Lasso(alpha=1.0)
#
# Fit the Lasso model
#
lasso.fit(X_test, y_test)
#
# Create the model score
#
#lasso.score(X_test, y_test), lasso.score(X_train, y_train)
lasso_reg = Lasso(normalize=True)
y_pred_lass =lasso_reg.predict(X_test)
print(y_pred_lass)
You've actually created two lasso models. One called lasso which you fit. But after that you create another one lasso_reg = Lasso(normalize=True) which you try to call predict on but that model hasn't been fitted yet. Try this:
from sklearn import datasets
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
bh = datasets.load_boston()
X = bh.data
y = bh.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
lasso = Lasso(alpha=1.0, normalize=True)
lasso.fit(X_test, y_test)
y_pred_lass =lasso.predict(X_test)
print(y_pred_lass)
As the error says you have to call lasso_reg.fit(X_test, y_test) before calling lasso_reg.predict(X_test) This will fix the issue.
lasso_reg = Lasso(normalize=True)
lasso_reg.fit(X_test, y_test)
y_pred_lass =lasso_reg.predict(X_test)
print(y_pred_lass)
Related
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
y_pred = none
accuracy = accuracy_score(y_test, y_pred)
print (accuracy)
What should I put in the y_pred = none area? Is there anything wrong with my code?
Normally you would split your data in train and test, below an example using iris:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X, y = make_classification()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
We can fit the model like you did:
logistic_model = LogisticRegression()
logistic_model.fit(X_train, y_train)
Get the prediction on test set and score with actual value:
y_pred = logistic_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
I built a linear model with the sklearn based on the Cement and Concrete Composites dataset.
Initially, i used the train_test_split(X, Y, test_size=0.3, Shuffle=False) and i found the train and test error.
Now i try to run the same model 10 times with Shuffle=True and compute the mean and sd of the errors. The new results should be compared to the first ones.
How could i loop the same model n times and save the errors in a list?
Try something like this:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
errors = []
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, Shuffle=True)
model = LinearRegression() # the model you want to use here
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
error = accuracy_score(y_test, y_pred) # the error metric you want to use here
errors.append(error)
What you need is cross-validation: repeated evaluation of the model on different splits of the same data. train_test_split in this case is a wrapper around ShuffleSplit cross-validation.
In your case it might look like this:
from sklearn.model_selection import ShuffleSplit, cross_val_score
import numpy as np
from sklearn.linear_model import LinearRegression
X, y = ... # read dataset
model = LinearRegression()
# n_splits=10 is for 10 random shuffled train-test splits
cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
np.mean(scores), np.std(scores)
If you want to compute the error on your own or do anything else with models/results, you could do it like this:
for train_ids, test_ids in cv.split(X):
model.fit(X[train_ids], y[train_ids])
model.score(X[test_ids], y[test_ids])
...
More about this:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html
Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
I am running the [code] of multi-label classification1.how to fix the NameError that the "X_train" is not defined.the python code is given below.
import scipy
from scipy.io import arff
data, meta = scipy.io.arff.loadarff('./yeast/yeast-train.arff')
from sklearn.datasets import make_multilabel_classification
# this will generate a random multi-label dataset
X, y = make_multilabel_classification(sparse = True, n_labels = 20,
return_indicator = 'sparse', allow_unlabeled = False)
# using binary relevance
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
# initialize binary relevance multi-label classifier
# with a gaussian naive bayes base classifier
classifier = BinaryRelevance(GaussianNB())
# train
classifier.fit(X_train, y_train)
# predict
predictions = classifier.predict(X_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predictions)
You forgot to split the dataset into train and test sets.
Import the library
from sklearn.model_selection import train_test_split
Add this line before classifier.fit()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
X_train does not exist, you have to split between train and test :
from sklearn.preprocessing import StandardScaler
s =StandardScaler()
X_train = s.fit_transform(X_train)
X_test = s.fit_transform(X_test)
I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?