I built a linear model with the sklearn based on the Cement and Concrete Composites dataset.
Initially, i used the train_test_split(X, Y, test_size=0.3, Shuffle=False) and i found the train and test error.
Now i try to run the same model 10 times with Shuffle=True and compute the mean and sd of the errors. The new results should be compared to the first ones.
How could i loop the same model n times and save the errors in a list?
Try something like this:
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
errors = []
for i in range(10):
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, Shuffle=True)
model = LinearRegression() # the model you want to use here
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
error = accuracy_score(y_test, y_pred) # the error metric you want to use here
errors.append(error)
What you need is cross-validation: repeated evaluation of the model on different splits of the same data. train_test_split in this case is a wrapper around ShuffleSplit cross-validation.
In your case it might look like this:
from sklearn.model_selection import ShuffleSplit, cross_val_score
import numpy as np
from sklearn.linear_model import LinearRegression
X, y = ... # read dataset
model = LinearRegression()
# n_splits=10 is for 10 random shuffled train-test splits
cv = ShuffleSplit(n_splits=10, test_size=.3, random_state=0)
scores = cross_val_score(model, X, y, cv=cv, scoring='neg_mean_squared_error')
np.mean(scores), np.std(scores)
If you want to compute the error on your own or do anything else with models/results, you could do it like this:
for train_ids, test_ids in cv.split(X):
model.fit(X[train_ids], y[train_ids])
model.score(X[test_ids], y[test_ids])
...
More about this:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.ShuffleSplit.html
Related
I wanted to find an optimal model to solve the assigned classification problem. Everything went smooth before I applied pd.get_dummies() function to preprocess the data. The experiment showed a impossibly perfect result. I know it is unlikely to happen but I do not know why. Any help would be highly appreciated.
Code for preprocessing data is as below
# Encoding Booking Status
status_dict = {'Not_Canceled':1, 'Canceled':0}
df.booking_status = df.booking_status.map(status_dict)
df.drop('Booking_ID',axis=1, inplace=True)
df = df.dropna()
df = pd.get_dummies(df)
# Standardizing Data
from sklearn.preprocessing import StandardScaler
import numpy as np
X = df.iloc[:,0:-1]
y = df.iloc[:,-1]
scaler = StandardScaler().fit(X)
rescaledX = scaler.transform(X)
np.set_printoptions(precision=3)
print(rescaledX[0:5,:])
And I split my data into training and testing with a proportion of 0.3
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
I used several models and the amazing result is
enter image description here
Simple code, stupid me. By the way, just a beginner in ML field. Any advice to master it well?
It was caused by data leaks. You must split your data first before any data pre-processing step. For example,
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(rescaledX, y, test_size=0.3, random_state=15)
Then do your data scaling part on the training and test data separately.
scaler = StandardScaler().fit(X_train)
rescaledX = scaler.transform(X_train)
You could try to use Pipe line as well to avoid data leaks.
# correct data preparation for model evaluation with k-fold cross-validation
from numpy import mean
from numpy import std
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=7)
# define the pipeline
steps = list()
steps.append(('scaler', MinMaxScaler()))
steps.append(('model', LogisticRegression()))
pipeline = Pipeline(steps=steps)
# define the evaluation procedure
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model using cross-validation
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# report performance
print('Accuracy: %.3f (%.3f)' % (mean(scores)*100, std(scores)*100))
Ref: https://machinelearningmastery.com/data-preparation-without-data-leakage/
raise NotFittedError(msg % {'name': type(estimator).name})
sklearn.exceptions.NotFittedError: This Lasso instance is not fitted
yet. Call 'fit' with appropriate arguments before using this
estimator.
from sklearn import datasets
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
#
# Load the Boston Data Set
#
bh = datasets.load_boston()
X = bh.data
y = bh.target
#
# Create training and test split
#
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
#
# Create an instance of Lasso Regression implementation
#
lasso = Lasso(alpha=1.0)
#
# Fit the Lasso model
#
lasso.fit(X_test, y_test)
#
# Create the model score
#
#lasso.score(X_test, y_test), lasso.score(X_train, y_train)
lasso_reg = Lasso(normalize=True)
y_pred_lass =lasso_reg.predict(X_test)
print(y_pred_lass)
You've actually created two lasso models. One called lasso which you fit. But after that you create another one lasso_reg = Lasso(normalize=True) which you try to call predict on but that model hasn't been fitted yet. Try this:
from sklearn import datasets
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
bh = datasets.load_boston()
X = bh.data
y = bh.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
lasso = Lasso(alpha=1.0, normalize=True)
lasso.fit(X_test, y_test)
y_pred_lass =lasso.predict(X_test)
print(y_pred_lass)
As the error says you have to call lasso_reg.fit(X_test, y_test) before calling lasso_reg.predict(X_test) This will fix the issue.
lasso_reg = Lasso(normalize=True)
lasso_reg.fit(X_test, y_test)
y_pred_lass =lasso_reg.predict(X_test)
print(y_pred_lass)
Here is my code, and it always returns 100% accuracy, regardless of how big the test size is. I used the train_test_split method, so I do not believe there should be any duplicates of data. Could someone inspect my code?
from sklearn.tree import DecisionTreeClassifier
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
prices.shape
(20640,)
features.shape
(20640, 8)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
X_train = X_train.dropna()
y_train = y_train.dropna()
X_test = X_test.dropna()
y_test = X_test.dropna()
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_train.shape
(16512,)
X_train.shape
(16512, 8)
predictions = model.predict(X_test)
score = model.score(y_test, predictions)
score
EDIT: I have reworked my answer since I found multiple issues. Please copy-paste the below code to ensure no bugs are left.
Issues -
You are using DecisionTreeClassifier instead of DecisionTreeRegressor for a regression problem.
You are removing nans after doing the test train split which will mess up the count of samples. Do the data.dropna() before the split.
You are using the model.score(X_test, y_test) incorrectly by passing it (X_test, predictions). Please use accuracy_score(X_test, predictions) with those parameters instead, or fix the syntax.
from sklearn.tree import DecisionTreeRegressor #<---- FIRST ISSUE
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = pd.read_csv('housing.csv')
data = data.dropna() #<--- SECOND ISSUE
prices = data['median_house_value']
features = data.drop(['median_house_value', 'ocean_proximity'], axis = 1)
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = accuracy_score(y_test, predictions) #<----- THIRD ISSUE
score
I am using cross-validation to evaluate my ML models but now I want to look into the distribution of the errors, i.e. I want to get the average error of specific data points whenever they are in the test set.
from sklearn import linear_model
from sklearn.model_selection import KFold, cross_val_score
X = #data points
y = #output
lm = linear_model.LinearRegression()
kfold = KFold(n_splits=10)
scores = cross_val_score(lm, X, y, scoring='neg_mean_squared_error', cv=kfold)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
print('Testing RMSE (lin reg): {:.3f}'.format(np.mean(rmse_scores)))
Is there an easy way to get the individual errors of each of the data points whenever they are in the test set (not training error) using cross-validation with scikit-learn?
Thank you!
If I understood your question correctly, this should be what you are looking for.
kf = KFold(n_splits=3)
error = []
for train_index, val_index in kf.split(X, y):
Xtrain, X_val = X[train_index], X[val_index]
ytrain, y_val = y[train_index], y[val_index]
model.fit(Xtrain, ytrain)
pred = model.predict(X_val)
current_error = mean_squared_error(y_val, pred) # error per iteration
error.append(current_error)
print(np.mean(error)) # get mean error after CV
I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?