I have regression task and I am predicting here with linear regression and random-forest models. Need some hints or code example how to ensemble them (averaging already done). Here are my model realizations with python:
np.random.seed(42)
mask = np.random.rand(happiness2.shape[0]) <= 0.7
print('Train set shape {0}, test set shape {1}'.format(happiness2[mask].shape, happiness2[~mask].shape))
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(happiness22[mask].drop(['Country', 'Happiness_Score_2017',
'Happiness_Score_2018','Happiness_Score_2019'], axis=1).fillna(0),
happiness22[mask]['Happiness_Score_2019'] )
pred = lr.predict(happiness22[~mask].drop(['Country', 'Happiness_Score_2017',
'Happiness_Score_2018','Happiness_Score_2019'], axis=1).fillna(0))
print('RMSE = {0:.04f}'.format(np.sqrt(np.mean((pred - happiness22[~mask]['Happiness_Score_2019'])**2))))
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100)
rf.fit(happiness22[mask].drop(['Country', 'Happiness_Score_2017',
'Happiness_Score_2018','Happiness_Score_2019'], axis=1).fillna(0),
happiness22[mask]['Happiness_Score_2019'] )
pred3 = rf.predict(happiness22[~mask].drop(['Country', 'Happiness_Score_2017',
'Happiness_Score_2018','Happiness_Score_2019'], axis=1).fillna(0))
print('RMSE = {0:.04f}'.format(np.sqrt(np.mean((pred3 - happiness22[~mask]['Happiness_Score_2019'])**2))))
avepred=(pred+pred3)/2
print('RMSE = {0:.04f}'.format(np.sqrt(np.mean((avepred - happiness22[~mask]['Happiness_Score_2019'])**2))))
First, you can evaluate each model (linear regression and random forest) on a validation set and get out the error (MSE for instance).
Then, weight each model according to this error and use this weight later when predicting.
You can use also cobra ensemble method (developped by Guedj et al.)
https://modal.lille.inria.fr/pycobra/
Related
I am working on linear regression and polynomial ML models.
I have done the following:
Built a least squares multiple linear regression model to predict mass from the other ten attributes in my dataset and printed out the weights (coefficients and intercept) for the model
Reduced the model down to using only one feature 𝑥 by using recursive feature elimination from scikit-learn library to determine the feature I should use.
I am now stuck on the next parts of the question where I have to:
Use the feature 𝑥 I have identified to construct a polynomial regression model of the form:
𝑓(𝐱)=𝑤0+𝑤1𝑥+𝑤2𝑥2
Print out the weights for this model and lot the polynomial regression model
I am not sure where I am going wrong with the code. Here is what I have so far:
import numpy as np
import pandas as pd
from google.colab import drive
from sklearn.linear_model import LinearRegression
#what to predict MASS from other 10 attributes
drive.mount('/content/gdrive')
sampledata = pd.read_csv (r'/content/gdrive/MyDrive/PG Cert in AI/Machine Learning/Coursework/coursework.txt', delimiter='\t',skiprows=0)
#number of rows and columns
print(f"The number of rows and columns {sampledata.shape}")
print(sampledata.head())
cols = ["Fore", "Bicep", "Chest","Neck", "Shoulder", "Waist", "Height", "Calf", "Thigh", "Head"]
x = sampledata[cols]
y = sampledata["Mass"]
print(x.shape)
print(y.shape)
# Create an instance of a linear regression model and fit it to the data with the fit() function:
regressor = LinearRegression()
regressor.fit(x, y)
# print the intercept (w0)
print('w0 = ',regressor.intercept_)
# print the weight vector for the model (w1, w2, ...)
print('coefficients = ',regressor.coef_)
from sklearn.feature_selection import RFE
**#reducing the number of features**
reg2 = LinearRegression()
rfe = RFE(reg2, n_features_to_select=1)
fit = rfe.fit(x,y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
#constructing a polynomial regression model
from sklearn import metrics
highestFeature = sampledata["Fore"]
def build_polynomial_model(maxorder):
X = sampledata["Fore"]
for i in range(2,maxorder+1):
output = np.hstack((X,X**i))
regressor = LinearRegression()
regressor.fit(output,y)
y = regressor.predict(drive._output)
test_mse = metrics.mean_squared_error(output,y)
return test_mse
print(build_polynomial_model(2))
This is my error message:
enter image description here
I am training different models for a regression problem. Since i want to find the best model between the choices, i wanted to perform a cross validation with k = 20, to characterize the MSE of the models, and statistically determine what model is the better between them.
The problem has got multiple dependant variables, and i would like to determinate the MSE separately for both dependant variables, but cross_val_score doesnt let me do that explicitely.
Here is some example code of one of my models:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
model = LinearRegression()
model.fit(x, y)
y_pred = model.predict(x_test)
mse = mean_squared_error(scaler2.inverse_transform(y_test), scaler2.inverse_transform(y_pred), multioutput="raw_values")
How can i iterate training on the k times corresponding to the k models trained and tested in a k fold cross validation?
Scikit provides a Kfold but it is just a way to specify the number of folds, and it doesnt actually returns the training and test folds, so i can't think a way to actually train different models using kfold cross validation theory. Plus, i would need to evaluate MSE seprately on each dependant variable since it's a multiple regression problem
You can use Scikit Learn KFold Cross Validation with just a simple for loop.
And here is a example testing 5-fold cross validation on bayes classifer:
from sklearn.model_selection import KFold
k = 5
kf = KFold(n_splits=k)
res = []
for train_index , test_index in kf.split(X_train_concat):
X_train_kf , X_test_kf = X_train_concat[train_index,:],X_train_concat[test_index,:]
y_train_kf , y_test_kf = y_train_concat[train_index] , y_train_concat[test_index]
X_train = np.append(X_train_concat, np.reshape(y_train_concat, (len(y_train_concat),1)), axis=1)
W_bayes = trainBayes(X_train)
y_pred = predict(X_test_kf, W_bayes)
mis_classification = len(y_pred)-np.count_nonzero(y_pred == y_test_kf)
e = (mis_classification / y_test_kf.shape[0]) * 100
res.append(e)
avg_res = sum(res)/k
print('Result of each fold - {}'.format(res))
print('Avg result : {}'.format(avg_res))
For more check this
So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html
I'm trying to compare the RMSE I have from performing multiple linear regression upon the full data set, to that of 10-fold cross validation, using the KFold module in scikit learn. I found some code that I tried to adapt but I can't get it to work (and I suspect it never worked in the first place.
TIA for any help!
Here's my linear regression function
def standRegres(xArr,yArr):
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T*xMat
if np.linalg.det(xTx) == 0.0:
print("This matrix is singular, cannot do inverse")
return
ws = xTx.I * (xMat.T*yMat)
return ws
## I run it on my matrix ("comm_df") and my dependent var (comm_target)
## Calculate RMSE (omitted some code)
initial_regress_RMSE = np.sqrt(np.mean((yHat_array - comm_target_array)**2)
## Now trying to get RMSE after training model through 10-fold cross validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf:
linreg.fit(comm_df,comm_target)
p = linreg.predict(comm_df)
e = p-comm_target
xval_err += np.sqrt(np.dot(e,e)/len(comm_df))
rmse_10cv = xval_err/10
I get an error about how kfold object is not iterable
There are several things you need to correct in this code.
You cannot iterate over kf. You can only iterate over kf.split(comm_df)
You need to somehow use the train test split that KFold provides. You are not using them in your code! The goal of the KFold is to fit your regression on the train observations, and to evaluate the regression (ie compute the RMSE in your case) on the test observations.
With this in mind, here is how I would correct your code (it is assumed here that your data is in numpy arrays, but you can easily switch to pandas)
kf = KFold(n_splits=10)
xval_err = 0
for train, test in kf.split(comm_df):
linreg.fit(comm_df[train],comm_target[train])
p = linreg.predict(comm_df[test])
e = p-comm_label[test]
xval_err += np.sqrt(np.dot(e,e)/len(comm_target[test]))
rmse_10cv = xval_err/10
So the code you provided still threw an error. I abandoned what I had above in favor of the following, which works:
## KFold cross-validation
from sklearn.model_selection import KFold
from sklearn.linear_model import LinearRegression
## Define variables for the for loop
kf = KFold(n_splits=10)
RMSE_sum=0
RMSE_length=10
X = np.array(comm_df)
y = np.array(comm_target)
for loop_number, (train, test) in enumerate(kf.split(X)):
## Get Training Matrix and Vector
training_X_array = X[train]
training_y_array = y[train].reshape(-1, 1)
## Get Testing Matrix Values
X_test_array = X[test]
y_actual_values = y[test]
## Fit the Linear Regression Model
lr_model = LinearRegression().fit(training_X_array, training_y_array)
## Compute the predictions for the test data
prediction = lr_model.predict(X_test_array)
crime_probabilites = np.array(prediction)
## Calculate the RMSE
RMSE_cross_fold = RMSEcalc(crime_probabilites, y_actual_values)
## Add each RMSE_cross_fold value to the sum
RMSE_sum=RMSE_cross_fold+RMSE_sum
## Calculate the average and print
RMSE_cross_fold_avg=RMSE_sum/RMSE_length
print('The Mean RMSE across all folds is',RMSE_cross_fold_avg)
I'm learning ML and doing the task for Boston house price predictions. I have following code:
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import GridSearchCV
def fit_model(X, y):
""" Tunes a decision tree regressor model using GridSearchCV on the input data X
and target labels y and returns this optimal model. """
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Set up the parameters we wish to tune
parameters = {'max_depth':(1,2,3,4,5,6,7,8,9,10)}
# Make an appropriate scoring function
scoring_function = make_scorer(fbeta_score, beta=2)
# Make the GridSearchCV object
reg = GridSearchCV(regressor, param_grid=parameters, scoring=scoring_function)
print reg
# Fit the learner to the data to obtain the optimal model with tuned parameters
reg.fit(X, y)
# Return the optimal model
return reg.best_estimator_
reg = fit_model(housing_features, housing_prices)
This gives me ValueError: continuous is not supported for the reg.fit(X, y) line and I don't understand why. What is the reason for this, what am I missing here?
That's because of the line:
scoring_function = make_scorer(fbeta_score, beta=2)
This sets the scoring-metric to fbeta, which is for classification tasks!
Your are doing regression here as seen in:
regressor = DecisionTreeRegressor()
From the docs