Best Practice for Executing Ensemble Methods - python

I am running the sample code below.
df = pd.read_csv('C:\\my_path\\test.csv', header=0, encoding = 'unicode_escape')
df = df.fillna(0)
X = df1.drop(columns = ['PRICE','MATURITYDATE'])
y = df1['PRICE']
from sklearn.model_selection import train_test_split
#split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
#create new a knn model
knn = KNeighborsClassifier()
#create a dictionary of all values we want to test for n_neighbors
params_knn = {'n_neighbors': np.arange(1, 25)}
#use gridsearch to test all values for n_neighbors
knn_gs = GridSearchCV(knn, params_knn, cv=5)
#fit model to training data
knn_gs.fit(X_train, y_train)
#save best model
knn_best = knn_gs.best_estimator_
#check best n_neigbors value
print(knn_gs.best_params_)
# RANDOM FOREST
from sklearn.ensemble import RandomForestClassifier
#create a new random forest classifier
rf = RandomForestClassifier()
#create a dictionary of all values we want to test for n_estimators
params_rf = {'n_estimators': [50, 100, 200]}
#use gridsearch to test all values for n_estimators
rf_gs = GridSearchCV(rf, params_rf, cv=5)
#fit model to training data
rf_gs.fit(X_train, y_train)
#save best model
rf_best = rf_gs.best_estimator_
#check best n_estimators value
print(rf_gs.best_params_)
# REGRESSION
from sklearn.linear_model import LogisticRegression
#create a new logistic regression model
log_reg = LogisticRegression()
#fit the model to the training data
log_reg.fit(X_train, y_train)
#test the three models with the test data and print their accuracy scores
print('knn: {}'.format(knn_best.score(X_test, y_test)))
print('rf: {}'.format(rf_best.score(X_test, y_test)))
print('log_reg: {}'.format(log_reg.score(X_test, y_test)))
# VOTING CLASSIFIER
from sklearn.ensemble import VotingClassifier
#create a dictionary of our models
estimators=[('knn', knn_best), ('rf', rf_best), ('log_reg', log_reg)]
#create our voting classifier, inputting our models
ensemble = VotingClassifier(estimators, voting='hard')
It's all from the link below
https://towardsdatascience.com/ensemble-learning-using-scikit-learn-85c4531ff86a
The problem that I'm running into with each of these methods is always this:
ValueError: Unknown label type: 'continuous'
I guess everything needs to be converted into a categorical type or, perhaps, one hot encoding needs to be applied. Is this correct? What is the best way to deal with this kind of issue? I'm hoping to keep things simple and very generic, without introducing custom coding. This is why I am leaning towards the scikit-learn libraries. I'd greatly appreciate any/all thoughts and insights. Thanks so much!

Related

How can i create an instance of multi-layer perceptron network to use in bagging classifier?

i am trying to create an instance of multi-layer perceptron network to use in bagging classifier. But i don't understand how to fix them.
Here is my code:
My task is:
1-To apply bagging classifier (with or without replacement) with eight base classifiers created at the previous step.
It would be really great if you show me how can i implement this to my algorithm. I did my search but i couldn't find a way to do that
To train your BaggingClassifier:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import classification_report, confusion_matrix
#Load the digits data:
X,y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
# Feature scaling
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
# Finally for the MLP- Multilayer Perceptron
mlp = MLPClassifier(hidden_layer_sizes=(16, 8, 4, 2), max_iter=1001)
clf = BaggingClassifier(mlp, n_estimators=8)
clf.fit(X_train,y_train)
To analyze your output you may try:
y_pred = clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
print(cm)
To see num of correctly predicted instances per class:
print(cm[np.eye(len(clf.classes_)).astype("bool")])
To see percentage of correctly predicted instances per class:
cm[np.eye(len(clf.classes_)).astype("bool")]/cm.sum(1)
To see total accuracy of your algo:
(y_pred==y_test).mean()
EDIT
To access predictions on a per base estimator basis, i.e. your mlps, you can do:
estimators = clf.estimators_
# print(len(estimators), type(estimators[0]))
preds = []
for base_estimator in estimators:
preds.append(base_estimator.predict(X_test))

How to get the predicted probabilities of a classification model?

I'm trying out different classification models using a binary dependent variable (occupied/unoccupied). The models I am interested in are Logistic regression, Decision tree and Gaussian Naïve Bayes.
My input data is a csv-file with a datetime index (e.g. 2019-01-07 14:00), three variable columns ("R", "P", "C", containing numerical values), and the dependent variable column ("value", containing the binary values).
Training the model is not the problem, that all works fine. All the models give me their prediction in binary values (this of course should be the ultimate outcome), but I would also like to see the predicted probabilities which made them decide on either of the binary values. Is there any way to get also these values?
I have tried all of the classification visualizers that function with the yellowbrick package (ClassBalance, ROCAUC, ClassificationReport, ClassPredictionError). But all of these don't give me a graph that shows the calculated probabilities by the model for the data set.
import pandas as pd
import numpy as np
data = pd.read_csv('testrooms_data.csv', parse_dates=['timestamp'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
##split dataset into test and trainig set
X = data.drop("value", axis=1) # X contains all the features
y = data["value"] # y contains only the label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 1)
###model training
###Logistic Regression###
clf_lr = LogisticRegression()
# fit the dataset into LogisticRegression Classifier
clf_lr.fit(X_train, y_train)
#predict on the unseen data
pred_lr = clf_lr.predict(X_test)
###Decision Tree###
from sklearn.tree import DecisionTreeClassifier
clf_dt = DecisionTreeClassifier()
pred_dt = clf_dt.fit(X_train, y_train).predict(X_test)
###Bayes###
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
###visualization for e.g. LogReg
from yellowbrick.classifier import ClassificationReport
from yellowbrick.classifier import ClassPredictionError
from yellowbrick.classifier import ROCAUC
#classificationreport
visualizer = ClassificationReport(clf_lr, support=True)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
g = visualizer.poof() # Draw/show/poof the data
#classprediction report
visualizer2 = ClassPredictionError(LogisticRegression())
visualizer2.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer2.score(X_test, y_test) # Evaluate the model on the test data
g2 = visualizer2.poof() # Draw visualization
#(ROC)
visualizer3 = ROCAUC(LogisticRegression())
visualizer3.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer3.score(X_test, y_test) # Evaluate the model on the test data
g3 = visualizer3.poof() # Draw/show/poof the data
it would be great to have e.g. an array similar to pred_lr that contains the probabilities calculated for each row of the csv file. Is that possible? If yes, how can I get it?
In most sklearn estimators (if not all) you have a method for obtaining the probability that precluded the classification, either in log probability or probability.
For example, if you have your Naive Bayes classifier and you want to obtain probabilities but not classification itself, you could do (I used same nomenclatures as in your code):
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
pred_bayes = bayes.fit(X_train, y_train).predict(X_test)
#for probabilities
bayes.predict_proba(X_test)
bayes.predict_log_proba(X_test)
Hope this helps.

I am getting Not Fitted error in random forest classifier?

I have 4 features and one target variable. I am using RandomForestRegressor instead of RandomForestClassifer as my target variable is float. When I am trying to fit my model and then output them in sorted order to get the important features I am getting Not fitted error how to fix it?
Code:
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn import datasets
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
feat_labels = data.columns[:4]
regr = RandomForestRegressor(max_depth=2, random_state=0)
#clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
#clf.fit(X_train, y_train)
regr.fit(X, y)
importances = clf.feature_importances_
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30, feat_labels[indices[f]], importances[indices[f]]))
You are fitting to regr but calling the feature importances on clf. Try calling this instead:
importances = regr.feature_importances_
I noticed that previously your classifier was being fit with the training data you setup, but the regressor is now being fit with X and y.
However, I don't see here where you're setting X and y in the first place or even more where you actually load in a dataset. Could it be you forgot this step as well as what Harpal mentioned in another answer?

I am not able to output the best selected features using random forest?

I have made random forest classifier which has threshold value = 0.15 but when I try to iterate over the selected model it does not output the best selected features.
Code:
X = data.loc[:,'IFATHER':'VEREP']
y = data.loc[:,'Criminal']
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
# Split the data into 30% test and 70% training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
clf = RandomForestClassifier(n_estimators=100, random_state=0)
# Train the classifier
clf.fit(X_train, y_train)
# Print the name and gini importance of each feature
for feature in zip(X, clf.feature_importances_):
print(feature)
# Create a selector object that will use the random forest classifier to identify
# features that have an importance of more than 0.15
sfm = SelectFromModel(clf, threshold=0.15)
# Train the selector
sfm.fit(X_train, y_train)
The code below does not work:
# Print the names of the most important features
for feature_list_index in sfm.get_support(indices=True):
print(X[feature_list_index])
I am able to get the feature importance of each feature using Random forest classifier but not using threshold value. I think get_support() is not the right method.
Screenshot:
To create a new X data set containing the most important features:
X_selected_features = sfm.fit_transform(X_train, y_train)
To see the feature names:
features = np.array(list_of_feature_names)
print(features[sfm.get_support()])
if X is a Pandas.DataFrame:
features = X.columns.values

Logistic Regression: Train using past data and predict using current data?

I've trained and tested my logistic regression using available data but now need to output a future prediction. I want to include the 2017 values that I used in my training and test set to predict the 2018 probability.
This is the code I used to train and test my model:
Xadj = train.ix[:,('2016 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016')]
#Coded is the transformation of 2017 transaction count to a binary variable
y = y=train.ix[:,('2017 transaction count coded')]
logit_model=sm.Logit(y,Xadj)
result=logit_model.fit()
print(result.summary())
X_train, X_test, y_train, y_test = train_test_split(Xadj, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
#Cross Validation
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
In an attempt to export predictions for 2018, I have done the following:
#Create 2018 Purchase Probability
train['2018 Purchase Probability']=pd.DataFrame({'2018 Purchase Probability' : []})
yact=train.ix[:,('2018 Purchase Probability')]
#Adding in 2017 values
X = train.ix[:, ('2017 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016','create_year_2017')]
from sklearn.preprocessing import scale, StandardScaler
scaler = StandardScaler()
scaler.fit(Xadj)
X = scaler.transform(Xadj)
X_pred = scaler.transform(X)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(Xadj, y)
#Generate 0/1 prediction
prediction = logreg.predict(X= X)
#Generate odds ratio
precent_prediction = logreg.predict_proba(X= X)
prediction = pd.DataFrame(prediction)
I'm not sure if I've done this correctly and judging from my output (which is mostly 1's) I don't think I have. I am new to coding in Python and am struggling to turn my tested model into a future prediction that can be used to make decisions.
Thanks in advance for any help!

Categories