To forecasting that can predict future data (example x day) - python

I have Timeseries dataset. I have used cross validation and XGBregressor model. Now i want to forcast my prediction for particular x day.
As per my understaning any fundamental ML prediction model can be expressed as : y = a. f(x)+b. where x = input, b= bias, y= prediction.
I have already trained the model. So for my case x will be time vector. Now the predicted forcasted output matrix say Y already contains all the forcast of x days/month which include everything.
So only thing for me is now to filter out from data what I need.
So, I am trying to write a fucntion argument where Y is the argument and from that i want to filter out for example it say below 20% or what will be forcast on 36 day. Can someone explain me how i can write this?
from sklearn.model_selection import TimeSeriesSplit
import xgboost as xgb
import xgboost as xgbr
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
tss = TimeSeriesSplit(n_splits=3, test_size=24*365*1, gap=24)
df3 = df3.sort_index()
fold = 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df3):
train = df3.iloc[train_idx].dropna()
test = df3.iloc[val_idx].dropna()
FEATURES = ['7day_rolling_avg','Lag_1']
TARGET = 'Liquid Lvl % C'
X_train = train[FEATURES]
y_train = train[TARGET]
X_test = test[FEATURES]
y_test = test[TARGET]
#################################################################################################################
xgbr = xgb.XGBRegressor(verbosity=0)
print(xgbr)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0,
importance_type='gain', learning_rate=0.1, max_delta_step=0,
max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=None, subsample=1, verbosity=1)
xgbr.fit(X_train, y_train)
score = xgbr.score(X_train, y_train)
print("Training score: ", score)
scores = cross_val_score(xgbr, X_train, y_train,cv=3)
print("Mean cross-validation score: %.2f" % scores.mean())
ypred = xgbr.predict(X_test)
mse = mean_squared_error(y_test, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % (mse**(1/2.0)))
x_ax = range(len(y_test))
plt.plot(x_ax, y_test, label="original")
plt.plot(x_ax, ypred, label="predicted")
plt.title("Data Prediction")
plt.legend()
plt.show()

Related

To plot date on X-axis for actual vs predicted values

I am having timeseries dataset which looks like below:
Time series Dataset
I am using linear regression for which i have to do plot of original and prediction. This was the plot i got
The below is my code.
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
tss = TimeSeriesSplit(n_splits=5)
df = df.sort_index()
fold = 0
preds = []
scores = []
for train_idx, val_idx in tss.split(df):
train = df.iloc[train_idx].dropna()
test = df.iloc[val_idx].dropna()
FEATURES = ['7day_rolling_avg','Lag_1']
TARGET = 'Liquid Lvl % C'
X_train = train[FEATURES]
y_train = train[TARGET]
X_test = test[FEATURES]
y_test = test[TARGET]
#################################################################################################################
model = LinearRegression()
model.fit(X_train, y_train)
score = model.score(X_train, y_train)
print("Training score: ", score)
scores = cross_val_score(model, X_train, y_train,cv=3)
print("Mean cross-validation score: %.2f" % scores.mean())
ypred = model.predict(X_test)
mse = mean_squared_error(y_test, ypred)
print("MSE: %.2f" % mse)
print("RMSE: %.2f" % (mse**(1/2.0)))
#x_ax = range(len(y_test))
#plt.plot(x_ax, y_test, label="original")
#plt.plot(x_ax, ypred,color='red', label="Linear/predicted")
#plt.title("Data Prediction")
#plt.legend()
#plt.show()
fig=plt.figure()
c=df.to_list()
plt.plot(c,y_test)
plt.plot(c,y_pred)
plt.show()
ax.plot(df)
I want date to be plotted on my X-axis. There was a similar question in stack overflow How to plot predicted values with actual values when we have multi-index
but it didnt work for me. This is what i tried.
fig=plt.figure()
c=df.to_list()
plt.plot(c,y_test)
plt.plot(c,y_pred)
plt.show()
ax.plot(df)

how to use "scikit-learn calibration" after fine-tuning lightgbm

I fine tuned LGBM and applied calibration, but have troubles applying calibration.
I have 1) train, 2) valid, 3) test data.
I trained and fine-tuned LGBM using 1) train data and 2) valid data.
Then, I got a best parameter of LGBM.
After then, I want to calibrate, in order to make my model's output can be directly interpreted as a confidence level. But I'm confused in using scikit-learn CalibratedClassifierCV.
In my situation, should I use cv='prefit' or cv=5? Also, should I use train data or valid data fitting CalibratedClassifierCV?
1) uncalibrated_clf but after training
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
Which one is right? All is right, or only one or two is(are) right?
Below is code.
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.calibration import calibration_curve
from sklearn.calibration import CalibratedClassifierCV
import lightgbm as lgb
import matplotlib.pyplot as plt
np.random.seed(0)
n_samples = 10000
X, y = make_classification(
n_samples=3*n_samples, n_features=20, n_informative=2,
n_classes=2, n_redundant=2, random_state=32)
#n_samples = N_SAMPLES//10
X_train, y_train = X[:n_samples], y[:n_samples]
X_valid, y_valid = X[n_samples:2*n_samples], y[n_samples:2*n_samples]
X_test, y_test = X[2*n_samples:], y[2*n_samples:]
plt.figure(figsize=(12, 9))
plt.plot([0, 1], [0, 1], '--', color='gray')
# 1) Uncalibrated_clf but fine-tuned on training data
clf = lgb.LGBMClassifier()
clf.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], verbose=True, early_stopping_rounds=20)
y_prob = clf.predict_proba(X_test)[:, 1]
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_prob, n_bins=10)
plt.plot(
fraction_of_positives,
mean_predicted_value,
'o-', label='uncalibrated_clf')
# 2-1) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv='prefit', method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob1 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives1, mean_predicted_value1 = calibration_curve(y_test, y_prob1, n_bins=10)
plt.plot(
fraction_of_positives1,
mean_predicted_value1,
'o-', label='calibrated_clf1')
# 2-2) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_train, y_train)
y_prob2 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives2, mean_predicted_value2 = calibration_curve(y_test, y_prob2, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf2')
plt.legend()
# 2-3) Calibrated_clf
cal_clf = CalibratedClassifierCV(clf, cv=5, method='isotonic')
cal_clf.fit(X_valid, y_valid)
y_prob3 = cal_clf.predict_proba(X_test)[:, 1]
fraction_of_positives3, mean_predicted_value3 = calibration_curve(y_test, y_prob3, n_bins=10)
plt.plot(
fraction_of_positives2,
mean_predicted_value2,
'o-', label='calibrated_clf3')
plt.legend()
The way to go about this is:
a) fit the model and calibrate on the hold out set
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, cv='prefit').fit(X_val, y_val)
y_pred = calibrated.predict(X_test)
(this is actually the meaning of prefit here: the model is already fitted now take a new relevant set and calibrate the output).
b) fit the model and calibrate with cross validation on the training set
model.fit(X_train, y_train)
calibrated = CalibratedClassifierCV(model, cv=5).fit(X_train, y_train)
y_pred_val = calibrated.predict(X_val)
As is usually the case the number of cross validations and the method (isotonic regression vs Platt scaling or sigmoid in scikit-learn's jargon) critically depends on your data and your setup. Therefore, I'd suggest to put those in a grid search and see what produces the best results.
Finally, a deeper dive can be found here:
https://machinelearningmastery.com/calibrated-classification-model-in-scikit-learn/

Plotting history of accuracy in BaggingClassifier

I've trained a simple random forest algorithm and bagging classifier (n_estimators = 100). Is it possible to plot the history of accuracy in bagging Classifier? How to calculate the variance of in 100 samples?
I've just printed the accuracy value for both algorithms:
# DecisionTree
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90)
clf2 = tree.DecisionTreeClassifier()
clf2.fit(X_tr, y_tr)
pred2 = clf2.predict(X_test)
acc2 = clf2.score(X_test, y_test)
acc2 # 0.6983930778739185
# Bagging
clf3 = BaggingClassifier(tree.DecisionTreeClassifier(), max_samples=0.5, max_features=0.5, n_estimators=100,\
verbose=2)
clf3.fit(X_tr, y_tr)
pred3 = clf3.predict(X_test)
acc3=clf3.score(X_test,y_test)
acc3 # 0.911619283065513
I don't think that you can get this information from the fitted BaggingClassifier. But you can create such a plot by fitting for different n_estimators:
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
iris = datasets.load_iris()
X, X_test, y, y_test = train_test_split(iris.data,
iris.target,
test_size=0.20)
estimators = list(range(1, 20))
accuracy = []
for n_estimators in estimators:
clf = BaggingClassifier(DecisionTreeClassifier(max_depth=1),
max_samples=0.2,
n_estimators=n_estimators)
clf.fit(X, y)
acc = clf.score(X_test, y_test)
accuracy.append(acc)
plt.plot(estimators, accuracy)
plt.xlabel("Number of estimators")
plt.ylabel("Accuracy")
plt.show()
(Of course, the iris dataset is easily fit with just a single DecisionTreeClassifier, so I set max_depth=1 in this example.)
For a statistically meaningful result, you should fit a BaggingClassifier multiple times for each n_estimators and take the average of the obtained accuracies.

roc_auc in VotingClassifier, RandomForestClassifier in scikit-learn (sklearn)

I am trying to calculate roc_auc for hard votingclassifier that i build . i present the code with reprodcible example. now i want to calculate the roc_auc score and plot ROC curver but unfortunately i got the following error predict_proba is not available when voting='hard'
# Voting Ensemble for Classification
import pandas
from sklearn import datasets
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer,confusion_matrix, f1_score, precision_score, recall_score, cohen_kappa_score,accuracy_score,roc_curve
import numpy as np
np.random.seed(42)
iris = datasets.load_iris()
X = iris.data[:, :4] # we only take the first two features.
Y = iris.target
print(Y)
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
# create the sub models
estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=0)
estimators.append(('RandomForest', model2))
model3 = MultinomialNB()
estimators.append(('NaiveBayes', model3))
model4=SVC(probability=True)
estimators.append(('svm', model4))
model5=DecisionTreeClassifier()
estimators.append(('Cart', model5))
# create the ensemble model
print('Majority Class Labels (Majority/Hard Voting)')
ensemble = VotingClassifier(estimators,voting='hard')
#accuracy
results = model_selection.cross_val_score(ensemble, X, Y, cv=kfold,scoring='accuracy')
y_pred = cross_val_predict(ensemble, X ,Y, cv=10)
print("Accuracy ensemble model : %0.2f (+/- %0.2f) " % (results.mean(), results.std() ))
print(results.mean())
#recall
recall_scorer = make_scorer(recall_score, pos_label=1)
recall = cross_val_score(ensemble, X, Y, cv=kfold, scoring=recall_scorer)
print('Recall', np.mean(recall), recall)
# Precision
precision_scorer = make_scorer(precision_score, pos_label=1)
precision = cross_val_score(ensemble, X, Y, cv=kfold, scoring=precision_scorer)
print('Precision', np.mean(precision), precision)
#f1_score
f1_scorer = make_scorer(f1_score, pos_label=1)
f1_score = cross_val_score(ensemble, X, Y, cv=kfold, scoring=f1_scorer)
print('f1_score ', np.mean(f1_score ),f1_score )
#roc_auc_score
roc_auc_score = cross_val_score(ensemble, X, Y, cv=kfold, scoring='roc_auc')
print('roc_auc_score ', np.mean(roc_auc_score ),roc_auc_score )
To calculate the roc_aucmetric you first need to
Replace: ensemble = VotingClassifier(estimators,voting='hard')
with: ensemble = VotingClassifier(estimators,voting='soft').
Next, the last 2 lines of code will throw an error:
roc_auc_score = cross_val_score(ensemble, X, Y, cv=3, scoring='roc_auc')
print('roc_auc_score ', np.mean(roc_auc_score ),roc_auc_score )
ValueError: multiclass format is not supported
This is normal since in Y you have 3 classes (np.unique(Y) == array([0, 1, 2])).
You can't use roc_auc as a single summary metric for multiclass models. If you want, you could calculate **per-class roc_auc.**
How to solve this:
1) Use only two classes to calculate the roc_auc_score
2) use label binarization in advance vefore calling roc_auc_score

scikit learn cross validation classification_report

I want to have metrics per class label and an aggregate confusion matrix from a cross validation in scikit learn.
I wrote a method that performs a cross-validation for scikit learn that sums the confusion matrices and also stores all the predicted labels. Then, it calls scikit learn methods to print out the metrics.
The code below should run with any recent scikit learn installation, you can test it out with any dataset.
Is below the correct way to gather an aggregate cm and a classification_report when doing StratifiedKFold cross validation?
from sklearn import metrics
from sklearn.cross_validation import StratifiedKFold
import numpy as np
def customCrossValidation(self, X, y, classifier, n_folds=10, shuffle=True, random_state=0):
''' Perform a cross validation and print out the metrics '''
skf = StratifiedKFold(y, n_folds=n_folds, shuffle=shuffle, random_state=random_state)
cm = None
y_predicted_overall = None
y_test_overall = None
for train_index, test_index in skf:
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
classifier.fit(X_train, y_train)
y_predicted = classifier.predict(X_test)
# collect the y_predicted per fold
if y_predicted_overall is None:
y_predicted_overall = y_predicted
y_test_overall = y_test
else:
y_predicted_overall = np.concatenate([y_predicted_overall, y_predicted])
y_test_overall = np.concatenate([y_test_overall, y_test])
cv_cm = metrics.confusion_matrix(y_test, y_predicted)
# sum the cv per fold
if cm is None:
cm = cv_cm
else:
cm += cv_cm
print (metrics.classification_report(y_test_overall, y_predicted_overall, digits=3))
print (cm)

Categories