Obtaining XGBoost regression predicted values for each data point - python

I'm new to using XGBoost and I'm confused about how we can obtain the XGBoost predicted values for each data point.
This is how I've approached the problem so far:
# Creating dataframe of predictor variables (dropping target variable and string columns)
X = players.drop(['Overall', 'Age', 'Market value', 'Player', 'Nationality', 'Contract End',
'Potential', 'Team', 'Position', 'Contract expires'], 1)
y = players['Overall']
# Splitting the data into training (80%) & test sets (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# XGBoost Regression model
model = xgb.XGBRegressor()
# Fitting the model
model.fit(X_train, y_train)
# Generating Test Predictions
y_pred_test = model.predict(X_test)
# Test RMSE
rmse_test = np.sqrt(MSE(y_test, y_pred_test))
print("RMSE: %f" % (rmse_test))
At this point I want to see the XGBoost model predictions for Overall for each Player, however I can't find any examples online with code for this. For example, the output would ideally look like:
Player, Overall, XGB Predicted Overall
Mbappé, 91, 92.3
Neymar, 90, 91.7
Messi, 93, 90.1
...
How should I go about obtaining these predicted values?
Here's a sample of my dataset:

Related

How to fit long time period data into Regression models in scikit-learn?

I'm working on the regresion model with population and demand values my data is for period from 1980 to 2021 by country, below example where under year is the number of population and under year_dem is the demand for item.
Taks is to create prediction model to forecast demand for each country in future.
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Load the dataset containing past data on vaccine demand and supply
data = df.iloc[0]
X = data.drop(['Country','ISO','1980_dem', '1981_dem', '1982_dem','1983_dem','1984_dem','1985_dem','1986_dem','1987_dem','1988_dem','1989_dem','1990_dem','1991_dem','1992_dem','1993_dem','1994_dem','1995_dem','1996_dem','1997_dem','1998_dem','1999_dem','2000_dem','2001_dem','2002_dem','2003_dem','2004_dem','2005_dem','2006_dem','2007_dem','2008_dem','2009_dem','2010_dem','2011_dem','2012_dem','2013_dem','2014_dem','2015_dem','2016_dem','2017_dem','2018_dem','2019_dem','2020_dem','2021_dem'])
y = data['1980_dem']
model = RandomForestRegressor(n_estimators=50, max_features="auto", random_state=44)
model.fit(X_train, y_train)
# Split the DataFrame into a training set and a testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
# Use the trained model to make predictions on the test set
#predictions = model.predict(X_test)
# Calculate the accuracy of the predictions
#accuracy = model.score(X_test, y_test)
#print('Accuracy:', round(accuracy,2),'%.')
expect to have created a model with Accuracy printed and poisbilit to predict values for future based on the model.

XGBoost : How to get feature names of a encoded dataframe for feature importance plot?

I am using xgboost to make some predictions. We do some pre-processing, hyper-parameter tuning before fitting the model. While performing model diagnostics, we'd like to plot feature importances with feature names.
Here are the steps we've taken.
# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.2)
X_train.shape
(1671, 21)
#Encoding of categorical variables
cat_vars = ['cat1','cat2']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')
encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)
X_train.shape
(1671, 420)
# Define xgb object
model = XGBRegressor()
# Tune hyper-parameters
r = RandomizedSearchCV(model, param_distributions=params, n_iter=200, cv=3, verbose=1, n_jobs=1)
# Fit model
r.fit(X_train, y_train)
xgb = r.best_estimator_
xgb
# Plot feature importance
plt.barh(X_train.feature_names?, xgbest.feature_importances)
X_train has encoded variable names only. And we cannot use column names with orig dataframe because of shape mismatch (21 vs 420).

SciKit Gradient Boosting - How to combine predictions with initial table?

I'm trying to use a gradient-boosting model to predict future scores in fantasy football - for now only looking at the 2 previous rounds. Currently, if a player is expected to score more than 6 points, the model would return '1', otherwise '0' - indicating whether the player would be a good captain choice or not.
In my original table i have player-name and round information to give context, but i removed these when training the algorithm. My question is, once the model makes a prediction - how can i show this prediction in combination with the player name, for example:
PlayerA - captain prediction = 1
etc.
y = ds.isCaptain
GB_table = ds.drop(['Player', 'Round', 'isCaptain', 'Points'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(GB_table, y, test_size=0.2)
baseline = GradientBoostingClassifier(learning_rate=0.01,n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
baseline.fit(X_train,y_train)
predictors=list(X_train)
feat_imp = pd.Series(baseline.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Importance of Features')
plt.ylabel('Feature Importance Score')
print('Accuracy of GBM on test set: {:.3f}'.format(baseline.score(X_test, y_test)))
pred=baseline.predict(X_test)
print(classification_report(y_test, pred))
The above shows me the predicted results, but unfortunately since I removed the player name and round information from the GB_table, I can no longer understand from who/which round the prediction is made.
I'm assuming you are using pandas DataFrames, in which case it's quite straightforward.
The index numbers in your X_train and X_test DataFrames will correspond to the index in your original 'ds' DataFrame.
Try:
pred = baseline.predict(X_test)
pred_original_data = ds.iloc[X_test.index]
pred_original_data['prediction'] = pred
You could drop the player column and other fields after train_test_split.
Here is my suggestion
y = ds.isCaptain
X_train, X_test, y_train, y_test = train_test_split(ds, y, test_size=0.2)
baseline = GradientBoostingClassifier(learning_rate=0.01, n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
baseline.fit(X_train.drop(['Player', 'Round', 'isCaptain', 'Points'], axis=1),y_train)
X_test_input = X_test.drop(['Player', 'Round', 'isCaptain', 'Points']
score = baseline.score(X_test_input, y_test))
print('Accuracy of GBM on test set: {:.3f}'.format(score)
X_test['prediction'] = baseline.predict(X_test_input)
print(classification_report(y_test, X_test['prediction']))

How to get SHAP values of the model averaged by folds?

This is way I can values from single fold trained model
clf.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_test, y_test)],
eval_metric='auc', verbose=100, early_stopping_rounds=200)
import shap # package used to calculate Shap values
# Create object that can calculate shap values
explainer = shap.TreeExplainer(clf)
# Calculate Shap values
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)
As you know result from different fold might be different - how to average this shap_values?
Because we have such rule:
It is fine to average the SHAP values from models with the same output
trained on the same input features, just make sure to also average the
expected_value from each explainer. However, if you have
non-overlapping test sets then you can't average the SHAP values from
the test sets since they are for different samples. You could just
explain the SHAP values for the whole dataset using each of your
models and then average that into a single matrix. (It's fine to
explain examples in your training set, just remember you may be
overfit to them)
So we need here some holdout dataset to follow that rule. I did something like this to get erything to work as expected:
shap_values = None
from sklearn.model_selection import cross_val_score, StratifiedKFold
(X_train, X_test, y_train, y_test) = train_test_split(df[feat], df['target'].values,
test_size=0.2, shuffle = True,stratify =df['target'].values,
random_state=42)
folds = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
folds_idx = [(train_idx, val_idx)
for train_idx, val_idx in folds.split(X_train, y=y_train)]
auc_scores = []
oof_preds = np.zeros(df[feat].shape[0])
test_preds = []
for n_fold, (train_idx, valid_idx) in enumerate(folds_idx):
train_x, train_y = df[feat].iloc[train_idx], df['target'].iloc[train_idx]
valid_x, valid_y = df[feat].iloc[valid_idx], df['target'].iloc[valid_idx]
clf = lgb.LGBMClassifier(nthread=4, boosting_type= 'gbdt', is_unbalance= True,random_state = 42,
learning_rate= 0.05, max_depth= 3,
reg_lambda=0.1 , reg_alpha= 0.01,min_child_samples= 21,subsample_for_bin= 5000,
metric= 'auc', n_estimators= 5000 )
clf.fit(train_x, train_y,
eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric='auc', verbose=False, early_stopping_rounds=100)
explainer = shap.TreeExplainer(clf)
if shap_values is None:
shap_values = explainer.shap_values(X_test)
else:
shap_values += explainer.shap_values(X_test)
oof_preds[valid_idx] = clf.predict_proba(valid_x)[:, 1]
auc_scores.append(roc_auc_score(valid_y, oof_preds[valid_idx]))
print( 'AUC: ', np.mean(auc_scores))
shap_values /= 10 # number of folds
shap.summary_plot(shap_values, X_test)

Logistic Regression: Train using past data and predict using current data?

I've trained and tested my logistic regression using available data but now need to output a future prediction. I want to include the 2017 values that I used in my training and test set to predict the 2018 probability.
This is the code I used to train and test my model:
Xadj = train.ix[:,('2016 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016')]
#Coded is the transformation of 2017 transaction count to a binary variable
y = y=train.ix[:,('2017 transaction count coded')]
logit_model=sm.Logit(y,Xadj)
result=logit_model.fit()
print(result.summary())
X_train, X_test, y_train, y_test = train_test_split(Xadj, y, test_size=0.3, random_state=42)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
#Cross Validation
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))
In an attempt to export predictions for 2018, I have done the following:
#Create 2018 Purchase Probability
train['2018 Purchase Probability']=pd.DataFrame({'2018 Purchase Probability' : []})
yact=train.ix[:,('2018 Purchase Probability')]
#Adding in 2017 values
X = train.ix[:, ('2017 transaction count','critical_CI', 'critical_CN','critical_CS',
'critical_FI', 'critical_IN','critical_OI','critical_RA','create_year_2012', 'create_year_2013',
'create_year_2014', 'create_year_2015','create_year_2016','create_year_2017')]
from sklearn.preprocessing import scale, StandardScaler
scaler = StandardScaler()
scaler.fit(Xadj)
X = scaler.transform(Xadj)
X_pred = scaler.transform(X)
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
logreg = LogisticRegression()
logreg.fit(Xadj, y)
#Generate 0/1 prediction
prediction = logreg.predict(X= X)
#Generate odds ratio
precent_prediction = logreg.predict_proba(X= X)
prediction = pd.DataFrame(prediction)
I'm not sure if I've done this correctly and judging from my output (which is mostly 1's) I don't think I have. I am new to coding in Python and am struggling to turn my tested model into a future prediction that can be used to make decisions.
Thanks in advance for any help!

Categories