SciKit Gradient Boosting - How to combine predictions with initial table? - python

I'm trying to use a gradient-boosting model to predict future scores in fantasy football - for now only looking at the 2 previous rounds. Currently, if a player is expected to score more than 6 points, the model would return '1', otherwise '0' - indicating whether the player would be a good captain choice or not.
In my original table i have player-name and round information to give context, but i removed these when training the algorithm. My question is, once the model makes a prediction - how can i show this prediction in combination with the player name, for example:
PlayerA - captain prediction = 1
etc.
y = ds.isCaptain
GB_table = ds.drop(['Player', 'Round', 'isCaptain', 'Points'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(GB_table, y, test_size=0.2)
baseline = GradientBoostingClassifier(learning_rate=0.01,n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
baseline.fit(X_train,y_train)
predictors=list(X_train)
feat_imp = pd.Series(baseline.feature_importances_, predictors).sort_values(ascending=False)
feat_imp.plot(kind='bar', title='Importance of Features')
plt.ylabel('Feature Importance Score')
print('Accuracy of GBM on test set: {:.3f}'.format(baseline.score(X_test, y_test)))
pred=baseline.predict(X_test)
print(classification_report(y_test, pred))
The above shows me the predicted results, but unfortunately since I removed the player name and round information from the GB_table, I can no longer understand from who/which round the prediction is made.

I'm assuming you are using pandas DataFrames, in which case it's quite straightforward.
The index numbers in your X_train and X_test DataFrames will correspond to the index in your original 'ds' DataFrame.
Try:
pred = baseline.predict(X_test)
pred_original_data = ds.iloc[X_test.index]
pred_original_data['prediction'] = pred

You could drop the player column and other fields after train_test_split.
Here is my suggestion
y = ds.isCaptain
X_train, X_test, y_train, y_test = train_test_split(ds, y, test_size=0.2)
baseline = GradientBoostingClassifier(learning_rate=0.01, n_estimators=1500,max_depth=4, min_samples_split=40, min_samples_leaf=7,max_features=4 , subsample=0.95, random_state=10)
baseline.fit(X_train.drop(['Player', 'Round', 'isCaptain', 'Points'], axis=1),y_train)
X_test_input = X_test.drop(['Player', 'Round', 'isCaptain', 'Points']
score = baseline.score(X_test_input, y_test))
print('Accuracy of GBM on test set: {:.3f}'.format(score)
X_test['prediction'] = baseline.predict(X_test_input)
print(classification_report(y_test, X_test['prediction']))

Related

I am getting 100% accuracy in my decision tree model. Where I was wrong?

#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)

Obtaining XGBoost regression predicted values for each data point

I'm new to using XGBoost and I'm confused about how we can obtain the XGBoost predicted values for each data point.
This is how I've approached the problem so far:
# Creating dataframe of predictor variables (dropping target variable and string columns)
X = players.drop(['Overall', 'Age', 'Market value', 'Player', 'Nationality', 'Contract End',
'Potential', 'Team', 'Position', 'Contract expires'], 1)
y = players['Overall']
# Splitting the data into training (80%) & test sets (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# XGBoost Regression model
model = xgb.XGBRegressor()
# Fitting the model
model.fit(X_train, y_train)
# Generating Test Predictions
y_pred_test = model.predict(X_test)
# Test RMSE
rmse_test = np.sqrt(MSE(y_test, y_pred_test))
print("RMSE: %f" % (rmse_test))
At this point I want to see the XGBoost model predictions for Overall for each Player, however I can't find any examples online with code for this. For example, the output would ideally look like:
Player, Overall, XGB Predicted Overall
Mbappé, 91, 92.3
Neymar, 90, 91.7
Messi, 93, 90.1
...
How should I go about obtaining these predicted values?
Here's a sample of my dataset:

Show sorted results from loop

I'm testing different models (classifiers) and I've created a list (that will contain model names) and then loop through it to print accuracy and cross validation score for each one of them. It works fine.
The thing I'd like to do is showing them ordered by descending accuracy_score (metrics.accuracy_score(y_test, y_pred) in the code below). How do I do that easily?
Thanks a lot to anyone who'll be willing to help!
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
#measure the accuracy and show results per model
for name, model in models:
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', metrics.accuracy_score(y_test, y_pred),'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
Append your scores to a new list, and then sort that list using the .sort() method, like so:
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
results = [] # New list to store results
#measure the accuracy and show results per model
for name, model in models:
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
results.append((name, metrics.accuracy_score(y_test, y_pred)))
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', metrics.accuracy_score(y_test, y_pred),'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
results.sort(key=lambda tup: tup[1], reverse=True) # sort in-place
print(results) # print results
Rather than doing everything in the same big chunk of code inside the loop, I suggest to identify the different types of operations that you're doing and separate them into their own functions:
Run the model, fit, predict, compute score;
Sort the list;
Print the model;
import operator # itemgetter
#create an array of models
models = []
models.append(("Random Forest",RandomForestClassifier(n_estimators = 100, random_state = 0)))
#models.append(("Logistic Regression",LogisticRegression()))
models.append(("Naive Bayes",GaussianNB()))
models.append(("SVM",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("KNN",KNeighborsClassifier()))
models.append(("Gradient Boosting",GradientBoostingClassifier()))
def run_model(m):
name, model = m
# fit the model with x and y data
model.fit(X_train, y_train)
#Prediction of test set
y_pred = model.predict(X_test)
kfold = KFold(n_splits=4)#, random_state=22)
cv_result = cross_val_score(model, X_train, y_train, cv = kfold, scoring = "accuracy")
return (name, accuracy, cv_result)
def print_model(name, accuracy, cv_result):
print('\033[1m', name, '\033[0m')
print('accuracy score is: \033[1m', accuracy,'\033[0m')
print('cross validation score is: ' ,cv_result,'\n------------------------------------------------------------------------------------')
results = sorted(map(run_model, models), key=operator.itemgetter(1))
for name, accuracy, cv_result in results:
print_model(name, accuracy, cv_result)
Disclaimer: Contrary to all best practices, I did not test this code before posting it, because the OP didn't provide example values for X_train, y_train, X_test, y_test, nor the relevant import to make their code work.

How to tell a SciKit LinearRegression model that a predicted value cannot be less than Zero?

I have the following code that attempts to valuate stock on non-price based features.
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = LinearRegression()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
In my df (which is very large), there is never an instance where 'regularMarketPrice' is less than 0. However, I occasionally receive a value less than 0 for some points in y_predicted.
Is there a way in Scikit to say anything less than 0 is an invalid prediction? I am hoping this makes my model more accurate.
Please comment if there is a need for further explanation.
To make more prediction larger than 0, you should not use linear regression. You should consider generalized linear regression (glm), such as poisson regression.
from sklearn.linear_model import PoissonRegressor
price = df.loc[:,'regularMarketPrice']
features = df.loc[:,feature_list]
#
X_train, X_test, y_train, y_test = train_test_split(features, price, test_size = 0.15, random_state = 1)
if len(X_train.shape) < 2:
X_train = np.array(X_train).reshape(-1,1)
X_test = np.array(X_test).reshape(-1,1)
#
model = PoissonRegressor()
model.fit(X_train,y_train)
#
print('Train Score:', model.score(X_train,y_train))
print('Test Score:', model.score(X_test,y_test))
#
y_predicted = model.predict(X_test)
All prediction is greater than or equal to 0
Consider using something other than a Gaussian response variable. Plot your y-values with a histogram. If the data are right skewed, considering modeling with a glm, gamma distribution, and log link.
Alternatively, you could set the y_predicted to the max of the model.score value and 0.

Python DummyRegressor with min of group

Trying to use sklearn.dummy DummyRegressor to create a baseline for my model which is a regression model with encoded categorical variables to predict a continuous target. Baseline strategy is 'min' and I'd like the min by group. Below is a reproducible example. My actual dataset is larger and it is a collection of runners ('a' ids) racing on courses ('c' ids) with the time they recorded for that performance as the target 'T'. I'm trying to see if the model performs better that the runner's best/fastest recorded time (min).
df = pd.DataFrame([['a1','c1',10],
['a1','c2',15],
['a1','c3',20],
['a1','c1',15],
['a2','c2',26],
['a4','c3',15],
['a2','c1',23],
['a2','c2',15],
['a3','c3',20],
['a3','c3',13],
['a1','c3',19],
['a4','c3',19],
['a3','c3',12],
['a3','c3',20]], columns=['aid','cid','T'])
X = pd.get_dummies(df, columns=['aid','cid'],prefix_sep='',prefix='')
X.drop(['T'], axis=1, inplace=True)
y = df['T']
# train test split 80-20
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regr = LinearRegression()
Lin_model = regr.fit(X_train, y_train)
y_pred = Lin_model.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred))
For comparison I'd like to use DummyRegressor. Using mean as strategy it works and as I understand it is using the mean of the entire column.
dummy_mean = DummyRegressor(strategy='mean')
dummy_mean.fit(X_train, y_train)
y_pred2 = dummy_mean.predict(X_test)
print('R-squared:', metrics.r2_score(y_test, y_pred2))
print('MAE:', metrics.mean_absolute_error(y_test, y_pred2))
to compare to the lowest T or fastest/best time I tried the constant function and defined it as min value by group
min_value = df.groupby('aid').agg({'T': ['min']})
dummy_min = DummyRegressor(strategy='constant',constant = min_value)
dummy_min.fit(X_train, y_train)
y_pred4 = dummy_min.predict(X_test)
which returns
ValueError: could not broadcast input array from shape (1,3) into shape (3,1)
what am I missing?
When you use min_value = df.groupby('aid').agg({'T': ['min']}) the shape of the data frame is changed to (3,1) try to change it tomin_value = df.groupby('aid').agg({'T': ['min']}).values.reshape(1,-1) Hope it helps.

Categories