I'm making a model that predicts football matches results. Now, I'm trying to predict the goal sum of each match, so I think this is a classification problem.
The Y column has 7 values (0-6). I scale data with RobustScaler, then I fit the model.
Could anyone give me some advices about fixing script (if there are some errors), improving accuracy and getting better predictions?
Below a part of the script and three rows of my dataset.
# Partitioning
dtf_train, dtf_test = model_selection.train_test_split(dtf, train_size=0.95, shuffle=True)
# Scaling
scalerX = preprocessing.RobustScaler()
scalerY = preprocessing.RobustScaler()
X_names = dtf_train.drop("Y", axis=1).columns
dtf_train[X_names] = scalerX.fit_transform(dtf_train[X_names])
dtf_train["Y"] = dtf_train["Y"].astype(int)
dtf_test[X_names] = scalerX.transform(dtf_test[X_names])
X_train = dtf_train.drop("Y", axis=1).values
y_train = dtf_train["Y"].values
X_test = dtf_test.drop("Y", axis=1).values
y_test = dtf_test["Y"].values
parameters = {
'C': [100,1000],
'gamma': [1,0.1,0.001],
'kernel': ['rbf']
}
grid = GridSearchCV(SVC(), parameters, refit=True, verbose=3)
grid.fit(X_train, y_train)
predicted = grid.predict(X_test)
I tried GradientBoostingRegressor, LinearRegression and SVR supposing it as a regression problem, then I changed my mind on classification using SVC, but little has changed. I hope I'll able to improve my model and reach my target that's predicting matches results.
Related
I'm currently working on a XGBoost regression model to predict ticket bookings.
My issue is that my model has a good accuracy for the training set (around 96%) and for the testing set (around 94%) but when I try to use the model to predict my booking on another held out dataset the accuracy on this one drop to 82%.
I tried switching some data from my testing set to this held out set and the accuracy is still pretty bad, even though the model can efficiently predict these data when they're inside my testing set.
I assume I'm doing something wrong but I can't figure out what.
Any help would be appreciated, thanks
Here's the XGBoost model part of my code:
import xgboost as xgb
from sklearn.metrics import mean_squared_error
X_conso, y_conso = data_conso2.iloc[:,:-1],data_conso2.iloc[:,-1]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_conso, y_conso, test_size=0.3, random_state=20)
d_train = xgb.DMatrix(X_train, label = y_train)
d_test = xgb.DMatrix(X_test, label = y_test)
d_fcst_held_out = xgb.DMatrix(X_fcst_held_out)
params = {'p_colsample_bytree_conso' : 0.9,
'p_colsample_bylevel_conso': 0.9,
'p_colsample_bynode_conso': 0.9,
'p_learning_rate_conso': 0.3,
'p_max_depth_conso': 10,
'p_alpha_conso': 3,
'p_n_estimators_conso': 10,
'p_gamma_conso': 0.8}
steps = 100
watchlist = [(d_train, 'train'), (d_test, 'test')]
model = xgb.train(params, d_train, steps, watchlist, early_stopping_rounds = 50)
preds_train = model.predict(d_train)
preds_test = model.predict(d_test)
preds_fcst = model.predict(d_fcst_held_out)
And my accuracy levels :
Error train: 4.524787%
Error test: 5.978759%
Error fcst: 18.008451%
This is generally normal, the unseen data usually has lower accuracy.
To improve accuracy on data you may optimize your parameters using for example optuna.
I am using ElasticNet to obtain a fit of my data. To determine the hyperparameters (l1, alpha), I am using ElasticNetCV. With the obtained hyperparamers, I refit the model to the whole dataset for production use. I am unsure if this is correct in both, the machine learning aspect and - if so - how I do it. The code "works" and presumably does what it should, but I wanted to be certain that it is also correct.
My procedure is:
X_tr, X_te, y_tr, y_te = train_test_split(X,y)
optimizer = ElasticNetCV(l1_ratio = [.1,.5,.7,.9,.99,1], n_alphas=400, cv=5, normalize=True)
optimizer.fit(X_tr, y_tr)
best = ElasticNet(alpha=optimizer.alpha_, l1_ratio=optimizer.l1_ratio_, normalize=True)
best.fit(X,y)
Thank you in advance
I am a beginner on this but I would love to share my approach to ElasticNet hyperparameters tuning. I would suggest to use RandomizedSearchCV instead. Here is part of the current code I am writing now:
#-----------------------------------------------
# input:
# X_train, X_test, Y_train, Y_test: datasets
# Returns:
# R² and RMSE Scores
#-----------------------------------------------
# Standardize data before
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# define grid
params = dict()
# values for alpha: 100 values between e^-5 and e^5
params['alpha'] = np.logspace(-5, 5, 100, endpoint=True)
# values for l1_ratio: 100 values between 0 and 1
params['l1_ratio'] = np.arange(0, 1, 0.01)
Warning: you are testing 100 x 100 = 10 000 possible combinations.
# Create an instance of the Elastic Net Regressor
regressor = ElasticNet()
# Call the RanddomizedSearch with Cross Validation using the chosen regressor
rs_cv= RandomizedSearchCV(regressor, params, n_iter = 100, scoring=None, cv=5, verbose=0, refit=True)
rs_cv.fit(X_train, Y_train.values.ravel())
# Results
Y_pred = rs_cv.predict(X_test)
R2_score = rs_cv.score(X_test, Y_test)
RMSE_score = np.sqrt(mean_squared_error(Y_test, Y_pred))
return R2_score, RMSE_score, rs_cv.best_params_
The advantage is that in RandomizedSearchCV the number of iterations can be predetermined in advance. The choices of points to be tested are random but 90% (in some cases) faster than GridSearchCV (that tests all possible combinations).
I am using this same approach for other regressors like RandomForests and GradientBoosting who parameters grids are far more complicated and demand much more computer power to run.
As I said at the beginning I am new to this field, so any constructive comment will be welcomed.
Johnny
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I am trying to solve a regression problem on Boston Dataset with help of random forest regressor.I was using GridSearchCV for selection of best hyperparameters.
Problem 1
Should I fit the GridSearchCV on some X_train, y_train and then get the best parameters.
OR
Should I fit it on X, y to get best parameters.(X, y = entire dataset)
Problem 2
Say If I fit it on X, y and get the best parameters and then build a new model on these best parameters.
Now how should I train this new model on ?
Should I train the new model on X_train, y_train or X, y.
Problem 3
If I train new model on X,y then how will I validate the results ?
My code so far
#Dataframes
feature_cols = ['CRIM','ZN','INDUS','NOX','RM','AGE','DIS','TAX','PTRATIO','B','LSTAT']
X = boston_data[feature_cols]
y = boston_data['PRICE']
Train Test Split of Data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1)
Grid Search to get best hyperparameters
from sklearn.grid_search import GridSearchCV
param_grid = {
'n_estimators': [100, 500, 1000, 1500],
'max_depth' : [4,5,6,7,8,9,10]
}
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10)
CV_rfc.fit(X_train, y_train)
CV_rfc.best_params_
#{'max_depth': 10, 'n_estimators': 100}
Train a Model on the max_depth: 10, n_estimators: 100
RFReg = RandomForestRegressor(max_depth = 10, n_estimators = 100, random_state = 1)
RFReg.fit(X_train, y_train)
y_pred = RFReg.predict(X_test)
y_pred_train = RFReg.predict(X_train)
RMSE: 2.8139766730629394
I just want some guidance with what the correct steps would be
In general, to tune the hyperparameters, you should always train your model over X_train, and use X_test to check the results. You have to tune the parameters based on the results obtained by X_test.
You should never tune hyperparameters over the whole dataset because it would defeat the purpose of the test/train split (as you correctly ask in the Problem 3).
This is a valid concern indeed.
Problem 1
The GridSearchCV does cross validation indeed to find the proper set of hyperparameters. But you should still have a validation set to make sure that the optimal set of parameters is sound for it (so that gives in the end train, test, validation sets).
Problem 2
The GridSearchCV already gives you the best estimator, you don't need to train a new one. But actually CV is just to check if the building is sound, you can train then on the full dataset (see https://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation for a full detailed discussion).
Problem 3
What you already validated is the way you trained your model (i.e. you already validated that the hyperparameters you found are sound and the training works as expected for the data you have).
Working on a Classification problem using python scikit, its a medical diagnostics data having 6 features and 2 targets. I tried with one target, trained a model using KNN algorithm, prediction accuracy is 100% with this model.
Now want to extend this to second target, want to predict the outcome of two y values for the same feature set(6 columns).
Following is my code where Im able to accurately predict the outcome of Target 1 ('Outcome1-Urinary-bladder'). How can I extend to predict the outcome of the second Target (Outcome2-Nephritis-of-renal).
X = Feature_set
y = Target1['Outcome1-Urinary-bladder'].values
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X,y)
y_predictor = knn.predict(X)
print metrics.accuracy_score(y,y_predictor)
Click here to view the dataset
What modifications to be made to the code to predict outcome of 2 target values ('Outcome1-Urinary-bladder' & Outcome2-Nephritis-of-renal)?
Please help me out. Thanks in advance.
In general you just wrap your classifier into one-vs-rest classifier wrapper:
http://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html#sklearn.multiclass.OneVsRestClassifier
And feed it with the matrix y which will have 2 columns at the same time.
Example of usage:
selClassifiers = {
'linear': LinearSVC(),
'linearWithSGD': SGDClassifier(),
'rbf': SVC(kernel='rbf', probability=True),
'poly': SVC(kernel='poly', probability=True),
'sigmoid': SVC(kernel='sigmoid', probability=True),
'bayes': MultinomialNB()
}
classifier = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),('clf', OneVsRestClassifier(selClassifiers[classif]))])
classifier.fit(X_train, Y)
predicted = classifier.predict(X_test)
all_labels = lb.inverse_transform(predicted)
As pointed by #yangjie, for your specific classifier there is no need to wrap it, it already support multi-output classification.