I have a video games Dataset with many categorical columns.
I binarized all these columns.
Now I want to predict a column (called Rating) with Logistic Regression, but this columns is now actually binarized into 4 columns (Rating_Everyone, Rating_Everyone10+, Rating_Teen and Rating_Mature).
So, I applied four times the Logistic Regression and here is my code:
df2 = pd.read_csv('../MQPI/docs/Video_Games_Sales_as_at_22_Dec_2016.csv', encoding="utf-8")
y = df2['Rating_Everyone'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
class_weight=None, random_state=None, solver='liblinear', max_iter=100,
multi_class='ovr',
verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)
print("Logistic Regression Rating_Everyone accuracy: ", ris)
And again:
y = df2['Rating_Everyone10'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
class_weight=None, random_state=None, solver='liblinear', max_iter=100,
multi_class='ovr',
verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)
print("Logistic Regression Rating_Everyone accuracy: ", ris)
And so on for Rating_Teen and Rating_Mature.
Can you tell me how to merge all these four results into one result OR how can I do this multiclass Logistic Regression problem better?
The LogisticRegression model is inherently handle multiclass problems:
Below is a summary of the classifiers supported by scikit-learn
grouped by strategy; you don’t need the meta-estimators in this class
if you’re using one of these, unless you want custom multiclass
behavior: Inherently multiclass: Naive Bayes, LDA and QDA, Decision
Trees, Random Forests, Nearest Neighbors, setting
multi_class='multinomial' in sklearn.linear_model.LogisticRegression.
As a basic model, without class weighting (as you may need to do as samples may not be balanced over the ratings) set multi_class='multinomial' and change the solver to 'lbfgs' or one
of the other solvers that support multiclass problems:
For multiclass problems, only ‘newton-cg’, ‘sag’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes
So you dont have to have to split your datasets up the way you have. Instead provide the original ratings column as the the labels.
Here is a minimal example:
X = np.random.randn(10, 10)
y = np.random.randint(1, 4, size=10) # 3 classes simulating ratings
lg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lg.fit(X, y)
lg.predict(X)
Edit: responding to comment.
td;lr: I expect that the model will learn that interaction on it own. IF not you might encode that information as a feature. So there is no obvious need to binarize your classes.
The way I understand it that you have features of a movies and you have the MPAA rating for the movie as the label (which you're trying to predict). This is then a multiclass problem which you can start modeling using logistic regression ( this you knew ). This is the model I proposed in above.
Now you recognized that there is a implicit distance between classes. The way I would use this information is as a feature for the model. However, I'd first be inclined to see of the model will learn this on its own.
Related
I'm making a model that predicts football matches results. Now, I'm trying to predict the goal sum of each match, so I think this is a classification problem.
The Y column has 7 values (0-6). I scale data with RobustScaler, then I fit the model.
Could anyone give me some advices about fixing script (if there are some errors), improving accuracy and getting better predictions?
Below a part of the script and three rows of my dataset.
# Partitioning
dtf_train, dtf_test = model_selection.train_test_split(dtf, train_size=0.95, shuffle=True)
# Scaling
scalerX = preprocessing.RobustScaler()
scalerY = preprocessing.RobustScaler()
X_names = dtf_train.drop("Y", axis=1).columns
dtf_train[X_names] = scalerX.fit_transform(dtf_train[X_names])
dtf_train["Y"] = dtf_train["Y"].astype(int)
dtf_test[X_names] = scalerX.transform(dtf_test[X_names])
X_train = dtf_train.drop("Y", axis=1).values
y_train = dtf_train["Y"].values
X_test = dtf_test.drop("Y", axis=1).values
y_test = dtf_test["Y"].values
parameters = {
'C': [100,1000],
'gamma': [1,0.1,0.001],
'kernel': ['rbf']
}
grid = GridSearchCV(SVC(), parameters, refit=True, verbose=3)
grid.fit(X_train, y_train)
predicted = grid.predict(X_test)
I tried GradientBoostingRegressor, LinearRegression and SVR supposing it as a regression problem, then I changed my mind on classification using SVC, but little has changed. I hope I'll able to improve my model and reach my target that's predicting matches results.
I'm new to using XGBoost and I'm confused about how we can obtain the XGBoost predicted values for each data point.
This is how I've approached the problem so far:
# Creating dataframe of predictor variables (dropping target variable and string columns)
X = players.drop(['Overall', 'Age', 'Market value', 'Player', 'Nationality', 'Contract End',
'Potential', 'Team', 'Position', 'Contract expires'], 1)
y = players['Overall']
# Splitting the data into training (80%) & test sets (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# XGBoost Regression model
model = xgb.XGBRegressor()
# Fitting the model
model.fit(X_train, y_train)
# Generating Test Predictions
y_pred_test = model.predict(X_test)
# Test RMSE
rmse_test = np.sqrt(MSE(y_test, y_pred_test))
print("RMSE: %f" % (rmse_test))
At this point I want to see the XGBoost model predictions for Overall for each Player, however I can't find any examples online with code for this. For example, the output would ideally look like:
Player, Overall, XGB Predicted Overall
Mbappé, 91, 92.3
Neymar, 90, 91.7
Messi, 93, 90.1
...
How should I go about obtaining these predicted values?
Here's a sample of my dataset:
I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:
DataFrame
Note that after dropping duplicates, the final dataframe only contains 149 observations.
Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.
X = pd_data.drop(target, axis=1)
y = pd_data[target]
y=y.astype('int')
protocol_type = [['tcp','udp','icmp']]
col_transformer = ColumnTransformer([
("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'),
['protocol_type']),
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89)
Finally I proceed to the model evaluation, which drops me the following result:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RFC', RandomForestClassifier()))
models.append(('SVM', SVC()))
#selector = SelectFromModel(estimator=model)
scaler = option2
selector = SelectKBest(score_func=f_classif,k = 3)
results=[]
for name, model in models:
pipeline = make_pipeline(col_transformer,scaler,selector)
#print(pipeline)
X_train_selected = pipeline.fit_transform(X_train,y_train)
#print(X_train_selected)
X_test_selected = pipeline.fit_transform(X_test,y_test)
modelo = model.fit(X_train_selected, y_train)
kf = KFold(n_splits=10, shuffle=True, random_state=89)
cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy')
results.append(cv_results)
print(name, cv_results)
plt.boxplot(results)
plt.show()
Boxplots from CV
My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?
You have 149 rows, of which 80% go into the training set, so 119. You then do 10-fold cross-validation, so each test fold has about 12 samples. So each individual test fold has only 13 possible accuracy scores; even if the classifiers predict some samples a little differently, they may have the same accuracy. (The common scores you see (1, 0.88, 0.71) don't line up with the fractions I'm expecting though, so maybe I've missed something?) So yes, possibly it's just the small number of rows, compounded with the cross-validation. Selecting down to just 3 features also probably contributes.
One quick thing to check is some continuous score of the models' performance, say log-loss or Brier score.
(And, Gaussian is probably the wrong Naive Bayes to use with your data, containing so many binary features.)
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
I have a multiclass classification problem for various classifiers (random forest, SVM, NN) and I use OneVsRestClassifier to wrap my models. I want to use an interpretability method (LIME) which makes use of probabilities that sum to 1, but when I use the function predict_proba, the sum of the matrix does not always sum to 1.
It's a multiclass classification problem. I have checked my raw data, my binarized values, and my train/test data to check that there is no overlap of classes. Each instance has a distinct label (100, 010, or 001).
x = pd.read_pickle(r"x.pkl").values
y = pd.read_pickle(r"y.pkl").values
# binarize labels for multilabel auc calculations
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# create train and test sets, stratified
x_train, x_test, y_train, y_test = train_test_split(x, y, stratify=y, test_size = 0.20, random_state=5)
rfclassifier = RandomForestClassifier(n_estimators=100, random_state=5, criterion = 'gini', bootstrap = True)
classifier = OneVsRestClassifier(rfclassifier)
classifier.fit(x_train, y_train)
prediction = classifier.predict(x_test)
probability = classifier.predict_proba(x_test)
#check probabilities
print(classifier.predict_proba([x_test[0]]).round(3))
print(classifier.predict_proba([x_test[1]]).round(3))
print(classifier.predict_proba([x_test[20]]).round(3))
The print statements show examples for label 1, 0, and 2 respectively.
The outputs are [[0.164 0.836 0. ]], [[0.953 0.015 0. ]], and [[0.01 0.12 0.96]]. The last two (as well as many other instances) do not sum to 0 and prevent me from implementing the interpretability method.