I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:
DataFrame
Note that after dropping duplicates, the final dataframe only contains 149 observations.
Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.
X = pd_data.drop(target, axis=1)
y = pd_data[target]
y=y.astype('int')
protocol_type = [['tcp','udp','icmp']]
col_transformer = ColumnTransformer([
("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'),
['protocol_type']),
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89)
Finally I proceed to the model evaluation, which drops me the following result:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RFC', RandomForestClassifier()))
models.append(('SVM', SVC()))
#selector = SelectFromModel(estimator=model)
scaler = option2
selector = SelectKBest(score_func=f_classif,k = 3)
results=[]
for name, model in models:
pipeline = make_pipeline(col_transformer,scaler,selector)
#print(pipeline)
X_train_selected = pipeline.fit_transform(X_train,y_train)
#print(X_train_selected)
X_test_selected = pipeline.fit_transform(X_test,y_test)
modelo = model.fit(X_train_selected, y_train)
kf = KFold(n_splits=10, shuffle=True, random_state=89)
cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy')
results.append(cv_results)
print(name, cv_results)
plt.boxplot(results)
plt.show()
Boxplots from CV
My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?
You have 149 rows, of which 80% go into the training set, so 119. You then do 10-fold cross-validation, so each test fold has about 12 samples. So each individual test fold has only 13 possible accuracy scores; even if the classifiers predict some samples a little differently, they may have the same accuracy. (The common scores you see (1, 0.88, 0.71) don't line up with the fractions I'm expecting though, so maybe I've missed something?) So yes, possibly it's just the small number of rows, compounded with the cross-validation. Selecting down to just 3 features also probably contributes.
One quick thing to check is some continuous score of the models' performance, say log-loss or Brier score.
(And, Gaussian is probably the wrong Naive Bayes to use with your data, containing so many binary features.)
Related
I'm making a model that predicts football matches results. Now, I'm trying to predict the goal sum of each match, so I think this is a classification problem.
The Y column has 7 values (0-6). I scale data with RobustScaler, then I fit the model.
Could anyone give me some advices about fixing script (if there are some errors), improving accuracy and getting better predictions?
Below a part of the script and three rows of my dataset.
# Partitioning
dtf_train, dtf_test = model_selection.train_test_split(dtf, train_size=0.95, shuffle=True)
# Scaling
scalerX = preprocessing.RobustScaler()
scalerY = preprocessing.RobustScaler()
X_names = dtf_train.drop("Y", axis=1).columns
dtf_train[X_names] = scalerX.fit_transform(dtf_train[X_names])
dtf_train["Y"] = dtf_train["Y"].astype(int)
dtf_test[X_names] = scalerX.transform(dtf_test[X_names])
X_train = dtf_train.drop("Y", axis=1).values
y_train = dtf_train["Y"].values
X_test = dtf_test.drop("Y", axis=1).values
y_test = dtf_test["Y"].values
parameters = {
'C': [100,1000],
'gamma': [1,0.1,0.001],
'kernel': ['rbf']
}
grid = GridSearchCV(SVC(), parameters, refit=True, verbose=3)
grid.fit(X_train, y_train)
predicted = grid.predict(X_test)
I tried GradientBoostingRegressor, LinearRegression and SVR supposing it as a regression problem, then I changed my mind on classification using SVC, but little has changed. I hope I'll able to improve my model and reach my target that's predicting matches results.
I am writing a small program and I am training a random forest to predict a binary value. My dataset has around 20,000 entries and each entry has 25 features(continuous and categorical) with a binary target value to predict.
I am getting over 99% test accuracy which is surprisingly high. I tried to reduce the number of my features, even with two features I am still getting such high accuracy. I just want to make sure I am not doing anything wrong in my code, such as the training set leaking into my test set.
Here is the code snippet
data = pd.read_csv(r'test.csv')
data = data.drop_duplicates()
#spliting data
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#preproccessing the dataset by one hot encoding
l1 = OneHotEncoder(handle_unknown='ignore')
l1.fit(X_train)
X_train = l1.transform(X_train)
X_test = l1.transform(X_test)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train.to_numpy())
#evaluation
y_pred = classifier.predict(X_test)
print(accuracy_score(y_test, y_pred))
additionally, I forgot to add that my dataset is balanced and precision and recall scores are 100% !
This is quite a big dataset. How balanced is your dataset? It might be the case your test split is filled mostly with the entries of one label and failed every time the entry was from the other label. Therefore, i would say accuracy is not a good measure to rely on in here.
Have a look at this:
Difference of model accuracy and performance
Have a look at your confusion matrix and inspect your splits.
I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=1/3, random_state=85)
models = [
RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
LinearSVC(),
MultinomialNB(),
LogisticRegression(random_state=0),
]
# 5 Cross-validation
CV = 5
cv_df = pd.DataFrame(index=range(CV * len(models)))
entries = []
for model in models:
model_name = model.__class__.__name__
accuracies = cross_val_score(model, features, labels, scoring='accuracy', cv=CV)
for fold_idx, accuracy in enumerate(accuracies):
entries.append((model_name, fold_idx, accuracy))
cv_df = pd.DataFrame(entries, columns=['model_name', 'fold_idx', 'accuracy'])
error :
UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
When using cross validation, it splits the whole train set such that the train and test (in cv) will have same distribution. If there are 10 objects with label "A", which are about 20% of whole examples, it will split it to groups where every group has 2 objects with label "A" so it will also 20% from test.
But what happens when a label "A" has only 1 object (one row with that class) and you try to split it for 5 groups? This is an error you get. It does not know how to handle that.
It's a bit hard to tell how to solve it without knowing what your data looks like and what are your needs. Different problems may have different solutions.
You can:
Remove problematic label from dataset. Check for all classes with extreme low frequency and group them all together to "Other" or something like that.
Give up on cv and use KFolds, which does not require that groups within cv will have same distribution.
I have a dataset that shows whether a person has diabetes based on indicators, it looks like this (original dataset):
I've created a straightforward model in order to predict the last column (Outcome).
#Libraries imported
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
#Dataset imported
data = pd.read_csv('diabetes.csv')
#Assign X and y
X = data.iloc[:,:-1].values
y = data.iloc[:,-1].values
#Data preprocessed
sc = StandardScaler()
X = sc.fit_transform(X)
#Dataset split between train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Predicting the results for the whole dataset
y_pred2 = model.predict(data)
#Add prediction column to original dataset
data['prediction'] = y_pred2
However, I get the following error: ValueError: X has 9 features per sample; expecting 8.
My questions are:
Why can't I create a new column with the predictions for my entire dataset?
How can I make predictions for blank outcomes (that need to be predicted), that is to say, should I upload the file again? Let's say I want to predict the folowing:
Rows to predict:
Please let me know if my questions are clear!
You are feeding data (with all 9 initial features) to a model that was trained with X (8 features, since Outcome has been removed to create y), hence the error.
What you need to do is:
Get predictions using X instead of data
Append the predictions to your initial data set
i.e.:
y_pred2 = model.predict(X)
data['prediction'] = y_pred2
Keep in mind that this means that your prediction variable will come from both data that have already been used for model fitting (i.e. the X_train part) as well as from data unseen by the model during training (the X_test part). Not quite sure what your final objective is (and neither this is what the question is about), but this is a rather unusual situation from an ML point of view.
If you have a new dataset data_new to predict the outcome, you do it in a similar way; always assuming that X_new has the same features with X (i.e. again removing the Outcome column as you have done with X):
y_new = model.predict(X_new)
data_new['prediction'] = y_new