AutoML H2O gives multinomial result instead of binomial - python

I am trying to proccess the dataset kr-vs-kp using AutoML H2O. The dataset has two possible target values "nowin" and "win", so I suppose it should be binary classification. But after the model is found it turns out that H2O regarded it as a multiclass classification problem (accuracy score is absent, and confusion matrix is present).
Why does that happend, and what I have to fix so that it will be binary classification problem?
The code to run AutoML is as follow:
info = h2o.import_file("kr-vs-kp.csv")
train,test = info.split_frame(ratios=[.75])
x = train.columns
y = x.pop()
train[y] = train[y].asfactor() #doesn't change anything
test[y] = test[y].asfactor() #doesn't change anything
automl = h2o.automl.H2OAutoML(max_runtime_secs=900)
automl.train(x=x, y=y, training_frame=train)
perf = automl.leader.model_performance(test)
print("perf type:", type(perf))
print("Algorithm", automl.leader.show())
print("Confusion Matrix", perf.confusion_matrix())
print("Accuracy score", perf.accuracy())
The output is as follow:
perf type: <class 'h2o.model.metrics.multinomial.H2OMultinomialModelMetrics'>
Algorithm GBM_1_AutoML_1_20230217_142818
Confusion Matrix Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
class nowin won Error Rate
------- ------- ----- --------- --------
0 0 0 nan 0 / 0
0 341 14 0.0394366 14 / 355
0 16 402 0.0382775 16 / 418
0 357 416 0.0388098 30 / 773
AttributeError: type object 'MetricsBase' has no attribute 'accuracy'
Upd. OK, It seems that I have found the source of the problem. For some strange reason the first line in the file is regarded not as columns' names, but as data, so target has three values: win, nowin, class. But why? All other files that I tried so far were processed normally with first line being columns' names.
First line in file, with columns' names:
bkblk,bknwy,bkon8,bkona,bkspr,bkxbq,bkxcr,bkxwp,blxwp,bxqsq,cntxt,dsopp,dwipd,hdchk,katri,mulch,qxmsq,r2ar8,reskd,reskr,rimmx,rkxwp,rxmsq,simpl,skach,skewr,skrxp,spcop,stlmt,thrsk,wkcti,wkna8,wknck,wkovl,wkpos,wtoeg,class

Related

Prediction in Python with SKLearn not working

I'm trying to predict the future values of a share with SKLearn regressors (but it could be the next value of anything, I've tried the same function on Covid Cases data with the same results) but it doesn't work.
I've written a function that takes my train dataset, the target variable, the test Xs and the features to take into account and gives back the prediction:
def predict_share_valuesGRDBST(data, target_variable, X_test, features=None):
# Split data into features (X) and target (y)
if features:
X = data[features]
else:
X = data.drop(target_variable, axis=1)
y = data[target_variable]
# Fit Gradient Boosting model to training data
model = GradientBoostingRegressor(n_estimators=200,random_state=20)
model.fit(X, y)
# Use model to make predictions on next num_predictions values
next_values = model.predict(X_test[features])
return next_values
variable data is like
Date
CloseValue
OpenValue
TradeVolume
...
...
...
...
2023-01-19
100
90
1000
2023-01-20
110
101
1100
Target_variable is 'CloseValue'
X_test is like data but with values in future dates
features variable is like ['OpenValue', 'TradeVolume', 'Date']
but the returned values don't fit at all:
I've tried with other regressors (AdaBoost, RandomForest) but they al give me the same, wrong, results:
that's why I'm think that I am doing something wrong and it's not just a problem of correlation between variables, it seems that they're working on wrong data but I cannot figure out how to fix it. Any ideas?

h2o: F1 score and other binary classification metrics missing

I am able to run the following example code and get an F1 score:
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
h2o.init()
# import the airlines dataset:
# This dataset is used to classify whether a flight will be delayed 'YES' or not "NO"
# original data can be found at http://www.transtats.bts.gov/
airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
# convert columns to factors
airlines["Year"]= airlines["Year"].asfactor()
airlines["Month"]= airlines["Month"].asfactor()
airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
airlines["Cancelled"] = airlines["Cancelled"].asfactor()
airlines['FlightNum'] = airlines['FlightNum'].asfactor()
# set the predictor names and the response column name
predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
"DayOfWeek", "Month", "Distance", "FlightNum"]
response = "IsDepDelayed"
# split into train and validation sets
train, valid = airlines.split_frame(ratios = [.8], seed = 1234)
# train your model
airlines_gbm = H2OGradientBoostingEstimator(sample_rate = .7, seed = 1234)
airlines_gbm.train(x = predictors,
y = response,
training_frame = train,
validation_frame = valid)
# retrieve the model performance
perf = airlines_gbm.model_performance(valid)
perf
With output like so:
ModelMetricsBinomial: gbm
** Reported on test data. **
MSE: 0.20546330299964743
RMSE: 0.4532806007316521
LogLoss: 0.5967028742962095
Mean Per-Class Error: 0.31720065289432364
AUC: 0.7414970113257631
AUCPR: 0.7616331690362552
Gini: 0.48299402265152613
Confusion Matrix (Act/Pred) for max f1 # threshold = 0.35417599264806404:
NO YES Error Rate
0 NO 1641.0 2480.0 0.6018 (2480.0/4121.0)
1 YES 595.0 4011.0 0.1292 (595.0/4606.0)
2 Total 2236.0 6491.0 0.3524 (3075.0/8727.0)
...
However, my dataset doesn't work in a similar manner, despite appearing to be of the same form. My dataset target variable also has a binary label. Some information about my dataset:
y_test.nunique()
failure 2
dtype: int64
Yet my performance (perf) metrics are a much smaller subset of the example code:
perf = gbm.model_performance(hf_test)
perf
ModelMetricsRegression: gbm
** Reported on test data. **
MSE: 0.02363221438767555
RMSE: 0.1537277281028883
MAE: 0.07460874699751764
RMSLE: 0.12362377397478382
Mean Residual Deviance: 0.02363221438767555
It is difficult to share my data due to its sensitive nature. Any ideas on what to check?
You're training a regression model and that's why you're missing the binary classification metrics. The way that H2O knows whether to train a regression vs classification model is by looking at the data type of the response column.
We explain it here in the H2O User Guide, but this is a frequent question we get since it's different than how scikit-learn works, which uses different methods for regression vs classification and doesn't require you to think about column types.
y_test.nunique()
failure 2
dtype: int64
On the response column in your training data, you can do something like this:
train["response"] = train["response"].asfactor()
Alternatively, when you read the file in from disk, you can parse the response column as "enum" type, so you don't have to convert it, after-the-fact. There's some examples of how to do that in Python here. If the response is stored as integers, H2O just assumes it's a numeric column when it reads in the data from disk, but if the response is stored as strings, it will correctly parse it as a categorical (aka. "enum") column and you won't need to specify or convert it.

Value Error X has 24 features, but DecisionTreeClassifier is expecting 19 features as input

I'm trying to reproduce this GitHub project on my machine, on Topological Data Analysis (TDA).
My steps:
get best parameters from a cross-validation output
load my dataset feature selection
extract topological features from the dataset for prediction
create a Random Forest Classifier model built on the best parameters
calculate probabilities on test data
Background:
Feature selection
In order to decide which attributes belong to which group, we created a correlation matrix.
From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups,
one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the
other players, we decided to take into account only the overall rating. Below, is possible
to see the 24 features used for each player:
Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing",
"reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration",
"sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots"
Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle",
"long_passing"
Goalkeeper: "overall_rating"
From this set of features, the next step we did was to, for each non-goalkeeper player,
compute the mean of the attack attributes and the defensive ones.
Finally, for each team in a given match, we compute the mean and the standard deviation
for the attack and the defense from these stats of the team's players, as well as the best
attack and best defense.
In this way a match is described by 14 features (GK overall value, best attack, std attack,
mean attack, the best defense, std defense, mean defense), that mapped the match in the space,
following the characterizes of the two teams.
Feature extraction
The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for
this kind of correlation.
Methods:
def get_best_params():
cv_output = read_pickle('cv_output.pickle')
best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output
return top_feat_params, top_model_feat_params
def load_dataset():
x_y = get_dataset(42188).get_data(dataset_format='array')[0]
x_train_with_topo = x_y[:, :-1]
y_train = x_y[:, -1]
return x_train_with_topo, y_train
def extract_x_test_features(x_train, y_train, players_df, pipeline):
"""Extract the topological features from the test set. This requires also the train set
Parameters
----------
x_train:
The x used in the training phase
y_train:
The 'y' used in the training phase
players_df: pd.DataFrame
The DataFrame containing the matches with all the players, from which to extract the test set
pipeline: Pipeline
The Giotto pipeline
Returns
-------
x_test:
The x_test with the topological features
"""
x_train_no_topo = x_train[:, :14]
y_test = np.zeros(len(players_df)) # Artificial y_test for features computation
print('Y_TEST',y_test.shape)
x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)
return x_test_topo
def extract_topological_features(diagrams):
metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
new_features = []
for metric in metrics:
amplitude = Amplitude(metric=metric)
new_features.append(amplitude.fit_transform(diagrams))
new_features = np.concatenate(new_features, axis=1)
return new_features
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
shift = 10
top_features = []
all_x_train = x_train
all_y_train = y_train
for i in tqdm(range(0, len(x_test), shift)):
#
print(range(0, len(x_test), shift) )
if i+shift > len(x_test):
shift = len(x_test) - i
batch = np.concatenate([all_x_train, x_test[i: i + shift]])
batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
new_features_batch = extract_topological_features(diagrams_batch[-shift:])
top_features.append(new_features_batch)
all_x_train = np.concatenate([all_x_train, batch[-shift:]])
all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
return final_x_test
def get_probabilities(model, x_test, team_ids):
"""Get the probabilities on the outcome of the matches contained in the test set
Parameters
----------
model:
The model (must have the 'predict_proba' function)
x_test:
The test set
team_ids: pd.DataFrame
The DataFrame containing, for each match in the test set, the ids of the two teams
Returns
-------
probabilities:
The probabilities for each match in the test set
"""
prob_pred = model.predict_proba(x_test)
prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
return prob_match_df
Working code:
best_pipeline_params, best_model_feat_params = get_best_params()
# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}
pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
# SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
#('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])
x_train, y_train = load_dataset()
# x_train.shape -> (2565, 19)
# y_train.shape -> (2565,)
x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)
# x_test.shape -> (380, 24)
rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids) # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')
But I'm getting the error:
ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.
Loaded dataset (X_train):
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 home_best_attack 2565 non-null float64
1 home_best_defense 2565 non-null float64
2 home_avg_attack 2565 non-null float64
3 home_avg_defense 2565 non-null float64
4 home_std_attack 2565 non-null float64
5 home_std_defense 2565 non-null float64
6 gk_home_player_1 2565 non-null float64
7 away_avg_attack 2565 non-null float64
8 away_avg_defense 2565 non-null float64
9 away_std_attack 2565 non-null float64
10 away_std_defense 2565 non-null float64
11 away_best_attack 2565 non-null float64
12 away_best_defense 2565 non-null float64
13 gk_away_player_1 2565 non-null float64
14 bottleneck_metric 2565 non-null float64
15 wasserstein_metric 2565 non-null float64
16 landscape_metric 2565 non-null float64
17 betti_metric 2565 non-null float64
18 heat_metric 2565 non-null float64
19 label 2565 non-null float64
Please note that the first 14 columns are the features that describe the match, and that the 5 remaining features (minus label) are the topological ones, that are already extracted.
The problem seems to be when code gets to extract_x_test_features() and extract_features_for_prediction(), which should get the tolopogical features and stack the train dataset with it.
Since X_train already has topological features, it adds 5 more and so I end up with 24 features.
I'm not sure, though. I'm just trying to wrap this project around my head...and how prediction is being made here.
How do I fix the mismatch using the code above?
NOTES:
1- x_train and y_test are not dataframes but numpy.ndarray
2 - This question is completely reproducible if one clones or downloads the project from the following link:
Github Link
Returning a slice with 19 features here:
def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
(...)
return final_x_test[:, :19]
Got rid of the error and ran the test.
I still don't get the gist of it, though.
I will grant the bounty to anyone who explains me the idea behind the test set in the context of this project, in the project notebook, which can be found here:
Project Notebook
The answer is actually given in the question already.
You mentioned in your question, # x_test.shape -> (380, 24) and # x_train.shape -> (2565, 19). As it is very clear and can be seen that your test data shape doesn't match with your train data, your train data have 19 features, whereas the test data have got 24 features (they must contain same amount of feature) thus you're getting the error "X has 24 features, but DecisionTreeClassifier is expecting 19 features as input" when you're giving the x_test inside your model in this line - get_probabilities(rf_model, x_test, team_ids).
So, your test data must have 24 features just like your train data.
In your x_train you have 19 features, whereas in X_test you have 24 features? Why is that?
To solve it, show both data frames (x_train and X_test) and try to find why they have different features. At the end, you must have same shape and same features in each dataframes. If not, you will obtain this error.
Probably is an error of the dataset you imported.

How to interpret output of .predict() from fitted scikit-survival model in python?

I'm confused how to interpret the output of .predict from a fitted CoxnetSurvivalAnalysis model in scikit-survival. I've read through the notebook Intro to Survival Analysis in scikit-survival and the API reference, but can't find an explanation. Below is a minimal example of what leads to my confusion:
import pandas as pd
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxnetSurvivalAnalysis
# load data
data_X, data_y = load_veterans_lung_cancer()
# one-hot-encode categorical columns in X
categorical_cols = ['Celltype', 'Prior_therapy', 'Treatment']
X = data_X.copy()
for c in categorical_cols:
dummy_matrix = pd.get_dummies(X[c], prefix=c, drop_first=False)
X = pd.concat([X, dummy_matrix], axis=1).drop(c, axis=1)
# display final X to fit Cox Elastic Net model on
del data_X
print(X.head(3))
so here's the X going into the model:
Age_in_years Celltype Karnofsky_score Months_from_Diagnosis \
0 69.0 squamous 60.0 7.0
1 64.0 squamous 70.0 5.0
2 38.0 squamous 60.0 3.0
Prior_therapy Treatment
0 no standard
1 yes standard
2 no standard
...moving on to fitting model and generating predictions:
# Fit Model
coxnet = CoxnetSurvivalAnalysis()
coxnet.fit(X, data_y)
# What are these predictions?
preds = coxnet.predict(X)
preds has same number of records as X, but their values are wayyy different than the values in data_y, even when predicted on the same data they were fit on.
print(preds.mean())
print(data_y['Survival_in_days'].mean())
output:
-0.044114643249153422
121.62773722627738
So what exactly are preds? Clearly .predict means something pretty different here than in scikit-learn, but I can't figure out what. The API Reference says it returns "The predicted decision function," but what does that mean? And how do I get to the predicted estimate in months yhat for a given X? I'm new to survival analysis so I'm obviously missing something.
I posted this question on github, though the author renamed the issue question.
I got some helpful explanation of what the predict output is, but still am not sure how to get to a set of predicted survival times, which is what I really want. Here's a couple helpful explanations from that github thread:
predictions are risk scores on an arbitrary scale, which means you can
usually only determine the sequence of events, but not their exact time.
-sebp (library author)
It [predict] returns a type of risk score. Higher value means higher
risk of your event (class value = True)...You were probably looking
for a predicted time. You can get the predicted survival function with
estimator.predict_survival_function as in the example 00
notebook...EDIT: Actually, I’m trying to extract this but it’s been a
bit of a pain to munge
-pavopax.
There's more explanation at the github thread, though I wasn't really able to follow all of it. I need to play around with predict_survival_function and predict_cumulative_hazard_function and see if I can get to a set of predictions for most likely survival time by row in X, which is what I really want.
I'm not going to accept this answer here, in case anyone else has a better one.
With the X input, you get an evaluation of the input array:
def predict(self, X, alpha=None):
"""The linear predictor of the model.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test data of which to calculate log-likelihood from
alpha : float, optional
Constant that multiplies the penalty terms. If the same alpha was used during training, exact
coefficients are used, otherwise coefficients are interpolated from the closest alpha values that
were used during training. If set to ``None``, the last alpha in the solution path is used.
Returns
-------
T : array, shape = (n_samples,)
The predicted decision function
"""
X = check_array(X)
coef = self._get_coef(alpha)
return numpy.dot(X, coef)
The definition check_array comes from another library.
You can review the code of coxnet.

Saving and using TFIDF vectorizer for future examples, then leading to error with dimension

So I am training a Multinomial Naive Bayes classifier from Skilearn. I actually can now save that classifier using from sklearn.externals import joblib.
I want to now make a script to classify new examples. My only issue is taking new data, being strings and passing them onto the classifier.predict( ... ) requires the data to be in vectorized form.
Before I would create a vectorizer, by the following:
vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', strip_accents='unicode', norm='l2',decode_error="ignore")
Now the way TFIDF works to vectorize, is it requires many many documents. But by creating a new vectorizer, I can't just pass it a single data structure to then classify it. I clearly need to save this vectorizer.
Really this comes to how to transform the data to the same form I trained the classifier on!?
Am I right in using the transform vectorizer.transform(X_test_title)
EDIT:
Seems I was right in my last comment above. However when now loading the classifier and vectorizer into my script, I seem to have issues passing the vectorized data to the classifier. Here is my function taking a title and document which are both clean strings:
def predict_function(title_data, document_data):
data = ((title + ' ') * number_repeat_title(title_data, document_data)) + document_data
# requires a list
data = [data, 'testing another element works']
print data
data_vector = vectorizer.transform(data)
print data_vector # checking data is good!
predicted = classifier.predict(data_vector)
return predicted
An example for calling this function is as follows:
predict_function('mr sponge bob square pants', 'SpongeBob SquarePants is an American animated television series created by marine biologist and animator Stephen Hillenburg for Nickelodeon. The series chronicles the adventures and endeavors of the title character and his various friends in the fictional underwater city of Bikini Bottom. The series' popularity has made it a media franchise, as well as Nickelodeon network's highest rated show, and the most distributed property of MTV Networks. The media franchise has generated $8 billion in merchandising revenue for Nickelodeon.')
I get an error, where I predict:
predicted = classifier.predict(data_vector)
giving....
/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in predict(self, X)
61 Predicted target values for X
62 """
---> 63 jll = self._joint_log_likelihood(X)
64 return self.classes_[np.argmax(jll, axis=1)]
65
/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
455 """Calculate the posterior log probability of the samples X"""
456 X = atleast2d_or_csr(X)
--> 457 return (safe_sparse_dot(X, self.feature_log_prob_.T)
458 + self.class_log_prior_)
459
/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
189 from scipy import sparse
190 if sparse.issparse(a) or sparse.issparse(b):
--> 191 ret = a * b
192 if dense_output and hasattr(ret, "toarray"):
193 ret = ret.toarray()
/Library/Python/2.7/site-packages/scipy-0.14.0.dev_572aaf0-py2.7-macosx-10.9-intel.egg/scipy/sparse/base.pyc in __mul__(self, other)
337
338 if other.shape[0] != self.shape[1]:
--> 339 raise ValueError('dimension mismatch')
340
341 result = self._mul_multivector(np.asarray(other))
ValueError: dimension mismatch
Looking at the scikit-learn documentation found here (http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html)
I believe you are correct.
The training data in the scikit-learn example is vectorized as follows:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)
This means the vectorizer will now remember the TFxIDF weightings.
These weightings are then applied to the test data with the following line of code:
X_test = vectorizer.transform(data_test.data)

Categories