I'm currently working on a model to predict a probability of fatality once a person is infected with the Corona virus.
I'm using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc.
It was suggested to use a decision tree, which I've already built.
Since I'm new to decision trees I would like some assistance.
I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output.
How can I achieve this?
Also I want to play around with samples by inputting the data myself and see what the outcome is.
For instance: let's take someone who is 40, male etc. and calculate what its survival chance is.
How can I achieve this?
I've attached the code below:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import random as rnd
filename = '/Users/sef/Downloads/pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)
df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = YHat
print(df)
you can use the method "predict_proba" of the DecisionTreeClassifier to compute the probabilities instead of the binary classification values.
In order to test individual data, that you can create by hand, you have to create an array of the shape of your X_test data (just that it only has one entry). Then you can use that with model.predict(array) or model.predict_proba(array).
By the way, your tree is currently not useful for retrieving probabilities. There is an article that explains the problem very well: https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html
So you can fix your code by defining the max_depths of your tree:
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import random as rnd
filename = 'pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234)
model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
model.fit(X_train, Y_train)
rnd.seed(123458)
X_new = X[rnd.randrange(X.shape[0])]
X_new = X_new.reshape(1,8)
YHat = model.predict_proba(X_new)
df = pd.DataFrame(X_new, columns = names[:-1])
df["predicted"] = list(YHat)
print(df)
Decision Tree can also estimate the probability than an instance belongs to a particular class. Use predict_proba() as below with your train feature data to return the probability of various class you want to predict. model.predict() returns the class which has the highest probability
model.predict_proba()
Use the function called predict_proba
model.predict_proba(X_test)
To the second part of your question, here is what you will have to do.
Create your own custom dataset with the exact same column names as you had trained.
Read your data from a csv and apply the same encoder values if any.
You can also save your label encoder object in a much more efficient way.
label = preprocessing.LabelEncoder()
label_encoded_columns=['Date_statistics_type', 'Agegroup', 'Sex', 'Province', 'Hospital_admission', 'Municipal_health_service', 'Deceased']
for col in label_encoded_columns:
dataframe[col] = dataframe[col].astype(str)
Label_Encoder = labelencoder.fit(dataframe[label_encoded_columns].values.flatten())
Encoded_Array = (Label_Encoder.transform(dataframe[label_encoded_columns].values.flatten())).reshape(dataframe[label_encoded_columns].shape)
LE_Dataframe=pd.DataFrame(Encoded_DataFrame,columns=label_encoded_columns,index=dataframe.index)
LE_mapping = dict(zip(Label_Encoder.classes_,Label_Encoder.transform(Label_Encoder.classes_).tolist()))
#####This should give you dictionary in the form for all your list of values.
##### for eg: {'Apple':0,'Banana':1}
For your second part of the question, there can be two ways.
The first one is pretty straightforward, where in you can use values of X_test to give you a resulting prediction.
model.predict(X_test.iloc[0:30]) ###First 30 rows
model.predict_proba(X_test.iloc[0:30])
In the second one, if you are talking about introducing new data, then in that case, you will have to label encode the raw data once again.
If that data is not present, it may give you never seen before values error.
Refer to this link in that case
Related
I am making a sklearn model (Random Forest Regressor), and have been successful in training it with my data, however, I am unsure of how to predict it.
My CSV contains 2 items per row, a year (expressed in years since 2003), and a number (what's being predicted), usually above 1,000. When I use model.predict([[20]]), I get a decimal for a number that is supposed to be in the thousands despite a very high r^2 value:
R-squared: 0.9804779528842772 Prediction: [0.67932727]
I have a feeling I'm not using this method correctly, but I couldn't really find anything online. A user from another question of mine said that the last item in a CSV row was supposed to be the output, so I assumed that is how it works. Please forgive me if something is unclear, just comment and I will try my best to clarify, I am a noob at this.
Code:
import pandas
import scipy
import numpy
import matplotlib
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_boston
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import scale
import matplotlib.pyplot as plt
from sklearn import set_config
from pandas import read_csv
names = ['YEAR', 'TOTAL']
url = 'energy/energyTotal.csv'
dataset = read_csv(url, names=names)
array = dataset.values
x = array[:, 0:1]
y = array[:, 1]
y=y.astype('int')
# rfr = RandomForestRegressor(max_depth=3)
# rfr.fit(x, y)
# print(rfr.predict([[0, 1, 0, 1]]))
x = scale(x)
y = scale(y)
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.10)
#Train model
set_config(print_changed_only=False)
rfr = RandomForestRegressor()
print(rfr)
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
max_depth=None, max_features='auto', max_leaf_nodes=None,
max_samples=None, min_impurity_decrease=0.0,
min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
rfr.fit(xtrain, ytrain)
score = rfr.score(xtrain, ytrain)
print("R-squared:", score)
print(rfr.predict([[20]]))
The CSV:
18,28564
17,28411
16,27515
15,24586
14,26653
13,26836
12,26073
11,27055
10,26236
9,26020
8,26538
7,25800
6,26682
5,24997
4,25100
3,24651
2,12053
1,11500
Your data has been scaled, so your predictions are not in the original range of the TOTAL variable. You can try to train your model without scaling the data and results are still quite good.
I would recommend scaling only the training set, to avoid leaking information about the whole dataset into the test set. And you need to know the scaling to reverse your predictions into the original range.
I'm working over fetch_kddcup99 data set, and by using pandas I've converted the original dataset to something like this, with all dummy variables as this:
DataFrame
Note that after dropping duplicates, the final dataframe only contains 149 observations.
Then I start the feature engineering phase, by OHE the protocol_type, which is a string categorical variable and transform y to 0,1.
X = pd_data.drop(target, axis=1)
y = pd_data[target]
y=y.astype('int')
protocol_type = [['tcp','udp','icmp']]
col_transformer = ColumnTransformer([
("encoder_tipo1", OneHotEncoder(categories=protocol_type, handle_unknown='ignore'),
['protocol_type']),
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=89)
Finally I proceed to the model evaluation, which drops me the following result:
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DTC', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RFC', RandomForestClassifier()))
models.append(('SVM', SVC()))
#selector = SelectFromModel(estimator=model)
scaler = option2
selector = SelectKBest(score_func=f_classif,k = 3)
results=[]
for name, model in models:
pipeline = make_pipeline(col_transformer,scaler,selector)
#print(pipeline)
X_train_selected = pipeline.fit_transform(X_train,y_train)
#print(X_train_selected)
X_test_selected = pipeline.fit_transform(X_test,y_test)
modelo = model.fit(X_train_selected, y_train)
kf = KFold(n_splits=10, shuffle=True, random_state=89)
cv_results = cross_val_score(modelo,X_train_selected,y_train,cv=kf,scoring='accuracy')
results.append(cv_results)
print(name, cv_results)
plt.boxplot(results)
plt.show()
Boxplots from CV
My question is why the models are all the same? Could it be due to the small number of rows of the dataframe, or am I doing something wrong?
You have 149 rows, of which 80% go into the training set, so 119. You then do 10-fold cross-validation, so each test fold has about 12 samples. So each individual test fold has only 13 possible accuracy scores; even if the classifiers predict some samples a little differently, they may have the same accuracy. (The common scores you see (1, 0.88, 0.71) don't line up with the fractions I'm expecting though, so maybe I've missed something?) So yes, possibly it's just the small number of rows, compounded with the cross-validation. Selecting down to just 3 features also probably contributes.
One quick thing to check is some continuous score of the models' performance, say log-loss or Brier score.
(And, Gaussian is probably the wrong Naive Bayes to use with your data, containing so many binary features.)
I'm trying to train a decision tree classifier using Python. I'm using MinMaxScaler() to scale the data, and f1_score for my evaluation metric. The strange thing is that I'm noticing my model giving me different results in a pattern at each run.
data in my code is a (2000, 7) pandas.DataFrame, with 6 feature columns and the last column being the target value. Columns 1, 3, and 5 are categorical data.
The following code is what I did to preprocess and format my data:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
# Data Preprocessing Step
# =============================================================================
data = pd.read_csv("./data/train.csv")
X = data.iloc[:, :-1]
y = data.iloc[:, 6]
# Choose which columns are categorical data, and convert them to numeric data.
labelenc = LabelEncoder()
categorical_data = list(data.select_dtypes(include='object').columns)
for i in range(len(categorical_data)):
X[categorical_data[i]] = labelenc.fit_transform(X[categorical_data[i]])
# Convert categorical numeric data to one-of-K data, and change y from Series to ndarray.
onehotenc = OneHotEncoder()
X = onehotenc.fit_transform(X).toarray()
y = y.values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
min_max_scaler = MinMaxScaler()
X_train_scaled = min_max_scaler.fit_transform(X_train)
X_val_scaled = min_max_scaler.fit_transform(X_val)
The next code is for the actual decision tree model training:
dectree = DecisionTreeClassifier(class_weight='balanced')
dectree = dectree.fit(X_train_scaled, y_train)
predictions = dectree.predict(X_val_scaled)
score = f1_score(y_val, predictions, average='macro')
print("Score is = {}".format(score))
The output that I get (i.e. the score) varies, but in a pattern. For example, it would circulate among data within the range of 0.39 and 0.42.
On some iterations, I even get the UndefinedMetricWarning, that claims "F-score is ill-defined and being set to 0.0 in labels with no predicted samples."
I'm familiar with what the UndefinedMetricWarning means, after doing some searching on this community and Google. I guess the two questions I have may be organized to:
Why does my output vary for each iteration? Is there something in the preprocessing stage that happens which I'm not aware of?
I've also tried to use the F-score with other data splits, but I always get the warning. Is this unpreventable?
Thank you.
You are splitting the dataset into train and test which randomly divides sets for both train and test. Due to this, when you train your model with different training data everytime, and testing it with different test data, you will get a range of F score depending on how well the model is trained.
In order to replicate the result each time you run, use random_state parameter. It will maintain a random number state which will give you the same random number each time you run. This shows that the random numbers are generated in the same order. This can be any number.
#train test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=13)
#Decision tree model
dectree = DecisionTreeClassifier(class_weight='balanced', random_state=2018)
I have a video games Dataset with many categorical columns.
I binarized all these columns.
Now I want to predict a column (called Rating) with Logistic Regression, but this columns is now actually binarized into 4 columns (Rating_Everyone, Rating_Everyone10+, Rating_Teen and Rating_Mature).
So, I applied four times the Logistic Regression and here is my code:
df2 = pd.read_csv('../MQPI/docs/Video_Games_Sales_as_at_22_Dec_2016.csv', encoding="utf-8")
y = df2['Rating_Everyone'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
class_weight=None, random_state=None, solver='liblinear', max_iter=100,
multi_class='ovr',
verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)
print("Logistic Regression Rating_Everyone accuracy: ", ris)
And again:
y = df2['Rating_Everyone10'].values
df2 = df2.drop(['Rating_Everyone'], axis=1)
df2 = df2.drop(['Rating_Everyone10'], axis=1)
df2 = df2.drop(['Rating_Teen'], axis=1)
df2 = df2.drop(['Rating_Mature'], axis=1)
X = df2.values
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.20)
log_reg = LogisticRegression(penalty='l1', dual=False, C=1.0, fit_intercept=False, intercept_scaling=1,
class_weight=None, random_state=None, solver='liblinear', max_iter=100,
multi_class='ovr',
verbose=0, warm_start=False, n_jobs=-1)
log_reg.fit(Xtrain, ytrain)
y_val_l = log_reg.predict(Xtest)
ris = accuracy_score(ytest, y_val_l)
print("Logistic Regression Rating_Everyone accuracy: ", ris)
And so on for Rating_Teen and Rating_Mature.
Can you tell me how to merge all these four results into one result OR how can I do this multiclass Logistic Regression problem better?
The LogisticRegression model is inherently handle multiclass problems:
Below is a summary of the classifiers supported by scikit-learn
grouped by strategy; you don’t need the meta-estimators in this class
if you’re using one of these, unless you want custom multiclass
behavior: Inherently multiclass: Naive Bayes, LDA and QDA, Decision
Trees, Random Forests, Nearest Neighbors, setting
multi_class='multinomial' in sklearn.linear_model.LogisticRegression.
As a basic model, without class weighting (as you may need to do as samples may not be balanced over the ratings) set multi_class='multinomial' and change the solver to 'lbfgs' or one
of the other solvers that support multiclass problems:
For multiclass problems, only ‘newton-cg’, ‘sag’ and ‘lbfgs’ handle
multinomial loss; ‘liblinear’ is limited to one-versus-rest schemes
So you dont have to have to split your datasets up the way you have. Instead provide the original ratings column as the the labels.
Here is a minimal example:
X = np.random.randn(10, 10)
y = np.random.randint(1, 4, size=10) # 3 classes simulating ratings
lg = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lg.fit(X, y)
lg.predict(X)
Edit: responding to comment.
td;lr: I expect that the model will learn that interaction on it own. IF not you might encode that information as a feature. So there is no obvious need to binarize your classes.
The way I understand it that you have features of a movies and you have the MPAA rating for the movie as the label (which you're trying to predict). This is then a multiclass problem which you can start modeling using logistic regression ( this you knew ). This is the model I proposed in above.
Now you recognized that there is a implicit distance between classes. The way I would use this information is as a feature for the model. However, I'd first be inclined to see of the model will learn this on its own.
Hoping to output a clean dataframe that shows the model name, the parameters used in the model, and the resulting scoring metrics. Would be even better if there was a smarter way to iterate through the metric functions (given the varying parameters). Example picture of what I'm aiming for.
Here's what I have so far:
def train_predict_score(clf, X_train, y_train, X_test, y_test):
clf = clf.fit(X_train, y_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
result = []
result.append(roc_auc_score(y_train, y_pred_train))
result.append(roc_auc_score(y_test, y_pred_test))
result.append(cohen_kappa_score(y_train, y_pred_train))
result.append(cohen_kappa_score(y_test, y_pred_test))
result.append(f1_score(y_train, y_pred_train, pos_label=1))
result.append(f1_score(y_test, y_pred_test, pos_label=1))
result.append(precision_score(y_train, y_pred_train, pos_label=1))
result.append(precision_score(y_test, y_pred_test, pos_label=1))
result.append(recall_score(y_train, y_pred_train, pos_label=1))
result.append(recall_score(y_test, y_pred_test, pos_label=1))
return result
# Initialize default models
clf1 = LogisticRegression(random_state=0)
clf2 = DecisionTreeClassifier(random_state=0)
clf3 = RandomForestClassifier(random_state=0)
clf4 = GradientBoostingClassifier(random_state=0)
results = []
# Build initial models
for clf in [clf1, clf2, clf3, clf4]:
result = []
result.append(clf) # name and parameters - how can I show all info? it gets truncated
result.append(train_predict_score(clf, X_train, y_train, X_test, y_test)) # how to parse this out into individual columns?
results.append(result)
results = pd.DataFrame(results, columns=['clf', 'auc_train', 'auc_test', 'f1_train', 'f1_test', 'prec_train',
'prec_test', 'recall_train', 'recall_test'])
results
Iterating through functions
Because functions are objects, you can make a list out of them and simply iterate over that. So for example:
def add1(x):
return x+1
def sub1(x):
return x-1
for func in [add1, sub1]:
print(func(10))
yields
11
9
Getting model name and parameters
As far as I understand, you want to store the name of a model (e.g. LogisticRegression) and it's parameters in different columns.
First of, you can get the parameters like this:
clf.get_params()
This returns all model parameters as a dictionary.
For getting the model name, you can take the string representation of the model and split it once on '('. The first element of the resulting list is the name of the model. So
>>>clf
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
becomes
>>>str(clf).split('(',1)[0]
LogisticRegression
Example
Here is a small example that should do what you want. It trains 3 different classifiers on sklearn's breast_cancer dataset and returns the roc_auc, f1, precision and recall score on both the train- and test-set as a DataFrame:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
#load and split example dataset
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
#classifiers with default parameters
clf1 = LogisticRegression()
clf2 = RandomForestClassifier()
clf3 = SVC()
clf_list = [clf1, clf2, clf3]
results_list = []
for clf in clf_list:
clf.fit(X_train, y_train)
res = {}
#extract the model name from the object string
res['Model'] = str(clf).split('(', 1)[0]
#get parameters via get_params() method
res['Parameters'] = clf.get_params()
#for every metric, record performance on train and test set
for metric_score in [roc_auc_score, f1_score, precision_score, recall_score]:
metric_name = metric_score.__name__
res[metric_name + '_train'] = metric_score(y_train, clf.predict(X_train))
res[metric_name + '_test'] = metric_score(y_test, clf.predict(X_test))
results_list.append(res)
results_df = pd.DataFrame(results_list)
The resulting DataFrame:
print(results_df.to_string())
Model Parameters f1_test f1_train precision_test precision_train recall_test recall_train roc_au_test roc_au_train
0 LogisticRegression {'fit_intercept': True, 'warm_start': False, '... 0.922384 0.969697 0.922384 0.966038 0.922384 0.973384 0.922384 0.959085
1 RandomForestClassifier {'criterion': 'gini', 'warm_start': False, 'n_... 0.928137 0.998095 0.928137 1.000000 0.928137 0.996198 0.928137 0.998099
2 SVC {'decision_function_shape': None, 'verbose': F... 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000 0.500000 1.000000
Note: Because you mentioned DataFrame contents being truncated in your question: That happens only for displaying purposes when you try to print the DF in a console for example, like I did above. When you access the respective cells directly, the content is still there.