I have binary classification problem where I want to calculate the roc_auc of the results. For this purpose, I did it in two different ways using sklearn. My code is as follows.
Code 1:
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, needs_proba=True)
from sklearn.model_selection import cross_validate
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))
I get the output as 0.60.
Code 2:
y_score = cross_val_predict(clf, X, y, cv=k_fold, method="predict_proba")
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
fpr[i], tpr[i], _ = roc_curve(y, y_score[:,i])
roc_auc[i] = auc(fpr[i], tpr[i])
print(roc_auc)
I get the output as {0: 0.41, 1: 0.59}.
I am confused since I get two different scores in the two codes. Please let me know why this difference happens and what is the correct way of doing this.
I am happy to provide more details if needed.
It seems that you used a part of my code from another answer, so I though to also answer this question.
For a binary classification case, you have 2 classes and one is the positive class.
For example see here. pos_label is the label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised..
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
mask = (y!=2)
y = y[mask]
X = X[mask,:]
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
positive_class = 1
clf = OneVsRestClassifier(LogisticRegression())
y_score = cross_val_predict(clf, X, y, cv=10 , method='predict_proba')
fpr = dict()
tpr = dict()
roc_auc = dict()
fpr[positive_class], tpr[positive_class], _ = roc_curve(y, y_score[:, positive_class])
roc_auc[positive_class] = auc(fpr[positive_class], tpr[positive_class])
print(roc_auc)
{1: 1.0}
and
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_validate
myscore = make_scorer(roc_auc_score, needs_proba=True)
clf = OneVsRestClassifier(LogisticRegression())
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))
1.0
Related
I am trying to work on local explainability using Lime graph. Before building the model, I encode some of the categorical variables.
Sample Data and code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
df = pd.DataFrame({'customer_id' : np.arange(1,21),
'gender' : np.random.choice(['male','female'], 20),
'age' : np.random.randint(19,50, 20),
'salary' : np.random.randint(20000,95000, 20),
'purchased' : np.random.choice([0,1], 20, p = [.8,.2])})
Preprocessing:
df['gender'] = df['gender'].map({'female' : 0, 'male' : 1})
df['age'] = df['age'].map(lambda x : 'young' if x<=35 else 'middle aged')
df['age'] = df['age'].map({'young' : 0, 'middle aged' : 1})
bins = [0, df['salary'].quantile(q=.33),df['salary'].quantile(q=.66),df['salary'].quantile(q=1)+1]
labels = ['low salary', 'medium salary', 'high salary']
df['salary'] = pd.cut(df['salary'], bins = bins, labels=labels)
from sklearn import preprocessing
l_encoder={}
label_encoder = preprocessing.LabelEncoder()
df['salary']= label_encoder.fit_transform(df['salary'])
df
customer_id gender age salary purchased
0 1 0 0 1 0
1 2 0 0 0 0
2 3 0 1 2 0
3 4 1 0 0 0
4 5 1 1 2 0
5 6 0 1 1 0
6 7 1 0 2 0
7 8 1 1 0 0
8 9 1 1 1 0
9 10 1 0 0 0
10 11 0 1 0 0
11 12 0 0 1 0
12 13 1 1 1 0
13 14 1 1 1 0
14 15 1 1 2 1
15 16 1 1 0 0
16 17 1 1 1 0
17 18 0 0 0 0
18 19 0 0 2 0
19 20 0 0 2 0
# input
x = df.iloc[:, :-1]
# output
y = df.iloc[:, 4]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.20, random_state = 0)
Separating the customer_id column:
X_train_cust = X_train.pop('customer_id')
X_test_cust = X_test.pop('customer_id')
Fitting a logistic regression model:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train, y_train)
Building a lime chart:
import lime
import lime.lime_tabular
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names=X_train.columns,
verbose=True, mode = 'classification')
exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
exp.as_pyplot_figure()
The lime chart displays the encoded features/columns values. But I need the original value. For example, if the lime chart says 0, I need to display it as female.
Could someone please let me know how fix it.
You can use:
# Your direct mapping dictionary
dmap = {'gender': {'female' : 0, 'male' : 1},
'age': {'young' : 0, 'middle aged' : 1},
'salary': {'low salary': 0, 'medium salary': 1, 'high salary': 2}}
# Reverse mapping dictionary (not used hear)
rmap = {col: {v: k for k, v in dm.items()} for col, dm in dmap.items()}
# Categorical names, col0->gender, col1->age, col2->salary
cmap = {c: list(d.keys()) for c, d in enumerate(dmap.values())}
# Now use
explainer = lime.lime_tabular.LimeTabularExplainer(np.array(X_train),
feature_names=X_train.columns,
categorical_features=[0, 1, 2], # <- 3 first columns
categorical_names=cmap, # <- int to string
verbose=True, mode = 'classification')
exp = explainer.explain_instance(X_test.iloc[0], classifier.predict_proba)
exp.as_pyplot_figure()
Reat this tutorial
I want to deploy multinomial logistic regression (or pruned version of this) that is easy to deploy without pickle file
here's the X
index 2853 1864 2658 11187 2874
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 1
here's the y (categorical)
ndex a.age
0 >50
1 15-20
2 35-50
Regards
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
h = .02
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, y)
df = pd.DataFrame(logreg.coef_, columns = X.columns, index = ['15-20' , '35-50' , '>50'])
It works
I am a beginner in python and currently learning machine learning. The problem is, that I got an error mentioned in the title, so I used np.ravel() and changed y shape, also used .reshape() on y_train as it was suggested in other solutions shown on stack overflow, but now after 30 + min, I don't get a response. I want to check all 9 classifiers that's why I used for cycle and I get only the first one and after that nothing.naujasdf - final data frame for machine learning and I want to compare all columns with column 'survival_status numeric' for classification. Would appreciate any kind of advice.
import seaborn as sbn
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import RBF, WhiteKernel as WK
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn import tree
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import pydotplus
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear",random_state=0),
GaussianProcessClassifier(kernel = (1.0 * RBF(1.0)), random_state = 0),
DecisionTreeClassifier(max_depth=5,random_state = 0),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1,random_state = 0),
MLPClassifier(alpha=1, max_iter=1000,random_state = 0),
AdaBoostClassifier(random_state=0),
GaussianNB(),
QuadraticDiscriminantAnalysis(),
]
names = [
"KNeighborsClassifier",
"SVC",
"GaussianProcessClassifier",
"DecisionTreeClassifier",
"RandomForestClassifier",
"MLPClassifier",
"AdaBoostClassifier",
"GaussianNB",
"QuadraticDiscriminantAnalysis"
]
X = naujasdf.loc[:, naujasdf.columns != 'survival_status numeric']
y = naujasdf.loc[:, naujasdf.columns == 'survival_status numeric'].values
y = np.ravel(y,order='C')
print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)
h = -1
y_train = y_train.reshape(-1, 1)
for clf in classifiers:
clf.fit(X_train, y_train)
y_predikt = clf.predict(X_test)
clf_score = clf.score(X_test, y_test)
ntm = confusion_matrix(y_test, y_predikt)
tikslumas = cross_val_score(estimator = clf, X = X_train, y = y_train, cv = 2)
h = h + 1
print(names[h])
print('Testo palyginimas', clf_score)
print('Confusion_matrix', ntm)
print('Kryžminę kontrole',tikslumas)
print("Tikslumo vidurkis: " + str(tikslumas.mean()))
print("Standartinis nuokrypis: " + str(tikslumas.std()))
if names[h] == "DecisionTreeClassifier" :
tree.plot_tree(clf,rounded = True)
plt.show()
**and what i get**
[1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 1
1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0
1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1
1 1 0 0 0 0 0 0 1 1 1 1 0 1]
KNeighborsClassifier
Testo palyginimas 0.9411764705882353
Confusion_matrix [[10 0]
[ 1 6]]
Kryžminę kontrole [0.91780822 0.91666667]
Tikslumo vidurkis: 0.9172374429223744
Standartinis nuokrypis: 0.0005707762557077833```
I am trying to fit a logistic regression model to a dataset, and while training the data, I am getting the following error :
1 from sklearn.linear_model import LogisticRegression
2 classifier = LogisticRegression()
----> 3 classifier.fit(X_train, y_train)
ValueError: could not convert string to float: 'Cragorn'
The code snippet is as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('predict_death_in_GOT.csv')
data.head(10)
X = data.iloc[:, 0:4]
y = data.iloc[:, 4]
plt.rcParams['figure.figsize'] = (10, 10)
alive = data.loc[y == 1]
not_alive = data.loc[y == 0]
plt.scatter(alive.iloc[:,0], alive.iloc[:,1], s = 10, label = "alive")
plt.scatter(not_alive.iloc[:,0], not_alive.iloc[:,1], s = 10, label = "not alive")
plt.legend()
plt.show()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
print(X_train, y_train)
print(X_test, y_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
**classifier.fit(X_train, y_train)**
The dataset looks like :
Sr No name houseID titleID isAlive
0 0 Viserys II Targaryen 0 0 0
1 1 Tommen Baratheon 0 0 1
2 2 Viserys I Targaryen 0 0 0
3 3 Will (orphan) 0 0 1
4 4 Will (squire) 0 0 1
5 5 Willam 0 0 1
6 6 Willow Witch-eye 0 0 0
7 7 Woth 0 0 0
8 8 Wyl the Whittler 0 0 1
9 9 Wun Weg Wun Dar Wun 0 0 1
I looked over the web but couldn't find any relevant solutions.Please help me with this error.
Thank you!
You cannot pass string to fit() method.
Column name needs to be transformed into float.
Good method is to use: sklearn.preprocessing.LabelEncoder
Given above sample of dataset, here is reproducible example how to perform LabelEncoding:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
le = preprocessing.LabelEncoder()
data.name = le.fit_transform(data.name)
X = data.iloc[:, 0:4]
y = data.iloc[:, 5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print(classifier.coef_,classifier.intercept_)
resulting model coefficients and intercept:
[[ 0.09253555 0.09253555 -0.15407024 0. ]] [-0.1015314]
Sklearn models only accept floats as arguments. You need to transform your variables into floats before passing them to the fit method. One way of doing this is by creating a series of dummy variables for each column containing strings. Check: pandas.get_dummies
I'm trying to use Machine learning to guess if a person has an income of over or under 50k using this data set. I think the code does not work because the data set contains strings. When I use a shorter data set containing 4 instead of 14 variables(and with numbers) the code works. What am I doing wrong?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset = pandas.read_csv(url, names=names)
# Split dataset
array = dataset.values
X = array[:,0:14]
Y = array[:,14]
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
Let's take a really simple example from your dataset.
Looking at dataset['income'].nunique() (produces 2), we can see you have two classes you're trying to predict. You're on the right track with taking the classification route (although there are different methodological arguments to be made as to whether this problem is better suited for a continuous regression approach, but save that for another day).
Say you want to use age and education to predict whether someone's income is above $50k. Let's try it out:
X = dataset[['age', 'education']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
This Exception should be raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 891, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 567, in check_array
array = array.astype(np.float64)
ValueError: could not convert string to float: ' Bachelors'
What if we tried with just age?
X = dataset[['age']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey! That works! So there's something unique about the education column that we need to account for. You've noticed this above - scikit-learn (and many other ML packages - though not all) don't operate off of strings. So we need to do something like "one-hot" encoding - creating k columns, where k represents the number of unique values in your categorical, "string" column (again, there's a methodological question as to whether you include k-1 or k features, but read up on the dummy-variable trap for more info to that end), where each column is composed of 1s and 0s - a 1 if the case/observation in a particular row has that kth attribute, a 0 if not.
There are many ways of doing this in Python:
pandas.get_dummies:
dummies = pandas.get_dummies(dataset['education'], prefix='education')
Here's a sample of dummies:
>>> dummies
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th ... education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 1 0 0 0 0
3 0 1 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
5 0 0 0 0 0 ... 0 1 0 0 0
6 0 0 0 0 0 ... 0 0 0 0 0
7 0 0 0 0 0 ... 1 0 0 0 0
8 0 0 0 0 0 ... 0 1 0 0 0
9 0 0 0 0 0 ... 0 0 0 0 0
Now we can use this education feature like so:
dataset = dataset.join(dummies)
X = dataset[['age'] + list(dummies)]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey, that worked!
Hopefully that helps to answer your question. There are tons of ways to perform one-hot encoding (e.g. through a list comprehension or sklearn.preprocessing.OneHotEncoder). I'd suggest you read more on "feature engineering" before progressing with your model-building - feature engineering is one of the most important parts of the ML process.
For columns that contain categorical strings, you should transform them to one hot encoding using the function:
dataset = pd.get_dummies(dataset, column=[my_column1, my_column2, ...])
Where my_column1, my_colum2, ...are the column names containing the categorical strings. Be careful, it changes the number of columns you have in your dataframe. Thus, change your split of X accordingly.
Here is the link to the documentation.