I am trying to fit a logistic regression model to a dataset, and while training the data, I am getting the following error :
1 from sklearn.linear_model import LogisticRegression
2 classifier = LogisticRegression()
----> 3 classifier.fit(X_train, y_train)
ValueError: could not convert string to float: 'Cragorn'
The code snippet is as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('predict_death_in_GOT.csv')
data.head(10)
X = data.iloc[:, 0:4]
y = data.iloc[:, 4]
plt.rcParams['figure.figsize'] = (10, 10)
alive = data.loc[y == 1]
not_alive = data.loc[y == 0]
plt.scatter(alive.iloc[:,0], alive.iloc[:,1], s = 10, label = "alive")
plt.scatter(not_alive.iloc[:,0], not_alive.iloc[:,1], s = 10, label = "not alive")
plt.legend()
plt.show()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
print(X_train, y_train)
print(X_test, y_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
**classifier.fit(X_train, y_train)**
The dataset looks like :
Sr No name houseID titleID isAlive
0 0 Viserys II Targaryen 0 0 0
1 1 Tommen Baratheon 0 0 1
2 2 Viserys I Targaryen 0 0 0
3 3 Will (orphan) 0 0 1
4 4 Will (squire) 0 0 1
5 5 Willam 0 0 1
6 6 Willow Witch-eye 0 0 0
7 7 Woth 0 0 0
8 8 Wyl the Whittler 0 0 1
9 9 Wun Weg Wun Dar Wun 0 0 1
I looked over the web but couldn't find any relevant solutions.Please help me with this error.
Thank you!
You cannot pass string to fit() method.
Column name needs to be transformed into float.
Good method is to use: sklearn.preprocessing.LabelEncoder
Given above sample of dataset, here is reproducible example how to perform LabelEncoding:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
le = preprocessing.LabelEncoder()
data.name = le.fit_transform(data.name)
X = data.iloc[:, 0:4]
y = data.iloc[:, 5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print(classifier.coef_,classifier.intercept_)
resulting model coefficients and intercept:
[[ 0.09253555 0.09253555 -0.15407024 0. ]] [-0.1015314]
Sklearn models only accept floats as arguments. You need to transform your variables into floats before passing them to the fit method. One way of doing this is by creating a series of dummy variables for each column containing strings. Check: pandas.get_dummies
Related
I have a DataFrame (as shown below) that is already set up with one-hot encoding. When I try to pass it into sklearn models, I keep getting dimension errors. If MultinomialNB only accept 1d arrays, how do I implement one-hot encoding?
Color
Col B
Col C
Col D
Col E
red
1
0
0
0
green
0
0
1
0
blue
0
1
0
0
green
0
0
1
0
brown
0
0
0
1
This runs fine:
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(df['Color']).toarray()
y = df.loc[:, df.columns != 'Color'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
I get an error stating "ValueError: y should be a 1d array" when I run the following:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
I want to deploy multinomial logistic regression (or pruned version of this) that is easy to deploy without pickle file
here's the X
index 2853 1864 2658 11187 2874
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 1
here's the y (categorical)
ndex a.age
0 >50
1 15-20
2 35-50
Regards
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
h = .02
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, y)
df = pd.DataFrame(logreg.coef_, columns = X.columns, index = ['15-20' , '35-50' , '>50'])
It works
I have binary classification problem where I want to calculate the roc_auc of the results. For this purpose, I did it in two different ways using sklearn. My code is as follows.
Code 1:
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, needs_proba=True)
from sklearn.model_selection import cross_validate
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))
I get the output as 0.60.
Code 2:
y_score = cross_val_predict(clf, X, y, cv=k_fold, method="predict_proba")
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
fpr[i], tpr[i], _ = roc_curve(y, y_score[:,i])
roc_auc[i] = auc(fpr[i], tpr[i])
print(roc_auc)
I get the output as {0: 0.41, 1: 0.59}.
I am confused since I get two different scores in the two codes. Please let me know why this difference happens and what is the correct way of doing this.
I am happy to provide more details if needed.
It seems that you used a part of my code from another answer, so I though to also answer this question.
For a binary classification case, you have 2 classes and one is the positive class.
For example see here. pos_label is the label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised..
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
mask = (y!=2)
y = y[mask]
X = X[mask,:]
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
positive_class = 1
clf = OneVsRestClassifier(LogisticRegression())
y_score = cross_val_predict(clf, X, y, cv=10 , method='predict_proba')
fpr = dict()
tpr = dict()
roc_auc = dict()
fpr[positive_class], tpr[positive_class], _ = roc_curve(y, y_score[:, positive_class])
roc_auc[positive_class] = auc(fpr[positive_class], tpr[positive_class])
print(roc_auc)
{1: 1.0}
and
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_validate
myscore = make_scorer(roc_auc_score, needs_proba=True)
clf = OneVsRestClassifier(LogisticRegression())
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))
1.0
I have the following data:
X1 X2 Y
-10 4 0
-10 3 4
-10 2.5 8
-8 3 7
-8 4 8
-8 4.4 9
0 2 9
0 2.3 9.2
0 4 10
0 5 12
I need to create a simple regression model to predict Y given X1 and X2: Y = f(X1,X2).
This is my code:
poly = PolynomialFeatures(degree=2)
X1 = poly.fit_transform(df["X1"].values.reshape(-1,1))
X2 = poly.fit_transform(df["X2"].values.reshape(-1,1))
clf = linear_model.LinearRegression()
clf.fit([X1,X2], df["Y"].values.reshape(-1, 1))
print(clf.coef_)
print(clf.intercept_)
Y_test = clf.predict([X1, X2])
df_test=pd.DataFrame()
df_test["X1"] = df["X1"]
df_test["Y"] = df["Y"]
df_test["Y_PRED"] = Y_test
df_test.plot(x="X1",y=["Y","Y_PRED"], figsize=(10,5), grid=True)
plt.show()
But it fails at line clf.fit([X1,X2], df["Y"].values.reshape(-1, 1)):
ValueError: Found array with dim 3. Estimator expected <= 2
It looks like the model cannot work with 2 input parameters X1 and X2. How should I change the code to fix it?
Well, your mistake resides in the way you append your feature dataframes. You should instead concatenate them, for instance using pandas:
import pandas as pd
X12_p = pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
Or the same using numpy:
import numpy as np
X12_p = np.concatenate([X1, X2], axis=1)
Your final snippet should look like:
# Fit
Y = df["Y"].values.reshape(-1,1)
X12_p = pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
clf.fit(X12_p, Y)
# Predict
Y_test = clf.predict(X12_p)
You can as well evaluate some performance metrics such as rmse using:
from sklearn.metrics import mean_squared_error
print('rmse = {0:.5f}'.format(mean_squared_error(Y, Y_test)))
Please also note that you can exclude the bias term from polynomial features by changing the default param:
PolynomialFeatures(degree=2, include_bias=False)
Hope this helps.
I'm trying to use Machine learning to guess if a person has an income of over or under 50k using this data set. I think the code does not work because the data set contains strings. When I use a shorter data set containing 4 instead of 14 variables(and with numbers) the code works. What am I doing wrong?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset = pandas.read_csv(url, names=names)
# Split dataset
array = dataset.values
X = array[:,0:14]
Y = array[:,14]
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
Let's take a really simple example from your dataset.
Looking at dataset['income'].nunique() (produces 2), we can see you have two classes you're trying to predict. You're on the right track with taking the classification route (although there are different methodological arguments to be made as to whether this problem is better suited for a continuous regression approach, but save that for another day).
Say you want to use age and education to predict whether someone's income is above $50k. Let's try it out:
X = dataset[['age', 'education']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
This Exception should be raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 891, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 567, in check_array
array = array.astype(np.float64)
ValueError: could not convert string to float: ' Bachelors'
What if we tried with just age?
X = dataset[['age']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey! That works! So there's something unique about the education column that we need to account for. You've noticed this above - scikit-learn (and many other ML packages - though not all) don't operate off of strings. So we need to do something like "one-hot" encoding - creating k columns, where k represents the number of unique values in your categorical, "string" column (again, there's a methodological question as to whether you include k-1 or k features, but read up on the dummy-variable trap for more info to that end), where each column is composed of 1s and 0s - a 1 if the case/observation in a particular row has that kth attribute, a 0 if not.
There are many ways of doing this in Python:
pandas.get_dummies:
dummies = pandas.get_dummies(dataset['education'], prefix='education')
Here's a sample of dummies:
>>> dummies
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th ... education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 1 0 0 0 0
3 0 1 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
5 0 0 0 0 0 ... 0 1 0 0 0
6 0 0 0 0 0 ... 0 0 0 0 0
7 0 0 0 0 0 ... 1 0 0 0 0
8 0 0 0 0 0 ... 0 1 0 0 0
9 0 0 0 0 0 ... 0 0 0 0 0
Now we can use this education feature like so:
dataset = dataset.join(dummies)
X = dataset[['age'] + list(dummies)]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey, that worked!
Hopefully that helps to answer your question. There are tons of ways to perform one-hot encoding (e.g. through a list comprehension or sklearn.preprocessing.OneHotEncoder). I'd suggest you read more on "feature engineering" before progressing with your model-building - feature engineering is one of the most important parts of the ML process.
For columns that contain categorical strings, you should transform them to one hot encoding using the function:
dataset = pd.get_dummies(dataset, column=[my_column1, my_column2, ...])
Where my_column1, my_colum2, ...are the column names containing the categorical strings. Be careful, it changes the number of columns you have in your dataframe. Thus, change your split of X accordingly.
Here is the link to the documentation.