I have a DataFrame (as shown below) that is already set up with one-hot encoding. When I try to pass it into sklearn models, I keep getting dimension errors. If MultinomialNB only accept 1d arrays, how do I implement one-hot encoding?
Color
Col B
Col C
Col D
Col E
red
1
0
0
0
green
0
0
1
0
blue
0
1
0
0
green
0
0
1
0
brown
0
0
0
1
This runs fine:
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(df['Color']).toarray()
y = df.loc[:, df.columns != 'Color'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
I get an error stating "ValueError: y should be a 1d array" when I run the following:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
Related
Dataset file : google drive link
I have a dataset consisting (27884 ROWS, 8933 Columns)
Here's a little preview of a dataset
user_iD
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
1
1
7
2
3
8
0
4
0
6
0
5
2
7
8
1
2
4
6
5
9
10
3
0
3
0
0
0
0
1
5
2
3
4
0
6
4
1
7
2
3
8
0
5
0
6
0
4
5
0
4
7
0
6
1
5
3
0
0
2
6
1
0
2
3
0
5
4
0
0
6
7
Here the column userid represents: STUDENTS
and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.
This is just a small preview of a big dataset.
There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)
Here's the complete dataset shape information
I'm Applying PCA. and I'm getting an error which is :
ValueError: Found array with 0 feature(s) (shape=(22307, 0)) while a minimum of 1 is required.
As I stated there are 27844 users & 8932 other columns
What I have tried so far
df3 = pd.read_feather('Bundles.ftr')
X = df3['user_iD']
y = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train= X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
How do I apply PCA to this use case?
Here's how to use PCA to pre-process your data:
df3 = pd.read_feather('Bundles.ftr')
X = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X into the
# Training set and Testing set
X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 0)
# performing preprocessing part
X_train = X_train.values
X_test = X_test.values
# Applying PCA function on training
# and testing set of X component
print(X_train.shape)
print(X_test.shape)
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
This is what the X_train variable looks like after preprocessing:
array([[-1846.8651992 , 437.17734222],
[-1847.05838019, 437.41158726],
[-1845.67443438, 436.28046074],
...,
[-1847.00651974, 437.20374889],
[ -780.18296423, 116.65908052],
[-1847.09404683, 437.30545959]])
However, I don't think that PCA is the right tool here. There are a few reasons for this:
I think kNN would be easier to interpret.
The way that the input weights are encoded is combining ordinal information and categorical information, which will make your clustering algorithm not work as well.
For example, if one user read a chapter first, and another user doesn't read a chapter at all, that is assigned 1 and 0. In this case, higher number means the user is more interested.
In another case, if one user read a chapter seventh, and another user read it eighth, that is assigned 7 and 8. In this case, a higher number means that the user is less interested.
On top of that, you're saying that the difference between reading something seventh or eighth is the same as the difference as between not reading it at all. To me, if someone didn't read it, that's a much bigger difference than a slight change in reading order.
So I would suggest having two sets of input features: did they read it at all, and if they did, where in their reading did the chapter fall.
The first set of features could be computed like this:
did_read = (X.values >= 1).astype(int)
These features are 1 if read and 0 otherwise.
The second set of features could be computed like this:
X_values = X.values
max_order = X_values.max(axis=1, initial=1).reshape(-1, 1)
order_normalized = X_values / max_order
These features are in the range [0, 1] based on whether it was toward the beginning or end of the chapters that they read.
I am a beginner in python and currently learning machine learning. The problem is, that I got an error mentioned in the title, so I used np.ravel() and changed y shape, also used .reshape() on y_train as it was suggested in other solutions shown on stack overflow, but now after 30 + min, I don't get a response. I want to check all 9 classifiers that's why I used for cycle and I get only the first one and after that nothing.naujasdf - final data frame for machine learning and I want to compare all columns with column 'survival_status numeric' for classification. Would appreciate any kind of advice.
import seaborn as sbn
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestNeighbors
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.gaussian_process.kernels import RBF, WhiteKernel as WK
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn import tree
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
import pydotplus
classifiers = [
KNeighborsClassifier(3),
SVC(kernel="linear",random_state=0),
GaussianProcessClassifier(kernel = (1.0 * RBF(1.0)), random_state = 0),
DecisionTreeClassifier(max_depth=5,random_state = 0),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1,random_state = 0),
MLPClassifier(alpha=1, max_iter=1000,random_state = 0),
AdaBoostClassifier(random_state=0),
GaussianNB(),
QuadraticDiscriminantAnalysis(),
]
names = [
"KNeighborsClassifier",
"SVC",
"GaussianProcessClassifier",
"DecisionTreeClassifier",
"RandomForestClassifier",
"MLPClassifier",
"AdaBoostClassifier",
"GaussianNB",
"QuadraticDiscriminantAnalysis"
]
X = naujasdf.loc[:, naujasdf.columns != 'survival_status numeric']
y = naujasdf.loc[:, naujasdf.columns == 'survival_status numeric'].values
y = np.ravel(y,order='C')
print(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 0)
h = -1
y_train = y_train.reshape(-1, 1)
for clf in classifiers:
clf.fit(X_train, y_train)
y_predikt = clf.predict(X_test)
clf_score = clf.score(X_test, y_test)
ntm = confusion_matrix(y_test, y_predikt)
tikslumas = cross_val_score(estimator = clf, X = X_train, y = y_train, cv = 2)
h = h + 1
print(names[h])
print('Testo palyginimas', clf_score)
print('Confusion_matrix', ntm)
print('Kryžminę kontrole',tikslumas)
print("Tikslumo vidurkis: " + str(tikslumas.mean()))
print("Standartinis nuokrypis: " + str(tikslumas.std()))
if names[h] == "DecisionTreeClassifier" :
tree.plot_tree(clf,rounded = True)
plt.show()
**and what i get**
[1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 0 1
0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 1 1 1 1 1
1 0 0 1 0 1 0 0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0
1 0 0 0 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1
1 1 0 0 0 0 0 0 1 1 1 1 0 1]
KNeighborsClassifier
Testo palyginimas 0.9411764705882353
Confusion_matrix [[10 0]
[ 1 6]]
Kryžminę kontrole [0.91780822 0.91666667]
Tikslumo vidurkis: 0.9172374429223744
Standartinis nuokrypis: 0.0005707762557077833```
I am trying to fit a logistic regression model to a dataset, and while training the data, I am getting the following error :
1 from sklearn.linear_model import LogisticRegression
2 classifier = LogisticRegression()
----> 3 classifier.fit(X_train, y_train)
ValueError: could not convert string to float: 'Cragorn'
The code snippet is as follows:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('predict_death_in_GOT.csv')
data.head(10)
X = data.iloc[:, 0:4]
y = data.iloc[:, 4]
plt.rcParams['figure.figsize'] = (10, 10)
alive = data.loc[y == 1]
not_alive = data.loc[y == 0]
plt.scatter(alive.iloc[:,0], alive.iloc[:,1], s = 10, label = "alive")
plt.scatter(not_alive.iloc[:,0], not_alive.iloc[:,1], s = 10, label = "not alive")
plt.legend()
plt.show()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
print(X_train, y_train)
print(X_test, y_test)
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
**classifier.fit(X_train, y_train)**
The dataset looks like :
Sr No name houseID titleID isAlive
0 0 Viserys II Targaryen 0 0 0
1 1 Tommen Baratheon 0 0 1
2 2 Viserys I Targaryen 0 0 0
3 3 Will (orphan) 0 0 1
4 4 Will (squire) 0 0 1
5 5 Willam 0 0 1
6 6 Willow Witch-eye 0 0 0
7 7 Woth 0 0 0
8 8 Wyl the Whittler 0 0 1
9 9 Wun Weg Wun Dar Wun 0 0 1
I looked over the web but couldn't find any relevant solutions.Please help me with this error.
Thank you!
You cannot pass string to fit() method.
Column name needs to be transformed into float.
Good method is to use: sklearn.preprocessing.LabelEncoder
Given above sample of dataset, here is reproducible example how to perform LabelEncoding:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
le = preprocessing.LabelEncoder()
data.name = le.fit_transform(data.name)
X = data.iloc[:, 0:4]
y = data.iloc[:, 5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
print(classifier.coef_,classifier.intercept_)
resulting model coefficients and intercept:
[[ 0.09253555 0.09253555 -0.15407024 0. ]] [-0.1015314]
Sklearn models only accept floats as arguments. You need to transform your variables into floats before passing them to the fit method. One way of doing this is by creating a series of dummy variables for each column containing strings. Check: pandas.get_dummies
I have binary classification problem where I want to calculate the roc_auc of the results. For this purpose, I did it in two different ways using sklearn. My code is as follows.
Code 1:
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
myscore = make_scorer(roc_auc_score, needs_proba=True)
from sklearn.model_selection import cross_validate
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))
I get the output as 0.60.
Code 2:
y_score = cross_val_predict(clf, X, y, cv=k_fold, method="predict_proba")
from sklearn.metrics import roc_curve, auc
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(2):
fpr[i], tpr[i], _ = roc_curve(y, y_score[:,i])
roc_auc[i] = auc(fpr[i], tpr[i])
print(roc_auc)
I get the output as {0: 0.41, 1: 0.59}.
I am confused since I get two different scores in the two codes. Please let me know why this difference happens and what is the correct way of doing this.
I am happy to provide more details if needed.
It seems that you used a part of my code from another answer, so I though to also answer this question.
For a binary classification case, you have 2 classes and one is the positive class.
For example see here. pos_label is the label of the positive class. When pos_label=None, if y_true is in {-1, 1} or {0, 1}, pos_label is set to 1, otherwise an error will be raised..
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.linear_model import LogisticRegression
import numpy as np
iris = datasets.load_iris()
X = iris.data
y = iris.target
mask = (y!=2)
y = y[mask]
X = X[mask,:]
print(y)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
positive_class = 1
clf = OneVsRestClassifier(LogisticRegression())
y_score = cross_val_predict(clf, X, y, cv=10 , method='predict_proba')
fpr = dict()
tpr = dict()
roc_auc = dict()
fpr[positive_class], tpr[positive_class], _ = roc_curve(y, y_score[:, positive_class])
roc_auc[positive_class] = auc(fpr[positive_class], tpr[positive_class])
print(roc_auc)
{1: 1.0}
and
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_validate
myscore = make_scorer(roc_auc_score, needs_proba=True)
clf = OneVsRestClassifier(LogisticRegression())
my_value = cross_validate(clf, X, y, cv=10, scoring = myscore)
print(np.mean(my_value['test_score'].tolist()))
1.0
I'm trying to use Machine learning to guess if a person has an income of over or under 50k using this data set. I think the code does not work because the data set contains strings. When I use a shorter data set containing 4 instead of 14 variables(and with numbers) the code works. What am I doing wrong?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset = pandas.read_csv(url, names=names)
# Split dataset
array = dataset.values
X = array[:,0:14]
Y = array[:,14]
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
Let's take a really simple example from your dataset.
Looking at dataset['income'].nunique() (produces 2), we can see you have two classes you're trying to predict. You're on the right track with taking the classification route (although there are different methodological arguments to be made as to whether this problem is better suited for a continuous regression approach, but save that for another day).
Say you want to use age and education to predict whether someone's income is above $50k. Let's try it out:
X = dataset[['age', 'education']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
This Exception should be raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 891, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 567, in check_array
array = array.astype(np.float64)
ValueError: could not convert string to float: ' Bachelors'
What if we tried with just age?
X = dataset[['age']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey! That works! So there's something unique about the education column that we need to account for. You've noticed this above - scikit-learn (and many other ML packages - though not all) don't operate off of strings. So we need to do something like "one-hot" encoding - creating k columns, where k represents the number of unique values in your categorical, "string" column (again, there's a methodological question as to whether you include k-1 or k features, but read up on the dummy-variable trap for more info to that end), where each column is composed of 1s and 0s - a 1 if the case/observation in a particular row has that kth attribute, a 0 if not.
There are many ways of doing this in Python:
pandas.get_dummies:
dummies = pandas.get_dummies(dataset['education'], prefix='education')
Here's a sample of dummies:
>>> dummies
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th ... education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 1 0 0 0 0
3 0 1 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
5 0 0 0 0 0 ... 0 1 0 0 0
6 0 0 0 0 0 ... 0 0 0 0 0
7 0 0 0 0 0 ... 1 0 0 0 0
8 0 0 0 0 0 ... 0 1 0 0 0
9 0 0 0 0 0 ... 0 0 0 0 0
Now we can use this education feature like so:
dataset = dataset.join(dummies)
X = dataset[['age'] + list(dummies)]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey, that worked!
Hopefully that helps to answer your question. There are tons of ways to perform one-hot encoding (e.g. through a list comprehension or sklearn.preprocessing.OneHotEncoder). I'd suggest you read more on "feature engineering" before progressing with your model-building - feature engineering is one of the most important parts of the ML process.
For columns that contain categorical strings, you should transform them to one hot encoding using the function:
dataset = pd.get_dummies(dataset, column=[my_column1, my_column2, ...])
Where my_column1, my_colum2, ...are the column names containing the categorical strings. Be careful, it changes the number of columns you have in your dataframe. Thus, change your split of X accordingly.
Here is the link to the documentation.