Losing my target variable when encoding categorial variables

Losing my target variable when encoding categorial variables - python

I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I'm trying to encode them.
My initial dataset looks like that:
The column I want to predict is action1_preflop, it contains 3 possibles classes: "r","c","f"
When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables:
action1_preflop_r
action1_preflop_f
action1_preflop_c
Below is the new dataframe after encoding
tiers tiers2_theory ... action1_preflop_f action1_preflop_r
0 7 11 ... 1 0
1 1 7 ... 0 1
2 5 11 ... 1 0
3 1 11 ... 0 1
4 1 7 ... 0 1
... ... ... ... ...
31007 4 11 ... 0 1
31008 1 11 ... 0 1
31009 1 11 ... 0 1
31010 1 11 ... 0 1
31011 2 7 ... 0 1
[31012 rows x 11 columns]
Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?
Thanks for the help
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
df_raw = pd.read_csv('\\Users\\rapha\\Desktop\\Consulting\\Poker\\Tables test\\SB_preflop_a1_prob V1.csv', sep=";")
#Select categorical features only & use binary encoding
feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']
df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])
df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)
x = df_variables
y = df.action1_preflop
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)
predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))

You should leave the 'action1_preflop' out of the 'cat_features' dataframe and include it in the 'num_features' dataframe:
cat_features = df_raw.select_dtypes(include=[object])
cat_features = cat_features.drop(['action1_preflop'], axis=1)
num_features = df_raw.select_dtypes(exclude=[object])
num_features = pd.concat([num_features, df_raw['action1_preflop']

You can also save some typing, and joining too
cat_features = df_raw.select_dtypes(include=[object]).columns.to_list()
cat_features.remove("action1_preflop")
And then, you can just include this list of columns in the columns parameter
df = pd.get_dummies(df_raw, columns=cat_features)

Related

split data such that a a categorical value is either in train or test python

I have a data set (df) as follows
Company Col1 Col2 Output
AB 10 20 1
AB 20 22 1
AB 14 12 0
XZ 33 22 1
XZ 43 62 0
I want to train_test_split the data such that if a company is in the test set, it should not be in the training set at all. By which I mean if the first row ( AB, 10, 20,1) is in the test set, the second row ( AB, 20,22,1) should also be in the test set. I know stratify would stratify=df[["Name"] would do the exact opposite of what I want. Is there any built in function to do as such?
P.S. Company column is string

This might be a little verbose and not a generic function, but this approach might work for you:
counts = df.groupby("Company").count()["Output"]
frac = 0.8 # Fraction of the training table, will only be approximated
train_companies = []
i = 0
c = 0
total_count = counts.values.sum()
train_count = total_count * frac
while(c < train_count):
train_companies.append(counts.index[i])
c = c + counts.values[i]
i = i + 1
c = c + counts.values[i]
df_train = df[df['Company'].isin(train_companies)]
df_test = df[~df['Company'].isin(train_companies)]

Applying PCA Clustering based on user id

Dataset file : google drive link
I have a dataset consisting (27884 ROWS, 8933 Columns)
Here's a little preview of a dataset
user_iD
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
1
1
7
2
3
8
0
4
0
6
0
5
2
7
8
1
2
4
6
5
9
10
3
0
3
0
0
0
0
1
5
2
3
4
0
6
4
1
7
2
3
8
0
5
0
6
0
4
5
0
4
7
0
6
1
5
3
0
0
2
6
1
0
2
3
0
5
4
0
0
6
7
Here the column userid represents: STUDENTS
and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.
This is just a small preview of a big dataset.
There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)
Here's the complete dataset shape information
I'm Applying PCA. and I'm getting an error which is :
ValueError: Found array with 0 feature(s) (shape=(22307, 0)) while a minimum of 1 is required.
As I stated there are 27844 users & 8932 other columns
What I have tried so far
df3 = pd.read_feather('Bundles.ftr')
X = df3['user_iD']
y = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train= X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
How do I apply PCA to this use case?

Here's how to use PCA to pre-process your data:
df3 = pd.read_feather('Bundles.ftr')
X = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X into the
# Training set and Testing set
X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 0)
# performing preprocessing part
X_train = X_train.values
X_test = X_test.values
# Applying PCA function on training
# and testing set of X component
print(X_train.shape)
print(X_test.shape)
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
This is what the X_train variable looks like after preprocessing:
array([[-1846.8651992 , 437.17734222],
[-1847.05838019, 437.41158726],
[-1845.67443438, 436.28046074],
...,
[-1847.00651974, 437.20374889],
[ -780.18296423, 116.65908052],
[-1847.09404683, 437.30545959]])
However, I don't think that PCA is the right tool here. There are a few reasons for this:
I think kNN would be easier to interpret.
The way that the input weights are encoded is combining ordinal information and categorical information, which will make your clustering algorithm not work as well.
For example, if one user read a chapter first, and another user doesn't read a chapter at all, that is assigned 1 and 0. In this case, higher number means the user is more interested.
In another case, if one user read a chapter seventh, and another user read it eighth, that is assigned 7 and 8. In this case, a higher number means that the user is less interested.
On top of that, you're saying that the difference between reading something seventh or eighth is the same as the difference as between not reading it at all. To me, if someone didn't read it, that's a much bigger difference than a slight change in reading order.
So I would suggest having two sets of input features: did they read it at all, and if they did, where in their reading did the chapter fall.
The first set of features could be computed like this:
did_read = (X.values >= 1).astype(int)
These features are 1 if read and 0 otherwise.
The second set of features could be computed like this:
X_values = X.values
max_order = X_values.max(axis=1, initial=1).reshape(-1, 1)
order_normalized = X_values / max_order
These features are in the range [0, 1] based on whether it was toward the beginning or end of the chapters that they read.

Grouping and splitting to avoid leakage

I have a pandas dataframe where the data is arranged as follows:
filename label
0 4456723 0
1 4456723_01 0
2 4456723_02 0
3 ab43912 1
4 ab43912_01 1
5 ab43912_03 1
... ... ...
I want to randomly split this dataframe in training and validation sets. Though if I do so, I will introduce a leakage because the files are images with slight variations but represented with different names, for example ab43912, ab43912_01, ab43912_03, are all same images with some variations.
Is there any efficient way to group these files and then make a split that doesn't introduce leakage?

You can manually select ~80% of the unique file handles randomly.
df = pd.DataFrame({'filename': list('aaabbbcccdddeeefff')})
df['filename'] = df['filename'] + ['', '_01', '_02']*6
# Get the unique handles
files = df.filename.str.split('_').str[0]
# Randomly select ~80%.
m = files.isin(np.random.choice(files.unique(), int(files.nunique()*0.8), replace=False))
# Split
train, test = df.loc[m], df.loc[~m]
In effect we got a 2/3-1/3 split because of the small N
train:
filename
0 a
1 a_01
2 a_02
6 c
7 c_01
8 c_02
12 e
13 e_01
14 e_02
15 f
16 f_01
17 f_02
test:
filename
3 b
4 b_01
5 b_02
9 d
10 d_01
11 d_02

ML on "Adult data set"(dataset) from archive.ics... whith KNeighborsClassifier wont run

I'm trying to use Machine learning to guess if a person has an income of over or under 50k using this data set. I think the code does not work because the data set contains strings. When I use a shorter data set containing 4 instead of 14 variables(and with numbers) the code works. What am I doing wrong?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset = pandas.read_csv(url, names=names)
# Split dataset
array = dataset.values
X = array[:,0:14]
Y = array[:,14]
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

Let's take a really simple example from your dataset.
Looking at dataset['income'].nunique() (produces 2), we can see you have two classes you're trying to predict. You're on the right track with taking the classification route (although there are different methodological arguments to be made as to whether this problem is better suited for a continuous regression approach, but save that for another day).
Say you want to use age and education to predict whether someone's income is above $50k. Let's try it out:
X = dataset[['age', 'education']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
This Exception should be raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 891, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 567, in check_array
array = array.astype(np.float64)
ValueError: could not convert string to float: ' Bachelors'
What if we tried with just age?
X = dataset[['age']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey! That works! So there's something unique about the education column that we need to account for. You've noticed this above - scikit-learn (and many other ML packages - though not all) don't operate off of strings. So we need to do something like "one-hot" encoding - creating k columns, where k represents the number of unique values in your categorical, "string" column (again, there's a methodological question as to whether you include k-1 or k features, but read up on the dummy-variable trap for more info to that end), where each column is composed of 1s and 0s - a 1 if the case/observation in a particular row has that kth attribute, a 0 if not.
There are many ways of doing this in Python:
pandas.get_dummies:
dummies = pandas.get_dummies(dataset['education'], prefix='education')
Here's a sample of dummies:
>>> dummies
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th ... education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 1 0 0 0 0
3 0 1 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
5 0 0 0 0 0 ... 0 1 0 0 0
6 0 0 0 0 0 ... 0 0 0 0 0
7 0 0 0 0 0 ... 1 0 0 0 0
8 0 0 0 0 0 ... 0 1 0 0 0
9 0 0 0 0 0 ... 0 0 0 0 0
Now we can use this education feature like so:
dataset = dataset.join(dummies)
X = dataset[['age'] + list(dummies)]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey, that worked!
Hopefully that helps to answer your question. There are tons of ways to perform one-hot encoding (e.g. through a list comprehension or sklearn.preprocessing.OneHotEncoder). I'd suggest you read more on "feature engineering" before progressing with your model-building - feature engineering is one of the most important parts of the ML process.

For columns that contain categorical strings, you should transform them to one hot encoding using the function:
dataset = pd.get_dummies(dataset, column=[my_column1, my_column2, ...])
Where my_column1, my_colum2, ...are the column names containing the categorical strings. Be careful, it changes the number of columns you have in your dataframe. Thus, change your split of X accordingly.
Here is the link to the documentation.

Get Top 3 predicted classes from GaussianNB classifier python

I am trying to predict a class using GaussianNB, but I need to get top 3 predicted classes to create a custom score for the prediction.
My training data is x,y,class where given x and y it needs to predict the class
tests variable cointains (x,y) values and testclass contains class values.
Test is a list data set in following format
Index Type Size Value
0 tuple 2 (0.6424, 0.8325)
1 tuple 2 (0.8493, 0.7848)
2 tuple 2 (0.791, 0.4191)
Test class data
Index Type Size Value
0 str 1 1.274e+09
1 str 1 9.5047e+09
Code:
import csv
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import GaussianNB
clf_pf = GaussianNB()
clf_pf.fit(train, trainclass)
print clf_pf.score(test,testclass)
ff = clf_pf.predict_proba(test)
How to get the top 3 predicted classes from above variable ff?
My ff data is like below
0 1 2 3 4 5 6 7 8
0 1.80791e-05 0 0.00126251 0 6.38504e-256 0 0 0 0
1 2.89477e-199 1.01093e-06 0 1.1056e-55 0 5.52213e-67 0 0
2 2.47755e-05 0 2.43499e-08 0 1.00392e-239 0 0 0 0
3 2.54941e-161 3.79815e-06 0 1.53516e-40 0 1.63465e-41 0 0

As said in the comment, ff has [n_samples, n_classes]. Using numpy.argsort you will obtain, for each row, the predicted classes ordered by their probability in ascending order, obtaining again a matrix of shape [n_samples, n_classes]. You then take the last three elements of all rows ([:, -3:]) and reverse their order ([:, ::-1]) to obtain the class with best probability first:
np.argsort(ff)[:, -3:][:, ::-1]
Note the [:, in the slicing just means "get all the rows".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Losing my target variable when encoding categorial variables - python

You can also save some typing, and joining too cat_features = df_raw.select_dtypes(include=[object]).columns.to_list() cat_features.remove("action1_preflop") And then, you can just include this list of columns in the columns parameter df = pd.get_dummies(df_raw, columns=cat_features)

Related

split data such that a a categorical value is either in train or test python

Applying PCA Clustering based on user id

Grouping and splitting to avoid leakage

ML on "Adult data set"(dataset) from archive.ics... whith KNeighborsClassifier wont run

Get Top 3 predicted classes from GaussianNB classifier python

Categories

Resources