Dataset file : google drive link
I have a dataset consisting (27884 ROWS, 8933 Columns)
Here's a little preview of a dataset
user_iD
b1
b2
b3
b4
b5
b6
b7
b8
b9
b10
b11
1
1
7
2
3
8
0
4
0
6
0
5
2
7
8
1
2
4
6
5
9
10
3
0
3
0
0
0
0
1
5
2
3
4
0
6
4
1
7
2
3
8
0
5
0
6
0
4
5
0
4
7
0
6
1
5
3
0
0
2
6
1
0
2
3
0
5
4
0
0
6
7
Here the column userid represents: STUDENTS
and columns b1-b11: They represent Book Chapters and the sequence of each student that which chapter he/she studied first then second then third and so on. the 0 entry tells that the student did not study that particular chapter.
This is just a small preview of a big dataset.
There are a total of 27884 users and 8932 Chapters stated as (b1--b8932)
Here's the complete dataset shape information
I'm Applying PCA. and I'm getting an error which is :
ValueError: Found array with 0 feature(s) (shape=(22307, 0)) while a minimum of 1 is required.
As I stated there are 27844 users & 8932 other columns
What I have tried so far
df3 = pd.read_feather('Bundles.ftr')
X = df3['user_iD']
y = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X and Y into the
# Training set and Testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# performing preprocessing part
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train= X_train.values.reshape(-1, 1)
X_test = X_test.values.reshape(-1, 1)
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Applying PCA function on training
# and testing set of X component
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
How do I apply PCA to this use case?
Here's how to use PCA to pre-process your data:
df3 = pd.read_feather('Bundles.ftr')
X = df3.loc[:, df3.columns != 'user_iD']
# Splitting the X into the
# Training set and Testing set
X_train, X_test = train_test_split(X, test_size = 0.2, random_state = 0)
# performing preprocessing part
X_train = X_train.values
X_test = X_test.values
# Applying PCA function on training
# and testing set of X component
print(X_train.shape)
print(X_test.shape)
pca = PCA(n_components = 2)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
explained_variance = pca.explained_variance_ratio_
This is what the X_train variable looks like after preprocessing:
array([[-1846.8651992 , 437.17734222],
[-1847.05838019, 437.41158726],
[-1845.67443438, 436.28046074],
...,
[-1847.00651974, 437.20374889],
[ -780.18296423, 116.65908052],
[-1847.09404683, 437.30545959]])
However, I don't think that PCA is the right tool here. There are a few reasons for this:
I think kNN would be easier to interpret.
The way that the input weights are encoded is combining ordinal information and categorical information, which will make your clustering algorithm not work as well.
For example, if one user read a chapter first, and another user doesn't read a chapter at all, that is assigned 1 and 0. In this case, higher number means the user is more interested.
In another case, if one user read a chapter seventh, and another user read it eighth, that is assigned 7 and 8. In this case, a higher number means that the user is less interested.
On top of that, you're saying that the difference between reading something seventh or eighth is the same as the difference as between not reading it at all. To me, if someone didn't read it, that's a much bigger difference than a slight change in reading order.
So I would suggest having two sets of input features: did they read it at all, and if they did, where in their reading did the chapter fall.
The first set of features could be computed like this:
did_read = (X.values >= 1).astype(int)
These features are 1 if read and 0 otherwise.
The second set of features could be computed like this:
X_values = X.values
max_order = X_values.max(axis=1, initial=1).reshape(-1, 1)
order_normalized = X_values / max_order
These features are in the range [0, 1] based on whether it was toward the beginning or end of the chapters that they read.
Related
I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I'm trying to encode them.
My initial dataset looks like that:
The column I want to predict is action1_preflop, it contains 3 possibles classes: "r","c","f"
When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables:
action1_preflop_r
action1_preflop_f
action1_preflop_c
Below is the new dataframe after encoding
tiers tiers2_theory ... action1_preflop_f action1_preflop_r
0 7 11 ... 1 0
1 1 7 ... 0 1
2 5 11 ... 1 0
3 1 11 ... 0 1
4 1 7 ... 0 1
... ... ... ... ...
31007 4 11 ... 0 1
31008 1 11 ... 0 1
31009 1 11 ... 0 1
31010 1 11 ... 0 1
31011 2 7 ... 0 1
[31012 rows x 11 columns]
Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?
Thanks for the help
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
df_raw = pd.read_csv('\\Users\\rapha\\Desktop\\Consulting\\Poker\\Tables test\\SB_preflop_a1_prob V1.csv', sep=";")
#Select categorical features only & use binary encoding
feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']
df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])
df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)
x = df_variables
y = df.action1_preflop
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)
predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))
You should leave the 'action1_preflop' out of the 'cat_features' dataframe and include it in the 'num_features' dataframe:
cat_features = df_raw.select_dtypes(include=[object])
cat_features = cat_features.drop(['action1_preflop'], axis=1)
num_features = df_raw.select_dtypes(exclude=[object])
num_features = pd.concat([num_features, df_raw['action1_preflop']
You can also save some typing, and joining too
cat_features = df_raw.select_dtypes(include=[object]).columns.to_list()
cat_features.remove("action1_preflop")
And then, you can just include this list of columns in the columns parameter
df = pd.get_dummies(df_raw, columns=cat_features)
I have a DataFrame (as shown below) that is already set up with one-hot encoding. When I try to pass it into sklearn models, I keep getting dimension errors. If MultinomialNB only accept 1d arrays, how do I implement one-hot encoding?
Color
Col B
Col C
Col D
Col E
red
1
0
0
0
green
0
0
1
0
blue
0
1
0
0
green
0
0
1
0
brown
0
0
0
1
This runs fine:
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(df['Color']).toarray()
y = df.loc[:, df.columns != 'Color'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
I get an error stating "ValueError: y should be a 1d array" when I run the following:
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
I am using a multiclass classification-ready dataset with 14 continuous variables and classes from 1 to 10.
This is the data file:
https://drive.google.com/file/d/1nPrE7UYR8fbTxWSuqKPJmJOYG3CGN5y9/view?usp=sharing
My goal is to apply the scikit-learn Gaussian NB model to the data, but in a binary classification task where only class 2 is the positive label and the remainder of the classes are all negatives. For that, I did the following code:
from sklearn.naive_bayes import GaussianNB, CategoricalNB
import pandas as pd
dataset = pd.read_csv("PD_21_22_HA1_dataset.txt", index_col=False, sep="\t")
x_d = dataset.values[:, :-1]
y_d = dataset.values[:, -1]
### train_test_split to split the dataframe into train and test sets
## with a partition of 20% for the test https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.20, random_state=23)
yc_TRAIN=np.array([int(i==2) for i in y_TRAIN])
mdl = GaussianNB()
mdl.fit(X_TRAIN, yc_TRAIN)
preds = mdl.predict(X_IVS)
# binarization of "y_true" array
yc_IVS=np.array([int(i==2) for i in y_IVS])
print("The Precision is: %7.4f" % precision_score(yc_IVS, preds))
print("The Matthews correlation coefficient is: %7.4f" % matthews_corrcoef(yc_IVS, preds))
But I get the following warning message when calculating precision:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples.
The matthew's correlation coeficient func also outputs 0 and gives a runtimewarning: invalid value encountered in double_scalars message.
Furthermore, by inspecting preds, I got that the model predicts only negatives/zeros.
I've tried increasing the 20% test partition as some forums suggested but it didn't do anything.
Is this simply a problem of the model not being able to fit against the data or am I doing something wrong that may be inputting the wrong data format/type into the model?
Edit: yc_TRAIN is the result of turning all cases from class 2 into my true positive cases "1" and the remaining classes into negatives/0, so it's a 1-d array of length 9450 (which matches my total number of prediction cases) with over 8697 0s and 753 1s, so its aspect would be something like this:
[0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ]
Your code looks fine; this is a classic problem with imbalanced datasets, and it actually means you do not have enough training data to correctly classify the rare positive class.
The only thing you could improve in the given code is to set stratify=y_d in train_test_split, in order to get a stratified training set; decreasing the size of the test set (i.e. leaving more samples for training) may also help:
X_TRAIN, X_IVS, y_TRAIN, y_IVS = train_test_split(x_d, y_d, test_size=0.10, random_state=23, stratify=y_d)
If this does not work, you should start thinking of applying class imbalance techniques (or different models); but this is not a programming question any more but a theory/methodology one, and it should be addressed at the appropriate SE sites and not here (see the intro and NOTE in the machine-learning tag info).
I have the following data:
X1 X2 Y
-10 4 0
-10 3 4
-10 2.5 8
-8 3 7
-8 4 8
-8 4.4 9
0 2 9
0 2.3 9.2
0 4 10
0 5 12
I need to create a simple regression model to predict Y given X1 and X2: Y = f(X1,X2).
This is my code:
poly = PolynomialFeatures(degree=2)
X1 = poly.fit_transform(df["X1"].values.reshape(-1,1))
X2 = poly.fit_transform(df["X2"].values.reshape(-1,1))
clf = linear_model.LinearRegression()
clf.fit([X1,X2], df["Y"].values.reshape(-1, 1))
print(clf.coef_)
print(clf.intercept_)
Y_test = clf.predict([X1, X2])
df_test=pd.DataFrame()
df_test["X1"] = df["X1"]
df_test["Y"] = df["Y"]
df_test["Y_PRED"] = Y_test
df_test.plot(x="X1",y=["Y","Y_PRED"], figsize=(10,5), grid=True)
plt.show()
But it fails at line clf.fit([X1,X2], df["Y"].values.reshape(-1, 1)):
ValueError: Found array with dim 3. Estimator expected <= 2
It looks like the model cannot work with 2 input parameters X1 and X2. How should I change the code to fix it?
Well, your mistake resides in the way you append your feature dataframes. You should instead concatenate them, for instance using pandas:
import pandas as pd
X12_p = pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
Or the same using numpy:
import numpy as np
X12_p = np.concatenate([X1, X2], axis=1)
Your final snippet should look like:
# Fit
Y = df["Y"].values.reshape(-1,1)
X12_p = pd.concat([pd.DataFrame(X1), pd.DataFrame(X2)], axis=1)
clf.fit(X12_p, Y)
# Predict
Y_test = clf.predict(X12_p)
You can as well evaluate some performance metrics such as rmse using:
from sklearn.metrics import mean_squared_error
print('rmse = {0:.5f}'.format(mean_squared_error(Y, Y_test)))
Please also note that you can exclude the bias term from polynomial features by changing the default param:
PolynomialFeatures(degree=2, include_bias=False)
Hope this helps.
I'm trying to use Machine learning to guess if a person has an income of over or under 50k using this data set. I think the code does not work because the data set contains strings. When I use a shorter data set containing 4 instead of 14 variables(and with numbers) the code works. What am I doing wrong?
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']
dataset = pandas.read_csv(url, names=names)
# Split dataset
array = dataset.values
X = array[:,0:14]
Y = array[:,14]
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
Let's take a really simple example from your dataset.
Looking at dataset['income'].nunique() (produces 2), we can see you have two classes you're trying to predict. You're on the right track with taking the classification route (although there are different methodological arguments to be made as to whether this problem is better suited for a continuous regression approach, but save that for another day).
Say you want to use age and education to predict whether someone's income is above $50k. Let's try it out:
X = dataset[['age', 'education']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
This Exception should be raised:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/neighbors/base.py", line 891, in fit
X, y = check_X_y(X, y, "csr", multi_output=True)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 756, in check_X_y
estimator=estimator)
File "/Users/jake/Documents/assets/venv/lib/python3.6/site-packages/sklearn/utils/validation.py", line 567, in check_array
array = array.astype(np.float64)
ValueError: could not convert string to float: ' Bachelors'
What if we tried with just age?
X = dataset[['age']]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey! That works! So there's something unique about the education column that we need to account for. You've noticed this above - scikit-learn (and many other ML packages - though not all) don't operate off of strings. So we need to do something like "one-hot" encoding - creating k columns, where k represents the number of unique values in your categorical, "string" column (again, there's a methodological question as to whether you include k-1 or k features, but read up on the dummy-variable trap for more info to that end), where each column is composed of 1s and 0s - a 1 if the case/observation in a particular row has that kth attribute, a 0 if not.
There are many ways of doing this in Python:
pandas.get_dummies:
dummies = pandas.get_dummies(dataset['education'], prefix='education')
Here's a sample of dummies:
>>> dummies
education_ 10th education_ 11th education_ 12th education_ 1st-4th education_ 5th-6th ... education_ HS-grad education_ Masters education_ Preschool education_ Prof-school education_ Some-college
0 0 0 0 0 0 ... 0 0 0 0 0
1 0 0 0 0 0 ... 0 0 0 0 0
2 0 0 0 0 0 ... 1 0 0 0 0
3 0 1 0 0 0 ... 0 0 0 0 0
4 0 0 0 0 0 ... 0 0 0 0 0
5 0 0 0 0 0 ... 0 1 0 0 0
6 0 0 0 0 0 ... 0 0 0 0 0
7 0 0 0 0 0 ... 1 0 0 0 0
8 0 0 0 0 0 ... 0 1 0 0 0
9 0 0 0 0 0 ... 0 0 0 0 0
Now we can use this education feature like so:
dataset = dataset.join(dummies)
X = dataset[['age'] + list(dummies)]
y = dataset['income']
model = KNeighborsClassifier()
model.fit(X, y)
Hey, that worked!
Hopefully that helps to answer your question. There are tons of ways to perform one-hot encoding (e.g. through a list comprehension or sklearn.preprocessing.OneHotEncoder). I'd suggest you read more on "feature engineering" before progressing with your model-building - feature engineering is one of the most important parts of the ML process.
For columns that contain categorical strings, you should transform them to one hot encoding using the function:
dataset = pd.get_dummies(dataset, column=[my_column1, my_column2, ...])
Where my_column1, my_colum2, ...are the column names containing the categorical strings. Be careful, it changes the number of columns you have in your dataframe. Thus, change your split of X accordingly.
Here is the link to the documentation.