I am dealing with a little challenge. I am trying to create a logistic regression model (multicass). Some of my variables are categorical, therefore I'm trying to encode them.
My initial dataset looks like that:
The column I want to predict is action1_preflop, it contains 3 possibles classes: "r","c","f"
When encoding categorical features, I end up losing the variable I want to predict as it gets converted into 3 sub-variables:
action1_preflop_r
action1_preflop_f
action1_preflop_c
Below is the new dataframe after encoding
tiers tiers2_theory ... action1_preflop_f action1_preflop_r
0 7 11 ... 1 0
1 1 7 ... 0 1
2 5 11 ... 1 0
3 1 11 ... 0 1
4 1 7 ... 0 1
... ... ... ... ...
31007 4 11 ... 0 1
31008 1 11 ... 0 1
31009 1 11 ... 0 1
31010 1 11 ... 0 1
31011 2 7 ... 0 1
[31012 rows x 11 columns]
Could you please let me know how I am supposed to deal with those new variables considering that the initial variable before being encoded was actually the variable I wanted to target from prediction?
Thanks for the help
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import linear_model
df_raw = pd.read_csv('\\Users\\rapha\\Desktop\\Consulting\\Poker\\Tables test\\SB_preflop_a1_prob V1.csv', sep=";")
#Select categorical features only & use binary encoding
feature_cols = ['tiers','tiers2_theory','tiers3_theory','assorties','score','proba_preflop','action1_preflop']
df_raw = df_raw[feature_cols]
cat_features = df_raw.select_dtypes(include=[object])
num_features = df_raw.select_dtypes(exclude=[object])
df = num_features.join(pd.get_dummies(cat_features))
df = df.select_dtypes(exclude = [object])
df_outcome = df.action1_preflop
df_variables = df.drop('action1_preflop',axis=1)
x = df_variables
y = df.action1_preflop
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=1)
lm = linear_model.LogisticRegression(multi_class='ovr', solver='liblinear')
lm.fit(x_train, y_train)
predict_test=lm.predict(x_test)
print(lm.score(x_test, y_test))
You should leave the 'action1_preflop' out of the 'cat_features' dataframe and include it in the 'num_features' dataframe:
cat_features = df_raw.select_dtypes(include=[object])
cat_features = cat_features.drop(['action1_preflop'], axis=1)
num_features = df_raw.select_dtypes(exclude=[object])
num_features = pd.concat([num_features, df_raw['action1_preflop']
You can also save some typing, and joining too
cat_features = df_raw.select_dtypes(include=[object]).columns.to_list()
cat_features.remove("action1_preflop")
And then, you can just include this list of columns in the columns parameter
df = pd.get_dummies(df_raw, columns=cat_features)
I use StratifiedKFold and a form of grid search for my Logistic Regression.
skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=SEED)
I call this for loop for each combination of parameters:
for fold, (trn_idx, test_idx) in enumerate(skf.split(X, y)):
My question is, are trn_idx and test_idx the same for each fold every time I run the loop?
For example, if fold0 contains trn_dx = [1,2,5,7,8] and test_idx = [3,4,6], is fold0 going to contain the same trn_idx and test_idx the next 5 times I run the loop?
Yes, the stratified k-fold split is fixed if random_state=SEED is fixed. The shuffle only shuffles the dataset along with their targets before the k-fold split.
This means that each fold will always have their indexes:
x = list(range(10))
y = [1]*5 + [2]*5
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
for fold, (trn_idx, test_idx) in enumerate(skf.split(x, y)):
print(trn_idx, test_idx)
Output:
[1 2 4 5 7 9] [0 3 6 8]
[0 1 3 5 6 8 9] [2 4 7]
[0 2 3 4 6 7 8] [1 5 9]
No matter how may times I run this code.
As we know the Bernoulli Naive Bayes Classifier uses binary predictors (features). The thing I am not getting is how BernoulliNB in scikit-learn is giving results even if the predictors are not binary. The following example is taken verbatim from the documentation:
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 100))
Y = np.array([1, 2, 3, 4, 4, 5])
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, Y)
print(clf.predict(X[2:3]))
Output:
array([3])
Here are the first 10 features of X, and they are obviously not binary:
3 4 0 1 3 0 0 1 4 4 1
1 0 2 4 4 0 4 1 4 1 0
2 4 4 0 3 3 0 3 1 0 2
2 2 3 1 4 0 0 3 2 4 1
0 4 0 3 2 4 3 2 4 2 4
3 3 3 3 0 2 3 1 3 2 3
How does BernoulliNB work here even though the predictors are not binary?
This is due to the binarize argument; from the docs:
binarize : float or None, default=0.0
Threshold for binarizing (mapping to booleans) of sample features. If None, input is presumed to already consist of binary vectors.
When called with its default value binarize=0.0, as is the case in your code (since you do not specify it explicitly), it will result in converting every element of X greater than 0 to 1, hence the transformed X that will be used as the actual input to the BernoulliNB classifier will consist indeed of binary values.
The binarize argument works exactly the same way with the stand-alone preprocessing function of the same name; here is a simplified example, adapting your own:
from sklearn.preprocessing import binarize
import numpy as np
rng = np.random.RandomState(1)
X = rng.randint(5, size=(6, 1))
X
# result
array([[3],
[4],
[0],
[1],
[3],
[0]])
binarize(X) # here as well, default threshold=0.0
# result (binary values):
array([[1],
[1],
[0],
[1],
[1],
[0]])
I need to do a K-fold CV on some models, but I need to ensure the validation (test) data set is clustered together by a group and t number of years. GroupKFold is close, but it still splits up the validation set (see second fold).
For example, if I have a set of data with years from 2000-2008 and I want to K-fold into 3 groups. The appropriate sets would be: Validation: 2000-2002, Train: 2003-2008; V:2003-2005, T:2000-2002 & 2006-2008; and V: 2006-2008, T: 2000-2005).
Is there a way to group and cluster the data using K-Fold CV where the validation set is clustered by t years?
from sklearn.model_selection import GroupKFold
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
gkf = GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups=groups):
print("Train:", train_index, "Validation:",test_index)
Output:
Train: [ 0 1 2 3 4 5 10 11 12] Validation: [6 7 8 9]
Train: [3 4 5 6 7 8 9] Validation: [ 0 1 2 10 11 12]
Train: [ 0 1 2 6 7 8 9 10 11 12] Validation: [3 4 5]
Desired Output (assume 2 years for each group):
Train: [ 7 8 9 10 11 12 ] Validation: [0 1 2 3 4 5 6]
Train: [0 1 2 10 11 12 ] Validation: [ 3 4 5 6 7 8 9 ]
Train: [ 0 1 2 3 4 5 ] Validation: [6 7 8 9 10 11 12]
Although, the test and train subsets are not sequential along and can select more years to group.
I hope I understood you correctly.
The LeaveOneGroupOut method from scikits model_selection might help:
Lets say you assign the group label 0 to all the data points from 2000-2002, label 1 for all data points between 2003 and 2005 and label 2 for the data in 2006-2008.
Then you could use the following method, to create training and test splits, where the three test splits are created from one of the three groups:
from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
groups=[1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
X=np.random.random(len(groups))
y=np.random.randint(0,4,len(groups))
logo = LeaveOneGroupOut()
print("n_splits=", logo.get_n_splits(X,y,groups))
for train_index, test_index in logo.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17] test_idx: [0 1 2 3]
train_idx: [ 0 1 2 3 10 11 12 13 14 15 16 17] test_idx: [4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5 6 7 8 9] test_idx: [10 11 12 13 14 15 16 17]
Edit
I think I now finally understood what you want. Sorry that it took me so long.
I dont think that your desired split method is already implemented in sklearn. But we can easily extend the BaseCrossValidator method.
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils.validation import check_array
class GroupOfGroups(BaseCrossValidator):
def __init__(self, group_of_groups):
"""
:param group_of_groups: list with length n_splits. Each entry in the list is a list with group ids from
set(groups). In each of the n_splits splits, the groups given in the current group_of_groups sublist are used
for validation.
"""
self.group_of_groups = group_of_groups
def get_n_splits(self, X=None, y=None, groups=None):
return len(self.group_of_groups)
def _iter_test_masks(self, X=None, y=None, groups=None):
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
groups=check_array(groups, copy=True, ensure_2d=False, dtype=None)
for g in self.group_of_groups:
test_index = np.zeros(len(groups), dtype=np.bool)
for g_id in g:
test_index[groups == g_id] = True
yield test_index
The usage is quite simple. As before, we define X,y and groups. Additionally we define a list of lists (groups of groups) which define which groups should be used together in which test fold.
So g_of_g=[[1,2],[2,3],[3,4]] means that groups 1 and 2 are used as test set in the first fold, while the remaining groups 3 and 4 are used for training. In fold 2, data from groups 2 and 3 are used as test set etc.
I am not quite happy with the naming "GroupOfGroups" so maybe you find something better.
Now we can test this cross validator:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
g_of_g = [[1,2],[2,3],[3,4]]
gg = GroupOfGroups(g_of_g)
print("n_splits=", gg.get_n_splits(X,y,groups))
for train_index, test_index in gg.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 6 7 8 9 10 11 12] test_idx: [0 1 2 3 4 5]
train_idx: [ 0 1 2 10 11 12] test_idx: [3 4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5] test_idx: [ 6 7 8 9 10 11 12]
Please keep in mind that I did not include a lot of checks and didn't do thorough testing. So verify carefully that this works for you.
I'm using the Linear Regression model from Scikit Learn to an explanatory fit on a time series:
from sklearn import linear_model
import numpy as np
X = np.array([np.random.random(100), np.random.random(100)])
y = np.array(np.random.random(100))
regressor = linear_model.LinearRegression()
regressor.fit(X, y)
y_hat = regressor.predict(X)
I want do cross-validate the the prediction. As far as I know, I can't use the cross_val from sklearn (like Kfold) because it will break down the results randomly, and I need that the folds are sequentially. For example,
data_set = [1 2 3 4 5 6 7 8 9 10]
# first train set
train = [1]
# first test set
test = [2 3 4 5 6 7 8 9 10]
#fit, predict, evaluate
# train set
train = [1 2]
# test set
test = [3 4 5 6 7 8 9 10]
#fit, predict, evaluate
...
# train set
train = [1 2 3 4 5 6 7 8]
# test set
test = [9 10]
#fit, predict, evaluate
Is it possible to do this using sklearn?
You do not need scikit for this kind of folding. Slicing is sufficient, something like:
step = 1
for i in range(0, len(data_set), step):
train = dataset[:i]
test = dataset[i:]
# etc...