Scikit-learn, GroupKFold with shuffling groups? - python

I was using StratifiedKFold from scikit-learn, but now I need to watch also for "groups". There is nice function GroupKFold, but my data are very time dependent. So similary as in help, ie number of week is the grouping index. But each week should be only in one fold.
Suppose I need 10 folds. What I need is to shuffle data first, before I can used GroupKFold.
Shuffling is in group sence - so whole groups should be shuffle among each other.
Is there way to do is with scikit-learn elegant somehow? Seems to me GroupKFold is robust to shuffle data first.
If there is no way to do it with scikit, can anyone write some effective code of this? I have large data sets.
matrix, label, groups as inputs

EDIT: This solution does not work.
I think using sklearn.utils.shuffle is an elegant solution!
For data in X, y and groups:
from sklearn.utils import shuffle
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=0)
Then use X_shuffled, y_shuffled and groups_shuffled with GroupKFold:
from sklearn.model_selection import GroupKFold
group_k_fold = GroupKFold(n_splits=10)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
Of course, you probably want to shuffle multiple times and do the cross-validation with each shuffle. You could put the entire thing in a loop - here's a complete example with 5 shuffles (and only 3 splits instead of your required 10):
X = np.arange(20).reshape((10, 2))
y = np.arange(10)
groups = [0, 0, 0, 1, 2, 3, 4, 5, 6, 7]
n_shuffles = 5
group_k_fold = GroupKFold(n_splits=3)
for i in range(n_shuffles):
X_shuffled, y_shuffled, groups_shuffled = shuffle(X, y, groups, random_state=i)
splits = group_k_fold.split(X_shuffled, y_shuffled, groups_shuffled)
# do something with splits here, I'm just printing them out
print 'Shuffle', i
print 'groups_shuffled:', groups_shuffled
for train_idx, val_idx in splits:
print 'Train:', train_idx
print 'Val:', val_idx

The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds)
In the GroupKfold the shape of the group is the same as data shape
For data in X, y and groups:
import numpy as np
import pandas as pd
from sklearn.model_selection import GroupKFold
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
import datetime
X = np.array([[1,2,1,1], [3,4,7,8], [5,6,1,3], [7,8,4,7]])
y=np.array([0,2,1,2])
groups=np.array([2,1,0,1])
group_kfold = GroupKFold(n_splits=len(groups.unique))
group_kfold.get_n_splits(X, y, groups)
param_grid ={
'min_child_weight': [50,100],
'subsample': [0.1,0.2],
'colsample_bytree': [0.1,0.2],
'max_depth': [2,3],
'learning_rate': [0.01],
'n_estimators': [100,500],
'reg_lambda': [0.1,0.2]
}
xgb = XGBClassifier()
grid_search = GridSearchCV(xgb, param_grid, cv=group_kfold.split(X, Y, groups), n_jobs=-1)
result = grid_search.fit(X,Y)

Here is a performant solution that essentially reassigns the values of the keys in a way that respects the original groups.
Code is shown below, but the 4 steps are:
Shuffle the grouping-key vector. The key goal here is rearrange the first time each grouping key appears.
Use np.unique() to return the first_index values for each unique key and the inverse_index values that could be used to reconstruct the grouping-key vector.
Use fancy indexing of the inverse indexes operating on the first_index values to construct a new array of grouping keys where each grouping key has been transformed to a number representing the order in which it first shows up in the shuffled grouping vector.
This new vector of grouping keys can be used in the standard GroupKFold splitter to get a different set of splits than the original because you have reordered the grouping indexes.
To give a quick example, imagine your original grouping-key vector was [3, 1, 1, 5, 3, 5], then this procedure would create a new grouping key vector [0, 1, 1, 2, 0, 2]. The 3's have become 0's because they were the first key to show up, the 1's have become 1's because they were the second key to show up, and the 5's have become 2's because they were the 3rd key to show up. As long as you shuffle the keys, you will get a transformation of grouping-keys, leading to a different set of splits by GroupKFold.
Code:
# Say that A is the official grouping key
A = list(range(10)) + list(range(10))
B = list(range(20))
y = np.zeros(20)
X = pd.DataFrame({
'group': A,
'var': B
})
X = X.sample(frac=1)
original_grouping_vector = X['group']
unique_values, indexes, inverse = np.unique(original_grouping_vector, return_inverse=True, return_index=True)
new_grouping_vector = indexes[inverse] # This is where the magic happens!
splitter = GroupKFold()
for train, test in splitter.split(X, y, groups=new_grouping_vector):
print(X.iloc[test, :])
The above will print out different splits upon shuffling because the grouping-keys are being reordered, causing the value of new_grouping_vector to change.

Related

Split data into train, test, validation with stratifying using Numpy

I've just seen this answer on SO which shows how to split data using numpy.
Assume we're going to split them as 0.8, 0.1, 0.1 for training, testing, and validation respectively, you do it this way:
train, test, val = np.split(df, [int(.8 * len(df)), int(.9 * len(df))])
I'm interested to know how could I consider stratifying while splitting data using this methodology.
Stratifying is splitting data while keeping the priors of each class you have in data. That is if you're going to take 0.8 for the training set, you take 0.8 from each class you have. Same for test and train.
I tried grouping the data first by class using:
grouped_df = df.groupby(class_col_name, group_keys=False)
But it did not show correct results.
Note: I'm familiar with train_test_split
Simply use your groupby object, grouped_df, which consists of each subsetted data frame where you can then run the needed np.split. Then concatenate all sampled data frames with pd.concat. Atogether, this would stratify according to your quoted message:
train_list = []; test_list = [], val_list = []
grouped_df = df.groupby(class_col_name)
# ITERATE THROUGH EACH SUBSET DF
for i, g in grouped_df:
# STRATIFY THE g (CLASS) DATA FRAME
train, test, val = np.split(g, [int(.8 * len(g)), int(.9 * len(g))])
train_list.append(train); test_list.append(test); val_list.append(val)
final_train = pd.concat(train_list)
final_test = pd.concat(test_list)
final_val = pd.concat(val_list)
Alternatively, a short-hand version using list comprehensions:
# LIST OF ARRAYS
arr_list = [np.split(g, [int(.8 * len(g)), int(.9 * len(g))]) for i, g in grouped_df]
final_train = pd.concat([t[0] for t in arr_list])
final_test = pd.concat([t[1] for t in arr_list])
final_val = pd.concat([v[2] for v in arr_list])
This assumes you have done stratification already such that a "category" column indicates which stratification each entry belongs to.
from collections import namedtuple
Dataset = namedtuple('Dataset', 'train test val')
grouped = df.groupby('headline')
splitted = {x: grouped.get_group(x).sample(frac=1) for x in grouped.groups}
datasets = {k:Dataset(*np.split(df, [int(.8 * len(df)), int(.9 * len(df))])) for k, df in splitted.items()}
This stores each stratified split by the category name assigned in df.
Each item in datasets is a Dataset namedtuple such that training, testing, and validation subsets are accessible by .train, .test, and .val respectively.

Generate all combinations of a list given a condition

I would like to generate all combinations of length n, for a list of k variables. I can do this as follows:
import itertools
import pandas as pd
from sklearn import datasets
dataset = datasets.load_breast_cancer()
X = dataset.data
y = dataset.target
df = pd.DataFrame(X, columns=dataset.feature_names)
features = dataset.feature_names
x = set(['mean radius', 'mean texture'])
for s in itertools.combinations(features, 3):
if x.issubset(set(s)):
print s
len(features) = 30, thus this will generate 4060 combinations where n=3. When n=10, this is 30,045,015 combinations.
len(tuple(itertools.combinations(features, 10)
Each of these combinations will then be evaluated based on the conditional statement. However for n>10 this becomes unfeasible.
Instead of generating all combinations, and then filtering by some condition like in this example, is it possible to generate all combinations given this condition?
In other words, generate all combinations where n=3, 4, 5 ... k, given 'mean radius' and 'mean texture' appear in the combination?
Just generate the combinations without 'mean radius' and 'mean texture' and add those two to every combination, thus largely reducing the number of combinations. This way you don't have to filter, every combination generated will be useful.
# remove the fixed features from the pool:
features = set(features) - x
for s in itertools.combinations(features, n - len(x)):
s = set(s) & x # add the fixed features to each combination
print(s)

Train test split without using scikit learn

I have a house price prediction dataset. I have to split the dataset into train and test.
I would like to know if it is possible to do this by using numpy or scipy?
I cannot use scikit learn at this moment.
I know that your question was only to do a train_test_split with numpy or scipy but there is actually a very simple way to do it with Pandas :
import pandas as pd
# Shuffle your dataset
shuffle_df = df.sample(frac=1)
# Define a size for your train set
train_size = int(0.7 * len(df))
# Split your dataset
train_set = shuffle_df[:train_size]
test_set = shuffle_df[train_size:]
For those who would like a fast and easy solution.
Although this is old question, this answer might help.
This is how sklearn implements train_test_split, this method given below, takes similar arguments as sklearn.
import numpy as np
from itertools import chain
def _indexing(x, indices):
"""
:param x: array from which indices has to be fetched
:param indices: indices to be fetched
:return: sub-array from given array and indices
"""
# np array indexing
if hasattr(x, 'shape'):
return x[indices]
# list indexing
return [x[idx] for idx in indices]
def train_test_split(*arrays, test_size=0.25, shufffle=True, random_seed=1):
"""
splits array into train and test data.
:param arrays: arrays to split in train and test
:param test_size: size of test set in range (0,1)
:param shufffle: whether to shuffle arrays or not
:param random_seed: random seed value
:return: return 2*len(arrays) divided into train ans test
"""
# checks
assert 0 < test_size < 1
assert len(arrays) > 0
length = len(arrays[0])
for i in arrays:
assert len(i) == length
n_test = int(np.ceil(length*test_size))
n_train = length - n_test
if shufffle:
perm = np.random.RandomState(random_seed).permutation(length)
test_indices = perm[:n_test]
train_indices = perm[n_test:]
else:
train_indices = np.arange(n_train)
test_indices = np.arange(n_train, length)
return list(chain.from_iterable((_indexing(x, train_indices), _indexing(x, test_indices)) for x in arrays))
Of course sklearn's implementation supports stratified k-fold, splitting of pandas series etc. This one only works for splitting lists and numpy arrays, which I think will work for your case.
This solution using pandas and numpy only
def split_train_valid_test(data,valid_ratio,test_ratio):
shuffled_indcies=np.random.permutation(len(data))
valid_set_size= int(len(data)*valid_ratio)
valid_indcies=shuffled_indcies[:valid_set_size]
test_set_size= int(len(data)*test_ratio)
test_indcies=shuffled_indcies[valid_set_size:test_set_size+valid_set_size]
train_indices=shuffled_indcies[test_set_size:]
return data.iloc[train_indices],data.iloc[valid_indcies],data.iloc[test_indcies]
train_set,valid_set,test_set=split_train_valid_test(dataset,valid_ratio=0.2,test_ratio=0.2)
print(len(train_set),len(valid_set),len(test_set))
##out: (16512, 4128, 4128)
This code should work (Assuming X_data is a pandas DataFrame):
import numpy as np
num_of_rows = len(X_data) * 0.8
values = X_data.values
np.random_shuffle(values) #shuffles data to make it random
train_data = values[:num_of_rows] #indexes rows for training data
test_data = values[num_of_rows:] #indexes rows for test data
Hope this helps!
import numpy as np
import pandas as pd
X_data = pd.read_csv('house.csv')
Y_data = X_data["prices"]
X_data.drop(["offers", "brick", "bathrooms", "prices"],
axis=1, inplace=True) # important to drop prices as well
# create random train/test split
indices = range(X_data.shape[0])
num_training_instances = int(0.8 * X_data.shape[0])
np.random.shuffle(indices)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]
# split the actual data
X_data_train, X_data_test = X_data.iloc[train_indices], X_data.iloc[test_indices]
Y_data_train, Y_data_test = Y_data.iloc[train_indices], Y_data.iloc[test_indices]
This assumes you want a random split. What happens is that we're creating a list of indices as long as the number of data points you have, i.e. the first axis of X_data (or Y_data). We then put them in random order and just take the first 80% of those random indices as training data and the rest for testing. [:num_training_indices] just selects the first num_training_indices from the list. After that you just extract the rows from your data using the lists of random indices and your data is split. Remember to drop the prices from your X_data and to set a seed if you want the split to be reproducible (np.random.seed(some_integer) in the beginning).

How to find the features names of the coefficients using scikit linear regression?

I use scikit linear regression and if I change the order of the features, the coef are still printed in the same order, hence I would like to know the mapping of the feature with the coeff.
#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])
model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])
# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_
The trick is that right after you have trained your model, you know the order of the coefficients:
model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))
This will print the coefficients and the correct feature. (Tested with pandas DataFrame)
If you want to reuse the coefficients later you can also put them in a dictionary:
coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
coef_dict[feat] = coef
(You can test it for yourself by training two models with the same features but, as you said, shuffled order of features.)
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())
#Robin posted a great answer, but for me I had to make one tweak on it to work the way I wanted, and it was to refer to the dimension of the 'coef_' np.array that I wanted, namely modifying to this: model_1.coef_[0,:], as below:
coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
coef_dict[feat] = coef
Then the dict was created as I pictured it, with {'feature_name' : coefficient_value} pairs.
Here is what I use for pretty printing of coefficients in Jupyter. I'm not sure I follow why order is an issue - as far as I know the order of the coefficients should match the order of the input data that you gave it.
Note that the first line assumes you have a Pandas data frame called df in which you originally stored the data prior to turning it into a numpy array for regression:
fieldList = np.array(list(df)).reshape(-1,1)
coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))
Borrowing from Robin, but simplifying the syntax:
coef_dict = dict(zip(model_1_features, model_1.coef_))
Important note about zip: zip assumes its inputs are of equal length, making it especially important to confirm that the lengths of the features and coefficients match (which in more complicated models might not be the case). If one input is longer than the other, the longer input will have the values in its extra index positions cut off. Notice the missing 7 in the following example:
In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]
pd.DataFrame(data=regression.coef_, index=X_train.columns)
All of these answers were great but what personally worked for me was this, as the feature names I needed were the columns of my train_date dataframe:
pd.DataFrame(data=model_1.coef_,columns=train_data.columns)
Right after training the model, the coefficient values are stored in the variable model.coef_[0]. We can iterate over the column names and store the column name and their coefficient value in a dictionary.
model.fit(X_train,y)
# assuming all the columns except last one is used in training
columns = data.iloc[:,-1].columns
coef_dict = {}
for i in range(0,len(columns)):
coef_dict[columns[i]] = model.coef_[0][i]
Hope this helps!
As of scikit-learn version 1.0, the LinearRegression estimator has a feature_names_in_ attribute. From the docs:
feature_names_in_ : ndarray of shape (n_features_in_,)
Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
Assuming you're fitting on a pandas.DataFrame (train_data), your estimators (model_1, model_2, and model_3) will have the attribute. You can line up your coefficients using any of the methods listed in previous answers, but I'm in favor of this one:
coef_series = pd.Series(
data=model_1.coef_,
index=model_1.feature_names_in_
)
A minimally reproducible example
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
# for repeatability
np.random.seed(0)
# random data
Xy = pd.DataFrame(
data=np.random.random((10, 3)),
columns=["x0", "x1", "y"]
)
# separate X and y
X = Xy.drop(columns="y")
y = Xy.y
# initialize estimator
lr = LinearRegression()
# fit to pandas.DataFrame
lr.fit(X, y)
# get coeficients and their respective feature names
coef_series = pd.Series(
data=lr.coef_,
index=lr.feature_names_in_
)
print(coef_series)
x0 0.230524
x1 -0.275611
dtype: float64

Scikitlearn - order of fit and predict inputs, does it matter?

Just getting started with this library... having some issues (i've read the docs but didn't get clarity) with RandomForestClassifiers
My question is pretty simple, say i have a train data set like
A B C
1 2 3
Where A is the independent variable (y) and B-C are the dependent variables (x). Let's say the test set looks the same, however the order is
B A C
1 2 3
When I call forest.fit(train_data[0:,1:],train_data[0:,0])
do I then need to reorder the test set to match this order before running? (Ignoring the fact that I need to remove the already predicted y value (a), so lets just say B and C are out of order... )
Yes, you need to reorder them. Imagine a simpler case, Linear Regression. The algorithm will calculate the weights for each of the features, so for example if feature 1 is unimportant, it will get assigned a close to 0 weight.
If at prediction time the order is different, an important feature will be multiplied by this almost null weight, and the prediction will be totally off.
elyase is correct. scikit-learn will simply take the data in whatever order you give it. Hence, you'll have to ensure that the data is in the same order during training and prediction time.
Here's a simple illustrating example:
Training time:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
x = pd.DataFrame({
'feature_1': [0, 0, 1, 1],
'feature_2': [0, 1, 0, 1]
})
y = [0, 0, 1, 1]
model.fit(x, y)
# we now have a model that
# (i) predicts 0 when x = [0, 0] or [0, 1], and
# (ii) predicts 1 when x = [1, 0] or [1, 1]
Prediction time:
# positive example
http_request_payload = {
'feature_1': 0,
'feature_2': 1
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 0, as expected
# negative example
http_request_payload = {
'feature_2': 1, # notice that the order is jumbled up
'feature_1': 0
}
input_features = pd.DataFrame([http_request_payload])
model.predict(input_features) # this returns 1, when it should have returned 0.
# scikit-learn doesn't care about the key-value mapping of the features.
# it simply vectorizes the dataframe in whatever order it comes in.
This is how I cache the column order during training so that I can use it during prediction time.
# training
x = pd.DataFrame([...])
column_order = x.columns
model = SomeModel().fit(x, y) # train model
# save the things that we need at prediction time. you can also use pickle if you don't want to pip install joblib
import joblib
joblib.dump(model, 'my_model.joblib')
joblib.dump(column_order, 'column_order.txt')
# load the artifacts from disk
model = joblib.load('linear_model.joblib')
column_order = joblib.load('column_order.txt')
# imaginary http request payload
request_payload = { 'feature_1': ..., 'feature_1': ... }
# create empty dataframe with the right shape and order (using column_order)
input_features = pd.DataFrame([], columns=column_order)
input_features = input_features.append(request_payload, ignore_index=True)
input_features = input_features.fillna(0) # handle any missing data however you like
model.predict(input_features.values.tolist())

Categories