Group/Cluster K-Fold CV with Sklearn

Group/Cluster K-Fold CV with Sklearn - python

I need to do a K-fold CV on some models, but I need to ensure the validation (test) data set is clustered together by a group and t number of years. GroupKFold is close, but it still splits up the validation set (see second fold).
For example, if I have a set of data with years from 2000-2008 and I want to K-fold into 3 groups. The appropriate sets would be: Validation: 2000-2002, Train: 2003-2008; V:2003-2005, T:2000-2002 & 2006-2008; and V: 2006-2008, T: 2000-2005).
Is there a way to group and cluster the data using K-Fold CV where the validation set is clustered by t years?
from sklearn.model_selection import GroupKFold
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
gkf = GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups=groups):
print("Train:", train_index, "Validation:",test_index)
Output:
Train: [ 0 1 2 3 4 5 10 11 12] Validation: [6 7 8 9]
Train: [3 4 5 6 7 8 9] Validation: [ 0 1 2 10 11 12]
Train: [ 0 1 2 6 7 8 9 10 11 12] Validation: [3 4 5]
Desired Output (assume 2 years for each group):
Train: [ 7 8 9 10 11 12 ] Validation: [0 1 2 3 4 5 6]
Train: [0 1 2 10 11 12 ] Validation: [ 3 4 5 6 7 8 9 ]
Train: [ 0 1 2 3 4 5 ] Validation: [6 7 8 9 10 11 12]
Although, the test and train subsets are not sequential along and can select more years to group.

I hope I understood you correctly.
The LeaveOneGroupOut method from scikits model_selection might help:
Lets say you assign the group label 0 to all the data points from 2000-2002, label 1 for all data points between 2003 and 2005 and label 2 for the data in 2006-2008.
Then you could use the following method, to create training and test splits, where the three test splits are created from one of the three groups:
from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
groups=[1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
X=np.random.random(len(groups))
y=np.random.randint(0,4,len(groups))
logo = LeaveOneGroupOut()
print("n_splits=", logo.get_n_splits(X,y,groups))
for train_index, test_index in logo.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17] test_idx: [0 1 2 3]
train_idx: [ 0 1 2 3 10 11 12 13 14 15 16 17] test_idx: [4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5 6 7 8 9] test_idx: [10 11 12 13 14 15 16 17]
Edit
I think I now finally understood what you want. Sorry that it took me so long.
I dont think that your desired split method is already implemented in sklearn. But we can easily extend the BaseCrossValidator method.
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils.validation import check_array
class GroupOfGroups(BaseCrossValidator):
def __init__(self, group_of_groups):
"""
:param group_of_groups: list with length n_splits. Each entry in the list is a list with group ids from
set(groups). In each of the n_splits splits, the groups given in the current group_of_groups sublist are used
for validation.
"""
self.group_of_groups = group_of_groups
def get_n_splits(self, X=None, y=None, groups=None):
return len(self.group_of_groups)
def _iter_test_masks(self, X=None, y=None, groups=None):
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
groups=check_array(groups, copy=True, ensure_2d=False, dtype=None)
for g in self.group_of_groups:
test_index = np.zeros(len(groups), dtype=np.bool)
for g_id in g:
test_index[groups == g_id] = True
yield test_index
The usage is quite simple. As before, we define X,y and groups. Additionally we define a list of lists (groups of groups) which define which groups should be used together in which test fold.
So g_of_g=[[1,2],[2,3],[3,4]] means that groups 1 and 2 are used as test set in the first fold, while the remaining groups 3 and 4 are used for training. In fold 2, data from groups 2 and 3 are used as test set etc.
I am not quite happy with the naming "GroupOfGroups" so maybe you find something better.
Now we can test this cross validator:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
g_of_g = [[1,2],[2,3],[3,4]]
gg = GroupOfGroups(g_of_g)
print("n_splits=", gg.get_n_splits(X,y,groups))
for train_index, test_index in gg.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 6 7 8 9 10 11 12] test_idx: [0 1 2 3 4 5]
train_idx: [ 0 1 2 10 11 12] test_idx: [3 4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5] test_idx: [ 6 7 8 9 10 11 12]
Please keep in mind that I did not include a lot of checks and didn't do thorough testing. So verify carefully that this works for you.

Related

Find local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides

Write a python program to find all the local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides
Note
Create a Pandas series from the given input.
Input format:
First line of the input consists of list of integers separated by spaces to from pandas series.
Output format:
Output display the array of indices where peak values present.
Sample testcase
input1
12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
output1
[2 5 10 12]
smapletest cases image
How to solve this problem?

import pandas as pd
a = "12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"
a = [int(x) for x in a.split(" ")]
angles = []
for i in range(len(a)):
if i!=0:
if a[i]>a[i-1]:
angles.append('rise')
else:
angles.append('fall')
else:
angles.append('ignore')
prev=0
prev_val = "none"
counts = []
for s in angles:
if s=="fall" and prev_val=="rise":
prev_val = s
counts.append(1)
else:
prev_val = s
counts.append(0)
peaks_pd = pd.Series(counts).shift(-1).fillna(0).astype(int)
df = pd.DataFrame({
'a':a,
'peaks':peaks_pd
})
peak_vals = list(df[df['peaks']==1]['a'].index)
This could be improved further. Steps I have followed:
First find the angle whether its rising or falling
Look at the index at which it starts falling after rising and call it as peaks

Use:
data = [12, 1, 2, 1.1, 9, 10, 2.1, 5, 7, 8, 9.1, -9, 10.1, 5.1, 15]
s = pd.Series(data)
n = 3 # number of points to be checked before and after
from scipy.signal import argrelextrema
local_max_index = argrelextrema(s.to_frame().to_numpy(), np.greater_equal, order=n)[0].tolist()
print (local_max_index)
[0, 5, 14]
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
local_max_index = s.index[s == s.rolling(n, center=True).max()].tolist()
print (local_max_index)
[2, 5, 10, 12]
EDIT: Solution for processing value in DataFrame:
df = pd.DataFrame({'Input': ["12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15"]})
print (df)
Input
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15
s = df['Input'].iloc[[0]].str.split().explode().astype(int).reset_index(drop=True)
print (s)
0 12
1 1
2 2
3 1
4 9
5 10
6 2
7 5
8 7
9 8
10 9
11 -9
12 10
13 5
14 15
Name: Input, dtype: int32
local_max_index = s.index[(s.shift() <= s) & (s.shift(-1) <= s)].tolist()
print (local_max_index)
[2, 5, 10, 12]
df['output'] = [local_max_index]
print (df)
Input output
0 12 1 2 1 9 10 2 5 7 8 9 -9 10 5 15 [2, 5, 10, 12]

Write sklearn LOO splits to pandas dataframe with index as label column

I'm trying (badly) to use sklearn's LOO functionality and what I would like to do is append each training split set into a dataframe column with a label for the split index. So using the example from the sklearn page, but slightly modified:
import numpy as np
from sklearn.model_selection import LeaveOneOut
x = np.array([1,2])
y = np.array([3,4])
coords = np.column_stack((x,y))
z = np.array([8, 12])
loo = LeaveOneOut()
loo.get_n_splits(coords)
print(loo)
LeaveOneOut()
for train_index, test_index in loo.split(coords):
print("TRAIN:", train_index, "TEST:", test_index)
XY_train, XY_test = coords[train_index], coords[test_index]
z_train, z_test = z[train_index], z[test_index]
print(XY_train, XY_test, z_train, z_test)
Which returns:
TRAIN: [1] TEST: [0]
[[2 4]] [[1 3]] [12] [8]
TRAIN: [0] TEST: [1]
[[1 3]] [[2 4]] [8] [12]
In my case I'd like to write each split value to a dataframe like this:
X Y Ztrain Ztest split
0 1 2 8 12 0
1 3 4 8 12 0
2 1 2 12 8 1
3 3 4 12 8 1
And so on.
The motivation for doing this is I want to try a jackknifing interpolation of sparse point data. Ideally I want to run an interpolation/gridder on each of the LOO training sets, and then stack them. But I am struggling to access each train set to then use in something like griddata
Any help would be appreciated, for the problem here or the approach in general.

I don't quite get the logic of your dataframe, but you can try something like below to get your dataframe:
df = []
for train_index, test_index in loo.split(coords):
x = pd.DataFrame({'XY_train':coords[train_index][0],\
'XY_test':coords[test_index][0],\
'Ztrain':z[train_index][0],\
'Ztest':z[test_index][0]})
df.append(x)
df = pd.concat(df)
df
XY_train XY_test Ztrain Ztest
0 2 1 12 8
1 4 3 12 8
0 1 2 8 12
1 3 4 8 12

Split on train and test separating by group

I have a sample data as follows:
import pandas as pd
df = pd.DataFrame({"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"]})
So my data look like this
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
I would like to split this data into two groups (train, test) based on label distribution given the number of test samples (e.g. 6 samples). My settings prefers to define size of test set as integer representing the number of test samples rather than percentage. However, with my specific domain, any id MUST be allocated in ONLY one group. For example, if id 1 was assigned to the training set, other samples with id 1 cannot be assigned to the test set. So the expected output are 2 dataframes as follows:
Training set
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
Test set
x id label
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
Both training set and test set have the same class distribution (a:b is 4:2) and id 1, 2 were assigned to only the training set while id 3, 4, 5 were assigned to only the test set. I used to do with sklearn train_test_split but I could not figure out how to apply it with such a condition. May I have your suggestions how to handle such conditions?

sklearn.model_selection has several other options other than train_test_split. One of them, aims at solving what you're after. In this case you could use GroupShuffleSplit, which as mentioned inthe docs it provides randomized train/test indices to split data according to a third-party provided group. You also have GroupKFold for these cases which is very useful.
from sklearn.model_selection import GroupShuffleSplit
X = df.drop('label',1)
y=df.label
You can now instantiate GroupShuffleSplit, and do as you would with train_test_split, with the only difference of specifying a group column, which will be used to split X and y so the groups are split according the the groups values:
gs = GroupShuffleSplit(n_splits=2, test_size=.6, random_state=0)
train_ix, test_ix = next(gs.split(X, y, groups=X.id))
Now you can index the dataframe to create the train and test sets:
X_train = X.loc[train_ix]
y_train = y.loc[train_ix]
X_test = X.loc[test_ix]
y_test = y.loc[test_ix]
Giving:
print(X_train)
x id
4 50 2
5 60 2
8 90 4
9 100 4
10 110 5
11 120 5
And for the test set:
print(X_test)
x id
0 10 1
1 20 1
2 30 1
3 40 1
6 70 3
7 80 3

Adding to Yatu's brilliant answer, you can split your data only using pandas if you liked, although its better to use what was proposed in his answer.
import pandas as pd
df = pd.DataFrame(
{
"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"],
}
)
TRAIN_TEST_SPLIT_PERC = 0.75
uniques = df["id"].unique()
sep = int(len(uniques) * TRAIN_TEST_SPLIT_PERC)
df = df.sample(frac=1).reset_index(drop=True) #For shuffling your data
train_ids, test_ids = uniques[:sep], uniques[sep:]
train_df, test_df = df[df.id.isin(train_ids)], df[df.id.isin(test_ids)]
print("\nTRAIN DATAFRAME\n", train_df)
print("\nTEST DATAFRAME\n", test_df)

Printing all solutions in the shape of a matrix using \n\

This function returns all possible multiplication from 1 to d. I want to print the solution in the shape of a d×d matrix.
def example(d):
for i in range(1,d+1):
for l in range(1,d+1):
print(i*l)
For d = 5, the expected output should look like:
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25

You could add the values in your second for loop to a list, join the list, and finally print it.
def mul(d):
for i in range(1, d+1):
list_to_print = []
for l in range(1, d+1):
list_to_print.append(str(l*i))
print(" ".join(list_to_print))
>>> mul(5)
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25
If you want it to be printed in aligned rows and columns, have a read at Pretty print 2D Python list.
EDIT
The above example will work for both Python 3 and Python 2. However, for Python 3 (as #richard has put in the comments), you can use:
def mul(d):
for i in range(1, d+1):
for l in range(1, d+1):
print(i*l, end=" ")
print()
>>> mul(5)
1 2 3 4 5
2 4 6 8 10
3 6 9 12 15
4 8 12 16 20
5 10 15 20 25

Try this:
mm = []
ll = []
def mul(d):
for i in range(1,d+1):
ll = []
for l in range(1,d+1):
# print(i*l),
ll.append((i*l))
mm.append(ll)
mul(5)
for x in mm:
print(x)
[1, 2, 3, 4, 5]
[2, 4, 6, 8, 10]
[3, 6, 9, 12, 15]
[4, 8, 12, 16, 20]
[5, 10, 15, 20, 25]

python, scikit-learn - Weird behaviour using LabelShuffleSplit

Following the scikit-learn documentation for LabelShuffleSplit, I wish to randomise my train/validation batches to ensure I'm training on all possible data (e.g. for an ensemble).
According to the doc, I should see something like (indeed, notice that train/validation sets are evenly split via test_size=0.5):
>>> from sklearn.cross_validation import LabelShuffleSplit
>>> labels = [1, 1, 2, 2, 3, 3, 4, 4]
>>> slo = LabelShuffleSplit(labels, n_iter=4, test_size=0.5, random_state=0)
>>> for train, test in slo:
>>> print("%s %s" % (train, test))
...
[0 1 2 3] [4 5 6 7]
[2 3 6 7] [0 1 4 5]
[2 3 4 5] [0 1 6 7]
[4 5 6 7] [0 1 2 3]
But then I tried using labels = [0, 0, 0, 0, 0, 0, 0, 0] which returned:
...
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
[] [0 1 2 3 4 5 6 7]
(i.e not evenly split - all data has simply been put into the validation set?) I understand in this case that is doesn't really matter which indices are put into the train/validation sets, but I was hoping it would still be a 50%:50% split???

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group/Cluster K-Fold CV with Sklearn - python

Related

Find local maxima or peaks(index) in a numeric series using numpy and pandas Peak refers to the values surrounded by smaller values on both sides

Write sklearn LOO splits to pandas dataframe with index as label column

Split on train and test separating by group

Printing all solutions in the shape of a matrix using \n\

python, scikit-learn - Weird behaviour using LabelShuffleSplit

Categories

Resources