Split on train and test separating by group

Split on train and test separating by group - python

I have a sample data as follows:
import pandas as pd
df = pd.DataFrame({"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"]})
So my data look like this
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
I would like to split this data into two groups (train, test) based on label distribution given the number of test samples (e.g. 6 samples). My settings prefers to define size of test set as integer representing the number of test samples rather than percentage. However, with my specific domain, any id MUST be allocated in ONLY one group. For example, if id 1 was assigned to the training set, other samples with id 1 cannot be assigned to the test set. So the expected output are 2 dataframes as follows:
Training set
x id label
10 1 a
20 1 a
30 1 a
40 1 b
50 2 a
60 2 b
Test set
x id label
70 3 a
80 3 a
90 4 b
100 4 a
110 5 b
120 5 a
Both training set and test set have the same class distribution (a:b is 4:2) and id 1, 2 were assigned to only the training set while id 3, 4, 5 were assigned to only the test set. I used to do with sklearn train_test_split but I could not figure out how to apply it with such a condition. May I have your suggestions how to handle such conditions?

sklearn.model_selection has several other options other than train_test_split. One of them, aims at solving what you're after. In this case you could use GroupShuffleSplit, which as mentioned inthe docs it provides randomized train/test indices to split data according to a third-party provided group. You also have GroupKFold for these cases which is very useful.
from sklearn.model_selection import GroupShuffleSplit
X = df.drop('label',1)
y=df.label
You can now instantiate GroupShuffleSplit, and do as you would with train_test_split, with the only difference of specifying a group column, which will be used to split X and y so the groups are split according the the groups values:
gs = GroupShuffleSplit(n_splits=2, test_size=.6, random_state=0)
train_ix, test_ix = next(gs.split(X, y, groups=X.id))
Now you can index the dataframe to create the train and test sets:
X_train = X.loc[train_ix]
y_train = y.loc[train_ix]
X_test = X.loc[test_ix]
y_test = y.loc[test_ix]
Giving:
print(X_train)
x id
4 50 2
5 60 2
8 90 4
9 100 4
10 110 5
11 120 5
And for the test set:
print(X_test)
x id
0 10 1
1 20 1
2 30 1
3 40 1
6 70 3
7 80 3

Adding to Yatu's brilliant answer, you can split your data only using pandas if you liked, although its better to use what was proposed in his answer.
import pandas as pd
df = pd.DataFrame(
{
"x": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120],
"id": [1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5],
"label": ["a", "a", "a", "b", "a", "b", "b", "b", "a", "b", "a", "b"],
}
)
TRAIN_TEST_SPLIT_PERC = 0.75
uniques = df["id"].unique()
sep = int(len(uniques) * TRAIN_TEST_SPLIT_PERC)
df = df.sample(frac=1).reset_index(drop=True) #For shuffling your data
train_ids, test_ids = uniques[:sep], uniques[sep:]
train_df, test_df = df[df.id.isin(train_ids)], df[df.id.isin(test_ids)]
print("\nTRAIN DATAFRAME\n", train_df)
print("\nTEST DATAFRAME\n", test_df)

Related

How to train a model using a an array of continuous values as a feature?

Say I have a Dataframe whose columns are features that I want to feed to a random forest classifier. These features are signals that are sampled at different rates and each row represents the values outputted by the sensor every 30 seconds. Each feature column that has a list of values have cells that contain lists of the same lenght Say my table looks like this:
|Epoch (30 sec) | Nasal Airflow 25hz | EEG 200hz | Target (0,1,2) |
| -------- | -------------- |-------------- |----- |
| 1 | [12,3,4,5,6...43] | [6,9,8,5...,69] | 1 |
| 2 | [15,45,8,4,9...89] |[7,9.6,8.5,9...,89] | 2 |
| 3 | [18,5,88,400,2...88] |[8,10.15,9.8,9.5...,45] | 0 |
All lists under the Nasal Airflow column has 750 numbers and all lists under the EEG column has 6000 numbers. The target column here is the value I want to predict.
I tried training a random forest classifier with the similar kind of data and it did not work. The error I got was
ValueError: setting an array element with a sequence.
I understand that I could apply some statistical methods like finding the mean, mode, median of each arrays but I feel like I'm losing a lot of data. Are there classifier models that can handle data like this?

(Turning comments into an answer)
If columns contain lists of equal length:
import pandas as pd
from io import StringIO
data_file = StringIO("""airflow|eeg|target
[12,3,4,5,6,43]|[6,9,8,5,69]|0
[15,4,8,4,9,89]|[7,9,8,9,89]|1
[18,5,5,7,2,88]|[8,8,9,9,45]|0
""")
df = pd.read_csv(
data_file,
delimiter="|",
converters={
"airflow": lambda x: x.strip("[]").split(","),
"eeg": lambda x: x.strip("[]").split(","),
},
)
airflow eeg target
0 [12, 3, 4, 5, 6, 43] [6, 9, 8, 5, 69] 0
1 [15, 4, 8, 4, 9, 89] [7, 9, 8, 9, 89] 1
2 [18, 5, 5, 7, 2, 88] [8, 8, 9, 9, 45] 0
Then the easiest option is turn each list into columns representing "airflow at t1, t2, t3, etc."
df[[f"af{i}" for i in range(len(df.airflow[0]))]] = df.airflow.apply(pd.Series)
df[[f"eeg{i}" for i in range(len(df.eeg[0]))]] = df.eeg.apply(pd.Series)
df.drop(["airflow", "eeg"], axis=1, inplace=True)
target af0 af1 af2 af3 af4 af5 eeg0 eeg1 eeg2 eeg3 eeg4
0 0 12 3 4 5 6 43 6 9 8 5 69
1 1 15 4 8 4 9 89 7 9 8 9 89
2 0 18 5 5 7 2 88 8 8 9 9 45
Which then can be used for model fitting. If the lists expand into a large number of features (e.g. 19,500), then feature selection approaches might be worth exploring. If there is a time-series component, non-linear models (like trees) can fit models of the form "target is influenced by the airflow and eeg at t12"—but other methods for time-series classification exist.
from sklearn.linear_model import LogisticRegression
X = df.drop(["target"], axis=1)
y = df["target"]
clf = LogisticRegression().fit(X, y)

Compare two dataframe and conditionally capture random data in Python

The main logic of my question is on comparing the two dataframes a little, but it will be different from the existing questions here. Q1, Q2, Q3
Let's create dummy two dataframes.
data1 = {'user': [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4,4],
'checkinid': [10, 20, 30, 40, 50, 35, 45, 55, 20, 120, 100, 35, 55, 180, 200,400],
'count': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]}
data2 = {'checkinid': [10, 20, 30, 35, 40, 45, 50,55, 60, 70,100,120,180,200,300,400]}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
data2 consists of whole checkinid values. I am trying to create a training file.
For example, user 1 visited 5 places where ids are (10,20,30,40,50)
I want to add randomly the places that user 1 does not visit and set the 'count column' as 0.
My expectation dataframe like this
user checkinid count
1 10 1
1 20 1
1 30 1
1 40 1
1 50 1
1 300 0 (add randomly)
1 180 0 (add randomly)
1 55 0 (add randomly)
2 35 1
2 45 1
2 55 1
2 20 1
2 120 1
2 10 0 (add randomly)
2 400 0 (add randomly)
2 180 0 (add randomly)
... ...
Now those who read the question can ask how many random data they will add.
For each user, just add 3 non-visited places is enough for this example.

This might not be the best solution but it works
you have to get each users and then pick the checkinids which are not assigned to them
#get all users
users = df1.user.unique();
for user in users:
checkins = df1.loc[df1['user'] == user]
df = checkins.merge(df2, how = 'outer' ,indicator=True).loc[lambda x : x['_merge']=='right_only'].sample(n=3)
df['user']=[user,user,user]
df['count']=[0,0,0]
df.pop("_merge")
df1 = df1.append(df, ignore_index=True)
#sort data frome based on user
df1 = df1.sort_values(by=['user']);
#re-arrange cols
df1 = df1[['user', 'checkinid', 'count']]
#print df
print df1

Group/Cluster K-Fold CV with Sklearn

I need to do a K-fold CV on some models, but I need to ensure the validation (test) data set is clustered together by a group and t number of years. GroupKFold is close, but it still splits up the validation set (see second fold).
For example, if I have a set of data with years from 2000-2008 and I want to K-fold into 3 groups. The appropriate sets would be: Validation: 2000-2002, Train: 2003-2008; V:2003-2005, T:2000-2002 & 2006-2008; and V: 2006-2008, T: 2000-2005).
Is there a way to group and cluster the data using K-Fold CV where the validation set is clustered by t years?
from sklearn.model_selection import GroupKFold
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
gkf = GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups=groups):
print("Train:", train_index, "Validation:",test_index)
Output:
Train: [ 0 1 2 3 4 5 10 11 12] Validation: [6 7 8 9]
Train: [3 4 5 6 7 8 9] Validation: [ 0 1 2 10 11 12]
Train: [ 0 1 2 6 7 8 9 10 11 12] Validation: [3 4 5]
Desired Output (assume 2 years for each group):
Train: [ 7 8 9 10 11 12 ] Validation: [0 1 2 3 4 5 6]
Train: [0 1 2 10 11 12 ] Validation: [ 3 4 5 6 7 8 9 ]
Train: [ 0 1 2 3 4 5 ] Validation: [6 7 8 9 10 11 12]
Although, the test and train subsets are not sequential along and can select more years to group.

I hope I understood you correctly.
The LeaveOneGroupOut method from scikits model_selection might help:
Lets say you assign the group label 0 to all the data points from 2000-2002, label 1 for all data points between 2003 and 2005 and label 2 for the data in 2006-2008.
Then you could use the following method, to create training and test splits, where the three test splits are created from one of the three groups:
from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
groups=[1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
X=np.random.random(len(groups))
y=np.random.randint(0,4,len(groups))
logo = LeaveOneGroupOut()
print("n_splits=", logo.get_n_splits(X,y,groups))
for train_index, test_index in logo.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17] test_idx: [0 1 2 3]
train_idx: [ 0 1 2 3 10 11 12 13 14 15 16 17] test_idx: [4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5 6 7 8 9] test_idx: [10 11 12 13 14 15 16 17]
Edit
I think I now finally understood what you want. Sorry that it took me so long.
I dont think that your desired split method is already implemented in sklearn. But we can easily extend the BaseCrossValidator method.
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils.validation import check_array
class GroupOfGroups(BaseCrossValidator):
def __init__(self, group_of_groups):
"""
:param group_of_groups: list with length n_splits. Each entry in the list is a list with group ids from
set(groups). In each of the n_splits splits, the groups given in the current group_of_groups sublist are used
for validation.
"""
self.group_of_groups = group_of_groups
def get_n_splits(self, X=None, y=None, groups=None):
return len(self.group_of_groups)
def _iter_test_masks(self, X=None, y=None, groups=None):
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
groups=check_array(groups, copy=True, ensure_2d=False, dtype=None)
for g in self.group_of_groups:
test_index = np.zeros(len(groups), dtype=np.bool)
for g_id in g:
test_index[groups == g_id] = True
yield test_index
The usage is quite simple. As before, we define X,y and groups. Additionally we define a list of lists (groups of groups) which define which groups should be used together in which test fold.
So g_of_g=[[1,2],[2,3],[3,4]] means that groups 1 and 2 are used as test set in the first fold, while the remaining groups 3 and 4 are used for training. In fold 2, data from groups 2 and 3 are used as test set etc.
I am not quite happy with the naming "GroupOfGroups" so maybe you find something better.
Now we can test this cross validator:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
g_of_g = [[1,2],[2,3],[3,4]]
gg = GroupOfGroups(g_of_g)
print("n_splits=", gg.get_n_splits(X,y,groups))
for train_index, test_index in gg.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 6 7 8 9 10 11 12] test_idx: [0 1 2 3 4 5]
train_idx: [ 0 1 2 10 11 12] test_idx: [3 4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5] test_idx: [ 6 7 8 9 10 11 12]
Please keep in mind that I did not include a lot of checks and didn't do thorough testing. So verify carefully that this works for you.

Write to file from dictionary instead of pandas

I would like to print dictionaries to file in a different way.
Right now, I am using Pandas to convert dictionaries to Dataframes, combine several Dataframes and then print them to file (see below code).
However, the Pandas operations seem to take a very long time and I would like to do this more efficiently.
Is it possible to do the below approach more efficiently while retaining the structure of the output files? (e.g. by printing from dictionary directly?)
import pandas as pd
labels = ["A", "B", "C"]
periods = [0, 1, 2]
header = ['key', 'scenario', 'metric', 'labels']
metrics_names = ["metric_balances", "metric_record"]
key = "key_x"
scenario = "base"
# The metrics are structured as dicts where the keys are `periods` and the values
# are arrays (where each array entry correspond to one of the `labels`)
metric_balances = {0: [1000, 100, 50], 1: [900, 150, 100], 2: [800, 350, 100]}
metric_record = {0: [20, 10, 5], 1: [90, 15, 10], 2: [80, 35, 10]}
# Combine all metrics into one output structure for key "x"
output_x = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
key = "key_y"
scenario = "base_2"
metric_balances = {0: [2000, 200, 50], 1: [1900, 350, 100], 2: [1200, 750, 100]}
metric_record = {0: [40, 5, 3], 1: [130, 45, 10], 2: [82, 25, 18]}
# Combine all metrics into one output structure for key "y"
output_y = pd.concat([pd.DataFrame(metric_balances, columns=periods, index=labels),
pd.DataFrame(metric_record, columns=periods, index=labels)],
keys=pd.MultiIndex.from_product([[key], [scenario], metrics_names]),
names=header)
# Concatenate all output dataframes
output = pd.concat([output_x, output_y], names=header)
# Print results to a csv file
output.to_csv("test.csv", index=False)
Below are the respective outputs:
OUTPUT X
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
-----------------------------------
OUTPUT Y
0 1 2
key scenario metric labels
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
------------------------------
OUTPUT COMBINED
0 1 2
key scenario metric labels
key_x base metric_balances A 1000 900 800
B 100 150 350
C 50 100 100
metric_record A 20 90 80
B 10 15 35
C 5 10 10
key_y base_2 metric_balances A 2000 1900 1200
B 200 350 750
C 50 100 100
metric_record A 40 130 82
B 5 45 25
C 3 10 18
I was looking into row wise printing of the dictionaries - but I had difficulties in merging the labels with the relevant arrays.

Indexing on DataFrame with MultiIndex

I have a large pandas DataFrame that I need to fill.
Here is my code:
trains = np.arange(1, 101)
#The above are example values, it's actually 900 integers between 1 and 20000
tresholds = np.arange(10, 70, 10)
tuples = []
for i in trains:
for j in tresholds:
tuples.append((i, j))
index = pd.MultiIndex.from_tuples(tuples, names=['trains', 'tresholds'])
df = pd.DataFrame(np.zeros((len(index), len(trains))), index=index, columns=trains, dtype=float)
metrics = dict()
for i in trains:
m = binary_metric_train(True, i)
#Above function returns a binary array of length 35
#Example: [1, 0, 0, 1, ...]
metrics[i] = m
for i in trains:
for j in tresholds:
trA = binary_metric_train(True, i, tresh=j)
for k in trains:
if k != i:
trB = metrics[k]
corr = abs(pearsonr(trA, trB)[0])
df[k][i][j] = corr
else:
df[k][i][j] = np.nan
My problem is, when this piece of code is finally done computing, my DataFrame df still contains nothing but zeros. Even the NaN are not inserted. I think that my indexing is correct. Also, I have tested my binary_metric_train function separately, it does return an array of length 35.
Can anyone spot what I am missing here?
EDIT: For clarity, this DataFrame looks like this:
1 2 3 4 5 ...
trains tresholds
1 10
20
30
40
50
60
2 10
20
30
40
50
60
...

As #EdChum noted, you should take a lookt at pandas indexing. Here's some test data for the purpose of illustration, which should clear things up.
import numpy as np
import pandas as pd
trains = [ 1, 1, 1, 2, 2, 2]
thresholds = [10, 20, 30, 10, 20, 30]
data = [ 1, 0, 1, 0, 1, 0]
df = pd.DataFrame({
'trains' : trains,
'thresholds' : thresholds,
'C1' : data,
'C2' : data
}).set_index(['trains', 'thresholds'])
print df
df.ix[(2, 30), 0] = 3 # using column index
# or...
df.ix[(2, 30), 'C1'] = 3 # using column name
df.loc[(2, 30), 'C1'] = 3 # using column name
# but not...
df.loc[(2, 30), 1] = 3 # creates a new column
print df
Which outputs the DataFrame before and after modification:
C1 C2
trains thresholds
1 10 1 1
20 0 0
30 1 1
2 10 0 0
20 1 1
30 0 0
C1 C2 1
trains thresholds
1 10 1 1 NaN
20 0 0 NaN
30 1 1 NaN
2 10 0 0 NaN
20 1 1 NaN
30 3 0 3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split on train and test separating by group - python

Related

How to train a model using a an array of continuous values as a feature?

Compare two dataframe and conditionally capture random data in Python

Group/Cluster K-Fold CV with Sklearn

Write to file from dictionary instead of pandas

Indexing on DataFrame with MultiIndex

Categories

Resources